In [1]:
import os
import pandas as pd

In [2]:
os.chdir('../../data')
disability = pd.read_csv('disability_sub_top_sm_lemmas.csv')

## Implementing TF-IDF
TF-IDF, short for **term frequency–inverse document frequency**, is a metric that reflects how important a word is to a **document** in a collection or **corpus**. When talking about text datasets, the dataset is called a corpus, and each datapoint is a document. A document can be a post, a paragraph, a webpage, whatever is considered the individual unit of text for a given datset. A **term** is each unique token in a document (we previously also referred to this as **type**). 

For example in a corpus of sentences, a document might be: `"I went to New York City in New York state."` 

The processed tokens in that document might be: `[went, new_york, city, new_york, state]`.

The document would have four unique terms: `[went, new_york, city, state]`.

The TF-IDF value increases proportionally to the number of times a word appears in the document (the term frequency, or TF), and is offset by the number of documents in the corpus that contain the word (the inverse document frequency, or IDF). This helps to adjust for the fact that some words appear more frequently in general – such as articles and prepositions.

We won't go into much detail about the math behind calculating the TF-IDF (see the D-Lab Text Analysis workshop videos to see more). The key components to remember are:

1. There is one TF-IDF score per unique term and unique document.
2. A high TF-IDF score suggests that term is descriptive of that document.
3. A low TF-IDF score may be because either the term is not frequent in that document, or that it is frequent in many documents in the dataset - either way, it may not be a good descriptor of that document.

The intuition is that if a word occurs many times in one post but rarely in the rest of the corpus, it is probably useful for characterizing that post; conversely, if a word occurs frequently in a post but also occurs frequently in the corpus, it is probably less characteristic of that post.


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
  'My cat has paws.',
  'Can we let the dog out?',
  'Our dog really likes the cat but the cat does not agree.']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
#pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
# Use this if your scikit-learn is older
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,agree,but,can,cat,does,dog,has,let,likes,my,not,our,out,paws,really,the,we
0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0
1,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,1,1
2,1,1,0,2,1,1,0,0,1,0,1,1,0,0,1,2,0


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Settings that you use for count vectorizer will go here
tfidf_vectorizer = TfidfVectorizer(max_df=0.85,
                                   decode_error='ignore',
                                   stop_words='english',
                                   smooth_idf=True,
                                   use_idf=True)

# Fit and transform the texts
tfidf = tfidf_vectorizer.fit_transform(disability['lemmas'])

In [9]:
df = pd.DataFrame(tfidf.todense(), columns=tfidf_vectorizer.get_feature_names())

In [10]:
df.sum().sort_values(ascending=False)

disability        410.746160
like              391.987521
people            350.514048
work              346.751137
know              340.956349
                     ...    
habitats            0.024578
drugstores          0.024578
measles             0.024578
nineabsolutely      0.024578
twothe              0.024578
Length: 32131, dtype: float64

In cosign similarities, 1 = the documents are the same, decreases to 0 the more disimilar they are

In [14]:
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(tfidf)
similarities.shape

(14920, 14920)

In [15]:
similarities

array([[1.        , 0.04881839, 0.01612609, ..., 0.01524852, 0.02747433,
        0.01840746],
       [0.04881839, 1.        , 0.0347358 , ..., 0.02236755, 0.05083116,
        0.039858  ],
       [0.01612609, 0.0347358 , 1.        , ..., 0.        , 0.01206677,
        0.00558479],
       ...,
       [0.01524852, 0.02236755, 0.        , ..., 1.        , 0.00775328,
        0.00607863],
       [0.02747433, 0.05083116, 0.01206677, ..., 0.00775328, 1.        ,
        0.0038968 ],
       [0.01840746, 0.039858  , 0.00558479, ..., 0.00607863, 0.0038968 ,
        1.        ]])

In [16]:
similar_df = pd.DataFrame({
    'text': aita['selftext'].values,
    'score': similarities[doc_idx]}).sort_values('score', ascending=False)

NameError: name 'aita' is not defined