# Data Science for Social Justice Workshop: Module 2

## Term Frequency-Inverse Document Frequency (TF-IDF)
In Notebook 1, we covered methods to clean and tokenize text into units (like words, bigrams, and trigrams). In Notebook 2, we covered methods to explore the text, including collocations, frequency, and common context. However, for many data science applications, we need a way to convert the content of a text (aka the list of tokens) into useful numbers that can be used as the features of a model or analysis. One way to do this is to to use a method called Term Frequency- Inverse Document Frequency (TF-IDF). This notebook introduces TF-IDF for comparing subsets of a reddit. 

This notebook is designed to help you use TF-IDF to:
1. Ccompare (subsets of) datasets
2. Find most-distinctive words in a subreddit
3. Find similar posts


## Retrieving the dataset
Let's get the data. Make sure you're in the "Data" directory when importing by running the magic command `%pwd`.
If you're not in the right directory, use `os.chdir` to navigate there. We are importing the processed data saved at the end of Notebook 1.

**Note**: Feel free to replace the file with the filename of your project data for the workshop

In [46]:
%pwd

'/Users/emilygrabowski/Documents/GitHub/Data-Science-Social-Justice/data'

In [61]:
import os
import pandas as pd

# We include two ../ because we want to go two levels up in the file structure
#os.chdir('../../Data')

reddf = pd.read_csv('aita_sub_top_sm_lemmas.csv')

In [62]:
reddf.head(3)

Unnamed: 0,idint,idstr,created,created_datetime,nsfw,author,title,selftext,lemmas,score,distinguish,textlen,num_comments,flair_text,flair_css_class
0,427576402,t3_72kg2a,1506433689,2017-09-26 13:48:09,0.0,Ritsku,AITA for breaking up with my girlfriend becaus...,My girlfriend recently went to the beach with ...,girlfriend recently went beach friends tiny bi...,679.0,,4917.0,434.0,no a--holes here,
1,551887974,t3_94kvhi,1533404095,2018-08-04 17:34:55,0.0,hhhhhhffff678,AITA for banning smoking in my house and telli...,My parents smoke like chimneys. I used to as w...,parents smoke like chimneys quit wife got youn...,832.0,,2076.0,357.0,asshole,ass
2,552654542,t3_951az2,1533562299,2018-08-06 13:31:39,0.0,creepatthepool,AITA? Creep wears skimpy bathing suit to pool,Hi guys. Throwaway for obv reasons.\n\nI'm a f...,hi guys throwaway obv reasons i'm female child...,23.0,,1741.0,335.0,Shitpost,


## Implementing TF-IDF
TF-IDF, short for **term frequency–inverse document frequency**, is a metric that reflects how important a word is to a **document** in a collection or **corpus**. When talking about text datasets, the dataset is called a corpus, and each datapoint is a document. A document can be a post, a paragraph, a webpage, whatever is considered the individual unit of text for a given datset. **Term** is each unique token in a document (we previously also referred to this as **type**). 

For example in a corpus of sentences, a document might be: "I went to New York City in New York state." 

The processed tokens in that document might be: [went, new_york, city, new_york, state]. 

The document would have four unique terms: [went, new_york, city, state].

The TF-IDF value increases proportionally to the number of times a word appears in the document (the term frequency, or TF), and is offset by the number of documents in the corpus that contain the word (the inverse document frequency, or IDF). This helps to adjust for the fact that some words appear more frequently in general – such as articles and prepositions.

We'll talk about the math behind calculating the TF-IDF but the key components to remember are: 
1. There is one TF-IDF score per unique term and each document
2. A high TF-IDF score means that term is descriptive of that document
3. A low TF-IDF score may be because either the term is not frequent in that document, or that it is frequent in many documents in the dataset - either way, it is not a good descriptor of that document.

The intuition is that if a word occurs many times in one post but rarely in the rest of the corpus, it is probably useful for characterizing that post; conversely, if a word occurs frequently in a post but also occurs frequently in the corpus, it is probably less characteristic of that post.

We will use the [`CountVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to generate the term-frequency matrix for a given corpus, which is the first part of calculating the tf-idf. This is a table where nrows = # documents, and ncols = # terms in the corpus. Each cell gives a count for the frequency of that term in the document (which may be zero). Let's try it on a toy dataset. Conveniently, there is a built-in tokenizer in CountVectorizer that we will use for now to preprocess the text.

In [51]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
  'My cat has paws.',
  'Can we let the dog out?',
  'Our dog really likes the cat but the cat does not agree.']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
#vectorizer.get_feature_names_out()
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,agree,but,can,cat,does,dog,has,let,likes,my,not,our,out,paws,really,the,we
0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0
1,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,1,1
2,1,1,0,2,1,1,0,0,1,0,1,1,0,0,1,2,0


Each column in the matrix represents a unique word in the vocabulary, while each row represents the document in our dataset. In this case, we have three sentences (i.e. the document), and therefore we three rows. The values in each cell are the word counts. Note that with this representation, counts of some words could be 0 if the word did not appear in the corresponding document. 


Lets look at the shape of the matrix. How many unique words are there in the matrix?'

In [52]:
X.shape

(3, 17)

Now we have numbers representing the contents of the documents!  This is the first step of Matrices like this are the simplest way to represent texts. However, it biases most frequent words and ends up ignoring rare words which could have helped is in processing our data more efficiently.

Oftentimes we not only want to focus on the frequency of words present in the corpus but also want to know the importance of the words. This is where tf-idf  (term frequency-inverse document frequency) comes in. Tf-idf is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

This is done by combining the **term frequency** calculated above and the **inverse document frequency** of the word across a set of documents. For the latter, we divide the total number of documents by the number of documents containing the term, and then take the logarithm of that quotient. (This is calculated for us by the computer). This tells us if a word is common or rare across all documents.

The combination of these two metrics into the TF-IDF will identify how unique a term is to that document.


**Note**: Word order is not retained in this type of featurization, since all that is counted is overall frequency of a word in a document. This is called a **bag-of-words** approach, and significantly simplifies the representation problem, but has been found to be effective in capturing key features of documents.

### Testing tf-idf with a toy dataset

Let's try tf-idf out with a toy dataset. Here we have three documents about Python, but with different meanings. If we are trying to distinguish between these documents, the word "Python" would not be very useful, since it occurs in all of the documents, but other terms might, like "Monty","snake", etc.

In [53]:
document1 = """Python is a 2000 made-for-TV horror movie directed by Richard
Clabaugh. The film features several cult favorite actors, including William
Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy,
Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean
Whalen."""

document2 = """Python, from the Greek word (πύθων/πύθωνας), is a genus of
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are
recognised.[2] A member of this genus, P. reticulatus, is among the longest
snakes known."""

document3 = """Monty Python (also collectively known as the Pythons) are a British 
surreal comedy group who created the sketch comedy television show Monty Python's 
Flying Circus, which first aired on the BBC in 1969. Forty-five episodes were made 
over four series."""

document4 = """Python is an interpreted, high-level, general-purpose programming language. 
Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes 
code readability with its notable use of significant whitespace. Its language constructs and 
object-oriented approach aim to help programmers write clear, logical code for small and 
large-scale projects."""

document5 = """The Colt Python is a .357 Magnum caliber revolver formerly
manufactured by Colt's Manufacturing Company of Hartford, Connecticut.
It is sometimes referred to as a "Combat Magnum". It was first introduced
in 1955, the same year as Smith &amp; Wesson's M29 .44 Magnum. The now discontinued
Colt Python targeted the premium revolver market segment."""

document6 = """The Pythonidae, commonly known simply as pythons, from the Greek word python 
(πυθων), are a family of nonvenomous snakes found in Africa, Asia, and Australia. 
Among its members are some of the largest snakes in the world. Eight genera and 31
species are currently recognized."""

test_list = [document1, document2, document3, document4, document5, document6]

Let's `CountVectorizer()` again, and customize some parameters. Using `max_df` (max document frequency) we can get rid of words that appear in more than 85% of the corpus, and using `stop_words` we can insert an English stopword list to leave out of the calculations as well.

In [54]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_df=0.85, stop_words='english')
word_count_vector = cv.fit_transform(test_list)
pd.DataFrame(word_count_vector.toarray(), columns=cv.get_feature_names())

Unnamed: 0,1955,1969,1991,2000,31,357,44,actors,africa,aim,...,wil,william,word,world,write,year,zabka,πυθων,πύθων,πύθωνας
0,0,0,0,1,0,0,0,1,0,0,...,1,1,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,1
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
4,1,0,0,0,0,1,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,1,0,0,0,1,0,...,0,0,1,1,0,0,0,1,0,0


How many documents and unique words?

## Using `TfidfTransformer`

Next, we need to compute the inverse document frequency values. We'll call [`tfidf_transformer.fit()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) on the word counts we computed earlier.


In [55]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer=TfidfTransformer() 
tfidf_transformer.fit(word_count_vector)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

To get a glimpse of how the IDF values look, let's put these into a DataFrame and sort by weights. Remember, a low idf indicates something that is less unique.


In [56]:
# print idf values 
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(), columns=["idf_weights"]) 
 
# sort ascending 
df_idf.sort_values(by=['idf_weights'])

Unnamed: 0,idf_weights
known,1.336472
pythons,1.559616
greek,1.847298
nonvenomous,1.847298
created,1.847298
...,...
film,2.252763
films,2.252763
flying,2.252763
fame,2.252763


Notice that the words "python" and "in" have the lowest idf values. This is expected: these words appear in each and every document in our collection. The lower the idf value of a word, the less unique it is to any particular document.

Now that we have the idf values, we can compute the tf-idf scores for our set of documents using `.transform()`

In [57]:
tf_idf_vector=tfidf_transformer.transform(word_count_vector)

By invoking `tfidf_transformer.transform()` we are computing the tf-idf scores for our docs. Internally this is weighting tf scores by their idf scores, so that the more unique a word, the more its frequency counts in a given document.

Let’s print the tf-idf values of the first document to see if it makes sense. What we are doing below is, placing the tf-idf scores from the third document into a pandas data frame and sorting it in descending order of scores. We can replace the index in `tf_idf_vector[]` to select a different document to check.

In [58]:
feature_names = cv.get_feature_names() 
  
#print the scores 
df = pd.DataFrame(tf_idf_vector[2].T.todense(), index=feature_names, columns=["tfidf"]) 
df.sort_values(by=["tfidf"],ascending=False)

Unnamed: 0,tfidf
comedy,0.424705
monty,0.424705
circus,0.212353
television,0.212353
surreal,0.212353
...,...
englund,0.000000
emphasizes,0.000000
elm,0.000000
discontinued,0.000000


What are the most distinctive words for document 3 (Highest tf-idf scores?) Does it made sense given the document and corpus?

## Using tf-idf on Reddit datasets
Now that we have a good grasp of how TF-IDF is calculated, let's perform this method on our Reddit data. Let's say, that rather than wanting to compare individual posts (for now) we want to compare terms that are important to two subsets of the Reddit. First, let's create two dataframes to seperate the data based on whether they were classified as "assholeish" and "non-assholeish" posts by the communities. Essentially our corpus will have two documents, one for each division of the subreddit. We'll put the relevant lemmatized texts into a list.

In [68]:
df_ass = reddf.loc[reddf.flair_css_class == "ass"]
df_ass.reset_index(inplace=True)
df_not = reddf.loc[reddf.flair_css_class == "not"]
df_not.reset_index(inplace=True)

In [69]:
aita_list = [' '.join(df_ass['lemmas']), ' '.join(df_not['lemmas'])]

This time, to save time, we will be using Scikit-LEARN [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html?highlight=tfidfvectorizer#sklearn.feature_extraction.text.TfidfVectorizer). It is a class that basically allows us to create a matrix of word counts (what we just did with `CountVectorizer`), and immediately transform them into tf-idf values. 

We simply instantiate an object of the `TfidfVectorizer`. Then, we run it by applying the `fit_transform()` method to our two-document corpus `aita_list`.

In [70]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# settings that you use for count vectorizer will go here
tfidf_vectorizer = TfidfVectorizer(max_df=0.85, decode_error='ignore',
                                   stop_words='english',smooth_idf=True,use_idf=True)

# fit and transform the texts
tfidf_vectorizer_vectors = tfidf_vectorizer.fit_transform(aita_list)


Let's have a look at our first df (assholes), and see which words are typical of those posts when compared to the posts which are by non-assholes. Do the numbers make sense? Do you notice patterns in the terms?

In [76]:
vector_tfidfvectorizer = tfidf_vectorizer_vectors[0] # Note that 0 refers to our first df (assholes), due to zero-based indexing

# place tf-idf values in a DataFrame
df = pd.DataFrame(vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)[:30]

Unnamed: 0,tfidf
johanna,0.279583
sabrina,0.186388
hayleigh,0.155324
elena,0.139791
gfm,0.116493
jeremy,0.116493
bryan,0.108727
zach,0.108727
timmy,0.10096
elmo,0.093194


Let's look at the other subset of our data. Do the high- tf-idf scores make sense?

In [77]:
vector_tfidfvectorizer = tfidf_vectorizer_vectors[1] # Note that 1 refers to our second df (non-assholes), due to zero-based indexing

# place tf-idf values in a DataFrame
df = pd.DataFrame(vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)[:30]

Unnamed: 0,tfidf
billy,0.144455
derek,0.134137
bella,0.121755
thank_awards,0.117628
pam,0.115564
tammy,0.101118
phil,0.099055
burrito,0.084609
mandy,0.082546
sasha,0.074291


## Using TF-IDF to find similar posts

We can also use TF-IDF to work out the similarity between any pair of documents. So given one post or comment, we could see which posts or comments are most similar. This can be useful if you're trying to find other examples of a pattern you have found and want to explore further.

This time, our "documents" will not be entire subreddits, but posts/submissions within one subreddit. Let's import the submissions and run the vectorizer without the preprocessing and lemmatizing. Tf-idf will still work this way, and this way, we will be able to read our posts.

Let's run TF-IDF over the entire corpus, so each post is compared to all the others.

In [79]:
# we could even add trigrams here by adding "ngram_range=(1,3)" to params
tfidf_vectorizer = TfidfVectorizer(analyzer='word', max_df = .70, stop_words = 'english')
word_count_vectors = tfidf_vectorizer.fit_transform([post for post in df_ass['selftext']])

We'll start by finding a post with a clear topic. Let's grab an entry in our dataframe. What is this post about?

In [81]:
df_ass['selftext'][7]

'It\'s 6PM on a Friday, store has hundreds of people, there are only 3 registers open and lines are ridiculously long. The self checkout has a line that wraps but its the fastest moving line so we wait about 15 minutes in line to check ourselves out.\n\nAfter we check out, a line is forming to exit the building because everyone is waiting for the walmart receipt checker to glance at their receipt.\n\nI\'m already frustrated because of the wait so I skip the line, and the checker anxiously tries to get my attention and loudly says "SIR I NEED TO CHECK YOUR RECEIPT!", I respond with a loud(because its loud in the store) "NO THANK YOU", and walk out of the building.  People start following my example.\n\nThus ensues a 15 minute fight in the car with the wife because she feels I made a scene.\n\nEDIT: Well this turned out well.'

Let's have a quick look at the tfidf scores for the words in this submission to see if these words are indeed typical for this particular submission. Do the distinctive words have to do with the topic of the post?

In [84]:
# get a vector out
vector_tfidfvectorizer = word_count_vectors[7] # change this number if you want to pick out a different vector / text

# place tf-idf values in a pandas data frame
df = pd.DataFrame(vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)[:10]

Unnamed: 0,tfidf
line,0.388256
receipt,0.345678
checker,0.309109
check,0.230729
building,0.194553
loud,0.174266
store,0.158222
anxiously,0.154554
ensues,0.154554
wait,0.152997


Now let's find the closest posts to this one. The fact that our documents are now in a vector space allows us to make use of mathematical similarity metrics.

**Cosine similarity** is one metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.

We can use the heper functoin below to calculate cosine similarities and find the documents that are closest to the selected document. Just add the `word_count_vectors` for the corpus, the index of the document you want to find similar documents to, and the number of similar documents you want to return.

In [89]:
def find_similar(word_count_vectors, index, top_n = 5):   # you can change the `top_n` parameter if you want to retrieve more similar documents
    cosine_similarities = linear_kernel(word_count_vectors[index:index+1], word_count_vectors).flatten()
    related_docs_indices = [i for i in cosine_similarities.argsort()[::-1] if i != index]
    return [(index, cosine_similarities[index]) for index in related_docs_indices][0:top_n]

We can now throw the resulting scores and similar posts in a list, feed that list into a DataFrame, and check out one of them to see if it works:

In [92]:
cosine = []
for index, score in find_similar(word_count_vectors, 7):
    cosine.append(
        {'cos_score': score, 
        'text': df_ass['selftext'][index]
        })
cosine_df = pd.DataFrame(cosine)
cosine_df

Unnamed: 0,cos_score,text
0,0.182479,I was in the TSA Precheck line where people ar...
1,0.159239,About 2 weeks ago I went to the drive-thru at ...
2,0.142716,Went to WalMart recently for batteries. I have...
3,0.139136,This happened today and I feel like it was jus...
4,0.123122,I (38m) stopped by a local grocery store on my...


Let's look at the first document's text. Does this post seem comparable to the one we selected above?

In [93]:
cosine_df['text'][0]

'I was in the TSA Precheck line where people are generally cognizant of the regulations. Unfortunately, a woman with her 3 children somehow ended up in this line. I have no patience for people who hold up the TSA screening line. This woman had to fish out all her feeding bottles from her 3 or 4 poorly organized bags and empty them before putting her belongings through security screening. As a result, she held up the security line for a long time. I finally had it and told her she should have been aware of the rules about liquids before getting in line and should have prepared for security screening before getting in line. I suggested that she step aside and let other travelers through and my suggestion was met with a few cheers. \n\nShe seemed really embarrassed but didn’t apologize at all. After I passed the screen I mentioned to the TSA agent he should manage the queue better. When a traveler was holding up the line he needed to step in, and he failed. He apologized and I moved on. I

In this notebook we introduced how to calculate TF-IDF scores for a corpus and used it to explore the Reddit data a little further. In this last example, we looked at similar documents. In the next module, we will use **topic modelling** to take this technique one step further by identifying groups of documents with similar themes, based off of the kinds of calculations done in this notebook.