# Data Science for Social Justice Workshop: TF-IDF
# PROJECT

## Retrieving the Dataset

In [None]:
import os
import pickle
import pandas as pd

In [None]:
# Replace this with your own csv file!
df = pd.read_csv('../../data/YOUR_FILE.csv')

In [None]:
# Replace this with your own pickle file!
with open('../../data/YOUR_FILE.pickle', 'rb') as f:
    trigrams = pickle.load(f)

## Using TF-IDF on your data

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Settings that you use for count vectorizer will go here
tfidf_vectorizer = TfidfVectorizer(max_df=0.85,
                                   decode_error='ignore',
                                   stop_words='english',
                                   smooth_idf=True,
                                   use_idf=True)

# Fit and transform the texts
tfidf = tfidf_vectorizer.fit_transform(trigrams)

Let's have a look at some of the TF-IDF values:

In [None]:
# Place TF-IDF values in a DataFrame
tfidf_df = pd.DataFrame(tfidf.todense(), columns=tfidf_vectorizer.get_feature_names_out().ravel())

In [None]:
tfidf_df.head()

In [None]:
# Highest TF-IDF values across documents
tfidf_df.sum().sort_values(ascending=False)

## Using TF-IDF to find Similar Posts

Choose a doc.

In [None]:
doc_idx = 25

Cange this column to `body` if you are working with a comments DataFrame!

In [None]:
df['selftext'].iloc[doc_idx]

Let's have a quick look at the TF-IDF scores for the words in this submission to see if these words are indeed typical for this particular submission. Do the distinctive words have to do with the topic of the post?

In [None]:
tfidf_df.loc[doc_idx].sort_values(ascending=False)

Now let's find the closest posts to this one. The fact that our documents are now in a vector space allows us to make use of mathematical similarity metrics.

**Cosine similarity** is one metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. It is equal to 1 if the documents are the same, and decreases to 0 the more dissimilar they are.

We can use a cosine similarity function from `sklearn` to calculate the cosine similarity between each pair of documents:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(tfidf)
similarities.shape

We can put the text and scores in a dataframe, and sort by the score:

In [None]:
similar_df = pd.DataFrame({
    # Change this to "body" if working with comments
    'text': df['selftext'].values,
    'score': similarities[doc_idx]}).sort_values('score', ascending=False)

The top document will be the document itself (it's going to have a similarity of 1 with itself). So we look at the next document - does it seem similar?

In [None]:
similar_df['text'].iloc[0]

In [None]:
similar_df['text'].iloc[1]