# TF-IDF

In this lesson, we're going to learn about a text analysis method called *term frequency–inverse document frequency* (tf–idf). This method will help us identify the most unique words in a document from a given corpus. 

### Simple Formula

Calculating the most frequent words in a text can be useful. But often the most *frequent words* in a text aren't the most *interesting* words in a text.

Term frequency-inverse document frequency is a method that tries to help with this problem becuase it identifies the most frequent *unique* words in a text by comparing it to other texts.

```{margin}
term = word <br>
document = text (or chunk of a text) <br>
corpus = collection of texts <br>
```

**term_frequency * inverse_document_frequency**

**term_frequency** = number of times a given term appears in document

**inverse_document_frequency** = log(total number of documents / number of documents with term) + 1

We're going to calculate and compare the tf–idf scores for the word *said* and the word *pigeons* in "The Girl Who Raised Pigeons," the first short story in *Lost in the City*.

We need the log() function for our calculation, so we're going to import it from the `math` package.

In [38]:
from math import log

**"said"**

In [39]:
total_number_of_documents = 14 ##total number of short stories in *Lost in the City*
number_of_documents_with_term = 13 ##number of short stories the contain the word "said"

In [40]:
term_frequency = 47 ##number of times "said" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = log(total_number_of_documents / number_of_documents_with_term) + 1

In [41]:
term_frequency * inverse_document_frequency

50.48307469122493

**"pigeons"**

In [42]:
total_number_of_documents = 14 ##total number of short stories in *Lost in the City*
number_of_documents_with_term = 2 ##number of short stories the contain the word "pigeons"

In [43]:
term_frequency = 30 ##number of times "pigeons" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = log(total_number_of_documents / number_of_documents_with_term) + 1

In [44]:
term_frequency * inverse_document_frequency

88.3773044716594

**tf–idf Scores**

"said" = 50.48<br>
"pigeons" = 88.38

Though the word "said" appears 47 times in "The Girl Who Raised Pigeons" and the word "pigeons" only appears 30 times, "pigeons" has a higher tf–idf score than "said" because it's a rarer word. The word "pigeons" appears in 2 of 14 stories, while "said" appears in 13 of 14 stories, almost all of them.

## tf–idf with scikit-learn

We could continue calculating tf–idf scores in this manner — by doing all the math with Python — but conveniently there's a Python library that can calculate tf–idf scores in just a few lines of code.

This library is called [scikit-learn](https://scikit-learn.org/stable/index.html), imported as `sklearn`. It's a popular Python library for machine learning approaches such as clustering, classification, and regression, among others. Though we're not doing any machine learning in this lesson, we're nevertheless going to use scikit-learn's `TfidfVectorizer` and `CountVectorizer`.

In [45]:
!pip install sklearn



#### Import Libraries

In [165]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
pd.set_option("max_rows", 600)
pd.set_option("max_columns", 200)
#pd.options.display.float_format = lambda value : '{:.0f}'.format(value) if round(value,0) == value else '{:,.3f}'.format(value)
from pathlib import Path  
import glob

We're also going to import `pandas` and change two of its default display settings. We're going to increase the maximum number of rows that pandas will display, and we're going to format numbers in a special way. If it's a decimal number, format to three decimal places; if it's a whole number, round to the whole number.

Finally, we're going to import two libraries that will help us work with files and the file system: [`pathlib`](https://docs.python.org/3/library/pathlib.html##basic-use) and [`glob`](https://docs.python.org/3/library/glob.html). These libraries will help us read in all the short story text files from *Lost in the City*.

#### Set Directory Path

Below we're setting the directory filepath that contains all the short story text files that we want to analyze.

In [4]:
directory_path = "../texts/literature/Lost-in-the-City_Stories/"

Then we're going to use `glob` and `Path` to make a list of all the short story filepaths in that directory and a list of all the short story titles.

In [64]:
text_files = glob.glob(f"{directory_path}/*.txt")

In [65]:
text_files

['../texts/literature/Lost-in-the-City_Stories/11-Gospel.txt',
 '../texts/literature/Lost-in-the-City_Stories/13-A-Dark-Night.txt',
 '../texts/literature/Lost-in-the-City_Stories/01-The-Girl-Who-Raised-Pigeons.txt',
 '../texts/literature/Lost-in-the-City_Stories/12-A-New-Man.txt',
 '../texts/literature/Lost-in-the-City_Stories/02-The-First-Day.txt',
 '../texts/literature/Lost-in-the-City_Stories/07-The-Sunday-Following-Mother’S-Day.txt',
 '../texts/literature/Lost-in-the-City_Stories/03-The-Night-Rhonda-Ferguson-Was-Killed.txt',
 '../texts/literature/Lost-in-the-City_Stories/05-The-Store.txt',
 '../texts/literature/Lost-in-the-City_Stories/08-Lost-In-The-City.txt',
 '../texts/literature/Lost-in-the-City_Stories/14-Marie.txt',
 '../texts/literature/Lost-in-the-City_Stories/09-His-Mother’S-House.txt',
 '../texts/literature/Lost-in-the-City_Stories/10-A-Butterfly-On-F-Street.txt',
 '../texts/literature/Lost-in-the-City_Stories/06-An-Orange-Line-Train-To-Ballston.txt',
 '../texts/literatur

In [68]:
text_titles = [Path(text).stem for text in text_files]

In [69]:
text_titles

['11-Gospel',
 '13-A-Dark-Night',
 '01-The-Girl-Who-Raised-Pigeons',
 '12-A-New-Man',
 '02-The-First-Day',
 '07-The-Sunday-Following-Mother’S-Day',
 '03-The-Night-Rhonda-Ferguson-Was-Killed',
 '05-The-Store',
 '08-Lost-In-The-City',
 '14-Marie',
 '09-His-Mother’S-House',
 '10-A-Butterfly-On-F-Street',
 '06-An-Orange-Line-Train-To-Ballston',
 '04-Young-Lions']

## Calculate tf–idf

To calculate tf–idf scores for every word, we're going to use scikit-learn's [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run `TfidfVectorizer` is with smoothing (`smooth_idf = True`) and normalization (`norm='l2'`) turned on. These parameters will better account for differences in story length, and, overall, they'll produce more meaningful tf–idf scores. 

Smoothing and L2 normalization are actually the default settings for `TfidfVectorizer`. To turn them on, you don't need to include any extra code at all.

**Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)**

In [124]:
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

**Plug in "text_files" which contains all our short stories**


In [4]:
directory_path = "../texts/literature/Lost-in-the-City_Stories/"

In [64]:
text_files = glob.glob(f"{directory_path}/*.txt")

In [68]:
text_titles = [Path(text).stem for text in text_files]

In [124]:
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

**Make a DataFrame out of the tf–idf vector and sort by title**

In [125]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())
tfidf_df = tfidf_df.sort_index()

**Add column for number of times word appears in all documents**

In [125]:
tfidf_df.loc['Document Frequency'] = (tfidf_df > 0).sum()

In [126]:
tfidf_slice = tfidf_df[['pigeons', 'school', 'said', 'church', 'gospelteers', 'thunder','girl', 'street', 'father', 'dreaming', 'car']]
tfidf_slice

Unnamed: 0,pigeons,school,said,church,gospelteers,thunder,girl,street,father,dreaming,car
01-The-Girl-Who-Raised-Pigeons,0.207,0.036,0.133,0.011,0.0,0.0,0.062,0.105,0.042,0.0,0.0
02-The-First-Day,0.0,0.134,0.0,0.031,0.0,0.0,0.094,0.07,0.012,0.0,0.0
03-The-Night-Rhonda-Ferguson-Was-Killed,0.0,0.02,0.212,0.003,0.0,0.0,0.032,0.082,0.061,0.0,0.092
04-Young-Lions,0.0,0.015,0.186,0.0,0.0,0.0,0.005,0.065,0.073,0.0,0.012
05-The-Store,0.0,0.018,0.246,0.012,0.0,0.0,0.065,0.093,0.1,0.0,0.032
06-An-Orange-Line-Train-To-Ballston,0.0,0.036,0.286,0.0,0.0,0.0,0.022,0.036,0.022,0.0,0.02
07-The-Sunday-Following-Mother’S-Day,0.0,0.003,0.21,0.01,0.0,0.0,0.028,0.023,0.059,0.0,0.073
08-Lost-In-The-City,0.0,0.007,0.292,0.025,0.0,0.0,0.019,0.051,0.051,0.09,0.0
09-His-Mother’S-House,0.0,0.006,0.231,0.0,0.0,0.0,0.01,0.065,0.007,0.0,0.025
10-A-Butterfly-On-F-Street,0.0,0.0,0.171,0.0,0.0,0.0,0.0,0.128,0.043,0.0,0.037


To find out the top 10 words with the highest tf–idf for every story, we're going to make and run the following function: `get_top_tfidf_scores()`

In [127]:
def get_top_tfidf_scores(series, top_n=10):
    pretty_df = series.stack().groupby(level=0).nlargest(top_n).reset_index()
    pretty_df = pretty_df.rename(columns={0:'tfidf_score', 'level_1': 'story', 'level_2': 'word'})
    pretty_df = pretty_df.drop(columns='level_0')
    pretty_df['tfidf_rank'] = pretty_df.groupby('story')['tfidf_score'].rank(method='first', ascending=False)
    return pretty_df

As before, this function will rearrange the dataframe, `.groupby()` short story, and filter for the top 10 highest tf–idf scores in every story. Finally, it will produce a dataframe with a new column `tfidf_rank`, which contains a 1-10 ranking of the highest tf–idf scores.

In [128]:
tfidf_df = tfidf_df.drop('Document Frequency', errors='ignore')

In [129]:
top_tfidf = get_top_tfidf_scores(tfidf_df)
top_tfidf

Unnamed: 0,story,word,tfidf_score,tfidf_rank
0,01-The-Girl-Who-Raised-Pigeons,betsy,0.358,1
1,01-The-Girl-Who-Raised-Pigeons,jenny,0.35,2
2,01-The-Girl-Who-Raised-Pigeons,ann,0.31,3
3,01-The-Girl-Who-Raised-Pigeons,robert,0.295,4
4,01-The-Girl-Who-Raised-Pigeons,coop,0.223,5
5,01-The-Girl-Who-Raised-Pigeons,pigeons,0.207,6
6,01-The-Girl-Who-Raised-Pigeons,miss,0.163,7
7,01-The-Girl-Who-Raised-Pigeons,birds,0.147,8
8,01-The-Girl-Who-Raised-Pigeons,clara,0.143,9
9,01-The-Girl-Who-Raised-Pigeons,said,0.133,10


#### Write to a CSV File

In [None]:
filename = "tfidf_Lost-in-The-City.csv"
top_tfidf.to_csv(filename, encoding='UTF-8', index=False)

In [155]:
directory_path = "../texts/history/US_Inaugural_Addresses/"

In [156]:
text_files = glob.glob(f"{directory_path}/*.txt")

In [157]:
text_titles = [Path(text).stem for text in text_files]

In [158]:
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

**Make a DataFrame out of the tf–idf vector and sort by title**

In [159]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())
tfidf_df = tfidf_df.sort_index()

**Add column for number of times word appears in all documents**

In [160]:
tfidf_df.loc['Document Frequency'] = (tfidf_df > 0).sum()

To find out the top 10 words with the highest tf–idf for every story, we're going to make and run the following function: `get_top_tfidf_scores()`

In [162]:
def get_top_tfidf_scores(series, top_n=10):
    pretty_df = series.stack().groupby(level=0).nlargest(top_n).reset_index()
    pretty_df = pretty_df.rename(columns={0:'tfidf_score', 'level_1': 'story', 'level_2': 'word'})
    pretty_df = pretty_df.drop(columns='level_0')
    pretty_df['tfidf_rank'] = pretty_df.groupby('story')['tfidf_score'].rank(method='first', ascending=False)
    return pretty_df

As before, this function will rearrange the dataframe, `.groupby()` short story, and filter for the top 10 highest tf–idf scores in every story. Finally, it will produce a dataframe with a new column `tfidf_rank`, which contains a 1-10 ranking of the highest tf–idf scores.

In [163]:
tfidf_df = tfidf_df.drop('Document Frequency', errors='ignore')

In [166]:
top_tfidf = get_top_tfidf_scores(tfidf_df)
top_tfidf

Unnamed: 0,story,word,tfidf_score,tfidf_rank
0,01_washington_1789,government,0.114,1
1,01_washington_1789,immutable,0.104,2
2,01_washington_1789,impressions,0.104,3
3,01_washington_1789,providential,0.104,4
4,01_washington_1789,ought,0.104,5
5,01_washington_1789,public,0.103,6
6,01_washington_1789,present,0.098,7
7,01_washington_1789,qualifications,0.096,8
8,01_washington_1789,peculiarly,0.091,9
9,01_washington_1789,article,0.086,10


## Your Turn!

Take a few minutes to explore the dataframe below and then answer the following questions.

In [None]:
tfidf_compare

**1.** What is the difference between a tf-idf score and raw word frequency?

**Your answer here**

**2.** Based on the dataframe above, what is one potential problem or limitation that you notice with tf-idf scores?

**Your answer here**

**3.** What's another collection of texts that you think might be interesting to analyze with tf-idf scores?  Why?

**Your answer here**