# TF-IDF with HathiTrust Data

In this lesson, we're going to learn about a text analysis method called *term frequency–inverse document frequency*, often abbreviated *tf-idf*.

While calculating the most frequent words in a text can be useful, the most frequent words in a text usually aren't the most interesting words in a text, even if we get rid of stop words ("the, "and," "to," etc.). Tf-idf is a method that builds off word frequency but it more specifically tries to identify the most distinctively frequent or significant words in a document. 

In this lesson, we will cover how to:
- Calculate and normalize tf-idf scores for each chapter in Sandra Cisneros's *The House on Mango Street*
- Download and process HathiTrust extracted features — that is, word frequencies for books in the HathiTrust Digital Library (including in-copyright books like *The House on Mango Street*)
- Prepare HathiTrust extracted features for tf-idf analysis

## Dataset

### *The House on Mango Street* by Sandra Cisneros

```{epigraph}
 [T]he pigeon had taken a step and dropped from the ledge. He caught an upwind that took him nearly as high as the tops of the empty K Street houses. He flew farther into Northeast, into the color and sounds of the city's morning. She did nothing, aside from following him, with her eyes, with her heart, as far as she could.

--  Sandra Cisneros, "The Girl Who Raised Pigeons," *The House on Mango Street* (1993)
```

Sandra Cisneros's *The House on Mango Street* (1993) is a collection of 14 short stories set in Washington D.C. The first chapter, "The Girl Who Raised Pigeons," begins with a young girl raising homing pigeons on her roof.

How distinctive is a "pigeon" in the world of *The House on Mango Street*? What does this uniqueness (or lackthereof) tell us about the meaning of pigeons in first chapter "The Girl Who Raised Pigeons" and the collection as a whole? These are just a few of the questions that we're going to try to answer with tf-idf.

If you already have a collection of plain text (.txt) files that you'd like to analyze, one of the easiest ways to calculate tf-idf scores is to use the Python library scikit-learn. It has a quick and nifty module called [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), which does all the math for you behind the scenes. We will cover how to use the TfidfVectorizer in the next lesson.

In this lesson, however, we're going to calculate tf-idf scores manually because *The House on Mango Street* is still in-copyright, which means that, for legal reasons, we can't easily share or access plain text files of the book.

Luckily, the [HathiTrust Digital Library](https://www.hathitrust.org/)—which contains digitized books from Google Books as well as many university libraries—has released word frequencies per page for all 17 million books in its catalog. These word frequencies (plus part of speech tags) are otherwise known as "extracted features." There's a lot of text analysis that we can do with extracted features alone, including tf-idf.

So to calculate tf-idf scores for *The House on Mango Street*, we're going to use HathiTrust extracted features. That's why we're not using sci-kit learn's TfidfVectorizer. It works great with plain text files but not so great with extracted features.

## Breaking Down the TF-IDF Formula

But first, let's quickly discuss the tf-idf formula. The idea is pretty simple.

**tf-idf = term_frequency * inverse_document_frequency**

**term_frequency** = number of times a given term appears in document

**inverse_document_frequency** = log(total number of documents / number of documents with term) + 1**\***

You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).

The reason we take the *inverse*, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word "said" vs the word "pigeon." The term "said" appears in 13 (document frequency) of 14 (total documents) *The House on Mango Street* stories (14 / 13 --> a smaller inverse document frequency) while the term "pigeons" only occurs in 2 (document frequency) of the 14 stories (total documents) (14 / 2 --> a bigger inverse document frequency, a bigger tf-idf boost). 

*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we're going to use is the [scikit-learn default](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer), which uses "smoothing" aka it adds a "1" to the numerator and denominator: 

**inverse_document_frequency**  = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

```{margin}
> If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.  
> -[scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)
```

### Let's test it out

We need the `log()` function for our calculation, otherwise known as [logarithm](https://en.wikipedia.org/wiki/Logarithm), so we're going to import the `numpy` package.

In [221]:
import numpy as np

**"said"**

In [230]:
total_number_of_documents = 14 ##total number of short stories in *The House on Mango Street*
number_of_documents_with_term = 13 ##number of short stories the contain the word "said"

In [231]:
term_frequency = 47 ##number of times "said" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = np.log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

In [232]:
term_frequency * inverse_document_frequency

50.24266495988672

**"pigeons"**

In [233]:
total_number_of_documents = 14 ##total number of short stories in *The House on Mango Street*
number_of_documents_with_term = 2 ##number of short stories the contain the word "pigeons"

In [234]:
term_frequency = 30 ##number of times "pigeons" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = np.log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

In [235]:
term_frequency * inverse_document_frequency

78.28313737302301

**tf–idf scores for "The Girl Who Raised Pigeons"**

"said" = 50.48<br>
"pigeons" = 78.28

Though the word "said" appears 47 times in "The Girl Who Raised Pigeons" and the word "pigeons" only appears 30 times, "pigeons" has a higher tf–idf score than "said" because it's a rarer word. The word "pigeons" appears in 2 of 14 stories, while "said" appears in 13 of 14 stories, almost all of them.

## Get HathiTrust Extracted Features

Now let's try to calculate tf-idf scores for all the words in all the short stories in *The House on Mango Street*. To do so, we need word counts, or HathiTrust extracted features, for each story in the collection.

To work with HathiTrust's extracted features, we first need to install and import the [HathiTrust Feature Reader](https://github.com/htrc/htrc-feature-reader).

Install HathiTrust Feature Reader

In [None]:
!pip install htrc-feature-reader

Import necessary libraries

In [1]:
from htrc_features import Volume
import pandas as pd

In [121]:
pd.options.display.max_rows = 800

Then we need to locate the the HathiTrust volume ID for *The House on Mango Street*. If we search the HathiTrust catalog for this book and then click on "Limited (search only)," it will take us to the following web page: https://babel.hathitrust.org/cgi/pt?id=mdp.39015029970129.

The HathiTrust Volume ID for *The House on Mango Street* is located after `id=` this URL: `mdp.39015029970129`. 

### Make DataFrame of Word Frequencies From Volume(s)

#### Single Volume

To get HathiTrust extracted features for a single volume, we can create a `[Volume` object](https://github.com/htrc/htrc-feature-reader#volume) and use the `.tokenlist()` method. 

For each page in *The House on Mango Street*, this DataFrame displays the page number and section type as well as every word/token that appears on the page, its part-of-speech, and the number of times that word/token occurs on the page. As you can see, there are 51,297 rows in this DataFrame — one for each token that appears on each page.

Let's look at a sample of just 20 words from page 11.

In [3]:
Volume('uc1.32106012740764').tokenlist()[500:520]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,token,pos,Unnamed: 4_level_1
18,body,moment,NN,1
18,body,much,JJ,1
18,body,my,PRP$,4
18,body,n't,RB,1
18,body,neighborhood,NN,1
18,body,no,DT,1
18,body,no,RB,1
18,body,not,RB,2
18,body,odd,JJ,1
18,body,of,IN,6


We can also get metadata for a HathiTrust volume by asking for [certain attributes](https://github.com/htrc/htrc-feature-reader#volume).

In [4]:
Volume('uc1.32106012740764').year

1994

In [5]:
Volume('uc1.32106012740764').page_count

168

In [6]:
Volume('uc1.32106012740764').publisher

['A.A. Knopf', 'Distributed by Random House']

#### Multiple Volumes

We might want to get extracted features for multiple volumes at the same time, so we're also going to practice a workflow that will allow us to read in multiple HathiTrust books, even though we're only reading in one book at this moment.

Insert list of desired HathiTrust volume(s)

In [7]:
volume_ids = ['uc1.32106012740764']

Loop through this list of volume IDs and make a DataFrame that includes extracted features, book title, and publication year, then make a list of all DataFrames.

In [88]:
all_tokens = []

for hathi_id in volume_ids:
    
    #Read in HathiTrust volume
    volume = Volume(hathi_id)
    
    #Make dataframe from token list -- do not include part of speech, sections, or case sensitivity
    token_df = volume.tokenlist(case=False, pos=False, drop_section=True)
    
    #Add book column
    token_df['book'] = volume.title
    
    #Add publication year column
    token_df['year'] = volume.year
    
    all_tokens.append(token_df)

Concatenate the list of DataFrames 

In [89]:
mango_df = pd.concat(all_tokens)

Preview the DataFrame

In [90]:
mango_df

Unnamed: 0_level_0,Unnamed: 1_level_0,count,book,year
page,lowercase,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,-,3,The house on Mango Street /,1994
1,-|-|-|-,1,The house on Mango Street /,1994
1,_,1,The house on Mango Street /,1994
1,|,1,The house on Mango Street /,1994
1,|-,4,The house on Mango Street /,1994
...,...,...,...,...
166,ſº,1,The house on Mango Street /,1994
167,32106,1,The house on Mango Street /,1994
167,o1274,1,The house on Mango Street /,1994
167,o764,1,The house on Mango Street /,1994


Change from multi-level index to regular index with `reset_index()`

In [91]:
mango_df_flattened = mango_df.reset_index()

In [92]:
mango_df_flattened[mango_df_flattened['token'] == 'march']

KeyError: 'token'

Nice! We now have a DataFrame of word counts per page for *The House on Mango Street*.

But what we need to move forward with tf-idf is a way of splitting this collection into its individual stories. Remember: to use tf-idf, we need a *collection* of texts because we need to compare word frequency for one document with all the other documents in the collection.

## Add story titles

How can we split up *The House on Mango Street* into individual stories?

Sometimes HathiTrust Extracted Features helpfully include "section" information for a book, such as chapter titles. Unfortunately, the extracted features for *The House on Mango Street* do not include chapter or story titles.

They do, however, include page numbers and, if you specify `volume.tokenlist(case=True)`, words with case sensitivity. When I manually combed through the HTRC token list with case sensitivity turned on, I noticed that the title page for each chapter seemed to format the title in all-caps. So I searched for all-caps words from each story title and noted down the corresponding page number. This should give us a marker of where every story begins and ends.

The function below will add in *The House on Mango Street*'s story titles for the correct page numbers and corresponding words.

In [227]:
def add_story_titles(page):
    if page >= 0 and page < 29:
        return "Front Matter"
    if page >= 29 and page < 33:
        return "01: The House on Mango Street"
    elif page >= 33 and page < 35:
        return "02: Hairs"
    elif page >= 35 and page < 37:
        return "03: Boys & Girls"
    elif page >= 37 and page < 40:
        return "04: My Name"
    elif page >= 40 and page < 42:
        return "05: Cathy Queen of Cats"
    elif page >= 42 and page < 46:
        return "06: Our Good Day"
    elif page >= 46 and page < 48:
        return "07: Laughter"
    elif page >= 48 and page < 51:
        return "08: Gil's Furniture Bought & Sold"
    elif page >= 51 and page < 54:
        return "09: Meme Ortiz"
    elif page >= 54 and page < 56:
        return "10: Louie, His Cousin, & His Other Cousin"
    elif page >= 56 and page < 59:
        return "11: Marin"
    elif page >= 59 and page < 61:
        return "12: Those Who Don't"
    elif page >= 61 and page < 63:
        return "13: There Was an Old Woman She Had So Many Children She Didn't Know What to Do"
    elif page >= 63 and page <= 64:
        return "14: Alicia Who Sees Mice"
    elif page >= 64 and page <= 67:
        return "15: Darius & the Clouds"
    elif page >= 67 and page <= 72:
        return "16: And Some More"
    elif page >= 72 and page <= 77:
        return "17: The Family of Little Feet"
    elif page >= 77 and page <= 82:
        return "18: A Rice Sandwich"
    elif page >= 82 and page <= 84:
        return "19: Chanclas"
    elif page >= 84 and page <= 90:
        return "20: Hips"
    elif page >= 90 and page <= 94:
        return "21: The First Job"
    elif page >= 94 and page <= 96:
        return "22: Papa Who Wakes Up Tired in the Dark"
    elif page >= 96 and page <= 102:
        return "23: Born Bad"
    elif page >= 102 and page <= 106:
        return "24: Elenita, Cards, Palm, Water"
    elif page >= 106 and page <= 109:
        return "25: Geraldo No Last Name"
    elif page >= 106 and page <= 113:
        return "26: Edna's Ruthie"
    elif page >= 113 and page <= 116:
        return "27: The Earl of Tennesse"
    elif page >= 116 and page <= 119:
        return "28: Sire"
    elif page >= 119 and page <= 120:
        return "29: Four Skinny Trees"
    elif page >= 120 and page <= 124:
        return "30: No Speak English"
    elif page >= 124 and page <= 127:
        return "31: Rafaela Who Drinks Coconut & Papaya Juice on Tuesdays"
    elif page >= 127 and page <= 131:
        return "32: Sally"
    elif page >= 131 and page <= 133:
        return "33: Minerva Writes Poems"
    elif page >= 133 and page <= 136:
        return "34: Bums in the Attic"
    elif page >= 136 and page <= 137:
        return "35: Beautiful & Cruel"
    elif page >= 137 and page <= 139:
        return "36: Smart Cookie"
    elif page >= 139 and page <= 140:
        return "37: What Sally Said"
    elif page >= 140 and page <= 148:
        return "38: The Monkey Garden"
    elif page >= 148 and page <= 150:
        return "39: Red Clowns"
    elif page >= 150 and page <= 151:
        return "40: Linoleum Roses"
    elif page >= 151 and page <= 156:
        return "41: The Three Sisters"
    elif page >= 156 and page <= 158:
        return "42: Alicia & I Talking on Edna's Steps"
    elif page >= 158 and page <= 159:
        return "43: A House of My Own"
    elif page >= 159 and page <= 160:
        return "44: Mango Says Goodbye Sometimes"
    elif page > 160:
        return "Back Matter"

Below we add a new column of story titles to the DataFrame by `apply()`ing our function to the "page" column and dumping the results to `mango_df_flattened['story']`. You can read more about applying functions in ["Pandas Basics - Part 3"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Analysis/Pandas-Basics-Part3.html#applying-functions).

In [228]:
mango_df_flattened['story'] = mango_df_flattened['page'].apply(add_story_titles)

We're also going to drop the "Front Matter" and "Back Matter" from the DataFrame.

In [229]:
mango_df_flattened = mango_df_flattened.drop(mango_df_flattened[mango_df_flattened['story'] == 'Front Matter'].index)

In [230]:
mango_df_flattened = mango_df_flattened.drop(mango_df_flattened[mango_df_flattened['story'] == 'Back Matter'].index)

## Sum Word Counts For Each Story

Page-level information is great. But for tf-idf purposes, we really only care about the frequency of words for every story. Below we group by story and calculate the sum of word frequencies for all the pages in that story.

In [231]:
mango_df_flattened.groupby(['story', 'lowercase'])[['count']].sum().reset_index()

Unnamed: 0,story,lowercase,count
0,01: The House on Mango Street,'d,5
1,01: The House on Mango Street,'re,2
2,01: The House on Mango Street,'s,4
3,01: The House on Mango Street,",",30
4,01: The House on Mango Street,--,3
...,...,...,...
8829,44: Mango Says Goodbye Sometimes,who,1
8830,44: Mango Says Goodbye Sometimes,why,1
8831,44: Mango Says Goodbye Sometimes,will,5
8832,44: Mango Says Goodbye Sometimes,with,2


Notice how the "page" column no longer exists in the DataFrame and our rows have slimmed down from more than 40,000 to 18,000.

In [232]:
word_frequency_df = mango_df_flattened.groupby(['story', 'lowercase'])[['count']].sum().reset_index()

## Remove Infrequent Words, Stopwords, & Punctuation

We will conclude with some final pre-processing steps. We will remove the list of stopwords defined below.

Make list of stopwords

In [233]:
STOPS = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
         'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
         'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
         'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
         'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
         'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
         'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
         'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
         'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
         'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
         'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
         'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp', "!"]

Remove stopwords

In [234]:
word_frequency_df = word_frequency_df.drop(word_frequency_df[word_frequency_df['lowercase'].isin(STOPS)].index)

We will also remove punctuation by using a regular expression `[^A-Za-z\s]`, which matches anything that's not a letter and drops it from the DataFrame.

In [235]:
word_frequency_df = word_frequency_df.drop(word_frequency_df[word_frequency_df['lowercase'].str.contains('[^A-Za-z\s]', regex=True)].index)

In [236]:
#Remove words that appear less than 5 times in a book
#word_frequency_df_test = word_frequency_df[word_frequency_df['count'] > 5]

In [237]:
word_frequency_df

Unnamed: 0,story,lowercase,count
12,01: The House on Mango Street,always,3
14,01: The House on Mango Street,anybody,1
16,01: The House on Mango Street,around,1
18,01: The House on Mango Street,asked,1
20,01: The House on Mango Street,away,1
...,...,...,...
8814,44: Mango Says Goodbye Sometimes,sometimes,1
8815,44: Mango Says Goodbye Sometimes,street,2
8816,44: Mango Says Goodbye Sometimes,strong,1
8821,44: Mango Says Goodbye Sometimes,third,1


## TF-IDF

### Term Frequency

We already have term frequencies for each document. Let's rename the columns so that they're consistent with the tf-idf vocabulary that we've been using.

In [238]:
word_frequency_df = word_frequency_df.rename(columns={'lowercase': 'term','count': 'term_frequency'})

In [239]:
word_frequency_df

Unnamed: 0,story,term,term_frequency
12,01: The House on Mango Street,always,3
14,01: The House on Mango Street,anybody,1
16,01: The House on Mango Street,around,1
18,01: The House on Mango Street,asked,1
20,01: The House on Mango Street,away,1
...,...,...,...
8814,44: Mango Says Goodbye Sometimes,sometimes,1
8815,44: Mango Says Goodbye Sometimes,street,2
8816,44: Mango Says Goodbye Sometimes,strong,1
8821,44: Mango Says Goodbye Sometimes,third,1


### Document Frequency

To calculate the number of documents or stories in which each term appears, we're going to create a separate DataFrame and do some Pandas manipulation and calculation.

In [240]:
document_frequency_df = (word_frequency_df.groupby(['story','term']).size().unstack()).sum().reset_index()

If you inspect parts of the complex chain of Pandas methods above (which is always a great way to learn!), you will see that we're momentarily reshaping the DataFrame to see if each term appears in each story...

In [241]:
word_frequency_df.groupby(['story','term']).size().unstack()

term,able,abuelito,accent,accident,account,ache,across,add,address,af,...,yes,yesterday,yet,yolanda,young,youngest,yous,yup,zeze,zillion
story,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01: The House on Mango Street,,,,,,,,,,,...,1.0,,1.0,,,,,,,
02: Hairs,,,,,,,,,,,...,,,,,,1.0,,,,
03: Boys & Girls,,,,,,,,,,,...,,,,,1.0,,,,,
04: My Name,,,,,,,,,,,...,1.0,,,,,,,,1.0,
05: Cathy Queen of Cats,,,,,,,1.0,,,,...,,,,,,,,,,
06: Our Good Day,,,,,,,,,,,...,,,1.0,,,,,,,
07: Laughter,,,,,,,,,,,...,1.0,,,,,,,,,
08: Gil's Furniture Bought & Sold,,,,,,,1.0,,,,...,,,,,,,,,,
09: Meme Ortiz,,,,,,,,,,,...,,,,,,,,,,
"10: Louie, His Cousin, & His Other Cousin",,,,,,,,,,,...,,,,,,,,,,


Then we're adding up how many stories each term appears in (`.sum()`) and resetting the index (`.reset_index()`) to make a DataFrame.

Finally, we will rename the column in this DataFrame and merge it into our word frequency DataFrame.

In [242]:
document_frequency_df = document_frequency_df.rename(columns={0:'document_frequency'})

In [243]:
word_frequency_df = word_frequency_df.merge(document_frequency_df)

Now we have term frequency and document frequency.

In [244]:
word_frequency_df

Unnamed: 0,story,term,term_frequency,document_frequency
0,01: The House on Mango Street,always,3,15.0
1,04: My Name,always,1,15.0
2,11: Marin,always,1,15.0
3,17: The Family of Little Feet,always,1,15.0
4,18: A Rice Sandwich,always,2,15.0
...,...,...,...,...
6051,44: Mango Says Goodbye Sometimes,ache,1,1.0
6052,44: Mango Says Goodbye Sometimes,ghost,1,1.0
6053,44: Mango Says Goodbye Sometimes,march,1,1.0
6054,44: Mango Says Goodbye Sometimes,pack,1,1.0


As you can see in the DataFrame above, the term "abandoned" appears 2 times in the story "The Girl Who Raised Pigeons" (term frequency), and it appears in 3 different stories in the collection overall (document frequency).

### Total Number of Documents 

To calculate the total number of documents are in the collection, we count how many unique values are in the "story" column (we know the answer should be 14 short stories).

In [245]:
total_number_of_documents = mango_df_flattened['story'].nunique()

In [246]:
total_number_of_documents

44

### Inverse Document Frequency

As we previously established, there are a lot of slightly different versions of the tf-idf formula, but we're going to use the default version from the [scikit-learn library](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) that adds "smoothing" to inverse document frequency.

```
inverse_document_frequency = log [ (1 + total number of docs) / (1 + document frequency) ] + 1
```

In [247]:
import numpy as np

In [248]:
word_frequency_df['idf'] = np.log((1 + total_number_of_documents) / (1 + word_frequency_df['document_frequency'])) + 1

### TF- IDF

Finally, we will calculate tf-idf by multiplying term frequency and inverse document frequency together.

In [249]:
word_frequency_df['tfidf'] = word_frequency_df['term_frequency'] * word_frequency_df['idf']

Then we will normalize these values with the scikit-learn library.

In [250]:
from sklearn import preprocessing

In [251]:
word_frequency_df['tfidf_normalized'] = preprocessing.normalize(word_frequency_df[['tfidf']], axis=0, norm='l2')

We did it! Now let's inspect the top 15 words with the highest tfidf scores for each story in the collection

In [None]:
word_frequency_df[word_frequency_df['story'] == '01 The House on Mango Street'

In [255]:
sorted_df = word_frequency_df.sort_values(by=['story','tfidf_normalized'], ascending=[True,False]).groupby(['story']).head(15)

In [257]:
sorted_df[sorted_df['story'] == '01: The House on Mango Street']

Unnamed: 0,story,term,term_frequency,document_frequency,idf,tfidf,tfidf_normalized
447,01: The House on Mango Street,house,17,21.0,1.71562,29.165541,0.07314
726,01: The House on Mango Street,mango,7,10.0,2.408767,16.861371,0.042284
1352,01: The House on Mango Street,would,9,18.0,1.862224,16.760012,0.04203
878,01: The House on Mango Street,papa,6,8.0,2.609438,15.656627,0.039263
1098,01: The House on Mango Street,street,7,14.0,2.098612,14.690286,0.036839
1076,01: The House on Mango Street,small,4,3.0,3.420368,13.681473,0.03431
1370,01: The House on Mango Street,yard,4,3.0,3.420368,13.681473,0.03431
1081,01: The House on Mango Street,stairs,5,7.0,2.727221,13.636105,0.034196
674,01: The House on Mango Street,loomis,3,2.0,3.70805,11.124151,0.027896
1166,01: The House on Mango Street,third,3,2.0,3.70805,11.124151,0.027896


In [258]:
sorted_df[sorted_df['story'] == '02: Hairs']

Unnamed: 0,story,term,term_frequency,document_frequency,idf,tfidf,tfidf_normalized
1434,02: Hairs,hair,9,7.0,2.727221,24.544989,0.061552
1578,02: Hairs,snoring,2,1.0,4.113515,8.227031,0.020631
1404,02: Hairs,bread,2,2.0,3.70805,7.4161,0.018598
1603,02: Hairs,warm,2,2.0,3.70805,7.4161,0.018598
443,02: Hairs,holding,2,5.0,3.014903,6.029806,0.015121
1534,02: Hairs,rain,2,6.0,2.860752,5.721505,0.014348
1569,02: Hairs,smell,2,7.0,2.727221,5.454442,0.013678
540,02: Hairs,like,5,41.0,1.068993,5.344964,0.013404
879,02: Hairs,papa,2,8.0,2.609438,5.218876,0.013088
1401,02: Hairs,bake,1,1.0,4.113515,4.113515,0.010316


In [259]:
sorted_df[sorted_df['story'] == '07: Laughter']

Unnamed: 0,story,term,term_frequency,document_frequency,idf,tfidf,tfidf_normalized
3014,07: Laughter,mexico,3,2.0,3.70805,11.124151,0.027896
2740,07: Laughter,lucy,3,9.0,2.504077,7.512232,0.018839
545,07: Laughter,like,7,41.0,1.068993,7.48295,0.018765
469,07: Laughter,houses,2,2.0,3.70805,7.4161,0.018598
2994,07: Laughter,laughter,2,2.0,3.70805,7.4161,0.018598
2764,07: Laughter,rachel,3,10.0,2.408767,7.226302,0.018122
2989,07: Laughter,exactly,2,3.0,3.420368,6.840736,0.017155
1719,07: Laughter,right,3,12.0,2.241713,6.725139,0.016865
764,07: Laughter,nenny,3,16.0,1.973449,5.920347,0.014847
1425,07: Laughter,family,2,7.0,2.727221,5.454442,0.013678


In [259]:
sorted_df[sorted_df['story'] == '07: Laughter']

Unnamed: 0,story,term,term_frequency,document_frequency,idf,tfidf,tfidf_normalized
3014,07: Laughter,mexico,3,2.0,3.70805,11.124151,0.027896
2740,07: Laughter,lucy,3,9.0,2.504077,7.512232,0.018839
545,07: Laughter,like,7,41.0,1.068993,7.48295,0.018765
469,07: Laughter,houses,2,2.0,3.70805,7.4161,0.018598
2994,07: Laughter,laughter,2,2.0,3.70805,7.4161,0.018598
2764,07: Laughter,rachel,3,10.0,2.408767,7.226302,0.018122
2989,07: Laughter,exactly,2,3.0,3.420368,6.840736,0.017155
1719,07: Laughter,right,3,12.0,2.241713,6.725139,0.016865
764,07: Laughter,nenny,3,16.0,1.973449,5.920347,0.014847
1425,07: Laughter,family,2,7.0,2.727221,5.454442,0.013678


In [260]:
sorted_df[sorted_df['story'] == '18: A Rice Sandwich']

Unnamed: 0,story,term,term_frequency,document_frequency,idf,tfidf,tfidf_normalized
2011,18: A Rice Sandwich,new,6,6.0,2.860752,17.164514,0.043044
717,18: A Rice Sandwich,mama,7,14.0,2.098612,14.690286,0.036839
4618,18: A Rice Sandwich,less,3,1.0,4.113515,12.340546,0.030947
4703,18: A Rice Sandwich,superior,3,1.0,4.113515,12.340546,0.030947
4563,18: A Rice Sandwich,dress,4,5.0,3.014903,12.059612,0.030242
1691,18: A Rice Sandwich,kids,5,10.0,2.408767,12.043836,0.030203
1043,18: A Rice Sandwich,school,5,12.0,2.241713,11.208566,0.028108
3710,18: A Rice Sandwich,letter,3,2.0,3.70805,11.124151,0.027896
4650,18: A Rice Sandwich,nacho,3,2.0,3.70805,11.124151,0.027896
4717,18: A Rice Sandwich,uncle,3,2.0,3.70805,11.124151,0.027896


In [252]:
word_frequency_df.sort_values(by=['story','tfidf_normalized'], ascending=[True,False]).groupby(['story']).head(15)

Unnamed: 0,story,term,term_frequency,document_frequency,idf,tfidf,tfidf_normalized
447,01: The House on Mango Street,house,17,21.0,1.71562,29.165541,0.07314
726,01: The House on Mango Street,mango,7,10.0,2.408767,16.861371,0.042284
1352,01: The House on Mango Street,would,9,18.0,1.862224,16.760012,0.04203
878,01: The House on Mango Street,papa,6,8.0,2.609438,15.656627,0.039263
1098,01: The House on Mango Street,street,7,14.0,2.098612,14.690286,0.036839
1076,01: The House on Mango Street,small,4,3.0,3.420368,13.681473,0.03431
1370,01: The House on Mango Street,yard,4,3.0,3.420368,13.681473,0.03431
1081,01: The House on Mango Street,stairs,5,7.0,2.727221,13.636105,0.034196
674,01: The House on Mango Street,loomis,3,2.0,3.70805,11.124151,0.027896
1166,01: The House on Mango Street,third,3,2.0,3.70805,11.124151,0.027896


It turns out that "pigeons" are pretty unique to the first chapter in *The House on Mango Street* and have a normalized tf-idf score of .062, making it one of the most distinctive words in that story along with "coop" and "birds."

What are some other distinctive words in *The House on Mango Street*?

## Further Resources

- Peter Organisciak and Boris Capitanu, ["Text Mining in Python through the HTRC Feature Reader,"](https://programminghistorian.org/en/lessons/text-mining-with-extracted-features) *The Programming Historian*
