# Vocabulary Analysis Workshop

## $\mbox{TF.IDF}$

The motivation for $\mbox{TF.IDF}$ is wanting to look at words that make documents stand out. These words are considered important for the document. If a word occurs in most documents, that may not be interesting to us. Similarly, if a word only occurs once in one document that is also not useful in summarizing our text. We want to see the words that occur often in a limited number of documents. This is why we are interested in the number of times a word occurs, and the number of documents it occurs in.

$\mbox{TF}$ stands for term frequency  
$\mbox{IDF}$ stands for inverse document frequency

There are many flavors of $\mbox{TF.IDF}$, let's look at one of the more common formulations.

Although $\mbox{TF}$ stands for term frequency, raw counts are often used instead. Similarly, $\mbox{IDF}$ is often the $log$ of the inverse document frequency.

Here is the mathematical definition for the flavor of $\mbox{TF.IDF}$ we will be using.

$$
\begin{array}{l}
D\ :=\ \text{a collection of documents}\\
d\ :=\ \text{a document in $D$}\\
t\ :=\ \text{a term}\\
N\ :=\ |D|\\
n_{t}\ :=\ |\{d\ :\ t \in d\}|\\
\mbox{TF}(t, d)\ :=\ \text{number of times $t$ occurs in $d$}\\
\mbox{IDF}(t)\ :=\ \log_2{(1+\frac{N}{n_{t}})}\\
\mbox{TF.IDF}(t, d)\ :=\ \mbox{TF}(t, d)\times\mbox{IDF}(t)\\
\end{array}
$$

We will be looking at the average $\mbox{TF.IDF}$ for words


$$
\begin{align*}
\overline{\mbox{TF.IDF}(t, d)}\ &=\ \frac{\sum_{d \in D}{\mbox{TF.IDF}(t, d)}}{N}\\
&=\ \frac{\sum_{d \in D}{\mbox{TF}(t, d)\times\mbox{IDF}(t)}}{N}\\
&=\ \mbox{IDF}(t)\times\frac{\sum_{d \in D}{\mbox{TF}(t, d)}}{N}\\
\end{align*}
$$

As one might imagine, this is still susceptible to words that have a high-enough $\mbox{TF}$ to diminish the effect of $\mbox{IDF}$.

(tf-idf [wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf))

We will produce two kinds of visualizations using $\mbox{TF.IDF}$.

1. A plot of $\mbox{TF}$ vs $\mbox{IDF}$
2. A word cloud, which is where we display our vocabulary with size proportional to some weight ($\mbox{TF.IDF}$)

You will sometimes here this kind of approach called to as the _bag-of-words_ approach. This is referring to how the documents are treated like _bags_. A _bag_ (AKA [_multiset_](https://en.wikipedia.org/wiki/Multiset)), in this context, is a collection of things with counts of occurrences.

In [None]:
from __future__ import division, print_function

%matplotlib inline

from collections import Counter, defaultdict
import numpy as np
import pandas as pd

from vocab_analysis import *

import answers

In [None]:
jobs_df = pd.read_pickle('./data/tokenized.pickle')

In [None]:
jobs_df.head()

In [None]:
def calculate_avg_tfidf(term_rows):
    bags = term_rows.apply(Counter) # convert the documents to bags, this will calculate the TF per document per term
    sum_tf = Counter() # this will hold the sum of the TF per term
    df = Counter() # this will calculate the raw DF (n_t from above)
    for bag in bags:
        sum_tf.update(bag)
        df.update(bag.keys())
    sum_tf = pd.Series(sum_tf)
    df = pd.Series(df)
    idf = np.log2(1 + len(term_rows) / df)
    sum_tfidf = sum_tf * idf # this will calculate the sum TF.IDF per term
    avg_tfidf = sum_tfidf / len(term_rows)  # this will calculate the average TF.IDF per term over the documents
    return pd.DataFrame({'sum_tf': sum_tf, 'idf': idf, 'avg_tfidf': avg_tfidf})

In [None]:
avg_tfidf_df = calculate_avg_tfidf(jobs_df['tokens'])

In [None]:
avg_tfidf_df.describe()

First let's look at the distribution of $\sum_{d \in D}{\mbox{TF}(t, d)}$ vs $\mbox{IDF}(t)$

In [None]:
avg_tfidf_df.sort_values('sum_tf').head()

In [None]:
avg_tfidf_df.sort_values('sum_tf', ascending=False).head()

In [None]:
avg_tfidf_df.sort_values('idf').head()

In [None]:
avg_tfidf_df.sort_values('idf', ascending=False).head()

In [None]:
avg_tfidf_df.sort_values('avg_tfidf').head()

In [None]:
avg_tfidf_df.sort_values('avg_tfidf', ascending=False).head()

When searching a document, the final score is often calculated as the sum of the $\mbox{TF.IDF}$ for each term in the query.


$$
\begin{array}{l}
D\ :=\ \text{a collection of documents}\\
d\ :=\ \text{a document in $D$}\\
q\ :=\ \text{a set of terms}
t\ :=\ \text{a term}\\
\mbox{TF.IDF}(t, d)\ :=\ \mbox{TF}(t, d)\times\mbox{IDF}(t)\\
score(q, d)\ :=\ \sum_{t \in q}{\mbox{TF.IDF}(t, d)}
\end{array}
$$

Let's build a function for searching our corpus.
First, let's build our _index_ from documents to $TF$

In [None]:
doc_index = jobs_df['tokens'].apply(Counter)
doc_index.head()

Now we need to build an _inverted index_ from terms to documents. This will let us quickly filter to a subset of documents for calculating $TF.IDF$

In [None]:
inv_index = defaultdict(set)
for ix, bag in doc_index.iteritems():
    for term in bag:
        inv_index[term].add(ix)
inv_index = pd.Series(inv_index)
inv_index.head()

In [None]:
from my_tokenize import tokenize

In [None]:
def search(query, docs, doc_index, inv_index, idf, processing, limit=10):
    terms = set(processing(query)) # always process your queries like you process your documents
    filter_set_ixs = set()
    term_idfs = idf[terms]
    for term in terms:
        filter_set_ixs |= inv_index.loc[term]
    # we should only return documents that contain at least one word from the query
    filter_set = doc_index.loc[filter_set_ixs]
    tf_df = pd.DataFrame({term: filter_set.apply(lambda bag: bag[term]) for term in terms})
    tfidf_df = tf_df * term_idfs
    score_df = tfidf_df.apply(np.sum, axis=1).sort_values(ascending=False)
    for doc_id, score in score_df[:limit].iteritems():
        print('=' * 80)
        print(doc_id)
        print('=' * 30)
        print(docs.loc[doc_id])
        print('=' * 80)

In [None]:
search("data scientist", jobs_df['description'], doc_index, inv_index, avg_tfidf_df['idf'], tokenize)

These calculation of average $TF.IDF$, and the ability to search our documents is useful, but it would be nice to be able to visualize our analysis.

### NEXT => [3. Visualizing](3. Visualizing.ipynb)