# Vocabulary Analysis Workshop

## $\mbox{TF.IDF}$

The motivation for $\mbox{TF.IDF}$ is wanting to look at words that make documents stand out. These words are considered important for the document. If a word occurs in most documents, that may not be interesting to us. Similarly, if a word only occurs once in one document that is also not useful in summarizing our text. We want to see the words that occur often in a limited number of documents. This is why we are interested in the number of times a word occurs, and the number of documents it occurs in.

$\mbox{TF}$ stands for term frequency  
$\mbox{IDF}$ stands for inverse document frequency

There are many flavors of $\mbox{TF.IDF}$, let's look at one of the more common formulations.

Although $\mbox{TF}$ stands for term frequency, raw counts are often used instead. Similarly, $\mbox{IDF}$ is often the $log$ of the inverse document frequency.

Here is the mathematical definition for the flavor of $\mbox{TF.IDF}$ we will be using.

$$
\begin{array}{l}
D\ :=\ \text{a collection of documents}\\
d\ :=\ \text{a document in $D$}\\
t\ :=\ \text{a term}\\
N\ :=\ |D|\\
n_{t}\ :=\ |\{d\ :\ t \in d\}|\\
\mbox{TF}(t, d)\ :=\ \text{number of times $t$ occurs in $d$}\\
\mbox{IDF}(t)\ :=\ \log_2{(1+\frac{N}{n_{t}})}\\
\mbox{TF.IDF}(t, d)\ :=\ \mbox{TF}(t, d)\times\mbox{IDF}(t)\\
\end{array}
$$

We will be looking at the average $\mbox{TF.IDF}$ for words


$$
\begin{align*}
\overline{\mbox{TF.IDF}(t, d)}\ &=\ \frac{\sum_{d \in D}{\mbox{TF.IDF}(t, d)}}{N}\\
&=\ \frac{\sum_{d \in D}{\mbox{TF}(t, d)\times\mbox{IDF}(t)}}{N}\\
&=\ \mbox{IDF}(t)\times\frac{\sum_{d \in D}{\mbox{TF}(t, d)}}{N}\\
\end{align*}
$$

As one might imagine, this is still susceptible to words that have a high-enough $\mbox{TF}$ to diminish the effect of $\mbox{IDF}$.

(tf-idf [wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf))

We will produce two kinds of visualizations using $\mbox{TF.IDF}$.

1. A plot of $\mbox{TF}$ vs $\mbox{IDF}$
2. A word cloud, which is where we display our vocabulary with size proportional to some weight ($\mbox{TF.IDF}$)

You will sometimes here this kind of approach called to as the _bag-of-words_ approach. This is referring to how the documents are treated like _bags_. A _bag_ (AKA [_multiset_](https://en.wikipedia.org/wiki/Multiset)), in this context, is a collection of things with counts of occurrences.

In [1]:
from __future__ import division, print_function

%matplotlib inline

from collections import Counter, defaultdict
import numpy as np
import pandas as pd

from vocab_analysis import *

import answers

In [2]:
jobs_df = pd.read_pickle('./data/tokenized.pickle')

In [3]:
jobs_df.head()

Unnamed: 0_level_0,description,experience,education,is_hourly,is_part_time,is_supervisor,tokens
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,THE COMPANY Employer is a midstream service...,5+,none,False,False,True,"[THE, COMPANY, Employer, is, a, midstream, ser..."
1,ICR Staffing is now accepting resumes for Indu...,2-5,none,False,False,False,"[ICR, Staffing, is, now, accepting, resumes, f..."
2,This is a great position for the right person....,none,none,False,True,False,"[This, is, a, great, position, for, the, right..."
3,A large multi-specialty health center is expan...,none,none,False,False,False,"[A, large, multi, -, specialty, health, center..."
4,JOB PURPOSE: The Account Director is respon...,5+,bs-degree-needed,False,False,True,"[JOB, PURPOSE, :, The, Account, Director, is, ..."


In [4]:
def calculate_avg_tfidf(term_rows):
    bags = term_rows.apply(Counter) # convert the documents to bags, this will calculate the TF per document per term
    sum_tf = Counter() # this will hold the sum of the TF per term
    df = Counter() # this will calculate the raw DF (n_t from above)
    for bag in bags:
        sum_tf.update(bag)
        df.update(bag.keys())
    sum_tf = pd.Series(sum_tf)
    df = pd.Series(df)
    idf = np.log2(1 + len(term_rows) / df)
    sum_tfidf = sum_tf * idf # this will calculate the sum TF.IDF per term
    avg_tfidf = sum_tfidf / len(term_rows)  # this will calculate the average TF.IDF per term over the documents
    return pd.DataFrame({'sum_tf': sum_tf, 'idf': idf, 'avg_tfidf': avg_tfidf})

In [5]:
avg_tfidf_df = calculate_avg_tfidf(jobs_df['tokens'])

In [6]:
avg_tfidf_df.describe()

Unnamed: 0,avg_tfidf,idf,sum_tf
count,35005.0,35005.0,35005.0
mean,0.03141,10.455301,38.978117
std,0.192839,2.131884,697.780197
min,0.002765,1.008819,1.0
25%,0.002765,9.512082,1.0
50%,0.005072,11.095727,2.0
75%,0.016636,12.095397,8.0
max,14.572458,12.095397,61839.0


First let's look at the distribution of $\sum_{d \in D}{\mbox{TF}(t, d)}$ vs $\mbox{IDF}(t)$

In [7]:
avg_tfidf_df.sort_values('sum_tf').head()

Unnamed: 0,avg_tfidf,idf,sum_tf
Terrell,0.002765,12.095397,1
SDO,0.002765,12.095397,1
SDLO,0.002765,12.095397,1
SDH,0.002765,12.095397,1
SDC,0.002765,12.095397,1


In [8]:
avg_tfidf_df.sort_values('sum_tf', ascending=False).head()

Unnamed: 0,avg_tfidf,idf,sum_tf
",",14.572458,1.030976,61839
and,14.43216,1.021578,61807
.,13.614096,1.008819,59041
to,7.681988,1.045957,32132
the,6.428974,1.110939,25318


In [9]:
avg_tfidf_df.sort_values('idf').head()

Unnamed: 0,avg_tfidf,idf,sum_tf
.,13.614096,1.008819,59041
and,14.43216,1.021578,61807
",",14.572458,1.030976,61839
to,7.681988,1.045957,32132
a,4.933866,1.060095,20362


In [10]:
avg_tfidf_df.sort_values('idf', ascending=False).head()

Unnamed: 0,avg_tfidf,idf,sum_tf
Terrell,0.002765,12.095397,1
MindBody,0.002765,12.095397,1
Milestone,0.002765,12.095397,1
employement,0.002765,12.095397,1
emploment,0.002765,12.095397,1


In [11]:
avg_tfidf_df.sort_values('avg_tfidf').head()

Unnamed: 0,avg_tfidf,idf,sum_tf
Terrell,0.002765,12.095397,1
SWMPs,0.002765,12.095397,1
SWM,0.002765,12.095397,1
SWIFT,0.002765,12.095397,1
SWCC,0.002765,12.095397,1


In [12]:
avg_tfidf_df.sort_values('avg_tfidf', ascending=False).head()

Unnamed: 0,avg_tfidf,idf,sum_tf
",",14.572458,1.030976,61839
and,14.43216,1.021578,61807
.,13.614096,1.008819,59041
•,9.81051,2.088409,20552
*,7.972788,2.121841,16439


When searching a document, the final score is often calculated as the sum of the $\mbox{TF.IDF}$ for each term in the query.


$$
\begin{array}{l}
D\ :=\ \text{a collection of documents}\\
d\ :=\ \text{a document in $D$}\\
q\ :=\ \text{a set of terms}
t\ :=\ \text{a term}\\
\mbox{TF.IDF}(t, d)\ :=\ \mbox{TF}(t, d)\times\mbox{IDF}(t)\\
score(q, d)\ :=\ \sum_{t \in q}{\mbox{TF.IDF}(t, d)}
\end{array}
$$

Let's build a function for searching our corpus.
First, let's build our _index_ from documents to $TF$

In [13]:
doc_index = jobs_df['tokens'].apply(Counter)
doc_index.head()

id
0    {u'limited': 2, u'distributors': 2, u'-': 4, u...
1    {u'shop': 1, u'United': 1, u'background': 1, u...
2    {u'HEALTHCAREseeker': 1, u'being': 1, u'bring'...
3    {u'and': 3, u'dedicated': 1, u'be': 1, u'expan...
4    {u'limited': 3, u'all': 12, u'KNOWLEDGE': 2, u...
Name: tokens, dtype: object

Now we need to build an _inverted index_ from terms to documents. This will let us quickly filter to a subset of documents for calculating $TF.IDF$

In [14]:
inv_index = defaultdict(set)
for ix, bag in doc_index.iteritems():
    for term in bag:
        inv_index[term].add(ix)
inv_index = pd.Series(inv_index)
inv_index.head()

!     {3887, 2, 2051, 2052, 10, 4107, 2050, 4110, 15...
"     {3073, 2731, 5, 4102, 1032, 3887, 1548, 3597, ...
#     {3588, 2055, 4268, 4110, 2576, 17, 18, 1043, 2...
$     {2191, 1818, 1691, 1819, 4265, 4271, 823, 1211...
$.                                               {1512}
dtype: object

In [15]:
from my_tokenize import tokenize

In [16]:
def search(query, docs, doc_index, inv_index, idf, processing, limit=10):
    terms = set(processing(query)) # always process your queries like you process your documents
    filter_set_ixs = set()
    term_idfs = idf[terms]
    for term in terms:
        filter_set_ixs |= inv_index.loc[term]
    # we should only return documents that contain at least one word from the query
    filter_set = doc_index.loc[filter_set_ixs]
    tf_df = pd.DataFrame({term: filter_set.apply(lambda bag: bag[term]) for term in terms})
    tfidf_df = tf_df * term_idfs
    score_df = tfidf_df.apply(np.sum, axis=1).sort_values(ascending=False)
    for doc_id, score in score_df[:limit].iteritems():
        print('=' * 80)
        print(doc_id)
        print('=' * 30)
        print(docs.loc[doc_id])
        print('=' * 80)

In [17]:
search("data scientist", jobs_df['description'], doc_index, inv_index, avg_tfidf_df['idf'], tokenize)

3489
We have a  DB2 DBA fulltime opportunity in Houston, TX,  Job desc is as follows. ---------------- Interview type : Telephone, followed by Skype / in-person.  You will be  responsible for administration of the database management subsystems that reside on the mainframe and midrange platforms.These data base management systems include IMS and all variations of DB2.  ·Assist in the analysis and design of logical and physical data base structures including conceptual design, data modeling, and physical implementation  ·Consult with Application Development and Computing Division personnel on data base application performance, tuning, and debugging  ·Assist with the design and implementation of procedures to ensure recoverability of corporate data resources  ·Assist with the installation, maintenance, and administration of data base related software utilities and tools  ·Maintain multi-tier data base environments including object migrations between those tiers  ·Perform data base mainte

These calculation of average $TF.IDF$, and the ability to search our documents is useful, but it would be nice to be able to visualize our analysis.

### NEXT => [3. Visualizing](3. Visualizing.ipynb)