# Vocabulary Analysis Workshop

## Visualizing

An important part of exploring a data set is producing easy to understand visualizations that describe or summarize the data. With structured data we have many well known visualizations like scatter plots, pie charts, bar charts, etc. How can we use some of these with our vocabulary analysis? Are there any ways we can characterize our data that are especially suited for text?

In [None]:
from __future__ import division, print_function

%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import pickle

from vocab_analysis import *

import answers

In [None]:
jobs_df = pd.read_pickle('./data/tokenized.pickle')

In [None]:
jobs_df.head()

We know that we have segments to our data. Ultimately, we will be trying to create models to predict membership in these segments.

In [None]:
segments = [
    pd.Series(jobs_df['experience'] == '5+', name='5+ years experience'),
    pd.Series(jobs_df['experience'] == '2-5', name='2-5 years experience'),
    pd.Series(jobs_df['experience'] == '1-2', name='1-2 years experience'),
    pd.Series(jobs_df['education'] == 'ms-or-phd-needed', name='Master\'s Degree or PhD'),
    pd.Series(jobs_df['education'] == 'bs-degree-needed', name='Bachelor\'s Degree'),
    pd.Series(jobs_df['education'] == 'associate-needed', name='Associate\'s Degree'),
    pd.Series(jobs_df['is_hourly'], name='Hourly'),
    pd.Series(jobs_df['is_part_time'], name='Part-time'),
    pd.Series(jobs_df['is_supervisor'], name='Supervising'),
]

Let's pickle these for future use.

In [None]:
with open('./data/segments.pickle', 'wb') as fp:
    pickle.dump(segments, fp)

First let's look at the distribution of $\sum_{d \in D}{\mbox{TF}(t, d)}$ vs $\mbox{IDF}(t)$

In [None]:
avg_tfidf_df = calculate_avg_tfidf(jobs_df['tokens'])

In [None]:
avg_tfidf_df.head()

One way we can visualize the data is to look at the distribution of $\mbox{TF}$ vs $\mbox{IDF}$

In [None]:
def plot_tfidf_freqs(avg_tfidf_df, n, title='sum TF vs IDF', ax=None):
    if ax is None:
        fig = plt.figure(figsize=(10, 10))
        ax = fig.add_subplot(111)
        
    ax.scatter(avg_tfidf_df.sum_tf, avg_tfidf_df.idf, c=avg_tfidf_df.avg_tfidf, cmap=plt.cm.coolwarm)
    ax.set_xbound(0, avg_tfidf_df.sum_tf.max())
    ax.set_ybound(0, avg_tfidf_df.idf.max())
    ax.set_xlabel('sum TF')
    ax.set_ylabel('IDF')
    ax.set_title(title)

In [None]:
plot_tfidf_freqs(avg_tfidf_df, len(jobs_df))

In [None]:
avg_tfidf_df.query('sum_tf > 20000')

In [None]:
avg_tfidf_df.query('sum_tf > 10000 and idf > 2')

It seems that our formulation of $\mbox{TF.IDF}$ has a limitation: some of our vocabulary have such high $\mbox{TF}$ that the $\mbox{IDF}$ does not matter.

#### Wordclouds

Word clouds are an easy to digest visualization that is especially suited for text.

To produce a wordcloud you need your vocabulary and an associated weight. Sometimes this is just the occurrences in a document or corpus ($\mbox{n_t}$ from our $\mbox{TF.IDF}$ formula). We will be using average $\mbox{TF.IDF}$, but let's also look at the wordclouds for $\mbox{TF}$ and $\mbox{IDF}$ for the tokens.

In [None]:
wordcloud(avg_tfidf_df['avg_tfidf'])

In [None]:
wordcloud(avg_tfidf_df['sum_tf'])

In [None]:
wordcloud(avg_tfidf_df['idf'])

We should also look at the wordclouds by segment. In order to make these visualization easier to use, a helper function has been included in `vocab_analysis.py` which calculates $\mbox{TF.IDF}$ and produces the scatter plot of $\mbox{TF}$ vs $\mbox{IDF}$, and $\mbox{TF.IDF}$ based wordclouds for the overall corpus as well as each segment.

In [None]:
analyze(jobs_df, 'tokens', segments)

Let's turn our attention to improving our processing. We need to do something to deal with how messy our tokens are.

### NEXT => [4. Stemming and Lemmatization](4. Stemming and Lemmatization.ipynb)