# Vocabulary Analysis Workshop

## Modeling

Now that we have explored the vocabulary of the different segments of this corpus, let's see if we can predict whether or not a document belongs in a given segment. We can use what we've learned about the vocabulary to make educated guesses about what features we should use.

In [1]:
from __future__ import division, print_function

%matplotlib inline

import ipywidgets as widgets
from ipywidgets import interact
from IPython.display import clear_output, display

from collections import OrderedDict
from matplotlib import pyplot as plt
import os
import pandas as pd
import pickle
from sklearn import feature_extraction as skfeatex
from sklearn import metrics as skmetrics
from sklearn import tree as sktree
from sklearn.model_selection import cross_val_predict

from vocab_analysis import *

import answers

In [2]:
jobs_df = pd.read_pickle('./data/ngrams.pickle')

IOError: [Errno 2] No such file or directory: './data/ngrams.pickle'

In [None]:
jobs_df.head()

We will use the `TfidfVectorizer` from scikit-learn to generate our features. It uses an analyzer to process text and then creates feature values based the $\mbox{TF.IDF}$ of a term within a document.

In [None]:
print(skfeatex.text.TfidfVectorizer.__doc__)

Let's load all of our functions back so we can use them as analyzers.

In [None]:
from my_tokenize import tokenize
from my_lemmatize import lemmatize, english_lemmas
from my_stopword_removal import stopword_removal
from my_lemma_sentences import lemma_sentences
from my_ngram_func import ngram_func

The `token_analyzer` will only tokenize the document.

In [None]:
def token_analyzer(description):
    return tokenize(description)

The `lemma_analyzer` will tokenize and lemmatize the document.

In [None]:
def lemma_analyzer(description):
    return lemmatize(tokenize(description), english_lemmas)

The `clean_lemma_analyzer`, will tokenize, lemmatize, and then remove stop words.

In [None]:
def clean_lemma_analyzer(description):
    return stopword_removal(lemmatize(tokenize(description), english_lemmas))

The `bigram_analyzer` will split into sentences, tokenize, lemmatize, remove stop words, and then generate bigrams as our terms

In [None]:
def bigram_analyzer(description):
    lemmatized_sentences = lemma_sentences(description)
    bigrams = ngram_func(2)(lemmatized_sentences)
    return bigrams

The `trigram_analyzer` will split into sentences, tokenize, lemmatize, remove stop words, and then generate trigrams as our terms

In [None]:
def trigram_analyzer(description):
    lemmatized_sentences = lemma_sentences(description)
    trigrams = ngram_func(3)(lemmatized_sentences)
    return trigrams

The `full_analyzer` will split into sentences, tokenize, lemmatize, remove stop words, and then generate bigrams and trigrams. It will use the cleaned lemmas, bigrams and trigrams as our terms.

In [None]:
def full_analyzer(description):
    lemmatized_sentences = lemma_sentences(description)
    unigrams = [unigram for sentence in lemmatized_sentences for unigram in sentence]
    bigrams = ngram_func(2)(lemmatized_sentences)
    trigrams = ngram_func(3)(lemmatized_sentences)
    return unigrams + bigrams + trigrams

Our tasks are
- education: predict what level of education is needed for a job (none, associate-needed, bs-degree-needed, ms-or-phd-needed)
- experience: predict how many years of experience is needed for a job (none, 1-2, 2-5, 5+)
- is_hourly: predict whether a job is an hourly or not (True, False)
- is_part_time: predict whether a job is part time or not (True, False)
- is_supervisor: predict whether a job is a supervisory position or not (True, False)


In [None]:
tasks = ['education', 'experience', 'is_hourly', 'is_part_time', 'is_supervisor']

Covering how to use scikit-learn is outside the scope of this tutorial. If you want to know more about using scikit-learn, check out Sebastian Rashka's tutorial [_Learning scikit-learn -- An Introduction to Machine Learning in Python_](https://www.youtube.com/watch?v=9fOWryQq9J8). 

We'll be modeling with decision trees ([wiki]()) ([_Python for Data Science_ by Joe McCarthy](http://nbviewer.jupyter.org/github/gumption/Python_for_Data_Science/blob/master/Python_for_Data_Science_all.ipynb#4.-Using-Python-to-Build-and-Use-a-Simple-Decision-Tree-Classifier)). Rather than go into the details of decision trees, let's look at one.

![tree](tree.png)

This is a tree that was built for the is_hourly task.

I've simplified working with these models by creating some widgets that let us modify some of the parameters.

### Exercise 4: finding the right features

Let's try and find the right features for these tasks.

I've pre-built all the models with my analyzers. Find the best performing set of features and parameters for each task.

**Note**: If you want to try your own analyzers, uncomment the following cell to rename the current results folder and create your own. Know that each model can take 10 seconds to over a minute to build. It will also rename the saved features, which will also take a few minutes to regenerate.

In [None]:
# ! mv ./results/ ./results-pre-built/
# ! mkdir ./results
# ! mv ./data/all_features.pickle ./data/all_features_pre_built.pickle
# ! mv ./data/all_featurizers.pickle ./data/all_featurizers_pre_built.pickle

In [None]:
featurization_approaches = OrderedDict()
featurization_approaches['tokens'] = token_analyzer
featurization_approaches['lemmas'] = lemma_analyzer
featurization_approaches['clean_lemmas'] = clean_lemma_analyzer
featurization_approaches['bigrams'] = bigram_analyzer
featurization_approaches['trigrams'] = trigram_analyzer
featurization_approaches['full'] = full_analyzer

In [None]:
features_path = './data/all_features.pickle'
featurizers_path = './data/all_featurizers.pickle'
if os.path.exists(features_path):
    print('Loading features')
    with open(features_path) as fp:
        all_features = pickle.load(fp)
    print('Loading featurizers')
    with open(featurizers_path) as fp:
        all_featurizers = pickle.load(fp)
else:
    all_features = {}
    all_featurizers = {}
    for name, analyzer in featurization_approaches.items():
        print(name)
        featurizer = skfeatex.text.TfidfVectorizer(analyzer=analyzer)
        features = featurizer.fit_transform(jobs_df['description'])
        all_features[name] = features
        all_featurizers[name] = featurizer.get_feature_names()
    print('Saving features')
    with open(features_path, 'wb') as out:
        pickle.dump(all_features, out)
    print('Saving featurizers')
    with open(featurizers_path, 'wb') as out:
        pickle.dump(all_featurizers, out)

In [None]:
def get_model(task, features, max_depth, min_samples_leaf):
    save_path = './results/{}-{}-{}-{}.results.pickle'.format(
        task, features, max_depth, min_samples_leaf
    )
    if os.path.exists(save_path):
        with open(save_path) as fp:
            train_sizes, train_scores, test_scores, preds, model = pickle.load(fp)
    else:
        model = sktree.DecisionTreeClassifier(
            max_depth=max_depth,
            min_samples_leaf=min_samples_leaf,
            random_state=123
        )
        train_sizes, train_scores, test_scores = learning_curve(
            model, 
            all_features[features], 
            jobs_df[task], 
            cv=3, 
            n_jobs=-1
        )
        preds = cross_val_predict(model, all_features[features], jobs_df[task], cv=3, n_jobs=-1)
        model.fit(all_features[features], jobs_df[task])
        with open(save_path, 'wb') as out:
            pickle.dump((train_sizes, train_scores, test_scores, preds, model), out)
            
    return train_sizes, train_scores, test_scores, preds, model

@interact(task=tasks, features=featurization_approaches.keys(), max_depth=(5, 20, 5), min_samples_leaf=(1, 10, 3))
def display_report(task, features, max_depth, min_samples_leaf):
    global _features_importances
    train_sizes, train_scores, test_scores, preds, model = get_model(
        task, features, max_depth, min_samples_leaf)
    
    fig = plt.figure(figsize=(10, 12))
    learning_curve_ax = fig.add_subplot(2, 1, 1)
    feature_importance_ax = fig.add_subplot(2, 1, 2)
    
    plot_learning_curve(train_sizes, train_scores, test_scores, task, ylim=(0.5, 1.0), ax=learning_curve_ax)
    
    feature_names = all_featurizers[features]
    features_importances = pd.Series(data = model.feature_importances_, index=feature_names)
    features_importances = features_importances[features_importances > 0.0]
    try:
        wordcloud(features_importances, title='Feature Importances', ax=feature_importance_ax)
    except ValueError as e:
        _features_importances = features_importances

    plt.show()
    
    print(skmetrics.classification_report(jobs_df[task], preds))

### Discussion
Once you find the right features/parameters, consider why they worked for that problem.

#### - education
#### - experience
#### - is_hourly
#### - is_part_time
#### - is_supervisor