# Vocabulary Analysis Workshop

## n-Grams and Sentence Boundary Detection

n-Grams are fixed width sequence of words pulled from a text. Let's use the following sentence as an example.
**Note**: when n is less then 5, they are sometimes given a special name
- 1-gram = unigram
- 2-gram = bigram
- 3-gram = trigram

In [None]:
tokens = 'the quick brown fox jumped over the lazy dog'.split(' ')

n = 1
print('unigrams')
print([tokens[i:i+n] for i in xrange(len(tokens) - n + 1)])

n = 2
print('bigrams')
print([tokens[i:i+n] for i in xrange(len(tokens) - n + 1)])

n = 3
print('trigrams')
print([tokens[i:i+n] for i in xrange(len(tokens) - n + 1)])

n = 4
print('4-grams')
print([tokens[i:i+n] for i in xrange(len(tokens) - n + 1)])

n = 5
print('5-grams')
print([tokens[i:i+n] for i in xrange(len(tokens) - n + 1)])

In [None]:
from __future__ import division, print_function

%matplotlib inline

import nltk
import pandas as pd
import pickle

from vocab_analysis import *

import answers

In [None]:
jobs_df = pd.read_pickle('./data/cleaned.pickle')

In [None]:
with open('./data/segments.pickle') as fp:
    segments = pickle.load(fp)

We have a problem though. What if our sequences run across a sentence boundary? Although these ngrams would likely be rare, for low n this can still cause problems. We will need to split our documents into sentences.

NLTK comes function for splitting text into sentences - `PunktSentenceTokenizer`.

(Sentence boundary disambiguation [wikipedia](https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation))  
(PunktTokenizer [docs](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktSentenceTokenizer))  
(Punkt algorithm [paper](https://www.linguistics.ruhr-uni-bochum.de/~kiss/publications/compling2005_KS27.01final.pdf))

The idea in text segmentation like this is to either find the boundaries, are find the segments. Punkt finds the boundaries by a combination of heuristics and collocation learning for identifying abbreviations.

In [None]:
from my_tokenize import tokenize
from my_lemmatize import lemmatize, english_lemmas
from my_stopword_removal import stopword_removal

In [None]:
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

def lemma_sentences(job_description):
    """
    This function takes a job description and splits it into sentences
    Parameters
    ----------
    job_description : str
        The text of the job description
    Returns
    ----------
    list[str]
        the list of sentences
    """
    sentences = sent_detector.tokenize(job_description)
    return [stopword_removal(lemmatize(tokenize(sentence), english_lemmas)) for sentence in sentences]

In [None]:
jobs_df['sentences'] = jobs_df['description'].apply(lemma_sentences)

In [None]:
with open('./data/segments.pickle') as fp:
    segments = pickle.load(fp)

Now that we sentences, we can generate ngrams.

It worth considering the performance for choose large n.

Although the increase in terms per document is not prohibitive, when this drastically increase the cost of calculating $\mbox{TF.IDF}$ since the terms will be much more sparse. Our data set is small enough that we can got to 5-grams, but for very large corpora trigrams are probably a safer limit.

In [None]:
def ngram_func(n):
    """
    This function creates an ngram extracting function with an appropriate name.
    Parameters
    ----------
    n : int
        the n of the ngrams to be produced
    Returns
    ----------
    Callable[str] -> list[str]
        the ngram generating function
    """
    assert n>1, 'n must be greater than 1'
    def fun(sentences):
        return ['-'.join(ngram) for sentence in sentences for ngram in nltk.ngrams(sentence, n)]
    if n == 2:
        fun.func_name = 'bigrams'
    elif n == 3:
        fun.func_name = 'trigrams'
    else:
        fun.func_name = 'n_{}_grams'.format(n)
    return fun

In [None]:
jobs_df['bigrams'] = jobs_df['sentences'].apply(ngram_func(2))

In [None]:
analyze(jobs_df, 'bigrams', segments)

These are even more meaningful than just the lemmas. However, TF vs IDF plot has become useless. Also, note that our vocabulary has gone from 18155 lemmas to 306245 bigrams, a 1687% increase.

Observations validating some intuitions
- "year-experience" appears to be more important the more experience is required
- "associate-degree", "bachelor-degree", "master-degree" are important for their respective education levels

There are some oddities
- "silver-bullet" is a prominent bigram for jobs requiring a graduate degree
- "engineer-ui" and "ui-engineer" are important for jobs requiring 5+ years experience

Let's look at "silver-bullet" oddity

In [None]:
bigram_avg_tfidf_df = calculate_avg_tfidf(jobs_df['bigrams'])
bigram_index, bigram_inv_index = build_indexes(jobs_df['bigrams'])

In [None]:
search(
    "silver bullet", 
    jobs_df['description'], 
    bigram_index, 
    bigram_inv_index, 
    bigram_avg_tfidf_df['idf'],
    lambda q: ngram_func(2)([stopword_removal(lemmatize(tokenize(q), english_lemmas))])
)

We see that "Silver Bullet" is a specific company with multiple jobs in our data. This is the danger with n-grams. Because they are more meaningful to our data for larger n, the conclusions are also less generalizable for larger n.

In [None]:
jobs_df['trigrams'] = jobs_df['sentences'].apply(ngram_func(3))

In [None]:
analyze(jobs_df, 'trigrams', segments)

We have 491719 trigrams, a 61% increase from bigrams, and a 2708% increase from lemmas.

Observations
- There are some formulaic phrases in our data leading to "equal-opportun-employ" being prominent in all segments
- "high-school-diploma", "hours-per-week", and "valid-driver-license" are prominent for hourly jobs

Oddities
- "colorado-spring-co" is prominent for some segments, this certainly not generalizable 
- "engineer-ui-engineer" and "ui-engineer-ui" are prominent for jobs requireing 5+ years of experience

Let's look into the "engineer-ui-engineer" oddity

In [None]:
trigram_avg_tfidf_df = calculate_avg_tfidf(jobs_df['trigrams'])
trigram_index, trigram_inv_index = build_indexes(jobs_df['trigrams'])

In [None]:
search(
    "engineer ui engineer", 
    jobs_df['description'], 
    trigram_index, 
    trigram_inv_index, 
    trigram_avg_tfidf_df['idf'],
    lambda q: ngram_func(3)([stopword_removal(lemmatize(tokenize(q), english_lemmas))])
)

If you look at the bottom of the description, you will see a classic search engine optimization (SEO) tactic - repeating key words to boost your $\mbox{TF}$ for some keywords.

Although this is an oddity, it is valuable information. If one were attempting classify jobs by industry, how might these repeats affect the modeling?

In [None]:
jobs_df['quadrigrams'] = jobs_df['sentences'].apply(ngram_func(4))

In [None]:
analyze(jobs_df, 'quadrigrams', segments)

We now have 497154 4-grams, a 1% increase from trigrams.

Observations
- formulaic phrases have almost completely taken over
- social work appears to be prominent type of job in our data

Oddities
- "sale-sale-sale-sale" appears, is likely another instance of SEO

In [None]:
quadrigram_avg_tfidf_df = calculate_avg_tfidf(jobs_df['quadrigrams'])
quadrigram_index, quadrigram_inv_index = build_indexes(jobs_df['quadrigrams'])

In [None]:
search(
    "sale sale sale sale", 
    jobs_df['description'], 
    trigram_index, 
    trigram_inv_index, 
    trigram_avg_tfidf_df['idf'],
    lambda q: ngram_func(3)([stopword_removal(lemmatize(tokenize(q), english_lemmas))])
)

## Conclusion

What have we learned about our data set?

- We should be able to distinguish between our segments using lemmas with stop words removed, and bigrams of the lemmas.
- Some jobs are using SEO to boost their $\mbox{TF}$
- Sales, social work, medical work are common types in our data
- Colorado appears to be overrepresented in our data
- There are certain lemmas that appear common across all segments
  - "manage" - "manage", "manager"
  - "experience" - "experience", "experienced"
  - "sale" - "sale", "sales"
  - "service" - "service", "services"
  

In [None]:
save_fun(lemma_sentences, imports=['nltk'], star_imports=['my_tokenize', 'my_lemmatize', 'my_stopword_removal'], 
         sent_detector=sent_detector)
save_fun(ngram_func, imports=['nltk'])

In [None]:
jobs_df.to_pickle('./data/ngrams.pickle')

Now, let's take what we've learned and try and apply it to model-building

### NEXT => [7. Modeling](7. Modeling.ipynb)