# Vocabulary Analysis Workshop

## Stop words

There is a term used in document search for words like "and", "to", and "the" - _stop words_. They are generally words that have a high enough average $\mbox{TF}$ that their low $\mbox{IDF}$ does not balance them out.

NLTK provides us with a list of English words often considered stop words. Removing stop words is not a technique that should always be used, and what is and is not a stop word is dependent on the task at hand.

Let's take a look at NLTK's stop words.

(Stop words [wikipedia](https://en.wikipedia.org/wiki/Stop_words))

In [None]:
from __future__ import division, print_function

%matplotlib inline

import nltk
import pandas as pd
import pickle

from vocab_analysis import *

import answers

In [None]:
jobs_df = pd.read_pickle('./data/lemmatized.pickle')

In [None]:
jobs_df.head()

In [None]:
with open('./data/segments.pickle') as fp:
    segments = pickle.load(fp)

In [None]:
stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords

Notice that they include tokens like "shouldn" and "t" which only exist due to tokenization.

### Exercise 3: finding new stop words

Let's try and find new words to add to our set of stop words. Keep in mind what the meaning of these words are. Although "manag" and "experi" occur very often, they are still meaningfull. You are looking for words that have high average $\mbox{TF.IDF}$, but seem to lack important meaning.

First we will look at $\mbox{TF.IDF}$ values for NLTK's stop words, and then we will look for new candidates.

Feel free to do this however you like - deciding a threshold for one of our values or manual selection.

In [None]:
lemma_avg_tfidf_df = calculate_avg_tfidf(jobs_df['lemmas'])

In [None]:
lemma_avg_tfidf_df.sort_values('avg_tfidf', ascending=False)

In [None]:
lemma_avg_tfidf_df.describe()

In [None]:
lemma_avg_tfidf_df[lemma_avg_tfidf_df.index.to_series().apply(lambda s: s in stopwords)].describe()

The `lemma_avg_tfidf_df` dataframe contains the candidates for new stop words. Explore it by sorting by the different values, and look at different strata.

In [None]:
cleaned_lemma_avg_tfidf = lemma_avg_tfidf_df[lemma_avg_tfidf_df.index.to_series().apply(lambda s: s not in stopwords)]

In [None]:
cleaned_lemma_avg_tfidf.sort_values('avg_tfidf', ascending=False)

In [None]:
cleaned_lemma_avg_tfidf.describe()

In [None]:
new_stopwords = None

if new_stopwords is None:
    raise NotImplementedError("Find additions to the list of stop words")
    
# new_stopwords = answers.additional_stopwords # uncomment this, and comment the above lines to skip this exercise

In [None]:
custom_stopwords = stopwords | new_stopwords

In [None]:
def stopword_removal(terms):
    """
    This function removes stop words from a list of terms
    Parameters
    ----------
    terms : list[str]
        a list of terms from which to remove stop words
    Returns
    ----------
    list[str]
        a list of terms with the stop words removed
    """
    return [s for s in terms if s not in custom_stopwords]

In [None]:
jobs_df['cleaned_lemmas'] = jobs_df['lemmas'].apply(stopword_removal)

Let's look at the effects of stop word removal.

In [None]:
analyze(jobs_df, 'cleaned_lemmas', segments)

In [None]:
save_fun(stopword_removal, custom_stopwords=custom_stopwords)

In [None]:
jobs_df.to_pickle('./data/cleaned.pickle')

**Note**: Again, if you picked stop words that are very different from those in the answers module this analysis may not be applicable.

These terms seems much more meaningful, but we still have a few terms appearing to be dominant - "management", "experience", "sale", "service". These terms are important to the overall context.

1. n-grams: These should be more distinct for different segments since the set of sequences of terms is much larger than the set of terms. The large the value of n the more distinct we can expect complimentary segments to be. The downside is that not all ngrams are meaningful, but these meaningless ngrams are generally not common.

We will be using n-grams.

### NEXT => [6. n-Grams](6. n-Grams.ipynb)