# stop words, tf-idf

Let's investigate one of the most useful feature weightings, and how stop words derive naturally from that. To start, let's load a set of small documents.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# load data

DATA_DIR = '../../data/'

df = pd.read_csv(DATA_DIR + '/rt_critics.csv')

In [3]:
# It seems silly to call such short blurbs 'documents', but we'll stick with the NLP nomenclature.

documents = list(df['quote'])
documents[:5]

['So ingenious in concept, design and execution that you could watch it on a postage stamp-sized screen and still be engulfed by its charm.',
 "The year's most inventive comedy.",
 'A winning animated feature that has something for everyone on the age spectrum.',
 "The film sports a provocative and appealing story that's every bit the equal of this technical achievement.",
 "An entertaining computer-generated, hyperrealist animation feature (1995) that's also in effect a toy catalog."]

## Document Frequency

Let's start by calculating the document frequency for words in these documents. For this task, let's also remove all the punctuation marks and make everything lower-case.

In [4]:
from nltk.tokenize import wordpunct_tokenize  # for tokenizing our text
import string  # helps with removing punctuation
from collections import Counter  # great dict-like datastructure for counting things

In [5]:
# This is a bit of text cleanup
word_bag_list = []
for doc in documents:
    cleaned = doc.replace('-', ' ')  # make lowercase and split hyphenated words in two
    for c in string.punctuation:  # strip punctuation marks.
        cleaned = cleaned.replace(c, '')
    word_bag_list.append(wordpunct_tokenize(cleaned))

# How do things look?
print 'a few tokens:', word_bag_list[:3]

# this flattens the nested lists into one big list for some stats
token_list = []
for tokens in word_bag_list:
    token_list.extend(tokens)
print 'number of tokens:', len(token_list)
print 'number of unique tokens:', len(set(token_list))
print 'number of documents:', len(word_bag_list)

a few tokens: [['So', 'ingenious', 'in', 'concept', 'design', 'and', 'execution', 'that', 'you', 'could', 'watch', 'it', 'on', 'a', 'postage', 'stamp', 'sized', 'screen', 'and', 'still', 'be', 'engulfed', 'by', 'its', 'charm'], ['The', 'years', 'most', 'inventive', 'comedy'], ['A', 'winning', 'animated', 'feature', 'that', 'has', 'something', 'for', 'everyone', 'on', 'the', 'age', 'spectrum']]
number of tokens: 280092
number of unique tokens: 25183
number of documents: 14072


In [6]:
# calculate the document frequency of all the unique tokens in the bags of words.

df = Counter()  # initialize this dict-like thing.

for doc in word_bag_list:
    # FILL IN CODE
    # count up the times words appear in INDIVIDUAL documents (not the total across all documents)
    for token in set(doc):
        df[token] += 1

# normalize the counts by the number of documents (are you getting zeros? Think datatypes.)
for token in df:
    df[token] = df[token] / float(len(documents))

# this prints the 20 highest-scoring words and their scores
df.most_common(20)

[('the', 0.5431353041500853),
 ('and', 0.48493462194428655),
 ('of', 0.4633314383172257),
 ('a', 0.4279420125071063),
 ('is', 0.3306566230812962),
 ('to', 0.31907333712336555),
 ('in', 0.22527003979533827),
 ('that', 0.19783968163729393),
 ('The', 0.179363274587834),
 ('it', 0.16742467310972142),
 ('its', 0.15079590676520752),
 ('with', 0.14830869812393405),
 ('but', 0.13331438317225697),
 ('film', 0.12841102899374646),
 ('movie', 0.12833996588971006),
 ('for', 0.1186753837407618),
 ('as', 0.11725412166003411),
 ('A', 0.10481807845366686),
 ('this', 0.10424957362137578),
 ('an', 0.089681637293917)]

## Stop Words

Which words are likely to be stop words? The ones that show up in the most documents! These terms with the largest document frequency are the stopwords! The threshold above which you call something a stopword is up to you.

## tf-idf

More interesting than stop-words is the tf-idf score. This tells us which words are most discriminative between documents. Words that occur a lot in one document but doesn't occur in many documents will tell you something special about the document:

$$
\text{tf-idf} = tf \cdot -\log{df}
$$

recall that:

$$
\log{x} = -\log{1 \over x}
$$

What are the most discriminative words in the first few documents?

In [None]:
# calculate the term frequency of all the unique tokens in all of the bags of words.

for doc in word_bag_list[:100]:
    tf = Counter()  # initialize this dict-like thing.
    tfidf = Counter()
    
    # FILL IN CODE

    # calculate term frequencies
    for token in doc:
        tf[token] += 1
    total = float(sum(tf.values()))

    # calculate tf-idf scores
    for token in tf:
        tfidf[token] = (tf[token] / total) * (-np.log(df[token]))

    # this prints most significant words in the document
    print tfidf.most_common(5)

# Sci-Kit Learn

Scikit-Learn comes with utilities to do these calculations for us. How do their results compare?

I confess, I ran out of time to do a proper comparison, but with enough work, we can figure out which features (i.e. words) have the highest scores. What's happening is each documents is converted into a normalized vector (length = 1) where most of the dimensions/features/words are 0, and the words that occur in the document get a score proportional to its tf-idf score.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [9]:
tfidf_vec = TfidfVectorizer()
output = tfidf_vec.fit_transform(documents)
print output.toarray()[20:30, :10]

[[ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.25400101  0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0

In [10]:
print tfidf_vec.get_stop_words()

None


In [11]:
from IPython.display import HTML
HTML('''
<style>
.text_cell_render {
  background-color: cyan;
}
</style>
''')