created on 8/1/15 to illustrate td-idf. td-idf is a weighting scheme for term document matrices. each word weight depends on (1) frequency within a given document (high frequency = higher weight) and (2) scarcity in the corpus (scarce = higher weight).

reference:

http://www.markhneedham.com/blog/2015/02/15/pythonscikit-learn-calculating-tfidf-on-how-i-met-your-mother-transcripts/

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
'''ngram range tells model to grab bigrams and trigrams. so "i love python" turns into six tokens:
'i', 'love', 'python', 'i love', 'love python', 'i love python'.'''

tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')

In [27]:
# corpus is a list of strings

corpus = ['Python is a 2000 made-for-TV horror movie directed by Richard Clabaugh',
          ' The film features several cult favorite actors', 
          'including William Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien']

In [28]:
# transform generates tokens; fit populates matrix

tfidf_matrix =  tf.fit_transform(corpus)
tfidf_matrix # 3x63 because there are 3 documents and 63 possible tokens

<3x63 sparse matrix of type '<type 'numpy.float64'>'
	with 63 stored elements in Compressed Sparse Row format>

In [29]:
# inspect tokens

tf.get_feature_names()[:5]

[u'2000', u'2000 tv', u'2000 tv horror', u'actors', u'casper']

In [30]:
# inspect weighted TDM to identify important words

# unpack sparse matrix to dense
dense = tfidf_matrix.todense()

# inspect first document 
# convert each row (represented by dense which is 3x63) in to a list
document = dense[0].tolist()[0]

In [31]:
phrase_scores = [pair for pair in zip(range(0, len(document)), document) if pair[1] > 0]

sorted_phrase_scores = sorted(phrase_scores, key=lambda t: t[1] * -1)
for phrase, score in [(feature_names[word_id], score) for (word_id, score) in sorted_phrase_scores][:20]:
   print('{0: <20} {1}'.format(phrase, score))

2000                 0.218217890236
2000 tv              0.218217890236
2000 tv horror       0.218217890236
clabaugh             0.218217890236
directed             0.218217890236
directed richard     0.218217890236
directed richard clabaugh 0.218217890236
horror               0.218217890236
horror movie         0.218217890236
horror movie directed 0.218217890236
movie                0.218217890236
movie directed       0.218217890236
movie directed richard 0.218217890236
python               0.218217890236
python 2000          0.218217890236
python 2000 tv       0.218217890236
richard              0.218217890236
richard clabaugh     0.218217890236
tv                   0.218217890236
tv horror            0.218217890236
