In [1]:
import nltk

Make Python aware of the movie reviews corpus, which will enable us to ask it about what categories it has, what files it has, etc.

In [2]:
from nltk.corpus import movie_reviews

In [3]:
movie_reviews.fileids()

['neg/cv000_29416.txt',
 'neg/cv001_19502.txt',
 'neg/cv002_17424.txt',
 'neg/cv003_12683.txt',
 'neg/cv004_12641.txt',
 'neg/cv005_29357.txt',
 'neg/cv006_17022.txt',
 'neg/cv007_4992.txt',
 'neg/cv008_29326.txt',
 'neg/cv009_29417.txt',
 'neg/cv010_29063.txt',
 'neg/cv011_13044.txt',
 'neg/cv012_29411.txt',
 'neg/cv013_10494.txt',
 'neg/cv014_15600.txt',
 'neg/cv015_29356.txt',
 'neg/cv016_4348.txt',
 'neg/cv017_23487.txt',
 'neg/cv018_21672.txt',
 'neg/cv019_16117.txt',
 'neg/cv020_9234.txt',
 'neg/cv021_17313.txt',
 'neg/cv022_14227.txt',
 'neg/cv023_13847.txt',
 'neg/cv024_7033.txt',
 'neg/cv025_29825.txt',
 'neg/cv026_29229.txt',
 'neg/cv027_26270.txt',
 'neg/cv028_26964.txt',
 'neg/cv029_19943.txt',
 'neg/cv030_22893.txt',
 'neg/cv031_19540.txt',
 'neg/cv032_23718.txt',
 'neg/cv033_25680.txt',
 'neg/cv034_29446.txt',
 'neg/cv035_3343.txt',
 'neg/cv036_18385.txt',
 'neg/cv037_19798.txt',
 'neg/cv038_9781.txt',
 'neg/cv039_5963.txt',
 'neg/cv040_8829.txt',
 'neg/cv041_22364.txt',


In [4]:
movie_reviews.categories()

['neg', 'pos']

Give the first file a number so we can refer to it for a little bit.

In [5]:
first = movie_reviews.fileids()[0]

In [6]:
first

'neg/cv000_29416.txt'

In [7]:
movie_reviews.words(first)

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]

We are going to make a list of pairs here. Each pair has the words from a review as its first element, and the category (pos or neg) as its second element.  We'll do this with a complex list comprehension.

In [8]:
documents = [(movie_reviews.words(fileid), category) \
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)
            ]

In [9]:
documents[0]

(['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...], 'neg')

Now that we have this big list of the reviews, we'll shuffle them up (which will be useful later when we are splitting the documents into those we train with, and those we test with).

In [10]:
import random

In [11]:
random.shuffle(documents)

In [12]:
documents[0]

(['"', 'love', 'to', 'kill', '"', 'starts', 'off', ...], 'neg')

The goal at this point is to find the most common 2000 words.  We have all of the words in this `documents` thing we just built (the first element of each pair in the list), but we can also just get them from the corpus using `words()`, so may as well do it that way.

In [13]:
movie_reviews.words()

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]

In [14]:
len(movie_reviews.words())

1583820

So that `The` and `the` are not treated as different words, we will go through and make a new list of words by transforming all the words in the movie reviews corpus into their lower case versions.

In [15]:
lower_words = [w.lower() for w in movie_reviews.words()]

In [16]:
fd = nltk.FreqDist(lower_words)

In [17]:
fd.most_common(5)

[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576)]

What we did above was to take a big list (of 1.6 million words!) and make a second, equally big list (of 1.6 million lowercased words!), and then do a frequency distribution of that.  This could have been kind of slow.  We can skip that middles step of making a second big list of words by instead giving a generator directly to `FreqDist` as I have done below.  In the end, `fd` and `fd2` should be the same, it's just that `fd2` will have been more efficient to construct.

In [18]:
fd2 = nltk.FreqDist(w.lower() for w in movie_reviews.words())

In [19]:
fd['a']

38106

Now, we'll take the most common 2000 and use those words to name the features that we will use when training the classifier.

In [20]:
word_features = [w for (w, c) in fd.most_common(2000)]

In [21]:
word_features[:10]

[',', 'the', '.', 'a', 'and', 'of', 'to', "'", 'is', 'in']

In [22]:
documents[0]

(['"', 'love', 'to', 'kill', '"', 'starts', 'off', ...], 'neg')

The features are going to look like `contains(the)` and will either be `True` or `False`.  We can use `format()` to construct these.

In [23]:
example = {'contains(the)': True, 'contains(garbage)': False}

In [24]:
'contains({})'.format("hello")

'contains(hello)'

The function below will go through a document and return its features, or essentially its "fingerprint" for these purposes.  It is written in a straightforward and sensible way. But. It. Is. Terribly. Slow.

In [25]:
def document_features(document):
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document)
    return features

In [26]:
document_features(documents[0][0])

{'contains(,)': True,
 'contains(the)': True,
 'contains(.)': True,
 'contains(a)': True,
 'contains(and)': True,
 'contains(of)': True,
 'contains(to)': True,
 "contains(')": True,
 'contains(is)': True,
 'contains(in)': True,
 'contains(s)': True,
 'contains(")': True,
 'contains(it)': True,
 'contains(that)': True,
 'contains(-)': True,
 'contains())': False,
 'contains(()': False,
 'contains(as)': True,
 'contains(with)': True,
 'contains(for)': True,
 'contains(his)': True,
 'contains(this)': True,
 'contains(film)': True,
 'contains(i)': True,
 'contains(he)': True,
 'contains(but)': True,
 'contains(on)': False,
 'contains(are)': False,
 'contains(t)': True,
 'contains(by)': True,
 'contains(be)': True,
 'contains(one)': True,
 'contains(movie)': True,
 'contains(an)': True,
 'contains(who)': True,
 'contains(not)': True,
 'contains(you)': False,
 'contains(from)': False,
 'contains(at)': True,
 'contains(was)': True,
 'contains(have)': False,
 'contains(they)': False,
 'contain

I am going to put a # at the beginning of the next cell so that if you run this, it won't try to execute the list comprehension below. It does the right thing, but it takes forever.  Below there is a new function, `document_features_set()`, that improves on the efficiency of `document_features()` by first taking a `set()` of the words in the document.  This, first of all, reduces the number of things Python needs to check when you are trying to see if, e.g., "the" is in the document.  You don't care how many times it is in there, just that it is.  Second, Python's actually just intrinsically a little quicker when checking sets than when checking lists.  So, `document_features_set` is faster.

In [27]:
#featuresets = [(document_features(d), c) for (d, c) in documents]

In [28]:
def document_features_set(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [29]:
featuresets = [(document_features_set(d), c) for (d, c) in documents]

In [30]:
featuresets[0]

({'contains(,)': True,
  'contains(the)': True,
  'contains(.)': True,
  'contains(a)': True,
  'contains(and)': True,
  'contains(of)': True,
  'contains(to)': True,
  "contains(')": True,
  'contains(is)': True,
  'contains(in)': True,
  'contains(s)': True,
  'contains(")': True,
  'contains(it)': True,
  'contains(that)': True,
  'contains(-)': True,
  'contains())': False,
  'contains(()': False,
  'contains(as)': True,
  'contains(with)': True,
  'contains(for)': True,
  'contains(his)': True,
  'contains(this)': True,
  'contains(film)': True,
  'contains(i)': True,
  'contains(he)': True,
  'contains(but)': True,
  'contains(on)': False,
  'contains(are)': False,
  'contains(t)': True,
  'contains(by)': True,
  'contains(be)': True,
  'contains(one)': True,
  'contains(movie)': True,
  'contains(an)': True,
  'contains(who)': True,
  'contains(not)': True,
  'contains(you)': False,
  'contains(from)': False,
  'contains(at)': True,
  'contains(was)': True,
  'contains(have)': F

Now, we have a list of pairs, the first element of which are the features of a given document (`True` or `False` for each of the most common words) and the second element of which is whether the review is positive or negative.  It was already jumbled up earlier.  So, now we will split it at the 100th position, so that our training set is everything from 100 to the end, and the testing set is evertying from the beginning to 100.

In [31]:
train_set, test_set = featuresets[100:], featuresets[:100]

In [32]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

So, how does it do on the test set after having been trained on the training set?

In [33]:
print(nltk.classify.accuracy(classifier, test_set))

0.76


Classifiers once trained can tell you what features are the most informative if found.  Odds are 10 to 1 that if the review contains *outstanding* then it is going to be a positive review.

In [34]:
classifier.show_most_informative_features(10)

Most Informative Features
        contains(seagal) = True              neg : pos    =     12.9 : 1.0
   contains(outstanding) = True              pos : neg    =     10.7 : 1.0
         contains(mulan) = True              pos : neg    =      9.1 : 1.0
   contains(wonderfully) = True              pos : neg    =      7.5 : 1.0
         contains(damon) = True              pos : neg    =      6.4 : 1.0
         contains(flynt) = True              pos : neg    =      5.7 : 1.0
        contains(wasted) = True              neg : pos    =      5.4 : 1.0
        contains(poorly) = True              neg : pos    =      5.3 : 1.0
         contains(awful) = True              neg : pos    =      5.1 : 1.0
          contains(lame) = True              neg : pos    =      5.0 : 1.0


**New topic** Now we're looking at authorship attribution.  We are going to pick out three reviews that I know to have been written by JB, three more that I know to have been written by SG, and then one mystery one.  We're going to look at the characteristics of the documents produced by each author, and then see if the mystery one looks closer to one than the other.  Getting this right is largely about picking the things that matter, that correctly characterize a single author's writing and differentiate it from other authors' writing.

In [35]:
jbf = ['29416', '29417', '29439']
sgf = ['29423', '29444', '29465']
myf = ['29497']

In [36]:
movie_reviews.fileids()[0]

'neg/cv000_29416.txt'

Using the id numbers above, we'll build lists of file IDs for JB, SG, and the mystery reviews.

In [37]:
jbfids = [f for f in movie_reviews.fileids() if f[10:15] in jbf]
sgfids = [f for f in movie_reviews.fileids() if f[10:15] in sgf]
myfids = [f for f in movie_reviews.fileids() if f[10:15] in myf]

In [38]:
jbfids

['neg/cv000_29416.txt', 'neg/cv009_29417.txt', 'pos/cv015_29439.txt']

In [39]:
movie_reviews.words(jbfids[0])

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]

Lexical diversity maybe will matter.  So, here's a computation of the lexical diversity of JB's first review.

In [40]:
rwords = movie_reviews.words(jbfids[0])

In [41]:
len(rwords)

879

In [42]:
len(set(rwords))

354

In [43]:
def lex_diversity(word_list):
    return len(set(word_list)) / len(word_list)

In [44]:
lex_diversity(rwords)

0.40273037542662116

Average word length might be characteristic of an author.  This takes a list of words and provides an average.  It is not the most elegant imaginable version.

In [45]:
def avg_word_length(word_list):
    sum = 0
    for w in word_list:
        sum = sum + len(w)
    return sum / len(word_list)

In [46]:
avg_word_length(movie_reviews.words(sgfids[0]))

3.891358024691358

We can do a little bit better by using `sum()` which can take a list.  (I didn't do this in class, but this is better.)

In [47]:
sum([1, 2, 3])

6

In [48]:
def better_avg_length(word_list):
    return sum([len(w) for w in word_list])/len(word_list)

In [49]:
better_avg_length(movie_reviews.words(sgfids[0]))

3.891358024691358

Now, let's do something that provides the average sentence length, since maybe that will correspond to an author.

In [50]:
def avg_sent_length(sent_list):
    sum = 0
    for s in sent_list:
        sum = sum + len(s)
        #print('{}: {}'.format(sum, s))
    return sum/len(sent_list)

In [51]:
avg_sent_length(movie_reviews.sents(sgfids[0]))

40.5

That feels surprisingly long for an average sentence, but maybe SG is verbose.  Also note that `avg_sent_length` is actually just the same as `avg_word_length`.  We don't really need both versions.  They're both just computing the average length of the things in the list we provide as an argument.  And so `better_avg_length` will work just fine for either of them.  As a note, there is a commented out `print()` line there, because at first it seemed to me like 40 was too long for an average sentence.  But adding in `print()` we could watch what the program is doing at each step and we can see whether it is doing something we didn't expect/intend.  In this case, the program was working, SG is just verbose.  We can compare this to JB's first file below.

In [52]:
avg_sent_length(movie_reviews.sents(jbfids[0]))

19.977272727272727

We get the same thing with the `better_avg_length()` function too, happily.

In [53]:
better_avg_length(movie_reviews.sents(sgfids[0]))

40.5

In [54]:
better_avg_length(movie_reviews.sents(jbfids[0]))

19.977272727272727

In [55]:
sgfids[0]

'neg/cv128_29444.txt'

In [56]:
movie_reviews.words(sgfids[0])

['susan', 'granger', "'", 's', 'review', 'of', '"', ...]

In [57]:
movie_reviews.sents(sgfids[0])

[['susan', 'granger', "'", 's', 'review', 'of', '"', 'ghosts', 'of', 'mars', '"', '(', 'sony', 'pictures', 'entertainment', ')', 'horror', 'auteur', 'john', 'carpenter', '(', '"', 'halloween', ',', '"', '"', 'vampires', '"', ')', 'strikes', 'out', 'with', 'this', 'sci', '-', 'fi', 'eco', '-', 'fable', 'that', "'", 's', 'so', 'bad', 'it', 'boggles', 'the', 'mind', 'to', 'imagine', 'how', 'the', 'project', 'ever', 'got', 'green', '-', 'lit', '.'], ['the', 'script', 'by', 'carpenter', 'and', 'larry', 'sulkis', 'appears', 'to', 'have', 'been', 'lifted', 'directly', 'from', 'last', 'year', "'", 's', '"', 'pitch', 'black', ',', '"', 'involving', 'a', 'violent', 'prisoner', 'who', 'must', 'be', 'released', 'from', 'bondage', 'so', 'that', 'he', 'can', 'help', 'a', 'small', 'band', 'of', 'humans', 'protect', 'themselves', 'from', 'blood', '-', 'thirsty', ',', 'marauding', 'aliens', '.'], ...]

So, now, we make a function that, if given a `fileid` will create a set of its features, with keys `lex` (lexical diversity), `word` (average word length), and `sent` (average sentence length).

In [58]:
def auth_stats(fileid):
    features = {}
    features['lex'] = lex_diversity(movie_reviews.words(fileid))
    features['word'] = avg_word_length(movie_reviews.words(fileid))
    features['sent'] = avg_sent_length(movie_reviews.sents(fileid))
    return features

In [59]:
auth_stats(sgfids[0])

{'lex': 0.5580246913580247, 'word': 3.891358024691358, 'sent': 40.5}

Now, let's look at all of them for SG.  If this were a larger scale project, we would want to do some kind of statistics here, like finding the mean and standard deviation of each property.  But there are only three, we can eyeball them.  SG is in the first set, JB is in the second.

In [60]:
[auth_stats(f) for f in sgfids]

[{'lex': 0.5580246913580247, 'word': 3.891358024691358, 'sent': 40.5},
 {'lex': 0.5686274509803921,
  'word': 3.8186274509803924,
  'sent': 31.384615384615383},
 {'lex': 0.5763157894736842,
  'word': 4.126315789473685,
  'sent': 34.54545454545455}]

In [61]:
[auth_stats(f) for f in jbfids]

[{'lex': 0.40273037542662116,
  'word': 3.621160409556314,
  'sent': 19.977272727272727},
 {'lex': 0.3960975609756098,
  'word': 3.4634146341463414,
  'sent': 12.654320987654321},
 {'lex': 0.4596100278551532,
  'word': 3.649025069637883,
  'sent': 27.615384615384617}]

Ok, so now let's look at the mystery review. Which author is this most like?

In [62]:
auth_stats(myfids[0])

{'lex': 0.35942492012779553,
 'word': 3.5279552715654954,
 'sent': 18.686567164179106}

This last part is mostly just some demos related to the things discussed in chapter 4.  Variable assignment surprises and so forth.

In [63]:
x = 4

In [64]:
y = x

In [65]:
x

4

In [66]:
y

4

In [67]:
x= 6

In [68]:
y

4

In [69]:
x = [0, 1, 2]

In [70]:
y = x

In [71]:
y

[0, 1, 2]

In [72]:
x[1] = 88

In [73]:
y

[0, 88, 2]

In [74]:
help(movie_reviews.words())

Help on ConcatenatedCorpusView in module nltk.corpus.reader.util object:

class ConcatenatedCorpusView(nltk.collections.AbstractLazySequence)
 |  ConcatenatedCorpusView(corpus_views)
 |  
 |  A 'view' of a corpus file that joins together one or more
 |  ``StreamBackedCorpusViews<StreamBackedCorpusView>``.  At most
 |  one file handle is left open at any time.
 |  
 |  Method resolution order:
 |      ConcatenatedCorpusView
 |      nltk.collections.AbstractLazySequence
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, corpus_views)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __len__(self)
 |      Return the number of tokens in the corpus file underlying this
 |      corpus view.
 |  
 |  close(self)
 |  
 |  iterate_from(self, start_tok)
 |      Return an iterator that generates the tokens in the corpus
 |      file underlying this corpus view, starting at the token number
 |      ``start``.  If ``start>=len(self)``, then 