#### Day 6: Regular Expressions and Naive Bayes Classifier


##### Part 1: Regular Expressions

- Regular expressions are useful to extract information from text.
- Set of “rules” to identify or match a particular sequence of characters.
- Most text in utf-8 or utf-16: letters, digits, punctuation and symbols
- In Python, mainly through library `re`


In [2]:
# Set Directory
import os
os.chdir('C:\\Users\\maria\\Documents\\GitHub\\pythoncamp2023\\Day06\\Lecture')

import re # for regular expressions

For demonstration, we will work with Obama's 2008 concession speech from New Hampshire primary. 

In [3]:
# read in example text, remember:
# readlines makes a list of each line break in file
with open("obama-nh.txt", "r") as f:
  text = f.readlines()

Let's take a look at how this file is structured 

In [4]:
# How does it impact our 'text' object?
print(text[0])
# print(text[1])
# print(text[2])

# print(text[0:3])

I want to congratulate Senator Clinton on a hard-fought victory here in



In [5]:
# Join into one string
# What could we have done at the outset instead?
alltext = ''.join(text) 

Or equivalently

In [6]:
with open("obama-nh.txt", "r") as f:
  alltext = f.read()

##### 1.1 Useful functions from `re` library:

- `re.findall`: Return all non-overlapping matches of pattern 
            in string, as a list of strings
- `re.split`: Split string by the occurrences of pattern.
- `re.match`: Search the beginning of the string for a
          regular expression and return the first occurrence.
          Returns a match object.
- `re.search`: Like re.match, but will check all lines of the input string.
- `re.compile`: Compile a regular expression pattern into a regular 
            expression object, which can be used for matching using
            match(), search() and other methods

Source: https://docs.python.org/3/library/re.html

Let's run some examples!

In [7]:
# re.findall(pattern = "Yes we can", string= alltext) # All instance of Yes we can
re.findall("Yes we can", alltext) # All instance of Yes we can

['Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can']

In [8]:
re.findall("American", alltext) # All instances of American

['American', 'American', 'American', 'American']

In [9]:
re.findall("\n", alltext) # all breaklines

['\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n']

##### 1.2 Backslash Characters

Regular expressions use the backslash character `\` to indicte special forms or to allow special characters to be used without invoking their special meaning. 

! This collides with Python's usage of the same character for the same purpose in string literals 

How do we find the literal character `\` in our file? 

In [None]:
# re.findall("\", alltext)
# re.findall("\\", alltext)
re.findall("\\\\", alltext)

One way to address such issue is to use Python's raw string notation for regular expression patterns. 

Backslashes are NOT handled in any special way in a string prefixed with `r`. 

So equivalently: 

In [None]:
re.findall(r"\\", alltext)

In [None]:
print("\n")

In [None]:
print(r"\n")

In [None]:
print("\\n")

##### 1.3 Basic special characters

In [None]:
# \d find any decimal digit, equivalent to [0-9]
re.findall("\d", alltext) 

In [None]:
# \D any character that is NOT a decimal digit, equivalent to ^[0-9]
re.findall("\D", alltext) 

`[]` can be used to indicate a set of characters 

In [None]:
# all instances of the char in []
re.findall("[a]", alltext) 

In [None]:
# all instances of the from char 1 to char 2 in []
re.findall("[a-d]", alltext) 

In [None]:
# all char, ^ except for of the from char 1 to char 2 in []
re.findall("[^a-z]", alltext) 

In [None]:
# all char and digits (alphanumeric)
re.findall("[a-zA-Z0-9]", alltext) 

In [None]:
# \w alphanumeric, one word char 
re.findall("\w", alltext) # same as above

In [None]:
# \W non-alphanumeric, opposite to \w
re.findall("\W", alltext) # same as re.findall(r"[^a-zA-Z0-9]", alltext)

In [None]:
# \s whitespace
re.findall("\s", alltext) 

In [None]:
# \S non-whitespace
re.findall("\S", alltext) 

In [None]:
# . any char (include white spaces, except a newline)
re.findall(".", alltext) 

In [None]:
# \ is an escape character (. has a special use)
re.findall("\.", alltext) 

In [None]:
# ? Makes the preceding RE optional. (match 0 or 1 repetitions of the preceding RE)
re.findall("Am?", alltext) # This would match A or Am where m is optional

In [None]:
# + match 1 or more repetitions of the preceding RE 
re.findall("\d+", alltext)
# re.findall("am+", alltext)

In [None]:
# * match 0 or more repetitions of the preceding RE
re.findall("am*", alltext) # match a, am, or a followed by any number of m's 

In [None]:
# get any word that starts with America
re.findall(r"America[a-z]*", alltext) 

`{m}` specifies exactly m copies of the previous RE should be matched

In [None]:
# {x} exactly x times (numbers with exact number of digits)
re.findall("\d{2}", alltext) 

In [None]:
re.findall("\d{1}", alltext) 

`{m,n}` matches from m to n repetitions of the preceding RE, while attempting to match as many repetitions as possible

In [None]:
re.findall("\d{1,3}", alltext) 

- There are so many more special characters
- Regex can be super powerful and complicated 
- Use parenthese to group things together when using operators like `+`, `*`, `?`, `^`


<br>

##### Short Exercise: 
How would we grab 10/10 and 19/18 as they appear in the text using `re.findall()`? 

In [None]:
x = "Hi 10/10 hello 19/18 asdf 7/6 and 1/10 or 10/1 "

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

In [None]:
# Answer
re.findall("\d{2}/\d{2}", x) 

#### 1.4 `re.split()`

Split string by the occurrences of pattern. 

In [None]:
# splits at digits, deletes digits
re.split("\d", alltext) 

In [None]:
re.split("America*", alltext) 

#### 1.5 `re.compile()`

Compile a RE pattern into a RE object, which can then be used for matching using the `match()` and `search()` methods. 

In [None]:
keyword = re.compile("America[a-z]*")

In [None]:
# search file for keyword in line by line version
for l in text: 
    if keyword.search(l): # reuse the RE here
        print(l)

In [None]:
# Create a regex object
pattern = re.compile('\d+')

In [None]:
pattern.findall(alltext) # equivalent to the earlier but longer version using RE

In [None]:
pattern.split(alltext)

#### 1.6 `re.MULTILINE` or `re.M`

When specified, it helps to search across lines in a single string. 

In [None]:
mline = "python\nis\nfun"
print(mline)

I want to search for "fun" in the third line, where it starts with an "f"

- We can use `^` to search the start of a string
- Be careful, `^` when used in `[]` means negating characters
- `$` can be used to match the end of a string

In [None]:
re.findall("^f\w*", mline)

In [None]:
# re.findall("^f\w*", mline, re.M)
re.findall("^f\w*", mline, re.MULTILINE)

#### Short Exercise: 

What does the following code search for? 

In [None]:
re.findall("^.*\.$", alltext, re.MULTILINE)

##### Part 2: Naive Bayes Classification


Docs for this library: https://www.nltk.org/api/nltk.classify.naivebayes.html

#### 2.1 Installation and Import Libraries

In [10]:
!pip3 install nltk
import nltk
nltk.download('names')
from nltk.corpus import names
import random

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
                                              0.0/1.5 MB ? eta -:--:--
     ---------                                0.4/1.5 MB 7.4 MB/s eta 0:00:01
     ---------------------------------        1.3/1.5 MB 16.3 MB/s eta 0:00:01
     ---------------------------------------- 1.5/1.5 MB 16.0 MB/s eta 0:00:00
Collecting click (from nltk)
  Downloading click-8.1.7-py3-none-any.whl (97 kB)
                                              0.0/97.9 kB ? eta -:--:--
     ---------------------------------------- 97.9/97.9 kB 5.5 MB/s eta 0:00:00
Collecting joblib (from nltk)
  Downloading joblib-1.3.2-py3-none-any.whl (302 kB)
                                              0.0/302.2 kB ? eta -:--:--
     ------------------------------------- 302.2/302.2 kB 18.2 MB/s eta 0:00:00
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2023.8.8-cp311-cp311-win_amd64.whl (268 kB)
                                              0.0/268.3


[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: C:\Users\maria\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
[nltk_data] Downloading package names to
[nltk_data]     C:\Users\maria\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\names.zip.


In [11]:
# Create a list of tuples with names
names = ([(name, 'male') for name in names.words('male.txt')] +
        [(name, 'female') for name in names.words('female.txt')])

In [12]:
names

[('Aamir', 'male'),
 ('Aaron', 'male'),
 ('Abbey', 'male'),
 ('Abbie', 'male'),
 ('Abbot', 'male'),
 ('Abbott', 'male'),
 ('Abby', 'male'),
 ('Abdel', 'male'),
 ('Abdul', 'male'),
 ('Abdulkarim', 'male'),
 ('Abdullah', 'male'),
 ('Abe', 'male'),
 ('Abel', 'male'),
 ('Abelard', 'male'),
 ('Abner', 'male'),
 ('Abraham', 'male'),
 ('Abram', 'male'),
 ('Ace', 'male'),
 ('Adair', 'male'),
 ('Adam', 'male'),
 ('Adams', 'male'),
 ('Addie', 'male'),
 ('Adger', 'male'),
 ('Aditya', 'male'),
 ('Adlai', 'male'),
 ('Adnan', 'male'),
 ('Adolf', 'male'),
 ('Adolfo', 'male'),
 ('Adolph', 'male'),
 ('Adolphe', 'male'),
 ('Adolpho', 'male'),
 ('Adolphus', 'male'),
 ('Adrian', 'male'),
 ('Adrick', 'male'),
 ('Adrien', 'male'),
 ('Agamemnon', 'male'),
 ('Aguinaldo', 'male'),
 ('Aguste', 'male'),
 ('Agustin', 'male'),
 ('Aharon', 'male'),
 ('Ahmad', 'male'),
 ('Ahmed', 'male'),
 ('Ahmet', 'male'),
 ('Ajai', 'male'),
 ('Ajay', 'male'),
 ('Al', 'male'),
 ('Alaa', 'male'),
 ('Alain', 'male'),
 ('Alan', 'male

In [13]:
# Now, we shuffle
random.shuffle(names)
print(names)

[('Ariela', 'female'), ('Hashim', 'male'), ('Arlena', 'female'), ('Broddie', 'male'), ('Rebecka', 'female'), ('Theodore', 'male'), ('Rochella', 'female'), ('Daune', 'female'), ('Querida', 'female'), ('Udell', 'male'), ('Cori', 'female'), ('Elbert', 'male'), ('Alton', 'male'), ('Rey', 'male'), ('Avrit', 'female'), ('Gusta', 'female'), ('Waylin', 'male'), ('Jerold', 'male'), ('Rosalinda', 'female'), ('Judson', 'male'), ('Odell', 'male'), ('Joelie', 'female'), ('Travers', 'male'), ('Rab', 'male'), ('Marie-Ann', 'female'), ('Ben', 'male'), ('Gwennie', 'female'), ('Ambrosia', 'female'), ('Rob', 'male'), ('Beatrice', 'female'), ('Urbano', 'male'), ('Michelina', 'female'), ('Gayla', 'female'), ('Jesse', 'male'), ('Inci', 'female'), ('Dario', 'male'), ('Debby', 'female'), ('Aloysia', 'female'), ('Whittaker', 'male'), ('Eliot', 'male'), ('Bee', 'female'), ('Elton', 'male'), ('Arron', 'male'), ('Emera', 'female'), ('Kelsy', 'female'), ('Meagan', 'female'), ('Leorah', 'female'), ('Mateo', 'male')

#### 2.2 Split Training and Test Sets

In [14]:
len(names) # N of observations

7944

In [15]:
# Define training and test set sizes
train_size = 5000

# Split train and test objects
train_names = names[:train_size]
test_names = names[train_size:]

#### 2.3 Define Features

In [20]:
# A simple feature: Get the last letter of the name
def g_features1(name):
  return {'last_letter': name[-1]}

Tips: Python functions can return multiple values

In [21]:
# Quick break — some syntax:
def return_two():
  return 5, 10

# When a method returns two values, we can use this format: 
x, y = return_two()
x, y

(5, 10)

#### 2.4 Data Preparation

Loop over names, and return tuple of dictionary and label

In [22]:
train_set = [(g_features1(n), g) for (n, g) in train_names]
test_set = [(g_features1(n), g) for (n,g) in test_names]

In [23]:
train_set

[({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'm'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'l'}, 'male'),
 ({'last_letter': 'i'}, 'female'),
 ({'last_letter': 't'}, 'male'),
 ({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'y'}, 'male'),
 ({'last_letter': 't'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'd'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'l'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 's'}, 'male'),
 ({'last_letter': 'b'}, 'male'),
 ({'last_letter': 'n'}, 'female'),
 ({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'b'}, 'male'),
 ({'last_letter

#### 2.5 Train the Classifier

In [24]:
# Run the naive Bayes classifier for the train set
classifier = nltk.NaiveBayesClassifier.train(train_set)

#### 2.6 Test your Classifier

In [25]:
# Apply the classifier to some names
classifier.classify(g_features1('Cecilia'))

'female'

In [26]:
classifier.classify(g_features1('Leticia'))

'female'

In [27]:
classifier.classify(g_features1('Irene'))

'female'

In [28]:
classifier.classify(g_features1('Jie'))

'female'

In [29]:
classifier.classify(g_features1('Tian'))

'male'

In [30]:
classifier.classify(g_features1('Masanori'))

'female'

In [31]:
classifier.classify(g_features1('Peter'))

'male'

In [43]:
# Get the probability of female:
classifier.prob_classify(g_features1('Cecilia')).prob("female")

0.9853069906549963

In [33]:
classifier.prob_classify(g_features1('Peter')).prob("male")

0.8233487067059713

We can check the overall accuracy with our test set. 

More on accuracy, F1, precision, recall: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

In [34]:
print(nltk.classify.accuracy(classifier, test_set))

0.7588315217391305


#### 2.7 Feature Attribution

In [35]:
# Lets see what is driving this
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     39.9 : 1.0
             last_letter = 'k'              male : female =     19.4 : 1.0
             last_letter = 'v'              male : female =     11.7 : 1.0
             last_letter = 'o'              male : female =      9.9 : 1.0
             last_letter = 'w'              male : female =      9.1 : 1.0


Let's be smarter and add more features!

In [82]:
# What all are we including now?
def g_features2(name):
  features = {}
  features["firstletter"] = name[0].lower()
  features["lastletter"] = name[-1].lower()
  for letter in 'abcdefghijklmnopqrstuvwxyz':
      features["count(%s)" % letter] = name.lower().count(letter)
      features["has(%s)" % letter] = (letter in name.lower())
  return features

In [83]:
# What all are we including now?
def g_features3(name):
  features = {}
  features["firstletter"] = name[0].lower()
  #features["lastletter"] = name[-1].lower()
  features["length"] = len(name)
  features['secondletter'] = name[:2].lower()
  for letter in 'abcdefghijklmnopqrstuvwxyz':
      features["count(%s)" % letter] = name.lower().count(letter)
      features["has(%s)" % letter] = (letter in name.lower())
  return features

In [84]:
g_features3('Cecilia')

{'firstletter': 'c',
 'length': 7,
 'secondletter': 'ce',
 'count(a)': 1,
 'has(a)': True,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 2,
 'has(c)': True,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 1,
 'has(e)': True,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 0,
 'has(h)': False,
 'count(i)': 2,
 'has(i)': True,
 'count(j)': 0,
 'has(j)': False,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 1,
 'has(l)': True,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 0,
 'has(n)': False,
 'count(o)': 0,
 'has(o)': False,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

In [85]:
# Run for train set
train_set = [(g_features3(n), g) for (n,g) in train_names]
# Run for test set
test_set = [(g_features3(n), g) for (n,g) in test_names]

In [86]:
# Run new classifier
classifier_new = nltk.NaiveBayesClassifier.train(train_set)

In [87]:
# Check the overall accuracy with test set
print(nltk.classify.accuracy(classifier_new, test_set))

0.704483695652174


In [71]:
# Lets see what is driving this
classifier_new.show_most_informative_features(20)

Most Informative Features
            secondletter = 'k'              male : female =      9.5 : 1.0
                count(v) = 2              female : male   =      6.9 : 1.0
                count(a) = 3              female : male   =      5.1 : 1.0
             firstletter = 'w'              male : female =      5.0 : 1.0
                count(w) = 1                male : female =      4.8 : 1.0
                  has(w) = True             male : female =      4.8 : 1.0
            secondletter = 'c'              male : female =      4.4 : 1.0
                count(i) = 3                male : female =      3.7 : 1.0
                count(a) = 2              female : male   =      3.2 : 1.0
                count(o) = 2                male : female =      3.1 : 1.0
                count(e) = 3              female : male   =      3.0 : 1.0
                count(e) = 4              female : male   =      3.0 : 1.0
                count(y) = 2              female : male   =      3.0 : 1.0

In [44]:
# Worse? Better? How can we refine?
# Lets look at the errors from this model
# and see if we can do better
errors = []
for (name, label) in test_names:
  guess = classifier.classify(g_features2(name))
  if guess != label:
    prob = classifier.prob_classify(g_features2(name)).prob(guess)
    errors.append((label, guess, prob, name))

In [45]:
for (label, guess, prob, name) in sorted(errors):
  print('correct={} guess={} prob={:.2f} name={}'.format(label, guess, prob, name))

correct=male guess=female prob=0.63 name=Abbey
correct=male guess=female prob=0.63 name=Abbot
correct=male guess=female prob=0.63 name=Abdullah
correct=male guess=female prob=0.63 name=Abelard
correct=male guess=female prob=0.63 name=Abner
correct=male guess=female prob=0.63 name=Abraham
correct=male guess=female prob=0.63 name=Abram
correct=male guess=female prob=0.63 name=Adam
correct=male guess=female prob=0.63 name=Adams
correct=male guess=female prob=0.63 name=Aditya
correct=male guess=female prob=0.63 name=Adlai
correct=male guess=female prob=0.63 name=Adnan
correct=male guess=female prob=0.63 name=Adolf
correct=male guess=female prob=0.63 name=Adolphe
correct=male guess=female prob=0.63 name=Adrick
correct=male guess=female prob=0.63 name=Adrien
correct=male guess=female prob=0.63 name=Agustin
correct=male guess=female prob=0.63 name=Aharon
correct=male guess=female prob=0.63 name=Ahmet
correct=male guess=female prob=0.63 name=Ajay
correct=male guess=female prob=0.63 name=Al
cor

What could we do to improve it? (Lab Assignment)

<br>
<br>
Now lets look at some bigger documents. 

This may take a while to download.

In [46]:
from nltk.corpus import movie_reviews
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\maria\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\movie_reviews.zip.


True

In [47]:
# list of tuples
# ([words], label)
documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]


In [None]:
# type(documents[0])
# type(documents)
# documents[0][1] # only neg & pos

In [48]:
random.shuffle(documents)

In [49]:
# Dictionary of words and number of instances
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
len(all_words)

39768

In [50]:
all_words

FreqDist({',': 77717, 'the': 76529, '.': 65876, 'a': 38106, 'and': 35576, 'of': 34123, 'to': 31937, "'": 30585, 'is': 25195, 'in': 21822, ...})

In [51]:
# Check the frequency of ','
all_words[',']

77717

In [52]:
word_features = [k for k in all_words.keys() if all_words[k] > 5]

In [53]:
len(word_features)

13214

In [54]:
# Function to get document features
def document_features(document):
  document_words = set(document)
  features = {}
  for word in word_features:
      features['contains(%s)' % word] = (word in document_words)
  return features

In [55]:
document_features(['This', 'is', 'a', 'horrible', 'movie'])

{'contains(plot)': False,
 'contains(:)': False,
 'contains(two)': False,
 'contains(teen)': False,
 'contains(couples)': False,
 'contains(go)': False,
 'contains(to)': False,
 'contains(a)': True,
 'contains(church)': False,
 'contains(party)': False,
 'contains(,)': False,
 'contains(drink)': False,
 'contains(and)': False,
 'contains(then)': False,
 'contains(drive)': False,
 'contains(.)': False,
 'contains(they)': False,
 'contains(get)': False,
 'contains(into)': False,
 'contains(an)': False,
 'contains(accident)': False,
 'contains(one)': False,
 'contains(of)': False,
 'contains(the)': False,
 'contains(guys)': False,
 'contains(dies)': False,
 'contains(but)': False,
 'contains(his)': False,
 'contains(girlfriend)': False,
 'contains(continues)': False,
 'contains(see)': False,
 'contains(him)': False,
 'contains(in)': False,
 'contains(her)': False,
 'contains(life)': False,
 'contains(has)': False,
 'contains(nightmares)': False,
 'contains(what)': False,
 "contains(')": F

In [56]:
document_features(movie_reviews.words('pos/cv957_8737.txt'))

{'contains(plot)': True,
 'contains(:)': True,
 'contains(two)': True,
 'contains(teen)': False,
 'contains(couples)': False,
 'contains(go)': False,
 'contains(to)': True,
 'contains(a)': True,
 'contains(church)': False,
 'contains(party)': False,
 'contains(,)': True,
 'contains(drink)': False,
 'contains(and)': True,
 'contains(then)': True,
 'contains(drive)': False,
 'contains(.)': True,
 'contains(they)': True,
 'contains(get)': True,
 'contains(into)': True,
 'contains(an)': True,
 'contains(accident)': False,
 'contains(one)': True,
 'contains(of)': True,
 'contains(the)': True,
 'contains(guys)': False,
 'contains(dies)': False,
 'contains(but)': True,
 'contains(his)': True,
 'contains(girlfriend)': True,
 'contains(continues)': False,
 'contains(see)': False,
 'contains(him)': True,
 'contains(in)': True,
 'contains(her)': False,
 'contains(life)': False,
 'contains(has)': True,
 'contains(nightmares)': False,
 'contains(what)': True,
 "contains(')": True,
 'contains(s)': T

In [57]:
## Now we have tuple of ({features}, label)
train_docs = documents[:1000]
test_docs = documents[1000:1500]
train_set = [(document_features(d), c) for (d,c) in train_docs]
test_set = [(document_features(d), c) for (d,c) in test_docs]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [58]:
print(nltk.classify.accuracy(classifier, test_set))

0.778


In [59]:
classifier.show_most_informative_features(10)

Most Informative Features
     contains(ludicrous) = True              neg : pos    =     15.7 : 1.0
     contains(stupidity) = True              neg : pos    =     13.1 : 1.0
         contains(anger) = True              pos : neg    =     11.4 : 1.0
        contains(shines) = True              pos : neg    =     11.4 : 1.0
      contains(poignant) = True              pos : neg    =     10.8 : 1.0
     contains(authentic) = True              pos : neg    =     10.1 : 1.0
         contains(sucks) = True              neg : pos    =      9.9 : 1.0
          contains(jedi) = True              pos : neg    =      9.4 : 1.0
   contains(mesmerizing) = True              pos : neg    =      9.4 : 1.0
     contains(testament) = True              pos : neg    =      9.4 : 1.0


In [None]:
# Copyright of the original version:

# Copyright (c) 2014 Matt Dickenson
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
