# Regular Expressions and Naive Bayes Classification

### 1. Regular Expressions

- Regular expressions are useful for extracting information from text.
- Set of “rules” to identify or match a particular sequence of characters.
- Most text in utf-8 or utf-16: letters, digits, punctuation and symbols
- In Python, mainly through library `re`

In [1]:
# Set Directory
import os
# os.getcwd()
# os.chdir('/Users/almavelazquez/Documents/GitHub/PythonCamp2024/Day06/Lecture')

In [2]:
# For regular expressions
import re 

* As a demonstration, we will work with Obama's 2008 concession speech from New Hampshire primary.
* Read in the sample text, and remember:
    * `readlines` makes a list of each line break in file

In [3]:
with open("obama-nh.txt", "r") as f:
  text = f.readlines()

* Let's take a look at how this file is structured 

In [53]:
print(text[0])
print(text[1])
print(text[2])

print(text[0:3])

I want to congratulate Senator Clinton on a hard-fought victory here in

New Hampshire.



['I want to congratulate Senator Clinton on a hard-fought victory here in\n', 'New Hampshire.\n', '\n']


* We can also join all lines into one string

In [None]:
alltext = ''.join(text) 

"I want to congratulate Senator Clinton on a hard-fought victory here in\nNew Hampshire.\n\nA few weeks ago, no one imagined that we'd have accomplished what we did\nhere tonight. For most of this campaign, we were far behind, and we\nalways knew our climb would be steep. But in record numbers, you came\nout and spoke up for change. And with your voices and your votes, you\nmade it clear that at this moment - in this election - there is\nsomething happening in America.\n\nThere is something happening when men and women in Des Moines and\nDavenport; in Lebanon and Concord come out in the snows of January to\nwait in lines that stretch block after block because they believe in\nwhat this country can be.\n\nThere is something happening when Americans who are young in age and in\nspirit - who have never before participated in politics - turn out in\nnumbers we've never seen because they know in their hearts that this\ntime must be different.\n\nThere is something happening when people vote

* What could we have done at the outset instead?

In [51]:
with open("obama-nh.txt", "r") as f:
  alltext = f.read()
  print(alltext)

I want to congratulate Senator Clinton on a hard-fought victory here in
New Hampshire.

A few weeks ago, no one imagined that we'd have accomplished what we did
here tonight. For most of this campaign, we were far behind, and we
always knew our climb would be steep. But in record numbers, you came
out and spoke up for change. And with your voices and your votes, you
made it clear that at this moment - in this election - there is
something happening in America.

There is something happening when men and women in Des Moines and
Davenport; in Lebanon and Concord come out in the snows of January to
wait in lines that stretch block after block because they believe in
what this country can be.

There is something happening when Americans who are young in age and in
spirit - who have never before participated in politics - turn out in
numbers we've never seen because they know in their hearts that this
time must be different.

There is something happening when people vote not just for the par

#### 1.1 Useful functions from the `re` module:

- `re.findall`: Return all non-overlapping matches of pattern 
            in string, as a list of strings
- `re.split`: Split string by the occurrences of pattern.
- `re.match`: Search the beginning of the string for a
          regular expression and return the first occurrence.
          Returns a match object.
- `re.search`: Like re.match, but will check all lines of the input string. (first match)
- `re.compile`: Compile a regular expression pattern into a regular 
            expression object, which can be used for matching using
            match(), search() and other methods

Source: https://docs.python.org/3/library/re.html

Let's run some examples!

* Both lines find all instances of "Yes we can"

In [7]:
# re.findall(pattern = "Yes we can", string= alltext) 
re.findall("Yes we can", alltext) 

['Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can']

* Here we find all instances of "American"

In [8]:
re.findall("American", alltext)

['American', 'American', 'American', 'American']

* ...And of all breaklines

In [9]:
re.findall("\n", alltext)[0:4] #...but only print 4

['\n', '\n', '\n', '\n']

* Example of `re.split()`

In [10]:
re.split("and", alltext)[7:12]

[" in\nspirit - who have never before participated in politics - turn out in\nnumbers we've never seen because they know in their hearts that this\ntime must be different.\n\nThere is something happening when people vote not just for the party\nthey belong to but the hopes they hold in common - that whether we are\nrich or poor; black or white; Latino or Asian; whether we hail from Iowa\nor New Hampshire, Nevada or South Carolina, we are ready to take this\ncountry in a fundamentally new direction. That is what's happening in\nAmerica right now. Change is what's happening in America.\n\nYou can be the new majority who can lead this nation out of a long\npolitical darkness - Democrats, Independents ",
 ' Republicans who are\ntired of the division ',
 ' distraction that has clouded Washington; who\nknow that we can disagree without being disagreeable; who underst',
 '\nthat if we mobilize our voices to challenge the money ',
 " influence\nthat's stood in our way "]

#### 1.2 Backslash Characters

* Regular expressions use the backslash character `\` to indicte special forms, or to allow special characters to be used without invoking their special meaning. 
* This collides with Python's usage of the same character for the same purpose in string literals 

* How do we find the literal character `\` in our file?
* First 2 will give errors

In [None]:
# re.findall("\", alltext)
# re.findall("\\", alltext)
re.findall("\\\\", alltext) 
# regrex engine sees a single backslash as an escape character, so to match a single backslash, we need to escape it with another backslash.   
# regrex engine sees \\, which means a single backslash.
# output is a string representation of a single backslash

['\\']

##### Instead of typing 4 backslashes every time we need to find one...

* Another way to address this is to use Python's *raw string notation* for regular expression patterns.
* This looks like this`r""`
* Backslashes are not handled in any special way in a string prefixed with `r`.  

##### So equivalently:

In [12]:
re.findall(r"\\", alltext)

['\\']

##### We can also see it in action here:

* Prints an actual linebreak

In [13]:
print("\n")





* Prints the character "\n"

In [14]:
print(r"\n")

\n


* Also prints the character "\n"

In [15]:
print("\\n")

\n


#### 1.3 Basic special characters

* `\d`: any decimal digit, equivalent to [0-9]

In [16]:
re.findall(r"\d", alltext) 

['9', '1', '1']

* `\D`: any character that is NOT a decimal digit, equivalent to ^[0-9]

In [None]:
# re.findall(r"\D", alltext) 

['I',
 ' ',
 'w',
 'a',
 'n',
 't',
 ' ',
 't',
 'o',
 ' ',
 'c',
 'o',
 'n',
 'g',
 'r',
 'a',
 't',
 'u',
 'l',
 'a',
 't',
 'e',
 ' ',
 'S',
 'e',
 'n',
 'a',
 't',
 'o',
 'r',
 ' ',
 'C',
 'l',
 'i',
 'n',
 't',
 'o',
 'n',
 ' ',
 'o',
 'n',
 ' ',
 'a',
 ' ',
 'h',
 'a',
 'r',
 'd',
 '-',
 'f',
 'o',
 'u',
 'g',
 'h',
 't',
 ' ',
 'v',
 'i',
 'c',
 't',
 'o',
 'r',
 'y',
 ' ',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'i',
 'n',
 '\n',
 'N',
 'e',
 'w',
 ' ',
 'H',
 'a',
 'm',
 'p',
 's',
 'h',
 'i',
 'r',
 'e',
 '.',
 '\n',
 '\n',
 'A',
 ' ',
 'f',
 'e',
 'w',
 ' ',
 'w',
 'e',
 'e',
 'k',
 's',
 ' ',
 'a',
 'g',
 'o',
 ',',
 ' ',
 'n',
 'o',
 ' ',
 'o',
 'n',
 'e',
 ' ',
 'i',
 'm',
 'a',
 'g',
 'i',
 'n',
 'e',
 'd',
 ' ',
 't',
 'h',
 'a',
 't',
 ' ',
 'w',
 'e',
 "'",
 'd',
 ' ',
 'h',
 'a',
 'v',
 'e',
 ' ',
 'a',
 'c',
 'c',
 'o',
 'm',
 'p',
 'l',
 'i',
 's',
 'h',
 'e',
 'd',
 ' ',
 'w',
 'h',
 'a',
 't',
 ' ',
 'w',
 'e',
 ' ',
 'd',
 'i',
 'd',
 '\n',
 'h',
 'e',
 'r',
 'e',
 ' ',


* `[]` can be used to indicate *a set* of characters
* Line below returns all instances of *each* of the characters in `[]`

In [None]:
re.findall("[ar]", alltext)[0:10]  # pull out all a, r

['a', 'r', 'a', 'a', 'a', 'r', 'a', 'a', 'r', 'r']

* All instances of the form "char1 to char2" in `[char1-char2]`

In [19]:
re.findall("[a-d]", alltext)[0:10] 

['a', 'c', 'a', 'a', 'a', 'a', 'a', 'd', 'c', 'a']

* Returns all characters, `^`: *except* for those of the form "char 1 to char 2" in [^char1-char2]

In [20]:
re.findall("[^a-z]", alltext)[0:20] 

['I',
 ' ',
 ' ',
 ' ',
 ' ',
 'S',
 ' ',
 'C',
 ' ',
 ' ',
 ' ',
 '-',
 ' ',
 ' ',
 ' ',
 '\n',
 'N',
 ' ',
 'H',
 '.']

* All characters and digits (alphanumeric)

In [21]:
re.findall("[a-zA-Z0-9]", alltext)[0:19]

['I',
 'w',
 'a',
 'n',
 't',
 't',
 'o',
 'c',
 'o',
 'n',
 'g',
 'r',
 'a',
 't',
 'u',
 'l',
 'a',
 't',
 'e']

* `\w`: Any alphanumeric, one word character

In [None]:
re.findall(r"\w", alltext)[0:19] # includes _

['I',
 'w',
 'a',
 'n',
 't',
 't',
 'o',
 'c',
 'o',
 'n',
 'g',
 'r',
 'a',
 't',
 'u',
 'l',
 'a',
 't',
 'e']

*  `\W`: non-alphanumeric, the inverse of `\w`

In [None]:
re.findall(r"\W", alltext)[0:15] # same as re.findall(r"[^a-zA-Z0-9_]", alltext)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '-', ' ', ' ', ' ', '\n', ' ', '.']

* `\s`: whitespace

In [24]:
re.findall(r"\s", alltext)[0:5]

[' ', ' ', ' ', ' ', ' ']

* `\S`: *non*-whitespace characters

In [25]:
re.findall(r"\S", alltext)[0:7]

['I', 'w', 'a', 'n', 't', 't', 'o']

* `.` any character (include white spaces, except a newline)

In [26]:
re.findall(".", alltext)[0:10]

['I', ' ', 'w', 'a', 'n', 't', ' ', 't', 'o', ' ']

 * `\` is an escape character (`.` has a special use)

In [27]:
re.findall(r"\.", alltext)[0:10]

['.', '.', '.', '.', '.', '.', '.', '.', '.', '.']

* `?`: Makes the preceding expression optional; match 0 or 1 repetitions of the preceding expression

In [28]:
re.findall("Am?", alltext)[0:5] # This would match A or Am where m is optional

['A', 'A', 'Am', 'Am', 'A']

* `+`: match 1 or more repetitions of the preceding expression

In [29]:
re.findall(r"\d+", alltext)
# re.findall("am+", alltext)

['9', '11']

* `*`: match 0 or more repetitions of the preceding expression

In [30]:
re.findall("am*", alltext)[0:8] # match a, am, or a followed by any number of m's 

['a', 'a', 'a', 'a', 'a', 'a', 'am', 'a']

* Get any word that starts with America

In [31]:
re.findall(r"America[a-z]*", alltext) 

['America',
 'Americans',
 'America',
 'America',
 'American',
 'Americans',
 'America',
 'America',
 'Americans',
 'America',
 'America']

* `{m}` specifies exactly m copies of the previous expression should be matched

In [32]:
# {x} exactly x times (numbers with exact number of digits)
re.findall(r"\d{2}", alltext) 

['11']

* `{m,n}` matches from m to n repetitions of the preceding expression, while attempting to match as many repetitions as possible

In [33]:
re.findall("o{2,3}", alltext) 

['oo', 'oo', 'oo', 'oo', 'oo', 'oo', 'oo', 'oo']

- There are so many more special characters
- Regex can be super powerful and complicated 
- Use parenthese to group things together when using operators like `+`, `*`, `?`, `^`

##### Short Exercise: 
How would we grab 10/10 and 19/18 as they appear in the text using `re.findall()`? 

In [34]:
x = "Hi 10/10 hello 19/18 asdf 7/6 and 1/10 or 10/1 "

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

##### Answer

In [35]:
re.findall(r"\d{2}/\d{2}", x) 

['10/10', '19/18']

#### 1.4 `re.split()`

##### Split string by the occurrences of pattern. 

In [36]:
# splits at 'or', deletes 'or'
re.split("or", alltext)[0:4]

['I want to congratulate Senat',
 ' Clinton on a hard-fought vict',
 "y here in\nNew Hampshire.\n\nA few weeks ago, no one imagined that we'd have accomplished what we did\nhere tonight. F",
 ' most of this campaign, we were far behind, and we\nalways knew our climb would be steep. But in rec']

In [None]:
re.split("America*", alltext)[0:3] # Americ, America, Americaa, ...

["I want to congratulate Senator Clinton on a hard-fought victory here in\nNew Hampshire.\n\nA few weeks ago, no one imagined that we'd have accomplished what we did\nhere tonight. For most of this campaign, we were far behind, and we\nalways knew our climb would be steep. But in record numbers, you came\nout and spoke up for change. And with your voices and your votes, you\nmade it clear that at this moment - in this election - there is\nsomething happening in ",
 '.\n\nThere is something happening when men and women in Des Moines and\nDavenport; in Lebanon and Concord come out in the snows of January to\nwait in lines that stretch block after block because they believe in\nwhat this country can be.\n\nThere is something happening when ',
 "ns who are young in age and in\nspirit - who have never before participated in politics - turn out in\nnumbers we've never seen because they know in their hearts that this\ntime must be different.\n\nThere is something happening when people vote no

#### 1.5 `re.compile()`

##### Compile a regular expression pattern into a RE object, which can then be used for matching using the `match()` and `search()` methods. 

In [38]:
keyword = re.compile("America[a-z]*")

In [39]:
# search file for keyword in line by line version
for l in text: 
    if keyword.search(l): # reuse the RE here
        print(l)

something happening in America.

There is something happening when Americans who are young in age and in

America right now. Change is what's happening in America.

Our new American majority can end the outrage of unaffordable,

working Americans who deserve it.

is a challenge that should unite America and the world against the

But in the unlikely story that is America, there has never been anything

we can't, generations of Americans have responded with a simple creed

remember that there is something happening in America; that we are not

nation; and together, we will begin the next great chapter in America's



* Create a regex object

In [40]:
pattern = re.compile(r'\d+')

In [41]:
pattern.findall(alltext) # equivalent to the earlier but longer version using RE

['9', '11']

In [57]:
pattern.split(alltext)

["I want to congratulate Senator Clinton on a hard-fought victory here in\nNew Hampshire.\n\nA few weeks ago, no one imagined that we'd have accomplished what we did\nhere tonight. For most of this campaign, we were far behind, and we\nalways knew our climb would be steep. But in record numbers, you came\nout and spoke up for change. And with your voices and your votes, you\nmade it clear that at this moment - in this election - there is\nsomething happening in America.\n\nThere is something happening when men and women in Des Moines and\nDavenport; in Lebanon and Concord come out in the snows of January to\nwait in lines that stretch block after block because they believe in\nwhat this country can be.\n\nThere is something happening when Americans who are young in age and in\nspirit - who have never before participated in politics - turn out in\nnumbers we've never seen because they know in their hearts that this\ntime must be different.\n\nThere is something happening when people vot

#### 1.6 `re.MULTILINE` or `re.M`



##### When specified, it helps to search across lines in a single string. 

In [43]:
mline = "python\nis\nfun"
print(mline)

python
is
fun


I want to search for "fun" in the third line, where it starts with an "f"

- We can use `^` to search the start of a string
- Be careful, `^` when used in `[]` means negating characters
- `$` can be used to match the end of a string

In [None]:
re.findall(r"^f\w*", mline) # ^ match only start of string, \w: word characters ([a-zA-Z0-9_]

[]

In [45]:
# re.findall("^f\w*", mline, re.M)
re.findall(r"^f\w*", mline, re.MULTILINE)

['fun']

#### Short Exercise: 

What does the following code search for? 

In [46]:
re.findall(r"^.*\.$", alltext, re.MULTILINE)[0:15]

['New Hampshire.',
 'something happening in America.',
 'what this country can be.',
 'time must be different.',
 "America right now. Change is what's happening in America.",
 'fulfill.',
 'working Americans who deserve it.',
 'can do this with our new majority.',
 'weapons; climate change and poverty; genocide and disease.',
 'ideas. And all are patriots who serve this country honorably.',
 'the people who love this country, can do to change it.',
 "That's why tonight belongs to you.",
 'believed in our improbable journey and rallied so many others to join.',
 'in the weeks to come.',
 'offering the people of this nation false hope.']

### 2. Naive Bayes Classification

* A conditional probability model that assigns probabilities for each class $k$ to an observation based on its $n$ features or $p(C_k|x_1, ...x_n)$.
* The central assumption: all $n$ features are independent of each other, conditional on the class/category $C_k$.
* Algorithm relies on Bayes' Theorem + a decision rule $$p(C_k|\mathbf{x}) = \dfrac{p(C_k)p(\mathbf{x}|C_k)}{p(\mathbf{x})}$$

Read more about it: https://en.wikipedia.org/wiki/Naive_Bayes_classifier

##### Why/When Naive Bayes

* Great first classifier to try, relatively fast, requires less data than other classifiers, and can be very accurate provided assumptions hold
* Popular in text classification problems

More resources: https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/

#### 2.1 Installation and Import Libraries

In [47]:
# !pip3 install nltk

In [None]:
import nltk
nltk.download('names') 
from nltk.corpus import names
import random

[nltk_data] Downloading package names to /Users/riverjeon/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.


Docs for this library: https://www.nltk.org/api/nltk.classify.naivebayes.html

* Create a list of tuples with names

In [None]:
names = ([(name, 'male') for name in names.words('male.txt')] +
        [(name, 'female') for name in names.words('female.txt')])

# .words(fileid) reads the given file (male.txt or female.txt) and returns a Python list of words

In [62]:
names[0:20]

[('Aamir', 'male'),
 ('Aaron', 'male'),
 ('Abbey', 'male'),
 ('Abbie', 'male'),
 ('Abbot', 'male'),
 ('Abbott', 'male'),
 ('Abby', 'male'),
 ('Abdel', 'male'),
 ('Abdul', 'male'),
 ('Abdulkarim', 'male'),
 ('Abdullah', 'male'),
 ('Abe', 'male'),
 ('Abel', 'male'),
 ('Abelard', 'male'),
 ('Abner', 'male'),
 ('Abraham', 'male'),
 ('Abram', 'male'),
 ('Ace', 'male'),
 ('Adair', 'male'),
 ('Adam', 'male')]

##### Now, we shuffle

In [63]:
random.shuffle(names)
names[0:5]

[('Del', 'female'),
 ('Lorrin', 'female'),
 ('Daren', 'male'),
 ('Sky', 'male'),
 ('Adger', 'male')]

#### 2.2 Split Training and Test Sets

In [None]:
len(names) # N total observations
# print(names)

[('Del', 'female'), ('Lorrin', 'female'), ('Daren', 'male'), ('Sky', 'male'), ('Adger', 'male'), ('Dafna', 'female'), ('William', 'male'), ('Edsel', 'male'), ('Iris', 'female'), ('Belle', 'female'), ('Keene', 'male'), ('Alaine', 'female'), ('Marcelle', 'female'), ('Kathy', 'female'), ('Hansel', 'male'), ('Austin', 'male'), ('Temp', 'male'), ('Gretal', 'female'), ('Gwennie', 'female'), ('Seth', 'male'), ('Melisent', 'female'), ('Ayn', 'female'), ('Christabelle', 'female'), ('Emalia', 'female'), ('Matilda', 'female'), ('Skipton', 'male'), ('Allyce', 'female'), ('Tammy', 'male'), ('Codie', 'female'), ('Hoyt', 'male'), ('Tessie', 'female'), ('Orella', 'female'), ('Tera', 'female'), ('Malia', 'female'), ('Pearl', 'female'), ('Rafa', 'female'), ('Carilyn', 'female'), ('Von', 'male'), ('Erastus', 'male'), ('Cass', 'male'), ('Lindy', 'male'), ('Karina', 'female'), ('Mersey', 'female'), ('Dorelle', 'female'), ('Blayne', 'male'), ('Rodie', 'female'), ('Lester', 'male'), ('Troy', 'male'), ('Steph

* Define training and test set sizes

In [65]:
train_size = 5000

* Split train and test objects

In [66]:
train_names = names[:train_size]
test_names = names[train_size:]

#### 2.3 Define Features

* A simple feature: get the last letter of the name

In [67]:
def g_features1(name):
  return {'last_letter': name[-1]}

Tips: Python functions can return multiple values

In [68]:
# Quick break — some syntax:
def return_two():
  return 5, 10

# When a method returns two values, we can use this format: 
x, y = return_two()
x, y

(5, 10)

#### 2.4 Data Preparation

Loop over names, and return tuple of dictionary and label

In [274]:
def g_features1(name):
  return {'last_letter': name[-1]}

# print(train_names[:3]) # [('Del', 'female'), ('Lorrin', 'female'), ('Daren', 'male')]

train_set = [(g_features1(n), g) for (n, g) in train_names] # predicting based on the last letter of the name
test_set = [(g_features1(n), g) for (n,g) in test_names]

train_set[0]

({'last_letter': 'l'}, 'female')

#### 2.5 Train the Classifier

##### Run the naive Bayes classifier for the train set

In [275]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

#### 2.6 Test your Classifier

* Apply the classifier to some names

In [276]:
classifier.classify(g_features1('Soyeon'))

'male'

In [277]:
classifier.classify(g_features1('Leticia'))

'female'

In [278]:
classifier.classify(g_features1('Jacob'))

'male'

* Get probabilities

In [279]:
classifier.prob_classify(g_features1('Soyeon')).prob("female")

0.4556629520076205

In [280]:
classifier.prob_classify(g_features1('Leticia')).prob("male")

0.019007603449529075

##### We can check the overall accuracy with our test set. 

More on accuracy, F1, precision, recall: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

In [281]:
print(nltk.classify.accuracy(classifier, test_set))

0.7595108695652174


#### 2.7 Feature Attribution

* Lets see what is driving this

In [282]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'k'              male : female =     30.5 : 1.0
             last_letter = 'a'            female : male   =     30.1 : 1.0
             last_letter = 'f'              male : female =     17.7 : 1.0
             last_letter = 'p'              male : female =     13.1 : 1.0
             last_letter = 'm'              male : female =     10.6 : 1.0


Let's be smarter and add more features!

In [333]:
# What all are we including now?
def g_features2(name):
  features = {}
  features["firstletter"] = name[0].lower()
  features["lastletter"] = name[-1].lower()
  for letter in 'abcdefghijklmnopqrstuvwxyz':
      features["count(%s)" % letter] = name.lower().count(letter)
      features["has(%s)" % letter] = (letter in name.lower())
  return features

In [334]:
g_features2('Soyeon')

{'firstletter': 's',
 'lastletter': 'n',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 1,
 'has(e)': True,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 0,
 'has(h)': False,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 0,
 'has(j)': False,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 1,
 'has(n)': True,
 'count(o)': 2,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 1,
 'has(s)': True,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 1,
 'has(y)': True,
 'count(z)': 0,
 'has(z)': False}

* Run for train set

In [335]:
train_set = [(g_features2(n), g) for (n,g) in train_names]

* Run for test set

In [336]:
test_set = [(g_features2(n), g) for (n,g) in test_names]

* Run new classifier

In [337]:
classifier_new = nltk.NaiveBayesClassifier.train(train_set)

* Check the overall accuracy with test set

In [338]:
print(nltk.classify.accuracy(classifier_new, test_set))

0.7707201086956522


* Lets see what is driving this

In [339]:
classifier_new.show_most_informative_features(20)

Most Informative Features
              lastletter = 'k'              male : female =     30.5 : 1.0
              lastletter = 'a'            female : male   =     30.1 : 1.0
              lastletter = 'f'              male : female =     17.7 : 1.0
              lastletter = 'p'              male : female =     13.1 : 1.0
              lastletter = 'm'              male : female =     10.6 : 1.0
              lastletter = 'o'              male : female =      9.8 : 1.0
              lastletter = 'd'              male : female =      8.3 : 1.0
              lastletter = 'r'              male : female =      6.8 : 1.0
              lastletter = 'v'              male : female =      6.5 : 1.0
              lastletter = 'b'              male : female =      6.3 : 1.0
                count(v) = 2              female : male   =      6.0 : 1.0
              lastletter = 'u'              male : female =      5.8 : 1.0
              lastletter = 'g'              male : female =      5.4 : 1.0

In [None]:
classifier_new.prob_classify(g_features2('Soyeon')).prob("female") # haha

0.07012806399520419

* Worse? Better? How can we refine?
* Lets look at the errors from this model and see if we can do better

In [None]:
errors = []
for (name, label) in test_names:
  guess = classifier.classify(g_features2(name)) # predicts male, female
  if guess != label:
    prob = classifier.prob_classify(g_features2(name)).prob(guess)
    # prob_classify returns a probability distribution over the classes, e.g., {'male': 0.85, 'female': 0.15}
    errors.append((label, guess, prob, name))

In [361]:
for (label, guess, prob, name) in sorted(errors):
  print('correct={} guess={} prob={:.2f} name={}'.format(label, guess, prob, name))

correct=female guess=neg prob=0.51 name=Abbe
correct=female guess=neg prob=0.51 name=Abbie
correct=female guess=neg prob=0.51 name=Abigail
correct=female guess=neg prob=0.51 name=Abigale
correct=female guess=neg prob=0.51 name=Adah
correct=female guess=neg prob=0.51 name=Addie
correct=female guess=neg prob=0.51 name=Adela
correct=female guess=neg prob=0.51 name=Adelaide
correct=female guess=neg prob=0.51 name=Adele
correct=female guess=neg prob=0.51 name=Adelind
correct=female guess=neg prob=0.51 name=Adelle
correct=female guess=neg prob=0.51 name=Adey
correct=female guess=neg prob=0.51 name=Adiana
correct=female guess=neg prob=0.51 name=Adora
correct=female guess=neg prob=0.51 name=Adorne
correct=female guess=neg prob=0.51 name=Adrea
correct=female guess=neg prob=0.51 name=Adrian
correct=female guess=neg prob=0.51 name=Adrianne
correct=female guess=neg prob=0.51 name=Adrien
correct=female guess=neg prob=0.51 name=Adriena
correct=female guess=neg prob=0.51 name=Adrienne
correct=female 

What could we do to improve it? (Lab Assignment)

##### Now lets look at some bigger documents
* This may take a while to download.

In [362]:
from nltk.corpus import movie_reviews
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/riverjeon/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [None]:
# list of tuples
# ([words], label)
documents = [(list(movie_reviews.words(fileid)), category) # movie_reviews.words('neg/cv000_29416.txt'), returns the tokenized words of the review file
              for category in movie_reviews.categories() # categories are 'pos' and 'neg'
              for fileid in movie_reviews.fileids(category)]  # For each category, it loops over all the file IDs belonging to that category.

print(movie_reviews.fileids('pos')[:3])


['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt']


In [384]:
print(type(documents))
print(documents[0])

print(type(documents[0]))
# documents[0][0]
documents[0][1] # only neg & pos


<class 'list'>
(['capsule', ':', 'five', 'friends', 'at', 'a', 'stag', 'party', 'are', 'involved', 'in', 'the', 'accidental', 'killing', 'of', 'a', 'prostitute', '.', 'the', 'cover', '-', 'up', 'attempt', 'becomes', 'a', 'monster', 'that', 'eats', 'up', 'the', 'friends', ',', 'two', 'wives', 'and', 'several', 'innocent', 'bystanders', '.', 'this', 'was', 'a', 'real', 'audience', 'pleaser', 'at', 'toronto', ',', 'but', 'it', 'did', 'not', 'do', 'much', 'for', 'me', '.', ',', 'low', '0', '(', '-', '4', 'to', '+', '4', ')', '-', 'directed', 'by', 'peter', 'berg', 'who', 'acted', 'in', 'the', 'last', 'seduction', 'and', 'copland', '.', '-', 'five', 'buddies', 'go', 'on', 'a', 'stag', 'outing', 'to', 'las', 'vegas', 'while', 'cameron', 'diaz', 'works', 'through', 'the', 'logistics', 'of', 'her', 'upcoming', 'wedding', 'to', 'one', 'of', 'them', '.', 'one', 'of', 'the', 'buddies', 'accidentally', 'kills', 'a', 'prostitute', '.', '-', 'several', 'people', 'with', 'no', 'moral', 'compass', '.'

'neg'

In [365]:
random.shuffle(documents)

* Dictionary of words and number of instances

In [366]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
len(all_words)

39768

In [367]:
all_words

FreqDist({',': 77717, 'the': 76529, '.': 65876, 'a': 38106, 'and': 35576, 'of': 34123, 'to': 31937, "'": 30585, 'is': 25195, 'in': 21822, ...})

* Check the frequency of `,`

In [368]:
all_words[',']

77717

In [387]:
word_features = [k for k in all_words.keys() if all_words[k] > 5]

In [370]:
len(word_features)

13214

* Define function to get document features

In [None]:
def document_features(document):
  document_words = set(document)
  features = {}
  for word in word_features: # word_features is a list of words that appear more than 5 times in the corpus
      features['contains(%s)' % word] = (word in document_words)
  return features

In [388]:
document_features(['This', 'is', 'a', 'horrible', 'movie'])

{'contains(plot)': False,
 'contains(:)': False,
 'contains(two)': False,
 'contains(teen)': False,
 'contains(couples)': False,
 'contains(go)': False,
 'contains(to)': False,
 'contains(a)': True,
 'contains(church)': False,
 'contains(party)': False,
 'contains(,)': False,
 'contains(drink)': False,
 'contains(and)': False,
 'contains(then)': False,
 'contains(drive)': False,
 'contains(.)': False,
 'contains(they)': False,
 'contains(get)': False,
 'contains(into)': False,
 'contains(an)': False,
 'contains(accident)': False,
 'contains(one)': False,
 'contains(of)': False,
 'contains(the)': False,
 'contains(guys)': False,
 'contains(dies)': False,
 'contains(but)': False,
 'contains(his)': False,
 'contains(girlfriend)': False,
 'contains(continues)': False,
 'contains(see)': False,
 'contains(him)': False,
 'contains(in)': False,
 'contains(her)': False,
 'contains(life)': False,
 'contains(has)': False,
 'contains(nightmares)': False,
 'contains(what)': False,
 "contains(')": F

In [389]:
document_features(movie_reviews.words('pos/cv957_8737.txt'))

{'contains(plot)': True,
 'contains(:)': True,
 'contains(two)': True,
 'contains(teen)': False,
 'contains(couples)': False,
 'contains(go)': False,
 'contains(to)': True,
 'contains(a)': True,
 'contains(church)': False,
 'contains(party)': False,
 'contains(,)': True,
 'contains(drink)': False,
 'contains(and)': True,
 'contains(then)': True,
 'contains(drive)': False,
 'contains(.)': True,
 'contains(they)': True,
 'contains(get)': True,
 'contains(into)': True,
 'contains(an)': True,
 'contains(accident)': False,
 'contains(one)': True,
 'contains(of)': True,
 'contains(the)': True,
 'contains(guys)': False,
 'contains(dies)': False,
 'contains(but)': True,
 'contains(his)': True,
 'contains(girlfriend)': True,
 'contains(continues)': False,
 'contains(see)': False,
 'contains(him)': True,
 'contains(in)': True,
 'contains(her)': False,
 'contains(life)': False,
 'contains(has)': True,
 'contains(nightmares)': False,
 'contains(what)': True,
 "contains(')": True,
 'contains(s)': T

* Now we have tuple of `({features}, label)`

In [374]:
train_docs = documents[:1000]
test_docs = documents[1000:1500]
train_set = [(document_features(d), c) for (d,c) in train_docs]
test_set = [(document_features(d), c) for (d,c) in test_docs]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [375]:
print(nltk.classify.accuracy(classifier, test_set))

0.762


In [376]:
classifier.show_most_informative_features(10)

Most Informative Features
   contains(outstanding) = True              pos : neg    =     19.7 : 1.0
        contains(sloppy) = True              neg : pos    =     12.4 : 1.0
      contains(lifeless) = True              neg : pos    =     11.0 : 1.0
        contains(asleep) = True              neg : pos    =     10.3 : 1.0
      contains(chilling) = True              pos : neg    =      9.7 : 1.0
     contains(acclaimed) = True              pos : neg    =      9.1 : 1.0
         contains(legal) = True              pos : neg    =      9.1 : 1.0
        contains(smooth) = True              pos : neg    =      9.1 : 1.0
           contains(era) = True              pos : neg    =      8.7 : 1.0
      contains(flawless) = True              pos : neg    =      8.4 : 1.0


In [377]:
# Copyright of the original version:

# Copyright (c) 2014 Matt Dickenson
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
