# Preprocessing and Classification
<ul><li>Preview</li>
<li>Review</li>
<li>Import Corpus</li>
<li>Pre-Processing</li>
<ul><li>Tokenise</li>
<li>Stop Words</li>
<li>Lemmatise</li>
<li>Most Frequent Terms</li></ul>
<li>Classification</li>
<ul><li>Featurise</li>
<li>Training</li>
<li>Prediction</li>
<li>Extra: Cross-Validation</li></ul>
<li>Literary Distinction</li>
</ul>

# 0. Preview

Run the 3 cells below as a summary of the rest of the module. No need to understand everything yet, but just to show you how easy it is to create a machine learning model!

In [1]:
# install and download required packages and data
!pip install nltk
import nltk
nltk.download(['punkt','wordnet', 'stopwords','omw-1.4'])



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alexb\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\alexb\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alexb\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\alexb\AppData\Roaming\nltk_data...


True

In [2]:
# Get texts of interest that belong to identifiably different categories
unladen_swallow = 'high air-speed velocity'
swallow_grasping_coconut = 'low air-speed velocity'

# Transform them into the format NLTK expects
unladen_features_tagged = ({'high':True, 'air-speed': True, 'velocity': True}, 'unladen')
coconut_features_tagged = ({'low':True, 'air-speed': True, 'velocity': True}, 'coconut')

# Train a classifier to learn distinguishing features
classifier = nltk.NaiveBayesClassifier.train([unladen_features_tagged, coconut_features_tagged])

In [3]:
# It's a simple question of weight ratios!
# A five ounce bird could not carry a one pound coconut. 

unknown_swallow = "high velocity"
unknown_features = { 'high': True, 'velocity':True}

classifier.classify(unknown_features)

'unladen'

# 1. Review

The cells below are a quick reminder of the previous module, plus some basic text manipulation methods:

In [4]:
# Read Renfrew and Bahn
rb_text = open('data/r_and_b_sample.txt',encoding="utf-8").read()

In [5]:
# Inspect the text
rb_text

'Where?\nSurvey and Excavation of Sites\nand Features\nA major characteristic of archaeology is that it commonly\ndraws evidence from fieldwork, that is, from\nsurvey and excavation. Traditionally, fieldwork was seen\nalmost exclusively in terms of the discovery and uncovering\nof sites. Today, archaeologists are more aware of\nthe high cost and destructiveness of excavation, and so\nwhile sites and their excavation remain of paramount\nimportance, site surface survey and subsurface detection\nusing non-destructive remote sensing devices\nare now widely used. Indeed, some archaeological\nprojects may not include excavation at all, and instead\nfocus on various forms of survey. The use of non-invasive\nsurvey not only leaves the archaeology undisturbed but\nalso helps to answer new questions. For example, the\nstudy of entire landscapes by regional survey is now a\nmajor part of archaeological fieldwork.\nIn archaeological fieldwork, we may distinguish\nbetween the methods used in the d

In [6]:
# Make the text lower case
rb_lower = rb_text.lower()

In [7]:
# Tokenise
from nltk import word_tokenize
rb_tokens = word_tokenize(rb_lower)

In [8]:
# Check out the tokens
rb_tokens

['where',
 '?',
 'survey',
 'and',
 'excavation',
 'of',
 'sites',
 'and',
 'features',
 'a',
 'major',
 'characteristic',
 'of',
 'archaeology',
 'is',
 'that',
 'it',
 'commonly',
 'draws',
 'evidence',
 'from',
 'fieldwork',
 ',',
 'that',
 'is',
 ',',
 'from',
 'survey',
 'and',
 'excavation',
 '.',
 'traditionally',
 ',',
 'fieldwork',
 'was',
 'seen',
 'almost',
 'exclusively',
 'in',
 'terms',
 'of',
 'the',
 'discovery',
 'and',
 'uncovering',
 'of',
 'sites',
 '.',
 'today',
 ',',
 'archaeologists',
 'are',
 'more',
 'aware',
 'of',
 'the',
 'high',
 'cost',
 'and',
 'destructiveness',
 'of',
 'excavation',
 ',',
 'and',
 'so',
 'while',
 'sites',
 'and',
 'their',
 'excavation',
 'remain',
 'of',
 'paramount',
 'importance',
 ',',
 'site',
 'surface',
 'survey',
 'and',
 'subsurface',
 'detection',
 'using',
 'non-destructive',
 'remote',
 'sensing',
 'devices',
 'are',
 'now',
 'widely',
 'used',
 '.',
 'indeed',
 ',',
 'some',
 'archaeological',
 'projects',
 'may',
 'not',

In [9]:
# Just how long is this text?
len(rb_tokens)

12179

In [11]:
# Create a dictionary that counts token frequencies
from collections import Counter
rb_dict = Counter(rb_tokens)

In [12]:
# Dictionaries pair keys with values
rb_dict

Counter({'where': 19,
         '?': 6,
         'survey': 72,
         'and': 358,
         'excavation': 41,
         'of': 455,
         'sites': 65,
         'features': 36,
         'a': 231,
         'major': 8,
         'characteristic': 1,
         'archaeology': 21,
         'is': 114,
         'that': 101,
         'it': 63,
         'commonly': 3,
         'draws': 1,
         'evidence': 11,
         'from': 72,
         'fieldwork': 8,
         ',': 612,
         '.': 425,
         'traditionally': 3,
         'was': 27,
         'seen': 2,
         'almost': 5,
         'exclusively': 1,
         'in': 271,
         'terms': 2,
         'the': 583,
         'discovery': 10,
         'uncovering': 3,
         'today': 6,
         'archaeologists': 26,
         'are': 96,
         'more': 27,
         'aware': 1,
         'high': 6,
         'cost': 3,
         'destructiveness': 1,
         'so': 16,
         'while': 6,
         'their': 24,
         'remain': 2,
         

In [13]:
# Report the ten most common tokens in the novel
rb_dict.most_common(10)

[(',', 612),
 ('the', 583),
 ('of', 455),
 ('.', 425),
 ('and', 358),
 ('in', 271),
 ('to', 255),
 ('a', 231),
 ('(', 140),
 (')', 140)]

In [15]:
# Get the frequency of a specific word
rb_dict['archaeology']

21

# 2. Import Corpus

## Operating System Interface!

Up to this point, we have only worked with one text at a time. It is simple enough to read in a single plaintext file, but we often find ourselves with many files residing in a folder on our hard drive. In order to access these, we will have to instruct Python to navigate our computer and access the files sequentially.

Even though it sounds simple, this is the moment your computer ceases to be an appliance and transforms into a tool. The <i>os</i> package allows Python to speak with the rest of your computer's systems and file storage. You now have access to any file on your computer and can manipulate them using the code you have learned so far. With great power comes great responsibility!

For now, we will look at just one function from <i>os</i> that will return a list of the files in a given directory.

In [17]:
import os

In [18]:
# Report the files in the current folder
os.listdir()

['.ipynb_checkpoints',
 '01-sequence-labeling.ipynb',
 '02-preprocessing-and-classification.ipynb',
 'data']

In [19]:
# Follow one of the reported folders
os.listdir('data/movie_reviews')

['ebert', 'negative', 'positive']

In [20]:
# And follow deeper
os.listdir('data/movie_reviews/negative')

['cv000_29416.txt',
 'cv001_19502.txt',
 'cv002_17424.txt',
 'cv003_12683.txt',
 'cv004_12641.txt',
 'cv005_29357.txt',
 'cv006_17022.txt',
 'cv007_4992.txt',
 'cv008_29326.txt',
 'cv009_29417.txt',
 'cv010_29063.txt',
 'cv011_13044.txt',
 'cv012_29411.txt',
 'cv013_10494.txt',
 'cv014_15600.txt',
 'cv015_29356.txt',
 'cv016_4348.txt',
 'cv017_23487.txt',
 'cv018_21672.txt',
 'cv019_16117.txt',
 'cv020_9234.txt',
 'cv021_17313.txt',
 'cv022_14227.txt',
 'cv023_13847.txt',
 'cv024_7033.txt',
 'cv025_29825.txt',
 'cv026_29229.txt',
 'cv027_26270.txt',
 'cv028_26964.txt',
 'cv029_19943.txt',
 'cv030_22893.txt',
 'cv031_19540.txt',
 'cv032_23718.txt',
 'cv033_25680.txt',
 'cv034_29446.txt',
 'cv035_3343.txt',
 'cv036_18385.txt',
 'cv037_19798.txt',
 'cv038_9781.txt',
 'cv039_5963.txt',
 'cv040_8829.txt',
 'cv041_22364.txt',
 'cv042_11927.txt',
 'cv043_16808.txt',
 'cv044_18429.txt',
 'cv045_25077.txt',
 'cv046_10613.txt',
 'cv047_18725.txt',
 'cv048_18380.txt',
 'cv049_21917.txt',
 'cv050_

In [21]:
# Assign that list to a variable
negative_files = os.listdir('data/movie_reviews/negative')

In [22]:
# Inspect first element in the list
negative_files[0]

'cv000_29416.txt'

In [23]:
## EXERCISE. How many reviews are there in the 'positive' folder?
##           How many in the 'negative' one?

negative_path = "data/movie_reviews/negative/"
positive_path = "data/movie_reviews/positive/"

## CHALLENGE: Find a list of files and folders on your desktop.

## Corpus

Unfortunately, freely available text collections in the archaeology domain are rare. Instead, we will apply text mining methods to a toy corpus distributed with NLTK: movie reviews. These have been divided into positive and negative categories, with one thousand of each.

In essence, our task will be to learn the vocabulary of positive and negative reviews.

In [24]:
# Open the first file from 'negative_files'
open('data/movie_reviews/negative/cv000_29416.txt').read()

'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\'t snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it\'s simply too jumbled . \nit starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience membe

In [25]:
# When opening others, filenames change but the path doesn't!
negative_path = "data/movie_reviews/negative/"
open(negative_path+'cv000_29416.txt').read()

'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\'t snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it\'s simply too jumbled . \nit starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience membe

In [26]:
# Read all files and assign to a variable
negative_reviews = [open(negative_path+name,'r').read() for name in negative_files]

In [None]:
# NOTE: If you are using OSX, your operating system may sometimes
# include hidden files in your folders that confuse Python.

# If you get an error while running the above line, try including an 'if' condition
# in your list comprehension to prevent Python from tripping over these.

# For example:
negative_reviews = [open(negative_path+name,'r').read() for name in negative_files if name[-4:]=='.txt']

In [27]:
# Inspect

negative_reviews[0]

'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\'t snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it\'s simply too jumbled . \nit starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience membe

In [28]:
# Repeat process for positive reviews

positive_path = 'data/movie_reviews/positive/'
positive_files = os.listdir(positive_path)
positive_reviews = [open(positive_path+name,'r').read() for name in positive_files]

In [29]:
# Inspect first element in list
positive_reviews[0]

'films adapted from comic books have had plenty of success , whether they\'re about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there\'s never really been a comic book like from hell before . \nfor starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid \'80s with a 12-part series called the watchmen . \nto say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . \nthe book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . \nin other words , don\'t dismiss this film because of its source . \nif you can get past the whole comic book thing , you might find another stumbling block in from hell\'s directors , albert and allen hughes . \ngetting the hughes brothers to direct this seem

In [None]:
## EXERCISE. How long is the list of positive movie reviews? Negative reviews?
##           Do these match the number of files you had observed in the folders earlier?

# 3. Pre-Processing

The pre-processing phase of our workflow transforms the strings that Python has read from the plaintext files into useful sets of features. Not only will we tokenise the texts, as we have previously, but we will perform three further steps described in the lectures: <i>stop word</i> removal, <i>lemmatisation</i> of nouns, and low-frequency word removal.

Although pre-processing feels like a nitty-gritty task, it is important to recognize that how we pre-process our texts depends on the questions we are trying to answer. Not every project lemmatises or stems its vocabulary. Perhaps we can imagine research questions in which the grammmatical functions indicated by word endings might be useful.

## Tokenise

In [30]:
from nltk import word_tokenize

In [31]:
# Tokenise first negative review

word_tokenize(negative_reviews[0])

['plot',
 ':',
 'two',
 'teen',
 'couples',
 'go',
 'to',
 'a',
 'church',
 'party',
 ',',
 'drink',
 'and',
 'then',
 'drive',
 '.',
 'they',
 'get',
 'into',
 'an',
 'accident',
 '.',
 'one',
 'of',
 'the',
 'guys',
 'dies',
 ',',
 'but',
 'his',
 'girlfriend',
 'continues',
 'to',
 'see',
 'him',
 'in',
 'her',
 'life',
 ',',
 'and',
 'has',
 'nightmares',
 '.',
 'what',
 "'s",
 'the',
 'deal',
 '?',
 'watch',
 'the',
 'movie',
 'and',
 '``',
 'sorta',
 '``',
 'find',
 'out',
 '.',
 '.',
 '.',
 'critique',
 ':',
 'a',
 'mind-fuck',
 'movie',
 'for',
 'the',
 'teen',
 'generation',
 'that',
 'touches',
 'on',
 'a',
 'very',
 'cool',
 'idea',
 ',',
 'but',
 'presents',
 'it',
 'in',
 'a',
 'very',
 'bad',
 'package',
 '.',
 'which',
 'is',
 'what',
 'makes',
 'this',
 'review',
 'an',
 'even',
 'harder',
 'one',
 'to',
 'write',
 ',',
 'since',
 'i',
 'generally',
 'applaud',
 'films',
 'which',
 'attempt',
 'to',
 'break',
 'the',
 'mold',
 ',',
 'mess',
 'with',
 'your',
 'head',
 '

In [32]:
# Tokenise our sets of reviews; tokens remain grouped by review

negative_tokenized = [word_tokenize(review) for review in negative_reviews]
positive_tokenized = [word_tokenize(review) for review in positive_reviews]

In [33]:
## Q.  The texts in the movie review corpus are already in lower case, however many
##     texts found in the wild are not. How would you change the code in the cell above
##     to produce a corpus of tokenized and lower-case text?

## EXERCISE. How many tokens are there in the first negative movie review? positive?

## CHALLENGE: How many tokens are there on average in each negative movie review?

## Stop Words

Stop words, sometimes refered to as <i>function words</i>, include articles, prepositions, pronouns, and conjunctions among others. Although their frequencies encode information about textual features like authorship, they do not convey semantic meanings and are often removed before analysis.

In [34]:
# Import our list of stop words

from nltk.corpus import stopwords

In [35]:
# Pull up NLTK's list of English-language stop words

stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [36]:
# How many stop words are in the list?

len(stopwords.words('english'))

179

In [38]:
# NLTK has stopwords for many Western languages

stopwords.words('dutch')

['de',
 'en',
 'van',
 'ik',
 'te',
 'dat',
 'die',
 'in',
 'een',
 'hij',
 'het',
 'niet',
 'zijn',
 'is',
 'was',
 'op',
 'aan',
 'met',
 'als',
 'voor',
 'had',
 'er',
 'maar',
 'om',
 'hem',
 'dan',
 'zou',
 'of',
 'wat',
 'mijn',
 'men',
 'dit',
 'zo',
 'door',
 'over',
 'ze',
 'zich',
 'bij',
 'ook',
 'tot',
 'je',
 'mij',
 'uit',
 'der',
 'daar',
 'haar',
 'naar',
 'heb',
 'hoe',
 'heeft',
 'hebben',
 'deze',
 'u',
 'want',
 'nog',
 'zal',
 'me',
 'zij',
 'nu',
 'ge',
 'geen',
 'omdat',
 'iets',
 'worden',
 'toch',
 'al',
 'waren',
 'veel',
 'meer',
 'doen',
 'toen',
 'moet',
 'ben',
 'zonder',
 'kan',
 'hun',
 'dus',
 'alles',
 'onder',
 'ja',
 'eens',
 'hier',
 'wie',
 'werd',
 'altijd',
 'doch',
 'wordt',
 'wezen',
 'kunnen',
 'ons',
 'zelf',
 'tegen',
 'na',
 'reeds',
 'wil',
 'kon',
 'niets',
 'uw',
 'iemand',
 'geweest',
 'andere']

In [39]:
tokenized_sentence = ['what', 'is', 'the', 'air-speed', 'velocity', 'of', 'an', 'unladen', 'swallow']

In [40]:
# Remove stopwords from tokenised review

for word in tokenized_sentence:
    if word not in stopwords.words('english'):
        print(word)

air-speed
velocity
unladen
swallow


In [41]:
# But what if we have more than one review at a time?

two_sentences = [['what', 'is', 'the', 'air-speed', 'velocity', 'of', 'an', 'unladen', 'swallow'],\
               ['what', 'do', 'you', 'mean', 'african', 'or', 'european']]

In [42]:
# Use a nested for-loop!

for sentence in two_sentences:
    for word in sentence:
        if word not in stopwords.words('english'):
            print(word)

air-speed
velocity
unladen
swallow
mean
african
european


In [43]:
# The example sentences were short so it's hard to tell, but looking up
# whether a word is 'in' a list takes a pretty long time

# We can improve the speed of our task by converting the list to a set!

stopword_set = set(stopwords.words('english'))

In [44]:
# Inspect

stopword_set

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [45]:
# And we can remove the stop words from our movie reviews

negative_no_stops = [[word for word in review if word not in stopword_set] for review in negative_tokenized]
positive_no_stops = [[word for word in review if word not in stopword_set] for review in positive_tokenized]

In [None]:
## Q.  Stop words are typically the most frequent words in a language, yet do not convey semantic meaning.
##     Does this make sense based on the words in NLTK's list of English stop words?
##     What about other languages with which you are familar?

stopword_languages = ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian',\
                      'norwegian', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish']

## EXERCISE. How many tokens did we remove from the first negative review in total? What percent were removed?

## CHALLENGE.  Stop words are often instrumental in language detection for unknown texts.
##             How might you write a program to do this?

## Lemmatise

Lemmatisation transforms an inflected word into its root. For nouns, this often converts plural form to singular. For verbs, this often produces the infinitive. So for example, both 'digging' and 'dug' are transformed to it's base form, 'dig'. 

NLTK's lemmatisation function assumes that all words are nouns by default and leaves most non-nouns untouched!

<h3>WordNet Abbreviations</h3>
<table align='left'>
<tr><td>Noun</td><td>'n'</td><td>wordnet.NOUN</td></tr>
<tr><td>Verb</td><td>'v'</td><td>wordnet.VERB</td></tr>
<tr><td>Adjective</td><td>'a'</td><td>wordnet.ADJ</td></tr>
<tr><td>Adverb</td><td>'r'</td><td>wordnet.ADV</td></tr>
</table>

In [46]:
# Import NLTK's lemmatiser

from nltk.stem import WordNetLemmatizer


In [47]:
# Initialise the lemmatiser and assign it to a variable

wnl = WordNetLemmatizer()

In [49]:
# The lemmatisation is called as a method: .lemmatize()

wnl.lemmatize('artefacts')

'artefact'

In [50]:
# Doesn't seem to work properly on 'digging' because 'wnl' assumes it is seeing a noun

wnl.lemmatize('digging')

'digging'

In [51]:
# Fortunately we can pass in a label for a different part of speech
# Perhaps one might use this along with a POS tagger!

wnl.lemmatize('digging', pos='v')

'dig'

In [52]:
# A new list of tokens

famous_sketch = ['ministry','of', 'silly', 'walks']

In [53]:
# Use a for-loop to lemmatise each word sequentially

for word in famous_sketch:
    print(wnl.lemmatize(word))

ministry
of
silly
walk


In [54]:
# Now, two lists of tokens

two_sketches = [['ministry','of', 'silly', 'walks'],['musical','mice']]

In [55]:
# As a nested loop

for sketch in two_sketches:
    for word in sketch:
        print(wnl.lemmatize(word))

ministry
of
silly
walk
musical
mouse


In [56]:
# Let's lemmatise the nouns in our movie reviews

negative_lemmatized = [[wnl.lemmatize(word) for word in review] for review in negative_no_stops]
positive_lemmatized = [[wnl.lemmatize(word) for word in review] for review in positive_no_stops]

In [None]:
## EXERCISE. Lemmatise the list of plural nouns below

plural_nouns = ['parrots', 'witches', 'volcanoes', 'soliloquies', 'cherries', 'addenda', \
                'baths', 'knives', 'oxen', 'lice', 'brethren', 'alumni', 'alumnae', 'matrices']

## CHALLENGE: Use the part-of-speech tagger we looked at in the previous module and include
##            a POS argument while lemmatising the following sentence.

brave_sir_robin = "When danger reared its ugly head, he bravely turned his tail and fled!"

## Minimum Document Frequency

Intuitively, not all words in the corpus will convey the same amount of information about whether a movie review is positive or negative or whether a poem is a haiku. At the extreme, if a word appears in just a single text out of thousands, it doesn't tell us much either way about whether that word is associated with a category. By removing infrequent terms from our model, we can also save computational time.

When we measure document frequency, we do not need to know how many times a token appears in a text. We simply need to know which tokens appear in each. Python has an easy, built-in data type that that tells us the unique elements appearing in a list: <i>set</i>. As a data-type, a set is like a list but it does not retain information about the order of elements. Also, it is very effient when we want to check 'if' a particular element is contained 'in' a group of words.

In order to count the document frequency for each word in the corpus, we will produce a set of unique words for each text. Then we will pull out each word from each set and put these into a single list. Words belonging to multiple sets will appear multiple times; words belonging to just a single set will appear only once. Finally, we will use the <i>Counter</i> to tally how many times each word appears in the term-document frequency list. 

### Set: Function/Data-Type

In [57]:
# Here's a text that reuses many of its words

gertrude_stein = ['rose','is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.']

In [58]:
# Return a 'set' of the unique tokens in the text
# A set is like a list but does not retain information about order, and does not include duplicate items

set(gertrude_stein)

{'.', 'a', 'is', 'rose'}

In [59]:
# The size of the set is much smaller than the text itself

len(gertrude_stein), len(set(gertrude_stein))

(11, 4)

In [60]:
# Produce a set of unique tokens contained in each review

negative_sets = [set(review) for review in negative_lemmatized]
positive_sets = [set(review) for review in positive_lemmatized]

### Document Frequency

In [61]:
# We want frequencies for the whole corpus, so we'll put our sets of words together now

all_sets = negative_sets + positive_sets

In [62]:
# Check that we got them all

len(negative_sets), len(positive_sets), len(all_sets)

(1000, 1000, 2000)

In [63]:
# And we'll pull out the words from each review set
# This produces a list in which each token appears as many times as the number documents to which it belongs

term_document_frequency_list = [word for review in all_sets for word in review]

In [64]:
# We want to count the number of times each token appears, so we'll use 'Counter'

from collections import Counter

tdf_counts = Counter(term_document_frequency_list)

In [65]:
# Inspect

tdf_counts

Counter({'somewhere': 120,
         'applaud': 10,
         'watch': 489,
         '(': 1900,
         'decided': 97,
         'clue': 68,
         'fantasy': 101,
         'slasher': 47,
         'life': 900,
         'really': 867,
         'kid': 318,
         'director': 868,
         'throughout': 261,
         'apparently': 183,
         'hold': 185,
         'terribly': 53,
         'always': 433,
         'also': 1070,
         'stir': 21,
         'street': 159,
         'church': 46,
         'party': 125,
         '&': 166,
         'downshift': 2,
         'got': 365,
         'unraveling': 2,
         'generally': 98,
         'thing': 959,
         'new': 758,
         'u': 703,
         'plot': 893,
         'arrow': 22,
         'showing': 138,
         'seems': 686,
         'accident': 83,
         'mind': 389,
         '.': 2000,
         'giving': 192,
         'bentley': 5,
         'deal': 241,
         '``': 1540,
         'five': 199,
         'need': 427,
     

In [66]:
# Let's refresh ourselves on Counter's methods

tdf_counts.keys()



In [67]:
# How many reviews refer to the larger film industry?

tdf_counts['hollywood']

385

In [68]:
# Produce a list of words whose tdf-count is greater than 1

more_than_once = [key for key in tdf_counts.keys() if tdf_counts[key]>1]

In [69]:
# Inspect

more_than_once

['somewhere',
 'applaud',
 'watch',
 '(',
 'decided',
 'clue',
 'fantasy',
 'slasher',
 'life',
 'really',
 'kid',
 'director',
 'throughout',
 'apparently',
 'hold',
 'terribly',
 'always',
 'also',
 'stir',
 'street',
 'church',
 'party',
 '&',
 'downshift',
 'got',
 'unraveling',
 'generally',
 'thing',
 'new',
 'u',
 'plot',
 'arrow',
 'showing',
 'seems',
 'accident',
 'mind',
 '.',
 'giving',
 'bentley',
 'deal',
 '``',
 'five',
 'need',
 'back',
 'thrilling',
 'go',
 'sad',
 'running',
 'excites',
 '20',
 'happen',
 'chase',
 'want',
 'sorta',
 'personally',
 'fed',
 'two',
 'still',
 'would',
 'salvation',
 'package',
 'bit',
 'elm',
 'character',
 'dy',
 'sense',
 'every',
 'neighborhood',
 'since',
 'kind',
 'see',
 'rarely',
 'attempt',
 'dead',
 'main',
 'obviously',
 'little',
 'offering',
 'folk',
 'disappearance',
 'entire',
 'took',
 'lost',
 'beauty',
 'horror',
 'neat',
 'edge',
 'unravel',
 'packaged',
 '9/10',
 'exact',
 'chopped',
 'teen',
 'redundant',
 'audience'

In [70]:
# Just how many words did we remove from our vocabulary?

len(more_than_once), len(tdf_counts.keys())

(22567, 41950)

In [71]:
# Now we can go back through our movie reviews and remove
# any words that were not included in 'more_than_once'

# As mentioned above, it is much more efficient to perform that task using a 'set'

more_than_once_set = set(more_than_once)

In [72]:
negative_min_df = [[word for word in review if word in more_than_once_set] for review in negative_sets]
positive_min_df = [[word for word in review if word in more_than_once_set] for review in positive_sets]

In [73]:
# Inspect
negative_min_df[0]

['somewhere',
 'applaud',
 'watch',
 '(',
 'decided',
 'clue',
 'fantasy',
 'slasher',
 'life',
 'really',
 'kid',
 'director',
 'throughout',
 'apparently',
 'hold',
 'terribly',
 'always',
 'also',
 'stir',
 'street',
 'church',
 'party',
 '&',
 'downshift',
 'got',
 'unraveling',
 'generally',
 'thing',
 'new',
 'u',
 'plot',
 'arrow',
 'showing',
 'seems',
 'accident',
 'mind',
 '.',
 'giving',
 'bentley',
 'deal',
 '``',
 'five',
 'need',
 'back',
 'thrilling',
 'go',
 'sad',
 'running',
 'excites',
 '20',
 'happen',
 'chase',
 'want',
 'sorta',
 'personally',
 'fed',
 'two',
 'still',
 'would',
 'salvation',
 'package',
 'bit',
 'elm',
 'character',
 'dy',
 'sense',
 'every',
 'neighborhood',
 'since',
 'kind',
 'see',
 'rarely',
 'attempt',
 'dead',
 'main',
 'obviously',
 'little',
 'offering',
 'folk',
 'disappearance',
 'entire',
 'took',
 'lost',
 'beauty',
 'horror',
 'neat',
 'edge',
 'unravel',
 'packaged',
 '9/10',
 'exact',
 'chopped',
 'teen',
 'redundant',
 'audience'

In [None]:
## EXERCISE.  Get a list of the 500 words with the highest document frequencies.

## CHALLENGE: Only words should be contained in the list.

## CHALLENGE: You can remove low-frequency words based on document frequency.
##            Another popular method is to retain only high-frequency
##            words based on the raw number of times they appear in the corpus
##            (i.e. the sum of their counts in all texts).

##            Use this method to find the 500 most common terms in the movie corpus.
##            Does this match with the list in the previous exercise?

# 4. Classification

### Featurise

For humans, reading a string of text is a relatively easy task, but for the computer to learn about language, text has to be represented in very particular ways. We refer to this as <i>featurisation</i>: the transformation of a text into a quantitative feature set.

In order for the NLTK classifier to work, we have to represent each text as a set of True/False values: Is a given word from our high-frequency vocabulary present in this review? More specifically, these values will be contained in a <i>dictionary</i>, where each key is a vocabulary word and its value is whether or not it is present. In fact, we need only to include terms that are present, so all of our values will be True.

Once we have processed each text according to this rubric, we will then attach a label for the text's category ('reviewed'/'random). The classifier will use this to identify which features are associated with each.

In [None]:
# Let's revisit some earlier tokens

unladen_tokens = ['high','air-speed','velocity']

In [None]:
# In order to represent our tokens to the classifier, we need to
# associate them with a 'True' value in a dictionary

# Sure looks like a list comprehension!

{token:True for token in unladen_tokens}

In [None]:
# In general, we are not limited to True/False values to our dictionary entries
# For example:

{token:len(token) for token in unladen_tokens}

In [None]:
# Turn our reviews into dictionaries that indicate whether a word is present

negative_featurized = [{word:True for word in review} for review in negative_min_df]
positive_featurized = [{word:True for word in review} for review in positive_min_df]

In [None]:
# Inspect first review

negative_featurized[0]

In [None]:
# Attach a label to each review
negative_tagged = [(review,'negative') for review in negative_featurized]
positive_tagged = [(review,'positive') for review in positive_featurized]

In [None]:
# Inspect
negative_tagged[0]

In [None]:
# Combine these lists of featurized, tagged reviews
all_tagged = negative_tagged + positive_tagged

In [None]:
## NOTE. We'll spend more time with dictionaries in the next module, so let's hand wave
##       it as a formatting step for now and move on to classification!

### Classification

We have selected an algorithm that specifically relies on <a href="https://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes' Theorem</a> to model relationships between textual features and categories in our corpus of movie reviews. (See link for more information about the method and its assumptions.)

Two ways that we learn about the model are its feature weights and predictions on new texts. The algorithm can explicity report to us which direction each word leans category-wise and how strongly. Based on those weights, it makes further predictions about the valences previously unseen movie reviews.

In [None]:
# Train the classifier and assign it to a variable

from nltk import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(all_tagged)

In [None]:
# Report feature information, this is a list of which words are most related to either positive or negative reviews!
classifier.show_most_informative_features(10)

In [None]:
# Roger Ebert's reviews of a couple family films

# "This movie made my heart glad. It is filled with innocence, hope, and good cheer."
#    -Roger Ebert, on ET

positive_ET = {'best': True, 'baffled': True, 'space': True, 'relationship': True, 'people': True, 'little': True,\
            'friendship': True, 'love': True, 'story': True, 'becomes': True,'hear': True, 'outer': True,\
            'boy': True, 'friend': True, 'tells': True, 'creature': True, 'described': True}

# "I hated this movie. Hated, hated, hated, hated, hated this movie. Hated it.
#  Hated every simpering stupid vacant audience-insulting moment of it."
#     -Roger Ebert, on "North"

negative_north = {'simpering': True, 'belief': True, 'every': True, 'thought': True, 'implied': True,\
                  'entertained': True, 'insulting': True, 'vacant': True, 'sensibility': True, 'stupid': True,\
                  'insult': True, 'audience': True, 'anyone': True, 'movie': True, 'hated': True, 'moment': True}

In [None]:
# What does the classifier think?

classifier.classify(positive_ET)

In [None]:
classifier.classify(negative_north)

In [None]:
# Predictions for a list of reviews

review_list = [positive_ET, negative_north]
classifier.classify_many(review_list)

In [None]:
# Although our classification is binary, Bayes theorem assigns
# a probability of membership in either category

# Just how confident is our classifier of its predictions?

classifier.prob_classify(positive_ET).prob('positive')

In [None]:
classifier.prob_classify(negative_north).prob('negative')

In [None]:
## EXERCISE. Below there are the locations of two movie reviews that the clasifier has not yet seen.
##           Use the classifier to predict whether they are positive or negative.

ebert_neg_location = 'movie_reviews/ebert/Roger Ebert - Battlefield Earth (negative review).txt'
ebert_pos_location = 'movie_reviews/ebert/Roger Ebert - Toy Story (positive review).txt'

## Q.  What kinds of patterns do you notice among the 'most informative features'
##     in the movie review corpus. Where are critics focusing their attention?
##     Try looking at the top several hundred most informative words.

## Extra: Cross-Validation

Just how good is our classifier? We can evaluate it by randomly selecting reviews from each category and setting them aside before training. We then see how well the classifier predicts their (known) categories.

Remember that if the classifier is trying to predict membership for just two categories, we would expect it to be correct about 50% of the time based on random chance. As a rule of thumb, if this kind of classifier has 65% accuracy or better under cross-validation, it has often identified a meaningful pattern.

In [None]:
# Randomise our list of movie reviews (in place)

!pip install numpy
import numpy
numpy.random.shuffle(all_tagged)

In [None]:
# We'll train our classifier on the first 90% of reviews
# and validate using the last 10%

training_set = all_tagged[:-72]
validation_set = all_tagged[-72:]

In [None]:
# Train, validate, show accuracy

classifier = nltk.NaiveBayesClassifier.train(training_set)
nltk.classify.accuracy(classifier, validation_set)

In [None]:
## CHALLENGE. In fact, this is not the best implementation of cross-validation, since we used
##            the entire corpus to identify words that appear more than once. In effect, we have
##            passed information from our validation set into the classifier we wish to test.

##            Repeat the processing of the corpus based only on the training set and use this to
##            make predictions about the validation set. How much does the classifier's accuracy change? Why?