# LX 496 / 796 Homework 2

In this homework we will do some classification and categorization, and work through some Python.  We talked through a lot of these things in class, so much of this is kind of just review, but also provides some practice doing it yourself.

**Using the "autograder":** The way this notebook is set up is that you read a bunch of stuff, and then periodically you will be prompted to answer a question.  The question might take the form of a question you answer in prose, or (more often) a question you answer in code. The code questions come with a test that will check your answer.  If you have it correct, it will tell you that the question "passed."  So, you can be fairly confident as you go along that you got the code working.  You'll want to submit a notebook where things pass.

**Submitting at the end:** When you are ready to submit, download the `.ipynb` file and then go to Gradescope and upload it there.  The autograder will run and double-check your answers.  In principle there may be some tests ("hidden tests") that Gradescope runs that you didn't have access to when you were working through it.

**Getting started**: In order for the checking procedure to work, you need to run this cell below first.  It will download the autograder stuff and the tests.


In [1]:
%%capture
# Run this first, it makes the autograder go.
# The "%%capture" line above keeps it from showing all the output
# Install the autograder (otter-grader) and get the test files
files = "https://github.com/bucomplx/lx496f22/raw/main/assets/ipynb/hw2/tests.zip"
!pip install otter-grader && wget $files && unzip -o tests.zip
# enable the otter test evaluator
import otter
grader = otter.Notebook()

Before we get too far into this, let's try a question that the autograder can check.  

### q0 (test autograder) ###

**Question:** In the code block below, change the `...` to be the number `4` and then execute the check.  It will fail.  Then change it to be the number `10` and it will pass.

<!--
BEGIN QUESTION
name: q0
-->

In [2]:
# First change the ... to 4 to see what failure looks like, then change to 10
first_answer = ...

In [None]:
grader.check("q0")

Great.  So, that's how the autograder works.  Whenever you see `...` like that after `=`, that's a spot for you to fill something in.  The instructions will tell you what to fill in.  Actually, you might also have noticed that when you had the wrong answer in, the autograder revealed what was supposed to be there.  Specifically, you'll have seen `assert first_answer == 10` in there, which is `True` if `first_answer` is `10` and otherwise presents an error.  So, you can get clues to what it is looking for, if your test isn't passing.  Of course, usually, getting the test to pass will be more involved than just setting the number to be what it expects to see, but it is a source of hints.

# Sentiment analysis on movie reviews


## Preparing the corpus data

NLTK comes with a corpus of movie reviews, that has been segmented into positive and negative reviews. Using this, we will try to construct a machine that can guess whether a new review is positive or negative. ("Sentiment analysis")

In [4]:
# make NLTK and the movie reviews corpus available
import nltk
from nltk.corpus import movie_reviews
nltk.download('movie_reviews')

Different corpora are organized in different ways, but generally they are collections of individual files, often categorized. The movie reviews corpus has many files, one file per review, and categorized into those that are positive and those that are negative. The positive reviews are in the `pos` directory (folder) and the negative reviews are in the `neg` directory. (When writing out a filename, folders are indicated by a `/` character.)

**Activity**.  When we did the `from nltk.corpus import movie_reviews` line above, we effectively created an object called `movie_reviews` which is a corpus-type object and knows how to do a number of different things.  Among the things it knows how to do is tell you what different categories it has in it.  It can also tell you what files it contains (`fileids`), and some other things.  Below, try typing just `movie_reviews.` and stop after the `.`. You'll see a bunch of options come up; these are the the things the `movie_reviews` corpus knows how to do.  Continue then by typing `categories()` to ask it to tell you what categories it has.

In [5]:
# Below, type
# movie_reviews.categories()
# to see what categories this corpus has defined.
# Conceptually, we are addressing the movie_reviews object and asking it
# do something, specifically to run the categories() function for us.
# Try stopping typing after you type the . and, after a moment, it should
# pop up a list of the various things that movie_reviews knows how to do.
# That's useful if you quite remember the name of the function, or are just
# curious about other things you can do with the movie reviews corpus.


When you looked at what `movie_reviews` knew how to do (that is, "what methods are defined" in Python parlance), which you got above by typing `movie_reviews.` and looking at the autocompletion popup, you saw `raw` and `categories` and `readme` and some other things.  The `readme` method gives you information about the corpus. 

### q1 (find dataset year) ###

**Question:** Get the corpus to tell you the readme information, in the same kind of way we got it to tell us what the categories were above.  Then use the information you find there to answer the question (fill in the year below).

<!--
BEGIN QUESTION
name: q1
-->

In [7]:
# Put a command below that will print out the readme information
...

# Question: what year was this (v2.0) dataset released?
# So the autograder can check it, assign the proper year to v2_year
# It should look like
# v2_year = 1066
# Note: 1066 is not the correct year
v2_year = ...

In [None]:
grader.check("q1")

Now, back to the problem at hand. We now know what the categories are (you saw them back in the "activity" erlier), but let's print them in a nicer way.  First, we'll count how many there are by asking for the `len()` (length) of the list that `categories()` returned, and then we'll list them out, joined by commas between them. We can do this using `len()` and `format()` and `join()`. This should be kind of familiar, but make sure you understand how these lines work.

In [9]:
print("There are {} categories".format(len(movie_reviews.categories())))
print("They are: " + ", ".join(movie_reviews.categories()))

Nice!  That's much easier to read.  Another thing that the corpus can do is give it the names of all the files it contains. We can get that list with `fileids()`. That list is long though.  There are (as we'll see below) 2000 of them. So, printing all 2000 of them out with commas in between would just be an unreadable mess.  We don't need to see them all, we can just look at the first three to get a sense of what they look like. It's not a bad idea to just make sure that the things you're working with (like categories or fileids) look like what you expect them to look like. So this is a kind of sanity check.

In [10]:
print("There are {} files".format(len(movie_reviews.fileids())))
print("The first 3 are: " + ", ".join(movie_reviews.fileids()[:3]))

Once we did the `...import movie_reviews` step, `movie_reviews` became a known corpus object, that we can interact with.  We can tell the corpus object to give us its categories, as above, with `categories()`, and we can ask it to give us its filenames with `fileids()`. If we want just the filenames for a single category, we can specify it like this: `fileids(categories='pos')` or (if you want the filenames from several categories) `fileids(categories=['neg', 'pos'])`. The `categories=` parameter can take either a string (naming a category) or a list of strings (each of which names a category).

The code below will go through the categories, and, for each category, it will call the current category `c` and then execute the indented block.  The first line gets the fileids for all the reviews in category `c` (whatever `c` is on this iteration), and then tells us how many there are and what the first 3 fileids are.

In [11]:
for c in movie_reviews.categories():
  ids = movie_reviews.fileids(categories=c)
  print("There are {} reviews in category {}.".format(len(ids), c))
  print("The first 3 files are: " + ', '.join(ids[:3]))

The hypothesis we will pursue first is that we can guess whether a review is positive or negative based on what words it contains. Roughly speaking, a review that contains the word "boring" is probably negative, and one that contains the word "exciting" is probably positive.

So, let's try to work that out.

These reviews are just text, they have not been processed except to split the text up into words.  So one thing we want to do right away is convert all the words to lowercase, so we don't consider a word like "boring" to be two different words depending on whether it is at the beginning of a sentence or in the middle of a sentence.

The second thing we want to do is to remove all the super-common/grammatical words and punctuation so that what's left are the more contentful words.  This is all an approximation, but the idea is that `.` and `the` and `and` are not providing useful information about the contents of a review, so we want to filter all those out.  The are, in NLP terminology, "stopwords." And NLTK has a corpus of them (in several languages, we'll use the English ones).

So what we are going to do is create a filtered version of these reviews, where the stopwords are removed and the remaining words are lowercased.

The first step is to make a list of words we will include.  This list will include the stopwords as well as the "punctuation words" (a list of words, each of which is one punctuation mark).

In [12]:
# make the stopwords corpus available.
from nltk.corpus import stopwords
nltk.download('stopwords')

In [13]:
# make the punctuation list available (which is something that "string" knows how to do)
import string

Now that the stopwords and punctuation list have been loaded up, we want to combine them into a list of words to exclude.

In [14]:
# if we inquire about the fileids in the stopwords corpus,
# we see the list of languages that it has stopwords for.
print(stopwords.fileids())

In [15]:
# the ones we want are the English ones, so we use words() to pull those out.
eng_stopwords = stopwords.words('english')

In [16]:
# let's take a look at what we got.  It's a list of words.  They're all lowecase.
print(eng_stopwords)

We now want to augment the list of stopwords with "punctuation words". Most tokenizers (which split raw text into words) and corpora split off punctuation into their own words, so a sentence "I left." in a corpus might be represented like `['I', 'left', '.']`, where the `.` represents its own word.  We want to filter out those punctuation marks as well as the stopwords, so we're planning on making a bigger list of words that contains all the stopwords and the punctuation words.

In [17]:
# here is a string containing all the relevant punctuation marks.
print(string.punctuation)

A string can be viewed as a list of characters (sort of, it isn't *actually* a list, but a lot of things you can do with lists you can also do with strings). Among the things you can do with a string is iterate through it, character by character, just like you would iterate through a regular list.

Just to show this in action, consider the code below. It prints each punctuation character between square brackets.  The `end=''` part of the `print()` command tells Python not to move to the next line but just keep printing on the same line (if you do not specify the end parameter, the default is to move to the next line).


In [18]:
for c in string.punctuation:
  print(' [{}]'.format(c), end='')

### q2 (create punctuation word list) ###

**Question**: Create a list of words formed from the punctuation string.  I expect you'll use a list comprehension. Though if you do, take a look at what the list comprehension is doing. There's actually an even easier way to get the same result, but however you get the result is fine.  Your list should look like `['!', '"', '#', '$', etc.]` once you've formed it.

<!--
BEGIN QUESTION
name: q2
-->


In [19]:
punc_words = ...
print(punc_words)

In [None]:
grader.check("q2")

Now that we have `punc_words` and `eng_stopwords` defined, we can create a full list of words to filter out.

In [21]:
skipwords = eng_stopwords + punc_words
print("We have {} words to filter out.".format(len(skipwords)))

Ok, so now we have the list of words we want to filter out, we can finally go back to address the reviews. Let's walk through the steps of lowercasing and filtering a single review by hand first, then we can generalize it to a function that we can apply to each of the reviews.  (One reason we want to lowercase the reviews is that the stopwords are all lowercase, so we want to filter out not only "my" in the middle of a sentence but also "My" at the beginning of one.)


### q3 (fileid of first positive review) ###

**Question:** 
Find the `fileid` of the first positive review.  You can do this however you like, just get to the right answer.

<!--
BEGIN QUESTION
name: q3
-->

In [22]:
# Find the fileid of the first positive review
# Do this however you want to, just get to the right answer.
file_first_pos = ...

In [None]:
grader.check("q3")

Once we have the first positive review identified, we will retrieve the words from it, by asking the `movie_reviews` object to give us words for the `fileids` provided (in this case, just the one).

In [24]:
# get the words for this review
rev_words = movie_reviews.words(fileids=file_first_pos)
# print out how many words it has, as a sanity check
# to give us some confidence that rev_words indeed contains a review
# It should have 862 words.
print("The review has {} words".format(len(rev_words)))

### q4 (convert words to lowercase) ###

**Question:** Now, convert all the words to lowercase, so that `rev_lower` is a list of the words in the first review, except with all of the words in lowercase.

<!--
BEGIN QUESTION
name: q4
-->


In [25]:
# now we want to convert all the words to lowercase
rev_lower = ...

In [None]:
grader.check("q4")

Now that we have our review in `rev_lower`, we can pull out any matching stopwords and punctuation.  



### q5 (remove skipwords) ###

**Question:** Define `rev_filtered` to be a list of the words from `rev_lower` that are not in `skipwords`.  You probably want to use a list comprehension for this.  You should wind up with 402 words after filtering.  The test is checking for that.

<!--
BEGIN QUESTION
name: q5
-->


In [27]:
# now we want to make a list of just those words in this review that
# are NOT in skipwords.
rev_filtered = ...
# and print out how many words are left after removing the stopwords
print("After filtering, the review has {} words.".format(len(rev_filtered)))

In [None]:
grader.check("q5")

Now that we've stepped through the procedure for a single review, let's generalize that so that we can do this for any review. We'll define a function `filter_review()` that goes through those same steps for any list of words (assumed to be a review) that is fed into it.  Specifically, it will make it lowercase, and then filter out the skipwords, and return the resulting list.

In [29]:
def filter_review(review_words, words_to_skip):
  rev_lower = [w.lower() for w in review_words]
  return [w for w in rev_lower if w not in words_to_skip]

In [30]:
# to see if it worked, let's try it on the review we did by hand.
# just to make sure we get the same result
rev_filtered = filter_review(rev_words, skipwords)
num_filtered_words = len(rev_filtered)
print("The review has {} non-stopwords.".format(num_filtered_words))
if num_filtered_words == 402:
  print("\o/")

A couple of notes here, particularly if Python is kind of new to you.  The `filter_review` function has two arguments, the first is a list of the words in the review, and the second is a list of words to filter out.  We have defined a list of words in the general context of this notebook that we want to filter out, we called it `skipwords`, and when we call the `filter_review` function, we will pass it that list.  It would have been possible to just refer to `skipwords` directly inside this function (that is, to have just assumed that it is already defined to be the right thing and use it), but that is not great programming practice. The reason is just that we usually want our functions to be as modular as they can be, so that we can reuse them.  Ideally, we want not to make assumptions about what people have defined in the outside context.  Better to define a function that takes as arguments all the information it needs to perform its function.

This relates to "variable scope" if you wanted to look it up.  This mainly relates to what definitions are visible from where; if you define a variable in the "global" context of the notebook, functions can in principle see and use (and even change) those values.  If you define a variable within a function, it is only visible from elsewhere in that function, and not from outside. Maybe I'll demonstrate this a bit later.

Ok, now we're ready to do what we just did for each of the reviews.  We'll start by doing it for the positive reviews.

In [31]:
# Apply the filter function to all the positive reviews.
pos_fileids = movie_reviews.fileids(categories='pos')
pos_filtered = [filter_review(movie_reviews.words(f), skipwords) for f in pos_fileids]

Note that this actually took a little while.  It's going through 1000 reviews, and for each one retrieving the words, then going through them to make sure each one isn't in `skipwords`. This isn't even all that large a dataset, either. So it becomes important to be somewhat efficient in your operations if your datasets get larger. Or, have a lot of patience.

In [32]:
# did it work? The first one should have 402 words in it.
num_filtered_first_pos = len(pos_filtered[0])
print("Filtered, first pos rev has {} words.".format(num_filtered_first_pos))
if num_filtered_first_pos != 402:
  print("Disaster! Something is wrong. This number is not right.")

Now we have processed all the positive reviews, so we should do the negative ones too. We could just repeat what we did above except for the negative ones. Let's do that quickly.



### q6 (define neg_filtered) ###

**Question:** Define `neg_fileids` and `neg_filtered` to provide a list of filtered reviews for the negative ones, just modeling your answer on what we did just above for the positive ones.

<!--
BEGIN QUESTION
name: q6
-->


In [33]:
neg_fileids = ...
neg_filtered = ...

In [None]:
grader.check("q6")

Now we have two lists, one a list of negative reviews, one a list of positive reviews. Each one lowercased and filtered.

The goal is to train a classifier on these, so that it can guess whether a review it sees is positive or negative. To train it, we need to give it a bunch of examples of positive ones and a bunch of examples of negative ones, with a label on each one that says what the right answer is so that, while it is training, it can check to see if it got it right and adjust itself if it didn't.

The format these need to be in is essentially:

```
[(question, answer), (question, answer), ... ]
```
That is, a list of pairs, where each pair has a "question" (the thing being classified) and an "answer" (what the network should be classifying it as). In this case, the answer is the category from which the review was drawn from.

A straightforward way to get these pairs is just build them with list comprehensions, as follows. Then add the pairs together.

In [35]:
pos_pairs = [(review, 'pos') for review in pos_filtered]
neg_pairs = [(review, 'neg') for review in neg_filtered]
all_pairs = pos_pairs + neg_pairs

In [36]:
# see if we have something like what we expect
# specifically, the first pair should hava:
# a list of words in the first review as its first member, and
# the category/answer (pos) as its second member
first_pair = all_pairs[0]
print(first_pair[0])
print(first_pair[1])

Great, now we have a big list of pairs in `all_pairs`.  Remember that each of those pairs is of the form `(words, category)` where `words` is a list of words in a given review, and `category` is either `pos` or `neg` depending on which category the review came from.

The last step here is to create a `NaiveBayesClassifier` and train it. What a `NaiveBayesClassifier` does is looks at a set of properties of each review, and correlates the properties with the category we tell it the review had. The classifier can't read the review directly, so we have one more step to take before we can train a classifier. We have to decide what properties of the review the classifier will have access to. What is important?

We'd already kind of decided that what we'll care about is what words are in it.  So the properties of the reviews are going to be something like `contains(terrible)` or `contains(exhilirating)`.  Then the classifier will look at those properties and learn to judge how likely a review is to be negative or positive based on the values of those properties.

The first thing we could try is just to make a set of properties for each review that has `contains(w)` for every word `w` in the review.  We can use `set()` here because we are (by hypothesis/assumption) supposing that it doesn't matter how many times `terrible` appears in a review, only whether it appears or not.

In [37]:
def extract_features(words):
  features = {}
  for w in set(words):
    features['contains({})'.format(w)] = True
  return features

Let's try this on one review. It'll be the first one. The first member of the list `all_pairs` is a pair. That pair has as its first member the filtered review (and as its second member the category). So, the review is in `all_pairs[0][0]`.

In [38]:
print(extract_features(all_pairs[0][0]))

We've just done this for one review (the one at the beginning of `all_pairs`), now let's apply that to all the reviews.  We're converting a pair like `(review, category)` to a pair like `(features, category)`. That is, we need to retain the cateogry as the second member of the pair, but extract the features from the review and use those features as the first member of the pair.

In [39]:
# Now that we've done that for one, let's do it for all of them.
feature_pairs = [(extract_features(words), cat) for (words, cat) in all_pairs]

In [40]:
# double check to see if this is what we expected to get
print(feature_pairs[0][0])
print(feature_pairs[0][1])

## Training the classifier

Now, let's try to make a `NaiveBayesClassifier`. Ths first thing to do is to shuffle the `feature_pairs` list (so that all the negative reviews aren't at the front), split it into a training set and test set, train a `NaiveBayesClassifier` on the training set, and see how it does on the test set.

In [41]:
# make Python aware of the randomization commands
import random

In [42]:
# now shuffle feature_pairs
random.shuffle(feature_pairs)
# they should now be all jumbled up
# let's take the first 600 (about 30%) as the test set, the rest as the training set
test_rev = feature_pairs[:600]
train_rev = feature_pairs[600:]

In [43]:
# ask nltk to create and train a classifier
classifier = nltk.NaiveBayesClassifier.train(train_rev)

In [44]:
# ok, now it is all trained up, how did it do?
print(nltk.classify.accuracy(classifier, test_rev))

In [45]:
# compare that to how well it learned its training set
print(nltk.classify.accuracy(classifier, train_rev))

The classifier is doing better than chance (better than guessing) at identifying positive and negative reviews for reviews it hasn't seen before, but there's a **BIG** difference between how well it does on its training data and how well it does on its test data. That's not optimal, it kind of looks like it memorized the training data. That's called "overfitting" and basically means that it's not likely to generalize well if it is memorizing idiosyncracies of the training set.

Because this data set is actually kind of small, this is probably not the right place to talk about the issues with overfitting, because there's not much improvement we can squeeze out of this on the test set by avoiding overfitting.  But one thing to consider is, how accurate *should* it be if given just a list of the words contained in a review at guessing whether it's positive or negative? It's very suspicious if it is 98% accurate; no person should be able to do that (given an unordered list of words in a review, guess to 98% accuracy whether it is positive or negative).  Anyway, we'll come back to overfitting later.  For the moment, we can at least be happy that it is getting around 70% on reviews it has never seen before.

We can get some insight into how it is making its decisions by asking the classifier what the most informative features are.  You'll see something slightly different from everyone else, given that it is based on a random shuffling, but it will tell you something like: a review that contains the word "outstanding" is about 20-to-1 odds of being a positive review, and one that contains "turkey" is about 18-to-1 odds of being a negative review.


In [46]:
classifier.show_most_informative_features()

# Sentiment analysis on Twitter samples

Let's do something else similar that somehow seems more relevant.  NLTK has a set of Twitter samples that we can use.  We did/will kind of run through some of this in class, but watching me do it is not quite the same as doing it yourself, so let's try this out in the context of a homework assignment.

> This exercise is largely modeled on [Shamuik Daityari's tutorial](https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk).

## Preparing the corpus data

First, we make the corpus available.  Again, involving the two steps of `from nltk.corpus import` and `nltk.download`.

In [47]:
# make the twitter samples corpus available
from nltk.corpus import twitter_samples
nltk.download('twitter_samples')

In [48]:
# what files are in this corpus?
twitter_samples.fileids()

The `twitter_samples` corpus contains tokenized tweets (so, already broken up into words), and we can just use those. If we request `tokenized(fileid)` of the corpus object, we will get those.

In [49]:
# look at the first tokenized positive tweet (a list of words)
first_tweet = twitter_samples.tokenized('positive_tweets.json')[0]
print(first_tweet)

We're going to do a couple of things that are a little bit more sophisticated with these. The first thing is to do something analogous to what we did when we lowercased all the words. The reason we lowercased the words before is so that we aren't treating "Happily" at the beginning of a sentence different from "happily" in the middle of the sentence, etc. What we're going to do here is one further level of abstraction by attempting to neutralize the difference between tense and agreement realizations.  That is, we want to make "annoy", "annoying", "annoyed" all come out as the same concept/word.

The process of removing inflection like this is called "stemming" and there are various ways that it can be done, but we'll just use a built-in way to do this using NLTK.  We're going to use the WordNetLemmatizer, which can strip the endings off of verbs and nouns if you tell it what kind of word it is.  (A "lemma" is basically the same as a "stem", so this stemmer is called a lemmatizer, because, I don't know why.) 

Let's load that up.

> The "omw" is the Open Multilingual WordNet.  This is a relatively recent addition to NLTK, but is now required to be able to use the stemmer/lemmatizer. The instruction used to be to download 'punkt' but 'omw-1.4' now replaces that.  This might be useful to know if you look at examples out on the web that are older than a few months.


In [50]:
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

Here is what the stemmer/lemmatizer will do: if you give it a singular noun and a plural noun, it should return some neutral form that doesn't distinguish them.  If you give it a past tense verb and a present tense verb, it should give you some neutral form. But you do have to tell it whether you are trying to "stem" verbs or nouns.  That matters to its algorithms.

In [51]:
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('is', 'v'))
print(lemmatizer.lemmatize('being', 'v'))
print(lemmatizer.lemmatize('cried', 'v'))
print(lemmatizer.lemmatize('crying', 'v'))
print(lemmatizer.lemmatize('taken', 'v'))
print(lemmatizer.lemmatize('took', 'v'))
print(lemmatizer.lemmatize('sing', 'v'))
print(lemmatizer.lemmatize('singing', 'v'))
print(lemmatizer.lemmatize('stapler', 'n'))
print(lemmatizer.lemmatize('staplers', 'n'))
print(lemmatizer.lemmatize('ox', 'n'))
print(lemmatizer.lemmatize('oxen', 'n'))
print(lemmatizer.lemmatize('person', 'n'))
print(lemmatizer.lemmatize('people', 'n'))

Pretty good. It didn't manage to quite factor out plurality for "people" but it did pretty well anyway.

Now: what we want to do is stem the words in our tweets.

Except, wait.  We need to know what the nouns and verbs are.  And we don't have that information.  So first we need to classify the words in the tweets as nouns or verbs.

That is in fact the same kind of problem we're already in the midst of trying to solve, it classifies words by part of speech.  However, we'll just use a pre-existing classifier for that.  NLTK contains a pretrained network that can look at words and guess their part of speech, that will be good enough for this.  So, let's load up this "tagger" (named that way because it tags words with their parts of speech).


In [52]:
from nltk.tag import pos_tag
nltk.download('averaged_perceptron_tagger')

And then let's see what it does.  Let's apply it to that first tweet.

In [53]:
print(pos_tag(first_tweet))

So it has converted our list of words into a list of pairs. The first member of the pair is the word, the second member of the pair is the tag (that is, the part of speech). We learn from this that it considers "#FollowFriday" to be an adjective.  I guess.

In there, we have a verb "being" and a noun "members".  We know that "being" is a verb because it has been given a part-of-speech tag of "VBG", and that "members" is a noun because it has been given a part-of-speech tag of "NNS".  Specifically, those first two letters of the part-of-speech tag are what tells us if it is a noun, if it is a verb, etc.  And the lemmatizer only cares about nouns and verbs (none of the more specific information), so we just give it "n" or "v" as the part of speech.

Let's try this out by hand on those two words.

In [54]:
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('being', 'v'))
print(lemmatizer.lemmatize('members', 'n'))

Now we'll define something that will take a whole sentence, figure out the part of speech of each word, and -- if the word is a noun or a verb -- stem it.  We'll do other processing with it as well, while we're at it.  Specifically, we'll lowercase all the words, and we'll exclude all the skipwords as well (stopwords for English and punctuation), and drop any @-mentions and links from consideration.

> You will see `continue` in the code below.  What this does is stops processing of the current block and returns control to the `for...:` to move on to the next word (if any are left).  So, if `continue` is processed before we add the token to the `cleaned_tokens` list, the token does not get added, we just move on to consider the next token.

In [55]:
def clean_tweet(tweet_tokens, skipwords=()):
  # we will return a list of "cleaned" tokens, which excludes
  # all of the stopwords, punctuation, @-mentions, and
  # links, and has all the tokens lowercased and stemmed.
  # the cleaned_tokens list is where we are collecting those.
  # so we start off by making it an empty list, which we will
  # then add to as we go through the tweet
  cleaned_tokens = []
  # define the lemmatizer
  lemmatizer = WordNetLemmatizer()

  # pos_tag() will convert the tweet into a list of pairs
  # each pair will have the word as its first member
  # and the part of speech as its second
  for token, tag in pos_tag(tweet_tokens):

    lowercased_token = token.lower()

    # if this word is one of the skipwords (stopwords + punctuation)
    # then skip it
    if lowercased_token in skipwords:
      continue
    # if this is an @-mention or a web address, then skip it
    if lowercased_token.startswith('http') or lowercased_token.startswith('@'):
      continue

    # set part of speech for this word to be n or v if the pos tagger
    # said it was a noun or a verb, otherwise set the part of speech to be a.
    if tag.startswith('NN'):
      pos = 'n'
    elif tag.startswith('VB'):
      pos = 'v'
    else:
      pos = 'a'
    
    # stem the word
    token = lemmatizer.lemmatize(lowercased_token, pos)
    # add the stemmed word to the list
    cleaned_tokens.append(token)
  return cleaned_tokens

Read through the function above to see what it is doing.

Let's see what it did to the first tweet, by printing the un-cleaned tweet for comparison, and then the cleaned tweet.

(Remember that we defined `skipwords` a long time ago, but it contains the English stopwords and the punctuation words.)

In [56]:
# skipwords = stopwords.words('english') + list(string.punctuation)
print(" ".join(first_tweet))
print(" ".join(clean_tweet(first_tweet, skipwords)))

Ok, now that we've worked through the first tweet by hand, let's prepare to process all the tweets this way.

First, we'll collect all the tokenized tweets into named variables so we can work with them.

In [57]:
# collect all the tokenized tweets
pos_tweets = twitter_samples.tokenized('positive_tweets.json')
neg_tweets = twitter_samples.tokenized('negative_tweets.json')
test_tweets = twitter_samples.tokenized('tweets.20150430-223406.json')

Finally, something for you to do again, after all that reading.  We have a list of positive tweets in `pos_tweets` and we now want to clean them all using `clean_tweet()` to get a list of cleaned positive tweets.



### q7 (clean positive tweets into pos_cleaned) ###

**Question:** Define a list called `pos_cleaned` that is a list of the results of calling `clean_tweet()` for each tweet in `pos_tweets`.

<!--
BEGIN QUESTION
name: q7
-->

In [58]:
pos_cleaned = ...

In [None]:
grader.check("q7")

### q8 (clean negative tweets into neg_cleaned) ###

**Question:** Now do the analogous thing to create `neg_cleaned` that is a list of the results of calling `clean_tweet()` for each tweet in `neg_tweets`.

<!--
BEGIN QUESTION
name: q8
-->

In [60]:
neg_cleaned = ...

In [None]:
grader.check("q8")

Just to see what we've got, let's look at the 501st positive tweet in its original form and then in its cleaned version.

In [62]:
print(pos_tweets[500])
print(pos_cleaned[500])

Now, it is time for a hypothesis.  We have the cleaned data and we need to figure out what information we think our classifier is going to need in order to make the call on whether a tweet is positive or negative.

For the moment, we're going to use a simple hypothesis.  It'll be like the movie reviews, we'll suppose that there are words that are common in positive tweets and not in negative tweets, and words that are common in negative tweets and not in positive tweets.

We can actually just use the `extract_features()` function defined a while ago, which just adds a feature for each word in the tweet.  As a reminder, that was defined like this:

```python
def extract_features(words):
  features = {}
  for w in set(words):
    features['contains({})'.format(w)] = True
  return features
```

In [63]:
# create the list of pairs
pos_tagged_tweets = [(extract_features(t), 'pos') for t in pos_cleaned]
neg_tagged_tweets = [(extract_features(t), 'neg') for t in neg_cleaned]
tagged_tweets = pos_tagged_tweets + neg_tagged_tweets

In [64]:
# make sure they look like we're expecting them to
tagged_tweets[500]

## Training the classifier

Now that we have all the tagged tweets, we just need to jumble them up, split them into a training set and a test set, train a `NaiveBayesClassifier` on the training set and test it on the testing set.



### q9 (train and test a NaiveBayesClassifier) ###

**Question:** So, do that.  Figure out how many tweets there are in `tagged_tweets`, shuffle them, cut off the first 30% as the test set, leaving the last 70% as the training set, train a new `NaiveBayesClassifier` on the training set and see how well it does on the test set. I've provided a couple of prompts (partly so that I can control what variable names you use, so that the results can be checked).

<!--
BEGIN QUESTION
name: q9
-->


In [65]:
# shuffle tagged_tweets
...
# determine the number of tweets in tagged_tweets
num_tweets = ...
print("A test set with 30% in it will have {:d} tweets".format(int(.3*num_tweets)))
# define test_tweets to be the first 30%, train_tweets to be the rest
test_tweets = ...
train_tweets = ...
# define and train a new NaiveBayesClassifier on train_tweets
tweet_classifier = ...
# test and print out how successful the classifier is on test_tweets
accuracy = ...
print("The classifier got an accuracy score of {}".format(accuracy))

In [None]:
grader.check("q9")

That's really, really good.

Really very good.

That's kind of surprisingly good. What did it use to make these calls?

In [68]:
tweet_classifier.show_most_informative_features()

Well, now, wait a minute. It seems like a great many of these tweets have basically an "I am a positive tweet" or "I am a negative tweet" right in them.  How many had these?

In [69]:
pos_smileys = sum([':)' in tw or ':-)' in tw for tw in pos_cleaned])
pos_frownies = sum([':(' in tw or ':-(' in tw for tw in pos_cleaned])
neg_smileys = sum([':)' in tw or ':-)' in tw for tw in neg_cleaned])
neg_frownies = sum([':(' in tw or ':-(' in tw for tw in neg_cleaned])
print("positive tweets with smiley: {}%".format(100*pos_smileys/len(pos_cleaned)))
print("negative tweets with smiley: {}%".format(100*neg_smileys/len(neg_cleaned)))
print("positive tweets with frowny: {}%".format(100*pos_frownies/len(pos_cleaned)))
print("negative tweets with frowny: {}%".format(100*neg_frownies/len(neg_cleaned)))

At the outset, this problem seemed like it was going to be a bit more challenging than it has turned out to be.

But counting on a smiley/frowny seems a little bit like cheating.  What if we excluded those from consideration, and asked something like: what face would be included in this tweet if a face were to be included? 

In [70]:
skipfaces2 = skipwords + [':)', ':-)', ':(', ':-(']
pos_cleaned2 = [clean_tweet(t, skipfaces2) for t in pos_tweets]
neg_cleaned2 = [clean_tweet(t, skipfaces2) for t in neg_tweets]
pos_tagged_tweets2 = [(extract_features(t), 'pos') for t in pos_cleaned2]
neg_tagged_tweets2 = [(extract_features(t), 'neg') for t in neg_cleaned2]
tagged_tweets2 = pos_tagged_tweets2 + neg_tagged_tweets2
random.shuffle(tagged_tweets2)
test_tweets2 = tagged_tweets2[:3000]
train_tweets2 = tagged_tweets2[3000:]
tweet_classifier2 = nltk.NaiveBayesClassifier.train(train_tweets2)
print(nltk.classify.accuracy(tweet_classifier2, test_tweets2))
tweet_classifier2.show_most_informative_features()

That's a bit more like it. It's still pretty good, it's right about three quarters of the time, even without smiley and frowny faces to guide it.
(And if we explored this further, we may find that there are other emoticons we should be removing too, to keep from "cheating" this way.)

Let's try making up a tweet just to see what it would do.

In [71]:
def test_sentence(sent):
  words = sent.split()
  clean_words = clean_tweet(words, skipfaces2)
  result = tweet_classifier2.classify(extract_features(clean_words))
  print("{}: {}".format(result, sent))

test_sentence("I just had the best salad of my life")
test_sentence("That salad was the worst thing I have ever eaten")

We can also ask our classifier what the actual numerical probability is (rather than just making the binary call). It will tell us "pos" if it is over 50% likely that it is positive, but it might me interesting to see the confidence it has.


In [72]:
def test_sentence_prob(sent):
  words = sent.split()
  clean_words = clean_tweet(words, skipfaces2)
  result = tweet_classifier2.prob_classify(extract_features(clean_words))
  call = result.max()
  odds = result.prob(call)
  print("{} ({:6.2f}%): {}".format(call, 100 * odds, sent))

test_sentence_prob("I just had the best salad of my life")
test_sentence_prob("The salad was marginally better than anticipated")
test_sentence_prob("That salad was the worst thing I have ever eaten")
test_sentence_prob("That salad could have been better")


It would seem that the classifier could be better at handling complex pragmatics. But, thinking that "That salad could have been better" is positive is a completely understandable error, since "better" is probably a pretty good signal of a positive tweet. 

We can look and see what contributions the individual words are making by evaluating a "tweet" that contains just that word.  And indeed, "better" pushes positive.  In fact, even "salad" does. Only "could" is pushing the other way.  So we can sympathize with our robot for thinking that "That salad could have been better" is a positive statement.

In [73]:
test_sentence_prob("that")
test_sentence_prob("salad")
test_sentence_prob("could")
test_sentence_prob("have")
test_sentence_prob("been")
test_sentence_prob("better")

<!-- BEGIN QUESTION -->

### q10 (thoughts) ###

**Question:** What do you make of the result above for "that", "have", and "been"?  They're all the same, why are they all the same?  And, what might be the reason for them not being exactly 50%, assuming they aren't? (For me, I am seeing pos 50.67%, but there's a random element that should lead you to see something different.)  There is kind of a right answer here, and it's not just that "that" is kind of a grammatical word that doesn't add much to the meaning, although that is in a sense related to the answer.  There's something more concrete at play here that leads them all to have the same value, and leads that value to be not quite exactly 50%.

<!--
BEGIN QUESTION
name: q10
manual: true
-->


_Type your answer here, replacing this text._

<!-- END QUESTION -->



At this point I was intending to add in something about authorship attribution, but it's a pretty similar endeavor; instead of classifying something as positive or negative, you are classifying something as one author or another.  So, this is long enough, we'll leave homework 2 at that.

# Chat corpus, dialogue acts (advanced, optional)

Let's use the [NPS Chat Corpus](https://faculty.nps.edu/cmartell/NPSChat.htm). These are things taken from a few chat rooms in 2006, and they are tagged both for part of speech and for dialogue-acts.  Dialogue-acts are like "ynQuestion", "Statement", and some other things.  There is a description and list at the link above, and [a bit more detailed description at the Linguistics Data Consortium](https://catalog.ldc.upenn.edu/LDC2010T05).

Download and import.  We will abbreviate `nps_chat` as `nps` for typing convenience.

In [74]:
from nltk.corpus import nps_chat as nps
nltk.download('nps_chat')

In [75]:
# Here are the files.  The filename indicates date, age range, number of posts
nps.fileids()

My recollection of this corpus is that some of the text is kind of unpleasant.  These are chat rooms, this was 2006.  I'll pick one for an example.

First, we retrieve the posts.

In [76]:
posts = nps.xml_posts()

Now, we'll retrieve one particular post.  Number 119.  The dialogue-act can be retrieved by using `p.get('class')`, and the user who typed it can be retrieved by using `p.get('user')`.  The user name is anonymized, but the number at the end should consistently identify the same speaker throughout the corpus.

In [77]:
p = posts[119]
print(p.text)
print(p.get('class'))
print(p.get('user'))

**Task.** Extract the complete list of dialogue acts represented in the list of posts.  The NPS page gives a list against which you can check.  Collect them and display them.  Count them.  There should be 15. Display them.

In [78]:
# retrieve the list of dialogue-acts
acts = {'Statement', 'Bye'} # in this format
print("If all is well, this should say 15: {}".format(len(acts)))
acts

We will use a tokenizer (breaks up text into words) and this depends on `punkt`, so download that.

In [80]:
nltk.download('punkt')

Here is a basic feature-extractor that we can use to begin with. Just making a feature out of each word in the chat sentence.

In [81]:
def nps_features(post):  #v1
  features = {}
  words = nltk.word_tokenize(post.text)
  for w in words:
    features['contains({})'.format(w.lower())] = True
  return features

Applied to post 119, it looks like:

In [82]:
print(nps_features(posts[119]))

**Task**. Now, make a list of pairs out of the posts. The first member will be the features provided by `nps_features` and the second member will be the classification (dialogue-act).

In [83]:
# like this but with all of the posts in the list.
fposts = [(nps_features(posts[119]), posts[119].get('class'))]
print("There should be 10567 posts and there are {}".format(len(fposts)))
fposts

**Task**. Split `fposts` into `nps_train` and `nps_test`, start by training on 80% and testing on 20%.  This should probably be randomized too, so that the test set doesn't come entirely out of one chat room/age group (on which the classifier was not trained).

In [85]:
# define nps_train, nps_test but make them 80%, 20% of fposts
# also shuffle them first
nps_train, nps_test = fposts[:1], fposts[1:]

Now, we will train the classifier and see how it did.

In [87]:
nps_classifier = nltk.NaiveBayesClassifier.train(nps_train)
print(nltk.classify.accuracy(nps_classifier, nps_test))

Not great.  The rest of this is just playing around trying to see what might improve this.

This is a chat room.  It's possible at least that a `ynQuestion` might be more likely to be followed by a `yanswer` than by a `greeting`.  Perhaps we could use the context to make better predictions.  That is, might the category of the preceding line, as well as the words contained in the current line, be able to predict the category of the current line better?

**Task**.  Try adding `prev-class` to the features of the posts that we train on.  (That is, in addition to all the `contains(like): True` type features we have, add something like `prev-class: ynQuestion` as well.  This probably can no longer be done with a list comprehension, I defined `fpost_list(posts)` that would return the list of (features, category) pairs.  That way it can go through the (unshuffled) posts, keep track of what the category of the previous text was, and record that in the features of the present chat text.

In [88]:
# define something to give you a revised context-aware version of fposts
# that records the class of the preceding line as one of the features
# then randomize, split, train, test


Was it better?  For me, no.  For me, it was actually worse!

**Task.** Continue to play with it.  Can you get it to do better?  Some ideas:
 - Maybe we should ignore messages in the "System" category?
 - Maybe try training and testing on subcorpora for just chat rooms with the same age rangs?
 - Maybe include the previous *two* message categories? Ordered or unordered?

See what you can do.  For me, the best I ever got it to do is actually to include the *next* category as a feature.  That is, I built the set in reverse, so that the features of a message were the words it included and the category of the following message.  Perhaps a story can be told about how `yanswer` predicts a preceding `ynQuestion` effectively or something.  Curious to see if anyone can get it much over 70%, that was about as much as I could squeeze out of it.

I suppose it's possible that a human looking at these wouldn't score super-high, except a human did the original classification.  However, perhaps not all humans would agree.

# Submission instructions ##

Go to File at the upper left of this web page and click "Download .ipynb" to download a copy of this.  Then go to Gradescope to submit the homework, and drag the .ipynb file in.  That should be all you need to do.  The autograder will run, but if your tests all passed in here, they should pass there as well.