# LX 394 / 594 / 694 Homework 2

In this homework we will do some classification and categorization.  It was a long time coming.  We talked through a lot of these things in class, so much of this is kind of just review, and practice doing it yourself.


In [5]:
# Run this first, it makes the autograder go.
files = "https://github.com/bucomplx/lx394s21/raw/main/assets/ipynb/hw2/tests.zip"
!pip install otter-grader && wget $files && unzip -o tests.zip
import otter
grader = otter.Notebook()

NLTK comes with a corpus of movie reviews, that has been segmented into positive and negative reviews. Using this, we will try to construct a machine that can guess whether a new review is positive or negative. ("Sentiment analysis")

In [6]:
# make NLTK and the movie reviews corpus available
import nltk
from nltk.corpus import movie_reviews
nltk.download('movie_reviews')

Different corpora are organized in different ways, but generally they are collections of individual files, often categorized. The movie reviews corpus has many files, one file per review, and categorized into those that are positive and those that are negative. The positive reviews are in the `pos` directory (folder) and the negative reviews are in the `neg` directory. (When writing out a filename, folders are indicated by a `/` character.)

In [7]:
# Activity:
# categories() is defined for movie_reviews.
# this isn't defined for ALL corpora, but it is for this one.
# to see that, try typing
# movie_reviews.
# on the line below and look at what pops up after you type .
# We see things like raw, paras, readme, and categories


In [8]:
# Let's get the corpus to tell use what categories it has defined
print(movie_reviews.categories())

**Activity**. When you looked at what `movie_reviews` knew how to do (that is, "what methods are defined" in Python parlance), which you got above by typing `movie_reviews.` and looking at the autocompletion popup, you saw `raw` and `categories` and `readme` and some other things.  The `readme` method gives you information about the corpus. Get the corpus to tell you this information, in the same kind of way we got it to tell us what the categories were above.

In [11]:
# Put a command below that will print out the readme information
...

# Question: what year was this (v2.0) dataset released?
# So the autograder can check it, assign the proper year to v2_year
# It should look like v2_year = 1812 (note: that is not the correct year)
v2_year = ...

In [None]:
grader.check("q1")

We now know what the categories are, but let's print them in a nicer way.  First, we'll count how many there are by asking for the `len()` (length) of the list that `categories()` returned, and then we'll list them out, joined by commas between them. We can do this using `len()` and `format()` and `join()`. This should be kind of familiar, but make sure you understand how these lines work.

In [None]:
print("There are {} categories".format(len(movie_reviews.categories())))
print("They are: " + ", ".join(movie_reviews.categories()))

Nice!  That's much easier to read.  Another thing that the corpus can do is give it the names of all the files it contains. We can get that list with `fileids()`. That list is long though.  There are (as we'll see below) 2000 of them. So, printing all 2000 of them out with commas in between would just be an unreadable mess.  We don't need to see them all, we can just look at the first three to get a sense of what they look like. It's not a bad idea to just make sure that the things you're working with (like categories or fileids) look like what you expect them to look like. So this is a kind of sanity check.

In [None]:
print("There are {} files".format(len(movie_reviews.fileids())))
print("The first 3 are: " + ", ".join(movie_reviews.fileids()[:3]))

Once we did the `...import movie_reviews` step, `movie_reviews` became a known corpus object, that we can interact with.  We can tell the corpus object to give us its categories, as above, with `categories()`, and we can ask it to give us its filenames with `fileids()`. If we want just the filenames for a single category, we can specify it like this: `fileids(categories='pos')` or (if you want the filenames from several categories) `fileids(categories=['neg', 'pos'])`. The `categories=` parameter can take either a string (naming a category) or a list of strings (each of which names a category).

The code below will go through the categories, and, for each category, it will call the current category `c` and then execute the indented block.  The first line gets the fileids for all the reviews in category `c` (whatever `c` is on this iteration), and then tells us how many there are and what the first 3 fileids are.

In [None]:
for c in movie_reviews.categories():
  ids = movie_reviews.fileids(categories=c)
  print("There are {} reviews in category {}.".format(len(ids), c))
  print("The first 3 files are: " + ', '.join(ids[:3]))

The hypothesis we will pursue first is that we can guess whether a review is positive or negative based on what words it contains. Roughly speaking, a review that contains the word "boring" is probably negative, and one that contains the word "exciting" is probably positive.

So, let's try to work that out.

These reviews are just raw text, they have not been processed.  So one thing we want to do right away is convert all the words to lowercase, so we don't consider a word like "boring" to be two different words when it is at the beginning of a sentence and when it is in the middle of a sentence.

The second thing we want to do is to remove all the super-common/grammatical words and punctuation so that what's left are the more contentful words.  This is all an approximation, but the idea is that `.` and `the` and `and` are not providing useful information about the contents of a review, so we want to filter all those out.  The are, in NLP terminology, "stopwords." And NLTK has a corpus of them (in several languages, we'll use the English ones).

So what we are going to do is create a filtered version of these reviews, where the stopwords are removed and the remaining words are lowercased.

In [None]:
# make the stopwords corpus available.
from nltk.corpus import stopwords
nltk.download('stopwords')

In [None]:
# make the punctuation list available
import string

Now that the stopwords and punctuation list have been loaded up, we want to combine them into a list of words to exclude.

In [None]:
# if we inquire about the fileids in the stopwords corpus,
# we see the list of languages that it includes.
print(stopwords.fileids())

In [None]:
# the ones we want are the English ones, so we use words() to pull those out.
eng_stopwords = stopwords.words('english')

In [None]:
# let's take a look at what we got.  It's a list of words.  They're all lowecase.
print(eng_stopwords)

We now want to augment the list of stopwords with "punctuation words". Most tokenizers and corpora split off punctuation into their own words, so a sentence in a corpus might be represented like `['I', 'left', '.']`, with the `.` representing its own word.  We want to filter out those punctuation marks as well as the stopwords, so we're planning on making a bigger list of words that contains all the stopwords and the punctuation words.

In [None]:
# here is a string containing all the relevant punctuation marks.
print(string.punctuation)

In [None]:
# you can go through this string and print each character.
# a string can be viewed as a list of characters, so you can
# iterate through a string by character just like iterating
# through a list.  The code below prints each punctuation character
# between square brackets.  The end='' part of the print()
# command tells python not to move to the next line but just
# keep printing on the same line (if you do not specify the end
# parameter, the default is to move to the next line).
for c in string.punctuation:
  print(' [{}]'.format(c), end='')

**Question** Create a list of words formed from the punctuation string.  I expect you'll use a list comprehension. Though if you do, take a look at what the list comprehension is doing. There's actually an even easier way to get the same result, but however you get the result is fine.  Your list should look like `['!', '"', '#', '$', etc.]` once you've formed it.


In [None]:
punc_words = ...
print(punc_words)

In [None]:
grader.check("q2")

Now that we have `punc_words` and `eng_stopwords` defined, we can create a full list of words to filter out.

In [None]:
skipwords = eng_stopwords + punc_words
print("We have {} words to filter out.".format(len(skipwords)))

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

These are some submission instructions.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()