# Bag-of-Words Assignment #

The goal of this notebook is to create models which classify sentiment (positive or negative) on the basis of textual movie reviews.  This largely follows Chapter 8 of (M), where you can see the original notebook. You should also see my text-sentiment.mp4 video and the jupyter notebook that accompanied it.  (It's '01-29-24.ipynb' on the class Github.) The goal of this notebook is to simplify his presentation so that you can concentrate on the important concept.  We will do this in three parts:

1. Cleaning up the data original data.
2. Building bag-of-words feature vectors to use as input to the model.
3. Building various models and checking for accuracy.

# The Data #

You can download the original dataset at [this URL](http://ai.stanford.edu/~amaas/data/sentiment/)  A more accessible CSV is also available on the [course Github](https://github.com/aleahy-work/CS-STAT323-W26/tree/main/ClassData) in the file movie_data.csv.

Here's what is what the file looks like after converting to a CSV:

In [4]:
import pandas as pd

df = pd.read_csv('Data/movie_data.csv', encoding='utf-8')
df = df.rename(columns={"0": "review", "1": "sentiment"})

df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [8]:
df.shape # there are 50000 reviews

(50000, 2)

You can access invididual reviews using index notation:

In [10]:
df.iloc[15,0]  # review number 15

"I saw this movie on the strength of the single positive review and I can only imagine that guy is a shill.<br /><br />The acting of the female lead is actually quite good, but the entire film is just so excruciatingly boring I could hardly bear to sit through it. This is the very definition of dullness.<br /><br />So far, this film is rated as 8 out of 10 on 7 votes. That must mean the director, director's girlfriend, producer, actress and drinking buddies have given their own film a 10.<br /><br />For the rest of you, who simply want to be entertained or enjoy a good story, avoid this.<br /><br />This man on the street shall give it a 2 out of 10.<br /><br />FDA note: while this movie can be used as an aide to obtaining a good nights sleep, no medicinal value is implied or offered."

In [11]:
df.iloc[1000,0]  # Review number 1000

'The story is derived from "King Lear"; the setting is a farm in Iowa. Here\'s a test for this kind of thing: if you find yourself asking, "Why did so-and-so do such-and-such," and the answer is, "because that\'s what happened in \'King Lear\'," you know that the film has failed. Well, that IS what happens here. The father figure in this story isn\'t living his own life, he\'s mimicking a fictional one. But there\'s more wrong with the film than this.<br /><br />Jocelyn Moorhouse is ambitious - far more ambitious than I think she realises. She\'s trying to take the King Lear story and completely change the setting. This is a task in itself. The likeliest result is that the transplanted story will die, and nobody will quite be able to work out why (although there are enough successful transplants, like "West Side Story", to make it worth trying). But she\'s ALSO attempting a revisionist retelling. In the version of "King Lear" she wishes to create, Reagan and Goneril command our sympath

# Cleaning up the Data #

The first step is to clean up each review so that it is reduced to literally just a 'bag of words' that can then be used to construct a fixed-length TF-IDF vector.  Details of this are discussed in (M) Chapter 8 and in my previous video.  You should expect to learn things about (1) [python string methods](https://www.w3schools.com/python/python_ref_string.asp), (2) the [python re package](https://www.w3schools.com/python/python_regex.asp), in particular the ".sub()" and ".split()" methods, and (3) the [NLTK toolkit] (https://www.nltk.org/), in particular its list of stop words and its various stemmers and lemmatizers. 

Here are things that you should consider doing to each review (in no particular order):

1. Convert it all to lower case
2. strip out all punctuation and HTML code (and contractions?)
3. Remove all numbers
4. Remove stop words--words that are so common they don't distinguish them in any way
5. Perform stemming or lemmatization

**Note:** These last two steps create *lists* of words that you will have to join this list back together to make a single string.  The syntax will look something like:

newreviewstring = ' '.join(review_words_list)

Google 'python join list into string' for more details.

**At the end of this process you should produce a random sample of reviews to convince yourself (and me) that you really have done as much as you can to clean up all of the unnecessary textual dross from your reviews.**

By the way, you might want to keep your original uncleaned dataset around to check your results.

In [None]:
# put your work here

# Fit a TF-IDF Vectorizer #

This should be clear enough.  But tell me a little bit about your data.  For instance, how long are your feature vectors?  Does the data cluster in any way? (Can you project it down into 2 dimenions and get something interesting?  Probably not . . . )  Have fun exploring your data!

You might also want to perform your train-test split at the **end** of this step as well.

In [None]:
# put your work here

# Build some models and test your outcomes #

We have lots of different classifiers at this point:

1. logistic regression
2. LDA
3. QDA
4. Naive Bayes
5. Decision Trees and enhancements (bagging, boosting, etc)
6. SVM

I've probably forgotten something . . . 

Your tasks:

**First:** The book builds a logisitic regression.  Try building a logistic regression model based on *your* feature vectors and check the accuracy.

**Second:** Then try a couple of other classifiers (preferably from different classification philosophies . . .) and see how accurate those results are.  Which model(s) seem to work best, or is there no difference?

**Warning:** Your feature vectors are going to be huge (probably around 100K features IIRC) so it's quite possible that some classifiers won't work.  But be sure to check out 'which sklearn classifiers work with sparse data'.

**Related:**  It seems to me that such a high dimensional space might be exactly what PCA was invented for--but I could be wrong.  If time permits, you might want to try to reduce your feature space *dramatically* in terms of dimensions and see if anything interesting happens to your results.

In [None]:
# put your work here