In [4]:
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
np.random.seed(0)

## Movie Review Classifier 🍿📽️

In this assignment, we'll be training a model to classify movie reviews as 'good' or 'bad.'\
The data consists of 40,000 real move reviews from IMBD.\


We'll load the data as a zipped csv. \
Notice that `pd.read_csv()` can take a URL as the path argument and that we can read in a compressed file without first expanding it if we specify the `compression` format!

In [5]:
data_url = './data/movie_reviews.zip'
df = pd.read_csv(data_url, compression='zip')

In [6]:
df.head()

Unnamed: 0,text,label
0,If you haven't seen this movie than you need t...,1
1,"but ""Cinderella"" gets my vote, not only for th...",0
2,"This movie is pretty cheesy, but I do give it ...",1
3,"I have not seen a Van Damme flick for a while,...",1
4,This is a 'sleeper'. It defines Nicholas Cage....,1


In [7]:
df.text[0]

"If you haven't seen this movie than you need to. It rocks and you have to watch it. It is so funny and will make you laugh your guts out!! so you have to watch it and i saw it about a billion and a half times and still think it is funny. so you have to. yes i have memorized the whole movie and could quote it to you from start to finish. you must see this move. it is also cute because it is half a chick flick. if you don't watch it then you are really missing out.this movie even has cute guys in it and that is always a bonus. so in summary watch the movie now and trust me you will not be making a mistake. did i mention the music is good too. So you should like it if you enjoy music. This is a movie that they rated correctly and it will work for anyone."

In [8]:
df.label.unique()

array([1, 0], dtype=int64)

We see that the dataset consists of text reviews and binary labels. Intuitively, the positive class is "good" while the negative is "bad."

Here are two examples from the dataset:

In [9]:
labels = {0: 'bad', 1: 'good'}
seen = {'bad': False, 'good': False}
for i in range(df.shape[0]):
    label = df.loc[i,'label']
    if not seen[labels[label]]:         #为了表示这俩标签是否都被遍历到过
        # display/print combination used to appease Ed's strange output behavior
        display(df.loc[i, 'text'])
        print()
        display(f"label: {labels[label]}")
        print()
        seen[labels[label]] = True
    if all(val == True for val in seen.values()):
        break

"If you haven't seen this movie than you need to. It rocks and you have to watch it. It is so funny and will make you laugh your guts out!! so you have to watch it and i saw it about a billion and a half times and still think it is funny. so you have to. yes i have memorized the whole movie and could quote it to you from start to finish. you must see this move. it is also cute because it is half a chick flick. if you don't watch it then you are really missing out.this movie even has cute guys in it and that is always a bonus. so in summary watch the movie now and trust me you will not be making a mistake. did i mention the music is good too. So you should like it if you enjoy music. This is a movie that they rated correctly and it will work for anyone."




'label: good'




'but "Cinderella" gets my vote, not only for the worst of Disney\'s princess movies, but for the worst movie the company made during Walt\'s lifetime. The music is genuinely pretty, and the story deserves to be called "classic." What fails in this movie are the characters, particularly the title character, who could only be called "the heroine" in the loosest sense of the term.<br /><br />After a brief prologue, the audience is introduced to Cinderella. She is waking up in the morning and singing "A Dream is A wish Your Heart Makes." This establishes her as an idealist (and thus deserving of our sympathy). Unfortunately, the script gives us no clue as to what she is dreaming about. Freedom from her servant role? The respect of her step-family? Someone to talk to besides mice and birds? In one song (cut from the movie but presented in the special features section of the latest DVD) Cinderella relates her wish that there could be many of her so she could do her work more efficiently. You




'label: bad'




**Some Preprocessing**

In the 2nd example, we can see some html tags inside the review text.

Complete the `remove_br()` function by providing its call to `re.sub()` with a regex that removes those pesky "\<br />" tags from an input string, `x`.\
Speciffically, we should replace 2 consecutive occurances of "\<br />" with a single space (can you see why?).

**Hint:** It is good practice to use 'raw' string when writing regular expressions to ensure that special characters are treated correctly. Raw strings are appended with an 'r' like this: `r'this is a raw string'`

In [10]:
# please fill this code block!
# fill in the regular expression
# Define the regular expression to match two consecutive "<br />" tags and replace them with a space.
remove_br = lambda x: re.sub(r'<br />\s*<br />', ' ', x)


Use the dataframe's `apply()` method to apply `remove_br` to each review in both train and test.

In [11]:
# please fill this code block!
# Apply the function on the 'text' column of the dataframe
df['text'] = df['text'].apply(remove_br)

And we can see that the tags have been removed!

In [12]:
df.loc[4,'text']

"This is a 'sleeper'. It defines Nicholas Cage. The plot is intricate and totally absorbing. The ending will blow you away. See it whenever you have the opportunity."

Don't worry about any newline characters or backslashes you may see before apostrophes in the examples above. This is just a quirk of how Jupyter displays strings by default.\
We don't see that these characters if we explicitly `print` the string.

In [13]:
example_str = df.loc[4,'text']
print(example_str)

This is a 'sleeper'. It defines Nicholas Cage. The plot is intricate and totally absorbing. The ending will blow you away. See it whenever you have the opportunity.


We'll continue our preprocessing by next **removing punctuation**.\
But first, let's keep a copy of the data *with* punctuation. This will be useful at the end of the notebook when we want to display the original text of specific observations.

In [14]:
# store copy of data with punctuation
df_raw = df.copy()

The next regex we need is a bit more involved.\
**This should match any non-whitespace, any non-alphanumeric characters, and underscores** (strangly, underscores are not covered by the first 2 conditions).

**Hints:**
- `\w` matches alphanumeric characters
- `\s` matches whitespace
- `[]` can be used to denote a set of characters. ex: `r'[ab]'` will match on 'a' *or* 'b'
- `^` at the beginning of a character set denotes *negation*. ex: `r'[^0-9]'` will matching any non-integer
- `|` is the *logical or* operator. ex: `r'cat|dog'` will match the strings 'cat' *or* 'dog' 
- There are many helpful sites for testing regexes. [Here's a nice one](https://www.regextester.com/).

In [15]:
# please fill this code block!
# create a regex that will match the characters described above 
punc_regex = r'[^\w\s]'

Here we'll use an alternative to the `apply` approach we saw above.\
Pandas has its own set of built-in string methods which includes a version of `replace`. But unlike Python's `str.replace()` this can actually use regexes!

In [16]:
df['text'] = df.text.str.replace(punc_regex, '', regex=True) # remove punctuation

If all went well we can see that punctuation has been removed from our dataset.

In [17]:
example_str = df.loc[4,'text']
print(example_str)

This is a sleeper It defines Nicholas Cage The plot is intricate and totally absorbing The ending will blow you away See it whenever you have the opportunity


**Train/Test Split**

Rather than splitting the data directly with `train_test_split` we'll instead use it to generate indices for the train and test data.\
This may seem strange, but there is a good reason for it. These indices will later allow us to recover the original, unprocessed text from `df_raw` for any given training and test observations. 

Notice too that we are stratifying on the label. This will help ensure that good and bad reviews appear in the same proportions in both train and test.

In [18]:
# generate indices to designate train and test observations
# 保证标签均衡satisfy
train_idx, test_idx = train_test_split(range(df.shape[0]), test_size=0.2, random_state=0, stratify=df['label'])

In [19]:
# Separate the predictor from the response
x = df.text.values
y = df.label.values

In [20]:
# Create train and test sets using the generated indices
x_train = x[train_idx]
y_train = y[train_idx]
x_test = x[test_idx]
y_test = y[test_idx]

**Building the Classifier Pipeline**\
**Step 1: Vectorizor**

It's true that there are still several preprocessing steps to be done such as converting to lowercase and tokenizing the reviews, but these can be done for using sklearn's [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). 

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

Instantiate a `TfidfVectorizer` with parameters such that it will:
- set all reviews to lowercase
- remove english stopwords
- exclude words that occur in less than 1 review in 10,000
- exclude words that occur in more than 90% of reviews

**Hint:** Reading the documentation, you'll see the arguments you need are `lowercase`, `stop_words`, `min_df`, and `max_df`

In [22]:
# please fill this code block!
#通过TF-IDF来确认单词在某个文本中的重要程度，TF是term frequency而idf是iverse document frequency,一个是在单句中的出现频率，一个是在所有文档中出现的反频率
vec = TfidfVectorizer(
    lowercase=True,
    stop_words='english',
    min_df=0.0001,
    max_df=0.9
)
newx = vec.fit_transform(x_train)

**Step 2: Classifier**

We'll use logistic regression with l2 regularization as our classifier model. The [LogisticRegressionCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html?highlight=logisticregressioncv#sklearn.linear_model.LogisticRegressionCV) object allows us to easily tune for the best regularization parameter.

In [23]:
from sklearn.linear_model import LogisticRegressionCV

With 40,000 training observations and each word in the vectorizer's vocabulary counting acting as a predictor training could be slow.\
This issue is exacerbated when using cross validation as we need fit the model multiple times!\
We'll set our classifier CV parameters so as to help keep the training time down to around 30 seconds or so.
- l2 penalty (e.g., Ridge)
- 10 iterations per fit (remember, logistic regression has no closed form solution for the betas!)
- 5-fold CV
- random state of 0 (the fitting can be stochastic)

In [139]:
# please fill this code block!
# Instantiate our Classifier
clf = LogisticRegressionCV(penalty='l2',max_iter=20,random_state=0,cv=5)

**Step 3: Pipeline**

Any text data going into our classifier will have to first be converted to numerical data by our vectorizer.\
One way to do this would be to:
1. fit the vectorizor on the training data
2. transform a dataset with the fitted vectorizer
3. pass the transformed data to the classifier

(1) only needs to be done once, but (2) & (3) would need to be done manually for train, test, and any other data we want to give them model.\
This would be tedious! Luckily, sklearn's [Pipline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline) object allow use to connect one more 'transformers' (such as a scaler or vectorizer) with a model.

In [25]:
from sklearn.pipeline import make_pipeline

Use [make_pipeline()](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html?highlight=make_pipeline#sklearn.pipeline.make_pipeline) to connect the vectorizor, `vec`, and our classifier, `clf`, into a single pipeline.

**Hint:** You can set `verbose=True` to see the individual steps during the fit process later.

In [148]:
# please fill this code block!
# Construct the pipeline
vec = TfidfVectorizer(
    lowercase=True,
    stop_words='english',
    min_df=0.00001,
    max_df=0.9
)
clf = LogisticRegressionCV(penalty='l2',max_iter=10,random_state=0,cv=5)
pipe = make_pipeline(vec,clf)

pipe.fit(x_train, y_train)

# please fill this code block!
# Predict class labels on test data
y_pred = pipe.predict(x_test)

# Predict probabilities of the positive on the test data
y_pred_proba = pipe.predict_proba(x_test)[:,1]

# Calculate test accuracy (there are several ways to do this)
from sklearn.metrics import accuracy_score
test_acc = accuracy_score(y_test, y_pred)

print(f"test accuracy: {test_acc:0.3f}")

test accuracy: 0.894


**Step 4: Fitting**

When it comes to fitting, we can treat the pipeline object as if it were the classifier object itself, and simply call `fit` on the pipeline.

In [123]:
# For the sake of time, we are fitting quickly and we may not converge
# We'll supress those pesky warnings
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
# We also ignore FutureWarnings due to version issues on Ed
simplefilter("ignore", category=(ConvergenceWarning, FutureWarning))

In [137]:
### edTest(test_fit) ###
# Fit the model via the pipeline
# fit对于模型来讲，就是train了，而对于其他的预处理则是，形成一个预处理器
pipe.fit(x_train, y_train)

We can inspect the steps of the pipeline.

In [29]:
pipe.get_params()['steps']

[('tfidfvectorizer',
  TfidfVectorizer(max_df=0.9, min_df=0.0001, stop_words='english')),
 ('logisticregressioncv',
  LogisticRegressionCV(cv=5, max_iter=10, random_state=0))]

By default they are named using the all lowercase class name of each object.\
We can use these names to access the fitted objects inside. Here we see the size of our vectorizer's vocabulary.

In [30]:
features = pipe.get_params()['tfidfvectorizer'].get_feature_names_out()
print('# of features:', len(features))

# of features: 36657


There are too many to print, but we can peek at a random sample.

In [31]:
sample_size = 40
feature_sample_idx = np.random.choice(len(features), size=sample_size, replace=False)
print(np.array(features)[feature_sample_idx])

['marco' 'goodit' 'rabbit' 'ration' 'april' 'teutonic' 'talbot' 'hdtv'
 'tensely' 'wikipedia' 'rhythms' 'invokes' 'drew' 'meyer' 'lestrade'
 'snorting' 'thenwife' 'seasonal' 'hardships' 'grilled' 'norm' 'harry'
 'pomp' 'dissapointment' 'broccoli' 'taunt' 'sherlock' '10000000' 'genx'
 'helsings' 'animosity' 'os' 'ethical' 'evacuated' 'dryers' 'sue'
 'belushi' 'parenting' 'reloading' 'travolta']


Similarly, we can access the fitted logistic model and see what regularization parameter was used.

In [32]:
best_C = pipe.get_params()['logisticregressioncv'].C_[0]
print(f'Best C from cross-validation: {best_C:.4f}')

Best C from cross-validation: 2.7826


**Step 5: Prediction**

Just like we did when fitting, we can treat the pipeline object as the classifier when making predictions.\
Predict on the test data to get:
1. class labels
2. probabilities of being the positive class (i.e., 'good' reviews)
3. test accuracy

In [149]:
# please fill this code block!
# Predict class labels on test data
y_pred = pipe.predict(x_test)

# Predict probabilities of the positive on the test data
y_pred_proba = pipe.predict_proba(x_test)[:,1]

# Calculate test accuracy (there are several ways to do this)
from sklearn.metrics import accuracy_score
test_acc = accuracy_score(y_test, y_pred)

print(f"test accuracy: {test_acc:0.3f}")

test accuracy: 0.894


Can you get better than 0.893 by tweaking the preprocessing, or vetorizer and classifier parameters? Perhaps inspecting how our model makes its predictions may help us decide how we might improve the model in the future.

### Kaggle Submission Process for Movie Review Classification

In the subsequent steps, we'll process the test dataset provided on Kaggle, produce a predicted output, and generate a CSV file suitable for submission. This will allow us to evaluate our model's predictions on Kaggle. Access the competition through this [link](https://www.kaggle.com/competitions/dsaa-6100-movie-review-classification/): **DSAA 6100 Movie Review Classification**.

When participating in the competition on Kaggle, please ensure your displayed username follows the format "StudentID_Name". This will help the teaching assistants to easily identify and verify your scores. For instance, change your Kaggle display name to a format similar to "50013772_Yupeng Xie" before submitting.

In [34]:
# 1. Load the 'test_data.csv' file
test_data = pd.read_csv('data/test_data.csv')

# Extract reviews from 'test_data.csv' assuming the column name is "text"
test_reviews = test_data['text']

# 2. Predict sentiments using the trained model
y_pred_kaggle = pipe.predict(test_reviews)

# 3. Create a dataframe for Kaggle submission
# Assuming 'test_data' has a column named 'Id' for identifying each review
submission = pd.DataFrame({'Id': test_data['Id'], 'Category': y_pred_kaggle})

# 4. Save the predictions to a .csv file for submission
submission.to_csv('kaggle_submission.csv', index=False)

print("Kaggle submission file saved as 'kaggle_submission.csv'")

Kaggle submission file saved as 'kaggle_submission.csv'


### Interpretation

Below we'll use the `eli5` library to have some fun interpreting what is driving our model's predictions on specific test observations.

- [ELI5](https://eli5.readthedocs.io/en/latest/) is a Python library which allows to visualize and debug various Machine Learning models using unified API. It has built-in support for several ML frameworks and provides a way to explain black-box models.

In [35]:
# please fill this code block!
# Install ELI5

In [36]:
# For interpretation
import eli5
# for parsing/formating eli5's HTML output
from bs4 import BeautifulSoup
# for displaying formatted HTML output
from IPython.display import HTML

Here are the words driving positive class predictions.

In [37]:
eli5.show_weights(clf,feature_names=vec.get_feature_names_out(), top=25)

Weight?,Feature
+9.181,excellent
+8.755,710
+8.585,great
+7.259,amazing
+7.041,wonderful
+6.905,best
+6.701,perfect
+6.234,favorite
+6.048,brilliant
… 18540 more positive …,… 18540 more positive …


Hmm, those digits like 710, 810, and 410 driving predictions seems strange. What might they represent?\

We'll use the 'raw' data with punctuation when inspecting the data (See! It is coming in handy!)

In [38]:
x_train_raw = df_raw.text[train_idx].values
x_test_raw = df_raw.text[test_idx].values

In [39]:
df_raw[df.text.str.contains(' 710 ')].iloc[0].text

"I have seen a lot of PPV's in the past but this is the most entertaining, intense PPV and the most complete DVD i have ever seen. The DVD extras are worth it because they it gives a different view of how the wrestlers act after the show (such as the chris benoit interview/edge interview), some glimpse into the Monday Night Wars era,the first match of Hogan winning tag title gold and some promotional talk. Additionally there is a good music video. 1. Tag Team Table match: Bubby Ray and Spike Dudley vs. Eddie Guerro and Chris benoit 7/10 This was a pretty good intense match to start off the show. Not too many holds and just pure raw physicallity. Spike can hold his own in tables matches and Guerro and Benoit gave good pure wrestling skills on the mat.  2. WWE Crusierweight championship: Jamie Noble w/ Nidia v. Billy Kidman 3/10 The crowd really didn't care about either wrestler and didn't get interested until Kidman did a shooting star press. Usually people expect a lot of high flying i

In [73]:
vec.vocabulary_

{'bad': 2707,
 'feeling': 12033,
 'seconds': 28735,
 'film': 12209,
 'pair': 23504,
 'overworked': 23404,
 'probably': 25374,
 'left': 18763,
 'western': 35736,
 'blew': 3751,
 'scene': 28420,
 'grew': 14225,
 'later': 18591,
 'obligatory': 22634,
 'opening': 22966,
 'apparent': 1904,
 'reason': 26372,
 'lowered': 19486,
 'rear': 26365,
 'view': 35018,
 'mirror': 21024,
 'shadow': 29120,
 'seat': 28715,
 'minutes': 21005,
 'credits': 7664,
 'treated': 33523,
 'overhead': 23314,
 'shot': 29424,
 'car': 5112,
 'rocking': 27665,
 'forth': 12822,
 'dramatic': 9922,
 'music': 21722,
 'informs': 16663,
 'killing': 18088,
 'taking': 32229,
 'place': 24525,
 'makeout': 19826,
 'session': 29039,
 '27': 308,
 'hours': 15765,
 'idiotic': 16089,
 'psychotics': 25693,
 'compelled': 6692,
 'drive': 10020,
 'desert': 8821,
 'southwest': 30485,
 'going': 13868,
 'like': 19034,
 'demented': 8642,
 'abbot': 535,
 'costello': 7407,
 'shocking': 29360,
 'twists': 33863,
 'end': 10782,
 'merely': 20680,
 '

These are actually numerical ratings embedded in the reviews! Looking at the text without the punctuation made it hard for us to see this at first.

Here's a helper function used to remove some extraneous things from `eli5`'s output. We just want to see the highlighted text.\
You don't need to read through the function but it is here as a nice resource/example. 🤓

In [95]:
def eli5_html(clf, vec, observation):
    """
    helper function for nicely formatting and displaying eli5 output
    """
    # Get info on is driving a given observation's predictions
    eli5_results = eli5.show_prediction(estimator=clf, doc=observation, vec=vec, targets=[True], target_names=['bad', 'good'],feature_names=vec.get_feature_names_out())
    #eli5_results = eli5.show_prediction(estimator=clf, doc=observation, vec=vec, targets=[True], target_names=['bad', 'good'])
    soup = BeautifulSoup(eli5_results.data, 'html.parser')
    #print(soup)
    # Remove a table we don't want
    soup.table.decompose()
    # Remove the first <p> tag with unwanted text
    soup.p.decompose()
    # Display the newly formatted HTML!
    
    display(HTML(str(soup)))

Now all you need to do is find the specific observations requested.\
You'll need your `y_pred_proba` values for this section to find which elements from `x_test_raw` to select.

**Hint:** [np.argsort()](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html), [np.flip()](https://numpy.org/doc/stable/reference/generated/numpy.flip.html?highlight=flip#numpy.flip), and [np.abs()](https://numpy.org/doc/stable/reference/generated/numpy.absolute.html) may be useful here. 

### What are the **5 worst** movie reviews in the test set according to your model? 🍅

In [75]:
# please fill this code block!
# Find indices of 5 worst reviews
worst5_indices = np.argsort(y_pred_proba)[:5]
worst5 = x_test_raw[worst5_indices]

In [96]:
for i, review in enumerate(worst5):
    style = 'background-color:black;color:white;font-weight:bold;padding:4px'
    display(HTML(f"<p style={style}>Bad Movie #{i+1} 🍅</p>"))
    eli5_html(clf, vec, review)

### What are the **5 best** movie review in the test set according to your model? 🏆

In [97]:
# please fill this code block!
# Find indices of 5 best reviews
best_idx = np.argsort(y_pred_proba)[-5:]
best5 = x_test_raw[best_idx]

In [98]:
for i, review in enumerate(best5):
    display(HTML(f"<p style={style}>Good Movie #{i+1} 🏆</p>"))
    eli5_html(clf, vec, review)

What are the **5 most 'meh'** movie review in the test set according to your model? 😐\
That is, which reviews are the most neutral according to your model?\
Upon reading some of these reviews you may find their sentiment to actually *not* be very ambiguous. What might be confusing our model?

In [99]:
# please fill this code block!
# Find indices of the 5 most neutral reviews
probs = pipe.predict_proba(x_test)
prob_diff = abs(probs[:, 1] - probs[:, 0])
meh5 = x_test_raw[prob_diff.argsort()[:5]]

In [100]:
for i, review in enumerate(meh5):
    display(HTML(f"<p style={style}>'Meh' Movie #{i+1} 😐</p>"))
    eli5_html(clf, vec, review)

Despite some difficulties with a few of the 'meh' movies, our model is actually pretty good! In fact, it works so well you can actually use it to find _mistakes_ in the manually labeled data!\
This can be done by inspecting which training observation predictions differ the most from the provided labels.\

**Write your own review**

Finally, you can try writing a review of your own and see what your model does with it!

In [116]:
my_review = """
            this is the best movie i have seen, though i don't like the actor who is picky.
            """

# Remove punctuation using your regex from earlier
my_review = re.sub(punc_regex, '', my_review)
# Remove leading & trailing whitespace
# and put into a numpy array (which the model expects)
my_review = np.array([my_review.strip()])
my_review

array(['this is the best movie i have seen though i dont like the actor who is picky'],
      dtype='<U76')

In [117]:
my_review_proba = pipe.predict_proba(my_review)[:,1][0]
my_review_label = pipe.predict(my_review)[0]
print('predicted class:', my_review_label)
print('predicted probability:', my_review_proba)

predicted class: 1
predicted probability: 0.7971789596170398


In [118]:
display(HTML(f"<p style={style}>My Review 🍿</p>"))
eli5_html(clf, vec, my_review[0])