# CNN Classifier: Positive and negative movie reviews

Here we're going to use the same movie review dataset you used for PS1 and in the [Colab notebook for Class 6.3 or 7.1: Neural Nets for Classification](https://colab.research.google.com/drive/14RWCVTA8F56v_6d1H6Xg0kGOPdOGmjOG?usp=sharing). Here we're going to train a convoultional neural network (CNN) to predict whether a movie review is positive or negative.


## Import statements

In [34]:
#keeping import statements
import nltk
from nltk import FreqDist
import glob
from nltk.corpus import stopwords
import math
import re
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split


## Getting and preparing the data

The first several code blocks are just processing the data -- in exactly the same way we did in the Class 6.3 Colab notebook, where we trained a multilayer perceptron.

In [35]:
#reading in our data
! curl -O https://raw.githubusercontent.com/gaylorav/NLPFinal/main/bg_descriptions.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18.1M  100 18.1M    0     0  4821k      0  0:00:03  0:00:03 --:--:-- 4821k


In [36]:
descriptions_df = pd.read_csv("bg_descriptions.csv")
X=descriptions_df['description']
y=descriptions_df['sentiment']
X=np.array(X)
y=np.array(y)
X_pretrain, X_pretest, y_pretrain, y_pretest = train_test_split(X, y, test_size=0.2, random_state=25)
print(X_pretrain[0])

The players are cowboys who confront each other in a duel, playing their Revolver cards simultaneously. Pure minimalist double guessing in the heart of Far West! Each player takes the 9 cards of one color. He then discards 2 cards of his choice, face down. The game is played in a series of rounds, each composed of several turns (Duels). At each turn, the players secretly choose one of their cards and place it in front of them, next to the other cards that have been already played. If the difference between the 2 cards is less or equal to 3, the highest card wins as much Gold as the difference between the cards. If the difference between the 2 cards is more than 3, the lowest card wins as much Gold as the difference between the cards.  If both cards are equal, there is no winner. 2 Gold are placed between the 2 cards. description from the publisher 


In [37]:
###########################
## READ IN TRAINING DATA ##
###########################
nltk.download('stopwords')

stops = stopwords.words('english')
stops.extend([",", ".", "!", "?", "'", '"', "I", "i", "n't", "'ve", "'d", "'s"])

allwords = []

# creating lists for each sentiment
poswords = []
neuwords = []
negwords = []
one_hot_y = []
integer_y = []

#probably should change the names because X_train is used later
for i in range(len(X_pretrain)):
  desc = X_pretrain[i]
  sentiment = y_pretrain[i]
  toextend = []
  words = desc.rstrip().split()
  toextend.extend(list(set([w for w in words if not w in stops])))
  allwords.extend(list(set(toextend)))
  if sentiment==-1:
    negwords.append(list(set(toextend)))
    integer_y.append(0)
    one_hot_y.append([1,0,0])
  elif sentiment==0:
    neuwords.append(list(set(toextend)))
    integer_y.append(1)
    one_hot_y.append([0,1,0])
  else:
    poswords.append(list(set(toextend)))
    integer_y.append(2)
    one_hot_y.append([0,0,1])

print(poswords[:25])
print(negwords[:25])
print(neuwords[:25])
print(len(poswords),len(neuwords),len(negwords))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[['9', 'Pure', 'At', 'choice,', 'If', 'publisher', 'placed', 'card', 'Revolver', 'The', '2', 'lowest', 'already', 'Each', '3,', 'discards', 'Far', 'secretly', 'confront', 'rounds,', 'highest', 'them,', 'guessing', 'cards', 'He', 'minimalist', 'difference', 'composed', 'series', 'equal,', 'next', 'heart', 'played', '(Duels).', 'double', 'color.', 'choose', 'one', 'equal', 'takes', 'much', 'cards.', 'place', 'played.', 'turn,', 'duel,', 'down.', 'turns', 'winner.', 'less', 'face', 'game', 'several', 'Gold', 'players', 'player', 'front', 'playing', 'description', 'cowboys', 'wins', 'simultaneously.', 'West!'], ['wasnt', 'day', 'standing.', 'every', 'course,', 'From', 'play', 'That', 'draw', 'cage', 'quite', 'Game', 'In', 'card', 'left', 'cleanest', 'something', 'line', 'mess', 'food', 'Each', 'poo', 'Poo', 'eight', 'break.', 'monkey', 'world,', 'Its', 'flinging.', 'cards', 'fifteen', 'done', 'Catalyst', 'play.', 'thing', 'anywhere', 'either', 'waiting', 'another', '-', 'card,', 'furious',

In [38]:
## Get the 1000 most frequent words
## These will be your features
wfreq = FreqDist(allwords)
top1000 = wfreq.most_common(1000)

training = []
traininglabel = []
balanced_training = []
balanced_traininglabel = []
# Take each review, and create a feature vector.
# For each word in the top1000, if that review contains
# that word, set its vector value to 1; otherwise 0.
count = 0
for p in poswords:
    vec = []
    for t in top1000:
        if t[0] in p:
            vec.append(1)
        else:
            vec.append(0)
    if count <=2500:
      balanced_training.append(vec)
      balanced_traininglabel.append([0,0,1])
      count+=1
    training.append(vec)
    traininglabel.append(1)

count = 0
for n in negwords:
    vec = []
    for t in top1000:
        if t[0] in n:
            vec.append(1)
        else:
            vec.append(0)
    if count <=2500:
      balanced_training.append(vec)
      balanced_traininglabel.append([1,0,0])
      count+=1
    training.append(vec)
    traininglabel.append(0)

count = 0
for n in neuwords:
    vec = []
    for t in top1000:
        if t[0] in n:
            vec.append(1)
        else:
            vec.append(0)
    if count <=2500:
      balanced_training.append(vec)
      balanced_traininglabel.append([0,1,0])
      count+=1
    training.append(vec)
    traininglabel.append(-1)

In [39]:
print(len(traininglabel))
print(len(training[0]))
print(training[0])
print(traininglabel[0])

12471
1000
[1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [40]:
# Now read in the testing data.
# For each testing example, create a vector
# of binary features just as you did for the training data.
# pretty sure it needs to be a one-hot vector, but have both just in case
testing = []
testinglabel = []
one_hot_y_test = []

for i in range(len(X_pretest)):
  desc = X_pretest[i]
  sentiment = y_pretest[i]
  words = desc.rstrip().split()
  vec = []
  for t in top1000:
      if t[0] in words:
          vec.append(1)
      else:
          vec.append(0)
  testing.append(vec)
  testinglabel.append(sentiment)
  if sentiment==-1:
    one_hot_y_test.append([1,0,0])
  elif sentiment==0:
    one_hot_y_test.append([0,1,0])
  else:
    one_hot_y_test.append([0,0,1])

print(len(testing))
print(len(testing[0]))
print(testing[0])

3118
1000
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## Formatting the data

We need the data to be in a specific shape for the CNN. This is specific to our data and to the CNN we are going to define later on.

In [41]:
import numpy as np
X_train = np.array(training)
X_test = np.array(testing)
y_train = np.array(traininglabel)
y_test = np.array(testinglabel)

X_btrain = np.array(balanced_training)
y_btrain = np.array(balanced_traininglabel)

print('Shape of training data: ')
print(X_train.shape)
print(y_train.shape)
print('Shape of test data: ')
print(X_test.shape)
print(y_test.shape)

X_train = X_train.reshape(-1, 1000, 1)
X_test = X_test.reshape(-1, 1000, 1)

X_btrain = X_btrain.reshape(-1, 1000, 1)

print('Shape of training data: ')
print(X_train.shape)
print(y_train.shape)
print('Shape of test data: ')
print(X_test.shape)
print(y_test.shape)


Shape of training data: 
(12471, 1000)
(12471,)
Shape of test data: 
(3118, 1000)
(3118,)
Shape of training data: 
(12471, 1000, 1)
(12471,)
Shape of test data: 
(3118, 1000, 1)
(3118,)


## Initializing the model

The code blocks below initialize

In [42]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Dropout, Dense, Flatten



In [43]:
# Initialize the model

# Here's the model.
# We have four layers:
#    2 convolutional layers
#    1 flatten layer
#    a final Dense layer to get the binary classification

# For the convolutional layers, we are pooling with max
# (i.e., when you pass the filter over, take the max of
# all the values you get after you apply the filter)
# and we have a dropout of 0.5 (i.e., throw out half the
# nodes so you don't overfit).

# You can change some of these parameters or add or remove
# layers to see whether it might improve your results.
# (Probably leave Flatten and Dense as-is, though.)


model = Sequential([
    Conv1D(filters=32, kernel_size=3, activation='relu'),
    MaxPooling1D(pool_size=2),
    Dropout(0.5),

    Conv1D(filters=64, kernel_size=3, activation='relu'),
    MaxPooling1D(pool_size=2),
    Dropout(0.5),

    Flatten(),
    Dense(64, activation='relu'),
    Dropout(0.5),

    Dense(3, activation='sigmoid')  # have 3 nodes for the output layer b/c we have 3 classes
])



In [44]:
# Compile the model
# Again, you change the optimize, the loss function, or what metric you'd like report
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Print summary
model.summary()

In [45]:
# Finally train the model!
# You can chage the number of epochs and the batch size, too.
#I think the X_train/test is wrong, but not sure how
one_hot_y = np.array(one_hot_y)
one_hot_y_test = np.array(one_hot_y_test)

model.fit(X_btrain, y_btrain, epochs=15, batch_size=32, validation_data=(X_test, one_hot_y_test))

Epoch 1/15
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 64ms/step - accuracy: 0.3935 - loss: 0.6411 - val_accuracy: 0.5574 - val_loss: 0.5840
Epoch 2/15
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 71ms/step - accuracy: 0.4774 - loss: 0.5967 - val_accuracy: 0.5776 - val_loss: 0.5567
Epoch 3/15
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 64ms/step - accuracy: 0.5104 - loss: 0.5831 - val_accuracy: 0.5471 - val_loss: 0.5646
Epoch 4/15
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 63ms/step - accuracy: 0.5422 - loss: 0.5668 - val_accuracy: 0.5446 - val_loss: 0.5641
Epoch 5/15
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 63ms/step - accuracy: 0.5578 - loss: 0.5542 - val_accuracy: 0.5334 - val_loss: 0.5686
Epoch 6/15
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 62ms/step - accuracy: 0.5770 - loss: 0.5359 - val_accuracy: 0.5449 - val_loss: 0.5585
Epoch 7/15
[1m2

<keras.src.callbacks.history.History at 0x7e733140dfd0>

In [46]:
model.evaluate(X_test, one_hot_y_test)

from sklearn.metrics import classification_report

# # check results:

y_pred = model.predict(X_test)

y_pred_classes = np.argmax(y_pred, axis=1)

y_true = np.argmax(one_hot_y_test, axis=1)





print(classification_report(y_true, y_pred_classes, target_names=["negative", "neutral", "positive"]))

print(np.unique(np.argmax(y_pred, axis=1), return_counts=True))


[1m98/98[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.4881 - loss: 0.6287
[1m98/98[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step
              precision    recall  f1-score   support

    negative       0.41      0.52      0.46       673
     neutral       0.27      0.43      0.33       687
    positive       0.77      0.52      0.62      1758

    accuracy                           0.50      3118
   macro avg       0.48      0.49      0.47      3118
weighted avg       0.58      0.50      0.52      3118

(array([0, 1, 2]), array([ 844, 1089, 1185]))


unbalanced:

```
 negative       0.13      0.01      0.01       673
     neutral       0.00      0.00      0.00       687
    positive       0.56      0.99      0.72      1758

    accuracy                           0.56      3118
   macro avg       0.23      0.33      0.24      3118
weighted avg       0.35      0.56      0.41      3118```

```

balanced:
```
              precision    recall  f1-score   support

    negative       0.41      0.52      0.46       673
     neutral       0.27      0.43      0.33       687
    positive       0.77      0.52      0.62      1758

    accuracy                           0.50      3118
   macro avg       0.48      0.49      0.47      3118
weighted avg       0.58      0.50      0.52      3118```

