In [165]:
# Linear algebra
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Text preprocessing
from unidecode import unidecode

# Text Embedding
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import Word2Vec

# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


# Tensorflow.keras
from tensorflow.keras import Sequential
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping


# Binary Sentiment Analysis of French Movie Reviews

### Objectives
1. Text cleaning
2. Text preprocessing for custom embedding Neural Network
3. Train RNN model for sentiment analysis

⚠️ This notebook will be your final deliverable. 
- Make sure it can run "restart and run all"
- Delete useless code cells
- Do not "clear output"

# 0. Load data

Our dataset contains 30,000 french reviews of movies, along with the binary class 1 (positive) or 0 (negative) score

In [3]:
# We load the dataset for you
data = pd.read_csv('https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/movies.csv')
data

Unnamed: 0,review,polarity
0,Ça commence à devenir énervant d'avoir l'impre...,0
1,"J'ai aimé ce film, si il ressemble a un docume...",1
2,Une grosse merde ce haneke ce faire produire p...,0
3,"Beau mélodrame magnifiquement photographié, ""V...",1
4,A la poursuite du diamant vers est un film pro...,1
...,...,...
29946,Le meilleur film de super-héros derrière le ba...,1
29947,Un drame qui est d'une efficacité remarquable....,1
29948,"Une daube hollywoodienne de plus, aucun intérê...",0
29949,Et voilà un nouveau biopic sur la star du X Li...,0


In [4]:
# We create features
y = data.polarity
X = data.review

# We analyse class balance
print(pd.value_counts(y))

1    15051
0    14900
Name: polarity, dtype: int64


In [5]:
# We check various reviews
print(f'polarity: {y[0]} \n')
print(X[0])

polarity: 0 

Ça commence à devenir énervant d'avoir l'impression de voir et revoir le même genre de film à savoir : la comédie romantique, surement le genre le plus prolifique de le production française actuelle. Le problème c'est que l'on a souvent affaire à des niaiseries de faible niveau comme celui ci. Avec un scenario ultra balisé et conventionnel, c'est à se demander comment ça peut passer les portes d'un producteur. Bref cette sempiternel histoire d'un homme mentant au nom de l'amour pour reconquérir une femme et qui à la fin se prend son mensonge en pleine figure est d'une originalité affligeante, et ce n'est pas la présence au casting de l'ex miss météo Charlotte Le Bon qui rêve surement d'avoir la même carrière que Louise Bourgoin qui change la donne.


# 1. Clean Text

❓ We need to give a _quick & dirty_ cleaning to all the sentences in the dataset. Create a variable `X_clean` of similar shape, but with the following cleaning:
- Replace french accents by their non-accentuated equivalent using the [unidecode.unidecode()](https://pypi.org/project/Unidecode/) method
- Reduce all uppercases to lowercases
- Remove any characters outside of a-z, for instance using `string.isalpha()`

😌 You will be given the solution `X_clean` in the next question to make sure you can complete the challenge

In [39]:
# test text
text = X[0]
print(text)

Ça commence à devenir énervant d'avoir l'impression de voir et revoir le même genre de film à savoir : la comédie romantique, surement le genre le plus prolifique de le production française actuelle. Le problème c'est que l'on a souvent affaire à des niaiseries de faible niveau comme celui ci. Avec un scenario ultra balisé et conventionnel, c'est à se demander comment ça peut passer les portes d'un producteur. Bref cette sempiternel histoire d'un homme mentant au nom de l'amour pour reconquérir une femme et qui à la fin se prend son mensonge en pleine figure est d'une originalité affligeante, et ce n'est pas la présence au casting de l'ex miss météo Charlotte Le Bon qui rêve surement d'avoir la même carrière que Louise Bourgoin qui change la donne.


In [40]:
### YOUR CODE
def clean_text(text):
    # remove accents
    text = unidecode(text)
    
    # lower case
    text = text.lower()
    
    # remove non-alpha characters
    text = ''.join(c for c in text if c.isalpha() or c == ' ')
    
    return text

X_clean = X.apply(lambda x: clean_text(x))

In [42]:
X.shape == X_clean.shape

True

In [41]:
# Test
X_clean[0]

'ca commence a devenir enervant davoir limpression de voir et revoir le meme genre de film a savoir  la comedie romantique surement le genre le plus prolifique de le production francaise actuelle le probleme cest que lon a souvent affaire a des niaiseries de faible niveau comme celui ci avec un scenario ultra balise et conventionnel cest a se demander comment ca peut passer les portes dun producteur bref cette sempiternel histoire dun homme mentant au nom de lamour pour reconquerir une femme et qui a la fin se prend son mensonge en pleine figure est dune originalite affligeante et ce nest pas la presence au casting de lex miss meteo charlotte le bon qui reve surement davoir la meme carriere que louise bourgoin qui change la donne'

In [43]:
from nbresult import ChallengeResult

result = ChallengeResult('C14',
    shape = X_clean.shape,
    first_sentence = X_clean[0]
)
result.write()

# 2. Preprocess data

Now that we have clean sentences, we need to convert each one into a list of integers of fixed size
- For example, the sentence: `"this was good"` should become something like `array([1, 3, 18, 0, 0, 0, ...0], dtype=int32)` where each integer match to a each _unique_ word in your corpus of sentences.

❓ Create a numpy ndarray `X_input` of shape (29951, 100) that will be the direct input to your Neutral Network. 

- 29951 represents the number of reviews in the dataset `X_clean`
- 100 represents the maximum number of words to keep for each movie review.
- It must contain only numerical values, without any `NaN`
- In the process, compute and save the number of _unique_ words in your cleaned corpus under `vocab_size` variable

👉 First, you **must** start back from the clean solution below (14Mo)

In [44]:
X_clean = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/movies_X_clean.csv")['review']
X_clean

0        ca commence a devenir enervant de voir et revo...
1        aime ce film si il ressemble a un documentaire...
2        une grosse merde ce haneke ce faire produire p...
3        beau melodrame magnifiquement photographie ver...
4        a la poursuite du diamant vers est un film pro...
                               ...                        
29946    le meilleur film de derriere le batman de nola...
29947    un drame qui est efficacite remarquable un fil...
29948    une daube hollywoodienne de plus aucun interet...
29949    et voila un nouveau biopic sur la star du x li...
29950    un film qui fait vieux avec des acteurs pas to...
Name: review, Length: 29951, dtype: object

In [134]:
# Embedding with CountVectorizer
vec1 = CountVectorizer()
vocab = vec1.fit_transform(X_clean)

In [135]:
# Compute unique words in the corpus
vocab_size = pd.Series(vec1.get_feature_names()).nunique()
vocab_size

62353

In [136]:
# Embedding with a maximum of 100 words
vec = CountVectorizer(max_features=100)
X_input = vec.fit_transform(X_clean)
X_input = X_input.toarray()

In [137]:
# Explore
pd.DataFrame(X_input, columns = vec.get_feature_names()).head(1)

Unnamed: 0,acteurs,assez,au,aussi,aux,avec,beaucoup,bien,bon,ca,...,tout,tres,trop,un,une,vie,voir,vous,vraiment,vu
0,0,0,2,0,0,1,0,0,1,2,...,0,0,0,1,1,0,1,0,0,0


In [138]:
# Verify Type
type(X_input)

numpy.ndarray

In [139]:
# Verify Shape
X_input.shape

(29951, 100)

In [140]:
# Verify NaN Values
pd.DataFrame(X_input).isna().sum().sum()

0

In [141]:
from nbresult import ChallengeResult

result = ChallengeResult('C1415',
    type_X = type(X_input),
    shape = X_input.shape, 
    input_1 = X_input[1], 
)
result.write()

# 3. Neural Network

❓Create and fit a Neural Netork that takes `X_input` and `y` as input, to binary classify each sentence's sentiment

- You cannot use transfer learning or other pre-existing Word2Vec models
- You must use a "recurrent" architecture to _capture_ a notion of order in the sentences' words
- The performance metrics for this task is "accuracy"
- Store your model in a variable `model` 
- Store the result your `model.fit()` in a variable `history`. 
- ⚠️ `history.history` must comprises a measure of the `val_accuracy` at each epoch.
- You don't need to cross-validate your model

😌 Don't worry, you will not be judged on your computer power: You should be able to reach accuracy significantly better than baseline in less than 3 minutes even without GPUs.

👉 But first, you **must** start back from the solution below (70Mo)

In [142]:
url = 'https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/movies_X_input.csv'
X_input = np.genfromtxt(url, delimiter=',', dtype='int32')

In [149]:
# ___Step 1: Holdout
X_train, X_test, y_train, y_test = train_test_split(X_input, y, test_size=0.3)

In [153]:
X_train[0]

array([14870,   293,   250,    32,    36,    13,     6,    15, 34653,
           3,    15,   158,   180,   162,    15,   636,     8,  6420,
         100,   356,   321,    55, 10015, 14253,    49,    15, 26715,
       13008,     3,    15,  7336,  6185,    59,  1005,   596,     6,
       43785,     8,   298,    59,  1531, 14252,   229,   274, 13466,
          73,     6, 15069,  5031,  5124, 47336,     8,    19,     6,
        2036,    87,    10,    13,     2,    32,    59,  5780, 48053,
       36037,  1021,   614,    55,    31,     6,    15,  1464,  1465,
         353,    46,   443, 55251,    88,    32,    59, 21356,   120,
         630,     6,  7305,   114,  1996,   450,   122,  1447,  5641,
       14870,  3047,     8,  3769,    36,   201,    79,    13,  5802,
        1046], dtype=int32)

In [163]:
# ___Step 2: Baseline Model
# baseline can be to predict the label that is the most present in y_train
unique, counts = np.unique(y_train, return_counts=True)
counts = dict(zip(unique, counts))
print('Number of labels in train set', counts)

y_pred = 0 if counts[0] > counts[1] else 1

baseline_accuracy = accuracy_score(y_test, [y_pred]*len(y_test))

print('Baseline accuracy: ', baseline_accuracy)

Number of labels in train set {0: 10430, 1: 10535}
Baseline accuracy:  0.5025595370576452


In [224]:
X_input.max()

62379

In [228]:
# ___Step 3: Model

model = Sequential()
model.add(layers.Embedding(input_dim=62379+1, output_dim=100))  # We have 62353 words in the vocabulary, 
                                                                   # and each word is represnted by a vector of size 100
model.add(layers.LSTM(25, activation='tanh'))
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [229]:
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

In [239]:
es = EarlyStopping(patience=5, restore_best_weights=True)

history = model.fit(X_train, y_train, 
          batch_size = 32,
          epochs=100,
          validation_split=0.3,
          callbacks=[es]
         )

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100


In [240]:
history.history

{'loss': [0.13665771484375,
  0.13583135604858398,
  0.12147479504346848,
  0.10539568215608597,
  0.08776560425758362,
  0.08190279453992844,
  0.07474476099014282,
  0.0738164409995079],
 'accuracy': [0.957206130027771,
  0.9574787020683289,
  0.9582964181900024,
  0.9660647511482239,
  0.9704259037971497,
  0.9716524481773376,
  0.974923312664032,
  0.9757410287857056],
 'val_loss': [0.34605300426483154,
  0.3748410642147064,
  0.34586039185523987,
  0.4484981596469879,
  0.38980528712272644,
  0.41468220949172974,
  0.4187156558036804,
  0.3788270354270935],
 'val_accuracy': [0.8794912695884705,
  0.8828299045562744,
  0.886963427066803,
  0.8882352709770203,
  0.8912559747695923,
  0.887122392654419,
  0.8813990354537964,
  0.8818759918212891]}

In [241]:
model_accuracy = model.evaluate(X_test, y_test)[1]



In [242]:
model_accuracy > baseline_accuracy

True

In [243]:
from nbresult import ChallengeResult
result = ChallengeResult('C1517',
                         history=history.history)
result.write()