# TP1: Machine learning (reminder)
Master LiTL - 2021-2022

## Requirements
In this practical session, we will explore machine learning models for NLP applications ; specifically, we will train a classifier for sentiment analysis on a French dataset of movie reviews. 
For these exercises, we will make use of Python (v3.*), and a number of modules for data processing and machine learning: *numpy*, *scipy*, *scikit-learn*, *pandas* and *spacy* . 
They are all already available within colab. 
If  you  want  to  use  your  own  computer  you  will  need  to  make  sure  these  are  installed  (e.g.  using  the command *pip*). If you’re using *Miniconda*, you can use the command
```
conda install <modulename>
```


First,  download  the  archive  for  the  practical  session  from  the  course  page  to  an  appropriate working directory, and unzip it. Under linux, you can issue the following commands :
```
$ unzip tp3.zip 
$ cd tp3
```


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Task and dataset

We’ll go through the following stages of an NLP machine learning pipeline, using sentiment classification as an application:
* data preprocessing (tokenization) 
* feature extraction
* model training
* evaluation

As a dataset, we’ll be using a set of reviews for television series in French, extracted from the website allocine.fr. 
The dataset consists of the text of the review, as well as a sentiment label (positive or negative).

The training set is divided into a training part (for training, 5576 reviews, ± 90%) and test part (for evaluation, 544 reviews, ± 10%). 
The dataset is balanced, which means positive and negative instances are evenly distributed. 
Additionally, training and test set contain reviews about different TV series (in order to avoid possible bias when evaluating).

## Exercise 1: Preprocessing (code given)

First, we’ll load the training set and axplore the dataset.



In [None]:
import pandas as pd

train_path = "allocine_train.tsv"
dev_path = "allocine_dev.tsv"
test_path = "allocine_test.tsv"

train = pd.read_csv(train_path, header=0, delimiter='\t', quoting=3) 
print("TRAIN:", train.shape)
print(train.columns.values)
print(train['sentiment'][0], train['review'][0])
print()

dev = pd.read_csv(dev_path, header=0, delimiter='\t', quoting=3) 
print("DEV:", dev.shape)
print()

test = pd.read_csv(test_path, header=0, delimiter='\t', quoting=3) 
print("TEST:", test.shape)

TRAIN: (5027, 4)
['movie_id' 'user_id' 'sentiment' 'review']
0 Stephen King doit bien ricaner en constatant cette navrante histoire de disparus, les scénaristes semblent s'être inspirés de ses oeuvres mais ont bien moins son talent que celui du business. Quel perte de temps que de regarder ces personnages perdus au centre d'une histoire sans fin et sans intérêt, où 2 ou 3 épisodes suffisent pour décrocher, à l'inverse d'une série comme Desperate housewives dont les dialogues, les scénarii et les personnages contribuent sans cesse à relancer l'intérêt et le plaisir au fil des épisodes. Pourtant mes goûts initiaux m'auraient porté davantage du côté de la série fantastique. Il ne faut préjuger de rien! A bon entendeur...

DEV: (549, 4)

TEST: (544, 4)


We need to preprocess the dataset to be able to properly extract features from it.
In order to do so, we’ll create a function that makes use of spacy’s preprocessing pipeline. 

You need to import the Spacy model for French. Two options (second not tested): 

```
!python -m spacy download en_core_web_lg
# or
import spacy.cli
spacy.cli.download("fr_core_news_sm")
```

In [None]:
import spacy.cli
spacy.cli.download("fr_core_news_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('fr_core_news_sm')


In [None]:
# Preprocessing = tokenize data
import spacy
nlp = spacy.load('fr_core_news_sm', disable=['tagger', 'parser', 'ner'])


def preprocess_data( dataset ):
  num_reviews = dataset['review'].size
  print("#Reviews =", num_reviews)
  dataset_tok = []
  for i in range(num_reviews):
      clean_review = review_to_tokens(dataset['review'][i])
      dataset_tok.append(clean_review)
  for i, r in enumerate(dataset_tok[:2]):
      print('\n', i, r) 
  return dataset_tok

def review_to_tokens(raw_review):
    doc = nlp(raw_review)
    tokenList = [token.text for token in doc]
    tokenized_string = ' '.join(tokenList)
    tokenized_string_lowercase = tokenized_string.lower()
    return tokenized_string_lowercase

print("-- Preprocess TRAIN:")
train_tok = preprocess_data( train )

print("\n-- Preprocess DEV:")
dev_tok = preprocess_data( dev )

print("\n-- Preprocess TEST:")
test_tok = preprocess_data( test )

-- Preprocess TRAIN:
#Reviews = 5027

 0 stephen king doit bien ricaner en constatant cette navrante histoire de disparus , les scénaristes semblent s' être inspirés de ses oeuvres mais ont bien moins son talent que celui du business . quel perte de temps que de regarder ces personnages perdus au centre d' une histoire sans fin et sans intérêt , où 2 ou 3 épisodes suffisent pour décrocher , à l' inverse d' une série comme desperate housewives dont les dialogues , les scénarii et les personnages contribuent sans cesse à relancer l' intérêt et le plaisir au fil des épisodes . pourtant mes goûts initiaux m' auraient porté davantage du côté de la série fantastique . il ne faut préjuger de rien ! a bon entendeur ...

 1 excellentissime ! une série à l' apparence toute calme et lisse , qui se révèle être un véritable noeud de problèmes , de secrets , de mensonges ... les actrices sont vraiment toutes très bonnes dans leurs rôles , avec une petite préférence pour bree , qui pète complètement 

### Exercise 2: Feature extraction 

Now it’s time to decide which features to use in our classifier. We’ll start with simple bag of words features.

▶▶ **TODO: write the code to vectorize the dev set.**

In [None]:
# Vectorizing data: BOW representation
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_bow = CountVectorizer( analyzer = 'word', max_features = 500 )

print("Vectorize TRAIN:")
train_feats_bow = vectorizer_bow.fit_transform( train_tok )
print(train_feats_bow.shape)

vocab = vectorizer_bow.get_feature_names()
print("Vocabulary:", vocab[:20])

# --------------------------------------------------------
# TODO: Write the code to vectorize the DEV set
# --------------------------------------------------------
print("\nVectorize DEV:")
dev_feats_bow = vectorizer_bow.transform( dev_tok )
print(dev_feats_bow.shape)

# Vocabulary should remain the same!
print("Vocabulary:", vocab[:20])

Vectorize TRAIN:
(5027, 500)
Vocabulary: ['10', '24', 'absolument', 'accroché', 'acteur', 'acteurs', 'action', 'actrice', 'actrices', 'adore', 'ados', 'ah', 'ai', 'ailleurs', 'aime', 'ainsi', 'ait', 'alias', 'alors', 'ambiance']

Vectorize DEV:
(549, 500)
Vocabulary: ['10', '24', 'absolument', 'accroché', 'acteur', 'acteurs', 'action', 'actrice', 'actrices', 'adore', 'ados', 'ah', 'ai', 'ailleurs', 'aime', 'ainsi', 'ait', 'alias', 'alors', 'ambiance']


### Exercise 3: Classification (Code given)

We’ll start with the simplest classifier, yet often performing well: Naive Bayes.
Train the classifier et report its performance on the dev set.



In [None]:
## Classification with NAIVE BAYES

# Train the classifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
classifier = MultinomialNB()
classifier.fit(train_feats_bow, train['sentiment'])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [None]:
# Compute the performance on the dev set
score = classifier.score( dev_feats_bow, dev['sentiment'] )
print(score)

0.8178506375227687


#### Exercise 3-b (code given)
* What does the score represent ?
* Look at the instances that were classified badly. Do you see why the review was misclassified ? 

## TO REMOVE

* What does the score represent ?
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.score

```score(X, y, sample_weight=None)```
Return the mean accuracy on the given test data and labels.

In [None]:
## Look at misclassified instances
pred = classifier.predict( dev_feats_bow )
print(pred) # = matrix, illisible
#print(test['sentiment']) #gold - y_test
print()

print('Misclassified examples: ')
count_err = 0
for i in range(len(pred)):
    if pred[i] != dev['sentiment'][i]:
        print( "\nGOLD=", dev['sentiment'][i], "PRED=",pred[i] , i, dev['review'][i])
        count_err += 1
        
print( "CHECK: ", "#Total=", len(pred), "#Errors=", count_err, "Acc=", (len(pred)-count_err)/len(pred))

[1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 1 0 1
 0 0 1 1 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 1 0 0 1 1 1 1 1 1 0 0
 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 1 1 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0
 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 1 1
 1 1 0 1 1 0 1 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1
 1 1 1 0 1 0 0 0 0 1 0 0 1 1 1 0 0 1 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0
 0 0 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0
 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1
 1 1 1 0 1 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 0
 1 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 1 1 1 1
 0 1 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 1 0 0 0 1 1 1
 1 1 0 0 1 1 1 1 0 1 0 0 1 1 0 1 0 0 0 1 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 1 0
 0 1 0 0 1 1 1 1 1 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 1 1
 1 1 1 0 1 0 1 1 1 1 1 1 

### Exercise 4: Experiment with different feature sets.

Here, we'll just try bi-grams.

▶▶ **TODO: write the code to vectorize the data into bigrams. Keep 'max_features = 500'. Then retrain and evaluate the classifier.**

We could also have tried to:
  * Exclude a list of stopwords (high-frequency words that are considered too general to be meaningful, such as une or le)
  * Experiment with n-grams with n>2 
  * Combine features (e.g. BOW + bi-grams)
  * Can you think of other features to include?

In [None]:
# --------------------------------------------------------
# TODO: Write the code to vectorize to extract bigrams
# --------------------------------------------------------

vectorizer_big = CountVectorizer( analyzer = 'word', max_features = 500, ngram_range=(2,2) )

# --------------------------------------------------------
# TODO: Vectorize the train and dev sets
# --------------------------------------------------------
print("Vectorize TRAIN:")
train_feats_big = vectorizer_big.fit_transform( train_tok )
print(train_feats_big.shape)

print("\nVectorize DEV:")
dev_feats_big = vectorizer_big.transform( dev_tok )
print(dev_feats_big.shape)

vocab = vectorizer_big.get_feature_names()
print("\nVocabulary:", vocab[:20])

# --------------------------------------------------------
# TODO: Train a Naive Bayes classifier and evaluate on dev
# --------------------------------------------------------
print("\nTraining classifier")
classifier_big = MultinomialNB()
classifier_big.fit(train_feats_big, train['sentiment'])

# Compute the performance on the dev set
score = classifier_big.score( dev_feats_big, dev['sentiment']
)
print(score)

Vectorize TRAIN:
(5027, 500)

Vectorize DEV:
(549, 500)

Vocabulary: ['acteurs et', 'acteurs jouent', 'acteurs ne', 'acteurs qui', 'acteurs sont', 'adore cette', 'ai jamais', 'ai pas', 'ai regardé', 'ai vu', 'ai été', 'aime pas', 'alors que', 'arrive pas', 'au bout', 'au début', 'au fil', 'au final', 'au moins', 'aujourd hui']

Training classifier
0.7085610200364298


### Exercise 5

Experiment with different classifiers, compare:
* Naive Bayes 
* MaxEnt

▶▶ **Compare the results obtained with NB to the ones obtained with MaxEnt.**

▶▶ **When you're done, try what happens if you remove 'max_features = 500'. What do you conclude?**

In [None]:
# --------------------------------------------------------
# TODO: Train a MaxEnt classifier and evaluate on dev, using the best features
# --------------------------------------------------------

from sklearn.linear_model import LogisticRegression
classifier_lr = LogisticRegression()

classifier_lr.fit(train_feats_big, train['sentiment'])

# Compute the performance on the dev set
score = classifier_lr.score( dev_feats_big, dev['sentiment'])
print(score)

0.7158469945355191


-- TO REMOVE

We don't have time here, but we would need to tune our model. Tuning would probably lead to MaxEnt giving better results that NB in both cases. 

### Exercise 6

You’ve determined the best feature set and classification algorithm (missing: the best set of hyper-parameters). 

▶▶ **compute the performance on the test set**.

In [None]:
# --------------------------------------------------------
# TODO: Compute the final results on the TEST set
# --------------------------------------------------------

test_feats_big = vectorizer_big.transform( test_tok )
score = classifier_lr.score( test_feats_big, test['sentiment'])
print(score)

0.7757352941176471


## Intrinsic model evaluation (code given)

Some models allow us to look at the most informative features. 

▶▶ **Examine both the top and the bottom of the list. Which features are most informative ?**

In [None]:
classifier_lr_bow = LogisticRegression()
classifier_lr_bow.fit(train_feats_bow, train['sentiment'])

vocab = vectorizer_bow.get_feature_names()

allCoefficients = [(classifier_lr_bow.coef_[0,i], vocab[i]) for i in range(len(vocab))]
allCoefficients.sort()
allCoefficients.reverse()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [None]:
print("Top features for positive class:")
print( '\n'.join( [ f+':\t'+str((round(w,3))) for (w,f) in allCoefficients[:50]] ) )

Top features for positive class:
excellent:	2.491
adore:	2.428
génial:	2.107
bravo:	2.021
meilleure:	1.878
géniale:	1.835
superbe:	1.798
excellente:	1.68
plaisir:	1.448
magnifique:	1.403
espère:	1.387
tres:	1.371
attachants:	1.361
meilleur:	1.359
fantastique:	1.329
super:	1.327
merci:	1.318
étoiles:	1.305
fan:	1.25
meilleures:	1.177
vivement:	1.172
drôle:	1.157
bonne:	1.155
coeur:	1.125
jack:	1.086
vrai:	1.082
suspense:	1.082
petit:	1.039
culte:	0.999
doute:	0.947
moment:	0.944
mes:	0.893
bons:	0.888
charme:	0.882
mélange:	0.857
fil:	0.853
ait:	0.853
revoir:	0.846
ambiance:	0.822
suite:	0.792
certaines:	0.774
rôle:	0.76
grande:	0.756
rebondissements:	0.754
ceux:	0.724
surtout:	0.713
enfin:	0.711
voix:	0.706
simple:	0.702
voir:	0.698


In [None]:
print("Top features for negative class:")
print( '\n'.join( [ f+':'+str((round(w,3))) for (w,f) in allCoefficients[-50:]] ) )

Top features for negative class:
minutes:-0.615
etre:-0.628
départ:-0.634
complètement:-0.65
pu:-0.663
demande:-0.664
sens:-0.664
arrive:-0.703
nom:-0.71
vont:-0.722
scénarios:-0.798
femme:-0.806
force:-0.819
maison:-0.825
quelle:-0.836
ah:-0.861
veut:-0.868
bout:-0.877
rien:-0.899
père:-0.902
sentiments:-0.926
dialogues:-0.942
mal:-0.967
ennuie:-0.98
manque:-0.997
truc:-1.046
chez:-1.056
aucun:-1.15
malheureusement:-1.158
comprends:-1.177
morale:-1.182
gros:-1.195
mauvais:-1.203
jouer:-1.212
feux:-1.232
personne:-1.25
image:-1.28
aucune:-1.36
pathétique:-1.368
déçu:-1.382
idée:-1.429
nulle:-1.633
nullité:-1.664
intérêt:-1.716
nul:-1.781
copie:-2.009
pire:-2.066
ridicule:-2.162
mauvaise:-2.206
étoile:-2.537


TO REMOVE

### Exercise 3: Tuning 

#### K-fold cross validation

Usually, we will want to try out different parameters, in order to see what works best for our task. As such, we might experiment with:

* Different features
* Different classification algorithms 
* Different model parameters

However, we have to be careful: we cannot use our test set over and over again, as we’ll be optimizing our parameters for that particular test set (and run the risk of overfitting, which means we are not able to properly generalize to data we haven’t trained on). 
For this reason, we need to make use of a validation set. 
However, our training set is already quite small; creating a separate validation set would give us even less training data. 

Fortunately, we don’t have to create a separate set: we can use k-fold cross validation. 
The idea is the following:
* Break up data into k (e.g. 10) parts (folds) 
* For each fold
     * Current fold is used as temporary test set – Use other 9 folds as training data
     * Performance is computed on test fold
* Average performance over 10 runs

Note that, again, we want to make sure that the movies that are reviewed in our training set are different from the ones that appear in our validation set. Scikit- learn has a function for this:

In [None]:
# Code to fit a classifier using kfold - print score on kfold

from sklearn.model_selection import GroupKFold
group_kfold = GroupKFold(n_splits=10)
score_kfold = []
for train_index, test_index in group_kfold.split(train_data_features, train['sentiment'], train['movie_id']):
    X_train, X_test = train_data_features[train_index], train_data_features[test_index]
    y_train, y_test = train['sentiment'][train_index], train['sentiment'][test_index]
    classifier.fit(X_train, y_train)
    score_kfold.append(classifier.score(X_test, y_test))
print(sum(score_kfold) / len(score_kfold))