# TP1: Machine learning (reminder)
Master LiTL - 2021-2022

## Requirements
In this practical session, we will explore machine learning models for NLP applications ; specifically, we will train a classifier for sentiment analysis on a French dataset of movie reviews. 
For these exercises, we will make use of Python (v3.*), and a number of modules for data processing and machine learning: *numpy*, *scipy*, *scikit-learn*, *pandas* and *spacy* . 
If  you  want  to  use  your  own  computer  you  will  need  to  make  sure  these  are  installed  (e.g.  using  the command *pip*). If you’re using *Miniconda*, you can use the command
```
conda install <modulename>
```


First,  download  the  archive  for  the  practical  session  from  the  course  page  to  an  appropriate working directory, and unzip it. Under linux, you can issue the following commands :
```
$ unzip tp3.zip 
$ cd tp3
```


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Task and dataset

We’ll go through the following stages of an NLP machine learning pipeline, using sentiment classification as an application:
* data preprocessing (tokenization) 
* feature extraction
* model training
* evaluation

As a dataset, we’ll be using a set of reviews for television series in French, extracted from the website allocine.fr. 
The dataset consists of the text of the review, as well as a sentiment label (positive or negative).

The training set is divided into a training part (for training, 5576 reviews, ± 90%) and test part (for evaluation, 544 reviews, ± 10%). 
The dataset is balanced, which means positive and negative instances are evenly distributed. 
Additionally, training and test set contain reviews about different TV series (in order to avoid possible bias when evaluating).

## Exercise 1: Preprocessing (code given)

First, we’ll load the training set and axplore the dataset.



In [None]:
import pandas as pd

train_path = "allocine_train.tsv"
dev_path = "allocine_dev.tsv"
test_path = "allocine_test.tsv"

train = pd.read_csv(train_path, header=0, delimiter='\t', quoting=3) #'allocine_train.tsv'
print("TRAIN:", train.shape)
print(train.columns.values)
print(train['sentiment'][0], train['review'][0])
print()

dev = pd.read_csv(dev_path, header=0, delimiter='\t', quoting=3) 
print("DEV:", dev.shape)
print()

test = pd.read_csv(test_path, header=0, delimiter='\t', quoting=3) 
print("TEST:", test.shape)

FileNotFoundError: ignored

We need to preprocess the dataset to be able to properly extract features from it.
In order to do so, we’ll create a function that makes use of spacy’s preprocessing pipeline. 

You need to import the Spacy model for French. Two options (second not tested): 

```
!python -m spacy download en_core_web_lg
# or
import spacy.cli
spacy.cli.download("fr_core_news_sm")
```

In [None]:
import spacy.cli
spacy.cli.download("fr_core_news_sm")

In [None]:
# Preprocessing = tokenize data
import spacy
nlp = spacy.load('fr_core_news_sm', disable=['tagger', 'parser', 'ner'])


def preprocess_data( dataset ):
  num_reviews = dataset['review'].size
  print("#Reviews =", num_reviews)
  dataset_tok = []
  for i in range(num_reviews):
      clean_review = review_to_tokens(dataset['review'][i])
      dataset_tok.append(clean_review)
  for i, r in enumerate(dataset_tok[:2]):
      print('\n', i, r) 
  return dataset_tok

def review_to_tokens(raw_review):
    doc = nlp(raw_review)
    tokenList = [token.text for token in doc]
    tokenized_string = ' '.join(tokenList)
    tokenized_string_lowercase = tokenized_string.lower()
    return tokenized_string_lowercase

print("-- Preprocess TRAIN:")
train_tok = preprocess_data( train )

print("\n-- Preprocess DEV:")
dev_tok = preprocess_data( dev )

print("\n-- Preprocess TEST:")
test_tok = preprocess_data( test )

### Exercise 2: Feature extraction 

Now it’s time to decide which features to use in our classifier. We’ll start with simple bag of words features.

▶▶ **TODO: write the code to vectorize the dev set.**

In [None]:
# Vectorizing data: BOW representation
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_bow = CountVectorizer( analyzer = 'word', max_features = 500 )

print("Vectorize TRAIN:")
train_feats_bow = vectorizer_bow.fit_transform( train_tok )
print(train_feats_bow.shape)

vocab = vectorizer_bow.get_feature_names()
print("Vocabulary:", vocab[:20])

# --------------------------------------------------------
# TODO: Write the code to vectorize the DEV set
# --------------------------------------------------------
print("\nVectorize DEV:")


# Vocabulary should remain the same!
print("Vocabulary:", vocab[:20])

### Exercise 3: Classification (Code given)

We’ll start with the simplest classifier, yet often performing well: Naive Bayes.
Train the classifier et report its performance on the dev set.



In [None]:
## Classification with NAIVE BAYES

# Train the classifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
classifier = MultinomialNB()
classifier.fit(train_feats_bow, train['sentiment'])

In [None]:
# Compute the performance on the dev set
score = classifier.score( dev_feats_bow, dev['sentiment'] )
print(score)

#### Exercise 3-b (code given)
* What does the score represent ?
* Look at the instances that were classified badly. Do you see why the review was misclassified ? 

In [None]:
## Look at misclassified instances
pred = classifier.predict( dev_feats_bow )
print(pred) # = matrix, illisible
#print(test['sentiment']) #gold - y_test
print()

print('Misclassified examples: ')
count_err = 0
for i in range(len(pred)):
    if pred[i] != dev['sentiment'][i]:
        print( "\nGOLD=", dev['sentiment'][i], "PRED=",pred[i] , i, dev['review'][i])
        count_err += 1
        
print( "CHECK: ", "#Total=", len(pred), "#Errors=", count_err, "Acc=", (len(pred)-count_err)/len(pred))

### Exercise 4: Experiment with different feature sets.

Here, we'll just try bi-grams.

▶▶ **TODO: write the code to vectorize the data into bigrams. Keep 'max_features = 500'. Then retrain and evaluate the classifier.**

We could also have tried to:
  * Exclude a list of stopwords (high-frequency words that are considered too general to be meaningful, such as une or le)
  * Experiment with n-grams with n>2 
  * Combine features (e.g. BOW + bi-grams)
  * Can you think of other features to include?

In [None]:
# --------------------------------------------------------
# TODO: Write the code to vectorize to extract bigrams
# --------------------------------------------------------

#vectorizer_big = ...

# --------------------------------------------------------
# TODO: Vectorize the train and dev sets
# --------------------------------------------------------
print("Vectorize TRAIN:")


print("\nVectorize DEV:")


vocab = vectorizer_big.get_feature_names()
print("\nVocabulary:", vocab[:20])

# --------------------------------------------------------
# TODO: Train a Naive Bayes classifier and evaluate on dev
# --------------------------------------------------------
print("\nTraining classifier")
#classifier_big = ...

# Compute the performance on the dev set


### Exercise 5

Experiment with different classifiers, compare:
* Naive Bayes 
* MaxEnt

▶▶ **Compare the results obtained with NB to the ones obtained with MaxEnt.**

▶▶ **When you're done, try what happens if you remove 'max_features = 500'. What do you conclude?**

In [None]:
from sklearn.linear_model import LogisticRegression

# --------------------------------------------------------
# TODO: Train a MaxEnt classifier and evaluate on dev, using the best features
# --------------------------------------------------------

# Train a model with MaxEnt


# Compute the performance on the dev set


### Exercise 6

You’ve determined the best feature set and classification algorithm (missing: the best set of hyper-parameters). 

▶▶ **compute the performance on the test set**.

In [None]:
# --------------------------------------------------------
# TODO: Compute the final results on the TEST set
# --------------------------------------------------------


## Intrinsic model evaluation (code given)

Some models allow us to look at the most informative features. 

▶▶ **Examine both the top and the bottom of the list. Which features are most informative ?**

In [None]:
classifier_lr_bow = LogisticRegression()
classifier_lr_bow.fit(train_feats_bow, train['sentiment'])

vocab = vectorizer_bow.get_feature_names()

allCoefficients = [(classifier_lr_bow.coef_[0,i], vocab[i]) for i in range(len(vocab))]
allCoefficients.sort()
allCoefficients.reverse()

In [None]:
print("Top features for positive class:")
print( '\n'.join( [ f+':\t'+str((round(w,3))) for (w,f) in allCoefficients[:50]] ) )

In [None]:
print("Top features for negative class:")
print( '\n'.join( [ f+':'+str((round(w,3))) for (w,f) in allCoefficients[-50:]] ) )