# FastText Deep NLP 

Using FastText library to create the best model for sentiment prediction on imdb dataset.

#### Authors
* Denjoy Segolene
* Le Helloco Quentin
* Sharpin Etienne

### Import FastText

In [70]:
import fasttext

### Create Dataset

We use imdb huggingface dataset. It is composed of 25000 samples of positive/negative labeled review for training and 25000 samples for testing.

In [71]:
from datasets import load_dataset

dataset = load_dataset("imdb")

Reusing dataset imdb (/Users/quentinlehelloco/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
x_train = dataset["train"][:]["text"]
y_train = dataset["train"][:]["label"]

x_test = dataset["test"][:]["text"]
y_test = dataset["test"][:]["label"]

Let's look at the first sample to see what kind of format is used.

In [5]:
x_train[0]

'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'

In [14]:
y_train[0]

1

Samples are made of a single string composed of multiples sentences. The label indicates 1 for positives reviews and 0 for negatives ones.

### Format dataset for FastText usage

FastText use a specific format of input: files where each line is a sample with **\_\_label\_\_\<label_name\>** before each.

In [103]:
import pandas as pd
import numpy as np

In [104]:
df = pd.DataFrame({'Value':x_train, 'Positive':y_train})
df.head()

Unnamed: 0,Value,Positive
0,Bromwell High is a cartoon comedy. It ran at t...,1
1,Homelessness (or Houselessness as George Carli...,1
2,Brilliant over-acting by Lesley Ann Warren. Be...,1
3,This is easily the most underrated film inn th...,1
4,This is not the typical Mel Brooks film. It wa...,1


#### We need to add "\__label\__positive" and "\__label\__negative" before every document

In [105]:
def add_label(s, p):
    if p == 1:
        return "__label__positive " + s
    else:
        return "__label__negative " + s 

In [106]:
df["Value"] = df.apply(lambda x: add_label(x.Value, x.Positive), axis=1)

In [107]:
df

Unnamed: 0,Value,Positive
0,__label__positive Bromwell High is a cartoon c...,1
1,__label__positive Homelessness (or Houselessne...,1
2,__label__positive Brilliant over-acting by Les...,1
3,__label__positive This is easily the most unde...,1
4,__label__positive This is not the typical Mel ...,1
...,...,...
24995,__label__negative Towards the end of the movie...,0
24996,__label__negative This is the kind of movie th...,0
24997,__label__negative I saw 'Descent' last night a...,0
24998,__label__negative Some films that you pick up ...,0


The samples are now nicely formatted to be put in a FastText input file. But because we do not want a bias by giving the model all positive samples first, we need to shuffle the dataset.

### Shuffle dataset

In [108]:
df = df.sample(frac=1).reset_index(drop=True)
df

Unnamed: 0,Value,Positive
0,__label__positive This is simply a classic fil...,1
1,__label__negative Wow. Some movies just leave ...,0
2,__label__negative As an ex-teacher(!) I must c...,0
3,"__label__positive Robert Jannuci,Luca Venantin...",1
4,__label__positive I would never have thought I...,1
...,...,...
24995,__label__positive This project was originally ...,1
24996,"__label__positive When I first watched this, w...",1
24997,__label__positive Madhur Bhandarkar goes all o...,1
24998,__label__negative I rented this thinking it mi...,0


The samples can now be put in a FasText input file.

### Put data as a file for FastText input

In [42]:
with open("FastText_input.txt", 'w+') as f:
    for values in df["Value"]:
        f.write(values + "\n")

### Create model 

FastText possess a method to create and train a model automatically so let's use it directly.

In [109]:
model = fasttext.train_supervised("FastText_input.txt")

Read 5M words
Number of words:  281132
Number of labels: 2
Progress: 100.0% words/sec/thread: 2805986 lr:  0.000000 avg.loss:  0.427186 ETA:   0h 0m 0s 67.4% words/sec/thread: 2791510 lr:  0.032629 avg.loss:  0.485156 ETA:   0h 0m 1s


### Save model

The model was fast to compute but we can always save it as a file to keep it after shutting down kernel.

In [110]:
model.save_model("model.bin")

### Test model predictions

Let's test some easy sentences to see if the model predict them correctly

In [111]:
model.predict("I loved this movie !")

(('__label__positive',), array([0.99994802]))

In [112]:
model.predict("I hated this movie !")

(('__label__negative',), array([0.98021179]))

In [114]:
model.predict("I really enjoyed how bad the actors played")

(('__label__negative',), array([0.86012012]))

In [113]:
model.predict("I did not enjoyed it as I should have")

(('__label__positive',), array([0.97754908]))

The model seems to correct nicely most of the sentences but the last one is maybe too ambigous and is mispredicted.

### Test our accuracy

A few sentences are great, but using a test dataset is better. Hopefully we got one so let's see the scores for predicting all 25000 test samples.

In [115]:
def pred_tests(x_test, y_test, model):
    """
    Calculate accuracy of predictions for a model
    
    Input:
        x_test : list of sentences to predict
        y_test : list of labels for x_test sentences
        model : model to predict from
        
    Output:
        List of predictions
    """
    
    preds = []
    
    number_doc = len(x_test)
    
    for i in range(number_doc):
        pred = model.predict(x_test[i])
        
        #
        if pred[0][0] == '__label__negative':
            preds.append(0)
        else:
            preds.append(1)
                
    return preds

In [116]:
%%time
y_pred = pred_tests(x_test, y_test, model)

CPU times: user 1.59 s, sys: 65 ms, total: 1.66 s
Wall time: 1.66 s


In [117]:
from sklearn.metrics import classification_report

In [118]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.86      0.86      0.86     12500
           1       0.86      0.86      0.86     12500

    accuracy                           0.86     25000
   macro avg       0.86      0.86      0.86     25000
weighted avg       0.86      0.86      0.86     25000



0.86 is better than logistic regression or naive bayesian implementation. We expected this because FastText is a great tool to create powerful models. Now, we need to see if we can improve the model to get better scores. 

First we can try to tune hyperparameters to see if FastText can offer a better model without modifying the dataset.

### Hyperparameters fitting

To tune hyperparameters, we could do it by hand and try multiples values of learning rates, ngram-word and epoch. But FastText provide an all-in-one method to search for the better hyperparameters combinations for our dataset.

#### Creating val and train dataset for hyperparameters tuning

In [119]:
def df_to_FastText_file(value_column:pd.DataFrame, file_name:str):
    """
    Create a file with FastText Format from a DataFrame
    
    Input:
        value_column : column of dataFrame containing formatted values
        file_name : name of the file to create
        
    Output: None
    """
    
    with open(file_name, "w+") as f:
        for values in value_column:
            f.write(values + "\n")

FastText need a train and validation dataset to tune the hyperparameters according to the validation score. We makes one of 2000 samples, the 23000 remaining samples are used for training.

In [120]:
# Create val dataset as FastText input file
df_to_FastText_file(df[:2000]["Value"],"FastText_val.txt")

In [123]:
# Create train dataset as FastText input file
df_to_FastText_file(df[2000:]["Value"], "FastText_train.txt")

#### Using FastText hyperparameters tuning method

In [170]:
tuned_model = fasttext.train_supervised(input='FastText_train.txt', autotuneValidationFile='FastText_val.txt')

Progress: 100.0% Trials:   12 Best score:  0.881500 ETA:   0h 0m 0s
Training again with best arguments
Read 5M words
Number of words:  266902
Number of labels: 2
Progress: 100.0% words/sec/thread: 4008359 lr:  0.000000 avg.loss:  0.060689 ETA:   0h 0m 0s  7.7% words/sec/thread: 3800757 lr:  0.089260 avg.loss:  0.387947 ETA:   0h 0m32s 64.3% words/sec/thread: 3974928 lr:  0.034480 avg.loss:  0.090183 ETA:   0h 0m11s


In [174]:
%%time
y_pred_tuned = pred_tests(x_test, y_test, best_model)
print(classification_report(y_test, y_pred_tuned))

              precision    recall  f1-score   support

           0       0.87      0.88      0.87     12500
           1       0.87      0.87      0.87     12500

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000

CPU times: user 1.37 s, sys: 13.4 ms, total: 1.38 s
Wall time: 1.4 s


In 5min of hyperparameters autotune using a validation file, FastText found a best score of 0.88 which is better than our previous 0.86 score.

In order to increase this score, we could let autotune execute longer but the best option is to apply treatment to the input data to remove stopword and use lemming/stemming.

### Testing pretreatment methods

Pretreating the data may be the most complicated part because of the large volume of samples in the dataset. Applying stopword removal, stemming and/or flemming can be very long to execute.

First, let's create an all-in-one function to create and train a model so we just need to put pretreated samples as input to calculate the score

In [132]:
def full_predict(x_train:list, y_train:list ,x_test:list, y_test:list, file_name:str):
    """
    Predict samples from x_test using FastText trained model with x_train samples.
    
    Input:
        x_train: list of samples (train)
        y_train: list of labels (train)
        x_test: list of samples (test)
        y_test: list of labels (test)
        file_name: name of file created for FastText input formatting
    """
    # Create dataframe
    df = pd.DataFrame({'Value':x_train, 'Positive':y_train})
    
    # Format samples
    df["Value"] = df.apply(lambda x: add_label(x.Value, x.Positive), axis=1)
    
    # Shuffle data
    df = df.sample(frac=1).reset_index(drop=True)
    
    # Create file
    with open(file_name, 'w+') as f:
        for values in df["Value"]:
            f.write(values + "\n")
            
    # Train model
    print("Training model")
    model = fasttext.train_supervised(file_name)
    
    # Predict test
    print("Predicting test samples")
    y_pred = pred_tests(x_test, y_test, model)
    
    # Print scores
    print(classification_report(y_test, y_pred))
    
    return (model, y_pred)

#### Now we just need to add pretreatment and test a new training

We can start by removing stopword to see if it affects the prediction scores.

In [None]:
import re
import spacy
from multiprocessing import Pool

# Reg expr for tokenization
re_word = re.compile(r"^\w+$")

# loading small English model
nlp = spacy.load("en_core_web_sm")

In [164]:
def stopword_removal(text:str):
    tokens_str = [str(token) for token in nlp(text.lower()) if re_word.match(token.text) and not token.is_stop]
    return " ".join(tokens_str)

def process_treatment(func, data):
    with Pool() as p:
        preproc_data = p.map(func, data)
        
    return preproc_data

In [165]:
preproc_x_train = process_treatment(stopword_removal, x_train)

Process SpawnPoolWorker-22:
Traceback (most recent call last):
  File "/Users/quentinlehelloco/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Users/quentinlehelloco/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/quentinlehelloco/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/Users/quentinlehelloco/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/queues.py", line 368, in get
    return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'stopword_removal' on <module '__main__' (built-in)>
Process SpawnPoolWorker-23:
Traceback (most recent call last):
  File "/Users/quentinlehelloco/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Users/quentinlehelloco/.pyenv/versions/3.9.1/lib/python3.9/mul

Process SpawnPoolWorker-34:
Traceback (most recent call last):
  File "/Users/quentinlehelloco/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Users/quentinlehelloco/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/quentinlehelloco/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/Users/quentinlehelloco/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/queues.py", line 368, in get
    return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'stopword_removal' on <module '__main__' (built-in)>
Process SpawnPoolWorker-35:
Traceback (most recent call last):
  File "/Users/quentinlehelloco/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Users/quentinlehelloco/.pyenv/versions/3.9.1/lib/python3.9/mul

KeyboardInterrupt: 

### Find wrongly classified samples to analyze why it failed

To understand why some samples were predicted incorrectly, we can take a look at some of it to see if we can detect a pattern that may be difficult for the model to learn.

In [149]:
def find_wrong_class(x_pred:list, y_pred:list, y_true:list, nb_samples=None) -> list:
    """
    Find wrong classification of samples in x_pred based on y_pred prediction
    
    Input:
        x_pred: list of samples
        y_pred: list of predicted labels
        y_true: list of true labels
        nb_samples: max number of samples to extract (could be less if not enough in provided samples)
    """
    wrongly_classified = []
    count_sample = 0
    
    for i in range(len(y_pred)):
        if nb_samples is not None and count_sample == nb_samples:
            break
        
        if y_pred[i] != y_true[i]:
            wrongly_classified.append((x_pred[i], y_pred[i]))
            count_sample += 1
            
    return wrongly_classified

In [147]:
few_wrong = find_wrong_class(x_test, y_pred, y_test)
few_wrong[-2]

('Four things intrigued me as to this film - firstly, it stars Carly Pope (of "Popular" fame), who is always a pleasure to watch. Secdonly, it features brilliant New Zealand actress Rena Owen. Thirdly, it is filmed in association with the New Zealand Film Commission. Fourthly, a friend recommended it to me. However, I was utterly disappointed. The whole storyline is absurd and complicated, with very little resolution. Pope\'s acting is fine, but Owen is unfortunately under-used. The other actors and actresses are all okay, but I am unfamiliar with them all. Aside from the nice riddles which are littered throughout the movie (and Pope and Owen), this film isn\'t very good. So the moral of the story is...don\'t watch it unless you really want to.',
 1)

In [151]:
few_wrong[42]

("The plot of this movie is as dumb as a bag of hair. Jimmy Smit plays a character that could have been upset by the ridiculousness of the story. He is evil and a wife beater. It's a character as far from his NYPD and LA Law roles as you could possibly get.<br /><br />If you've thought he had the looks and the acting chops to play the really bad boy role, her's your present.<br /><br />But!!!!!!!! Mary Louis Parker wears black miniskirts and little black minidresses throughout the movie.<br /><br />She has always had some of the greatest legs in the history of the movies. This makes the movie well worth it for this leg admirer.<br /><br />I'd buy the DVD for this reason only if it was available.",
 0)

If we take those two samples that were mispredicted, we can see that they do not clearly state if they like or not the movie but implies it with complicated word and phrasing of sentences.

### Improve model

Now that we have seen all that we could do with FastText, let's try to beat the baseline and target a prediction score of 0.90.

To do so, we are going to use pretrained embedding by FastText and merge the into a classifier to finetune them for our particular sentiment prediction project.