# FastText Deep NLP 

Using FastText library to create the best model for sentiment prediction on imdb dataset.

#### Authors
* Denjoy Segolene
* Le Helloco Quentin
* Sharpin Etienne

### Import FastText

In [199]:
import fasttext

### Create Dataset

We use imdb huggingface dataset. It is composed of 25000 samples of positive/negative labeled review for training and 25000 samples for testing.

In [71]:
from datasets import load_dataset

dataset = load_dataset("imdb")

Reusing dataset imdb (/Users/quentinlehelloco/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
x_train = dataset["train"][:]["text"]
y_train = dataset["train"][:]["label"]

x_test = dataset["test"][:]["text"]
y_test = dataset["test"][:]["label"]

Let's look at the first sample to see what kind of format is used.

In [5]:
x_train[0]

'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'

In [14]:
y_train[0]

1

Samples are made of a single string composed of multiples sentences. The label indicates 1 for positives reviews and 0 for negatives ones.

### Format dataset for FastText usage

FastText use a specific format of input: files where each line is a sample with **\_\_label\_\_\<label_name\>** before each.

In [103]:
import pandas as pd
import numpy as np

In [104]:
df = pd.DataFrame({'Value':x_train, 'Positive':y_train})
df.head()

Unnamed: 0,Value,Positive
0,Bromwell High is a cartoon comedy. It ran at t...,1
1,Homelessness (or Houselessness as George Carli...,1
2,Brilliant over-acting by Lesley Ann Warren. Be...,1
3,This is easily the most underrated film inn th...,1
4,This is not the typical Mel Brooks film. It wa...,1


#### We need to add "\__label\__positive" and "\__label\__negative" before every document

In [105]:
def add_label(s:str, p:int) -> str:
    if p == 1:
        return "__label__positive " + s
    else:
        return "__label__negative " + s 

In [106]:
df["Value"] = df.apply(lambda x: add_label(x.Value, x.Positive), axis=1)

In [107]:
df

Unnamed: 0,Value,Positive
0,__label__positive Bromwell High is a cartoon c...,1
1,__label__positive Homelessness (or Houselessne...,1
2,__label__positive Brilliant over-acting by Les...,1
3,__label__positive This is easily the most unde...,1
4,__label__positive This is not the typical Mel ...,1
...,...,...
24995,__label__negative Towards the end of the movie...,0
24996,__label__negative This is the kind of movie th...,0
24997,__label__negative I saw 'Descent' last night a...,0
24998,__label__negative Some films that you pick up ...,0


The samples are now nicely formatted to be put in a FastText input file. But because we do not want a bias by giving the model all positive samples first, we need to shuffle the dataset.

### Shuffle dataset

In [108]:
df = df.sample(frac=1).reset_index(drop=True)
df

Unnamed: 0,Value,Positive
0,__label__positive This is simply a classic fil...,1
1,__label__negative Wow. Some movies just leave ...,0
2,__label__negative As an ex-teacher(!) I must c...,0
3,"__label__positive Robert Jannuci,Luca Venantin...",1
4,__label__positive I would never have thought I...,1
...,...,...
24995,__label__positive This project was originally ...,1
24996,"__label__positive When I first watched this, w...",1
24997,__label__positive Madhur Bhandarkar goes all o...,1
24998,__label__negative I rented this thinking it mi...,0


The samples can now be put in a FasText input file.

### Put data as a file for FastText input

In [42]:
with open("FastText_input.txt", 'w+') as f:
    for values in df["Value"]:
        f.write(values + "\n")

### Create model 

FastText possess a method to create and train a model automatically so let's use it directly.

In [109]:
model = fasttext.train_supervised("FastText_input.txt")

Read 5M words
Number of words:  281132
Number of labels: 2
Progress: 100.0% words/sec/thread: 2805986 lr:  0.000000 avg.loss:  0.427186 ETA:   0h 0m 0s 67.4% words/sec/thread: 2791510 lr:  0.032629 avg.loss:  0.485156 ETA:   0h 0m 1s


### Save model

The model was fast to compute but we can always save it as a file to keep it after shutting down kernel.

In [110]:
model.save_model("model.bin")

### Test model predictions

Let's test some easy sentences to see if the model predict them correctly

In [111]:
model.predict("I loved this movie !")

(('__label__positive',), array([0.99994802]))

In [112]:
model.predict("I hated this movie !")

(('__label__negative',), array([0.98021179]))

In [114]:
model.predict("I really enjoyed how bad the actors played")

(('__label__negative',), array([0.86012012]))

In [113]:
model.predict("I did not enjoyed it as I should have")

(('__label__positive',), array([0.97754908]))

The model seems to correct nicely most of the sentences but the last one is maybe too ambigous and is mispredicted.

### Test our accuracy

A few sentences are great, but using a test dataset is better. Hopefully we got one so let's see the scores for predicting all 25000 test samples.

In [115]:
def pred_tests(x_test:list, y_test:list, model:fasttext.FastText._FastText) -> list:
    """
    Calculate accuracy of predictions for a model
    
    Input:
        x_test : list of sentences to predict
        y_test : list of labels for x_test sentences
        model : model to predict from
        
    Output:
        List of predictions
    """
    
    preds = []
    
    number_doc = len(x_test)
    
    for i in range(number_doc):
        pred = model.predict(x_test[i])
        
        #
        if pred[0][0] == '__label__negative':
            preds.append(0)
        else:
            preds.append(1)
                
    return preds

In [116]:
%%time
y_pred = pred_tests(x_test, y_test, model)

CPU times: user 1.59 s, sys: 65 ms, total: 1.66 s
Wall time: 1.66 s


In [117]:
from sklearn.metrics import classification_report

In [118]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.86      0.86      0.86     12500
           1       0.86      0.86      0.86     12500

    accuracy                           0.86     25000
   macro avg       0.86      0.86      0.86     25000
weighted avg       0.86      0.86      0.86     25000



0.86 is better than logistic regression or naive bayesian implementation. We expected this because FastText is a great tool to create powerful models. Now, we need to see if we can improve the model to get better scores. 

First we can try to tune hyperparameters to see if FastText can offer a better model without modifying the dataset.

### Hyperparameters fitting

To tune hyperparameters, we could do it by hand and try multiples values of learning rates, ngram-word and epoch. But FastText provide an all-in-one method to search for the better hyperparameters combinations for our dataset.

#### Creating val and train dataset for hyperparameters tuning

In [205]:
def list_to_FastText_file(value_column:list, file_name:str):
    """
    Create a file with FastText Format from a DataFrame
    
    Input:
        value_column : column of dataFrame containing formatted values
        file_name : name of the file to create
        
    Output: None
    """
    
    with open(file_name, "w+") as f:
        for values in value_column:
            f.write(values + "\n")

FastText need a train and validation dataset to tune the hyperparameters according to the validation score. We makes one of 2000 samples, the 23000 remaining samples are used for training.

In [120]:
# Create val dataset as FastText input file
list_to_FastText_file(df[:2000]["Value"],"FastText_val.txt")

In [123]:
# Create train dataset as FastText input file
list_to_FastText_file(df[2000:]["Value"], "FastText_train.txt")

#### Using FastText hyperparameters tuning method

In [170]:
tuned_model = fasttext.train_supervised(input='FastText_train.txt', autotuneValidationFile='FastText_val.txt')

Progress: 100.0% Trials:   12 Best score:  0.881500 ETA:   0h 0m 0s
Training again with best arguments
Read 5M words
Number of words:  266902
Number of labels: 2
Progress: 100.0% words/sec/thread: 4008359 lr:  0.000000 avg.loss:  0.060689 ETA:   0h 0m 0s  7.7% words/sec/thread: 3800757 lr:  0.089260 avg.loss:  0.387947 ETA:   0h 0m32s 64.3% words/sec/thread: 3974928 lr:  0.034480 avg.loss:  0.090183 ETA:   0h 0m11s


In [174]:
%%time
y_pred_tuned = pred_tests(x_test, y_test, tuned_model)
print(classification_report(y_test, y_pred_tuned))

              precision    recall  f1-score   support

           0       0.87      0.88      0.87     12500
           1       0.87      0.87      0.87     12500

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000

CPU times: user 1.37 s, sys: 13.4 ms, total: 1.38 s
Wall time: 1.4 s


In 5min of hyperparameters autotune using a validation file, FastText found a best score of 0.88 which is better than our previous 0.86 score.

In order to increase this score, we could let autotune execute longer but the best option is to apply treatment to the input data to remove stopword and use lemming/stemming.

### Testing pretreatment methods

Pretreating the data may be the most complicated part because of the large volume of samples in the dataset. Applying stopword removal, stemming and/or flemming can be very long to execute.

First, let's create an all-in-one function to create and train a model so we just need to put pretreated samples as input to calculate the score.

In [236]:
def full_predict(x_train:list, y_train:list ,x_test:list, y_test:list, file_name:str, autotune:bool=False, save:bool=True):
    """
    Predict samples from x_test using FastText trained model with x_train samples.
    
    Input:
        x_train: list of samples (train)
        y_train: list of labels (train)
        x_test: list of samples (test)
        y_test: list of labels (test)
        file_name: name of file created for FastText input formatting
        autotune: specify if model need to autotune hyperparameters with a validation dataset
        save: specify if test dataset needs to be save as file
        
    Output:
        model and predictions of x_test
    """
    # Create dataframe
    df = pd.DataFrame({'Value':x_train, 'Positive':y_train})
    
    # Format samples
    df["Value"] = df.apply(lambda x: add_label(x.Value, x.Positive), axis=1)
    
    # Shuffle data
    df = df.sample(frac=1).reset_index(drop=True)
    
    # If autotune is True, create validation dataset from 20%
    if autotune:
        val_size = int(len(x_train) * 0.2)
    
        # Create file for validation set
        list_to_FastText_file(df[:val_size]["Value"],file_name + "_AutoTune_val.txt")
        
        # Create file for train set
        list_to_FastText_file(df[val_size:]["Value"], file_name + "_AutoTune_train.txt")
        
        
        # Train model
        print("Autotuning model")
        model = fasttext.train_supervised(input=file_name + "_AutoTune_train.txt",\
                                          autotuneValidationFile=file_name + "_AutoTune_val.txt")
        
    else:
        # Create file for train
        list_to_FastText_file(df["Value"], file_name + "_train.txt")
        
        model = fasttext.train_supervised(input=file_name + "_train.txt")
        
    if save:
        # Create test df
        test_df = pd.DataFrame({'Value':x_test, 'Positive':y_test})

        # Format tests samples
        test_df["Value"] = test_df.apply(lambda x: add_label(x.Value, x.Positive), axis=1)

        # Create file for test
        list_to_FastText_file(test_df["Value"], file_name + "_test.txt")
    
    # Predict test
    print("Predicting test samples")
    y_pred = pred_tests(x_test, y_test, model)
    
    # Print scores
    print(classification_report(y_test, y_pred))
    
    return (model, y_pred)

#### Now we just need to add pretreatment and test a new training

We can start by removing stopword to see if it affects the prediction scores.

In [231]:
from multiprocessing import Pool
import functools

In [237]:
def get_FastText_score(preproc_func, file_name, remove_stopwords: bool = False, autotune: bool = False):
    """
    Description:
        Generate model and predictions from x_test with model trained on pretreated 
        x_train with preproc_func function and save as FastText input in file_name.
    
    Input:
        preproc_func: preprocessing function.
        file_name: name of file to save pretreated dataset.
        remove_stopword: boolean to specify if stopwords needs to be removed (default: False).
        
    Output:
        FastText model and predictions of x_test.
    """
    # Apply pretreatment to x_train
    print("Process x_train")
    with Pool() as p:
        preproc_x_train = p.map(functools.partial(preproc_func, remove_stopwords=remove_stopwords), x_train)
    
    # Apply pretreatment to x_test
    print("Process x_test")
    with Pool() as p:
        preproc_x_test = p.map(functools.partial(preproc_func, remove_stopwords=remove_stopwords), x_test)
        
    return full_predict(preproc_x_train, y_train, preproc_x_test, y_test, file_name, autotune)

#### Removing stopwords

In [239]:
import preprocessing as pp

In [253]:
no_stopwords = get_FastText_score(pp.basic, "No_StopWords", remove_stopwords=True)

Process x_train


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data]   Package stopwords is already up-to-date!


Process x_test


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data]   Package stopwords is already up-to-date!
Read 2M words
Number of words:  73049
Number of labels: 2
Progress: 100.0% words/sec/thread: 2342250 lr:  0.000000 avg.loss:  0.307644 ETA:   0h 0m 0s


Predicting test samples
              precision    recall  f1-score   support

           0       0.88      0.87      0.88     12500
           1       0.87      0.88      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000



#### Removing stopword + stemming

In [259]:
stemming_nostop = get_FastText_score(pp.stemming, "Stemming_NoStop", remove_stopwords=True)

Process x_train


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data]   Package stopwords is already up-to-date!


Process x_test


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Read 3M words
Number of words:  49050
Number of labels: 2
Progress: 100.0% words/sec/thread: 2737150 lr:  0.000000 avg.loss:  0.324717 ETA:   0h 0m 0s


Predicting test samples
              precision    recall  f1-score   support

           0       0.88      0.87      0.87     12500
           1       0.87      0.88      0.87     12500

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000



#### Removing stopword + lemming

In [240]:
lemming_nostop = get_FastText_score(pp.lemming, "lemming_NoStop", remove_stopwords=True)

Process x_train


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data]   Package stopwords is already up-to-date!


Process x_test


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/quentinlehelloco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Read 2M words
Number of words:  60661
Number of labels: 2
Progress: 100.0% words/sec/thread: 2443393 lr:  0.000000 avg.loss:  0.315808 ETA:   0h 0m 0s 86.2% words/sec/thread: 2568275 lr:  0.013765 avg.loss:  0.333495 ETA:   0h 0m 0s


Predicting test samples
              precision    recall  f1-score   support

           0       0.87      0.87      0.87     12500
           1       0.87      0.87      0.87     12500

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000



It seems that only removing stopwords give the better results, let's try to tune hyperparameter for some models.


To avoid more computation, we can now retrieve the pretreated data from the FastText input files with the right function.

In [278]:
def load_list_from_FastText_file(file:str) -> (list, list):
    """
    Load list from FastText input file
    
    Input:
        file: name of the saved file
        
    Output:
        samples and labels
    """
    
    labels = []
    samples = []
    
    with open(file, "r+") as f:
        lines = f.readlines()
        length = len(lines)
        
        for i in range(length):
            label = -1
            
            if '__label__positive ' in lines[i]:
                label = 1
            elif '__label__negative ' in lines[i]:
                label = 0
                
            labels.append(label)
            samples.append(lines[i].replace('__label__positive ', '').replace('__label__negative ', '').replace('\n', ''))
            
    return samples, labels  

#### Autotune lemming + no stopword

In [251]:
# Get pretreated datasets from file to avoid more computation.
lemming_x_train, lemming_y_train = load_list_from_FastText_file("lemming_NoStop_train.txt")
lemming_x_test, lemming_y_test = load_list_from_FastText_file("lemming_NoStop_test.txt")

In [252]:
lemming_autotune = full_predict(lemming_x_train, lemming_y_train, \
                                lemming_x_test, lemming_y_test, \
                                "lemming_NoStop", autotune=True)

Autotuning model


Progress: 100.0% Trials:   12 Best score:  0.870600 ETA:   0h 0m 0s
Training again with best arguments
Read 2M words
Number of words:  54963
Number of labels: 2
Progress: 100.0% words/sec/thread: 1474224 lr:  0.000000 avg.loss:  0.347242 ETA:   0h 0m 0s 62.0% words/sec/thread: 1465431 lr:  0.004768 avg.loss:  0.441059 ETA:   0h 0m11s% words/sec/thread: 1466029 lr:  0.004378 avg.loss:  0.430805 ETA:   0h 0m10s 74.5% words/sec/thread: 1469048 lr:  0.003198 avg.loss:  0.402465 ETA:   0h 0m 7s 77.7% words/sec/thread: 1470609 lr:  0.002799 avg.loss:  0.394025 ETA:   0h 0m 6s 87.2% words/sec/thread: 1473732 lr:  0.001610 avg.loss:  0.371904 ETA:   0h 0m 3s 94.8% words/sec/thread: 1474570 lr:  0.000652 avg.loss:  0.356613 ETA:   0h 0m 1s


Predicting test samples
              precision    recall  f1-score   support

           0       0.87      0.86      0.87     12500
           1       0.87      0.87      0.87     12500

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000



#### Autotune stopword removed

In [254]:
# Get pretreated datasets from file to avoid more computation.
stopword_x_train, stopword_y_train = load_list_from_FastText_file("No_StopWords_train.txt")
stopword_x_test, stopword_y_test = load_list_from_FastText_file("No_StopWords_test.txt")

In [258]:
stopword_autotune = full_predict(stopword_x_train, stopword_y_train, \
                                stopword_x_test, stopword_y_test, \
                                "No_Stop", autotune=True)

Autotuning model


Progress: 100.0% Trials:   10 Best score:  0.876200 ETA:   0h 0m 0s
Training again with best arguments
Read 2M words
Number of words:  66849
Number of labels: 2
Progress: 100.0% words/sec/thread: 2252626 lr:  0.000000 avg.loss:  0.332764 ETA:   0h 0m 0s


Predicting test samples
              precision    recall  f1-score   support

           0       0.88      0.87      0.87     12500
           1       0.87      0.88      0.87     12500

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000



#### Autotune stemming + stopword removed

In [261]:
# Get pretreated datasets from file to avoid more computation.
stemming_x_train, stemming_y_train = load_list_from_FastText_file("Stemming_NoStop_train.txt")
stemming_x_test, stemming_y_test = load_list_from_FastText_file("Stemming_NoStop_test.txt")

In [262]:
stemming_autotune = full_predict(stemming_x_train, stemming_y_train, \
                                stemming_x_test, stemming_y_test, \
                                "Stemming_NoStop", autotune=True)

Autotuning model


Progress: 100.0% Trials:   10 Best score:  0.874000 ETA:   0h 0m 0s
Training again with best arguments
Read 2M words
Number of words:  44584
Number of labels: 2
Progress: 100.0% words/sec/thread: 2595407 lr:  0.000000 avg.loss:  0.343354 ETA:   0h 0m 0s


Predicting test samples
              precision    recall  f1-score   support

           0       0.87      0.88      0.87     12500
           1       0.88      0.87      0.87     12500

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000



### Find wrongly classified samples to analyze why it failed

To understand why some samples were predicted incorrectly, we can take a look at some of it to see if we can detect a pattern that may be difficult for the model to learn.

In [149]:
def find_wrong_class(x_pred:list, y_pred:list, y_true:list, nb_samples=None) -> list:
    """
    Find wrong classification of samples in x_pred based on y_pred prediction
    
    Input:
        x_pred: list of samples
        y_pred: list of predicted labels
        y_true: list of true labels
        nb_samples: max number of samples to extract (could be less if not enough in provided samples)
    """
    wrongly_classified = []
    count_sample = 0
    
    for i in range(len(y_pred)):
        if nb_samples is not None and count_sample == nb_samples:
            break
        
        if y_pred[i] != y_true[i]:
            wrongly_classified.append((x_pred[i], y_pred[i]))
            count_sample += 1
            
    return wrongly_classified

In [147]:
few_wrong = find_wrong_class(x_test, y_pred, y_test)
few_wrong[-2]

('Four things intrigued me as to this film - firstly, it stars Carly Pope (of "Popular" fame), who is always a pleasure to watch. Secdonly, it features brilliant New Zealand actress Rena Owen. Thirdly, it is filmed in association with the New Zealand Film Commission. Fourthly, a friend recommended it to me. However, I was utterly disappointed. The whole storyline is absurd and complicated, with very little resolution. Pope\'s acting is fine, but Owen is unfortunately under-used. The other actors and actresses are all okay, but I am unfamiliar with them all. Aside from the nice riddles which are littered throughout the movie (and Pope and Owen), this film isn\'t very good. So the moral of the story is...don\'t watch it unless you really want to.',
 1)

In [151]:
few_wrong[42]

("The plot of this movie is as dumb as a bag of hair. Jimmy Smit plays a character that could have been upset by the ridiculousness of the story. He is evil and a wife beater. It's a character as far from his NYPD and LA Law roles as you could possibly get.<br /><br />If you've thought he had the looks and the acting chops to play the really bad boy role, her's your present.<br /><br />But!!!!!!!! Mary Louis Parker wears black miniskirts and little black minidresses throughout the movie.<br /><br />She has always had some of the greatest legs in the history of the movies. This makes the movie well worth it for this leg admirer.<br /><br />I'd buy the DVD for this reason only if it was available.",
 0)

If we take those two samples that were mispredicted, we can see that they do not clearly state if they like or not the movie but implies it with complicated word and phrasing of sentences.

### Improve model

Now that we have seen all that we could do with FastText, let's try to beat the baseline and target a prediction score of 0.90.

To do so, we are going to use pretrained embedding by FastText and merge the into a classifier to finetune them for our particular sentiment prediction project.

In [281]:
from nltk.tokenize import word_tokenize
# Download model if not already exists
fasttext.util.download_model('en', if_exists='ignore')

In [273]:
ft = fasttext.load_model('cc.en.300.bin')



In [295]:
def get_sample_vector(sample:list):
    """
    Get the average vector of all words in sample from pretrained FastText model
    
    Input:
        sample: string containing words
        
    Output:
        average vector for a sample
    """
    
    # Tokenize sample in single words
    words = word_tokenize(sample)
        
    # Create list of vectors
    vectors = []
        
    for w in words:
        vectors.append(ft.get_word_vector(w))
            
    # Make average vector
    vectors = np.asarray(vectors)
    
    average = np.average(vectors, axis=0)
        
    return average

Now that we can get a average embedding of a sample, let's do it for all our dataset and use it as a classifier input dataset for our sentiment analysis prediction

In [300]:
%%time
# Get all dataset embedding average vectors
embedding_x_train = []
embedding_y_train = stopword_y_train

for samples in stopword_x_train:
    embedding_x_train.append(get_sample_vector(samples))

CPU times: user 1min 32s, sys: 58.4 s, total: 2min 30s
Wall time: 7min 8s


In [306]:
%%time
# Get all test dataset embedding average vectors
embedding_x_test = []
embedding_y_test = stopword_y_test


for samples in stopword_x_test:
    embedding_x_test.append(get_sample_vector(samples))

CPU times: user 1min 19s, sys: 26.9 s, total: 1min 46s
Wall time: 3min 47s


In [313]:
embedding_x_train = np.asarray(embedding_x_train)
embedding_x_test = np.asarray(embedding_x_test)

embedding_x_train.shape, embedding_x_test.shape

((25000, 300), (25000, 300))

In [312]:
# Just to be sure not to lose our data
np.save("save_embedding_xtrain", embedding_x_train)
np.save("save_embedding_xtest", embedding_x_test)

Now we can use a classic classifier to predict our sentiment from those vector samples

In [315]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

In [316]:
# Create pipeline for SVM classifier
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(embedding_x_train, embedding_y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('svc', SVC(gamma='auto'))])

In [317]:
clf.score(embedding_x_test, embedding_y_test)

0.84056

We get a accuracy that is worse than when using FastText directly.

To improve this score we could search for another method to merge embedding from each words in a sample or use another type of model such as linear regression of random forest.