# Logistique Regression model  (TP n°1)

## Import

First of all, we will need several python packages in our method :

- dataset : we need it to download the dataset
- math : we need it for the log function
- sklearn : we will use the __LogisticRegression__ function to create our model and __precision_recall_fscore_support__ to calculate the score of our models
- numpy : to define the random seed to make the results reproducible
- panda : to manipulate the dataset
- typing : to type our functions
- nltk and spacy : for lemming and stemming steps
- re : for regex manipulation

In [1]:
import datasets
import math
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support
import numpy as np
import pandas as pd
from typing import List
from typing import Callable
from nltk import tokenize
from nltk.stem.snowball import SnowballStemmer
import spacy
import re

## The dataset

We download the dataset :

In [2]:
dataset = datasets.load_dataset('imdb')

Reusing dataset imdb (/home/leherlemaxime/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


  0%|          | 0/3 [00:00<?, ?it/s]

Now take a look at the dataset format :

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

So we can see that our dataset is composed of 2 parts, a __train__ part and a __test__ part which both contain 25000 elements. Each element of our dataset has 2 components, the first is the text and the second is the corresponding label. 

And we can also see what an element of our dataset looks like :

In [4]:
dataset["train"][42]

{'text': 'Titanic has to be one of my all-time favorite movies. It has its problems (what movies don\'t) but still, it\'s enjoyable.<br /><br />When I stumble across someone who asks me why I like Titanic, I suppose my first reaction is "wait a minute, you don\'t?" I know so many people who don\'t like this movie, and I\'m not saying I don\'t see why. "The love story is too cheesy" well, yes but isn\'t it enjoyable and moving? All right, the love story between Jack and Rose is very unrealistic, everyone knows that love like this doesn\'t actually exist. But this is a movie, doesn\'t everyone enjoy watching a beautiful story that lets us slip slightly into fantasy for a while? The next complaint, DiCaprio and Winslet are terrible actors. Well, OK, in this movie, I agree that they do not perform to their full potentials. However I think it\'s unfair to say that they are terrible actors. I personally think they are both very talented actors who unfortunately are very famous for a movie th

We also need to define the types that we will use in the typing of our functions.

In [5]:
feature_type = List[int]
DataFrame = pd.core.frame.DataFrame
list_of_words = List[str]

## Random

We set the random seed to to ensure that our results are reproducible.

In [6]:
random_seed = 38
random = np.random.default_rng(random_seed)

## First model

### Creation of the features

For this model we will need to create features from our text to use as input to our logistic regression. We will use the following features:

- 1 if "no" appear in the doc, 0 otherwise
- The count of first and second pronouns in the document
- The count of "." => number of classic sentences
- 1 if "!" is in the document, 0 otherwise
- log(word count in the document)
- Number of words in the document which are in the positive lexicon
- Number of words in the document which are in the negative lexicon
- Number of words in the document wich are in the lexicon but netral

So we have used the basic features proposed then we have added 2 new features and we are now going to explain our choice.
For the feature __sentences count__ we noticed that often the positive criticism contained shorter sentences, indeed they are less detailed than the sentences with negative conotation so that's why we wanted to add this feature.  And then for __netral words__ we decided to choose this because we already use the number of words but here we use the number of interesting words that add meaning to the text.

In [7]:
def word_array_to_feature(word_array : list_of_words) -> feature_type:
    feature = []
    
    ''' No feature '''
    
    if ("no" in word_array):
        feature.append(1)
    else:
        feature.append(0)
        
    ''' Pronouns feature '''
    
    valid_pronouns = ["i", "me", "my", "mine", "myself", "you", "your", "yours", "yourself", "we", "us", "our", "ourselves"]
    pronouns_count = 0
    for elem in word_array:
        if (elem in valid_pronouns):
            pronouns_count += 1
            
    feature.append(pronouns_count)
    
    ''' "." feature '''
    
    feature.append(word_array.count("."))
            
    ''' ! feature '''
        
    if ("!" in word_array):
        feature.append(1)
    else:
        feature.append(0)
        
    ''' log(nb_word) feature '''
    
    feature.append(math.log(len(word_array)))
    
    ''' positive, negative and neutral feature '''
    positive_count = 0
    negative_count = 0
    netral_count = 0
    
    for elem in word_array :
        if ((elem in dict_value) and (float(dict_value[elem]) >= 1.5)):
            positive_count += 1
        elif ((elem in dict_value) and (float(dict_value[elem]) <= -1.5)):
            negative_count += 1
        elif ((elem in dict_value)):
            netral_count += 1
            
    feature.append(positive_count)
    feature.append(negative_count)
    feature.append(netral_count)
    
    return feature

### Create the lexicon

We have speak about a lexcion so now we need to create it.

In [8]:
''' Open and read all line of the file '''
file = open("vader_lexicon.txt", "r", encoding="utf-8")
lines = file.readlines()
file.close()

''' We define a list of all our line but in the good format '''
good_format = []

''' For each line we replace all tablutaion by space and we split to get the 2 first elements '''
for line in lines:
    line_temp = line.replace("\t", " ")
    line_temp = line_temp.split(" ")
    line_temp = line_temp[:2]
    good_format.append(line_temp)

''' Now we create a dict from our list to use it after '''
dict_value = dict(good_format)

### Formatting the dataset

We need a function that can take our dataset and make it usable for our other function.

In [9]:
def dataset_to_array(dataset : DataFrame) -> (List[str], List[int], List[str], List[int]):
    '''
        This function take the dataset and create and array from this dataset
        
        Parameters :
                dataset (dataset): it's the input dataset thta we want to format in array
                
        Returns:
                x_train (list[str) : We have all the list avec all word of text in the train part of the datatset
                y_train (list[int]) : We have a list of all label of text in the train part  of the dataset
                x_test (list[str]) : We have all the list avec all word of text in the test part of the datatset
                y_test (list[int]) : We have a list of all label of text in the test part  of the dataset
    '''
    x_train = []
    y_train = []
    x_test = []
    y_test = []
    
    for elem in dataset["train"]:
        y_train.append(elem["label"])
        x_train.append(elem["text"])
        
    for elem in dataset["test"]:
        y_test.append(elem["label"])
        x_test.append(elem["text"])
        
    return x_train, y_train, x_test, y_test

In [10]:
%%time
x_train, y_train, x_test, y_test = dataset_to_array(dataset)

CPU times: user 18.3 s, sys: 2.69 s, total: 21 s
Wall time: 20.7 s


We can see that formatting the dataset don't take a lot's of time : 2.22 seconds

In [11]:
%%time
x_train_feature = [word_array_to_feature(elem.split(" ")) for elem in x_train]
x_test_feature = [word_array_to_feature(elem.split(" ")) for elem in x_test]

CPU times: user 12.1 s, sys: 2.17 ms, total: 12.1 s
Wall time: 12.1 s


The creation of features just take 3.34 seconds wich is fast. 

We create a __LogisticRegression__ model from sikitleran and we train it with the train data.
We also set the random_state to the variable __random_seed__ to control the random and make the result reproducible.

In [12]:
%%time
clf = LogisticRegression(random_state=random_seed).fit(x_train_feature, y_train)

CPU times: user 1.54 s, sys: 1.41 s, total: 2.95 s
Wall time: 411 ms


The training time is really fast.

#### Result of the first model

In [13]:
y_pred = clf.predict(x_test_feature)

In [14]:
def print_formated_result(result):
    print("For labbel 0 :")
    print("\t - number of entry : ", result[3][0])
    print("\t - precision : ", result[0][0])
    print("\t - recall : ", result[2][0])
    print("\t - fbeta_score : ", result[1][0])
    print("\n\nFor labbel 1 :")
    print("\t - number of entry : ", result[3][1])
    print("\t - precision : ", result[0][1])
    print("\t - recall : ", result[2][1])
    print("\t - fbeta_score : ", result[1][1])

In [15]:
print_formated_result(precision_recall_fscore_support(y_test, y_pred))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.6965906303654648
	 - recall :  0.6890137883627835
	 - fbeta_score :  0.6816


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.6883076200172292
	 - recall :  0.6956349677470418
	 - fbeta_score :  0.70312


First of all we can see that we have 12500 elements of each labels.

So we can see that the results are quite correct. We have a precission and a recall between __0,68__ and __0,70__ on each label. The fbeta_score, which is the weighted harmonic mean of precision and recall, is of __0,68__ for label __0__ and __0.70__ for label __1__.

## Pre-processing

### Clean the text

As we can see in the text of the elements there are still some formatting characters in fact there are still __"\t"__, __";"__ and so on.
So we will define a list of words and formatting characters that we will remove from the text. Moreover we will also put spaces before and after the __"!"__ so that they are considered as words.

In [16]:
def clean_the_text(text_array : str) -> List[str]:
    '''
        This function return a list of all word and char in the text in parameters.

            Parameters:
                    text_array (str): The text in a string format.

            Returns:
                    result_array (list[str]) : A list with all the word and char in the inpt text.
    '''
    
    specialChars = "()\\\'',;:\"?-" 
    for specialChar in specialChars:
        text_array = text_array.replace(specialChar, ' ')
        
    text_array = text_array.replace(".", " . ")
    text_array = text_array.replace("/>", ' ')
    text_array = text_array.replace("<br", ' ')
    ''' As say before we add space before and after '!' for the split function '''
    text_array = text_array.replace("!", " ! ")
    
    
    return text_array.lower()

In [17]:
def dataset_to_array_clean(dataset : DataFrame) -> (List[str], List[int], List[str], List[int]):
    '''
        This function take the dataset and create and array from this dataset
        
        Parameters :
                dataset (dataset): it's the input dataset thta we want to format in array
                
        Returns:
                x_train (list[str]) : We have all the list avec all word of text in the train part of the datatset
                y_train (list[int]) : We have a list of all label of text in the train part  of the dataset
                x_test (list[str]) : We have all the list avec all word of text in the test part of the datatset
                y_test (list[int]) : We have a list of all label of text in the test part  of the dataset
    '''
    x_train = []
    y_train = []
    x_test = []
    y_test = []
    
    for elem in dataset["train"]:
        y_train.append(elem["label"])
        x_train.append(clean_the_text(elem["text"]))
        
    for elem in dataset["test"]:
        y_test.append(elem["label"])
        x_test.append(clean_the_text(elem["text"]))
        
    return x_train, y_train, x_test, y_test

In [18]:
%%time
x_train_clean, y_train_clean, x_test_clean, y_test_clean = dataset_to_array(dataset)

CPU times: user 6.52 s, sys: 302 ms, total: 6.82 s
Wall time: 6.71 s


In [19]:
%%time
x_train_feature_clean = [word_array_to_feature(elem.split(" ")) for elem in x_train_clean]
x_test_feature_clean = [word_array_to_feature(elem.split(" ")) for elem in x_test_clean]

CPU times: user 10.9 s, sys: 0 ns, total: 10.9 s
Wall time: 10.9 s


In [20]:
%%time
clf_clean = LogisticRegression(random_state=random_seed).fit(x_train_feature_clean, y_train_clean)

CPU times: user 1.64 s, sys: 1.13 s, total: 2.76 s
Wall time: 383 ms


Overall we can see that cleaning the text adds computational time but it remains negligible.

Now take a look at result.

In [21]:
y_pred_clean = clf_clean.predict(x_test_feature_clean)

In [22]:
print_formated_result(precision_recall_fscore_support(y_test_clean, y_pred_clean))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.6965906303654648
	 - recall :  0.6890137883627835
	 - fbeta_score :  0.6816


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.6883076200172292
	 - recall :  0.6956349677470418
	 - fbeta_score :  0.70312


We can see with the scores that the results do not change the results, in fact it only modifies the features a little and it is not enough to be able to modify the prediction of an element, so we will not keep this optimization for the rest of our notebook.

### Stemming

We will now try to use stemming on our text to see if it changes our results.

In [23]:
re_word = re.compile(r"^\w+$")
stemmer = SnowballStemmer("english")

stem_word: Callable[[str], str] = lambda w : stemmer.stem(w.lower()) if re_word.match(w) else w

In [24]:
def stemmed_text(text : str) -> str:
    '''
        This function steeming the text in parameter and return in
        
        Parameters :
                text (str) : the text to stemming
                
        Returns :
                return_text (str) : the text stemmed
    '''
    list_of_words = text.split(" ")
    
    list_of_words = [stem_word(word) for word in list_of_words]
    
    return_text = " ".join(list_of_words)
    
    return return_text

Now define an other function for this optimisation.

In [25]:
def dataset_to_array_stemmed(dataset : DataFrame) -> (List[list_of_words], List[int], List[list_of_words], List[int]):
    '''
        This function take the dataset and create and array from this dataset
        
        Parameters :
                dataset (dataset): it's the input dataset thta we want to format in array
                
        Returns:
                x_train (list[list[str]]) : We have all the list avec all word of text in the train part of the datatset
                y_train (list[int]) : We have a list of all label of text in the train part  of the dataset
                x_test (list[list[str]]) : We have all the list avec all word of text in the test part of the datatset
                y_test (list[int]) : We have a list of all label of text in the test part  of the dataset
    '''
    x_train = []
    y_train = []
    x_test = []
    y_test = []
    
    for elem in dataset["train"]:
        y_train.append(elem["label"])
        x_train.append(stemmed_text(elem["text"]))
        
    for elem in dataset["test"]:
        y_test.append(elem["label"])
        x_test.append(stemmed_text(elem["text"]))
        
    return x_train, y_train, x_test, y_test

In [26]:
%%time
x_train_stemmed, y_train_stemmed, x_test_stemmed, y_test_stemmed = dataset_to_array_stemmed(dataset)

CPU times: user 2min 37s, sys: 224 ms, total: 2min 37s
Wall time: 2min 37s


In [27]:
%%time
x_train_feature_stemmed = [word_array_to_feature(elem.split(" ")) for elem in x_train_stemmed]
x_test_feature_stemmed = [word_array_to_feature(elem.split(" ")) for elem in x_test_stemmed]

CPU times: user 9 s, sys: 0 ns, total: 9 s
Wall time: 8.99 s


In [28]:
%%time
clf_stemmed = LogisticRegression(random_state=random_seed).fit(x_train_feature_stemmed, y_train_stemmed)

CPU times: user 643 ms, sys: 551 ms, total: 1.19 s
Wall time: 184 ms


We can now test out new optimisation and see if it's change the result.

In [29]:
y_pred_stemmed = clf_stemmed.predict(x_test_feature_stemmed)

In [30]:
print_formated_result(precision_recall_fscore_support(y_test_stemmed, y_pred_stemmed))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.6702224427354668
	 - recall :  0.6591306469320538
	 - fbeta_score :  0.6484


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.6594871000232432
	 - recall :  0.6700515605935372
	 - fbeta_score :  0.68096


So we can see that once again the computation time is increased but in a rather negligible way because we remain in very short computation times. On the other hand our results are less good indeed that comes from the fact that the stemming changes the words but can thus not make them match with the dictionary we will thus have to create a new dictionary to match with the stemming.

In [31]:
''' Open and read all line of the file '''
file = open("vader_lexicon.txt", "r", encoding="utf-8")
lines = file.readlines()
file.close()

''' We define a list of all our line but in the good format '''
good_format_stemming = []

''' For each line we replace all tablutaion by space and we split to get the 2 first elements '''
for line in lines:
    line_temp = line.replace("\t", " ")
    line_temp = line_temp.split(" ")
    line_temp = line_temp[:2]
    line_temp[0] = stem_word(line_temp[0])
    good_format_stemming.append(line_temp)

''' Now we create a dict from our list to use it after '''
dict_value_stemming = dict(good_format_stemming)

In [32]:
def word_array_to_feature_stemming(word_array : list_of_words) -> feature_type:
    feature = []
    
    ''' No feature '''
    
    if ("no" in word_array):
        feature.append(1)
    else:
        feature.append(0)
        
    ''' Pronouns feature '''
    
    valid_pronouns = ["i", "me", "my", "mine", "myself", "you", "your", "yours", "yourself", "we", "us", "our", "ourselves"]
    pronouns_count = 0
    for elem in word_array:
        if (elem in valid_pronouns):
            pronouns_count += 1
            
    feature.append(pronouns_count)
    
    ''' "." feature '''
    
    feature.append(word_array.count("."))
            
    ''' ! feature '''
        
    if ("!" in word_array):
        feature.append(1)
    else:
        feature.append(0)
        
    ''' log(nb_word) feature '''
    
    feature.append(math.log(len(word_array)))
    
    ''' positive, negative and neutral feature '''
    positive_count = 0
    negative_count = 0
    netral_count = 0
    
    for elem in word_array :
        if ((elem in dict_value_stemming) and (float(dict_value_stemming[elem]) >= 1.5)):
            positive_count += 1
        elif ((elem in dict_value_stemming) and (float(dict_value_stemming[elem]) <= -1.5)):
            negative_count += 1
        elif ((elem in dict_value_stemming)):
            netral_count += 1
            
    feature.append(positive_count)
    feature.append(negative_count)
    feature.append(netral_count)
    
    return feature

In [33]:
%%time
x_train_feature_stemmed = [word_array_to_feature_stemming(elem.split(" ")) for elem in x_train_stemmed]
x_test_feature_stemmed = [word_array_to_feature_stemming(elem.split(" ")) for elem in x_test_stemmed]

CPU times: user 8.88 s, sys: 0 ns, total: 8.88 s
Wall time: 8.88 s


In [34]:
%%time
clf_stemmed = LogisticRegression(random_state=random_seed).fit(x_train_feature_stemmed, y_train_stemmed)

CPU times: user 1.04 s, sys: 752 ms, total: 1.79 s
Wall time: 258 ms


Now let's see the results with the modified dictionary

In [35]:
y_pred_stemmed = clf_stemmed.predict(x_test_feature_stemmed)

In [36]:
print_formated_result(precision_recall_fscore_support(y_test_stemmed, y_pred_stemmed))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.692652329749104
	 - recall :  0.6863900548918308
	 - fbeta_score :  0.68024


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.6858692235146181
	 - recall :  0.6919600380589915
	 - fbeta_score :  0.69816


We can see that the results are of the same order of validation but as the claculus time is not significantly longer we will keep this optimization.

### Lemming

We will now try to use lemming on our text to see if it changes our results.

In [37]:
!python -m spacy download en_core_web_sm > output_dl.txt

In [38]:
# loading the small English model
nlp = spacy.load("en_core_web_sm")

So we can now create the new dict and new feature function.

In [39]:
''' Open and read all line of the file '''
file = open("vader_lexicon.txt", "r", encoding="utf-8")
lines = file.readlines()
file.close()

''' We define a list of all our line but in the good format '''
good_format_lemming = []

''' For each line we replace all tablutaion by space and we split to get the 2 first elements '''
for line in lines:
    line_temp = line.replace("\t", " ")
    line_temp = line_temp.split(" ")
    line_temp = line_temp[:2]
    line_temp[0] = nlp(line_temp[0])
    good_format_lemming.append(line_temp)

''' Now we create a dict from our list to use it after '''
dict_value_lemming = dict(good_format_lemming)

In [40]:
def word_array_to_feature_lemming(word_array : list_of_words) -> feature_type:
    feature = []
    
    ''' No feature '''
    
    if ("no" in word_array):
        feature.append(1)
    else:
        feature.append(0)
        
    ''' Pronouns feature '''
    
    valid_pronouns = ["i", "me", "my", "mine", "myself", "you", "your", "yours", "yourself", "we", "us", "our", "ourselves"]
    pronouns_count = 0
    for elem in word_array:
        if (elem in valid_pronouns):
            pronouns_count += 1
            
    feature.append(pronouns_count)
    
    ''' "." feature '''
    
    feature.append(word_array.count("."))
            
    ''' ! feature '''
        
    if ("!" in word_array):
        feature.append(1)
    else:
        feature.append(0)
        
    ''' log(nb_word) feature '''
    
    feature.append(math.log(len(word_array)))
    
    ''' positive, negative and neutral feature '''
    positive_count = 0
    negative_count = 0
    netral_count = 0
    
    for elem in word_array :
        if ((elem in dict_value_lemming) and (float(dict_value_lemming[elem]) >= 1.5)):
            positive_count += 1
        elif ((elem in dict_value_lemming) and (float(dict_value_lemming[elem]) <= -1.5)):
            negative_count += 1
        elif ((elem in dict_value_lemming)):
            netral_count += 1
            
    feature.append(positive_count)
    feature.append(negative_count)
    feature.append(netral_count)
    
    return feature

We can now define an other function that take this time the lemming transformation instead of stemming.

In [41]:
def dataset_to_array_lemmed(dataset : DataFrame) -> (List[list_of_words], List[int], List[list_of_words], List[int]):
    '''
        This function take the dataset and create and array from this dataset
        
        Parameters :
                dataset (dataset): it's the input dataset thta we want to format in array
                
        Returns:
                x_train (list[list[str]]) : We have all the list avec all word of text in the train part of the datatset
                y_train (list[int]) : We have a list of all label of text in the train part  of the dataset
                x_test (list[list[str]]) : We have all the list avec all word of text in the test part of the datatset
                y_test (list[int]) : We have a list of all label of text in the test part  of the dataset
    '''
    x_train = []
    y_train = []
    x_test = []
    y_test = []
    
    for elem in dataset["train"]:
        y_train.append(elem["label"])
        x_train.append(' '.join([token.lemma_ for token in nlp(elem["text"])]))
        
    for elem in dataset["test"]:
        y_test.append(elem["label"])
        x_test.append(' '.join([token.lemma_ for token in nlp(elem["text"])]))
        
    return x_train, y_train, x_test, y_test

In [42]:
%%time
x_train_lemmed, y_train_lemmed, x_test_lemmed, y_test_lemmed = dataset_to_array_lemmed(dataset)

CPU times: user 22min 37s, sys: 4.84 s, total: 22min 41s
Wall time: 22min 42s


In [43]:
%%time
x_train_feature_lemmed = [word_array_to_feature_lemming(elem.split(" ")) for elem in x_train_lemmed]
x_test_feature_lemmed = [word_array_to_feature_lemming(elem.split(" ")) for elem in x_test_lemmed]

CPU times: user 4.27 s, sys: 20.6 ms, total: 4.29 s
Wall time: 4.28 s


In [44]:
%%time
clf_lemmed = LogisticRegression(random_state=random_seed).fit(x_train_feature_lemmed, y_train_lemmed)

CPU times: user 629 ms, sys: 400 ms, total: 1.03 s
Wall time: 160 ms


Add now the results.

In [45]:
y_pred_lemmed = clf_lemmed.predict(x_test_feature_lemmed)

In [46]:
print_formated_result(precision_recall_fscore_support(y_test_lemmed, y_pred_lemmed))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.5998505976095617
	 - recall :  0.46931618936294567
	 - fbeta_score :  0.38544


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.5472654408297972
	 - recall :  0.630242975430976
	 - fbeta_score :  0.74288


So we can see that the results are 

## Regularization

We will use the different regulation on the first model to comprare the performances.

### regularization L1

In [47]:
%%time
clf_l1 = LogisticRegression(random_state=random_seed).fit(x_train_feature, y_train)
y_pred_l1 = clf_l1.predict(x_test_feature)
print_formated_result(precision_recall_fscore_support(y_test, y_pred_l1))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.6965906303654648
	 - recall :  0.6890137883627835
	 - fbeta_score :  0.6816


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.6883076200172292
	 - recall :  0.6956349677470418
	 - fbeta_score :  0.70312
CPU times: user 1.34 s, sys: 1.09 s, total: 2.44 s
Wall time: 318 ms


### regularization L2

In [48]:
%%time
clf_l2 = LogisticRegression(random_state=random_seed).fit(x_train_feature, y_train)
y_pred_l2 = clf_l2.predict(x_test_feature)
print_formated_result(precision_recall_fscore_support(y_test, y_pred_l2))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.6965906303654648
	 - recall :  0.6890137883627835
	 - fbeta_score :  0.6816


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.6883076200172292
	 - recall :  0.6956349677470418
	 - fbeta_score :  0.70312
CPU times: user 1.42 s, sys: 1.17 s, total: 2.59 s
Wall time: 324 ms


### Regularization Elasticnet

In [49]:
%%time
clf_elasticnet = LogisticRegression(random_state=random_seed).fit(x_train_feature, y_train)
y_pred_elasticnet = clf_elasticnet.predict(x_test_feature)
print_formated_result(precision_recall_fscore_support(y_test, y_pred_elasticnet))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.6965906303654648
	 - recall :  0.6890137883627835
	 - fbeta_score :  0.6816


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.6883076200172292
	 - recall :  0.6956349677470418
	 - fbeta_score :  0.70312
CPU times: user 1.33 s, sys: 1.15 s, total: 2.49 s
Wall time: 311 ms


### No regularization

In [50]:
clf_without_regularization = LogisticRegression(random_state=random_seed).fit(x_train_feature, y_train)
y_pred_without_regularization = clf_without_regularization.predict(x_test_feature)
print_formated_result(precision_recall_fscore_support(y_test, y_pred_without_regularization))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.6965906303654648
	 - recall :  0.6890137883627835
	 - fbeta_score :  0.6816


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.6883076200172292
	 - recall :  0.6956349677470418
	 - fbeta_score :  0.70312


So we can see that the regularization doesn't change the result, so we'll stay on the proposed default regularization function.

## Conclusion

Our better results is only with the clean text optimiszation.

In [51]:
print_formated_result(precision_recall_fscore_support(y_test_clean, y_pred_clean))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.6965906303654648
	 - recall :  0.6890137883627835
	 - fbeta_score :  0.6816


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.6883076200172292
	 - recall :  0.6956349677470418
	 - fbeta_score :  0.70312


We can now check why some text are wrongly classified :

In [52]:
wrong_0_index = []
wrong_1_index = []

for i in range(len(y_pred_clean)):
    if ((y_pred_clean[i] == 0) and (y_test_clean[i] == 1)):
        wrong_0_index.append(i)
    elif ((y_pred_clean[i] == 1) and (y_test_clean[i] == 0)):
        wrong_1_index.append(i)

In [53]:
len(wrong_0_index)

3711

In [54]:
len(wrong_1_index)

3980

2 examples of wrong 0 label :

In [55]:
dataset["test"][int(random.choice(wrong_0_index))]["text"]

'GUTS OF A BEAUTY is a bit better than its predecessor GUTS OF A VIRGIN. Although this film isn\'t really a sequel in the sense that it has absolutely nothing to do with the first installment, I did find BEAUTY to be a little stronger and better put together all-the-way-around than VIRGIN...but then again, that\'s not really saying much.<br /><br />BEAUTY starts off as a pretty rough and straight-faced exploit film. A couple of Yakuza cats are holding a young woman prisoner and begin gang raping her in pretty brutal fashion. As this nastiness is going on, the head guy tells the girl that they did the same to her sister and sold her into slavery in Africa, and that they\'re gonna do the same to her. They then shoot her up with some drugs and rape her some more. She somehow gets away and ends up at a clinic where the nurse there listens to her sob story. The rapee ends up freaking out from the stress of her prior experience and commits suicide. The clinic worker, moved by the young lady\

In [56]:
dataset["test"][int(random.choice(wrong_0_index))]["text"]

"With documentary films, the question of realism always crops up. How much of the film is real and how much is manipulated by the film maker? In LITTLE DIETER NEEDS TO FLY, Herzog is far too absorbed in telling the story of a man telling his own story to even address the question of realism versus formalism. From the beginning, Herzog's role as storyteller is obvious. Luckily, he is a master storyteller. LITTLE DIETER is the finest, most engaging documentary I have ever seen. Dieter's story is enthralling, and Herzog's efforts at reenactment, putting Dieter through the paces of reliving his story on location while it is being filmed, are very effective. The story that Dieter tells is real, but Herzog is ever-present, wrenching absurdist commentary from the realism. This film is a must-see for any students of documentary film and/or of Werner Herzog."

2 examples of wrong 1 label :

In [57]:
dataset["test"][int(random.choice(wrong_1_index))]["text"]

"This Asterix is very similar to modern Disney cartoons. Soulless, technically good and the usual in-jokes for adults. Maybe it's because this is the first cartoon I watched after Laputa: Castle in the Sky, but it was quite disappointing.<br /><br />The plot is contrived and forgettable but it involves Asterix and Obelix going to the Viking's territory to rescue a spoilt teenager who then learns humility and finds love as well. Oh and initially they don't get on but after facing adversity they all share a deep bond of friendship... yadda yadda.<br /><br />The best bit is to watch out for the little jokes. The Vikings get all the best ones. Such as Vikea (the Viking's chief's wife) giving a list of furniture and skulls to bring back from the next raid. Or the Vikings not knowing the meaning of mercy (literally). Oh, and Olaf the dumbest Viking is actually hilarious (as much for the voice acting as the dialogue).<br /><br />For example, aboard the Viking ship: (After a speech by Abba, th

In [58]:
dataset["test"][int(random.choice(wrong_1_index))]["text"]

'(As a note, I\'d like to say that I saw this movie at my annual church camp, where the entire youth group laughed at it. I bought it when I saw it on a shelf one year later, if only for the humor I derived from a bad attempt at making an evangelical movie.)<br /><br />Lay it Down falls short of many movie fans\' expectations on several different planes. Most of the problems lie within the impersonal acting. Regardless of the nice cars in the film, or the truth in Christ\'s sacrifice for you, as a movie AND witnessing tool, Lay it Down hardly delivers. <br /><br />Most good opinions of the movies are supported by Christians agreeing with the message. While it\'s easy for a Christian to agree with the points delivered, the audience hardly ever witnesses life outside a cliché. The fighting scene between Ben and his brother is horribly dubbed. And there are at least three blatant typos in the subtitles.<br /><br />I encourage anyone to watch the movie a second time with the director\'s co

With these examples, and others that we have studied, we can easily see some problems that our classifier may have:

- It does not detect irony in sentences, i.e. if a person uses irony there is a high chance that his text will be classified in the other category.
- It also doesn't detect false compliments, for example if someone says "This is a great ninja movie for kids" it will just keep the positive side when it can mean that it is an interesting movie when you have a childish vision of the story.
- When critics use certain expressions, it can also be a misclassification, often expressions with several meanings that are not necessarily well understood.
- When the text is too descriptive, it takes over the possible short passages with the emphasis on the feeling and so these cases are at the border of the 2 classes and so are sometimes misclassified.

These are the major problems, there are surely other less visible ones that we have not seen yet. But overall we have a result close to 70% which for a method that only takes a few minutes to execute is quite correct from our point of view.