# Logistique Regression model  (TP n°1)

## Import

First of all we need to import several packages.

- dataset : we need it to download the database
- math : we need it for several math function
- sklearn : we need 'LogisticRegression' to compute it on the data and we also need 'precision_recall_fscore_support'for compute  and see results of our model. 
- numpy : for set the random seed to make result reproducible
- pandas : for dataframe manipulation
- typing : for type the function
- nltk and spacy : for lemming and stemming
- re : for use the regex expression

In [45]:
import datasets
import math
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support
import numpy as np
import pandas as pd
from typing import List
from typing import Callable
from nltk import tokenize
from nltk.stem.snowball import SnowballStemmer
import spacy
import re

## The dataset

We download the dataset :

In [2]:
dataset = datasets.load_dataset('imdb')

Reusing dataset imdb (/home/leherlemaxime/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


  0%|          | 0/3 [00:00<?, ?it/s]

Now take a lokk at the dataset format :

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

So we can see that the dataset is compose of 2 part "train" and "test". Each part have the same numbers of elements 25000. All element have 2 component, the first id the text and the second is the label : 0 for positive element and 0 for negative element So we will use the "train" part to train our model and the "test" part to check the performance of our model.

We can see how look at a element of our dataset :

In [4]:
dataset["train"][42]

{'text': 'Titanic has to be one of my all-time favorite movies. It has its problems (what movies don\'t) but still, it\'s enjoyable.<br /><br />When I stumble across someone who asks me why I like Titanic, I suppose my first reaction is "wait a minute, you don\'t?" I know so many people who don\'t like this movie, and I\'m not saying I don\'t see why. "The love story is too cheesy" well, yes but isn\'t it enjoyable and moving? All right, the love story between Jack and Rose is very unrealistic, everyone knows that love like this doesn\'t actually exist. But this is a movie, doesn\'t everyone enjoy watching a beautiful story that lets us slip slightly into fantasy for a while? The next complaint, DiCaprio and Winslet are terrible actors. Well, OK, in this movie, I agree that they do not perform to their full potentials. However I think it\'s unfair to say that they are terrible actors. I personally think they are both very talented actors who unfortunately are very famous for a movie th

And we also need to define the type that we will use in the following notebook

In [5]:
feature_type = List[int]
DataFrame = pd.core.frame.DataFrame
list_of_words = List[str]

## Random

set a defined random generator, better for reproducible results.

In [6]:
random_seed = 38
random = np.random.default_rng(random_seed)

## First model

### Creation of the features

As we say before we now need to create feature from our textn we have the followinf feature :

- 1 if "no" appear in the doc, 0 otherwise
- The count of first and second pronouns in the document
- The count of "." => number of classic sentences
- 1 if "!" is in the document, 0 otherwise
- log(word count in the document)
- Number of words in the document which are in the positive lexicon
- Number of words in the document which are in the negative lexicon
- Number of words in the document wich are in the lexcion but netral

In [7]:
def word_array_to_feature(word_array : list_of_words) -> feature_type:
    feature = []
    
    ''' No feature '''
    
    if ("no" in word_array):
        feature.append(1)
    else:
        feature.append(0)
        
    ''' Pronouns feature '''
    
    valid_pronouns = ["i", "me", "my", "mine", "myself", "you", "your", "yours", "yourself", "we", "us", "our", "ourselves"]
    pronouns_count = 0
    for elem in word_array:
        if (elem in valid_pronouns):
            pronouns_count += 1
            
    feature.append(pronouns_count)
    
    ''' "." feature '''
    
    feature.append(word_array.count("."))
            
    ''' ! feature '''
        
    if ("!" in word_array):
        feature.append(1)
    else:
        feature.append(0)
        
    ''' log(nb_word) feature '''
    
    feature.append(math.log(len(word_array)))
    
    ''' positive, negative and neutral feature '''
    positive_count = 0
    negative_count = 0
    netral_count = 0
    
    for elem in word_array :
        if ((elem in dict_value) and (float(dict_value[elem]) >= 1.5)):
            positive_count += 1
        elif ((elem in dict_value) and (float(dict_value[elem]) <= -1.5)):
            negative_count += 1
        elif ((elem in dict_value)):
            netral_count += 1
            
    feature.append(positive_count)
    feature.append(negative_count)
    feature.append(netral_count)
    
    return feature

### Create the lexicon

We have speak about a lexcion so now we need to create it

In [8]:
''' Open and read all line of the file '''
file = open("vader_lexicon.txt", "r", encoding="utf-8")
lines = file.readlines()
file.close()

''' We define a list of all our line but in the good format '''
good_format = []

''' For each line we replace all tablutaion by space and we split to get the 2 first elements '''
for line in lines:
    line_temp = line.replace("\t", " ")
    line_temp = line_temp.split(" ")
    line_temp = line_temp[:2]
    good_format.append(line_temp)

''' Now we create a dict from our list to use it after '''
dict_value = dict(good_format)

### Formatting the dataset

We need a function that can take our dataset and make it usable for our other function.

In [9]:
def dataset_to_array(dataset : DataFrame) -> (List[list_of_words], List[int], List[list_of_words], List[int]):
    '''
        This function take the dataset and create and array from this dataset
        
        Parameters :
                dataset (dataset): it's the input dataset thta we want to format in array
                
        Returns:
                x_train (list[list[str]]) : We have all the list avec all word of text in the train part of the datatset
                y_train (list[int]) : We have a list of all label of text in the train part  of the dataset
                x_test (list[list[str]]) : We have all the list avec all word of text in the test part of the datatset
                y_test (list[int]) : We have a list of all label of text in the test part  of the dataset
    '''
    x_train = []
    y_train = []
    x_test = []
    y_test = []
    
    for elem in dataset["train"]:
        y_train.append(elem["label"])
        x_train.append(elem["text"])
        
    for elem in dataset["test"]:
        y_test.append(elem["label"])
        x_test.append(elem["text"])
        
    return x_train, y_train, x_test, y_test

In [10]:
%%time
x_train, y_train, x_test, y_test = dataset_to_array(dataset)

CPU times: user 2.58 s, sys: 58.4 ms, total: 2.63 s
Wall time: 2.61 s


We can see that formatting the dataset don't take a lot's of time : 2.22 seconds

In [11]:
%%time
x_train_feature = [word_array_to_feature(elem.split(" ")) for elem in x_train]
x_test_feature = [word_array_to_feature(elem.split(" ")) for elem in x_test]

CPU times: user 4.24 s, sys: 10.2 ms, total: 4.25 s
Wall time: 4.25 s


The creation of the feature just take 3.34 seconds wich is fast. 

We create a 'LogisticRegression' model from sikitleran and we train it with the train data.
We also set the random_state to the variable random_seed to control the random and make the result reproducible.

In [12]:
%%time
clf = LogisticRegression(random_state=random_seed).fit(x_train_feature, y_train)

CPU times: user 704 ms, sys: 575 ms, total: 1.28 s
Wall time: 187 ms


#### Result of the first model

In [13]:
y_pred = clf.predict(x_test_feature)

In [14]:
def print_formated_result(result):
    print("For labbel 0 :")
    print("\t - number of entry : ", result[3][0])
    print("\t - precision : ", result[0][0])
    print("\t - recall : ", result[2][0])
    print("\t - fbeta_score : ", result[1][0])
    print("\n\nFor labbel 1 :")
    print("\t - number of entry : ", result[3][1])
    print("\t - precision : ", result[0][1])
    print("\t - recall : ", result[2][1])
    print("\t - fbeta_score : ", result[1][1])

In [15]:
print_formated_result(precision_recall_fscore_support(y_test, y_pred))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.6965906303654648
	 - recall :  0.6890137883627835
	 - fbeta_score :  0.6816


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.6883076200172292
	 - recall :  0.6956349677470418
	 - fbeta_score :  0.70312


First of all we can see that we have 12500 elements of each labels.

So we can see that the result are quite correct. We have a precission of __0,69__ on each label. A recall of __0,68__ for label 0 and __0,70__ for label 1. The fbeta_score, which is the weighted harmonic mean of precision and recall, is of __0,79__ for each label.

## Pre-processing

### Clean the text

As we can see before, the text-format is not perfect, we have for exemple '\t' or '<br\>' that are formated text. So here we just need to have a list of all word (or char) to use them to compute our features. So for this we will replace all special char by space. And we will also add space before and after '!' to use the split function of python for string split.

In [16]:
def clean_the_text(text_array : str) -> List[str]:
    '''
        This function return a list of all word and char in the text in parameters.

            Parameters:
                    text_array (str): The text in a string format.

            Returns:
                    result_array (list[str]) : A list with all the word and char in the inpt text.
    '''
    
    specialChars = "()\\\'',;:\"?-" 
    for specialChar in specialChars:
        text_array = text_array.replace(specialChar, ' ')
        
    text_array = text_array.replace(".", " . ")
    text_array = text_array.replace("/>", ' ')
    text_array = text_array.replace("<br", ' ')
    ''' As say before we add space before and after '!' for the split function '''
    text_array = text_array.replace("!", " ! ")
    
    ''' We split the text by the space to get a list of all the word in the text'''
    text_array = text_array.split(" ")
    
    ''' We put all word in lowercase to copare them to the word in the dict'''
    result_array = [elem.lower() for elem in text_array if(len(elem) != 0)]
    
    return result_array

Now we can create the new function with the clean feature

This time the formatting take more time : 5.45 seconds when we add 2.22 seconds without optimisation.

In [17]:
def dataset_to_array_clean(dataset : DataFrame) -> (List[list_of_words], List[int], List[list_of_words], List[int]):
    '''
        This function take the dataset and create and array from this dataset
        
        Parameters :
                dataset (dataset): it's the input dataset thta we want to format in array
                
        Returns:
                x_train (list[list[str]]) : We have all the list avec all word of text in the train part of the datatset
                y_train (list[int]) : We have a list of all label of text in the train part  of the dataset
                x_test (list[list[str]]) : We have all the list avec all word of text in the test part of the datatset
                y_test (list[int]) : We have a list of all label of text in the test part  of the dataset
    '''
    x_train = []
    y_train = []
    x_test = []
    y_test = []
    
    for elem in dataset["train"]:
        y_train.append(elem["label"])
        x_train.append(clean_the_text(elem["text"]))
        
    for elem in dataset["test"]:
        y_test.append(elem["label"])
        x_test.append(clean_the_text(elem["text"]))
        
    return x_train, y_train, x_test, y_test

In [18]:
%%time
x_train_clean, y_train_clean, x_test_clean, y_test_clean = dataset_to_array(dataset)

CPU times: user 2.19 s, sys: 101 ms, total: 2.29 s
Wall time: 2.26 s


In [19]:
%%time
x_train_feature_clean = [word_array_to_feature(elem) for elem in x_train_clean]
x_test_feature_clean = [word_array_to_feature(elem) for elem in x_test_clean]

CPU times: user 14 s, sys: 1.1 ms, total: 14 s
Wall time: 14 s


In [20]:
%%time
clf_clean = LogisticRegression(random_state=random_seed).fit(x_train_feature_clean, y_train_clean)

CPU times: user 2.03 s, sys: 1.47 s, total: 3.5 s
Wall time: 460 ms


Now take a look at this result.

In [21]:
y_pred_clean = clf_clean.predict(x_test_feature_clean)

In [22]:
print_formated_result(precision_recall_fscore_support(y_test_clean, y_pred_clean))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.5424659346474401
	 - recall :  0.5938880440292563
	 - fbeta_score :  0.65608


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.5649666059502125
	 - recall :  0.49888303100705933
	 - fbeta_score :  0.44664


First of all we can see that we have 12500 elements of each labels. This don't change from the previous model.

So we can see that the result are quite correct. We have a precission of __0,71__ on each label. A recall of __0,70__ for label 0 and __0,71__ for label 1. The fbeta_score, which is the weighted harmonic mean of precision and recall, is of __0,71__ for each label.

### Stemming

In [23]:
re_word = re.compile(r"^\w+$")
stemmer = SnowballStemmer("english")

stem_word: Callable[[str], str] = lambda w : stemmer.stem(w.lower()) if re_word.match(w) else w

In [24]:
def stemmed_text(text : str) -> str:
    '''
        This function steeming the text in parameter and return in
        
        Parameters :
                text (str) : the text to stemming
                
        Returns :
                return_text (str) : the text stemmed
    '''
    list_of_words = text.split(" ")
    
    list_of_words = [stem_word(word) for word in list_of_words]
    
    return_text = " ".join(list_of_words)
    
    return return_text

Now define an other function for this optimisation.

In [25]:
def dataset_to_array_stemmed(dataset : DataFrame) -> (List[list_of_words], List[int], List[list_of_words], List[int]):
    '''
        This function take the dataset and create and array from this dataset
        
        Parameters :
                dataset (dataset): it's the input dataset thta we want to format in array
                
        Returns:
                x_train (list[list[str]]) : We have all the list avec all word of text in the train part of the datatset
                y_train (list[int]) : We have a list of all label of text in the train part  of the dataset
                x_test (list[list[str]]) : We have all the list avec all word of text in the test part of the datatset
                y_test (list[int]) : We have a list of all label of text in the test part  of the dataset
    '''
    x_train = []
    y_train = []
    x_test = []
    y_test = []
    
    for elem in dataset["train"]:
        y_train.append(elem["label"])
        x_train.append(stemmed_text(elem["text"]))
        
    for elem in dataset["test"]:
        y_test.append(elem["label"])
        x_test.append(stemmed_text(elem["text"]))
        
    return x_train, y_train, x_test, y_test

In [26]:
%%time
x_train_stemmed, y_train_stemmed, x_test_stemmed, y_test_stemmed = dataset_to_array_stemmed(dataset)

CPU times: user 1min 7s, sys: 305 ms, total: 1min 8s
Wall time: 1min 7s


In [27]:
%%time
x_train_feature_stemmed = [word_array_to_feature(elem.split(" ")) for elem in x_train_stemmed]
x_test_feature_stemmed = [word_array_to_feature(elem.split(" ")) for elem in x_test_stemmed]

CPU times: user 3.98 s, sys: 18.8 ms, total: 4 s
Wall time: 4 s


In [28]:
%%time
clf_stemmed = LogisticRegression(random_state=random_seed).fit(x_train_feature_stemmed, y_train_stemmed)

CPU times: user 342 ms, sys: 442 ms, total: 784 ms
Wall time: 123 ms


We can now test out new optimisation and see if it's change the result.

In [29]:
y_pred_stemmed = clf_stemmed.predict(x_test_feature_stemmed)

In [30]:
print_formated_result(precision_recall_fscore_support(y_test_stemmed, y_pred_stemmed))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.6702224427354668
	 - recall :  0.6591306469320538
	 - fbeta_score :  0.6484


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.6594871000232432
	 - recall :  0.6700515605935372
	 - fbeta_score :  0.68096


So we can see that the results are 

### Lemming

In [46]:
!python -m spacy download en_core_web_sm > output_dl.txt

In [50]:
# loading the small English model
nlp = spacy.load("en_core_web_sm")

We can now define an other function that take this time the lemming transformation instead of stemming.

In [51]:
def dataset_to_array_lemmed(dataset : DataFrame) -> (List[list_of_words], List[int], List[list_of_words], List[int]):
    '''
        This function take the dataset and create and array from this dataset
        
        Parameters :
                dataset (dataset): it's the input dataset thta we want to format in array
                
        Returns:
                x_train (list[list[str]]) : We have all the list avec all word of text in the train part of the datatset
                y_train (list[int]) : We have a list of all label of text in the train part  of the dataset
                x_test (list[list[str]]) : We have all the list avec all word of text in the test part of the datatset
                y_test (list[int]) : We have a list of all label of text in the test part  of the dataset
    '''
    x_train = []
    y_train = []
    x_test = []
    y_test = []
    
    for elem in dataset["train"]:
        y_train.append(elem["label"])
        x_train.append(' '.join([token.lemma_ for token in nlp(elem["text"])]))
        
    for elem in dataset["test"]:
        y_test.append(elem["label"])
        x_test.append(' '.join([token.lemma_ for token in nlp(elem["text"])]))
        
    return x_train, y_train, x_test, y_test

In [52]:
%%time
x_train_lemmed, y_train_lemmed, x_test_lemmed, y_test_lemmed = dataset_to_array_lemmed(dataset)

KeyboardInterrupt: 

In [53]:
%%time
x_train_feature_lemmed = [word_array_to_feature(elem.split(" ")) for elem in x_train_lemmed]
x_test_feature_lemmed = [word_array_to_feature(elem.split(" ")) for elem in x_test_lemmed]

NameError: name 'x_train_lemmed' is not defined

In [54]:
%%time
clf_lemmed = LogisticRegression(random_state=random_seed).fit(x_train_feature_lemmed, y_train_lemmed)

NameError: name 'x_train_feature_lemmed' is not defined

Add now the results.

In [55]:
y_pred_lemmed = clf_lemmed.predict(x_test_feature_lemmed)

NameError: name 'clf_lemmed' is not defined

In [None]:
print_formated_result(precision_recall_fscore_support(y_test_lemmed, y_pred_lemmed))

So we can see that the results are 

## Regularization

We win use the different regulation on the first model to comprare the performances.

### regularization L1

In [56]:
%%time
clf_l1 = LogisticRegression(random_state=random_seed).fit(x_train_feature, y_train)
y_pred_l1 = clf_l1.predict(x_test_feature)
print_formated_result(precision_recall_fscore_support(y_test, y_pred_l1))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.6965906303654648
	 - recall :  0.6890137883627835
	 - fbeta_score :  0.6816


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.6883076200172292
	 - recall :  0.6956349677470418
	 - fbeta_score :  0.70312
CPU times: user 1.16 s, sys: 833 ms, total: 2 s
Wall time: 283 ms


### regularization L2

In [57]:
%%time
clf_l2 = LogisticRegression(random_state=random_seed).fit(x_train_feature, y_train)
y_pred_l2 = clf_l2.predict(x_test_feature)
print_formated_result(precision_recall_fscore_support(y_test, y_pred_l2))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.6965906303654648
	 - recall :  0.6890137883627835
	 - fbeta_score :  0.6816


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.6883076200172292
	 - recall :  0.6956349677470418
	 - fbeta_score :  0.70312
CPU times: user 1.06 s, sys: 1.01 s, total: 2.07 s
Wall time: 284 ms


### Regularization Elasticnet

In [59]:
%%time
clf_elasticnet = LogisticRegression(random_state=random_seed).fit(x_train_feature, y_train)
y_pred_elasticnet = clf_elasticnet.predict(x_test_feature)
print_formated_result(precision_recall_fscore_support(y_test, y_pred_elasticnet))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.6965906303654648
	 - recall :  0.6890137883627835
	 - fbeta_score :  0.6816


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.6883076200172292
	 - recall :  0.6956349677470418
	 - fbeta_score :  0.70312
CPU times: user 1.19 s, sys: 922 ms, total: 2.11 s
Wall time: 288 ms


### No regularization

In [61]:
clf_without_regularization = LogisticRegression(random_state=random_seed).fit(x_train_feature, y_train)
y_pred_without_regularization = clf_without_regularization.predict(x_test_feature)
print_formated_result(precision_recall_fscore_support(y_test, y_pred_without_regularization))

For labbel 0 :
	 - number of entry :  12500
	 - precision :  0.6965906303654648
	 - recall :  0.6890137883627835
	 - fbeta_score :  0.6816


For labbel 1 :
	 - number of entry :  12500
	 - precision :  0.6883076200172292
	 - recall :  0.6956349677470418
	 - fbeta_score :  0.70312


So we can see that the best regularization is

## Conclusion

Our better results is for lemming.