# Logistique Regression model  (TP n°1)

## Import

First of all we need to import several packages.

- dataset : we need it to download the database
- math : we need it for several math function
- sklearn : we need 'LogisticRegression' to compute it on the data and we also need 'precision_recall_fscore_support'for compute  and see results of our model. 
- numpy : for set the random seed to make result reproducible
- pandas : for dataframe manipulation
- typing : for type the function

In [1]:
import datasets
import math
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support
import numpy as np
import pandas as pd
from typing import List

## The dataset

We download the dataset :

In [2]:
dataset = datasets.load_dataset('imdb')

Reusing dataset imdb (/home/leherlemaxime/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


  0%|          | 0/3 [00:00<?, ?it/s]

Now take a lokk at the dataset format :

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

So we can see that the dataset is compose of 2 part "train" and "test". Each part have the same numbers of elements 25000. All element have 2 component, the first id the text and the second is the label : 0 for positive element and 0 for negative element So we will use the "train" part to train our model and the "test" part to check the performance of our model.

We can see how look at a element of our dataset :

In [4]:
dataset["train"][42]

{'text': 'Titanic has to be one of my all-time favorite movies. It has its problems (what movies don\'t) but still, it\'s enjoyable.<br /><br />When I stumble across someone who asks me why I like Titanic, I suppose my first reaction is "wait a minute, you don\'t?" I know so many people who don\'t like this movie, and I\'m not saying I don\'t see why. "The love story is too cheesy" well, yes but isn\'t it enjoyable and moving? All right, the love story between Jack and Rose is very unrealistic, everyone knows that love like this doesn\'t actually exist. But this is a movie, doesn\'t everyone enjoy watching a beautiful story that lets us slip slightly into fantasy for a while? The next complaint, DiCaprio and Winslet are terrible actors. Well, OK, in this movie, I agree that they do not perform to their full potentials. However I think it\'s unfair to say that they are terrible actors. I personally think they are both very talented actors who unfortunately are very famous for a movie th

## Preprocessing

set a defined random generator, better for reproducible results.

In [5]:
random_seed = 42
random = np.random.default_rng(random_seed)

Later in our model we will have to compute several feature on each text and for this we need a dict were word are associate with their meaning (positive or negative). And in the file 'vader_lexicon.txt' we have for a lot's of word their positiv score so we to convert the file to a dict.

In [6]:
''' Open and read all line of the file '''
file = open("vader_lexicon.txt", "r", encoding="utf-8")
lines = file.readlines()
file.close()

''' We define a list of all our line but in the good format '''
good_format = []

''' For each line we replace all tablutaion by space and we split to get the 2 first elements '''
for line in lines:
    line_temp = line.replace("\t", " ")
    line_temp = line_temp.split(" ")
    line_temp = line_temp[:2]
    good_format.append(line_temp)

''' Now we create a dict from our list to use it after '''
dict_value = dict(good_format)

As we can see before, the text-format is not perfect, we have for exemple '\t' or '<br\>' that are formated text. So here we just need to have a list of all word (or char) to use them to compute our features. So for this we will replace all special char by space. And we will also add space before and after '!' to use the split function of python for string split.

In [7]:
def clean_the_text(text_array : str) -> List[str]:
    '''
        This function return a list of all word and char in the text in parameters.

            Parameters:
                    text_array (str): The text in a string format.

            Returns:
                    result_array (list[str]) : A list with all the word and char in the inpt text.
    '''
    
    specialChars = "()\\\''.,;:\"?-" 
    for specialChar in specialChars:
        text_array = text_array.replace(specialChar, ' ')
        
    text_array = text_array.replace("/>", ' ')
    text_array = text_array.replace("<br", ' ')
    ''' As say before we add space before and after '!' for the split function '''
    text_array = text_array.replace("!", " ! ")
    
    ''' We split the text by the space to get a list of all the word in the text'''
    text_array = text_array.split(" ")
    
    ''' We put all word in lowercase to copare them to the word in the dict'''
    result_array = [elem.lower() for elem in text_array if(len(elem) != 0)]
    
    return result_array

We need a function that can take our dataset and make it usable for our other function.

In [8]:
DataFrame = pd.core.frame.DataFrame
list_of_words = List[str]

def dataset_to_array(dataset : DataFrame) -> (List[list_of_words], List[int], List[list_of_words], List[int]):
    '''
        This function take the dataset and create and array from this dataset
        
        Parameters :
                dataset (dataset): it's the input dataset thta we want to format in array
                
        Returns:
                x_train (list[list[str]]) : We have all the list avec all word of text in the train part of the datatset
                y_train (list[int]) : We have a list of all label of text in the train part  of the dataset
                x_test (list[list[str]]) : We have all the list avec all word of text in the test part of the datatset
                y_test (list[int]) : We have a list of all label of text in the test part  of the dataset
    '''
    x_train = []
    y_train = []
    x_test = []
    y_test = []
    
    for elem in dataset["train"]:
        y_train.append(elem["label"])
        x_train.append(clean_the_text(elem["text"]))
        
    for elem in dataset["test"]:
        y_test.append(elem["label"])
        x_test.append(clean_the_text(elem["text"]))
        
    return x_train, y_train, x_test, y_test

In [9]:
x_train, y_train, x_test, y_test = dataset_to_array(dataset)

## Creation of the features

As we say before we now need to create feature from our textn we have the followinf feature :

- 1 if "no" appear in the doc, 0 otherwise
- The count of first and second pronouns in the document
- 1 if "!" is in the document, 0 otherwise
- log(word count in the document)
- Number of words in the document which are in the positive lexicon
- Number of words in the document which are in the negative lexicon

In [10]:
feature_type = List[int]

def word_array_to_feature(word_array : list_of_words) -> feature_type:
    feature = []
    
    ''' No feature '''
    
    if ("no" in word_array):
        feature.append(1)
    else:
        feature.append(0)
        
    ''' Pronouns feature '''
    
    valid_pronouns = ["i", "me", "my", "mine", "myself", "you", "your", "yours", "yourself", "we", "us", "our", "ourselves"]
    pronouns_count = 0
    for elem in word_array:
        if (elem in valid_pronouns):
            pronouns_count += 1
            
    feature.append(pronouns_count)
            
    ''' ! feature '''
        
    if ("!" in word_array):
        feature.append(1)
    else:
        feature.append(0)
        
    ''' log(nb_word) feature '''
    
    feature.append(math.log(len(word_array)))
    
    ''' positive and negative feature '''
    positive_count = 0
    negative_count = 0
    
    for elem in word_array :
        if ((elem in dict_value) and (float(dict_value[elem]) >= 1.5)):
            positive_count += 1
        elif ((elem in dict_value) and (float(dict_value[elem]) <= -1.5)):
            negative_count += 1
            
    feature.append(positive_count)
    feature.append(negative_count)
    
    return feature

In [11]:
x_train_feature = [word_array_to_feature(elem) for elem in x_train]
x_test_feature = [word_array_to_feature(elem) for elem in x_test]

## Creation and train of the model

We create a 'LogisticRegression' model from sikitleran and we train it with the train data.
We also set the random_state to the variable random_seed to control the random and make the result reproducible.

In [12]:
clf = LogisticRegression(random_state=random_seed).fit(x_train_feature, y_train)

## Test our model

We will predict the label of all text in the test part of dataset and we compare with the real result.

In [13]:
y_pred = clf.predict(x_test_feature)

In [14]:
precision_recall_fscore_support(y_test, y_pred)

(array([0.70999516, 0.70619946]),
 array([0.70352, 0.71264]),
 array([0.70674275, 0.70940511]),
 array([12500, 12500]))

First of all we can see that we have 12500 elements of each labels.

So we can see that the result are quite correct. We have a precission of __0,71__ on each label. A recall of __0,70__ for label 0 and __0,71__ for label 1. The fbeta_score, which is the weighted harmonic mean of precision and recall, is of __0,71__ for each label.