## Baseline with Simple Model

In this notebook, we tried simple models like logistic regression and naive Bayes with TF-IDF and bag of words using stratified k-fold cross-validation for training.

### Import libraries
We used pandas to read the datasets, pandarallel for parallel processing of the dataset, and scikit-learn for the Naive Bayes and Logistic Regression models to split the dataset and compute metrics.

In [1]:
import pandas as pd
from pandarallel import pandarallel
import ast
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.feature_extraction.text import TfidfVectorizer


In [2]:
pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 56 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


## Read the dataset
This function **ast.literal_eval** is used to convert strings into Python objects, because when we load the cleaned dataset, its contents appeared as strings.

In [3]:
dataRew=pd.read_csv('../Dataset/datiClean.csv')
dataMovie=pd.read_csv('../Dataset/movieclean.csv')

In [4]:
dataRew["clean_review"]=dataRew.loc[:,"clean_review"].parallel_apply(ast.literal_eval)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=10249), Label(value='0 / 10249')))…

In [5]:
dataMovie["plot_clean"]=dataMovie.loc[:,"plot_clean"].parallel_apply(ast.literal_eval)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=29), Label(value='0 / 29'))), HBox…

In [6]:
dataMovie.drop(['plot_synopsis','plot_summary'],axis=1,inplace=True)

### Split the Dataset

Let's divide the dataset into train and test sets, ensuring that the sets are balanced, we set the same random state in each notebook to ensure consistent division and facilitate better result comparison.

drop the useless field

In [8]:
dataRew.drop(['review_date','movie_id','user_id','rating','review_summary','review_text'],axis=1,inplace=True)

In [9]:
x=dataRew['clean_review']
y=dataRew['is_spoiler']

In [10]:
## Stratify balance the dataset
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y,random_state=42)

In [11]:
y_train.value_counts()

is_spoiler
False    338391
True     120739
Name: count, dtype: int64

In [12]:
y_test.value_counts()

is_spoiler
False    84598
True     30185
Name: count, dtype: int64

### Apply Logistic Regression and Naive Bayes, with k fold

### Function for computing the result of the model

In [13]:
def print_mean():
    # Calculate the averages of the metrics
    mean_accuracy = np.mean(metrics['accuracy'])
    mean_precision = np.mean(metrics['precision'])
    mean_recall = np.mean(metrics['recall'])
    mean_f1_score = np.mean(metrics['f1_score'])

    # Print the averages of the metrics
    print("Mean Accuracy:", mean_accuracy)
    print("Mean Precision:", mean_precision)
    print("Mean Recall:", mean_recall)
    print("Mean F1 Score:", mean_f1_score)

In [14]:
def print_test(y_pred_test):
    # Calcolo delle metriche di valutazione sul set di test
    accuracy_test = accuracy_score(y_test, y_pred_test)
    precision_test = precision_score(y_test, y_pred_test)
    recall_test = recall_score(y_test, y_pred_test)
    f1_score_test = f1_score(y_test, y_pred_test)

    # Stampa delle metriche di valutazione sul set di test
    print("Test Accuracy:", accuracy_test)
    print("Test Precision:", precision_test)
    print("Test Recall:", recall_test)
    print("Test F1 Score:", f1_score_test)

Function to train the model, which takes as input the number of folds and the number of iterations to perform.

In [15]:
metrics = {
    'accuracy': [],
    'precision': [],
    'recall': [],
    'f1_score': []
}
def computeLogistic(folds,iter,X,y_train):
    logistic_reg=LogisticRegression(max_iter=iter)
    ## Stratified k-fold grant me a balance division of classes

    kf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
    for train_index, val_index in kf.split(X, y_train):
        X_fold_train, X_fold_val = X[train_index], X[val_index]
        y_fold_train, y_fold_val = y_train[train_index], y_train[val_index]
        
        # Training
        logistic_reg.fit(X_fold_train, y_fold_train)
        
        # Validation
        y_pred = logistic_reg.predict(X_fold_val)
        
        # Compute metrics
        metrics['accuracy'].append(accuracy_score(y_fold_val, y_pred))
        metrics['precision'].append(precision_score(y_fold_val, y_pred))
        metrics['recall'].append(recall_score(y_fold_val, y_pred))
        metrics['f1_score'].append(f1_score(y_fold_val, y_pred))
    return logistic_reg

Function to train the model, which takes as input the number of folds.

In [16]:
metrics = {
    'accuracy': [],
    'precision': [],
    'recall': [],
    'f1_score': []
}
def compute_naive(folds,X,y_train):
    kf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
    naive_bayes = MultinomialNB()
    
    for train_index, val_index in kf.split(X, y_train):
        X_fold_train, X_fold_val = X[train_index], X[val_index]
        y_fold_train, y_fold_val = y_train[train_index], y_train[val_index]
        
        # Training
        naive_bayes.fit(X_fold_train, y_fold_train)
        
        # Validation
        y_pred = naive_bayes.predict(X_fold_val)
        
        # Compute metrics
        metrics['accuracy'].append(accuracy_score(y_fold_val, y_pred))
        metrics['precision'].append(precision_score(y_fold_val, y_pred))
        metrics['recall'].append(recall_score(y_fold_val, y_pred))
        metrics['f1_score'].append(f1_score(y_fold_val, y_pred))
    return naive_bayes

### Using Bag of Words
To apply Bag of Words, first reconstruct a dummy text from tokens, then apply the function. As a result we have a matrix where each row corresponds to a document and each column corresponds to a token.

In [17]:
text=[" ".join(word) for word in X_train]

In [18]:
textT=[" ".join(word) for word in X_test]

In [19]:
## Bag of Words for train
vect=CountVectorizer()
X=vect.fit_transform(text)

In [20]:
## Bag of Words for test
X_t=vect.transform(textT)

In [21]:
y_train=y_train.values
y_test=y_test.values

## Logistic Regression

### Result Train and Validation

In [22]:
logistic_reg=computeLogistic(5,1000,X,y_train)
print_mean()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Mean Accuracy: 0.7641365190686733
Mean Precision: 0.5817804667891033
Mean Recall: 0.3667166575982666
Mean F1 Score: 0.44986233168201845


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Result Test

In [23]:
y_pred_test = logistic_reg.predict(X_t)
print_test(y_pred_test)


Test Accuracy: 0.7651568612076701
Test Precision: 0.5845420746714144
Test Recall: 0.36981944674507206
Test F1 Score: 0.4530254453958849


## Naive Bayes

### Result Validation, Training

In [24]:
naive_bayes=compute_naive(5,X,y_train)
print_mean()

Mean Accuracy: 0.7515649162546555
Mean Precision: 0.5428177570680055
Mean Recall: 0.42837858697856496
Mean F1 Score: 0.47335420922726074


### Result Test

In [25]:
y_pred_test = naive_bayes.predict(X_t)

print_test(y_pred_test)

Test Accuracy: 0.7410853523605412
Test Precision: 0.5081326352530541
Test Recall: 0.4822925294020209
Test F1 Score: 0.49487549927764085


## Tf-idf

In [26]:

tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(text)
X_test_tfidf = tfidf_vectorizer.transform(textT)

In [27]:
X_train_tfidf

<459130x240545 sparse matrix of type '<class 'numpy.float64'>'
	with 47974839 stored elements in Compressed Sparse Row format>

### Logistic Regression

In [28]:
logistic_reg=computeLogistic(5,1000,X_train_tfidf,y_train)
print_mean()

Mean Accuracy: 0.7607983214630569
Mean Precision: 0.5814609280366747
Mean Recall: 0.39667934390511417
Mean F1 Score: 0.4631081223667101


In [29]:
y_pred_test = logistic_reg.predict(X_test_tfidf)
print_test(y_pred_test)

Test Accuracy: 0.7797583265814624
Test Precision: 0.6580932121446529
Test Recall: 0.33821434487328145
Test F1 Score: 0.4468029235415117


### Naive Bayes

In [30]:
naive_bayes=compute_naive(5,X_train_tfidf,y_train)
print_mean()

Mean Accuracy: 0.755747282904624
Mean Precision: 0.6148767691415962
Mean Recall: 0.30314977314182534
Mean F1 Score: 0.358266525473654


In [31]:
y_pred_test = naive_bayes.predict(X_test_tfidf)

print_test(y_pred_test)

Test Accuracy: 0.7404406575886673
Test Precision: 0.7848837209302325
Test Recall: 0.017889680304787145
Test F1 Score: 0.03498202312700418


In [None]:
#### FORSE UTILIZZARE anche embedding