## Baseline with Simple Model

In this notebook, we tried simple models like logistic regression and naive Bayes with TF-IDF and bag of words using stratified k-fold cross-validation for training.

In [1]:
%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=1

env: CUDA_DEVICE_ORDER=PCI_BUS_ID
env: CUDA_VISIBLE_DEVICES=1


### Import libraries
We used pandas to read the datasets, pandarallel for parallel processing of the dataset, and scikit-learn for the Naive Bayes and Logistic Regression models to split the dataset and compute metrics.

In [2]:
import pandas as pd
from pandarallel import pandarallel
import ast
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

In [3]:
import cuml
import cuml.linear_model
import cuml.common
from numba import cuda
from cuml.linear_model import LogisticRegression as cuMLLogisticRegression

In [4]:

device_id = 0
cuda.select_device(device_id)

<weakproxy at 0x754de0cdde90 to Device at 0x754de0ce8a90>

In [5]:
pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 56 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


## Read the dataset
This function **ast.literal_eval** is used to convert strings into Python objects, because when we load the cleaned dataset, its contents appeared as strings.

In [6]:
dataRew=pd.read_csv('../Dataset/datiClean.csv')
dataMovie=pd.read_csv('../Dataset/movieclean.csv')

In [7]:
dataRew["clean_review"]=dataRew.loc[:,"clean_review"].parallel_apply(ast.literal_eval)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=10249), Label(value='0 / 10249')))…

Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x755023000d60>>
Traceback (most recent call last):
  File "/home/f.caprari/prova/Group10venv/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 775, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(
KeyboardInterrupt: 


In [8]:
dataMovie["plot_clean"]=dataMovie.loc[:,"plot_clean"].parallel_apply(ast.literal_eval)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=29), Label(value='0 / 29'))), HBox…

In [9]:
dataMovie.drop(['plot_synopsis','plot_summary'],axis=1,inplace=True)

### Split the Dataset

Let's divide the dataset into train and test sets, ensuring that the sets are balanced, we set the same random state in each notebook to ensure consistent division and facilitate better result comparison.

drop the useless field

In [10]:
dataRew.drop(['review_date','movie_id','user_id','rating','review_summary','review_text'],axis=1,inplace=True)

In [11]:
x=dataRew['clean_review']
y=dataRew['is_spoiler']

In [12]:
## Stratify balance the dataset
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y,random_state=42)

The train and test dataset dollow the spoiler distribution of the original dataset

In [13]:
y_train.value_counts()

is_spoiler
False    338391
True     120739
Name: count, dtype: int64

In [14]:
y_test.value_counts()

is_spoiler
False    84598
True     30185
Name: count, dtype: int64

### Apply Logistic Regression and Naive Bayes, with k fold

### Function for computing the result of the model

In [15]:
def print_mean():
    # Calculate the averages of the metrics
    mean_accuracy = np.mean(metrics['accuracy'])
    mean_precision = np.mean(metrics['precision'])
    mean_recall = np.mean(metrics['recall'])
    mean_f1_score = np.mean(metrics['f1_score'])

    # Print the averages of the metrics
    print("Mean Accuracy:", mean_accuracy)
    print("Mean Precision:", mean_precision)
    print("Mean Recall:", mean_recall)
    print("Mean F1 Score:", mean_f1_score)

In [16]:
def print_test(y_pred_test):
    # Calcolo delle metriche di valutazione sul set di test
    accuracy_test = accuracy_score(y_test, y_pred_test)
    precision_test = precision_score(y_test, y_pred_test)
    recall_test = recall_score(y_test, y_pred_test)
    f1_score_test = f1_score(y_test, y_pred_test)

    # Stampa delle metriche di valutazione sul set di test
    print("Test Accuracy:", accuracy_test)
    print("Test Precision:", precision_test)
    print("Test Recall:", recall_test)
    print("Test F1 Score:", f1_score_test)

Function to train the model, which takes as input the number of folds and the number of iterations to perform.

In [17]:
metrics = {
    'accuracy': [],
    'precision': [],
    'recall': [],
    'f1_score': []
}
def computeLogistic(folds,iter,X,y_train,penalty,C,class_weight):
    logistic_reg=LogisticRegression(max_iter=iter,penalty=penalty,C=C,class_weight=class_weight)
    ## Stratified k-fold grant me a balance division of classes

    kf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
    for train_index, val_index in kf.split(X, y_train):
        X_fold_train, X_fold_val = X[train_index], X[val_index]
        y_fold_train, y_fold_val = y_train[train_index], y_train[val_index]
        
        # Training
        logistic_reg.fit(X_fold_train, y_fold_train)
        
        # Validation
        y_pred = logistic_reg.predict(X_fold_val)
        
        # Compute metrics
        metrics['accuracy'].append(accuracy_score(y_fold_val, y_pred))
        metrics['precision'].append(precision_score(y_fold_val, y_pred))
        metrics['recall'].append(recall_score(y_fold_val, y_pred))
        metrics['f1_score'].append(f1_score(y_fold_val, y_pred))
    return logistic_reg

Function to train the model, which takes as input the number of folds.

In [18]:
metrics = {
    'accuracy': [],
    'precision': [],
    'recall': [],
    'f1_score': []
}
def compute_naive(folds,X,y_train,alpha,fit_prior):
    kf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
    naive_bayes = MultinomialNB(alpha=alpha,fit_prior=fit_prior)
    
    for train_index, val_index in kf.split(X, y_train):
        X_fold_train, X_fold_val = X[train_index], X[val_index]
        y_fold_train, y_fold_val = y_train[train_index], y_train[val_index]
        
        # Training
        naive_bayes.fit(X_fold_train, y_fold_train)
        
        # Validation
        y_pred = naive_bayes.predict(X_fold_val)
        
        # Compute metrics
        metrics['accuracy'].append(accuracy_score(y_fold_val, y_pred))
        metrics['precision'].append(precision_score(y_fold_val, y_pred))
        metrics['recall'].append(recall_score(y_fold_val, y_pred))
        metrics['f1_score'].append(f1_score(y_fold_val, y_pred))
    return naive_bayes

## Using Bag of Words
To apply Bag of Words, first reconstruct a dummy text from tokens, then apply the function. As a result we have a matrix where each row corresponds to a document and each column corresponds to a token.

In [19]:
text=[" ".join(word) for word in X_train]

In [20]:
textT=[" ".join(word) for word in X_test]

In [21]:
## Bag of Words for train
vect=CountVectorizer()
X=vect.fit_transform(text)

In [22]:
## Bag of Words for test
X_t=vect.transform(textT)

In [23]:
y_train=y_train.values
y_test=y_test.values

### Logistic Regression

### Try Grid Search 
Grid Search to find the regularization parameters for logistic regression.

In [53]:

param_grid = {         
    'penalty': ['l1', 'l2'],                     # Regolarization type
    'C': [0.001, 0.01, 0.1, 1, 10, 100],         # Parameters for Reg
    'class_weight': [None, 'balanced']           # Weight for Classes
}



In [None]:
model=LogisticRegression(max_iter=1000)
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END C=0.001, class_weight=None, penalty=l1;, score=nan total time=   0.2s
[CV 2/5] END C=0.001, class_weight=None, penalty=l1;, score=nan total time=   0.2s
[CV 3/5] END C=0.001, class_weight=None, penalty=l1;, score=nan total time=   0.2s
[CV 4/5] END C=0.001, class_weight=None, penalty=l1;, score=nan total time=   0.2s
[CV 5/5] END C=0.001, class_weight=None, penalty=l1;, score=nan total time=   0.2s
[CV 1/5] END C=0.001, class_weight=None, penalty=l2;, score=0.774 total time=  30.3s
[CV 2/5] END C=0.001, class_weight=None, penalty=l2;, score=0.775 total time=  29.3s
[CV 3/5] END C=0.001, class_weight=None, penalty=l2;, score=0.773 total time=  26.6s
[CV 4/5] END C=0.001, class_weight=None, penalty=l2;, score=0.774 total time=  28.7s
[CV 5/5] END C=0.001, class_weight=None, penalty=l2;, score=0.771 total time=  26.7s
[CV 1/5] END C=0.001, class_weight=balanced, penalty=l1;, score=nan total time=   0.2s
[CV 2/5] EN

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=1, class_weight=None, penalty=l2;, score=0.764 total time= 5.5min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=1, class_weight=None, penalty=l2;, score=0.765 total time= 5.6min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END C=1, class_weight=None, penalty=l2;, score=0.763 total time= 5.5min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END C=1, class_weight=None, penalty=l2;, score=0.767 total time= 5.5min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=1, class_weight=None, penalty=l2;, score=0.765 total time= 5.2min
[CV 1/5] END C=1, class_weight=balanced, penalty=l1;, score=nan total time=   0.2s
[CV 2/5] END C=1, class_weight=balanced, penalty=l1;, score=nan total time=   0.2s
[CV 3/5] END C=1, class_weight=balanced, penalty=l1;, score=nan total time=   0.2s
[CV 4/5] END C=1, class_weight=balanced, penalty=l1;, score=nan total time=   0.2s
[CV 5/5] END C=1, class_weight=balanced, penalty=l1;, score=nan total time=   0.2s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=1, class_weight=balanced, penalty=l2;, score=0.720 total time= 5.5min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=1, class_weight=balanced, penalty=l2;, score=0.725 total time= 5.5min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END C=1, class_weight=balanced, penalty=l2;, score=0.725 total time= 5.5min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END C=1, class_weight=balanced, penalty=l2;, score=0.724 total time= 5.7min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=1, class_weight=balanced, penalty=l2;, score=0.722 total time= 5.8min
[CV 1/5] END .C=10, class_weight=None, penalty=l1;, score=nan total time=   0.2s
[CV 2/5] END .C=10, class_weight=None, penalty=l1;, score=nan total time=   0.2s
[CV 3/5] END .C=10, class_weight=None, penalty=l1;, score=nan total time=   0.2s
[CV 4/5] END .C=10, class_weight=None, penalty=l1;, score=nan total time=   0.2s
[CV 5/5] END .C=10, class_weight=None, penalty=l1;, score=nan total time=   0.2s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=10, class_weight=None, penalty=l2;, score=0.749 total time= 5.3min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=10, class_weight=None, penalty=l2;, score=0.748 total time= 5.2min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END C=10, class_weight=None, penalty=l2;, score=0.748 total time= 5.3min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END C=10, class_weight=None, penalty=l2;, score=0.754 total time= 5.3min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END C=10, class_weight=None, penalty=l2;, score=0.747 total time= 5.3min
[CV 1/5] END C=10, class_weight=balanced, penalty=l1;, score=nan total time=   0.2s
[CV 2/5] END C=10, class_weight=balanced, penalty=l1;, score=nan total time=   0.2s
[CV 3/5] END C=10, class_weight=balanced, penalty=l1;, score=nan total time=   0.2s
[CV 4/5] END C=10, class_weight=balanced, penalty=l1;, score=nan total time=   0.2s
[CV 5/5] END C=10, class_weight=balanced, penalty=l1;, score=nan total time=   0.2s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END C=10, class_weight=balanced, penalty=l2;, score=0.708 total time= 5.6min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END C=10, class_weight=balanced, penalty=l2;, score=0.712 total time= 5.5min


In [61]:
best_params = grid_search.best_params_

In [62]:
print(best_params)

{'C': 0.01, 'class_weight': None, 'penalty': 'l2'}


{'C': 0.01, 'class_weight': None, 'penalty': 'l2'}

In [None]:
##save the output of the logistic regression

with open("../Output/outputGridLog.txt", "a") as f:
    print(f" BEst Parameters:{best_params}",file=f)

### Result Train and Validation

In [29]:
C=0.01
class_weight=None
penalty='l2'

In [30]:
logistic_reg=computeLogistic(5,1500,X,y_train,penalty,C,class_weight)

print_mean()

Mean Accuracy: 0.7776555659617101
Mean Precision: 0.6742965163142707
Mean Recall: 0.29888436967766563
Mean F1 Score: 0.4141759272972262


### Result Test

In [31]:
y_pred_test = logistic_reg.predict(X_t)
print_test(y_pred_test)


Test Accuracy: 0.7788696932472579
Test Precision: 0.677533821246396
Test Recall: 0.3036276296173596
Test F1 Score: 0.41933565153733526


## Naive Bayes

### Grid Search for parameters

In [30]:
### Grid search
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 2.0, 5.0],
    'fit_prior': [True, False]
}


In [39]:
### TRY GRID
naive_bayes = MultinomialNB()
# Eseguire la ricerca su griglia
grid_search = GridSearchCV(naive_bayes, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y_train)

In [40]:
best_params = grid_search.best_params_

In [57]:
print(best_params)

{'alpha': 0.1, 'fit_prior': True}


### Result Validation, Training

In [42]:
naive_bayes=compute_naive(5,X,y_train,best_params['alpha'],best_params['fit_prior'])
print_mean()

Mean Accuracy: 0.7335580336723803
Mean Precision: 0.5083741452480295
Mean Recall: 0.45899421060909995
Mean F1 Score: 0.4684245825990197


### Result Test

In [43]:
y_pred_test = naive_bayes.predict(X_t)

print_test(y_pred_test)

Test Accuracy: 0.7562618157741129
Test Precision: 0.5612109115103127
Test Recall: 0.3353321186019546
Test F1 Score: 0.41981709213828


## Tf-idf

In [32]:

tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(text)
X_test_tfidf = tfidf_vectorizer.transform(textT)

In [33]:
X_train_tfidf

<459130x240545 sparse matrix of type '<class 'numpy.float64'>'
	with 47974839 stored elements in Compressed Sparse Row format>

### Logistic Regression

### for tf-idf try different regualarization parameters

For TF-IDF, keep the results from the previous grid search and only change the parameter C.

In [34]:

param_grid = {                        
    'C': [0.001, 0.01, 0.1, 1, 10, 100],         # Parameters for Reg
}


In [36]:
model=cuMLLogisticRegression(max_iter=1500,penalty='l2')
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', verbose=3)
grid_search.fit(X_train_tfidf,y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV 1/5] END ...........................C=0.001;, score=0.737 total time=   0.4s
[CV 2/5] END ...........................C=0.001;, score=0.737 total time=   0.4s
[CV 3/5] END ...........................C=0.001;, score=0.737 total time=   0.4s
[CV 4/5] END ...........................C=0.001;, score=0.737 total time=   0.4s
[CV 5/5] END ...........................C=0.001;, score=0.737 total time=   0.4s
[CV 1/5] END ............................C=0.01;, score=0.745 total time=   0.4s
[CV 2/5] END ............................C=0.01;, score=0.744 total time=   0.4s
[CV 3/5] END ............................C=0.01;, score=0.745 total time=   0.5s
[CV 4/5] END ............................C=0.01;, score=0.745 total time=   0.4s
[CV 5/5] END ............................C=0.01;, score=0.745 total time=   0.4s
[CV 1/5] END .............................C=0.1;, score=0.772 total time=   0.6s
[CV 2/5] END .............................C=0.1;,

In [37]:
best_params = grid_search.best_params_

In [38]:
print(best_params)

{'C': 1}


### Apply the model

In [41]:
C=best_params['C']
penal

In [43]:
logistic_reg=computeLogistic(5,1500,X_train_tfidf,y_train,'l2',C,None)
print_mean()

Mean Accuracy: 0.774037191323098
Mean Precision: 0.6846282462388742
Mean Recall: 0.27978059056329424
Mean F1 Score: 0.3819288801432443


In [45]:
y_pred_test = logistic_reg.predict(X_test_tfidf)
print_test(y_pred_test)

Test Accuracy: 0.7797583265814624
Test Precision: 0.6580932121446529
Test Recall: 0.33821434487328145
Test F1 Score: 0.4468029235415117


### Naive Bayes

In [26]:
### Grid search
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 2.0, 5.0],
    'fit_prior': [True, False]
}

In [27]:
### TRY GRID
naive_bayes = MultinomialNB()
# Eseguire la ricerca su griglia
grid_search = GridSearchCV(naive_bayes, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_tfidf, y_train)

In [28]:
best_params = grid_search.best_params_

In [29]:
print(best_params)

{'alpha': 0.1, 'fit_prior': True}


In [30]:
naive_bayes=compute_naive(5,X_train_tfidf,y_train,best_params['alpha'],best_params['fit_prior'])
print_mean()

Mean Accuracy: 0.7538714525297846
Mean Precision: 0.6373659583614911
Mean Recall: 0.14860982514345233
Mean F1 Score: 0.2410190994486531


In [31]:
y_pred_test = naive_bayes.predict(X_test_tfidf)

print_test(y_pred_test)

Test Accuracy: 0.7543103072754677
Test Precision: 0.6647840531561462
Test Recall: 0.13258240848103361
Test F1 Score: 0.22107443723242645
