# Tema 11 - Ejercicio 
## Mejorando un modelo de machine learning

1. Utilizando el procedimiento descrito en el capítulo 11 del libro con la
librería caret, realice una búsqueda de los parámetros óptimos para
cualquiera de los modelos de las pruebas de evaluación de los temas 3, 4,
5, 6 o 7. </br>
Comente los resultados obtenidos.



Elegimos el tema 4 (Naive Bayes).

Realizamos los mismos pasos que en ese ejercicio.

Importamos dependencias

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as mpatches
import seaborn as sb

%matplotlib inline
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')

import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk import PorterStemmer as Stemmer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

[nltk_data] Downloading package stopwords to /home/francd/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

Cargamos el archivo entrada csv con pandas, indicando como separador la coma. Con head(5) vemos los 5 primeros registros.

In [12]:
reviewsdf = pd.read_csv(r"Movie_pang02.csv",sep=',')
#reviewsdf.head(5)
#reviewsdf.tail(5)

In [13]:
#Summary of target variable
# reviewsTable = reviewsdf.groupby("class")
# totals = reviewsTable.size()
# total = sum(totals)
# print(totals)
#An easier way
reviewsdf.groupby('class').describe()

Unnamed: 0_level_0,text,text,text,text
Unnamed: 0_level_1,count,unique,top,freq
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Neg,1000,1000,plot two teen couples go to a church party ...,1
Pos,1000,1000,films adapted from comic books have had plent...,1


Se comprueba que hay igual número de críticas positivas y negativas en el fichero.

### Limpieza y preparación

In [14]:
def process(text):
    
    #To lowercase
    text = text.lower()
    #Remove numbers
    text = ''.join([t for t in text if not t.isdigit()])
    # remove punctuation
    text = ''.join([t for t in text if t not in string.punctuation])
    # remove stopwords
    text = [t for t in text.split() if t not in stopwords.words('english')]
    #Eliminate unneeded whitespace
    # Stemming
    st = Stemmer()
    text = [st.stem(t) for t in text]
    result = ' '.join([t for t in text])
    # return token list
    return result


In [15]:
%%time
clean_reviewsdf = reviewsdf['text'].apply(process)

CPU times: user 1min 13s, sys: 10.7 s, total: 1min 24s
Wall time: 1min 24s


In [16]:
clean_reviewsdf.head(5)

0    film adapt comic book plenti success whether s...
1    everi movi come along suspect studio everi ind...
2    got mail work alot better deserv order make fi...
3    jaw rare film grab attent show singl imag scre...
4    moviemak lot like gener manag nfl team post sa...
Name: text, dtype: object

Separar variables independientes y la variable dependiente/target

In [17]:
# Separate features and target variable
X = clean_reviewsdf
y = reviewsdf['class']

### Vectorización de los textos

In [20]:
#CountVectorizer
vect1000 = CountVectorizer(ngram_range=(1, 3), max_features = 1000)
vect1000

In [21]:
%%time
X = vect1000.fit_transform(X).astype(np.int8)

CPU times: user 3.46 s, sys: 124 ms, total: 3.58 s
Wall time: 3.59 s


In [22]:
X

<Compressed Sparse Row sparse matrix of dtype 'int8'
	with 291608 stored elements and shape (2000, 1000)>

In [23]:
feature_names_count = vect1000.get_feature_names_out()
# Print the feature names
print("CountVectorizer feature names:", feature_names_count.shape)

CountVectorizer feature names: (1000,)


In [24]:
#TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df = 100)
tfidf_vectorizer

In [25]:
%%time
X2 = tfidf_vectorizer.fit_transform(clean_reviewsdf)

CPU times: user 366 ms, sys: 7.79 ms, total: 374 ms
Wall time: 376 ms


In [26]:
X2

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 301410 stored elements and shape (2000, 1077)>

In [22]:
feature_names_tfidf = tfidf_vectorizer.get_feature_names_out()
# Print the feature names
feature_names_tfidf
#print("TfidfVectorizer feature names:", feature_names_tfidf)

array(['abil', 'abl', 'absolut', ..., 'yet', 'york', 'young'],
      shape=(1077,), dtype=object)

### Separar las críticas en dos conjuntos diferentes (entrenamiento y test)
Las críticas ya están vectorizadas

In [27]:
#CountVectorizer
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10, shuffle=True, stratify=y)
print(f"X_train.shape: {X_train.shape}, X_test.shape: {X_test.shape}, y_train.shape: {y_train.shape}, y_test.shape: {y_test.shape}") 

X_train.shape: (1500, 1000), X_test.shape: (500, 1000), y_train.shape: (1500,), y_test.shape: (500,)


In [28]:
#TfidfVectorizer
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.25, random_state=10, shuffle=True, stratify=y)
print(f"X2_train.shape: {X2_train.shape}, X2_test.shape: {X2_test.shape}, y2_train.shape: {y2_train.shape}, y2_test.shape: {y2_test.shape}") 

X2_train.shape: (1500, 1077), X2_test.shape: (500, 1077), y2_train.shape: (1500,), y2_test.shape: (500,)


## Entrenamiento y Test usando GridSearchCV


In [36]:
# aux. function
def classifier_testing(clf, X_train, X_test, y_train, y_test):
    # Training
    clf.fit(X_train, y_train)

    #Predictions
    y_pred = clf.predict(X_test)

    #Accuracy
    clf_accuracy_score = accuracy_score(y_test, y_pred)
    print("Accuracy Score:\n", clf_accuracy_score, "\n")

    #Classification Report
    class_rep = classification_report(y_test, y_pred)
    print("Classification Report:\n", class_rep, "\n")

    #Confusion Matrix
    conf_mtx = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix:\n", conf_mtx, "\n")

**BernoulliNB Classifier**

In [31]:
from sklearn.model_selection import GridSearchCV 

In [57]:
%%time

# Define parameter grid
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 3, 5,10,20,100],
    'fit_prior': [True, False]
}

grid_search = GridSearchCV(
    BernoulliNB(),
    param_grid,
    cv=5,
    scoring='accuracy',
    verbose=1
)

grid_search.fit(X_train, y_train)
grid_search.best_params_

Fitting 5 folds for each of 16 candidates, totalling 80 fits
CPU times: user 514 ms, sys: 0 ns, total: 514 ms
Wall time: 513 ms


{'alpha': 1.0, 'fit_prior': True}

In [58]:
# Evaluate on test set
best_model = grid_search.best_estimator_
accuracy = best_model.score(X_test, y_test)
print(f"Test set accuracy: {accuracy:.3f}")

Test set accuracy: 0.746


In [59]:
#CountVectorizer laplace = 0

# Initializing NaiveBayes-BernoulliNB Classifier
BNB = BernoulliNB(alpha=1, fit_prior=True)    #  by default force_alpha=True

In [60]:
%%time
# Training / Testing
classifier_testing(BNB, X_train, X_test, y_train, y_test)

Accuracy Score:
 0.746 

Classification Report:
               precision    recall  f1-score   support

         Neg       0.72      0.81      0.76       250
         Pos       0.78      0.68      0.73       250

    accuracy                           0.75       500
   macro avg       0.75      0.75      0.75       500
weighted avg       0.75      0.75      0.75       500
 

Confusion Matrix:
 [[202  48]
 [ 79 171]] 

CPU times: user 21 ms, sys: 118 μs, total: 21.1 ms
Wall time: 19.5 ms


</br>

**Previamente en el ejercicio 4:**

Accuracy Score:
 0.746 

Classification Report:
               precision    recall  f1-score   support

         Neg       0.72      0.80      0.76       250
         Pos       0.78      0.69      0.73       250

    accuracy                           0.75       500
   macro avg       0.75      0.75      0.75       500
weighted avg       0.75      0.75      0.75       500
 

Confusion Matrix:</br>
 [[200  50]</br>
 [ 77 173]] 

CPU times: user 37.1 ms, sys: 125 μs, total: 37.2 ms
Wall time: 37.1 ms

**MultinomialNB Classifier**

In [76]:
%%time

# Define parameter grid
param_grid_Multi = {
    'alpha': [0.1, 0.5, 1.0, 3, 5,10,20,100],
    'fit_prior': [True, False],
}

grid_search_Multi = GridSearchCV(
    MultinomialNB(),
    param_grid_Multi,
    cv=5,
    scoring='accuracy',
    verbose=1
)

grid_search_Multi.fit(X_train, y_train)
grid_search_Multi.best_params_

Fitting 5 folds for each of 16 candidates, totalling 80 fits
CPU times: user 398 ms, sys: 112 μs, total: 399 ms
Wall time: 396 ms


{'alpha': 0.5, 'fit_prior': True}

In [77]:
# Evaluate on test set
best_model_Multi = grid_search_Multi.best_estimator_
accuracy = best_model_Multi.score(X_test, y_test)
print(f"Test set accuracy: {accuracy:.3f}")

Test set accuracy: 0.782


In [80]:
#CountVectorizer laplace = 0

# Initializing NaiveBayes-MultinomialNB Classifier
MNB = MultinomialNB(alpha=0.5, fit_prior=True)  #  by default force_alpha=True

In [81]:
%%time

# Training / Testing
classifier_testing(MNB, X_train, X_test, y_train, y_test)

Accuracy Score:
 0.782 

Classification Report:
               precision    recall  f1-score   support

         Neg       0.78      0.78      0.78       250
         Pos       0.78      0.78      0.78       250

    accuracy                           0.78       500
   macro avg       0.78      0.78      0.78       500
weighted avg       0.78      0.78      0.78       500
 

Confusion Matrix:
 [[196  54]
 [ 55 195]] 

CPU times: user 20.8 ms, sys: 0 ns, total: 20.8 ms
Wall time: 19.5 ms


</br>

**Previamente en el ejercicio 4:**

Accuracy Score:
 0.786 

Classification Report:
               precision    recall  f1-score   support

         Neg       0.78      0.80      0.79       250
         Pos       0.79      0.78      0.78       250

    accuracy                           0.79       500
   macro avg       0.79      0.79      0.79       500
weighted avg       0.79      0.79      0.79       500
 

Confusion Matrix:</br>
 [[199  51]</br>
 [ 56 194]] 

CPU times: user 34.7 ms, sys: 0 ns, total: 34.7 ms
Wall time: 34.9 ms

### Conclusión:

Parece que ya se obtuvieron los mejores hiperparámetros cuando se realizó el ejercicio 4.
Con GridSearchCV no se han podido mejorar.