# Tema 4 - Ejercicio 
## Aprendizaje Probabilístico. Clasificación mediante Naive Bayes

El fichero “Movie_pang02.csv”, disponible en la carpeta de Pruebas de Evaluación del Máster, contiene una muestra de 2000 reviews de películas de la página web IMDB utilizada en el artículo de Pang, B. y Lee, L., A Sentimental
Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, Proceedings of ACL 2004. Dichas reviews están etiquetadas mediante la variable class como positivas (“Pos”) o negativas (“Neg”). 
Utilizando dicho dataset, elabore un modelo de clasificación de reviews en base a su texto siguiendo el procedimiento descrito en el capítulo 4 del texto base en el ejemplo de SMS Spam. En particular, genere las nubes de  palabras para las reviews positivas y negativas, y obtenga el modelo asignando los valores 0 y 1 al parámetro laplace de la función naiveBayes(), comparando las matrices de confusión de cada variante del modelo.



Importamos dependencias

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as mpatches
import seaborn as sb

%matplotlib inline
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')

import string
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk import PorterStemmer as Stemmer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

Cargamos el archivo entrada csv con pandas, indicando como separador la coma. Con head(5) vemos los 5 primeros registros.

In [3]:
reviewsdf = pd.read_csv(r"Movie_pang02.csv",sep=',')
reviewsdf.head(5)

Unnamed: 0,class,text
0,Pos,films adapted from comic books have had plent...
1,Pos,every now and then a movie comes along from a...
2,Pos,you ve got mail works alot better than it des...
3,Pos,jaws is a rare film that grabs your atte...
4,Pos,moviemaking is a lot like being the general m...


In [4]:
reviewsdf.tail(5)

Unnamed: 0,class,text
1995,Neg,if anything stigmata should be taken as...
1996,Neg,john boorman s zardoz is a goofy cinemati...
1997,Neg,the kids in the hall are an acquired taste ...
1998,Neg,there was a time when john carpenter was a gr...
1999,Neg,two party guys bob their heads to haddaway s ...


**Resumen estadístico**

El equivalente a la función summary de R en pandas es describe:

In [5]:
reviewsdf.describe()

Unnamed: 0,class,text
count,2000,2000
unique,2,2000
top,Pos,tommy lee jones chases an innocent victim aro...
freq,1000,1


In [6]:
#Summary of target variable
reviewsTable = reviewsdf.groupby("class")
totals = reviewsTable.size()
total = sum(totals)
print(totals)

class
Neg    1000
Pos    1000
dtype: int64


Se comprueba que hay igual número de críticas positivas y negativas en el fichero.

In [7]:
#An easier way
reviewsdf.groupby('class').describe()

Unnamed: 0_level_0,text,text,text,text
Unnamed: 0_level_1,count,unique,top,freq
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Neg,1000,1000,two party guys bob their heads to haddaway s ...,1
Pos,1000,1000,truman true man burbank is the perfec...,1


### Limpieza y preparación

In [8]:
def process(text):
    
    #To lowercase
    text = text.lower()

    #Remove numbers
    text = ''.join([t for t in text if not t.isdigit()])

    # remove punctuation
    text = ''.join([t for t in text if t not in string.punctuation])
    
    # remove stopwords
    text = [t for t in text.split() if t not in stopwords.words('english')]

    #Eliminate unneeded whitespace
    
    # Stemming
    st = Stemmer()
    text = [st.stem(t) for t in text]

    result = ' '.join([t for t in text])
    # return token list
    return result


Probamos esta función:

In [9]:
#Testing function process with one of the reviews
#
#text = """
#the film pales in comparison to that in the black and white comic    oscar winner martin childs    shakespeare in love   production design turns the original prague surroundings into one creepy place    
#even the acting in from hell is solid   with the dreamy depp turning in a typically strong performance and deftly handling a british accent    ians holm   joe gould s secret   and richardson   102 dalmatians   
#log in great supporting roles   but the big surprise here is graham    i cringed the first time she opened her mouth   imagining her attempt at an irish accent   but it actually wasn t half bad    
#the film   however   is all good    2   00   r for strong violence/gore   sexuality   language and drug content "
#"""
#print(text)
#
#result = process(text)
#print(result)

In [10]:
clean_reviewsdf = reviewsdf['text'].apply(process)

In [11]:
clean_reviewsdf.head(5)

0    film adapt comic book plenti success whether s...
1    everi movi come along suspect studio everi ind...
2    got mail work alot better deserv order make fi...
3    jaw rare film grab attent show singl imag scre...
4    moviemak lot like gener manag nfl team post sa...
Name: text, dtype: object

Separar variables independientes y la variable dependiente/target

In [12]:
# Separate features and target variable
X = clean_reviewsdf
y = reviewsdf['class']
#print(X.shape) #(2000,)
#print(y.shape) #(2000,)

In [13]:
#X.tail(5)
#y.head(5)
X

0       film adapt comic book plenti success whether s...
1       everi movi come along suspect studio everi ind...
2       got mail work alot better deserv order make fi...
3       jaw rare film grab attent show singl imag scre...
4       moviemak lot like gener manag nfl team post sa...
                              ...                        
1995    anyth stigmata taken warn releas similarli the...
1996    john boorman zardoz goofi cinemat debacl funda...
1997    kid hall acquir tast took least season watch s...
1998    time john carpent great horror director cours ...
1999    two parti guy bob head haddaway danc hit love ...
Name: text, Length: 2000, dtype: object

In [14]:
y

0       Pos
1       Pos
2       Pos
3       Pos
4       Pos
       ... 
1995    Neg
1996    Neg
1997    Neg
1998    Neg
1999    Neg
Name: class, Length: 2000, dtype: object

### Vectorización de los textos

In [15]:
#CountVectorizer
vect1000 = CountVectorizer(ngram_range=(1, 3), max_features = 1000)
vect1000

In [16]:
X = vect1000.fit_transform(X).astype(np.int8)

In [17]:
X

<Compressed Sparse Row sparse matrix of dtype 'int8'
	with 291556 stored elements and shape (2000, 1000)>

In [18]:
feature_names_count = vect1000.get_feature_names_out()
# Print the feature names
#feature_names_count
print("CountVectorizer feature names:", feature_names_count.shape)

CountVectorizer feature names: (1000,)


In [41]:
#TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df = 100)

In [42]:
X2 = tfidf_vectorizer.fit_transform(clean_reviewsdf)
X2

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 301410 stored elements and shape (2000, 1077)>

In [43]:
feature_names_tfidf = tfidf_vectorizer.get_feature_names_out()
# Print the feature names
feature_names_tfidf
#print("TfidfVectorizer feature names:", feature_names_tfidf)

array(['abil', 'abl', 'absolut', ..., 'yet', 'york', 'young'],
      shape=(1077,), dtype=object)

### Separar las críticas en dos conjuntos diferentes (entrenamiento y test)
Las críticas ya están vectorizadas

In [22]:
#CountVectorizer
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10, shuffle=True, stratify=y)
print(f"X_train.shape: {X_train.shape}, X_test.shape: {X_test.shape}, y_train.shape: {y_train.shape}, y_test.shape: {y_test.shape}") 

X_train.shape: (1500, 1000), X_test.shape: (500, 1000), y_train.shape: (1500,), y_test.shape: (500,)


In [23]:
#TfidfVectorizer
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.25, random_state=10, shuffle=True, stratify=y)
print(f"X2_train.shape: {X2_train.shape}, X2_test.shape: {X2_test.shape}, y2_train.shape: {y2_train.shape}, y2_test.shape: {y2_test.shape}") 

X2_train.shape: (1500, 1077), X2_test.shape: (500, 1077), y2_train.shape: (1500,), y2_test.shape: (500,)


## Entrenamiento y Test


In [24]:
def classifier_testing(clf, X_train, X_test, y_train, y_test):
    # Training
    clf.fit(X_train, y_train)

    #Predictions
    y_pred = clf.predict(X_test)

    #Accuracy
    clf_accuracy_score = accuracy_score(y_test, y_pred)
    print("Accuracy Score:\n", clf_accuracy_score, "\n")

    #Classification Report
    class_rep = classification_report(y_test, y_pred)
    print("Classification Report:\n", class_rep, "\n")

    #Confusion Matrix
    conf_mtx = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix:\n", conf_mtx, "\n")

**BernoulliNB Classifier**

In [25]:
#CountVectorizer

# Initializing NaiveBayes-BernoulliNB Classifier
BNB = BernoulliNB(fit_prior=False, alpha=1.3)

# Training / Testing
classifier_testing(BNB, X_train, X_test, y_train, y_test)

Accuracy Score:
 0.748 

Classification Report:
               precision    recall  f1-score   support

         Neg       0.72      0.80      0.76       250
         Pos       0.78      0.69      0.73       250

    accuracy                           0.75       500
   macro avg       0.75      0.75      0.75       500
weighted avg       0.75      0.75      0.75       500
 

Confusion Matrix:
 [[201  49]
 [ 77 173]] 



In [45]:
#TfidfVectorizer

# Initializing NaiveBayes-BernoulliNB Classifier
BNB = BernoulliNB(fit_prior=False, alpha=1.3)

# Training / Testing
classifier_testing(BNB, X2_train, X2_test, y2_train, y2_test)

Accuracy Score:
 0.732 

Classification Report:
               precision    recall  f1-score   support

         Neg       0.71      0.78      0.74       250
         Pos       0.76      0.68      0.72       250

    accuracy                           0.73       500
   macro avg       0.73      0.73      0.73       500
weighted avg       0.73      0.73      0.73       500
 

Confusion Matrix:
 [[195  55]
 [ 79 171]] 



**MultinomialNB Classifier**

In [27]:
#CountVectorizer

# Initializing NaiveBayes-MultinomialNB Classifier
MNB = MultinomialNB(fit_prior=False, alpha=1.3)

# Training / Testing
classifier_testing(MNB, X_train, X_test, y_train, y_test)

Accuracy Score:
 0.784 

Classification Report:
               precision    recall  f1-score   support

         Neg       0.78      0.80      0.79       250
         Pos       0.79      0.77      0.78       250

    accuracy                           0.78       500
   macro avg       0.78      0.78      0.78       500
weighted avg       0.78      0.78      0.78       500
 

Confusion Matrix:
 [[199  51]
 [ 57 193]] 



In [46]:
#TfidfVectorizer

# Initializing NaiveBayes-MultinomialNB Classifier
MNB = MultinomialNB(fit_prior=False, alpha=1.3)

# Training / Testing
classifier_testing(MNB, X2_train, X2_test, y2_train, y2_test)

Accuracy Score:
 0.794 

Classification Report:
               precision    recall  f1-score   support

         Neg       0.80      0.79      0.79       250
         Pos       0.79      0.80      0.80       250

    accuracy                           0.79       500
   macro avg       0.79      0.79      0.79       500
weighted avg       0.79      0.79      0.79       500
 

Confusion Matrix:
 [[197  53]
 [ 50 200]] 



### Vectorización sin límite de características

In [29]:
X = clean_reviewsdf
vect = CountVectorizer(ngram_range=(1, 3)) # no max_features = 1000
vect

In [30]:
X = vect.fit_transform(X).astype(np.int8)

In [31]:
X

<Compressed Sparse Row sparse matrix of dtype 'int8'
	with 1871216 stored elements and shape (2000, 1180397)>

In [32]:
feature_names_count = vect.get_feature_names_out()
# Print the feature names
#feature_names_count
print("CountVectorizer feature names:", feature_names_count.shape)

CountVectorizer feature names: (1180397,)


Separar los conjuntos de entrenamiento y test

In [33]:
#CountVectorizer no limit
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10, shuffle=True, stratify=y)
print(f"X_train.shape: {X_train.shape}, X_test.shape: {X_test.shape}, y_train.shape: {y_train.shape}, y_test.shape: {y_test.shape}") 

X_train.shape: (1500, 1180397), X_test.shape: (500, 1180397), y_train.shape: (1500,), y_test.shape: (500,)


In [34]:
#CountVectorizer BernoulliNB

# Initializing NaiveBayes-BernoulliNB Classifier
BNB = BernoulliNB(fit_prior=False, alpha=1.3)

# Training / Testing
classifier_testing(BNB, X_train, X_test, y_train, y_test)

Accuracy Score:
 0.516 

Classification Report:
               precision    recall  f1-score   support

         Neg       0.51      1.00      0.67       250
         Pos       0.90      0.04      0.07       250

    accuracy                           0.52       500
   macro avg       0.70      0.52      0.37       500
weighted avg       0.70      0.52      0.37       500
 

Confusion Matrix:
 [[249   1]
 [241   9]] 



In [35]:
#CountVectorizer MultinomialNB

# Initializing NaiveBayes-MultinomialNB Classifier
MNB = MultinomialNB(fit_prior=False, alpha=1.35)

# Training / Testing
classifier_testing(MNB, X_train, X_test, y_train, y_test)

Accuracy Score:
 0.798 

Classification Report:
               precision    recall  f1-score   support

         Neg       0.78      0.83      0.80       250
         Pos       0.82      0.77      0.79       250

    accuracy                           0.80       500
   macro avg       0.80      0.80      0.80       500
weighted avg       0.80      0.80      0.80       500
 

Confusion Matrix:
 [[207  43]
 [ 58 192]] 

