# Tema 4 - Ejercicio 
## Aprendizaje Probabilístico. Clasificación mediante Naive Bayes

El fichero “Movie_pang02.csv”, disponible en la carpeta de Pruebas de Evaluación del Máster, contiene una muestra de 2000 reviews de películas de la página web IMDB utilizada en el artículo de Pang, B. y Lee, L., A Sentimental
Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, Proceedings of ACL 2004. Dichas reviews están etiquetadas mediante la variable class como positivas (“Pos”) o negativas (“Neg”). 
Utilizando dicho dataset, elabore un modelo de clasificación de reviews en base a su texto siguiendo el procedimiento descrito en el capítulo 4 del texto base en el ejemplo de SMS Spam. En particular, genere las nubes de  palabras para las reviews positivas y negativas, y obtenga el modelo asignando los valores 0 y 1 al parámetro laplace de la función naiveBayes(), comparando las matrices de confusión de cada variante del modelo.



Importamos dependencias

In [49]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as mpatches
import seaborn as sb

%matplotlib inline
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')

import string
import nltk
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk import PorterStemmer as Stemmer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

[nltk_data] Downloading package stopwords to /home/francd/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

Cargamos el archivo entrada csv con pandas, indicando como separador la coma. Con head(5) vemos los 5 primeros registros.

In [3]:
reviewsdf = pd.read_csv(r"Movie_pang02.csv",sep=',')
reviewsdf.head(5)

Unnamed: 0,class,text
0,Pos,films adapted from comic books have had plent...
1,Pos,every now and then a movie comes along from a...
2,Pos,you ve got mail works alot better than it des...
3,Pos,jaws is a rare film that grabs your atte...
4,Pos,moviemaking is a lot like being the general m...


In [4]:
reviewsdf.tail(5)

Unnamed: 0,class,text
1995,Neg,if anything stigmata should be taken as...
1996,Neg,john boorman s zardoz is a goofy cinemati...
1997,Neg,the kids in the hall are an acquired taste ...
1998,Neg,there was a time when john carpenter was a gr...
1999,Neg,two party guys bob their heads to haddaway s ...


**Resumen estadístico**

El equivalente a la función summary de R en pandas es describe:

In [5]:
reviewsdf.describe()

Unnamed: 0,class,text
count,2000,2000
unique,2,2000
top,Pos,tommy lee jones chases an innocent victim aro...
freq,1000,1


In [6]:
#Summary of target variable
reviewsTable = reviewsdf.groupby("class")
totals = reviewsTable.size()
total = sum(totals)
print(totals)

class
Neg    1000
Pos    1000
dtype: int64


Se comprueba que hay igual número de críticas positivas y negativas en el fichero.

In [7]:
#An easier way
reviewsdf.groupby('class').describe()

Unnamed: 0_level_0,text,text,text,text
Unnamed: 0_level_1,count,unique,top,freq
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Neg,1000,1000,two party guys bob their heads to haddaway s ...,1
Pos,1000,1000,truman true man burbank is the perfec...,1


**Separar las críticas en dos conjuntos diferentes (entrenamiento y test):**

In [77]:
reviewsdf_train, reviewsdf_test = train_test_split(reviewsdf, test_size=0.25)

In [78]:
reviewsdf_train.groupby('class').describe()

Unnamed: 0_level_0,text,text,text,text
Unnamed: 0_level_1,count,unique,top,freq
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Neg,754,754,there is a scene in patch adams in which patc...,1
Pos,746,746,it s a curious thing i ve found that when w...,1


In [79]:
reviewsdf_test.groupby('class').describe()

Unnamed: 0_level_0,text,text,text,text
Unnamed: 0_level_1,count,unique,top,freq
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Neg,246,246,what were they thinking nostalgia for the ...,1
Pos,254,254,the central focus of michael winterbottom s ...,1


La proporción de críticas positivas y negativas se mantiene.

### Limpieza y preparación

In [25]:
def process(text):
    
    #To lowercase
    text = text.lower()

    #Remove numbers
    text = ''.join([t for t in text if not t.isdigit()])

    # remove punctuation
    text = ''.join([t for t in text if t not in string.punctuation])
    
    # remove stopwords
    text = [t for t in text.split() if t not in stopwords.words('english')]

    #Eliminate unneeded whitespace
    
    # Stemming
    st = Stemmer()
    text = [st.stem(t) for t in text]

    result = ' '.join([t for t in text])
    # return token list
    return result


Probamos esta función:

In [76]:
#Testing function process with one of the reviews
#
#text = """
#the film pales in comparison to that in the black and white comic    oscar winner martin childs    shakespeare in love   production design turns the original prague surroundings into one creepy place    
#even the acting in from hell is solid   with the dreamy depp turning in a typically strong performance and deftly handling a british accent    ians holm   joe gould s secret   and richardson   102 dalmatians   
#log in great supporting roles   but the big surprise here is graham    i cringed the first time she opened her mouth   imagining her attempt at an irish accent   but it actually wasn t half bad    
#the film   however   is all good    2   00   r for strong violence/gore   sexuality   language and drug content "
#"""
#print(text)
#
#result = process(text)
#print(result)

Aplicar la función a todos los textos de los dos dataframes:

In [58]:
clean_reviewsdf_train = reviewsdf_train['text'].apply(process)
clean_reviewsdf_test = reviewsdf_test['text'].apply(process)

In [73]:
#clean_reviewsdf_train.head(5)

In [72]:
#clean_reviewsdf_test.head(5)

In [74]:
#have a look
#print(reviewsdf['text'][337])
#print("\n")
#print(clean_reviewsdf_train[337])

In [75]:
#have a look
#print(reviewsdf['text'][799])
#print("\n")
#print(clean_reviewsdf_test[799])

Crear la DTM con el data frame de entrenamiento (clean_reviewsdf_train)

In [96]:
# Sample documents
#documents = [
#    cleanreviewsdf[1],
#    cleanreviewsdf[10],
#    cleanreviewsdf[100],
#    cleanreviewsdf[1000],
#]

# CountVectorizer/TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df = 100)
dtm_train_count = tfidf_vectorizer.fit_transform(clean_reviewsdf_train)

# Get the feature names (tokens)
feature_names_count = tfidf_vectorizer.get_feature_names_out()

# Print the feature names
print("TfidfVectorizer feature names length:", len(feature_names_count))

# Print the feature names
#print("TfidfVectorizer feature names:", feature_names_count)

# Print the document-term matrices
#print("TfidfVectorizer document-term matrix:")
#print(dtm_train_count.toarray())
print("dtm_train_count.shape: ",dtm_train_count.shape)

TfidfVectorizer feature names length: 801
dtm_train_count.shape:  (1500, 801)


Si se obtiene la DTM del total de críticas, el resultado es muy similar al obtenido en R: en R se obtenían 1027 palabras, con TfidfVectorizer(min_df = 100) se obtienen 50 más. Sin embargo, al crear la DTM solo sobre el conjunto de entrenamiento, ahora se obtienen 200 palabras menos.

In [43]:
# Se puede intentar hacer todos los pasos posibles con el TfidfVectorizer: pasar a minúsculas (ya lo hace por defecto), las "stopwords" .. 
# no hace los números y los signos de puntuación, todo lo demás creo que sí

# CountVectorizer/TfidfVectorizer
#
# tfidf_vectorizer = TfidfVectorizer(stop_words='english', min_df = 100)
# X_count = tfidf_vectorizer.fit_transform(reviewsdf['text'])

# feature_names_count = tfidf_vectorizer.get_feature_names_out()
#
# print("TfidfVectorizer feature names length:", len(feature_names_count))
#
# print("TfidfVectorizer feature names:", feature_names_count)
#
# print("TfidfVectorizer document-term matrix:")
# print(X_count.toarray())
# print(X_count.shape)
#
# TfidfVectorizer feature names length: 887
# TfidfVectorizer feature names: ['10' '1998' 'ability' 'able' 'absolutely' 'act' 'acting' 'action' 'actor'
#  'actors' 'actress' 'actual' 'actually' 'add' 'adds' 'admit' 'age' 'agent'
#  ...

Crear la DTM con el data frame de test (clean_reviewsdf_test)

In [97]:
# CountVectorizer/TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df = 100)
dtm_test_count = tfidf_vectorizer.fit_transform(clean_reviewsdf_test)

# Get the feature names (tokens)
feature_names_count = tfidf_vectorizer.get_feature_names_out()

# Print the feature names
print("TfidfVectorizer feature names length:", len(feature_names_count))

# Print the feature names
#print("TfidfVectorizer feature names:", feature_names_count)

# Print the document-term matrices
#print("TfidfVectorizer document-term matrix:")
#print(dtm_train_count.toarray())
print("dtm_test_count.shape: ",dtm_test_count.shape)

TfidfVectorizer feature names length: 197
dtm_test_count.shape:  (500, 197)
