# Tema 4 - Ejercicio 
## Aprendizaje Probabilístico. Clasificación mediante Naive Bayes

El fichero “Movie_pang02.csv”, disponible en la carpeta de Pruebas de Evaluación del Máster, contiene una muestra de 2000 reviews de películas de la página web IMDB utilizada en el artículo de Pang, B. y Lee, L., A Sentimental
Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, Proceedings of ACL 2004. Dichas reviews están etiquetadas mediante la variable class como positivas (“Pos”) o negativas (“Neg”). 
Utilizando dicho dataset, elabore un modelo de clasificación de reviews en base a su texto siguiendo el procedimiento descrito en el capítulo 4 del texto base en el ejemplo de SMS Spam. En particular, genere las nubes de  palabras para las reviews positivas y negativas, y obtenga el modelo asignando los valores 0 y 1 al parámetro laplace de la función naiveBayes(), comparando las matrices de confusión de cada variante del modelo.



Importamos dependencias

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as mpatches
import seaborn as sb

%matplotlib inline
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')

import string
import nltk
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk import PorterStemmer as Stemmer



[nltk_data] Downloading package stopwords to /home/francd/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

Cargamos el archivo entrada csv con pandas, indicando como separador la coma. Con head(5) vemos los 5 primeros registros.

In [3]:
reviewsdf = pd.read_csv(r"Movie_pang02.csv",sep=',')
reviewsdf.head(5)

Unnamed: 0,class,text
0,Pos,films adapted from comic books have had plent...
1,Pos,every now and then a movie comes along from a...
2,Pos,you ve got mail works alot better than it des...
3,Pos,jaws is a rare film that grabs your atte...
4,Pos,moviemaking is a lot like being the general m...


In [4]:
reviewsdf.tail(5)

Unnamed: 0,class,text
1995,Neg,if anything stigmata should be taken as...
1996,Neg,john boorman s zardoz is a goofy cinemati...
1997,Neg,the kids in the hall are an acquired taste ...
1998,Neg,there was a time when john carpenter was a gr...
1999,Neg,two party guys bob their heads to haddaway s ...


**Resumen estadístico**

El equivalente a la función summary de R en pandas es describe:

In [5]:
reviewsdf.describe()

Unnamed: 0,class,text
count,2000,2000
unique,2,2000
top,Pos,tommy lee jones chases an innocent victim aro...
freq,1000,1


In [6]:
#Summary of target variable
reviewsTable = reviewsdf.groupby("class")
totals = reviewsTable.size()
total = sum(totals)
print(totals)

class
Neg    1000
Pos    1000
dtype: int64


Se comprueba que hay igual número de críticas positivas y negativas en el fichero.

In [7]:
#An easier way
reviewsdf.groupby('class').describe()

Unnamed: 0_level_0,text,text,text,text
Unnamed: 0_level_1,count,unique,top,freq
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Neg,1000,1000,two party guys bob their heads to haddaway s ...,1
Pos,1000,1000,truman true man burbank is the perfec...,1


### Limpieza y preparación

In [40]:
def process(text):
    
    #To lowercase
    text = text.lower()

    #Remove numbers
    text = ''.join([t for t in text if not t.isdigit()])

    # remove punctuation
    text = ''.join([t for t in text if t not in string.punctuation])
    
    # remove stopwords
    text = [t for t in text.split() if t not in stopwords.words('english')]

    #Eliminate unneeded whitespace
    
    # Stemming
    st = Stemmer()
    text = [st.stem(t) for t in text]
    
    # return token list
    return text


Probamos esta función:

In [41]:
#Testing process with one of the reviews

text = """
the film pales in comparison to that in the black and white comic    oscar winner martin childs    shakespeare in love   production design turns the original prague surroundings into one creepy place    
even the acting in from hell is solid   with the dreamy depp turning in a typically strong performance and deftly handling a british accent    ians holm   joe gould s secret   and richardson   102 dalmatians   
log in great supporting roles   but the big surprise here is graham    i cringed the first time she opened her mouth   imagining her attempt at an irish accent   but it actually wasn t half bad    
the film   however   is all good    2   00   r for strong violence/gore   sexuality   language and drug content "
"""
print(text)

print(process(text))



the film pales in comparison to that in the black and white comic    oscar winner martin childs    shakespeare in love   production design turns the original prague surroundings into one creepy place    
even the acting in from hell is solid   with the dreamy depp turning in a typically strong performance and deftly handling a british accent    ians holm   joe gould s secret   and richardson   102 dalmatians   
log in great supporting roles   but the big surprise here is graham    i cringed the first time she opened her mouth   imagining her attempt at an irish accent   but it actually wasn t half bad    
the film   however   is all good    2   00   r for strong violence/gore   sexuality   language and drug content "

['film', 'pale', 'comparison', 'black', 'white', 'comic', 'oscar', 'winner', 'martin', 'child', 'shakespear', 'love', 'product', 'design', 'turn', 'origin', 'pragu', 'surround', 'one', 'creepi', 'place', 'even', 'act', 'hell', 'solid', 'dreami', 'depp', 'turn', 'typic', 

Aplicar la función a todos los textos del dataframe reviewsdf

In [55]:
cleanreviewsdf = reviewsdf['text'].apply(process)

In [65]:
#have a look
#print(reviewsdf['text'][10])
#print(cleanreviewsdf[10])

['watch', 'rat', 'race', 'last', 'week', 'notic', 'cheek', 'sore', 'realiz', 'laugh', 'aloud', 'held', 'grin', 'virtual', 'film', 'minut', 'saturday', 'night', 'attend', 'anoth', 'sneak', 'preview', 'movi', 'damn', 'enjoy', 'much', 'second', 'time', 'first', 'rat', 'race', 'great', 'goofi', 'delight', 'dandi', 'mix', 'energet', 'perform', 'inspir', 'sight', 'gag', 'flat', 'silli', 'hand', 'fun', 'film', 'summer', 'movi', 'begin', 'zippi', 'retro', 'style', 'open', 'credit', 'look', 'like', 'torn', 'straight', 'slapstick', 'comedi', 'featur', 'anim', 'photo', 'cast', 'attach', 'herki', 'jerki', 'bodi', 'bound', 'around', 'screen', 'come', 'setup', 'donald', 'sinclair', 'john', 'clees', 'extrem', 'rich', 'owner', 'venetian', 'hotel', 'casino', 'la', 'vega', 'enjoy', 'concoct', 'unusu', 'bet', 'high', 'roll', 'client', 'end', 'place', 'half', 'dozen', 'special', 'token', 'slot', 'machin', 'gather', 'togeth', 'lucki', 'token', 'holder', 'explain', 'today', 'chanc', 'play', 'game', 'odd', '