## Ejercicio NLP y ML 02

Usando el dataset **`dialogos_TBBT.pkl`** entrena un modelo de clasificación que prediga a que personaje pertenece una frase u oración.

- Los personajes son: Sheldon, Leonard, Penny, Howard, Raj.

- Haz preprocesamiento, Bag-of-Words, TF-IDF y utiliza un modelo de PCA para reducir la dimensionalidad del dataset.

- Define una función que tome como parámetro una lista de modelos y que retorne un DataFrame con las métricas de cada modelo y el tiempo de ejecución de cada uno.

In [41]:
import pickle
import pandas as pd
import numpy as np
import nltk
import re
from nltk.stem import PorterStemmer

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

from sklearn.decomposition import PCA

# Normalizacion
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MaxAbsScaler

# Train, Test
from sklearn.model_selection import train_test_split

# Metricas
from sklearn.metrics import jaccard_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# Clasificadores
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Validacion
from sklearn.model_selection import StratifiedKFold


In [2]:
with open("dialogos_TBBT.pkl", "br") as file:
    data = pickle.load(file)
    
len(data)

55182

In [3]:
data

['Scene: A corridor at a sperm bank.',
 'Sheldon: So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it’s unobserved it will, however, if it’s observed after it’s left the plane but before it hits its target, it will not have gone through both slits.',
 'Leonard: Agreed, what’s your point?',
 'Sheldon: There’s no point, I just think it’s a good idea for a tee-shirt. ',
 'Leonard: Excuse me?',
 'Receptionist: Hang on. ',
 'Leonard: One across is Aegean, eight down is Nabakov, twenty-six across is MCM, fourteen down is… move your finger… phylum, which makes fourteen across Port-au-Prince. See, Papa Doc’s capital idea, that’s Port-au-Prince. Haiti. ',
 'Receptionist: Can I help you?',
 'Leonard: Yes. Um, is this the High IQ sperm bank?',
 'Receptionist: If you have to ask, maybe you shouldn’t be here.',
 'Sheldon: I think this is the place.',
 'Receptionist: Fill these out.',
 'Leonard: Thank-you. We’ll be righ

In [4]:
# Un dataframe en conjunto, con una columna para el nombre y una columna para el diálogo
# Vamos a predecir quién dice una frase
# Smote para desbalance de clases

In [5]:
personajes = ["Sheldon", "Leonard", "Penny", "Howard", "Raj"]

In [6]:
datos = list()

for linea in data:
    datos.append(linea.split(":", maxsplit = 1))

In [7]:
df = pd.DataFrame(datos, columns = ["personaje", "texto"])

df = df[df["personaje"].isin(personajes)].reset_index(drop = True)

df

Unnamed: 0,personaje,texto
0,Sheldon,So if a photon is directed through a plane wi...
1,Leonard,"Agreed, what’s your point?"
2,Sheldon,"There’s no point, I just think it’s a good id..."
3,Leonard,Excuse me?
4,Leonard,"One across is Aegean, eight down is Nabakov, ..."
...,...,...
38719,Sheldon,"Uh, breakfast yes, lunch no. I did have a cou..."
38720,Sheldon,How thoughtful. Thank you.
38721,Sheldon,"And I with you. Question, are you seeking a r..."
38722,Sheldon,"Well, that would raise a number of problems. ..."


In [8]:
df["personaje"].unique()

array(['Sheldon', 'Leonard', 'Penny', 'Howard', 'Raj'], dtype=object)

In [9]:
#Preprocesamiento

In [10]:
stopwords = nltk.corpus.stopwords.words("english")
stopwords.append("<br />")

In [11]:
def limpiar_reviews(lista, stopwords):
    tokens_reviews = list()

    for review in lista:
        
        tokens_limpios = list()
        tokens = nltk.word_tokenize(text = review.lower(), language = "english")
        
        for token in tokens:
            if (token not in stopwords) and (len(token) > 2) and (re.findall(r"[\d]", token) == []):
                tokens_limpios.append(token)
        
        tokens_reviews.append(tokens_limpios)    
    
    return tokens_reviews

In [12]:
tokens_dialogos = limpiar_reviews(df["texto"], stopwords)
tokens_dialogos

[['photon',
  'directed',
  'plane',
  'two',
  'slits',
  'either',
  'slit',
  'observed',
  'slits',
  'unobserved',
  'however',
  'observed',
  'left',
  'plane',
  'hits',
  'target',
  'gone',
  'slits'],
 ['agreed', 'point'],
 ['point', 'think', 'good', 'idea', 'tee-shirt'],
 ['excuse'],
 ['one',
  'across',
  'aegean',
  'eight',
  'nabakov',
  'twenty-six',
  'across',
  'mcm',
  'fourteen',
  'is…',
  'move',
  'finger…',
  'phylum',
  'makes',
  'fourteen',
  'across',
  'port-au-prince',
  'see',
  'papa',
  'doc',
  'capital',
  'idea',
  'port-au-prince',
  'haiti'],
 ['yes', 'high', 'sperm', 'bank'],
 ['think', 'place'],
 ['thank-you', 'right', 'back'],
 ['leonard', 'think'],
 ['kidding', 'semi-pro'],
 ['committing',
  'genetic',
  'fraud',
  'guarantee',
  'sperm',
  'going',
  'generate',
  'high',
  'offspring',
  'think',
  'sister',
  'basic',
  'dna',
  'mix',
  'hostesses',
  'fuddruckers'],
 ['sheldon',
  'idea',
  'little',
  'extra',
  'money',
  'get',
  'fra

In [13]:
df["texto"] = tokens_dialogos
df

Unnamed: 0,personaje,texto
0,Sheldon,"[photon, directed, plane, two, slits, either, ..."
1,Leonard,"[agreed, point]"
2,Sheldon,"[point, think, good, idea, tee-shirt]"
3,Leonard,[excuse]
4,Leonard,"[one, across, aegean, eight, nabakov, twenty-s..."
...,...,...
38719,Sheldon,"[breakfast, yes, lunch, cough, drop, really, r..."
38720,Sheldon,"[thoughtful, thank]"
38721,Sheldon,"[question, seeking, romantic, relationship]"
38722,Sheldon,"[well, would, raise, number, problems, colleag..."


In [14]:
def porter_stemmer(lista):
    
    textos = list()
    
    for texto in lista:
        stemmer = PorterStemmer()
        textos.append(" ".join([stemmer.stem(word) for word in texto]))

    return textos

In [15]:
df["texto"] = porter_stemmer(df["texto"])
df

Unnamed: 0,personaje,texto
0,Sheldon,photon direct plane two slit either slit obser...
1,Leonard,agre point
2,Sheldon,point think good idea tee-shirt
3,Leonard,excus
4,Leonard,one across aegean eight nabakov twenty-six acr...
...,...,...
38719,Sheldon,breakfast ye lunch cough drop realli ride line...
38720,Sheldon,thought thank
38721,Sheldon,question seek romant relationship
38722,Sheldon,well would rais number problem colleagu curren...


In [16]:
df["personaje"].value_counts()

Sheldon    11408
Leonard     9570
Penny       7474
Howard      5712
Raj         4560
Name: personaje, dtype: int64

In [17]:
#Reducimos el número de textos de Sheldon

In [18]:
#Bag-of-Word
count_vectorizer = CountVectorizer()

# Entrenamos el modelo y transformamos los datos.
bag = count_vectorizer.fit_transform(df["texto"])

bag

<38724x13040 sparse matrix of type '<class 'numpy.int64'>'
	with 202721 stored elements in Compressed Sparse Row format>

In [19]:
# TF-IDF

# Inicializamos un objeto Tfidf
tfidf = TfidfTransformer()

# Cambio la precisión de python a 2 decimales
np.set_printoptions(precision = 2)

# Entrenamos el Tfidf y transformamos la variable bag
bag = tfidf.fit_transform(bag)

bag

<38724x13040 sparse matrix of type '<class 'numpy.float64'>'
	with 202721 stored elements in Compressed Sparse Row format>

In [20]:
#Smote
# X = bag
y = df[df["personaje"].isin(["Sheldon", "Leonard"])]["personaje"]

bag.shape, y.shape

((38724, 13040), (20978,))

In [21]:
bag[y.index]

<20978x13040 sparse matrix of type '<class 'numpy.float64'>'
	with 118163 stored elements in Compressed Sparse Row format>

In [22]:
Counter(y)

Counter({'Sheldon': 11408, 'Leonard': 9570})

In [23]:
undersampling = RandomUnderSampler()
X_balanceado1, y_balanceado1 = undersampling.fit_resample(bag[y.index], y)
Counter(y_balanceado1)

Counter({'Leonard': 9570, 'Sheldon': 9570})

In [24]:
#Aumentamos el número de textos de Raj

In [25]:
#Smote
# X = bag
y = df[df["personaje"].isin(["Penny", "Raj"])]["personaje"]

bag[y.index].shape, y.shape

((12034, 13040), (12034,))

In [26]:
Counter(y)

Counter({'Penny': 7474, 'Raj': 4560})

In [27]:
oversampling = SMOTE(sampling_strategy = 0.8)
X_balanceado2, y_balanceado2 = oversampling.fit_resample(bag[y.index], y)

Counter(y_balanceado2)

Counter({'Penny': 7474, 'Raj': 5979})

In [28]:
#Aumentamos el número de textos de Howard

In [29]:
#Smote
# X = bag
y = df[df["personaje"].isin(["Penny", "Howard"])]["personaje"]

bag[y.index].shape, y.shape

((13186, 13040), (13186,))

In [30]:
Counter(y)

Counter({'Penny': 7474, 'Howard': 5712})

In [31]:
oversampling = SMOTE(sampling_strategy = 0.83)
X_balanceado3, y_balanceado3 = oversampling.fit_resample(bag[y.index], y)

Counter(y_balanceado3)

Counter({'Penny': 7474, 'Howard': 6203})

In [32]:
X_balanceado = pd.concat([pd.DataFrame(X_balanceado1.toarray()),
                          pd.DataFrame(X_balanceado2.toarray()),
                          pd.DataFrame(X_balanceado3.toarray())])
#aplicamos .toarray() porque es una matriz sparse y le costaría más procesar

In [33]:
y_balanceado = pd.concat([pd.DataFrame(y_balanceado1),
                          pd.DataFrame(y_balanceado2),
                          pd.DataFrame(y_balanceado3)]) #es una serie y lo procesa mejor

In [34]:
X_balanceado

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13030,13031,13032,13033,13034,13035,13036,13037,13038,13039
0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13672,0.0,0.0,0.0,0.0,0.0,0.0,0.286758,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13673,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13674,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13675,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [35]:
y_balanceado

Unnamed: 0,personaje
0,Leonard
1,Leonard
2,Leonard
3,Leonard
4,Leonard
...,...
13672,Howard
13673,Howard
13674,Howard
13675,Howard


In [36]:
df_balanceado = pd.concat([X_balanceado, y_balanceado], axis = 1).reset_index(drop = True)
df_balanceado

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13031,13032,13033,13034,13035,13036,13037,13038,13039,personaje
0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Leonard
1,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Leonard
2,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Leonard
3,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Leonard
4,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Leonard
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46265,0.0,0.0,0.0,0.0,0.0,0.0,0.286758,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Howard
46266,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Howard
46267,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Howard
46268,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Howard


In [37]:
df_balanceado.drop_duplicates(inplace = True, keep = "first")
# type(df_balanceado)

In [38]:
df_balanceado.shape

(32673, 13041)

In [39]:
df_balanceado["personaje"].value_counts(normalize = True)

Sheldon    0.263031
Leonard    0.230098
Penny      0.180302
Raj        0.163346
Howard     0.163223
Name: personaje, dtype: float64

# PCA

In [42]:
x_scaler = MaxAbsScaler()

X = x_scaler.fit_transform(df_balanceado.drop("personaje", axis=1))


In [43]:
pca = PCA(n_components = 5) 
X = pca.fit_transform(X) 


In [44]:
X_train, X_test, y_train, y_test = train_test_split(X, df_balanceado["personaje"], test_size = 0.3, random_state = 42)

print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"X_test: {X_test.shape},  y_test: {y_test.shape}")

X_train: (22871, 5), y_train: (22871,)
X_test: (9802, 5),  y_test: (9802,)


In [50]:
# El Modelo de Clasificacion

# Modelo

def modelo_clasificar(lista_clasificadores,X_train, y_train, X_test, y_test):

    for clasificador in lista_clasificadores:
        print(clasificador)

        clasificador.fit(X_train, y_train)

        yhat = clasificador.predict(X_test)
        
        print("Jaccard Index:", jaccard_score(y_test, yhat, average = "macro"))
        print("Accuracy:"     , accuracy_score(y_test, yhat))
        print("Precisión:"    , precision_score(y_test, yhat, average = "macro"))
        print("Sensibilidad:" , recall_score(y_test, yhat, average = "macro"))
        print("F1-score:"     , f1_score(y_test, yhat, average = "macro"))
        print("Confusion Matrix:\n", confusion_matrix(y_test, yhat))
        print("*"*100)

In [51]:
lista_clasificadores = [KNeighborsClassifier(n_neighbors = 3),
                        RadiusNeighborsClassifier(radius = 0.8,
                                                  outlier_label = "most_frequent"),
                        NearestCentroid(metric = "euclidean"), 
                        GaussianNB(),
                        LogisticRegression(),
                        DecisionTreeClassifier(),
                        RandomForestClassifier(),
                        SVC(),
                        AdaBoostClassifier(),
                        GradientBoostingClassifier()
                       ]

In [52]:
modelo_clasificar(lista_clasificadores,X_train, y_train, X_test, y_test)

KNeighborsClassifier(n_neighbors=3)
Jaccard Index: 0.1266523943222003
Accuracy: 0.23331973066721076
Precisión: 0.24464758341520806
Sensibilidad: 0.22973502025339543
F1-score: 0.22365364879272337
Confusion Matrix:
 [[489 439 204 110 296]
 [749 669 314 148 361]
 [502 567 326 111 241]
 [545 417 206 180 268]
 [824 700 303 210 623]]
****************************************************************************************************
RadiusNeighborsClassifier(outlier_label='most_frequent', radius=0.8)
Jaccard Index: 0.05489338024155386
Accuracy: 0.2717812691287492
Precisión: 0.15435404742436631
Sensibilidad: 0.2004719857788131


  _warn_prf(average, modifier, msg_start, len(result))


F1-score: 0.08655853253829743
Confusion Matrix:
 [[   0    0    2    0 1536]
 [   0    1    4    0 2236]
 [   0    5    4    0 1738]
 [   0    0    1    0 1615]
 [   0    0    1    0 2659]]
****************************************************************************************************
NearestCentroid()
Jaccard Index: 0.13418569417548804
Accuracy: 0.3087125076514997
Precisión: 0.2627484672940274
Sensibilidad: 0.26329153770822944
F1-score: 0.2236032595647596
Confusion Matrix:
 [[  37  168  281  139  913]
 [  40  313  481  178 1229]
 [  34  208  591  137  777]
 [  32  151  307  126 1000]
 [  38  167  321  175 1959]]
****************************************************************************************************
GaussianNB()
Jaccard Index: 0.1130560083631369
Accuracy: 0.3102428075902877
Precisión: 0.21947062820934882
Sensibilidad: 0.25203338487563326


  _warn_prf(average, modifier, msg_start, len(result))


F1-score: 0.18560513052491207
Confusion Matrix:
 [[   0  120  219    7 1192]
 [   0  255  387    4 1595]
 [   0  194  499    7 1047]
 [   0  109  220    4 1283]
 [   0  158  216    3 2283]]
****************************************************************************************************
LogisticRegression()
Jaccard Index: 0.11408897621355507
Accuracy: 0.3117731075290757
Precisión: 0.19090024360713398


  _warn_prf(average, modifier, msg_start, len(result))


Sensibilidad: 0.24858294794195052
F1-score: 0.1876870022115698
Confusion Matrix:
 [[   0  294  110    0 1134]
 [   0  491  207    0 1543]
 [   0  423  303    0 1021]
 [   0  284  119    0 1213]
 [   0  311   87    0 2262]]
****************************************************************************************************
DecisionTreeClassifier()
Jaccard Index: 0.12524327257692872
Accuracy: 0.22985105080595797
Precisión: 0.22220992800465883
Sensibilidad: 0.22091014784506857
F1-score: 0.22123169872199436
Confusion Matrix:
 [[262 388 274 240 374]
 [425 544 441 347 484]
 [272 472 355 265 383]
 [293 368 262 320 373]
 [487 571 385 445 772]]
****************************************************************************************************
RandomForestClassifier()
Jaccard Index: 0.13950513403615594
Accuracy: 0.2708630891654764
Precisión: 0.2447926785809294
Sensibilidad: 0.24479845157565214
F1-score: 0.24011159869652654
Confusion Matrix:
 [[ 166  377  231  184  580]
 [ 245  586  444  222  74