# Alberto J. Orio García.

## Prueba de Conocimiento: Machine Learning & NLP

Usando el dataset de **`Tweets.csv`** y utilizando métodos de procesamiento de datos de **`NLP`**, desarrolla un modelo de predicción sobre la columna de **`sentiment`**.

- Usa diferentes modelos de clasificación y compara sus métricas y el tiempo de ejecución de cada uno.
- Retorna un **`DataFrame`** con los resultados (metricas) de todos los modelos.
- Selecciona el mejor modelo y aplica **`GridSearch()`** para encontrar los mejores parámetros.
- Usa algoritmos de **`PCA`** o de **`SMOTE`** si consideras que es necesario.

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.stem import WordNetLemmatizer

# Bag-of-Words y TF-IDF
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# Normalizacion
from sklearn.preprocessing import MinMaxScaler

# GridSearchCV
from sklearn.model_selection import GridSearchCV

# Train, Test
from sklearn.model_selection import train_test_split

# Metricas
from sklearn.metrics import jaccard_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import SCORERS

# Clasificadores
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Validacion
from sklearn.model_selection import StratifiedKFold

In [2]:
df = pd.read_csv("Tweets.csv")

df.head(3)

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative


### Procesamiento

In [3]:
# Las columnas texID y selected_text no son necesarias.
# Esto es para eliminarlas.

df.drop(['textID', 'selected_text',], axis = 1, inplace = True)
df

Unnamed: 0,text,sentiment
0,"I`d have responded, if I were going",neutral
1,Sooo SAD I will miss you here in San Diego!!!,negative
2,my boss is bullying me...,negative
3,what interview! leave me alone,negative
4,"Sons of ****, why couldn`t they put them on t...",negative
...,...,...
27476,wish we could come see u on Denver husband l...,negative
27477,I`ve wondered about rake to. The client has ...,negative
27478,Yay good for both of you. Enjoy the break - y...,positive
27479,But it was worth it ****.,positive


In [4]:
# Para averiguar el número de elementos que no son NaN.
# Se ve que en la columna text hay un NaN.

df.count()

text         27480
sentiment    27481
dtype: int64

In [5]:
# Para eliminar ese NaN (en realidad, esto elimina todas las filas
# en las que hubiera al menos un valor NaN).

df.dropna(inplace = True)
df

Unnamed: 0,text,sentiment
0,"I`d have responded, if I were going",neutral
1,Sooo SAD I will miss you here in San Diego!!!,negative
2,my boss is bullying me...,negative
3,what interview! leave me alone,negative
4,"Sons of ****, why couldn`t they put them on t...",negative
...,...,...
27476,wish we could come see u on Denver husband l...,negative
27477,I`ve wondered about rake to. The client has ...,negative
27478,Yay good for both of you. Enjoy the break - y...,positive
27479,But it was worth it ****.,positive


In [6]:
# Para ver el tamaño de los datos: 27.480 filas
# y 2 columnas. Viendo esto, seguramente no sea
# necesario realizar PCA para reducir la dimensionalidad.

df.shape

(27480, 2)

In [7]:
# Para averiguar el número de valores de cada tipo.
# Se ve que no existe un desbalance significativo, por 
# lo que no será necesario aplicar SMOTE.
df['sentiment'].value_counts()

neutral     11117
positive     8582
negative     7781
Name: sentiment, dtype: int64

In [8]:
# Para sustituir los valores neutral, positive y negative
# por números (0, 1 y 2, respectivamente), y luego poder 
# trabajar con ellos en modelos de predicción.

df['sentiment'].replace('neutral', 0, inplace = True)
df['sentiment'].replace('positive', 1, inplace = True)
df['sentiment'].replace('negative', 2, inplace = True)
df['sentiment'].value_counts()

0    11117
1     8582
2     7781
Name: sentiment, dtype: int64

In [9]:
# Para obtener todos los textos en una lista.

lista_textos = df['text'].to_list()

In [10]:
lista_textos[0:10]

[' I`d have responded, if I were going',
 ' Sooo SAD I will miss you here in San Diego!!!',
 'my boss is bullying me...',
 ' what interview! leave me alone',
 ' Sons of ****, why couldn`t they put them on the releases we already bought',
 'http://www.dothebouncy.com/smf - some shameless plugging for the best Rangers forum on earth',
 '2am feedings for the baby are fun when he is all smiles and coos',
 'Soooo high',
 ' Both of you',
 ' Journey!? Wow... u just became cooler.  hehe... (is that possible!?)']

In [11]:
# Para agrupar los stopwords, o palabras carentes de
# significado e importancia, que serán excluidas del análisis.
# Se le añaden los saltos de línea y los asteriscos, que
# probablemente representan insultos.

stopwords = nltk.corpus.stopwords.words('english')
stopwords.append('<br />')
stopwords.append('*')
stopwords.append('**')
stopwords.append('***')
stopwords.append('****')

In [12]:
# Con esta función, se pasan los textos a minúsculas, se
# descartan las palabras cortas, y se eliminan los stopwords
# definidos anteriormente.

def depurar(lista, stopwords):
    tokens_depurados = list()

    for texto in lista:
        
        tokens_textos = list()
        tokens = nltk.word_tokenize(text = texto.lower(), language = 'english')
        
        for token in tokens:
            if (token not in stopwords) and (len(token) > 2):
                tokens_textos.append(token)
                
        tokens_depurados.append(tokens_textos) 
    
    return tokens_depurados

In [13]:
%%time
# Utilizando la función previa, se guarda en una 
# variable la lista de textos ya limpios.

lista_textos = depurar(lista_textos, stopwords)

Wall time: 8.97 s


In [14]:
lista_textos[0:10]

[['responded', 'going'],
 ['sooo', 'sad', 'miss', 'san', 'diego'],
 ['boss', 'bullying', '...'],
 ['interview', 'leave', 'alone'],
 ['sons', 'put', 'releases', 'already', 'bought'],
 ['http',
  '//www.dothebouncy.com/smf',
  'shameless',
  'plugging',
  'best',
  'rangers',
  'forum',
  'earth'],
 ['2am', 'feedings', 'baby', 'fun', 'smiles', 'coos'],
 ['soooo', 'high'],
 [],
 ['journey', 'wow', '...', 'became', 'cooler', 'hehe', '...', 'possible']]

In [15]:
# Para quedarse solo con las raices de las palabras,
# conservando, eso sí, su significado.

def lematizar(lista):
    
    textos = list()
    
    for texto in lista:
        lemmatizer = WordNetLemmatizer()
        textos.append(" ".join([lemmatizer.lemmatize(word) for word in texto]))

    return textos

In [16]:
%%time
# De nuevo, se guarda en una variable
# la lista de textos ya tratados, usando la
# función definida en la celda anterior.

lista_textos = lematizar(lista_textos)

Wall time: 4.12 s


In [17]:
lista_textos[0:10]

['responded going',
 'sooo sad miss san diego',
 'bos bullying ...',
 'interview leave alone',
 'son put release already bought',
 'http //www.dothebouncy.com/smf shameless plugging best ranger forum earth',
 '2am feeding baby fun smile coo',
 'soooo high',
 '',
 'journey wow ... became cooler hehe ... possible']

In [18]:
# Para transformar el texto en números, en
# forma de matriz, donde las columnas la forman 
# las palabras que aparecen en el texto, y las filas
# las veces que aparecen en el texto.

count_vectorizer = CountVectorizer()

bag = count_vectorizer.fit_transform(lista_textos)

bag

<27480x24212 sparse matrix of type '<class 'numpy.int64'>'
	with 184815 stored elements in Compressed Sparse Row format>

In [19]:
# Se utiliza TF-IDF para reducir el peso de aquellas
# palabras que aparecen mucho. Se realiza sobre la operación
# anterior, se entrena y transforma, y se guarda en la
# variable bag, igual que antes. Adicionalemente, se
# reduce su precisión a 2 decimales, y se pasan los datos
# en forma de array (por ser más difícil trabajar con matrices sparse).

tfidf = TfidfTransformer()

np.set_printoptions(precision = 2)

bag = tfidf.fit_transform(bag).toarray()

bag

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [20]:
# Para separar los datos, ya tratados y pasados a números,
# en X e y, y en conjuntos de entranamiento y de test.

X_train, X_test, y_train, y_test = train_test_split(bag, df['sentiment'].values, test_size = 0.2, random_state = 42)

print(f'X_train: {X_train.shape}, y_train: {y_train.shape}')
print(f'X_test: {X_test.shape},  y_test: {y_test.shape}')

X_train: (21984, 24212), y_train: (21984,)
X_test: (5496, 24212),  y_test: (5496,)


### Modelos de clasificación.

In [21]:
lista_modelos = [KNeighborsClassifier(n_neighbors = 3),
                RadiusNeighborsClassifier(radius = 0.8, outlier_label = "most_frequent"),
                NearestCentroid(metric = "euclidean"), 
                GaussianNB(),
                LogisticRegression(),
                DecisionTreeClassifier(),
                RandomForestClassifier(),
                SVC(),
                AdaBoostClassifier(),
                GradientBoostingClassifier()]

In [22]:
def clasificar(lista_modelos, X_train, y_train, X_test, y_test):

    for modelo in lista_modelos:
        print(modelo)

        modelo.fit(X_train, y_train)

        yhat = modelo.predict(X_test)
        
        print("Jaccard Index:", jaccard_score(y_test, yhat, average = "macro"))
        print("Accuracy:"     , accuracy_score(y_test, yhat))
        print("Precisión:"    , precision_score(y_test, yhat, average = "macro"))
        print("Sensibilidad:" , recall_score(y_test, yhat, average = "macro"))
        print("F1-score:"     , f1_score(y_test, yhat, average = "macro"))
        #print("ROC AUC:"     , roc_auc_score(y_test, yhat)) Da error: multi_class must be in ('ovo', 'ovr')
        print("Confusion Matrix:\n", confusion_matrix(y_test, yhat))
        print("*"*100)

In [23]:
%%time
clasificar(lista_modelos, X_train, y_train, X_test, y_test)

KNeighborsClassifier(n_neighbors=3)
Jaccard Index: 0.2154748028119092
Accuracy: 0.46033478893740903
Precisión: 0.6520297086850556
Sensibilidad: 0.39765532854077595
F1-score: 0.3316919988760159
Confusion Matrix:
 [[2149   44   43]
 [1435  240   13]
 [1419   12  141]]
****************************************************************************************************
RadiusNeighborsClassifier(outlier_label='most_frequent', radius=0.8)
Jaccard Index: 0.1866682757702561
Accuracy: 0.4410480349344978
Precisión: 0.6491423413207852
Sensibilidad: 0.3742180405080268
F1-score: 0.28476744260531295
Confusion Matrix:
 [[2183   27   26]
 [1523  159    6]
 [1479   11   82]]
****************************************************************************************************
NearestCentroid()
Jaccard Index: 0.4362057189917848
Accuracy: 0.6153566229985444
Precisión: 0.6474934634811107
Sensibilidad: 0.5961147277746609
F1-score: 0.6067286662247192
Confusion Matrix:
 [[1712  251  273]
 [ 690  907   91]
 [ 6

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Jaccard Index: 0.5192834933301474
Accuracy: 0.6841339155749636
Precisión: 0.7061876842617387
Sensibilidad: 0.6723528074420916
F1-score: 0.6821958030667413
Confusion Matrix:
 [[1714  244  278]
 [ 457 1167   64]
 [ 614   79  879]]
****************************************************************************************************
DecisionTreeClassifier()
Jaccard Index: 0.4949042299094277
Accuracy: 0.6610262008733624
Precisión: 0.661128751738762
Sensibilidad: 0.6610804400510331
F1-score: 0.6609191489583254
Confusion Matrix:
 [[1451  365  420]
 [ 330 1229  129]
 [ 463  156  953]]
****************************************************************************************************
RandomForestClassifier()
Jaccard Index: 0.5425973984788003
Accuracy: 0.7050582241630277
Precisión: 0.7168076933231268
Sensibilidad: 0.6971489195985289
F1-score: 0.7018910837819753
Confusion Matrix:
 [[1676  309  251]
 [ 334 1303   51]
 [ 539  137  896]]
**************************************************************

### GridSearch

In [26]:
%%time
# Echando un vistazo a las métricas de los clasificadores,
# parece que el mejor modelo para predecir la columna sentiment
# es el RandomForestClassifier (tiene la mayor Accuracy, y en
# el resto de métricas está entre las mejores o tiene la mejor).
# Por lo tanto, se aplicará GridSearch a RandomForestClassifier
# para buscar los mejores parámetros, en base a mejorar la Accuracy.

model = RandomForestClassifier()

params = {"n_estimators"           : [100, 200],
          "criterion"              : ["gini", "entropy"],
          "max_depth"              : [3, 4, 5],
          "max_features"           : [2, 3],
          "max_leaf_nodes"         : [8],
          "min_impurity_decrease"  : [0.02, 0.3],
          "min_samples_split"      : [2, 5]}

scorers = {"f1_macro", "accuracy", "recall_macro"}

grid_solver = GridSearchCV(estimator  = model    , 
                           param_grid = params   , 
                           scoring    = scorers  ,
                           cv         = 5        ,
                           refit      = "accuracy",
                           n_jobs     = -1        )

model_result = grid_solver.fit(X_train, y_train)

print(model_result.cv_results_["mean_test_recall_macro"].mean())
print(model_result.cv_results_["mean_test_f1_macro"].mean())
print(model_result.cv_results_["mean_test_accuracy"].mean())

print("*"*100)

print(model_result.best_score_)
print(model_result.best_params_)

MemoryError: Unable to allocate 3.17 GiB for an array with shape (17587, 24212) and data type float64