# Algoritmo de aprendizaje de máquina

En este notebook se implementará un algoritmo de aprendizaje de máquina para el conjunto de datos de tweets

Se carga en memoria el archivo Excel que contiene los vectores de características de los tweets

In [490]:
import pandas as pd

df = pd.read_csv("tweets_feature_extracion.csv", encoding = "utf-8")

Se eliminan todos los vectores de características de los tweets que hayan sido clasificado como neutros

In [491]:
df.drop(df[ df["class"] == "neutro"].index, inplace = True)

In [492]:
df.head()

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f92,f93,f94,f95,f96,f97,f98,f99,f100,class
0,-0.185263,-0.126442,0.687119,-0.110711,-0.115368,-0.382538,0.177452,0.319352,0.41437,0.906571,...,-0.06524,-0.096958,-0.332892,-0.002203,0.488303,0.204036,-0.002951,0.283848,0.113339,positivo
1,-0.623082,-0.271673,1.00097,0.00867,-0.036696,-0.473044,0.546534,0.174511,0.489129,1.187792,...,-0.156622,-0.499596,-0.379724,0.104167,0.715333,0.232296,-0.278074,0.229985,-0.180596,positivo
2,-0.657994,-0.780325,0.999145,0.198315,0.05362,-0.494484,0.285496,0.32096,0.3713,1.08135,...,0.032576,0.075115,-0.187369,0.372731,0.648799,0.4583,-0.082725,0.435323,0.333019,positivo
3,-0.830661,-0.3382,1.131208,0.00791,0.039084,-0.531723,0.678928,0.09881,0.553012,1.354421,...,-0.186377,-0.6619,-0.428507,0.161155,0.830715,0.277351,-0.370832,0.256122,-0.219536,positivo
4,-0.351396,-0.39434,0.774259,-0.005173,0.01089,-0.29029,0.307829,0.23842,0.25853,0.89345,...,-0.041784,-0.14321,-0.350679,0.139063,0.584188,0.264411,-0.176282,0.150352,-0.062272,positivo


Se guardan los vectores de características de los tweets en la variable x

In [493]:
x = df.drop("class", axis = 1, inplace = False)

Se obtiene la variable clasificadora de los tweets y se hace la transformación:
    * 1 = "negativo"
    * 5 = "positivo"    

In [494]:
df["class"] = [5 if tag == "positivo"  else 1 for tag in df["class"]]
y = df["class"]

In [495]:
df.head()

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f92,f93,f94,f95,f96,f97,f98,f99,f100,class
0,-0.185263,-0.126442,0.687119,-0.110711,-0.115368,-0.382538,0.177452,0.319352,0.41437,0.906571,...,-0.06524,-0.096958,-0.332892,-0.002203,0.488303,0.204036,-0.002951,0.283848,0.113339,5
1,-0.623082,-0.271673,1.00097,0.00867,-0.036696,-0.473044,0.546534,0.174511,0.489129,1.187792,...,-0.156622,-0.499596,-0.379724,0.104167,0.715333,0.232296,-0.278074,0.229985,-0.180596,5
2,-0.657994,-0.780325,0.999145,0.198315,0.05362,-0.494484,0.285496,0.32096,0.3713,1.08135,...,0.032576,0.075115,-0.187369,0.372731,0.648799,0.4583,-0.082725,0.435323,0.333019,5
3,-0.830661,-0.3382,1.131208,0.00791,0.039084,-0.531723,0.678928,0.09881,0.553012,1.354421,...,-0.186377,-0.6619,-0.428507,0.161155,0.830715,0.277351,-0.370832,0.256122,-0.219536,5
4,-0.351396,-0.39434,0.774259,-0.005173,0.01089,-0.29029,0.307829,0.23842,0.25853,0.89345,...,-0.041784,-0.14321,-0.350679,0.139063,0.584188,0.264411,-0.176282,0.150352,-0.062272,5


In [496]:
y.head()

0    5
1    5
2    5
3    5
4    5
Name: class, dtype: int64

Se normaliza los vectores de características de los tweets

In [497]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
normalized_x = scaler.fit_transform(x)

In [498]:
print(normalized_x)

[[ 0.32510994  0.70942713 -0.27430557 ...  0.87882734  1.24333992
   1.31214435]
 [-1.14343497 -0.10760975  1.29750533 ... -0.77069928  0.80968509
  -0.26629212]
 [-1.26053554 -2.96917911  1.2883662  ...  0.40053531  2.46289794
   2.49183036]
 ...
 [ 0.43318049 -1.11697778 -0.78828387 ... -0.2273637  -1.44548556
   0.97151615]
 [ 0.45384243 -0.72585761 -0.23426219 ...  0.40874805 -0.34699723
  -0.28413279]
 [-0.74390649 -1.04858376  1.45309583 ... -1.23156128 -0.93277647
  -2.19579293]]


Se hace la división del conjunto de datos en 2:
    * entrenamiento - 80 %
    * pruebas - 20 %
    
Se hace muestreo estatrificado para que los conjuntos de entrenamiento y pruebas guarden la misma proporcion

In [499]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(normalized_x,
                                                    y,
                                                    test_size = 0.2,
                                                    random_state = 0,
                                                    stratify = y)

In [500]:
y_train.value_counts()

5    5276
1    2102
Name: class, dtype: int64

In [501]:
y_test.value_counts()

5    1319
1     526
Name: class, dtype: int64

Se utilizará un algoritmo de máquina de soporte vectorial.

Para obtener los mejores hiper-parametros que ajusten al modelo, se realizará una búsqueda en grilla con validación cruzada de K = 5

La métrica que se utilizará será F1.

In [502]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

param_grid = {  
    'C': [1, 10, 100],
    'gamma': [0.1, 0.01, 0.001, 0.0001],
    'kernel': ['rbf'],
    'tol': [0.01, 0.001, 0.0001, 0.00001]
}

grid_search = GridSearchCV(estimator = SVC(),
                           param_grid = param_grid,
                           scoring = 'f1',
                           cv = 3)

grid_search.fit(x_train, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'C': [1, 10, 100], 'gamma': [0.1, 0.01, 0.001, 0.0001], 'kernel': ['rbf'], 'tol': [0.01, 0.001, 0.0001, 1e-05]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1', verbose=0)

Se obtienen los resultados de la búsqueda en grilla

In [503]:
print('Tuneando los hiper-parámetros para f1')
print()

print('Los mejores hiper-parámetros encontrados con validación cruzada:')
print()
print(grid_search.best_params_)
print()
print('Puntajes de la métrica f1 en el conjunto de validación:')
print()
means = grid_search.cv_results_['mean_test_score']
stds = grid_search.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid_search.cv_results_['params']):
    print('%0.4f (+/-%0.04f) for %r'
          % (mean, std * 2, params))

Tuneando los hiper-parámetros para f1

Los mejores hiper-parámetros encontrados con validación cruzada:

{'C': 100, 'gamma': 0.001, 'kernel': 'rbf', 'tol': 0.001}

Puntajes de la métrica f1 en el conjunto de validación:

0.7520 (+/-0.0086) for {'C': 1, 'gamma': 0.1, 'kernel': 'rbf', 'tol': 0.01}
0.7518 (+/-0.0089) for {'C': 1, 'gamma': 0.1, 'kernel': 'rbf', 'tol': 0.001}
0.7518 (+/-0.0089) for {'C': 1, 'gamma': 0.1, 'kernel': 'rbf', 'tol': 0.0001}
0.7518 (+/-0.0089) for {'C': 1, 'gamma': 0.1, 'kernel': 'rbf', 'tol': 1e-05}
0.7862 (+/-0.0110) for {'C': 1, 'gamma': 0.01, 'kernel': 'rbf', 'tol': 0.01}
0.7864 (+/-0.0111) for {'C': 1, 'gamma': 0.01, 'kernel': 'rbf', 'tol': 0.001}
0.7864 (+/-0.0111) for {'C': 1, 'gamma': 0.01, 'kernel': 'rbf', 'tol': 0.0001}
0.7866 (+/-0.0112) for {'C': 1, 'gamma': 0.01, 'kernel': 'rbf', 'tol': 1e-05}
0.7620 (+/-0.0194) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf', 'tol': 0.01}
0.7619 (+/-0.0192) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf', 'tol': 0.001}
0.

Se muestran los resultados tanto para entrenamiento como pruebas

In [504]:
support_vector_model = grid_search

support_vector_model_train_prediction = support_vector_model.predict(x_train)
support_vector_model_test_prediction = support_vector_model.predict(x_test)

In [505]:
from sklearn.metrics import f1_score

print('Máquina de soporte vectorial')
print('\t-Mejores hiper-parámetros: ' + str(support_vector_model.best_params_))
print('\t-Puntaje f1 en entrenamiento: %.4f' % f1_score(y_train, support_vector_model_train_prediction, average = 'binary'))
print('\t-Puntaje f1 en pruebas: %.4f' % f1_score(y_test, support_vector_model_test_prediction, average = 'binary'))

Máquina de soporte vectorial
	-Mejores hiper-parámetros: {'C': 100, 'gamma': 0.001, 'kernel': 'rbf', 'tol': 0.001}
	-Puntaje f1 en entrenamiento: 0.8397
	-Puntaje f1 en pruebas: 0.8290
