In this notebook, data reading, feature engineering, hyperparameter selection and model training are performed.

In [None]:
import pandas as pd
from xml.dom import minidom
from sklearn.model_selection import train_test_split

data is assigned to a variable and combined into a tuple called path_list.
xml format files.

In [None]:

path_es = 'ES_train.xml'
path_mx = 'MX_train.xml'
path_pe = 'PE_train.xml'
path_uy = 'UY_train.xml'

path_list  = (path_es, path_mx, path_pe, path_uy)


In [None]:
contenido = []
sentimiento = []
def contenido_atributos(path_list):

    '''Function to extract the comment and the 
        sentiment of each one.
    
    Parameters:
        path_list: is a list or tuple.
    
    Return:
        returns the comment and the sentiment in two 
        different variables of all the data.
    
    '''
    for i in path_list:
        mydoc = minidom.parse(i)
        content = mydoc.getElementsByTagName('content')
        sentiment = mydoc.getElementsByTagName('value')

        for element in content:
            contenido.append(element.firstChild.data)
        for element in sentiment:
            sentimiento.append(element.firstChild.data)
    
    return contenido, sentimiento

Use of the function.

In [None]:
contenido, sentimiento = contenido_atributos(path_list)

In [None]:
len(contenido), len(sentimiento)

The two variables with the comment and the sentiment are merged into a single list called datos_totales.

In [None]:

datos_totales=[]
for i in zip(contenido, sentimiento):
    datos_totales.append(i)



In [None]:
datos_totales

In [None]:
len(datos_totales)

In [None]:
datos_totales[5]

In [None]:
contenido[5], sentimiento[5]

We transform the list into a dataframe to work more easily.

In [None]:
df = pd.DataFrame(datos_totales)
df

We renamed the columns for better understanding.

In [None]:
df = df.rename(columns={0: 'comentario', 1:'sentimiento'})
df

In [None]:
df['sentimiento'].value_counts()

We have four different sentiment categories, but we are only going to work with two of them, N and P.

In [None]:
((df['sentimiento']=='NEU') | (df['sentimiento']=='NONE')).value_counts()

In [None]:
df_final = df[(df['sentimiento']=='P') | (df['sentimiento']=='N')]
df_final['sentimiento'].value_counts()

We change the categories N by 0 and P by 1 for a better understanding of the machine learning model.

In [None]:
df_final['sentimiento'] = df_final['sentimiento'].replace({'P': 1}) 
df_final['sentimiento'] = df_final['sentimiento'].replace({'N': 0})
df_final['sentimiento'].value_counts()

Final dataframe ready to process in the machine learning model.

In [None]:
df_final

In [None]:
df_comentario = df_final['comentario']
df_sentimiento = df_final['sentimiento']
df_comentario

In [None]:
df_sentimiento

df_comentario is our X, and df_sentimiento is our Y. We separate them into 85% training data and 15% test data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_comentario, df_sentimiento, test_size=0.15)

In [None]:
len(X_train), len(X_test)

TfidfVectorizer is used to work with text type data, which is ideal for a classification problem. For more detailed information consult documentation.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizacion = TfidfVectorizer(ngram_range=(1,1))
train_x_vect = vectorizacion.fit_transform(X_train)
test_x_vect = vectorizacion.transform(X_test)

We will use SVM for this classification problem, more information about this and other classification models, consult the sklearn documentation.

In [None]:
from sklearn.svm import SVC

svc = SVC()
grid_svc = SVC()

In [None]:
svc.fit(train_x_vect, y_train)

In [None]:
svc.score(test_x_vect, y_test)

In [None]:
from sklearn.metrics import f1_score

f1_score(y_test, svc.predict(test_x_vect), average=None, labels=[1, 0])

In [None]:
prueba = ['tus maquinas 3d no tienen niun brillo maldito surenio', 'era mentira te quiero mucho uwu']
prueba_transformado = vectorizacion.transform(prueba)

svc.predict(prueba_transformado)

Once trained, evaluated and seeing results, we look for the best hyperparameters to increase performance with GridSearchCV.

In [None]:
from sklearn.model_selection import GridSearchCV

parametros = {
    'kernel': ('linear', 'rbf', 'poly'),
    'C': [0.001, 0.01, 0.1, 10],
    'gamma': ('scale', 'auto')
}

In [None]:
svc_final = GridSearchCV(grid_svc, parametros, cv=5, scoring='roc_auc')
svc_final

In [None]:
svc_final.fit(train_x_vect, y_train)

In [None]:
svc_final.best_params_

In [None]:
svc_final.best_score_

In [None]:
svc_final.score(test_x_vect, y_test)

Significantly increased performance after tuning hyperparameters from 75% to 82%.

In [None]:
f1_score(y_test, svc_final.predict(test_x_vect), average=None, labels=[1, 0])

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
confusion_matrix = confusion_matrix(y_test, svc_final.predict(test_x_vect))

In [None]:
confusion_matrix

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,8))
sns.heatmap(confusion_matrix, annot=True, fmt='.0f')
plt.show()

In [None]:
print(classification_report(y_test, svc_final.predict(test_x_vect), labels=[1, 0]))

Using different evaluation metrics we achieve the following performance values.

In [None]:
## positivo 1, negativo 0
prueba = ['parece cualquer cosa menos persona']
prueba_transformado = vectorizacion.transform(prueba)

svc_final.predict(prueba_transformado)

In [None]:
import joblib 

In [None]:
## exportando el mejor modelo
joblib.dump(svc_final, 'best_model_espanol')

In [None]:
## importando el mejor meodelo
model = joblib.load('best_model_espanol')

In [None]:
## positivo 1, negativo 0
prueba = ['ojala todos fueran asi de buenos']
prueba_transformado = vectorizacion.transform(prueba)

#model.predict(prueba_transformado)
resultado = model.predict(prueba_transformado)

if resultado == 1: 
    print('Positivo') 
else: print('Negativo') #resultado

In [None]:
joblib.dump(vectorizacion, 'vect_fit_espanol')

In [7]:
## importa libreria
import joblib 

## importando el mejor meodelo y vectorizador
model = joblib.load('best_model_espanol')
vectorizacion_espanol = joblib.load('vect_fit_espanol')

## lista de frases para aplicar prediccion
prueba = ['no sabes nada']
prueba_transformado = vectorizacion_espanol.transform(prueba)

## prediccion del modelo
resultado = model.predict(prueba_transformado)

if resultado == 1: 
    print('Positivo') 
else: 
    print('Negativo') 

Negativo
