# **COMPETENCIA**
Uno de los datasets más famosos de Natural Language Inference es SNLI. En esta tarea se debe responder, dadas dos frases A y B, si B es implicación de A ("entailment"), B es contradictorio con A ("contradiction") o si lo que enuncia B es neutral respecto de A ("neutral"). Se dice que A es la premisa y B es la hipótesis.

En Gururangan et al., 2018 mostraron que este dataset tiene algunos sesgos, provocados por ejemplo por las heurísticas que tienen los humanos para generar estos pares de frases (A, B). Para ello, desarrollaron un modelo que aún sin observar la premisa A pudiera clasificar el par (A, B) en alguna de las tres clases del dataset.

En este trabajo práctico intentaremos predecir a qué clase pertenece cada una de las hipótesis sin observar la premisa. La idea es replicar los resultados publicados en Gururangan et al., 2018 y mejorarlos si es posible utilizando clasificadores más complejos.

## IMPORTO TODO

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from string import punctuation

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
!pip install -U -q kaggle
!mkdir -p ~/.kaggle
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"camilasolguardia","key":"08502a67f702520e8a9c44c2da543bd7"}'}

In [None]:
!cp kaggle.json ~/.kaggle/

In [None]:
!kaggle competitions download -c sesgos-en-el-dataset-de-snli

Downloading test_data.hdf5.zip to /content
  0% 0.00/200k [00:00<?, ?B/s]
100% 200k/200k [00:00<00:00, 75.1MB/s]
Downloading train_data.hdf5.zip to /content
 35% 9.00M/25.6M [00:00<00:00, 34.7MB/s]
100% 25.6M/25.6M [00:00<00:00, 73.5MB/s]
Downloading valid_data.hdf5.zip to /content
  0% 0.00/440k [00:00<?, ?B/s]
100% 440k/440k [00:00<00:00, 137MB/s]
Downloading submission_sample.csv to /content
  0% 0.00/152k [00:00<?, ?B/s]
100% 152k/152k [00:00<00:00, 160MB/s]


In [None]:
!unzip test_data.hdf5.zip
!unzip train_data.hdf5.zip
!unzip valid_data.hdf5.zip

Archive:  test_data.hdf5.zip
  inflating: test_data.hdf5          
Archive:  train_data.hdf5.zip
  inflating: train_data.hdf5         
Archive:  valid_data.hdf5.zip
  inflating: valid_data.hdf5         


In [None]:
df_train = pd.read_hdf("train_data.hdf5")
df_valid = pd.read_hdf("valid_data.hdf5")
df_test = pd.read_hdf("test_data.hdf5")

In [None]:
text_train = df_train["text"].tolist()
labels_train = df_train["gold_label"].tolist()

text_valid = df_valid["text"].tolist()
labels_valid = df_valid["gold_label"].tolist()

text_test = df_test["text"].tolist()

## PRE PROCESAMIENTO DE DATOS

In [None]:
def pre_procesamiento(text, lemmatize = 'true', stopWords = 'true', stemming = 'true', filtrado = 'true'):
  
  procesamiento = []

  for i in range(len(text)):

    em = text[i]
  
    tok = word_tokenize(em)

    if (lemmatize):
      lemmatizer = WordNetLemmatizer()
      lem = [lemmatizer.lemmatize(x,pos='v') for x in tok] 
    else:
      lem = tok

    if (stopwords):
      stop = [x for x in lem if x not in stopwords.words('english')]
    else:
      stop = lem 

    if (stemming):
      stemmer = PorterStemmer()
      stem =[stemmer.stem(x) for x in stop]
    else:
      stem = stop

    if (filtrado):
      alpha=[x for x in stem if x.isalpha()]
    else:
      alpha = stop
    
    procesamiento.append(" ".join(alpha))

  return procesamiento



In [None]:
f = 'false'
t = 'true'

Ahora probamos distintos preprocesados

In [None]:
pre_procesado = pre_procesamiento(text_train,t,f,t,f)
print(pre_procesado[0:50])

['insid hous', 'two guy yard', 'they yardwork', 'A man swim', 'two young white men near bush', 'two men outsid', 'A young boy scooter past old woman', 'two young men outsid', 'two young white men talk work bush', 'men stand empti field', 'men outdoor', 'men wait ride pick', 'two femal near bush', 'two white young male near bush', 'two young male cut bush', 'two men hang bu stop', 'two men outsid', 'two men stand bush tri find cat', 'man home alon day', 'two men outsid', 'men talk plan week', 'the man pick appl', 'the man outdoor', 'the man garden', 'the peopl stranger', 'the friend togeth', 'they watch tv', 'sever women oper giant pulley system', 'men wear hard hat', 'these men construct worker', 'stock broker examin market', 'sever men wear protect gear', 'sever men work construct', 'the construct worker enjoy lunch sit truck', 'the construct worker work heavi equip', 'the men hard hat readi demolish condemn build', 'the detect oper tool crime scene', 'worker use simpl machin', 'the t

In [None]:
pre_procesado_2 = pre_procesamiento(text_train,t,t,f,f)

In [None]:
pre_procesado_3 = pre_procesamiento(text_train,f,f,f,f)

In [None]:
pre_procesado_4 = pre_procesamiento(text_train,t,t,t,t)

Luego de probar con distintos preprocesados, llegamos a la conclusión que el que da mejores resultados es el "pre_procesado", que tiene tokenización, lematización y stemming. Al estar fijándonos en frases y no exclusivamente en palabras en particular, no es necesario realizar todos los filtros posibles a cada frase ya que se pierde información valiosa. 

Para no volver a correr los mismos códigos (tardan mucho) cargamos a drive las listas con los pre procesamientos ya que son datos fijos y luego los importamos. 

In [None]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

%cd drive
%cd Shareddrives
%cd Redes neuronales

Mounted at /content/drive
/content/drive
/content/drive/Shareddrives
/content/drive/Shareddrives/Redes neuronales


***DESCOMENTAR***

In [None]:

with open('pre_procesado.txt', 'w') as file: # cambiar nombre del archivo
  file_lines = "\n".join(pre_procesado)
  file.write(file_lines)

# with open('pre_procesado_2.txt', 'w') as file: # cambiar nombre del archivo
#   file_lines_2 = "\n".join(pre_procesado_2)
#   file.write(file_lines_2)

# with open('pre_procesado_3.txt', 'w') as file: # cambiar nombre del archivo
#   file_lines_3 = "\n".join(pre_procesado_3)
#   file.write(file_lines_3)

# with open('pre_procesado_4.txt', 'w') as file: # cambiar nombre del archivo
#   file_lines_4 = "\n".join(pre_procesado_4)
#   file.write(file_lines_4)  

Importamos entonces los datos desde drive

In [None]:
file = open("pre_procesado.txt", "r")
file_lines = file.read()
pre_procesado = file_lines.split("\n")

# file = open("pre_procesado_2.txt", "r")
# file_lines = file.read()
# pre_procesado_2 = file_lines.split("\n")

# file = open("pre_procesado_3.txt", "r")
# file_lines = file.read()
# pre_procesado_3 = file_lines.split("\n")

# file = open("pre_procesado_4.txt", "r")
# file_lines = file.read()
# pre_procesado_4 = file_lines.split("\n")



## COUNT VECTORIZERS Y VALIDACIÓN DE DATOS

Preprocesamos el texto de validación

In [None]:
pre_procesado_valid = pre_procesamiento(text_valid,t,f,t,f)

Exportamos e importamos todos los datos hacia y desde drive 

In [None]:
with open('pre_procesado_valid.txt', 'w') as file: # cambiar nombre del archivo
  file_lines_valid = "\n".join(pre_procesado_valid)
  file.write(file_lines_valid)

In [None]:
file = open("pre_procesado_valid.txt", "r")
file_lines_valid = file.read()
pre_procesado_valid = file_lines_valid.split("\n")

### MODELO NAIVE BAYES

#### VARIANDO ALPHA

In [None]:
alphas = []
score_trains = []
score_valids = []

for a in range(1,200):
  alpha = a/100

  clf_prueba = MultinomialNB(alpha)
  clf_prueba.fit(cv4_train, labels_train)
  score_train = clf_prueba.score(cv4_train, labels_train)

  cv_valid_prueba = cv4.transform(pre_procesado_valid)
  score_valid = clf4.score(cv_valid, labels_valid)

  alphas.append(alpha)
  score_trains.append(score_train)
  score_valids.append(score_valid)


Se probó realizando un barrido de distintos valores posibles de alpha,  y se eligió aquel que dio el máximo score, el cual es 1e-10. A continuación probabremos variando df_min y df_max 

#### VARIANDO DFMIN Y DFMAX

In [None]:
df_mins = []
score_trains_b = []
score_valids_b = []

for df_min in range(2,20):

  df_mins.append(df_min)

  cv_prueba2 = CountVectorizer(max_df = 1.0, min_df=df_min, ngram_range = (1,4)) 
  cv_train_prueba2 = cv_prueba2.fit_transform(pre_procesado)

  clf_prueba2 = MultinomialNB(alpha=1e-10)
  clf_prueba2.fit(cv_train_prueba2, labels_train)

  print("Para df_min = ",df_min)
  
  score1 = clf_prueba2.score(cv_train_prueba2, labels_train)
  score_trains_b.append(score1)
  print(score1)

  cv_valid_prueba2 = cv_prueba2.transform(pre_procesado_valid)

  score2 = clf_prueba2.score(cv_valid_prueba2, labels_valid)
  score_valids_b.append(score2)
  print(score2,"\n")

In [None]:
max = np.max(score_valids_b)
idx = score_valids_b.index(max)
df_min = df_mins[idx]
print(df_min)

4


Se llega entonces a que con df_min = 4 se tiene el máximo score

Ahora probamos para distintos df_max

In [None]:
df_maxs = []
score_trains_c = []
score_valids_c = []

for i in range(1,10):

  df_max = i/10

  df_maxs.append(df_max)

  cv_prueba2 = CountVectorizer(max_df = df_max, min_df=4, ngram_range = (1,4)) 
  cv_train_prueba2 = cv_prueba2.fit_transform(pre_procesado)

  clf_prueba2 = MultinomialNB(alpha=1e-10)
  clf_prueba2.fit(cv_train_prueba2, labels_train)

  print("Para df_max = ",df_min)
  
  score1 = clf_prueba2.score(cv_train_prueba2, labels_train)
  score_trains_c.append(score1)
  print(score1)

  cv_valid_prueba2 = cv_prueba2.transform(pre_procesado_valid)

  score2 = clf_prueba2.score(cv_valid_prueba2, labels_valid)
  score_valids_c.append(score2)
  print(score2,"\n")

In [None]:
max = np.max(score_valids_c)
idx = score_valids_c.index(max)
df_max = df_maxs[idx]
print(df_max)

0.3


Tomamos df_max = 0.3

#### VARIANDO N_GRAMS

Ahora probamos para distintos n_grams

In [None]:
grams = []
score_trains_d = []
score_valids_d = []

for i in range(1,7):

  grams.append(i)

  cv_prueba2 = CountVectorizer(max_df = 0.3, min_df=4, ngram_range = (1,i)) 
  cv_train_prueba2 = cv_prueba2.fit_transform(pre_procesado)

  clf_prueba2 = MultinomialNB(alpha=1e-10)
  clf_prueba2.fit(cv_train_prueba2, labels_train)

  print("Para n_grams = ","(",1,",",i,")")
  
  score1 = clf_prueba2.score(cv_train_prueba2, labels_train)
  score_trains_d.append(score1)
  print(score1)

  cv_valid_prueba2 = cv_prueba2.transform(pre_procesado_valid)

  score2 = clf_prueba2.score(cv_valid_prueba2, labels_valid)
  score_valids_d.append(score2)
  print(score2,"\n")

In [None]:
max = np.max(score_valids_d)
idx = score_valids_d.index(max)
gram = grams[idx]
print("(",1,",",gram,")")

( 1 , 3 )


FINALMENTE, QUEDARÍA

In [None]:
  cv = CountVectorizer(max_df = 0.3, min_df=4, ngram_range = (1,3)) 
  cv_train = cv.fit_transform(pre_procesado)

  clf = MultinomialNB(alpha=1e-10)
  clf.fit(cv_train, labels_train)
  
  score_train = clf.score(cv_train, labels_train)
  print("Score train: ",score_train)

  cv_valid = cv.transform(pre_procesado_valid)

  score_valid = clf.score(cv_valid, labels_valid)
  print("Score valid: ",score_valid)

Score train:  0.6745818369141212
Score valid:  0.6333062385693965


Como el TFIDF es una alternativa al CountVectorizer, probamos también con esta estructura de datos y los mismos parámetros

In [None]:
  tfidf = TfidfVectorizer(max_df = 0.3, min_df=4, ngram_range = (1,3)) 
  tfidf_train = tfidf.fit_transform(pre_procesado)

  clf_tfidf = MultinomialNB(alpha=1e-10)
  clf_tfidf.fit(tfidf_train, labels_train)
  
  score_train = clf_tfidf.score(tfidf_train, labels_train)
  print("Score train: ",score_train)

  tfidf_valid = tfidf.transform(pre_procesado_valid)

  score_valid = clf_tfidf.score(tfidf_valid, labels_valid)
  print("Score valid: ",score_valid)

Da peor, con lo cual nos quedamos con el cv

#### PREDICCIÓN

In [None]:
pre_procesado_test = pre_procesamiento(text_test,t,f,t,f)

In [None]:
cv_test = cv.transform(pre_procesado_test)
test_labels = clf.predict(cv_test)

In [None]:
test_labels

array(['neutral', 'neutral', 'neutral', ..., 'contradiction',
       'entailment', 'neutral'], dtype='<U13')

Arrancamos el submission

In [None]:
df_test = pd.DataFrame(data=test_labels, columns=["pred_labels"],)
df_test.head()
df_test.index.names = ["pairID"]
df_test.to_csv("MNB_test_entrega.csv")

#### METRICAS

**Explicación de las métricas**


---


*   PRECISIÓN: define el porcentaje de predicciones correctas. Es la capacidad de un clasificador de no etiquetar una instancia como positiva que en realidad es negativa. Para cada clase, se define como la relación entre verdaderos positivos y la suma de verdaderos y falsos positivos.



> $Precision = \frac{TP}{TP + FP}$


*   RECALL: define el porcentaje de positivos que fueron capturados (de encontrar todas las instancias positivas). Para cada clase, se define como la relación entre verdaderos positivos y la suma de verdaderos positivos y falsos negativos.

> $Recall = \frac{TP}{TP + FN}$


*   F1-SCORE: define el porcentaje de predicciones positivas correctas. Es una media armónica ponderada de precisión y recall de manera que la mejor puntuación es 1 y la peor es 0. En términos generales, los puntajes F1 son más bajos que las medidas de precisión, ya que incorporan precisión y recall en su cálculo. Como regla general, el promedio ponderado de F1 debe usarse para comparar modelos de clasificadores, no la precisión global.

> $F1-SCORE = \frac{2 \cdot (Recall \cdot Precision)}{Recall + Precision}$


*  ACCURACY: es la fracción de muestras predichas correctamente.


> $ACCURACY = \frac{VP + VN}{VP + VN + FP + FN}$







In [None]:
from sklearn.metrics import classification_report

cv_metrics = cv.transform(pre_procesado_valid)
valid_labels_metrics = clf.predict(cv_metrics)

print(classification_report(valid_labels_metrics, labels_valid))

               precision    recall  f1-score   support

contradiction       0.64      0.63      0.64      3363
   entailment       0.63      0.65      0.64      3235
      neutral       0.63      0.62      0.63      3244

     accuracy                           0.63      9842
    macro avg       0.63      0.63      0.63      9842
 weighted avg       0.63      0.63      0.63      9842



In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(labels_valid, valid_labels_metrics,multi_class='macro')

### MLP


---



In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler 

In [None]:
pre_procesado_valid = pre_procesamiento(text_valid,t,f,t,f)

Basándonos en el count vectorizer empleado en el modelo de Naive Bayes, procederemos a variar el número de neuronas por capa y la cantidad de capas. 

In [None]:
  cv = CountVectorizer(max_df = 0.3, min_df=4, ngram_range = (1,3)) 
  cv_train = cv.fit_transform(pre_procesado)
 
  mlp = MLPClassifier(hidden_layer_sizes=(1,), verbose=True)
  mlp.fit(cv_train, labels_train)
  
  score_train = mlp.score(cv_train, labels_train)
  print("Score train: ",score_train)

  cv_valid = cv.transform(pre_procesado_valid)

  score_valid = mlp.score(cv_valid, labels_valid)
  print("Score valid: ",score_valid)

Iteration 1, loss = 0.99720045
Iteration 2, loss = 0.91668726
Iteration 3, loss = 0.88696200
Iteration 4, loss = 0.86753999
Iteration 5, loss = 0.85504816
Iteration 6, loss = 0.84659931
Iteration 7, loss = 0.84051112
Iteration 8, loss = 0.83596831
Iteration 9, loss = 0.83241753
Iteration 10, loss = 0.82943128
Iteration 11, loss = 0.82702101
Iteration 12, loss = 0.82506415
Iteration 13, loss = 0.82334217
Iteration 14, loss = 0.82187562
Iteration 15, loss = 0.82059351
Iteration 16, loss = 0.81925172
Iteration 17, loss = 0.81838219
Iteration 18, loss = 0.81739827
Iteration 19, loss = 0.81664172
Iteration 20, loss = 0.81595356
Iteration 21, loss = 0.81531153
Iteration 22, loss = 0.81467307
Iteration 23, loss = 0.81416435
Iteration 24, loss = 0.81368833
Iteration 25, loss = 0.81321469
Iteration 26, loss = 0.81294278
Iteration 27, loss = 0.81250320
Iteration 28, loss = 0.81215198
Iteration 29, loss = 0.81189985
Iteration 30, loss = 0.81162561
Iteration 31, loss = 0.81134769
Iteration 32, los

In [None]:

cv = CountVectorizer(max_df = 0.3, min_df=4, ngram_range = (1,3)) 
cv_train = cv.fit_transform(pre_procesado)
 

mlp = MLPClassifier(hidden_layer_sizes=(2,), verbose=True)
mlp.fit(cv_train, labels_train)
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)


score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.93394752
Iteration 2, loss = 0.77869477
Iteration 3, loss = 0.72926482
Iteration 4, loss = 0.70273332
Iteration 5, loss = 0.68574617
Iteration 6, loss = 0.67409205
Iteration 7, loss = 0.66537279
Iteration 8, loss = 0.65848169
Iteration 9, loss = 0.65323614
Iteration 10, loss = 0.64865416
Iteration 11, loss = 0.64497116
Iteration 12, loss = 0.64182682
Iteration 13, loss = 0.63888522
Iteration 14, loss = 0.63663741
Iteration 15, loss = 0.63439642
Iteration 16, loss = 0.63245626
Iteration 17, loss = 0.63075709
Iteration 18, loss = 0.62920775
Iteration 19, loss = 0.62768400
Iteration 20, loss = 0.62646107
Iteration 21, loss = 0.62519474
Iteration 22, loss = 0.62412709
Iteration 23, loss = 0.62319300
Iteration 24, loss = 0.62224030
Iteration 25, loss = 0.62129759
Iteration 26, loss = 0.62046382
Iteration 27, loss = 0.61975611
Iteration 28, loss = 0.61913543
Iteration 29, loss = 0.61842304
Iteration 30, loss = 0.61786291
Iteration 31, loss = 0.61738518
Iteration 32, los

In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=4, ngram_range = (1,3)) 
cv_train = cv.fit_transform(pre_procesado_4)
 

mlp = MLPClassifier(hidden_layer_sizes=(5,), verbose=True)
mlp.fit(cv_train, labels_train)
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)


score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.87009557
Iteration 2, loss = 0.75103693
Iteration 3, loss = 0.70542984
Iteration 4, loss = 0.67629371
Iteration 5, loss = 0.65483755
Iteration 6, loss = 0.63804874
Iteration 7, loss = 0.62372831
Iteration 8, loss = 0.61185874
Iteration 9, loss = 0.60108433
Iteration 10, loss = 0.59144470
Iteration 11, loss = 0.58272127
Iteration 12, loss = 0.57476244
Iteration 13, loss = 0.56754637
Iteration 14, loss = 0.56044298
Iteration 15, loss = 0.55413458
Iteration 16, loss = 0.54790458
Iteration 17, loss = 0.54240817
Iteration 18, loss = 0.53696733
Iteration 19, loss = 0.53181738
Iteration 20, loss = 0.52718271
Iteration 21, loss = 0.52250006
Iteration 22, loss = 0.51824213
Iteration 23, loss = 0.51453187
Iteration 24, loss = 0.51039556
Iteration 25, loss = 0.50677852
Iteration 26, loss = 0.50327662
Iteration 27, loss = 0.50024362
Iteration 28, loss = 0.49689507
Iteration 29, loss = 0.49415049
Iteration 30, loss = 0.49100060
Iteration 31, loss = 0.48831759
Iteration 32, los

In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=5, ngram_range = (1,2)) 
cv_train = cv.fit_transform(pre_procesado)
 

mlp = MLPClassifier(hidden_layer_sizes=(20, 10, 20), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train) 
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)


score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.85677202
Validation score: 0.638804
Iteration 2, loss = 0.75636023
Validation score: 0.640042
Iteration 3, loss = 0.70981105
Validation score: 0.636784
Iteration 4, loss = 0.66875336
Validation score: 0.634854
Iteration 5, loss = 0.62816451
Validation score: 0.630031
Iteration 6, loss = 0.58793373
Validation score: 0.625498
Iteration 7, loss = 0.55022498
Validation score: 0.619055
Iteration 8, loss = 0.51650944
Validation score: 0.617234
Iteration 9, loss = 0.48693911
Validation score: 0.615323
Iteration 10, loss = 0.46190664
Validation score: 0.607987
Iteration 11, loss = 0.44145323
Validation score: 0.605075
Iteration 12, loss = 0.42364999
Validation score: 0.604765
Iteration 13, loss = 0.40909509
Validation score: 0.604165
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.7023829243474763
Score valid:  0.6509855720382036


In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=3, ngram_range = (1,2)) 
cv_train = cv.fit_transform(pre_procesado)
 

mlp = MLPClassifier(hidden_layer_sizes=(15, 12, 10), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train) 
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)


score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.85459893
Validation score: 0.635983
Iteration 2, loss = 0.73622758
Validation score: 0.633052
Iteration 3, loss = 0.67220727
Validation score: 0.629375
Iteration 4, loss = 0.61906064
Validation score: 0.623623
Iteration 5, loss = 0.57228817
Validation score: 0.621057
Iteration 6, loss = 0.53011223
Validation score: 0.617398
Iteration 7, loss = 0.49322361
Validation score: 0.610317
Iteration 8, loss = 0.46115860
Validation score: 0.610117
Iteration 9, loss = 0.43473492
Validation score: 0.606822
Iteration 10, loss = 0.41219059
Validation score: 0.604092
Iteration 11, loss = 0.39342410
Validation score: 0.599996
Iteration 12, loss = 0.37784673
Validation score: 0.600433
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.6924205494687522
Score valid:  0.6506807559439138


In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=3, ngram_range = (1,2)) 
cv_train = cv.fit_transform(pre_procesado)
 

mlp = MLPClassifier(hidden_layer_sizes=(12, 10, 7), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train) 
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)


score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.85805147
Validation score: 0.637585
Iteration 2, loss = 0.73751656
Validation score: 0.638349
Iteration 3, loss = 0.67375583
Validation score: 0.633507
Iteration 4, loss = 0.62347461
Validation score: 0.629758
Iteration 5, loss = 0.58065443
Validation score: 0.625043
Iteration 6, loss = 0.54253095
Validation score: 0.621312
Iteration 7, loss = 0.51026796
Validation score: 0.618326
Iteration 8, loss = 0.48269056
Validation score: 0.614012
Iteration 9, loss = 0.45891650
Validation score: 0.613339
Iteration 10, loss = 0.43905635
Validation score: 0.609444
Iteration 11, loss = 0.42242319
Validation score: 0.606786
Iteration 12, loss = 0.40831937
Validation score: 0.606677
Iteration 13, loss = 0.39584183
Validation score: 0.604856
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.7230976742323437
Score valid:  0.6540337329811015


In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=3, ngram_range = (1,2)) 
cv_train = cv.fit_transform(pre_procesado)
 

mlp = MLPClassifier(hidden_layer_sizes=(12, 12, 12), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train) 
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)


score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.85582369
Validation score: 0.636292
Iteration 2, loss = 0.73749189
Validation score: 0.633380
Iteration 3, loss = 0.67719290
Validation score: 0.629321
Iteration 4, loss = 0.62806858
Validation score: 0.622331
Iteration 5, loss = 0.58515473
Validation score: 0.621293
Iteration 6, loss = 0.54709797
Validation score: 0.615341
Iteration 7, loss = 0.51466848
Validation score: 0.610426
Iteration 8, loss = 0.48673426
Validation score: 0.608697
Iteration 9, loss = 0.46320572
Validation score: 0.608206
Iteration 10, loss = 0.44363388
Validation score: 0.605166
Iteration 11, loss = 0.42666767
Validation score: 0.601544
Iteration 12, loss = 0.41280916
Validation score: 0.601089
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.6908405492139135
Score valid:  0.6501727291200975


In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=3, ngram_range = (1,2)) 
cv_train = cv.fit_transform(pre_procesado)
 

mlp = MLPClassifier(hidden_layer_sizes=(7, 7, 7), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train) 
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)


score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.88866420
Validation score: 0.636802
Iteration 2, loss = 0.75243579
Validation score: 0.638222
Iteration 3, loss = 0.70136850
Validation score: 0.632761
Iteration 4, loss = 0.66515840
Validation score: 0.628920
Iteration 5, loss = 0.63619114
Validation score: 0.625171
Iteration 6, loss = 0.61113067
Validation score: 0.620929
Iteration 7, loss = 0.58926239
Validation score: 0.620565
Iteration 8, loss = 0.56904794
Validation score: 0.616852
Iteration 9, loss = 0.55112037
Validation score: 0.614395
Iteration 10, loss = 0.53461262
Validation score: 0.611591
Iteration 11, loss = 0.52000365
Validation score: 0.612611
Iteration 12, loss = 0.50670957
Validation score: 0.607878
Iteration 13, loss = 0.49442487
Validation score: 0.607732
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.7133300689702876
Score valid:  0.6501727291200975


In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=1, ngram_range = (1,2)) 
cv_train = cv.fit_transform(pre_procesado)
 

mlp = MLPClassifier(hidden_layer_sizes=(5, 10, 15), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train) 
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)


score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.86584154
Validation score: 0.635364
Iteration 2, loss = 0.67744953
Validation score: 0.628902
Iteration 3, loss = 0.55013919
Validation score: 0.615814
Iteration 4, loss = 0.47724106
Validation score: 0.606913
Iteration 5, loss = 0.43656704
Validation score: 0.602345
Iteration 6, loss = 0.41071406
Validation score: 0.599687
Iteration 7, loss = 0.39328619
Validation score: 0.599559
Iteration 8, loss = 0.38059579
Validation score: 0.596210
Iteration 9, loss = 0.36997706
Validation score: 0.594736
Iteration 10, loss = 0.36164675
Validation score: 0.591368
Iteration 11, loss = 0.35506994
Validation score: 0.592606
Iteration 12, loss = 0.34993716
Validation score: 0.592624
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.7198011529633196
Score valid:  0.6509855720382036


In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=4, ngram_range = (1,3)) 
cv_train = cv.fit_transform(pre_procesado)
 
mlp = MLPClassifier(hidden_layer_sizes=(20, 10, 5), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train)
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)

score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.93886749
Validation score: 0.630941
Iteration 2, loss = 0.76159472
Validation score: 0.633963
Iteration 3, loss = 0.69289265
Validation score: 0.628957
Iteration 4, loss = 0.63808862
Validation score: 0.624261
Iteration 5, loss = 0.58653529
Validation score: 0.617635
Iteration 6, loss = 0.53886322
Validation score: 0.611646
Iteration 7, loss = 0.49656900
Validation score: 0.606331
Iteration 8, loss = 0.46017891
Validation score: 0.601416
Iteration 9, loss = 0.42942408
Validation score: 0.602854
Iteration 10, loss = 0.40431553
Validation score: 0.598285
Iteration 11, loss = 0.38430886
Validation score: 0.595209
Iteration 12, loss = 0.36855369
Validation score: 0.593534
Iteration 13, loss = 0.35521633
Validation score: 0.594299
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.7186998855045899
Score valid:  0.6496647022962813


In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=4, ngram_range = (1,3)) 
cv_train = cv.fit_transform(pre_procesado)
 
mlp = MLPClassifier(hidden_layer_sizes=(5, 10, 5), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train)
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)

score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.90106436
Validation score: 0.632470
Iteration 2, loss = 0.75679875
Validation score: 0.634927
Iteration 3, loss = 0.70413542
Validation score: 0.631396
Iteration 4, loss = 0.67041190
Validation score: 0.627628
Iteration 5, loss = 0.64548832
Validation score: 0.623168
Iteration 6, loss = 0.62460063
Validation score: 0.620438
Iteration 7, loss = 0.60549756
Validation score: 0.616579
Iteration 8, loss = 0.58822641
Validation score: 0.616524
Iteration 9, loss = 0.57202665
Validation score: 0.614304
Iteration 10, loss = 0.55622245
Validation score: 0.612047
Iteration 11, loss = 0.54144943
Validation score: 0.610226
Iteration 12, loss = 0.52791109
Validation score: 0.609844
Iteration 13, loss = 0.51543155
Validation score: 0.607114
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.7126784098790062
Score valid:  0.6480390164600691


In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=1, ngram_range = (1,2)) 
cv_train = cv.fit_transform(pre_procesado_4)
 

mlp = MLPClassifier(hidden_layer_sizes=(20, 10, 20), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train) 
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)


score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.85350072
Validation score: 0.636129
Iteration 2, loss = 0.65649157
Validation score: 0.627792
Iteration 3, loss = 0.51342125
Validation score: 0.616743
Iteration 4, loss = 0.42544471
Validation score: 0.608679
Iteration 5, loss = 0.37131110
Validation score: 0.606713


In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=4, ngram_range = (1,3)) 
cv_train = cv.fit_transform(pre_procesado)
 
mlp = MLPClassifier(hidden_layer_sizes=(5, 10, 5), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train)
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)

score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.90106436
Validation score: 0.632470
Iteration 2, loss = 0.75679875
Validation score: 0.634927
Iteration 3, loss = 0.70413542
Validation score: 0.631396
Iteration 4, loss = 0.67041190
Validation score: 0.627628
Iteration 5, loss = 0.64548832
Validation score: 0.623168
Iteration 6, loss = 0.62460063
Validation score: 0.620438
Iteration 7, loss = 0.60549756
Validation score: 0.616579
Iteration 8, loss = 0.58822641
Validation score: 0.616524
Iteration 9, loss = 0.57202665
Validation score: 0.614304
Iteration 10, loss = 0.55622245
Validation score: 0.612047
Iteration 11, loss = 0.54144943
Validation score: 0.610226
Iteration 12, loss = 0.52791109
Validation score: 0.609844
Iteration 13, loss = 0.51543155
Validation score: 0.607114
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.7126784098790062
Score valid:  0.6480390164600691


In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=4, ngram_range = (1,3)) 
cv_train = cv.fit_transform(pre_procesado)
 
mlp = MLPClassifier(hidden_layer_sizes=(8, 10, 15), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train)
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)

score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.86230517
Validation score: 0.634345
Iteration 2, loss = 0.74182943
Validation score: 0.631705
Iteration 3, loss = 0.68578117
Validation score: 0.627009
Iteration 4, loss = 0.64293308
Validation score: 0.622677
Iteration 5, loss = 0.60524602
Validation score: 0.617835
Iteration 6, loss = 0.57123795
Validation score: 0.613175
Iteration 7, loss = 0.54072924
Validation score: 0.611246
Iteration 8, loss = 0.51279211
Validation score: 0.610499
Iteration 9, loss = 0.48864524
Validation score: 0.603728
Iteration 10, loss = 0.46720862
Validation score: 0.601234
Iteration 11, loss = 0.44841218
Validation score: 0.597994
Iteration 12, loss = 0.43244022
Validation score: 0.597776
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.6913447658851005
Score valid:  0.6488518593781751


In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=4, ngram_range = (1,3)) 
cv_train = cv.fit_transform(pre_procesado)
 
mlp = MLPClassifier(hidden_layer_sizes=(10, 10, 20), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train)
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)

score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.85713168
Validation score: 0.634964
Iteration 2, loss = 0.73754514
Validation score: 0.631159
Iteration 3, loss = 0.67935475
Validation score: 0.628010
Iteration 4, loss = 0.63432322
Validation score: 0.622823
Iteration 5, loss = 0.59394904
Validation score: 0.618381
Iteration 6, loss = 0.55649041
Validation score: 0.615050
Iteration 7, loss = 0.52232541
Validation score: 0.609425
Iteration 8, loss = 0.49157387
Validation score: 0.604929
Iteration 9, loss = 0.46559709
Validation score: 0.603728
Iteration 10, loss = 0.44358627
Validation score: 0.601453
Iteration 11, loss = 0.42464907
Validation score: 0.599377
Iteration 12, loss = 0.40867284
Validation score: 0.598649
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.6919855033156341
Score valid:  0.64681975208291


In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=4, ngram_range = (1,3)) 
cv_train = cv.fit_transform(pre_procesado)
 
mlp = MLPClassifier(hidden_layer_sizes=(10, 5, 10), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train)
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)

score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.89859313
Validation score: 0.633762
Iteration 2, loss = 0.75531210
Validation score: 0.634636
Iteration 3, loss = 0.70000307
Validation score: 0.629394
Iteration 4, loss = 0.66129434
Validation score: 0.625535
Iteration 5, loss = 0.62910848
Validation score: 0.621803
Iteration 6, loss = 0.60040827
Validation score: 0.618599
Iteration 7, loss = 0.57372504
Validation score: 0.614449
Iteration 8, loss = 0.54974992
Validation score: 0.613321
Iteration 9, loss = 0.52708187
Validation score: 0.610645
Iteration 10, loss = 0.50603912
Validation score: 0.605585
Iteration 11, loss = 0.48661654
Validation score: 0.604711
Iteration 12, loss = 0.46846078
Validation score: 0.602326
Iteration 13, loss = 0.45162919
Validation score: 0.599742
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.7140272349813512
Score valid:  0.6496647022962813


In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=4, ngram_range = (1,3)) 
cv_train = cv.fit_transform(pre_procesado)
 
mlp = MLPClassifier(hidden_layer_sizes=(5, 10, 5), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train)
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)

score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.88027813
Validation score: 0.632306
Iteration 2, loss = 0.75241797
Validation score: 0.633562
Iteration 3, loss = 0.70125143
Validation score: 0.628047
Iteration 4, loss = 0.66769703
Validation score: 0.623733
Iteration 5, loss = 0.64115542
Validation score: 0.617780
Iteration 6, loss = 0.61887540
Validation score: 0.614012
Iteration 7, loss = 0.59913632
Validation score: 0.612702
Iteration 8, loss = 0.58170440
Validation score: 0.610026
Iteration 9, loss = 0.56534193
Validation score: 0.607824
Iteration 10, loss = 0.55096908
Validation score: 0.607223
Iteration 11, loss = 0.53758861
Validation score: 0.602035
Iteration 12, loss = 0.52571315
Validation score: 0.603400
Iteration 13, loss = 0.51463179
Validation score: 0.601307
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.714637027706433
Score valid:  0.6497663076610445


In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=4, ngram_range = (1,3)) 
cv_train = cv.fit_transform(pre_procesado)
 
mlp = MLPClassifier(hidden_layer_sizes=(5, 3, 5), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train)
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)

score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.88786967
Validation score: 0.627628
Iteration 2, loss = 0.75214470
Validation score: 0.630413
Iteration 3, loss = 0.69969151
Validation score: 0.627391
Iteration 4, loss = 0.66631220
Validation score: 0.622167
Iteration 5, loss = 0.64138466
Validation score: 0.619237
Iteration 6, loss = 0.62076960
Validation score: 0.617926
Iteration 7, loss = 0.60309708
Validation score: 0.612429
Iteration 8, loss = 0.58749642
Validation score: 0.612829
Iteration 9, loss = 0.57330341
Validation score: 0.609262
Iteration 10, loss = 0.56070773
Validation score: 0.608333
Iteration 11, loss = 0.54893362
Validation score: 0.606495
Iteration 12, loss = 0.53837562
Validation score: 0.604092
Iteration 13, loss = 0.52840656
Validation score: 0.603892
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.7114187783394343
Score valid:  0.6478358057305426


Se llegó a que el mejor score se obtiene con 3 capas de 12, 10 y 7 neuronas. 

In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=4, ngram_range = (1,3)) 
cv_train = cv.fit_transform(pre_procesado)
 

mlp = MLPClassifier(hidden_layer_sizes=(12, 10, 7), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train) 
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)


score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.86572901
Validation score: 0.634381
Iteration 2, loss = 0.74197337
Validation score: 0.630668
Iteration 3, loss = 0.68597138
Validation score: 0.628374
Iteration 4, loss = 0.64020358
Validation score: 0.620693
Iteration 5, loss = 0.59852202
Validation score: 0.618163
Iteration 6, loss = 0.56060213
Validation score: 0.613066
Iteration 7, loss = 0.52645102
Validation score: 0.606986
Iteration 8, loss = 0.49667676
Validation score: 0.603218
Iteration 9, loss = 0.47034844
Validation score: 0.601362
Iteration 10, loss = 0.44815297
Validation score: 0.600269
Iteration 11, loss = 0.42904452
Validation score: 0.597503
Iteration 12, loss = 0.41284684
Validation score: 0.595191
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.6921202037981895
Score valid:  0.6494614915667547


In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=3, ngram_range = (1,2)) 
cv_train = cv.fit_transform(pre_procesado)
 

mlp = MLPClassifier(hidden_layer_sizes=(12, 10, 7), max_iter=50, early_stopping= True, verbose= True)
mlp.fit(cv_train, labels_train) 
  
score_train = mlp.score(cv_train, labels_train)
print("Score train: ",score_train)

cv_valid = cv.transform(pre_procesado_valid)


score_valid = mlp.score(cv_valid, labels_valid)
print("Score valid: ",score_valid)

Iteration 1, loss = 0.86537060
Validation score: 0.637876
Iteration 2, loss = 0.74250840
Validation score: 0.636839
Iteration 3, loss = 0.68432273
Validation score: 0.632069
Iteration 4, loss = 0.63800574
Validation score: 0.627519
Iteration 5, loss = 0.59687667
Validation score: 0.622076
Iteration 6, loss = 0.55954784
Validation score: 0.616433
Iteration 7, loss = 0.52600430
Validation score: 0.613303
Iteration 8, loss = 0.49649077
Validation score: 0.609917
Iteration 9, loss = 0.47179323
Validation score: 0.605330
Iteration 10, loss = 0.45031168
Validation score: 0.603054
Iteration 11, loss = 0.43200108
Validation score: 0.603255
Iteration 12, loss = 0.41644393
Validation score: 0.600397
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Score train:  0.6881920464825881
Score valid:  0.6546433651696809


#### PREDICCION

In [None]:
pre_procesado_test = pre_procesamiento(text_test,t,f,t,f)

In [None]:
cv = CountVectorizer(max_df = 0.3, min_df=3, ngram_range = (1,2)) 
cv_train = cv.fit_transform(pre_procesado)

In [None]:
cv_test = cv.transform(pre_procesado_test)
test_labels = mlp.predict(cv_test)

In [None]:
test_labels

array(['contradiction', 'neutral', 'neutral', ..., 'contradiction',
       'entailment', 'contradiction'], dtype='<U13')

Armamos el submission

In [None]:
df_test = pd.DataFrame(data=test_labels, columns=["pred_labels"],)
df_test.head()
df_test.index.names = ["pairID"]
df_test.to_csv("MLP_test_entrega.csv")

#### METRICAS

In [None]:
from sklearn.metrics import classification_report

cv_metrics = cv.transform(pre_procesado_valid)
valid_labels_metrics = mlp.predict(cv_metrics)

print(classification_report(valid_labels_metrics, labels_valid))

               precision    recall  f1-score   support

contradiction       0.67      0.64      0.66      3388
   entailment       0.67      0.67      0.67      3317
      neutral       0.63      0.65      0.64      3137

     accuracy                           0.65      9842
    macro avg       0.65      0.65      0.65      9842
 weighted avg       0.65      0.65      0.65      9842



In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(labels_valid, valid_labels_metrics, multi_class='ovr')

## CONCLUSIONES

Al comparar ambos modelos clasificadores, se llega a que con MLP se obtiene mayor accuracy que en el modelo Multinomial de Naive Bayes. Esto se puede deber a que Naive Bayes es un enfoque puramente probabilístico para la clasificación, donde la idea es usar la probabilidad de la frecuencia con la que ocurren las palabras en las clases para determinar la probabilidad de que una oración pertenezca a una determinada clase. La parte *naive* del método es la suposición de que las características (palabras) son independientes entre sí; con lo cual, al estar trabajando con frases, la relación entre las palabras no es del todo independiente. En cambio, en MLP se trabaja en base a un método de regresión logística, en el cual se mide la relación entre la variable dependiente categórica y una o más variables independientes mediante la estimación de probabilidades utilizando una función de activación no lineal.

En cuanto al sesgo de datos del dataset, concluimos que se trata de una muestra que no es lo suficientemente representativa, es decir, que hay sesgos. Esto es porque, al probar con diferentes modelos, variando los parámetros, nunca se llega a una performance mejor que el 70%, con lo cual concluimos que va más allá del clasificador empleado, y pasa a ser un problema del sesgo del dataset. Además, sabemos de clase que test es sesgado, suele estar un poco por debajo de lo obtenido con CV; esto es porque al probar en muchos modelos, al validar siempre con CV, probablemente de todos esos modelos, un par sean bastante buenos, y de ese par se elige al que mejor se ajusta a CV y no a datos desconocidos.