#### Ejercicio 2: "Otros text embedding" (opcional)

Probar otro método de generar el text embedding a partir de los word embeddings y reportar los resultados.

In [None]:
import numpy as np
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

import os.path 
import glob
import pandas as pd

import nltk
nltk.download('all')
import tqdm

##

# Solucion

## Funcion necesarias para cargar dataset de train y test

In [12]:
import os.path 
import glob
import pandas as pd



def get_text(file_path):
    with open(file_path, 'r') as f:
        return '\n'.join(f.readlines())

def load_dataset(is_train, limit):
    dataset_path = '../data/aclImdb/'
    cls = ['pos', 'neg']
    data_path = os.path.join(dataset_path, 'train' if is_train else 'test')
    data = []
    limit_per_class = limit//2
    for c in cls:
        class_path = os.path.join(data_path,c)
        regex_glob = os.path.join(class_path,"*.txt")
        for i, file_path in enumerate(glob.glob(regex_glob)):
            if i == limit_per_class:
                break
            data.append((get_text(file_path),c))
    return pd.DataFrame(data=data,columns=['text', 'class'])
            
 

## Carga del dataset

In [13]:
       
df = load_dataset(is_train=True, limit=100000).sample(frac=1)
df.head(10)

Unnamed: 0,text,class
1246,When an actor has to play the role of an actor...,pos
6837,I first saw this movie on a local station on t...,pos
23009,Something to Sing About was produced at Grand ...,neg
24118,I would have enjoyed this movie slightly more ...,neg
9846,"And this somebody is me. And not only me, as I...",pos
14266,"Scotty (Grant Cramer, who would go on to star ...",neg
24830,"""The Cat's Meow"" contains a few scenes that bo...",neg
12798,I really don't get how people made this film a...,neg
1587,"Laurence Olivier, Merle Oberon, Ralph Richards...",pos
11198,On first watching this film it is hard to know...,pos


## Carga del dataset para embedding

In [16]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec

embeddings_path = '../models/glove.6B/glove.6B.300d.txt'

glove_file = datapath(embeddings_path)
word2vec_glove_file = get_tmpfile("glove.6B.100d.word2vec.txt")

glove2word2vec(embeddings_path, word2vec_glove_file)

model_en = KeyedVectors.load_word2vec_format(word2vec_glove_file)

  glove2word2vec(embeddings_path, word2vec_glove_file)


## Definimos funciones para hacer el embedding

Nuestras features (o variable independiente) son textos. Cómo obtener un array de features del texto dado nuestros word embeddings?

Primero que nada, es necesario calcular la secuencia de palabras que aparece en el texto.

Para calcular las features del texto existen diferentes formas. Siempre implicará calcular los embeddings de las palabras para luego:
1. Calcular la suma.
2. Calcular la media
3. Calcular otro agregado (max, min, etc). O una combinación de ellos.
4. Concatenarlos. En este caso hay que decidir cuántos embeddings vamos a concatenar y agregar padding (embedding de ceros) para rellenar comentarios cortos y truncar comentarios largos. Recordar que todos los samples tienen que tener la misma cantidad de features para utilizar una algoritmos de Machine Learning.

<strong> Nosotros vamos a optar por el punto 1 </strong>

In [18]:
def get_text_embedding(text):
    words = nltk.word_tokenize(text.lower())
    l = [model_en[w] for w in words if w in model_en]
    if not l:
        return np.zeros((300,))
    return np.array(l).sum(axis=0)
#get_text_embedding(text).shape 

In [19]:


def get_df_embeddings(df):
    embs = []
    for i, row in tqdm.tqdm(df.iterrows(), total=len(df)):
        embs.append(get_text_embedding(row['text']))
    embs=np.array(embs)
    return embs

## Hacemos el embedding

In [20]:
X = get_df_embeddings(df)

100%|██████████| 25000/25000 [00:18<00:00, 1379.53it/s]


Y calcularemos las targets (variable dependiente):

1 = Positiva

0 = Negativa

In [24]:
y = (df['class']=='pos').astype(int)

In [25]:
X.shape, y.shape

((25000, 300), (25000,))

## Dividimos en train y valid el DF de train

In [26]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

((20000, 300), (5000, 300), (20000,), (5000,))

## Generamos modelos

Y finalmente entrenaremos y evaluaremos algunos clasificadores simples: uno basado en KNN y otro basado en RandomForest.

In [27]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=11)
knn_clf.fit(X_train, y_train)

In [28]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)

In [29]:
pred_rf = rf_clf.predict(X_valid)
pred_knn = knn_clf.predict(X_valid)

probs_rf = rf_clf.predict_proba(X_valid)
probs_knn = knn_clf.predict_proba(X_valid)
ensamble_pred = ((probs_rf[:,1]+probs_knn[:,1])/2.0)>0.5

pred_rf.shape, pred_knn.shape, ensamble_pred.shape

((5000,), (5000,), (5000,))

In [30]:
((probs_rf[:,1]>0.5) == pred_rf).all()

True

In [31]:
((probs_knn[:,1]>0.5) == pred_knn).all()

True

## Evaluación del modelo

In [32]:
from sklearn.metrics import accuracy_score

print(f'Random Forest acc: {accuracy_score(y_valid,pred_rf)}')
print(f'KNN acc: {accuracy_score(y_valid,pred_knn)}')
print(f'Ensamble acc: {accuracy_score(y_valid,ensamble_pred)}')

Random Forest acc: 0.7406
KNN acc: 0.686
Ensamble acc: 0.7336


## Comparacion de resultados

| Modelo  | Med  | Sum   |
|------------|------------|------------|
| Random Forest acc | 0.7748 | 0.7406 |
| KNN acc | 0.7194 | 0.686 |
| Ensamble acc | 0.7576 | 0.7336 |

<strong> Conclusion: </strong> Haciendo una simple evaluación sobre el acurracy que muestra cada modelo con los datos de validacion se observa que los modelos generados considerando la media en los encodding es levemente mejor los modelos generados con la suma.