# Tarea 4 : Análisis de opiniones

Alberto Ramos Sánchez

20/12/20


-------------------------

En esta práctica se han comparado distintas técnicas de preprocesado NLP (básico, con *lemmatization* y con *stemming*), de selección de características (por frecuencia y *TF-IDF*) y de clasificación (*SVC* y *Random Forest*). Todas las técnicas aplicadas con el objetivo de clasificar opiniones.

En la parte optativa se ha comparado cada técnica de preprocesado (básica, con *lemmatization* y  con *stemming*), utilizando un clasificador LSTM.

Para la práctica se ha utilizado un dataset de opiniones de alimentos de Amazon que contiene más de 500000 muestras. Para cada opinión existen diferentes campos como identificador de producto, identificador de usuario y su nombre, calificación, resumen y texto de la opinión. Las puntuaciones en este conjunto de datos van de 1 a 5, siendo 1 muy mala y 5 muy buena.

In [1]:
import pandas as pd
import numpy as np

np.random.seed(42)

import string
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.preprocessing import LabelEncoder

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Alberto\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Alberto\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Alberto\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Alberto\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


---
## Carga de datos

Cargamos las opiniones y seleccionamos las columnas necesarias: *Text*, que contiene la opinión; y *Score*, donde está la puntuación dada al producto por la persona que escribió la opinión.

In [2]:
df = pd.read_csv('Reviews.csv')
df.set_index('Id')
df.drop(df.columns.difference(['Score','Text']), 1, inplace=True)
df.head()

Unnamed: 0,Score,Text
0,5,I have bought several of the Vitality canned d...
1,1,Product arrived labeled as Jumbo Salted Peanut...
2,4,This is a confection that has been around a fe...
3,2,If you are looking for the secret ingredient i...
4,5,Great taffy at a great price. There was a wid...


Para reducir la demora en la ejecución, se ha seleccionado una porción aleatoria del dataset.

In [3]:
n_rows = 10_000

df = df.sample(n_rows)
df.reset_index(inplace=True)
df.shape

(10000, 3)

## Preprocesado de datos

In [4]:
df_lem = df.copy()
df_stm = df.copy()

reviews = df['Text']

### Preprocesado básico de los datos

###### Pasamos todas las letras a minúscula

In [5]:
reviews = reviews.apply(lambda x: x.lower())
reviews[0]

'having tried a couple of other brands of gluten-free sandwich cookies, these are the best of the bunch.  they\'re crunchy and true to the texture of the other "real" cookies that aren\'t gluten-free.  some might think that the filling makes them a bit too sweet, but for me that just means i\'ve satisfied my sweet tooth sooner!  the chocolate version from glutino is just as good and has a true "chocolatey" taste - something that isn\'t there with the other gluten-free brands out there.'

###### Eliminamos los signos de puntuación

In [6]:
reviews = reviews.apply(lambda x: ''.join([l for l in x if l not in string.punctuation]))
reviews[0]

'having tried a couple of other brands of glutenfree sandwich cookies these are the best of the bunch  theyre crunchy and true to the texture of the other real cookies that arent glutenfree  some might think that the filling makes them a bit too sweet but for me that just means ive satisfied my sweet tooth sooner  the chocolate version from glutino is just as good and has a true chocolatey taste  something that isnt there with the other glutenfree brands out there'

###### Eliminamos las *stopwords*

In [7]:
reviews = reviews.apply(lambda x: nltk.word_tokenize(x))
len(reviews[0])

83

In [8]:
stop_words = nltk.corpus.stopwords.words('english')
reviews = reviews.apply(lambda x: [w for w in x if w not in stop_words])
print(len(reviews[0]))
print(reviews[0])

39
['tried', 'couple', 'brands', 'glutenfree', 'sandwich', 'cookies', 'best', 'bunch', 'theyre', 'crunchy', 'true', 'texture', 'real', 'cookies', 'arent', 'glutenfree', 'might', 'think', 'filling', 'makes', 'bit', 'sweet', 'means', 'ive', 'satisfied', 'sweet', 'tooth', 'sooner', 'chocolate', 'version', 'glutino', 'good', 'true', 'chocolatey', 'taste', 'something', 'isnt', 'glutenfree', 'brands']


In [9]:
df['Text'] = reviews.apply(lambda x: ' '.join(x))
df['Text'][0]

'tried couple brands glutenfree sandwich cookies best bunch theyre crunchy true texture real cookies arent glutenfree might think filling makes bit sweet means ive satisfied sweet tooth sooner chocolate version glutino good true chocolatey taste something isnt glutenfree brands'

### Stemming

El objetivo de aplicar stemming es conseguir una misma representación para palabras con una misma raíz. Este método aplica un proceso heurístico para extraer la raíz de las palabras.

In [10]:
stemmer = nltk.stem.PorterStemmer()
stem = stemmer.stem
rewiews_stem = reviews.apply(lambda x: [stem(w) for w in x])

df_stm['Text'] = rewiews_stem.apply(lambda x: ' '.join(x))
df_stm.iloc[0]['Text']

'tri coupl brand glutenfre sandwich cooki best bunch theyr crunchi true textur real cooki arent glutenfre might think fill make bit sweet mean ive satisfi sweet tooth sooner chocol version glutino good true chocolatey tast someth isnt glutenfre brand'

### Lemmatization

Este método aplica la búsqueda en tablas para obtener la forma canónica o infinitivo (lemma) de palabras.

In [11]:
lemmatizer = nltk.stem.WordNetLemmatizer()
lem = lemmatizer.lemmatize
reviews_lem = reviews.apply(lambda x: [lem(w) for w in x])

df_lem['Text'] = reviews_lem.apply(lambda x: ' '.join(x))
df_lem.iloc[0]['Text']

'tried couple brand glutenfree sandwich cooky best bunch theyre crunchy true texture real cooky arent glutenfree might think filling make bit sweet mean ive satisfied sweet tooth sooner chocolate version glutino good true chocolatey taste something isnt glutenfree brand'

## Extracción de características

La extracción de características es un proceso cuyo resultado es una representación en forma de matriz que indica la aparición de cada término en cada documentos del dataset (matriz término-documento).

A continuación se muestra un ejemplo de extracción de características de frecuencias y *TF-IDF* para los datos preprocesados con *stemming*.

### Stemming

Dividimos el dataset preprocesado con stemming

In [12]:
df_train_stem, df_test_stem = train_test_split(df_stm, train_size=0.7)
df_train_stem, df_val_stem = train_test_split(df_train_stem, train_size=0.85)

#### Frecuencia

In [13]:
bowTF = CountVectorizer(max_features=500, ngram_range=(1, 1))

Xtrain_tf_stem = bowTF.fit_transform(df_train_stem['Text'])
Xval_tf_stem = bowTF.transform(df_val_stem['Text'])
Xtest_tf_stem = bowTF.transform(df_test_stem['Text'])

encoder = LabelEncoder()
Ytrain_tf_stem = encoder.fit_transform(df_train_stem['Score'])
Yval_tf_stem = encoder.transform(df_val_stem['Score'])
Ytest_tf_stem = encoder.transform(df_test_stem['Score'])

Xtrain_tf_stem

<5950x500 sparse matrix of type '<class 'numpy.int64'>'
	with 123973 stored elements in Compressed Sparse Row format>

#### TF-IDF

In [14]:
bowTFIDF = TfidfVectorizer(max_features=500, ngram_range=(1, 1))

Xtrain_tfidf_stem = bowTFIDF.fit_transform(df_train_stem['Text'])
Xval_tfidf_stem = bowTFIDF.transform(df_val_stem['Text'])
Xtest_tfidf_stem = bowTFIDF.transform(df_test_stem['Text'])

encoder = LabelEncoder()
Ytrain_tfidf_stem = encoder.fit_transform(df_train_stem['Score'])
Yval_tfidf_stem = encoder.transform(df_val_stem['Score'])
Ytest_tfidf_stem = encoder.transform(df_test_stem['Score'])

Xtrain_tfidf_stem

<5950x500 sparse matrix of type '<class 'numpy.float64'>'
	with 123973 stored elements in Compressed Sparse Row format>

### Lemmatization

A continuación se muestra el mismo ejemplo para el dataset preprocesado con *lemmatization*.

In [15]:
df_train_lem, df_test_lem = train_test_split(df_lem, train_size=0.7)
df_train_lem, df_val_lem = train_test_split(df_train_lem, train_size=0.85)

#### Frecuencia

In [16]:
bowTF = CountVectorizer(max_features=500, ngram_range=(1, 1))

Xtrain_tf_lem = bowTF.fit_transform(df_train_lem['Text'])
Xval_tf_lem = bowTF.transform(df_val_lem['Text'])
Xtest_tf_lem = bowTF.transform(df_test_lem['Text'])

encoder = LabelEncoder()
Ytrain_tf_lem = encoder.fit_transform(df_train_lem['Score'])
Yval_tf_lem = encoder.transform(df_val_lem['Score'])
Ytest_tf_lem = encoder.transform(df_test_lem['Score'])

Xtrain_tf_lem

<5950x500 sparse matrix of type '<class 'numpy.int64'>'
	with 117865 stored elements in Compressed Sparse Row format>

#### TF-IDF

In [17]:
bowTFIDF = TfidfVectorizer(max_features=500, ngram_range=(1, 1))

Xtrain_tfidf_lem = bowTFIDF.fit_transform(df_train_lem['Text'])
Xval_tfidf_lem = bowTFIDF.transform(df_val_lem['Text'])
Xtest_tfidf_lem = bowTFIDF.transform(df_test_lem['Text'])

encoder = LabelEncoder()
Ytrain_tfidf_lem = encoder.fit_transform(df_train_lem['Score'])
Yval_tfidf_lem = encoder.transform(df_val_lem['Score'])
Ytest_tfidf_lem = encoder.transform(df_test_lem['Score'])

Xtrain_tfidf_lem

<5950x500 sparse matrix of type '<class 'numpy.float64'>'
	with 117865 stored elements in Compressed Sparse Row format>

### Original

A continuación se muestra el mismo ejemplo para el dataset con un preprocesado básico.

In [18]:
df_train_orig, df_test_orig = train_test_split(df, train_size=0.7)
df_train_orig, df_val_orig = train_test_split(df_train_orig, train_size=0.85)

#### Frecuencia

In [19]:
bowTF = CountVectorizer(max_features=500, ngram_range=(1, 1))

Xtrain_tf_orig = bowTF.fit_transform(df_train_orig['Text'])
Xval_tf_orig = bowTF.transform(df_val_orig['Text'])
Xtest_tf_orig = bowTF.transform(df_test_orig['Text'])

encoder = LabelEncoder()
Ytrain_tf_orig = encoder.fit_transform(df_train_orig['Score'])
Yval_tf_orig = encoder.transform(df_val_orig['Score'])
Ytest_tf_orig = encoder.transform(df_test_orig['Score'])

Xtrain_tf_orig

<5950x500 sparse matrix of type '<class 'numpy.int64'>'
	with 113525 stored elements in Compressed Sparse Row format>

#### TF-IDF

In [20]:
bowTFIDF = TfidfVectorizer(max_features=500, ngram_range=(1, 1))

Xtrain_tfidf_orig = bowTFIDF.fit_transform(df_train_orig['Text'])
Xval_tfidf_orig = bowTFIDF.transform(df_val_orig['Text'])
Xtest_tfidf_orig = bowTFIDF.transform(df_test_orig['Text'])

encoder = LabelEncoder()
Ytrain_tfidf_orig = encoder.fit_transform(df_train_orig['Score'])
Yval_tfidf_orig = encoder.transform(df_val_orig['Score'])
Ytest_tfidf_orig = encoder.transform(df_test_orig['Score'])

Xtrain_tfidf_orig

<5950x500 sparse matrix of type '<class 'numpy.float64'>'
	with 113525 stored elements in Compressed Sparse Row format>

<hr style="border:2px solid gray"> </hr>

## Extracción caracteristicas (genérico)

Se ha implementando la función *feature_extraction* con el objetivo de comparar los distintos tipos de preprocesados según el rango de *n-gramas* y el número de características seleccionadas. En el siguiente apartado contrastaremos para distintos parámetros de esta función a los clasificadores *SVC* y *Random Forest*

In [21]:
def feature_extraction(dataset, vectorizer, max_features, ngram_range, test_size=0.3, val_size=0.15):
    data_train, data_test = train_test_split(dataset, train_size=1-test_size)
    data_train, data_val = train_test_split(data_train, train_size=1-val_size)
    
    bow = vectorizer(max_features=max_features, ngram_range=ngram_range)
    
    Xtrain = bow.fit_transform(data_train['Text'])
    Xval = bow.transform(data_val['Text'])
    Xtest = bow.transform(data_test['Text'])
    
    encoder = LabelEncoder()
    Ytrain = encoder.fit_transform(data_train['Score'])
    Yval = encoder.transform(data_val['Score'])
    Ytest = encoder.transform(data_test['Score'])
    
    return Xtrain, Ytrain, Xval, Yval, Xtest, Ytest

<hr style="border:2px solid gray"> </hr>

## Clasificación

Buscaremos para cada clasificador el ajuste de parámetros y el selector de características que da mejores resultados entre los propuestos a continuación. La métrica de comparación será la precisión (*accuracy*).

In [22]:
features = [500, 1000, 2000] # Número de características a seleccionar
ngram_range = [(1, 1), (1, 2), (1, 3)] # Ngramas

vect = [CountVectorizer, TfidfVectorizer] # Selectores de caracteristicas

In [27]:
def train_classifier(classifier, Xtrain, Ytrain, Xval, Yval):
    classifier.fit(Xtrain, Ytrain)
    Ypredicted = classifier.predict(Xval)
    
    accuracy = accuracy_score(Yval, Ypredicted)*100
    precision = precision_score(Yval, Ypredicted, average='macro', zero_division=1)*100
    recall = recall_score(Yval, Ypredicted, average='macro', zero_division=1)*100
    
    return accuracy, precision, recall


### SVC

#### Stemming

El mejor ajuste de parámetros para la clasificación con SVC de los datos preprocesados con *stemming* es:
- Ngramas (1, 3). Nº de características: 500
- Vectorizer: TfidfVectorizer
- Accuracy: 67.71
- Precision: 81.44
- Recall: 24.34

In [28]:
bestData = [None, None, None, None, None, None] # split of best solution
best_features = [None, None]
stats = [0, 0, 0] # accuracy, precision, recall of best solution
best_svc = None # best svc
best_vectorizer = None

for vec in vect:
    for ft in features:
        for ngram in ngram_range:
            print("Feature extraction: ngram {0}, n_feat {1}".format(ngram, ft))
            
            Xtrain, Ytrain, Xval, Yval, Xtest, Ytest = \
                feature_extraction(df_stm, vec, ft, ngram)
            
            svc = SVC()
            
            acc, prec, rec = train_classifier(svc, Xtrain, Ytrain, Xval, Yval)
            
            
            print("\tStats: ")
            print("\t\tVectorizer: {0}".format(vec.__name__))
            print("\t\tAccuracy: {0:.2f}".format(acc))
            print("\t\tPrecision: {0:.2f}".format(prec))
            print("\t\tRecall: {0:.2f}".format(rec))
            
            if acc > stats[0]:
                bestData = [Xtrain, Ytrain, Xval, Yval, Xtest, Ytest]
                stats = [acc, prec, rec]
                best_svc = svc
                best_features = [ft, ngram]
                best_vectorizer = vec

print("Best solution: \n")
print("Feature extraction: ngram {0}, n_feat {1}".format(best_features[1], best_features[0]))
print("\tStats: ")
print("\t\tVectorizer: {0}".format(best_vectorizer.__name__))
print("\t\tAccuracy: {0:.2f}".format(stats[0]))
print("\t\tPrecision: {0:.2f}".format(stats[1]))
print("\t\tRecall: {0:.2f}".format(stats[2]))

Feature extraction: ngram (1, 1), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 59.90
		Precision: 79.19
		Recall: 22.31
Feature extraction: ngram (1, 2), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 66.38
		Precision: 89.00
		Recall: 24.64
Feature extraction: ngram (1, 3), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 63.52
		Precision: 69.91
		Recall: 22.06
Feature extraction: ngram (1, 1), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 64.38
		Precision: 62.93
		Recall: 21.94
Feature extraction: ngram (1, 2), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 66.48
		Precision: 76.76
		Recall: 23.70
Feature extraction: ngram (1, 3), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 65.05
		Precision: 82.51
		Recall: 23.49
Feature extraction: ngram (1, 1), n_feat 2000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 66.10
		Precision: 79.06
		Recall: 22.86
Feature extraction: ngram (1, 2), n_f

#### Lemmatization

El mejor ajuste de parámetros para la clasificación con SVC de los datos preprocesados con *lemmatization* es:
- Ngramas (1, 2). Nº de características: 1000
- Vectorizer: TfidfVectorizer
- Accuracy: 67.81
- Precision: 88.06
- Recall: 23.94

In [29]:
bestData = [None, None, None, None, None, None] # split of best solution
best_features = [None, None]
stats = [0, 0, 0] # accuracy, precision, recall of best solution
best_svc = None # best svc
best_vectorizer = None

for vec in vect:
    for ft in features:
        for ngram in ngram_range:
            print("Feature extraction: ngram {0}, n_feat {1}".format(ngram, ft))
            
            Xtrain, Ytrain, Xval, Yval, Xtest, Ytest = \
                feature_extraction(df_lem, vec, ft, ngram)
            
            svc = SVC()
            
            acc, prec, rec = train_classifier(svc, Xtrain, Ytrain, Xval, Yval)
            
            
            print("\tStats: ")
            print("\t\tVectorizer: {0}".format(vec.__name__))
            print("\t\tAccuracy: {0:.2f}".format(acc))
            print("\t\tPrecision: {0:.2f}".format(prec))
            print("\t\tRecall: {0:.2f}".format(rec))
            
            if acc > stats[0]:
                bestData = [Xtrain, Ytrain, Xval, Yval, Xtest, Ytest]
                stats = [acc, prec, rec]
                best_svc = svc
                best_features = [ft, ngram]
                best_vectorizer = vec

print("Best solution: \n")
print("Feature extraction: ngram {0}, n_feat {1}".format(best_features[1], best_features[0]))
print("\tStats: ")
print("\t\tVectorizer: {0}".format(best_vectorizer.__name__))
print("\t\tAccuracy: {0:.2f}".format(stats[0]))
print("\t\tPrecision: {0:.2f}".format(stats[1]))
print("\t\tRecall: {0:.2f}".format(stats[2]))

Feature extraction: ngram (1, 1), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 65.43
		Precision: 91.34
		Recall: 22.98
Feature extraction: ngram (1, 2), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 63.52
		Precision: 89.55
		Recall: 22.93
Feature extraction: ngram (1, 3), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 64.57
		Precision: 79.56
		Recall: 23.80
Feature extraction: ngram (1, 1), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 65.52
		Precision: 86.91
		Recall: 22.83
Feature extraction: ngram (1, 2), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 67.81
		Precision: 88.06
		Recall: 23.94
Feature extraction: ngram (1, 3), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 66.48
		Precision: 87.39
		Recall: 22.73
Feature extraction: ngram (1, 1), n_feat 2000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 65.62
		Precision: 86.37
		Recall: 23.16
Feature extraction: ngram (1, 2), n_f

#### Original

El mejor ajuste de parámetros para la clasificación con SVC de los datos con un preprocesado básico es:
- Ngramas (1, 1). Nº de características: 500
- Vectorizer: TfidfVectorizer
- Accuracy: 70.10
- Precision: 67.77
- Recall: 25.37

In [30]:
bestData = [None, None, None, None, None, None] # split of best solution
best_features = [None, None]
stats = [0, 0, 0] # accuracy, precision, recall of best solution
best_svc = None # best svc
best_vectorizer = None

for vec in vect:
    for ft in features:
        for ngram in ngram_range:
            print("Feature extraction: ngram {0}, n_feat {1}".format(ngram, ft))
            
            Xtrain, Ytrain, Xval, Yval, Xtest, Ytest = \
                feature_extraction(df, vec, ft, ngram)
            
            svc = SVC()
            
            acc, prec, rec = train_classifier(svc, Xtrain, Ytrain, Xval, Yval)
            
            
            print("\tStats: ")
            print("\t\tVectorizer: {0}".format(vec.__name__))
            print("\t\tAccuracy: {0:.2f}".format(acc))
            print("\t\tPrecision: {0:.2f}".format(prec))
            print("\t\tRecall: {0:.2f}".format(rec))
            
            if acc > stats[0]:
                bestData = [Xtrain, Ytrain, Xval, Yval, Xtest, Ytest]
                stats = [acc, prec, rec]
                best_svc = svc
                best_features = [ft, ngram]
                best_vectorizer = vec

print("Best solution: \n")
print("Feature extraction: ngram {0}, n_feat {1}".format(best_features[1], best_features[0]))
print("\tStats: ")
print("\t\tVectorizer: {0}".format(best_vectorizer.__name__))
print("\t\tAccuracy: {0:.2f}".format(stats[0]))
print("\t\tPrecision: {0:.2f}".format(stats[1]))
print("\t\tRecall: {0:.2f}".format(stats[2]))

Feature extraction: ngram (1, 1), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 65.52
		Precision: 86.40
		Recall: 23.19
Feature extraction: ngram (1, 2), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 64.67
		Precision: 92.85
		Recall: 22.46
Feature extraction: ngram (1, 3), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 63.43
		Precision: 90.41
		Recall: 22.10
Feature extraction: ngram (1, 1), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 64.57
		Precision: 87.28
		Recall: 24.37
Feature extraction: ngram (1, 2), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 67.90
		Precision: 89.88
		Recall: 23.01
Feature extraction: ngram (1, 3), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 66.57
		Precision: 88.97
		Recall: 22.89
Feature extraction: ngram (1, 1), n_feat 2000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 65.62
		Precision: 89.31
		Recall: 23.48
Feature extraction: ngram (1, 2), n_f

### Random Forest

#### Stemming

El mejor ajuste de parámetros para la clasificación con *Random Forest* de los datos preprocesados con *stemming* es:

- Ngramas (1, 2). Nº de características: 500
- Vectorizer: TfidfVectorizer
- Accuracy: 69.52
- Precision: 63.44
- Recall: 26.15

In [31]:
bestData = [None, None, None, None, None, None] # split of best solution
best_features = [None, None]
stats = [0, 0, 0] # accuracy, precision, recall of best solution
best_rf = None # best svc
best_vectorizer = None

for vec in vect:
    for ft in features:
        for ngram in ngram_range:
            print("Feature extraction: ngram {0}, n_feat {1}".format(ngram, ft))
            
            Xtrain, Ytrain, Xval, Yval, Xtest, Ytest = \
                feature_extraction(df_stm, vec, ft, ngram)
            
            rf = RandomForestClassifier()
            
            acc, prec, rec = train_classifier(rf, Xtrain, Ytrain, Xval, Yval)
            
            
            print("\tStats: ")
            print("\t\tVectorizer: {0}".format(vec.__name__))
            print("\t\tAccuracy: {0:.2f}".format(acc))
            print("\t\tPrecision: {0:.2f}".format(prec))
            print("\t\tRecall: {0:.2f}".format(rec))
            
            if acc > stats[0]:
                bestData = [Xtrain, Ytrain, Xval, Yval, Xtest, Ytest]
                stats = [acc, prec, rec]
                best_rf = rf
                best_features = [ft, ngram]
                best_vectorizer = vec

print("Best solution: \n")
print("Feature extraction: ngram {0}, n_feat {1}".format(best_features[1], best_features[0]))
print("\tStats: ")
print("\t\tVectorizer: {0}".format(best_vectorizer.__name__))
print("\t\tAccuracy: {0:.2f}".format(stats[0]))
print("\t\tPrecision: {0:.2f}".format(stats[1]))
print("\t\tRecall: {0:.2f}".format(stats[2]))

Feature extraction: ngram (1, 1), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 67.81
		Precision: 40.58
		Recall: 23.80
Feature extraction: ngram (1, 2), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 65.90
		Precision: 66.50
		Recall: 26.16
Feature extraction: ngram (1, 3), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 67.05
		Precision: 53.57
		Recall: 26.10
Feature extraction: ngram (1, 1), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 65.62
		Precision: 64.82
		Recall: 26.31
Feature extraction: ngram (1, 2), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 66.57
		Precision: 66.25
		Recall: 26.86
Feature extraction: ngram (1, 3), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 66.48
		Precision: 60.44
		Recall: 25.80
Feature extraction: ngram (1, 1), n_feat 2000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 65.24
		Precision: 60.45
		Recall: 24.83
Feature extraction: ngram (1, 2), n_f

#### Lemmatization

El mejor ajuste de parámetros para la clasificación con *Random Forest* de los datos preprocesados con *lemmatization* es:
- Ngramas (1, 3). Nº de características: 1000
- Vectorizer: TfidfVectorizer
- Accuracy: 67.90
- Precision: 71.96
- Recall: 25.36

In [32]:
bestData = [None, None, None, None, None, None] # split of best solution
best_features = [None, None]
stats = [0, 0, 0] # accuracy, precision, recall of best solution
best_rf = None # best svc
best_vectorizer = None

for vec in vect:
    for ft in features:
        for ngram in ngram_range:
            print("Feature extraction: ngram {0}, n_feat {1}".format(ngram, ft))
            
            Xtrain, Ytrain, Xval, Yval, Xtest, Ytest = \
                feature_extraction(df_lem, vec, ft, ngram)
            
            rf = RandomForestClassifier()
            
            acc, prec, rec = train_classifier(rf, Xtrain, Ytrain, Xval, Yval)
            
            
            print("\tStats: ")
            print("\t\tVectorizer: {0}".format(vec.__name__))
            print("\t\tAccuracy: {0:.2f}".format(acc))
            print("\t\tPrecision: {0:.2f}".format(prec))
            print("\t\tRecall: {0:.2f}".format(rec))
            
            if acc > stats[0]:
                bestData = [Xtrain, Ytrain, Xval, Yval, Xtest, Ytest]
                stats = [acc, prec, rec]
                best_rf = rf
                best_features = [ft, ngram]
                best_vectorizer = vec

print("Best solution: \n")
print("Feature extraction: ngram {0}, n_feat {1}".format(best_features[1], best_features[0]))
print("\tStats: ")
print("\t\tVectorizer: {0}".format(best_vectorizer.__name__))
print("\t\tAccuracy: {0:.2f}".format(stats[0]))
print("\t\tPrecision: {0:.2f}".format(stats[1]))
print("\t\tRecall: {0:.2f}".format(stats[2]))

Feature extraction: ngram (1, 1), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 66.19
		Precision: 71.25
		Recall: 24.35
Feature extraction: ngram (1, 2), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 65.52
		Precision: 67.60
		Recall: 26.20
Feature extraction: ngram (1, 3), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 65.81
		Precision: 63.86
		Recall: 26.21
Feature extraction: ngram (1, 1), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 65.24
		Precision: 60.03
		Recall: 25.60
Feature extraction: ngram (1, 2), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 62.86
		Precision: 74.37
		Recall: 25.36
Feature extraction: ngram (1, 3), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 65.33
		Precision: 46.78
		Recall: 24.58
Feature extraction: ngram (1, 1), n_feat 2000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 66.95
		Precision: 64.48
		Recall: 25.78
Feature extraction: ngram (1, 2), n_f

#### Original

El mejor ajuste de parámetros para la clasificación con *Random Forest* de los datos con un preprocesado básico es:
- Ngramas (1, 1). Nº de características: 1000
- Vectorizer: CountVectorizer
- Accuracy: 68.29
- Precision: 76.50
- Recall: 24.84

In [33]:
bestData = [None, None, None, None, None, None] # split of best solution
best_features = [None, None]
stats = [0, 0, 0] # accuracy, precision, recall of best solution
best_rf = None # best svc
best_vectorizer = None

for vec in vect:
    for ft in features:
        for ngram in ngram_range:
            print("Feature extraction: ngram {0}, n_feat {1}".format(ngram, ft))
            
            Xtrain, Ytrain, Xval, Yval, Xtest, Ytest = \
                feature_extraction(df, vec, ft, ngram)
            
            rf = RandomForestClassifier()
            
            acc, prec, rec = train_classifier(rf, Xtrain, Ytrain, Xval, Yval)
            
            
            print("\tStats: ")
            print("\t\tVectorizer: {0}".format(vec.__name__))
            print("\t\tAccuracy: {0:.2f}".format(acc))
            print("\t\tPrecision: {0:.2f}".format(prec))
            print("\t\tRecall: {0:.2f}".format(rec))
            
            if acc > stats[0]:
                bestData = [Xtrain, Ytrain, Xval, Yval, Xtest, Ytest]
                stats = [acc, prec, rec]
                best_rf = rf
                best_features = [ft, ngram]
                best_vectorizer = vec

print("Best solution: \n")
print("Feature extraction: ngram {0}, n_feat {1}".format(best_features[1], best_features[0]))
print("\tStats: ")
print("\t\tVectorizer: {0}".format(best_vectorizer.__name__))
print("\t\tAccuracy: {0:.2f}".format(stats[0]))
print("\t\tPrecision: {0:.2f}".format(stats[1]))
print("\t\tRecall: {0:.2f}".format(stats[2]))

Feature extraction: ngram (1, 1), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 64.38
		Precision: 58.13
		Recall: 25.58
Feature extraction: ngram (1, 2), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 64.48
		Precision: 58.26
		Recall: 24.78
Feature extraction: ngram (1, 3), n_feat 500
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 67.33
		Precision: 66.82
		Recall: 26.08
Feature extraction: ngram (1, 1), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 68.29
		Precision: 76.50
		Recall: 24.84
Feature extraction: ngram (1, 2), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 62.48
		Precision: 59.91
		Recall: 24.90
Feature extraction: ngram (1, 3), n_feat 1000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 64.95
		Precision: 68.01
		Recall: 27.35
Feature extraction: ngram (1, 1), n_feat 2000
	Stats: 
		Vectorizer: CountVectorizer
		Accuracy: 65.14
		Precision: 75.16
		Recall: 27.49
Feature extraction: ngram (1, 2), n_f

## Conclusión

En todos los resultados obtenemos un valor de recall bajo y una precisión mayor, lo que indica que el clasificador es muy estricto y comete muchos falsos positivos. Es posible que se obtenga este resultado por el bajo número de muestras utilizadas o porque las muestras estén desbalanceadas (en el apartado posterior se comprueba con un dataset balanceado).

En general, para ambos algoritmos, cuando se aplica preprocesado (*stemming* o *lemmatization*) es necesario seleccionar menos características para alcanzar una misma exactitud que sin aplicar estos preprocesados.

---

A continuación se comprueba con un dataset balanceado sobre los mejores parámetros obtenidos.

Equilibramos el dataset.

In [38]:
df_eq = pd.read_csv('Reviews.csv')
df_eq.set_index('Id')
df_eq.drop(df_eq.columns.difference(['Score','Text']), 1, inplace=True)

n_rows_each_score = 50_000//5

df_sampled = []
for score_i in range(1, 5+1, 1):
    df_sampled.append(df_eq[df_eq['Score'] == score_i].sample(n_rows_each_score))
df_eq = pd.concat(df_sampled, ignore_index=True)

print(df_eq.groupby(['Score']).count())
df_eq


        Text
Score       
1      10000
2      10000
3      10000
4      10000
5      10000


Unnamed: 0,Score,Text
0,1,We purchased a six box carton of the product. ...
1,1,We ordered Wolfgang Puck Sumatra blend and Col...
2,1,These were selling in my gym and I decided to ...
3,1,This product has to be one of the most frustra...
4,1,"Zero out of five cats, all ferals, said euck.<..."
...,...,...
49995,5,This is a great drink that has the flavor of m...
49996,5,Mad Dog's Revenge was an awesome purchase!! I ...
49997,5,I really like these. One reviewer mentioned th...
49998,5,I like this tea better than any I can find at ...


In [39]:
reviews = df_eq['Text']

reviews = reviews.apply(lambda x: x.lower())

reviews = reviews.apply(lambda x: ''.join([l for l in x if l not in string.punctuation]))

reviews = reviews.apply(lambda x: nltk.word_tokenize(x))

stop_words = nltk.corpus.stopwords.words('english')
reviews = reviews.apply(lambda x: [w for w in x if w not in stop_words])

df_eq['Text'] = reviews.apply(lambda x: ' '.join(x))




In [42]:
ft = 500
ngram = (1, 1)
vec = TfidfVectorizer

Xtrain, Ytrain, Xval, Yval, Xtest, Ytest = \
                feature_extraction(df_eq, vec, ft, ngram)
            
rf = RandomForestClassifier()

acc, prec, rec = train_classifier(rf, Xtrain, Ytrain, Xval, Yval)


print("\tStats: ")
print("\t\tVectorizer: {0}".format(vec.__name__))
print("\t\tAccuracy: {0:.2f}".format(acc))
print("\t\tPrecision: {0:.2f}".format(prec))
print("\t\tRecall: {0:.2f}".format(rec))

	Stats: 
		Vectorizer: TfidfVectorizer
		Accuracy: 50.48
		Precision: 50.25
		Recall: 50.44


Ahora se consigue aumentar el *recall* siendo este equivalente a la precisión. Sin embargo, es mucho menor. Posiblemente, si aumentáramos el número de muestras que seleccionamos posiblemente conseguiríamos mejores resultados.

<hr style="border:2px solid gray"> </hr>

## Opcional

En el apartado opcional compararemos la arquitectura *LSTM* para los distintos tipos de preprocesados. Para el ajuste de hiperparámetros utilizaremos el módulo de optimización de hiperparámetros *talos*.

In [31]:
import talos
from talos.model.normalizers import lr_normalizer

from talos import Evaluate

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence

from tensorflow.keras.models import Model
from tensorflow.keras.layers import LSTM, Dense, Activation, Dropout, Input, Embedding

from tensorflow.keras.optimizers import RMSprop, SGD, Adam

from tensorflow.keras.callbacks import EarlyStopping

In [32]:
max_words = 1000
max_len = 150

La función *modelo* crea, compila y entrena el modelo de la *LSTM* según los parámetros aportados por *talos* en el diccionario *params*.

In [33]:
def modelo(Xtrain, Ytrain, Xval, Yval, params):
    inputs = Input(name='inputs', shape=[max_len])
    
    layer = Embedding(max_words, params['emb_dim'], input_length=max_len)(inputs)
    
    layer = LSTM(params['lstm_dim'])(layer)
    
    layer = Dense(params['fc1_dim'], name='FC1')(layer)
    
    layer = Activation('relu')(layer)
    
    layer = Dropout(params['dropout'])(layer)
    
    layer = Dense(5, name='output_layer')(layer)
    
    layer = Activation('softmax')(layer)

    model = Model(inputs=inputs, outputs=layer)
    
    model.compile(loss=params['losses'],
                  optimizer=params['optimizer'](lr=lr_normalizer(params['lr'], params['optimizer'])),
                  metrics=["accuracy"])
    
    history = model.fit(Xtrain, Ytrain,
                        validation_data=[Xval, Yval],
                        batch_size=params['batch_size'],
                        callbacks=[EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=10)],
                        epochs=params['epochs'],
                        verbose=0)
    
    return history, model

En el siguiente diccionario se definen los hiperparámetros de la red *LSTM* que vamos a optimizar con *talos*.

In [34]:
parameters = {
    "emb_dim": [25, 50],
    "lstm_dim": [32, 64, 128],
    "fc1_dim": [128, 256, 512],
    "dropout": [0, 0.5],
    "losses": ['categorical_crossentropy'],
    'kernel_initializer': ['uniform','normal'],
    "optimizer": [Adam, SGD],
    "batch_size": [16, 32],
    "epochs": [100, 200, 500],
    "lr": [0.005, 0.01]
}

### Originales

A continuación se entrena la *LSTM* para los datos a los que se le aplicó un preprocesado básico.

Dividimos el dataset en entrenamiento, validación y test.

In [35]:
df_train_orig, df_test_orig = train_test_split(df, test_size=0.10)
df_train_orig, df_val_orig = train_test_split(df_train_orig, test_size=0.15)

"Nº train {0:.04f} Nº val {1:.04f} Nº test {2:.04f}".format(len(df_train_orig), len(df_val_orig), len(df_test_orig))

'Nº train 7650.0000 Nº val 1350.0000 Nº test 1000.0000'

Convertimos las etiquetas de *Score* a formato *one-hot encoding*.

In [36]:
Xtrain_orig = df_train_orig.Text
Ytrain_orig = df_train_orig.Score

Xtest_orig = df_test_orig.Text
Ytest_orig = df_test_orig.Score

Xval_orig = df_val_orig.Text
Yval_orig = df_val_orig.Score

le = LabelEncoder()
Ytrain_orig = le.fit_transform(Ytrain_orig)
Ytest_orig = le.fit_transform(Ytest_orig)
Yval_orig = le.fit_transform(Yval_orig)

Ytrain_orig = Ytrain_orig.reshape(-1, 1)
Ytest_orig = Ytest_orig.reshape(-1, 1)
Yval_orig = Yval_orig.reshape(-1, 1)

enc = OneHotEncoder()
Ytrain_orig = enc.fit_transform(Ytrain_orig)
Ytest_orig = enc.fit_transform(Ytest_orig)
Yval_orig = enc.fit_transform(Yval_orig)

Ytrain_orig = Ytrain_orig.toarray()
Ytest_orig = Ytest_orig.toarray()
Yval_orig = Yval_orig.toarray()

Ytest_orig

array([[0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.]])

Convertimos las opiniones a secuencias numéricas.

In [37]:
tok = Tokenizer(num_words = max_words)

tok.fit_on_texts(Xtrain_orig)

train_orig = tok.texts_to_sequences(Xtrain_orig)
train_orig_matrix = sequence.pad_sequences(train_orig, maxlen=max_len)

test_orig = tok.texts_to_sequences(Xtest_orig)
test_orig_matrix = sequence.pad_sequences(test_orig, maxlen=max_len)

val_orig = tok.texts_to_sequences(Xval_orig)
val_orig_matrix = sequence.pad_sequences(val_orig, maxlen=max_len)

val_orig_matrix

array([[  0,   0,   0, ...,  22, 207, 901],
       [  0,   0,   0, ..., 284, 150, 336],
       [  0,   0,   0, ...,  45,   1, 193],
       ...,
       [  0,   0,   0, ...,  42,   7,  74],
       [  0,   0,   0, ...,  74,   7, 797],
       [  0,   0,   0, ...,  96, 654, 148]])

Ejecutamos la optimización de hiperparámetros. Los resultados se guardan en el objeto *t* y en archivos *csv* en la carpeta *lstm_classifier_originales*. Pasamos por parámetro la función *modelo* que define a la red, y el diccionario de parámetros *parameters*.

In [38]:
t = talos.Scan(x=train_orig_matrix,
               y=Ytrain_orig,
               x_val=val_orig_matrix,
               y_val=Yval_orig,
               model=modelo,
               params=parameters,
               experiment_name='lstm_classifier_originales',
               round_limit=5)

100%|███████████████████████████████████████████████████████████████████████████████████| 5/5 [22:11<00:00, 266.24s/it]


A continuación vemos los resultados para cada una de las rondas del optimizador.

In [39]:
t.data

Unnamed: 0,start,end,duration,round_epochs,loss,accuracy,val_loss,val_accuracy,batch_size,dropout,emb_dim,epochs,fc1_dim,kernel_initializer,losses,lr,lstm_dim,optimizer
0,12/09/20-163704,12/09/20-164143,278.874527,11,1.491676,0.637255,0.0,0.0,16,0.5,25,100,256,normal,categorical_crossentropy,0.005,128,<class 'tensorflow.python.keras.optimizer_v2.g...
1,12/09/20-164143,12/09/20-164626,282.664845,11,1.129904,0.637255,0.0,0.0,16,0.5,25,500,512,uniform,categorical_crossentropy,0.005,128,<class 'tensorflow.python.keras.optimizer_v2.a...
2,12/09/20-164626,12/09/20-165112,285.730194,11,1.379455,0.637255,0.0,0.0,16,0.5,50,500,512,uniform,categorical_crossentropy,0.01,128,<class 'tensorflow.python.keras.optimizer_v2.g...
3,12/09/20-165112,12/09/20-165400,168.395534,11,1.139502,0.637255,0.0,0.0,32,0.5,50,100,256,normal,categorical_crossentropy,0.005,128,<class 'tensorflow.python.keras.optimizer_v2.a...
4,12/09/20-165401,12/09/20-165915,314.275641,11,1.30624,0.637255,0.0,0.0,16,0.0,50,100,128,normal,categorical_crossentropy,0.01,128,<class 'tensorflow.python.keras.optimizer_v2.g...


Mediante el objeto *Evaluate* podemos evaluar cada uno de los resultados del optimizador aplicando el conjunto de *test*.

In [40]:
e = Evaluate(t)
evaluation = e.evaluate(test_orig_matrix, 
                        Ytest_orig,
                        folds=10,
                        shuffle=False,
                        metric='accuracy',
                        print_out=True,
                        task='multi_label',
                        asc=False)

mean : 0.16 
 std : 0.01


En la siguiente tabla se muestran los resultados de cada ronda de evaluación. 

In [41]:
e.data

Unnamed: 0,start,end,duration,round_epochs,loss,accuracy,val_loss,val_accuracy,batch_size,dropout,emb_dim,epochs,fc1_dim,kernel_initializer,losses,lr,lstm_dim,optimizer
0,12/09/20-163704,12/09/20-164143,278.874527,11,1.491676,0.637255,0.0,0.0,16,0.5,25,100,256,normal,categorical_crossentropy,0.005,128,<class 'tensorflow.python.keras.optimizer_v2.g...
1,12/09/20-164143,12/09/20-164626,282.664845,11,1.129904,0.637255,0.0,0.0,16,0.5,25,500,512,uniform,categorical_crossentropy,0.005,128,<class 'tensorflow.python.keras.optimizer_v2.a...
2,12/09/20-164626,12/09/20-165112,285.730194,11,1.379455,0.637255,0.0,0.0,16,0.5,50,500,512,uniform,categorical_crossentropy,0.01,128,<class 'tensorflow.python.keras.optimizer_v2.g...
3,12/09/20-165112,12/09/20-165400,168.395534,11,1.139502,0.637255,0.0,0.0,32,0.5,50,100,256,normal,categorical_crossentropy,0.005,128,<class 'tensorflow.python.keras.optimizer_v2.a...
4,12/09/20-165401,12/09/20-165915,314.275641,11,1.30624,0.637255,0.0,0.0,16,0.0,50,100,128,normal,categorical_crossentropy,0.01,128,<class 'tensorflow.python.keras.optimizer_v2.g...


*Talos* nos permite extraer el mejor modelo encontrado con la función *best_model*. Utilizamos como métrica de comparación entre resultados la precisión.

En el resumen devuelto vemos los mejores hiperparámetros para la red.

In [42]:
best_model_orig = t.best_model(metric='accuracy', asc=True)

In [43]:
best_model_orig.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          [(None, 150)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 150, 25)           25000     
_________________________________________________________________
lstm (LSTM)                  (None, 128)               78848     
_________________________________________________________________
FC1 (Dense)                  (None, 256)               33024     
_________________________________________________________________
activation (Activation)      (None, 256)               0         
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
output_layer (Dense)         (None, 5)                

#### Resultados

Con los datos con un preprocesado básico se consigue un resultado de 63.72% de *accuracy* y un *loss* de 1.49, con los siguientes parámetros escogidos por *talos*:
- Número de épocas realizadas: 11 de 100 épocas
- Tamaño de batch: 16
- Dimensión del *embedding*: 25
- Dimensión del *lstm*: 128
- Dimensión de la capa *fully-connected*: 256
- Dropout: 0.5
- Optimizador: SGD
- Learning rate : 0.005

### Stemming

Del mismo modo que en el apartado anterior, a continuación se entrena la *LSTM* para los datos a los que se le aplicó un preprocesado con *stemming*.

Dividimos el dataset en entrenamiento, validación y test.

In [46]:
df_train_stem, df_test_stem = train_test_split(df_stm, test_size=0.10)
df_train_stem, df_val_stem = train_test_split(df_train_stem, test_size=0.15)

"Nº train {0:.04f} Nº val {1:.04f} Nº test {2:.04f}".format(len(df_train_stem), len(df_val_stem), len(df_test_stem))

'Nº train 7650.0000 Nº val 1350.0000 Nº test 1000.0000'

Se convierte las etiquetas de *Score* a formato *one-hot encoding*.

In [47]:
Xtrain_stem = df_train_stem.Text
Ytrain_stem = df_train_stem.Score

Xtest_stem = df_test_stem.Text
Ytest_stem = df_test_stem.Score

Xval_stem = df_val_stem.Text
Yval_stem = df_val_stem.Score

le = LabelEncoder()
Ytrain_stem = le.fit_transform(Ytrain_stem)
Ytest_stem = le.fit_transform(Ytest_stem)
Yval_stem = le.fit_transform(Yval_stem)

Ytrain_stem = Ytrain_stem.reshape(-1, 1)
Ytest_stem = Ytest_stem.reshape(-1, 1)
Yval_stem = Yval_stem.reshape(-1, 1)

enc = OneHotEncoder()
Ytrain_stem = enc.fit_transform(Ytrain_stem)
Ytest_stem = enc.fit_transform(Ytest_stem)
Yval_stem = enc.fit_transform(Yval_stem)

Ytrain_stem = Ytrain_stem.toarray()
Ytest_stem = Ytest_stem.toarray()
Yval_stem = Yval_stem.toarray()

Ytest_stem

array([[0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1.]])

Se convierte a secuencia numérica las opiniones, truncando a una longitud máxima.

In [48]:
tok = Tokenizer(num_words = max_words)

tok.fit_on_texts(Xtrain_stem)

train_stem = tok.texts_to_sequences(Xtrain_stem)
train_stem_matrix = sequence.pad_sequences(train_stem, maxlen=max_len)

test_stem = tok.texts_to_sequences(Xtest_stem)
test_stem_matrix = sequence.pad_sequences(test_stem, maxlen=max_len)

val_stem = tok.texts_to_sequences(Xval_stem)
val_stem_matrix = sequence.pad_sequences(val_stem, maxlen=max_len)

val_stem_matrix

array([[  0,   0,   0, ..., 316, 193,  48],
       [  0,   0,   0, ..., 415, 308, 149],
       [  0,   0,   0, ...,  54,  41, 187],
       ...,
       [  0,   0,   0, ...,   3, 157,  91],
       [  0,   0,   0, ...,  17, 192,  91],
       [  0,   0,   0, ...,   3, 157,  91]])

Aplicamos la optimización de hiperparámetros. En esta ocasión, los resultados se vuelcan en archivos *csv* en la carpeta *lstm_classifier_stemming*.

In [49]:
t = talos.Scan(x=train_stem_matrix,
               y=Ytrain_stem,
               x_val=val_stem_matrix,
               y_val=Yval_stem,
               model=modelo,
               params=parameters,
               experiment_name='lstm_classifier_stemming',
               round_limit=5)

100%|███████████████████████████████████████████████████████████████████████████████████| 5/5 [19:28<00:00, 233.78s/it]


En la siguiente tabla se encuentran los resultados de cada ronda de optimización.

In [50]:
t.data

Unnamed: 0,start,end,duration,round_epochs,loss,accuracy,val_loss,val_accuracy,batch_size,dropout,emb_dim,epochs,fc1_dim,kernel_initializer,losses,lr,lstm_dim,optimizer
0,12/09/20-171030,12/09/20-171234,124.035707,11,1.423369,0.635033,0.0,0.0,16,0.0,25,100,256,normal,categorical_crossentropy,0.01,32,<class 'tensorflow.python.keras.optimizer_v2.g...
1,12/09/20-171235,12/09/20-171738,303.12733,11,1.129079,0.635033,0.0,0.0,16,0.0,25,500,256,uniform,categorical_crossentropy,0.005,128,<class 'tensorflow.python.keras.optimizer_v2.a...
2,12/09/20-171738,12/09/20-172249,310.539641,11,1.478115,0.635033,0.0,0.0,16,0.5,50,500,256,uniform,categorical_crossentropy,0.005,128,<class 'tensorflow.python.keras.optimizer_v2.g...
3,12/09/20-172249,12/09/20-172751,302.406538,11,1.479346,0.635033,0.0,0.0,16,0.5,50,100,128,uniform,categorical_crossentropy,0.005,128,<class 'tensorflow.python.keras.optimizer_v2.g...
4,12/09/20-172751,12/09/20-172959,127.544448,11,1.121236,0.635033,0.0,0.0,16,0.0,50,500,512,uniform,categorical_crossentropy,0.005,32,<class 'tensorflow.python.keras.optimizer_v2.a...


Utilizando los datos de test evaluamos a cada uno de los resultados de la red.

In [51]:
e = Evaluate(t)
evaluation = e.evaluate(test_stem_matrix, 
                        Ytest_stem,
                        folds=10,
                        shuffle=False,
                        metric='accuracy',
                        print_out=True,
                        task='multi_label',
                        asc=False)

mean : 0.16 
 std : 0.02


A continuación se muestra cada uno de los resultados de la evaluación.

In [52]:
e.data

Unnamed: 0,start,end,duration,round_epochs,loss,accuracy,val_loss,val_accuracy,batch_size,dropout,emb_dim,epochs,fc1_dim,kernel_initializer,losses,lr,lstm_dim,optimizer
0,12/09/20-171030,12/09/20-171234,124.035707,11,1.423369,0.635033,0.0,0.0,16,0.0,25,100,256,normal,categorical_crossentropy,0.01,32,<class 'tensorflow.python.keras.optimizer_v2.g...
1,12/09/20-171235,12/09/20-171738,303.12733,11,1.129079,0.635033,0.0,0.0,16,0.0,25,500,256,uniform,categorical_crossentropy,0.005,128,<class 'tensorflow.python.keras.optimizer_v2.a...
2,12/09/20-171738,12/09/20-172249,310.539641,11,1.478115,0.635033,0.0,0.0,16,0.5,50,500,256,uniform,categorical_crossentropy,0.005,128,<class 'tensorflow.python.keras.optimizer_v2.g...
3,12/09/20-172249,12/09/20-172751,302.406538,11,1.479346,0.635033,0.0,0.0,16,0.5,50,100,128,uniform,categorical_crossentropy,0.005,128,<class 'tensorflow.python.keras.optimizer_v2.g...
4,12/09/20-172751,12/09/20-172959,127.544448,11,1.121236,0.635033,0.0,0.0,16,0.0,50,500,512,uniform,categorical_crossentropy,0.005,32,<class 'tensorflow.python.keras.optimizer_v2.a...


Extraemos el mejor modelo y observamos los parámetros escogidos.

In [53]:
best_model_stem = t.best_model(metric='accuracy', asc=True)
best_model_stem.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          [(None, 150)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 150, 25)           25000     
_________________________________________________________________
lstm (LSTM)                  (None, 32)                7424      
_________________________________________________________________
FC1 (Dense)                  (None, 256)               8448      
_________________________________________________________________
activation (Activation)      (None, 256)               0         
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
output_layer (Dense)         (None, 5)                

#### Resultados

Con los datos preprocesados con *stemming* se consigue un resultado de 63.5% de *accuracy* y un *loss* de 1.42, con los siguientes parámetros escogidos por *talos*:
- Número de épocas realizadas: 11 de 100 épocas
- Tamaño de batch: 16
- Dimensión del *embedding*: 25
- Dimensión del *lstm*: 32
- Dimensión de la capa *fully-connected*: 256
- Dropout: 0.0
- Optimizador: SGD
- Learning rate : 0.01

### Lemmatization

Aplicamos a continuación el mismo procedimiento para entrenar la *LSTM* para los datos a los que se le aplicó un preprocesado con *lemmatization*.

Dividimos el dataset en entrenamiento, validación y test.

In [54]:
df_train_lem, df_test_lem = train_test_split(df_lem, test_size=0.10)
df_train_lem, df_val_lem = train_test_split(df_train_lem, test_size=0.15)

"Nº train {0:.04f} Nº val {1:.04f} Nº test {2:.04f}".format(len(df_train_lem), len(df_val_lem), len(df_test_lem))

'Nº train 7650.0000 Nº val 1350.0000 Nº test 1000.0000'

Codificamos la etiqueta *Score* a *one-hot encoding*.

In [55]:
Xtrain_lem = df_train_lem.Text
Ytrain_lem = df_train_lem.Score

Xtest_lem = df_test_lem.Text
Ytest_lem = df_test_lem.Score

Xval_lem = df_val_lem.Text
Yval_lem = df_val_lem.Score

le = LabelEncoder()
Ytrain_lem = le.fit_transform(Ytrain_lem)
Ytest_lem = le.fit_transform(Ytest_lem)
Yval_lem = le.fit_transform(Yval_lem)

Ytrain_lem = Ytrain_lem.reshape(-1, 1)
Ytest_lem = Ytest_lem.reshape(-1, 1)
Yval_lem = Yval_lem.reshape(-1, 1)

enc = OneHotEncoder()
Ytrain_lem = enc.fit_transform(Ytrain_lem)
Ytest_lem = enc.fit_transform(Ytest_lem)
Yval_lem = enc.fit_transform(Yval_lem)

Ytrain_lem = Ytrain_lem.toarray()
Ytest_lem = Ytest_lem.toarray()
Yval_lem = Yval_lem.toarray()

Ytest_lem

array([[0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.]])

Convertimos a secuencia numérica las opiniones.

In [56]:
tok = Tokenizer(num_words = max_words)

tok.fit_on_texts(Xtrain_lem)

train_lem = tok.texts_to_sequences(Xtrain_lem)
train_lem_matrix = sequence.pad_sequences(train_lem, maxlen=max_len)

test_lem = tok.texts_to_sequences(Xtest_lem)
test_lem_matrix = sequence.pad_sequences(test_lem, maxlen=max_len)

val_lem = tok.texts_to_sequences(Xval_lem)
val_lem_matrix = sequence.pad_sequences(val_lem, maxlen=max_len)

val_lem_matrix

array([[  0,   0,   0, ...,  24,  13,  35],
       [  0,   0,   0, ..., 137,   4, 457],
       [  0,   0,   0, ..., 891, 400, 755],
       ...,
       [  0,   0,   0, ...,  22,  93, 164],
       [  0,   0,   0, ...,  84,  12, 784],
       [  0,   0,   0, ..., 800,  74, 186]])

Aplicamos la optimización de hiperparámetros. Los resultados los encontraremos en el directorio *lstm_classifier_lemmatization*.

In [57]:
t = talos.Scan(x=train_lem_matrix,
               y=Ytrain_lem,
               x_val=val_lem_matrix,
               y_val=Yval_lem,
               model=modelo,
               params=parameters,
               experiment_name='lstm_classifier_lemmatization',
               round_limit=5)

100%|███████████████████████████████████████████████████████████████████████████████████| 5/5 [12:25<00:00, 149.11s/it]


En la siguiente tabla están los resultados para cada solución obtenida por el optimizador.

In [58]:
t.data

Unnamed: 0,start,end,duration,round_epochs,loss,accuracy,val_loss,val_accuracy,batch_size,dropout,emb_dim,epochs,fc1_dim,kernel_initializer,losses,lr,lstm_dim,optimizer
0,12/09/20-181833,12/09/20-182014,100.941295,11,1.130054,0.637516,0.0,0.0,32,0.5,25,500,512,normal,categorical_crossentropy,0.01,64,<class 'tensorflow.python.keras.optimizer_v2.a...
1,12/09/20-182014,12/09/20-182248,153.81773,11,1.495309,0.637516,0.0,0.0,16,0.5,25,500,256,uniform,categorical_crossentropy,0.005,64,<class 'tensorflow.python.keras.optimizer_v2.g...
2,12/09/20-182248,12/09/20-182745,296.177344,11,1.391446,0.637516,0.0,0.0,16,0.0,25,500,512,uniform,categorical_crossentropy,0.01,128,<class 'tensorflow.python.keras.optimizer_v2.g...
3,12/09/20-182745,12/09/20-182949,124.10438,11,1.122573,0.637516,0.0,0.0,16,0.0,25,100,128,normal,categorical_crossentropy,0.005,32,<class 'tensorflow.python.keras.optimizer_v2.a...
4,12/09/20-182949,12/09/20-183058,69.017961,11,1.503583,0.637516,0.0,0.0,32,0.5,50,200,512,normal,categorical_crossentropy,0.01,32,<class 'tensorflow.python.keras.optimizer_v2.g...


Evaluamos cada solución mediante el conjunto de test.

In [59]:
e = Evaluate(t)
evaluation = e.evaluate(test_lem_matrix, 
                        Ytest_lem,
                        folds=10,
                        shuffle=False,
                        metric='accuracy',
                        print_out=True,
                        task='multi_label',
                        asc=False)
e.data

mean : 0.16 
 std : 0.01


Unnamed: 0,start,end,duration,round_epochs,loss,accuracy,val_loss,val_accuracy,batch_size,dropout,emb_dim,epochs,fc1_dim,kernel_initializer,losses,lr,lstm_dim,optimizer
0,12/09/20-181833,12/09/20-182014,100.941295,11,1.130054,0.637516,0.0,0.0,32,0.5,25,500,512,normal,categorical_crossentropy,0.01,64,<class 'tensorflow.python.keras.optimizer_v2.a...
1,12/09/20-182014,12/09/20-182248,153.81773,11,1.495309,0.637516,0.0,0.0,16,0.5,25,500,256,uniform,categorical_crossentropy,0.005,64,<class 'tensorflow.python.keras.optimizer_v2.g...
2,12/09/20-182248,12/09/20-182745,296.177344,11,1.391446,0.637516,0.0,0.0,16,0.0,25,500,512,uniform,categorical_crossentropy,0.01,128,<class 'tensorflow.python.keras.optimizer_v2.g...
3,12/09/20-182745,12/09/20-182949,124.10438,11,1.122573,0.637516,0.0,0.0,16,0.0,25,100,128,normal,categorical_crossentropy,0.005,32,<class 'tensorflow.python.keras.optimizer_v2.a...
4,12/09/20-182949,12/09/20-183058,69.017961,11,1.503583,0.637516,0.0,0.0,32,0.5,50,200,512,normal,categorical_crossentropy,0.01,32,<class 'tensorflow.python.keras.optimizer_v2.g...


Obtenemos el mejor modelo y su configuración de hiperparámetros.

In [60]:
best_model_lem = t.best_model(metric='accuracy', asc=True)
best_model_lem.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          [(None, 150)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 150, 25)           25000     
_________________________________________________________________
lstm (LSTM)                  (None, 64)                23040     
_________________________________________________________________
FC1 (Dense)                  (None, 512)               33280     
_________________________________________________________________
activation (Activation)      (None, 512)               0         
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
output_layer (Dense)         (None, 5)                

#### Resultados

Con los datos preprocesados con *lemmatization* se consigue un resultado de 63.75% de *accuracy* y un *loss* de 1.13, con los siguientes parámetros escogidos por *talos*:
- Número de épocas realizadas: 11 de 100 épocas
- Tamaño de batch: 32
- Dimensión del *embedding*: 25
- Dimensión del *lstm*: 64
- Dimensión de la capa *fully-connected*: 512
- Dropout: 0.5
- Optimizador: Adam
- Learning rate : 0.01

## Conclusión

Según los resultados obtenidos no parece que el preprocesado varíe notablemente el resultado de precisión de la red. Aún así, la configuración de la red *LSTM* debe ser mayor cuando se utilizan datos sin preprocesar mediante *stemming* o *lemmatization*. La red óptima que utilizó datos no preprocesados es de 128 dimensiones y tardó ~278 segundos en entrenar, en contraste a las que utilizaron datos preprocesados, que tienen una dimensión de 64 y tardaron ~100 segundos en entrenar.