## Doramas Báez Bernal - análisis de opiniones

El objetivo de este trabajo es el análisis de diferentes técnicas de procesamiento de lenguaje natural para un problema de análisis de opiniones (Sentiment Analisys). Para ello, se dispone de un conjunto de datos de opiniones de comida de Amazon con más de 500.000 instancias.

### Objetivos
- Tarea obligatoria: realizar análisis utilizando el proceso NLP clásico considerando el texto original y dos preprocesados y dos características diferentes para representar la matriz término-documento. 
- Tarea optativa: aplicar métodos conexionistas aplicados en NLP

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import string

import nltk

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.preprocessing import LabelEncoder

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\adrian\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\adrian\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\adrian\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
df = pd.read_csv('Reviews.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      568454 non-null  int64 
 1   ProductId               568454 non-null  object
 2   UserId                  568454 non-null  object
 3   ProfileName             568438 non-null  object
 4   HelpfulnessNumerator    568454 non-null  int64 
 5   HelpfulnessDenominator  568454 non-null  int64 
 6   Score                   568454 non-null  int64 
 7   Time                    568454 non-null  int64 
 8   Summary                 568427 non-null  object
 9   Text                    568454 non-null  object
dtypes: int64(5), object(5)
memory usage: 43.4+ MB


In [4]:
df.isnull().sum()

Id                         0
ProductId                  0
UserId                     0
ProfileName               16
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64

## Dataset

El conjunto de datos con el que se está tratando, está conformado por 568.454 reseñas de alimentos vendidos por amazon. Estas reseñas, incluyen información sobre el tipo de producto, el usuario, la calificación y texto plano contando la experiencia del usuario con el producto.

En principio, no se utilizará el conjunto de datos completo, puesto que procesar 568454 muestras implica un elevado coste computacional. Por lo tanto, para reducir el tamaño del conjunto de datos se realizaran dos pasos:

1. Eliminar los valores nulos
2. Reducir la dimesionalidad del conjunto de datos a unas 100.000 muestras 

In [5]:
df.isnull().sum()
df.dropna(how='any',inplace=True)
print("Dataset sin valores nulos:", df.shape)

df1 = df.loc[df['Score'] == 1]
df2 = df.loc[df['Score'] == 2]
df3 = df.loc[df['Score'] == 3]
df4 = df.loc[df['Score'] == 4]
df5 = df.loc[df['Score'] == 5]

df = pd.concat([df1[:20000],df2[:20000],df3[:20000],df4[:20000],df5[:20000]])

print("Dataset Balanceado a 100000 muestras: ", df.shape)

Dataset sin valores nulos: (568411, 10)
Dataset Balanceado a 100000 muestras:  (100000, 10)


In [6]:
fig = px.pie(df, names='Score', title='Percentage of the score of the products')
fig.show()

## Balanceo de datos

El conjunto de datos originalmente está desbalanceado. Es decir, existe una cantidad muy superior de muestras para la puntuacion de valor 5 (casi un 66% del dataset), que para el resto de valores. Si no se balancea el conjunto de datos, durante el entrenamiento se produciría una desviación hacía esos datos obteniendose valores altos en cuanto a la precisión pero siendo un valor ficticios. 

Esto es debido a que, aunque se obtengan valores altos con respecto a la precisión, la red solo estaría aprendiendo a decir que hay puntuacion de valor 5 y por lo tanto, no está aprendiendo. No obstante, balancenado los datos se podrá solucionar dicho problema.



In [7]:
score_df = pd.DataFrame(df, columns=['Text','Score'])
score_df_lem = score_df.copy()
score_df.head()

Unnamed: 0,Text,Score
1,Product arrived labeled as Jumbo Salted Peanut...,1
12,My cats have been happily eating Felidae Plati...,1
26,"The candy is just red , No flavor . Just plan...",1
50,"This oatmeal is not good. Its mushy, soft, I d...",1
62,Arrived in 6 days and were so stale i could no...,1


## Text Preprocesing

En este apartado, realizaremos los tratamientos necesarios para mejorar las posibles predicciones del texto. Como pueden ser, poner en minuscula todo el texto o quitar los signos de puntuación.

Además, vamos a realizar unos tratamientos similares para "score_df" y para "score_df_lem". Aunque, para el primero se realizará un proceso de stemmer y para el segundo un proceso de lemmatisation, pudiendo comparar los resultados entre ellos.

In [8]:
text = score_df['Text']
text = text.apply(lambda x: x.lower())
text = text.apply(lambda x: ''.join(letra for letra in x if letra not in string.punctuation))
text = text.apply(lambda x: nltk.word_tokenize(x))
print(text[0])

['i', 'have', 'bought', 'several', 'of', 'the', 'vitality', 'canned', 'dog', 'food', 'products', 'and', 'have', 'found', 'them', 'all', 'to', 'be', 'of', 'good', 'quality', 'the', 'product', 'looks', 'more', 'like', 'a', 'stew', 'than', 'a', 'processed', 'meat', 'and', 'it', 'smells', 'better', 'my', 'labrador', 'is', 'finicky', 'and', 'she', 'appreciates', 'this', 'product', 'better', 'than', 'most']


## Stemmer y Lemmatizer

Se han procesado los textos hasta obtener las palabras tokens de los mismos. Ahora, es necesario eleiminar las "stopwords" (articulos, pronombres o preposiciones) para poder reducir la dimensionalidad del problema sin perder el significado semantico de la sentencia. 

Posteriormente, se aplicarán dos tecnicas de procesamiento de lenguaje natural como son el stemming y el lemmatization. Con el objetivo, de obtener dos posibles conjuntos de datos diferentes para realizar pruebas en el entrenamiento. 

In [9]:
stop_words = nltk.corpus.stopwords.words('english')
stemmer = nltk.stem.PorterStemmer()
lemmatizer = nltk.stem.WordNetLemmatizer()

text = text.apply(lambda x: [palabra for palabra in x if palabra not in stop_words])

text_lem = text.copy()

text = text.apply(lambda x: [stemmer.stem(palabra) for palabra in x])

text_lem = text_lem.apply(lambda x: [lemmatizer.lemmatize(palabra) for palabra in x])


In [12]:
print(text[1])
print(text_lem[1])

['product', 'arriv', 'label', 'jumbo', 'salt', 'peanutsth', 'peanut', 'actual', 'small', 'size', 'unsalt', 'sure', 'error', 'vendor', 'intend', 'repres', 'product', 'jumbo']
['product', 'arrived', 'labeled', 'jumbo', 'salted', 'peanutsthe', 'peanut', 'actually', 'small', 'sized', 'unsalted', 'sure', 'error', 'vendor', 'intended', 'represent', 'product', 'jumbo']


In [13]:
score_df['Text'] = text.apply(lambda x: ' '.join(x))
score_df.head()

Unnamed: 0,Text,Score
1,product arriv label jumbo salt peanutsth peanu...,1
12,cat happili eat felida platinum two year got n...,1
26,candi red flavor plan chewi would never buy,1
50,oatmeal good mushi soft dont like quaker oat w...,1
62,arriv 6 day stale could eat 6 bag,1


In [14]:
score_df_lem['Text'] = text_lem.apply(lambda x: ' '.join(x))
score_df_lem.head()

Unnamed: 0,Text,Score
1,product arrived labeled jumbo salted peanutsth...,1
12,cat happily eating felidae platinum two year g...,1
26,candy red flavor plan chewy would never buy,1
50,oatmeal good mushy soft dont like quaker oat w...,1
62,arrived 6 day stale could eat 6 bag,1


Como se puede observar en las salidas anteriores, se han creado dos conjuntos de datos para los distintos preprocesamiento realizado .

In [13]:
bowTF = CountVectorizer()
resTF = bowTF.fit_transform(score_df['Text'])

In [14]:
bowTFIDF = TfidfVectorizer()
resTFIDF = bowTFIDF.fit_transform(score_df['Text'])

In [15]:
df_train, df_test = train_test_split(score_df, train_size=0.80)

df_train, df_validation = train_test_split(df_train,train_size=0.85)

print('Num. train: {}'.format(len(df_train)))
print('Num. validacion: {}'.format(len(df_validation)))
print('Num. test: {}'.format(len(df_test)))

Num. train: 68000
Num. validacion: 12000
Num. test: 20000


In [16]:
df_train_lem, df_test_lem = train_test_split(score_df_lem, train_size=0.80)

df_train_lem, df_validation_lem = train_test_split(df_train_lem,train_size=0.85)

print('Num. train: {}'.format(len(df_train_lem)))
print('Num. validacion: {}'.format(len(df_validation_lem)))
print('Num. test: {}'.format(len(df_test_lem)))

Num. train: 68000
Num. validacion: 12000
Num. test: 20000


In [17]:
def train_classifier(classifier, Xtrain, Ytrain, Xtest, Ytest):
    classifier.fit(Xtrain,Ytrain)
    Ypredicted = classifier.predict(Xtest)
    print('Accuracy: {:.2f}'.format(accuracy_score(Ytest,Ypredicted)*100))
    print('Precision: {:.2f}'.format(precision_score(Ytest,Ypredicted,average='macro')*100))
    print('Recall: {:.2f}'.format(recall_score(Ytest,Ypredicted,average='macro')*100))

In [18]:
tf_vectorizer = CountVectorizer()
Xtrain = tf_vectorizer.fit_transform(df_train['Text'])
Xval = tf_vectorizer.transform(df_validation['Text'])
Xtest = tf_vectorizer.transform(df_test['Text'])

encoder = LabelEncoder()
Ytrain = encoder.fit_transform(df_train['Score'])
Yval = encoder.transform(df_validation['Score'])
Ytest = encoder.transform(df_test['Score'])

RandomForest con stemming y CountVectorizer

In [19]:
rf = RandomForestClassifier(verbose=1,n_jobs=-1)
train_classifier(rf, Xtrain,Ytrain,Xtest,Ytest)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  3.9min finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.5s
Accuracy: 59.15
Precision: 59.02
Recall: 59.21
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    1.0s finished


In [20]:
bowTF_lem = CountVectorizer()
resTF_lem = bowTF_lem.fit_transform(score_df_lem['Text'])

In [21]:
tf_vectorizer = CountVectorizer()
Xtrain = tf_vectorizer.fit_transform(df_train_lem['Text'])
Xval = tf_vectorizer.transform(df_validation_lem['Text'])
Xtest = tf_vectorizer.transform(df_test_lem['Text'])

encoder = LabelEncoder()
Ytrain = encoder.fit_transform(df_train_lem['Score'])
Yval = encoder.transform(df_validation_lem['Score'])
Ytest = encoder.transform(df_test_lem['Score'])

  (0, 40087)	1
  (0, 47280)	1
  (0, 60672)	1
  (0, 27919)	1
  (0, 58458)	1
  (0, 8701)	1
  (0, 67947)	1
  (0, 6251)	1


RandomForest con lemmatization y CountVectorizer

In [22]:
rf = RandomForestClassifier(verbose=1,n_jobs=-1)
train_classifier(rf, Xtrain,Ytrain,Xtest,Ytest)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  4.3min finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.4s
Accuracy: 59.56
Precision: 59.55
Recall: 59.55
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.9s finished


In [23]:
tfidf_vectorizer = TfidfVectorizer()

Xtrain = tfidf_vectorizer.fit_transform(df_train['Text'])
Xval = tfidf_vectorizer.transform(df_validation['Text'])
Xtest = tfidf_vectorizer.transform(df_test['Text'])

encoder = LabelEncoder()
Ytrain = encoder.fit_transform(df_train['Score'])
Yval = encoder.transform(df_validation['Score'])
Ytest = encoder.transform(df_test['Score'])

RandomForest con stemming y TFID

In [24]:
rf = RandomForestClassifier(verbose=1,n_jobs=-1)
train_classifier(rf, Xtrain,Ytrain,Xtest,Ytest)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  3.6min finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.3s
Accuracy: 59.27
Precision: 59.21
Recall: 59.32
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.9s finished


In [25]:
bowTFIDF_lem = TfidfVectorizer()
resTFIDF_lem = bowTFIDF.fit_transform(score_df_lem['Text'])

In [26]:
tfidf_vectorizer = TfidfVectorizer()

Xtrain = tfidf_vectorizer.fit_transform(df_train_lem['Text'])
Xval = tfidf_vectorizer.transform(df_validation_lem['Text'])
Xtest = tfidf_vectorizer.transform(df_test_lem['Text'])

encoder = LabelEncoder()
Ytrain = encoder.fit_transform(df_train_lem['Score'])
Yval = encoder.transform(df_validation_lem['Score'])
Ytest = encoder.transform(df_test_lem['Score'])

RandomForest con lemmatization y TFID

In [27]:
rf = RandomForestClassifier(verbose=1,n_jobs=-1)
train_classifier(rf, Xtrain,Ytrain,Xtest,Ytest)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  4.3min finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.4s
Accuracy: 59.46
Precision: 59.51
Recall: 59.45
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    1.0s finished
