# Análisis de sentimiento

Este notebook contiene modelos de análisis de sentimiento para páginas de reviews de películas: IMDB para el idioma inglés, y filmaffinity para el idioma español.

In [1]:
import pandas as pd
import re

from nltk.corpus import stopwords

In [1]:
# Función para leer los datos.
def readDataIMDB(inputSize):
    data = []
    for file in range(1, inputSize + 1):
        for line in open("Datasets scrapeados/IMDb_reviews_0" + str(file)+ "_03_2020.txt", 'r', encoding = 'latin'):
            splitedLine = line.split(';', maxsplit=1)
            clasification = int(splitedLine[0])
            if clasification >= 1 and clasification <= 4:
                clasification = 0
            elif clasification == 5 or clasification == 6:
                clasification = 1
            else:
                clasification = 2
            data.append([splitedLine[1], clasification])
    return data

In [3]:
# Lectura de los datos.
dataFrame = pd.DataFrame(readDataIMDB(7), columns=['review', 'calification'])

In [4]:
dataFrame.head()

Unnamed: 0,review,calification
0,It's not really a review but my attempt to exp...,2
1,I am remarkably stingy with my 10/10 ratings. ...,2
2,I can't remember the last time I saw a movie t...,2
3,This movie is a gosh darn masterpiece. It will...,2
4,Parasite was directed and written by Bong Joon...,2


In [5]:
# Quitar puntos y otros signos de puntuación.
REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess_reviews(reviews):
    reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]
    
    return reviews

dataFrame['review_clean'] = dataFrame['review'].apply(lambda review: REPLACE_NO_SPACE.sub("", review.lower()))

In [6]:
dataFrame['review_clean'] = dataFrame['review_clean'].apply(lambda review: REPLACE_WITH_SPACE.sub(" ", review))

In [7]:
dataFrame.head()

Unnamed: 0,review,calification,review_clean
0,It's not really a review but my attempt to exp...,2,its not really a review but my attempt to expl...
1,I am remarkably stingy with my 10/10 ratings. ...,2,i am remarkably stingy with my 10 10 ratings i...
2,I can't remember the last time I saw a movie t...,2,i cant remember the last time i saw a movie th...
3,This movie is a gosh darn masterpiece. It will...,2,this movie is a gosh darn masterpiece it will ...
4,Parasite was directed and written by Bong Joon...,2,parasite was directed and written by bong joon...


In [8]:
# Recordar: remover todas las stopwords puede no ser tan beneficioso,
# por lo que se puede generar una lista de stopwords definida por nosotros.

# Se eliminan las stops words porque generalmente mejoran el entrenamiento.
english_stop_words = stopwords.words('english')

#dataFrame['review'].apply(lambda x: [item for item in x if item not in english_stop_words])
dataFrame['review_without_stop'] = dataFrame['review_clean'].apply(lambda x: ' '.join([word for word in x.split() if word not in (english_stop_words)]))

In [9]:
dataFrame.head()

Unnamed: 0,review,calification,review_clean,review_without_stop
0,It's not really a review but my attempt to exp...,2,its not really a review but my attempt to expl...,really review attempt explain interpreted movi...
1,I am remarkably stingy with my 10/10 ratings. ...,2,i am remarkably stingy with my 10 10 ratings i...,remarkably stingy 10 10 ratings ill first pers...
2,I can't remember the last time I saw a movie t...,2,i cant remember the last time i saw a movie th...,cant remember last time saw movie contained ma...
3,This movie is a gosh darn masterpiece. It will...,2,this movie is a gosh darn masterpiece it will ...,movie gosh darn masterpiece make belly laugh c...
4,Parasite was directed and written by Bong Joon...,2,parasite was directed and written by bong joon...,parasite directed written bong joon ho tells s...


In [10]:
# Lematizar
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()

# Para que funcione el lemmatizador, debemos indicar que tipo de palabras es.
# Esta función indica que tipo de palabra es.
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [11]:
lemmatizer.lemmatize('stingy', get_wordnet_pos('stingy'))

'stingy'

In [12]:
dataFrame['review_lemmatized'] = dataFrame['review_without_stop'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in x.split()]))

In [13]:
dataFrame.head()

Unnamed: 0,review,calification,review_clean,review_without_stop,review_lemmatized
0,It's not really a review but my attempt to exp...,2,its not really a review but my attempt to expl...,really review attempt explain interpreted movi...,really review attempt explain interpret movie ...
1,I am remarkably stingy with my 10/10 ratings. ...,2,i am remarkably stingy with my 10 10 ratings i...,remarkably stingy 10 10 ratings ill first pers...,remarkably stingy 10 10 rating ill first perso...
2,I can't remember the last time I saw a movie t...,2,i cant remember the last time i saw a movie th...,cant remember last time saw movie contained ma...,cant remember last time saw movie contain many...
3,This movie is a gosh darn masterpiece. It will...,2,this movie is a gosh darn masterpiece it will ...,movie gosh darn masterpiece make belly laugh c...,movie gosh darn masterpiece make belly laugh c...
4,Parasite was directed and written by Bong Joon...,2,parasite was directed and written by bong joon...,parasite directed written bong joon ho tells s...,parasite direct write bong joon ho tell story ...


In [14]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [15]:
dataFrame['review_stemmed'] = dataFrame['review_without_stop'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))

In [10]:
#dataFrame.to_csv('datoslimpios.csv')

Se cargan los datos ya procesados. Estos toman bastante tiempo en procesar, principalmente el lematizar.

In [11]:
# Lectura de los datos ya trabajados.
dataFrame = pd.read_csv('datoslimpios.csv')

In [13]:
dataFrame.head()

Unnamed: 0.1,Unnamed: 0,calification,review_clean,review_without_stop,review_lemmatized,review_stemmed
0,87259,2,this is one of the masterpiece bengali movie i...,one masterpiece bengali movie didnt bored sing...,one masterpiece bengali movie didnt bore singl...,one masterpiec bengali movi didnt bore singl m...
1,45630,1,honneponnetje is a completely meaningless but ...,honneponnetje completely meaningless entertain...,honneponnetje completely meaningless entertain...,honneponnetj complet meaningless entertain 80 ...
2,133723,2,we see a family that gradually starts to feed ...,see family gradually starts feed privileged fa...,see family gradually start feed privileged fam...,see famili gradual start feed privileg famili ...
3,67367,2,i saw dirty mary at the venice beach film fest...,saw dirty mary venice beach film festival love...,saw dirty mary venice beach film festival love...,saw dirti mari venic beach film festiv love am...
4,6868,2,masculin feminin is a definitive example of fr...,masculin feminin definitive example french new...,masculin feminin definitive example french new...,masculin feminin definit exampl french new wav...


In [15]:
dataFrame.drop(columns=['Unnamed: 0'], inplace=True)

Se define una semilla para poder replicar los resultados:

In [15]:
seed = 3

El conjunto de datos es muy grande, esto implica que posiblemente la memoria del computador no sea suficiente. Por lo tanto, se decide tomar una muestra aleatoria de 50.000 datos para realizar el modelo.

In [None]:
dataFrame = dataFrame.sample(50000, random_state= seed)

De estos, 40.000 serán usados para el entrenamiento y 10.000 para el conjunto de pruebas.

Se genera un conjunto de entrenamiento, validación y de pruebas.

In [18]:
dataFrameTrain = dataFrame[0:40000]
dataFrameTest = dataFrame[40000:]

In [19]:
# Definir el conjunto de datos para entrenar. 
# Primero se utilizará el conjunto de datos sin stopwords pero sin otro preprocesamiento.
X = dataFrameTrain['review_without_stop']
y = dataFrameTrain['calification']

Se utiliza un modelo de regresión logística.

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
ngram_vectorizer.fit(X)
X = ngram_vectorizer.transform(X)
target = y
# Separamos en conjunto de prueba y entrenamiento.
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.80, random_state = seed)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c, max_iter=200, random_state=3, solver = 'lbfgs')
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))

Accuracy for C=0.01: 0.824
Accuracy for C=0.05: 0.83125
Accuracy for C=0.25: 0.833375
Accuracy for C=0.5: 0.8335
Accuracy for C=1: 0.833


Luego, se entrena con el conjunto lematizado.

In [21]:
# Definir el conjunto de datos para entrenar. En este caso, lematizado.
X = dataFrameTrain['review_lemmatized']
y = dataFrameTrain['calification']

In [22]:
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
ngram_vectorizer.fit(X)
X = ngram_vectorizer.transform(X)
target = y
# Separamos en conjunto de prueba y entrenamiento.
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.80, random_state = seed)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c, max_iter=200, random_state = seed, solver = 'lbfgs')
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))

Accuracy for C=0.01: 0.822875
Accuracy for C=0.05: 0.834375
Accuracy for C=0.25: 0.834875
Accuracy for C=0.5: 0.8355
Accuracy for C=1: 0.835375


Finalmente, se entrena el modelo con stemming.

In [26]:
# Definir el conjunto de datos para entrenar. En este caso, con stemming.
X = dataFrameTrain['review_stemmed']
y = dataFrameTrain['calification']

In [27]:
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
ngram_vectorizer.fit(X)
X = ngram_vectorizer.transform(X)
target = y
# Separamos en conjunto de prueba y entrenamiento.
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.80, random_state = seed)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c, max_iter=200, solver = 'lbfgs', random_state = seed)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))

Accuracy for C=0.01: 0.824
Accuracy for C=0.05: 0.835
Accuracy for C=0.25: 0.833
Accuracy for C=0.5: 0.83325
Accuracy for C=1: 0.83375


Para el modelo de regresión logística, se obtiene que los mejores resultados se logran al lematizar las palabras con un valor de C igual a 0.5.

Definiendo el modelo final:

In [28]:
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
ngram_vectorizer.fit(dataFrame['review_lemmatized'])

CountVectorizer(analyzer='word', binary=True, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 2), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [29]:
X = ngram_vectorizer.transform(dataFrame['review_lemmatized'])
y = dataFrame['calification']

In [33]:
X_train = X[0:40000]
y_train = y[0:40000]
X_test = X[40000:]
y_test = y[40000:]

In [35]:
lr = LogisticRegression(C=0.5, max_iter=200, solver = 'lbfgs', random_state = seed)
lr.fit(X_train, y_train)
print ("Accuracy for C=%s: %s" 
       % (c, accuracy_score(y_test, lr.predict(X_test))))

Accuracy for C=1: 0.8382


### Filmaffinity

In [1]:
# Función para leer los datos.
def readDataFilmaffinity():
    data = []
    for file in range(7, 9):
        for line in open("Filmaffinity/filmaffinity_reviews_0" + str(file) + "_03_2020.txt", 'r', encoding = 'utf8'):
            splitedLine = line.split(';', maxsplit=1)
            clasification = int(splitedLine[0])
            if clasification >= 1 and clasification <= 4:
                clasification = 0
            elif clasification == 5 or clasification == 6:
                clasification = 1
            else:
                clasification = 2
            data.append([splitedLine[1], clasification])
    return data

In [4]:
# Lectura de los datos.
dataFrame = pd.DataFrame(readDataFilmaffinity(), columns=['review', 'calification'])

In [5]:
dataFrame.head()

Unnamed: 0,review,calification
0,Pensé que iba a tratarse de otro telefilm conv...,1
1,"La verdad, no sé qué espera la gente cuando ve...",1
2,"La peli tiene un humor bastante poco ""correcto...",2
3,"""Origen"" no es lo mismo que ""Torrente 2"". Ni ""...",2
4,"Me reí mucho viendo a Ben Stiller, vaya crack....",1


In [6]:
# Quitar puntos y otros signos de puntuación.
REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess_reviews(reviews):
    reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]
    
    return reviews

dataFrame['review_clean'] = dataFrame['review'].apply(lambda review: REPLACE_NO_SPACE.sub("", review.lower()))
dataFrame['review_clean'] = dataFrame['review_clean'].apply(lambda review: REPLACE_WITH_SPACE.sub(" ", review))

In [7]:
# Se quitan los tildes.
dataFrame['review_clean'] = dataFrame['review_clean'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')

In [10]:
# Se eliminan las stops words porque generalmente mejoran el entrenamiento.
spanish_stop_words = stopwords.words('spanish')

dataFrame['review_without_stop'] = dataFrame['review_clean'].apply(lambda x: ' '.join([word for word in x.split() if word not in (spanish_stop_words)]))

In [12]:
from nltk import SnowballStemmer
spanishstemmer=SnowballStemmer("spanish")

dataFrame['review_stemmed'] = dataFrame['review_without_stop'].apply(lambda x: ' '.join([spanishstemmer.stem(word) for word in x.split()]))

In [13]:
dataFrame.head()

Unnamed: 0,review,calification,review_clean,review_without_stop,review_stemmed
0,Pensé que iba a tratarse de otro telefilm conv...,1,pense que iba a tratarse de otro telefilm conv...,pense iba tratarse telefilm convencional solo ...,pens iba trat telefilm convencional sol salv p...
1,"La verdad, no sé qué espera la gente cuando ve...",1,la verdad no se que espera la gente cuando ve ...,verdad espera gente ve peli dos horas guiones ...,verd esper gent ve peli dos hor guion sup curr...
2,"La peli tiene un humor bastante poco ""correcto...",2,la peli tiene un humor bastante poco correcto ...,peli humor bastante correcto tiempos corren pa...,peli humor bastant correct tiemp corr par reir...
3,"""Origen"" no es lo mismo que ""Torrente 2"". Ni ""...",2,origen no es lo mismo que torrente 2 ni gladia...,origen mismo torrente 2 gladiator mismo estupi...,orig mism torrent 2 gladiator mism estup pelic...
4,"Me reí mucho viendo a Ben Stiller, vaya crack....",1,me rei mucho viendo a ben stiller vaya crack l...,rei viendo ben stiller vaya crack peli momento...,rei viend ben still vay crack peli moment absu...


In [3]:
#dataFrame.to_csv('datoslimpiosFilmaffinity.csv')
dataFrame = pd.read_csv('datoslimpiosFilmaffinity.csv')

In [5]:
seed = 3

Se replica lo hecho en el modelo para reviews en inglés.

In [6]:
dataFrame = dataFrame.sample(50000, random_state= seed)

In [7]:
dataFrameTrain = dataFrame[0:40000]
dataFrameTest = dataFrame[40000:]

In [8]:
# Definir el conjunto de datos para entrenar. 
# Primero se utilizará el conjunto de datos sin stopwords pero sin otro preprocesamiento.
X = dataFrameTrain['review_without_stop']
y = dataFrameTrain['calification']

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
ngram_vectorizer.fit(X)
X = ngram_vectorizer.transform(X)
target = y
# Separamos en conjunto de prueba y entrenamiento.
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.80, random_state = seed)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c, max_iter=200, random_state=3, solver = 'lbfgs')
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))

Accuracy for C=0.01: 0.794
Accuracy for C=0.05: 0.805125
Accuracy for C=0.25: 0.805375
Accuracy for C=0.5: 0.805875
Accuracy for C=1: 0.80525


In [21]:
# Definir el conjunto de datos para entrenar. En este caso, con stemming.
X = dataFrameTrain['review_stemmed']
y = dataFrameTrain['calification']

In [22]:
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
ngram_vectorizer.fit(X)
X = ngram_vectorizer.transform(X)
target = y
# Separamos en conjunto de prueba y entrenamiento.
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.80, random_state = seed)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c, max_iter=200, solver = 'lbfgs', random_state = seed)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))

Accuracy for C=0.01: 0.79625
Accuracy for C=0.05: 0.809
Accuracy for C=0.25: 0.8075
Accuracy for C=0.5: 0.807875
Accuracy for C=1: 0.80675


Para el conjunto de reviews en español, el mejor valor de C fue 0.05 para el conjunto stemmed. Con esto se genera el modelo final.

In [10]:
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
ngram_vectorizer.fit(dataFrame['review_stemmed'])

CountVectorizer(analyzer='word', binary=True, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 2), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [11]:
X = ngram_vectorizer.transform(dataFrame['review_stemmed'])
y = dataFrame['calification']

In [12]:
X_train = X[0:40000]
y_train = y[0:40000]
X_test = X[40000:]
y_test = y[40000:]

In [14]:
lr = LogisticRegression(C=0.05, max_iter=200, solver = 'lbfgs', random_state = seed)
lr.fit(X_train, y_train)
print ("Accuracy for C=%s: %s" 
       % (0.05, accuracy_score(y_test, lr.predict(X_test))))

Accuracy for C=0.05: 0.8072


En los modelos anteriores se aplicó un modelo de regresión logística. Podemos probar si otro modelo podría realizar un mejor desempeño. En este caso, se decide probar con random forest.

In [21]:
from sklearn.ensemble import RandomForestClassifier as RFC

In [25]:
X = dataFrame['review_without_stop']
y = dataFrame['calification']

In [26]:
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
ngram_vectorizer.fit(X)
X = ngram_vectorizer.transform(X)
target = y
# Separamos en conjunto de prueba y entrenamiento.
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.80)

for n in [10, 30, 50, 100, 200]:
    
    model = RFC(n_estimators=n)
    model.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (n, accuracy_score(y_val, model.predict(X_val))))

Accuracy for C=10: 0.7906
Accuracy for C=30: 0.7852
Accuracy for C=50: 0.7834
Accuracy for C=100: 0.7842


KeyboardInterrupt: 