# Analisis de Sentimientos en reseñas de películas

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Ohtar10/icesi-nlp/blob/main/Sesion1/7-sentiment-analysis.ipynb)

Ahora pongamos en práctica algunos de estos conceptos en un caso más real. Para esta práctica vamos a hacer un análisis de sentimientos sobre unas reseñas de películas. Este caso sería una simple clasificación binaria y podemos utilizar cualquier modelo para ese fin, lo adicional aquí es el pre-procesamiento de las entradas de texto.

Empecemos por cargar el dataset:

In [56]:
import pandas as pd
import numpy as np

reviews = pd.read_csv('./bf6_opinions_clean.tsv', sep='\t')
reviews.head()

Unnamed: 0,source,sentiment,speaker,review,url
0,PC Gamer / official preview,pos,Morgan Park,"After 4 hours with Battlefield 6, I think Batt...",https://www.pcgamer.com/games/fps/after-4-hour...
1,DLCompare (news),pos,Community writer,The community has embraced Battlefield 6’s ret...,https://www.dlcompare.com/gaming-news/battlefi...
2,PC Gamer (news),neg,Tester / community,"Raised concerns about a 'zero recoil' LMG, tho...",https://www.pcgamer.com/games/fps/people-are-f...
3,GamesRadar (news),neg,Analyst / writer,Raises questions if the new fast-paced movemen...,https://www.gamesradar.com/games/battlefield/b...
4,FandomWire (review),neg,Murky‑String1114,The gunplay is quite underwhelming... it feels...,https://fandomwire.com/battlefield-6-beta-feed...


Luego, hagamos algo de limpieza, vamos a remover nulos y valores vacíos:

In [57]:
reviews.drop(columns=['source', 'speaker', 'url'], inplace=True)
reviews.dropna(inplace=True)
reviews.review = reviews.review.apply(lambda r: r.strip())
blanks = reviews[reviews.review == ''].index
reviews.drop(blanks, inplace=True)

In [58]:
reviews[reviews.review == ''].index

Index([], dtype='int64')

In [59]:
reviews.sentiment.value_counts()

sentiment
pos    5
neg    5
Name: count, dtype: int64

Tenemos un dataset balanceado de casi mil ejemplares por cada clase.

Para hacer las cosas simples, vamos a utilizar un VADER para computar el puntaje de positivo o negativo. Este modelo ya viene implementado dentro de NLTK.

In [60]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/luis/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [61]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
reviews['scores'] = reviews.review.apply(lambda r: sid.polarity_scores(r))
reviews.head()

Unnamed: 0,sentiment,review,scores
0,pos,"After 4 hours with Battlefield 6, I think Batt...","{'neg': 0.342, 'neu': 0.658, 'pos': 0.0, 'comp..."
1,pos,The community has embraced Battlefield 6’s ret...,"{'neg': 0.109, 'neu': 0.588, 'pos': 0.303, 'co..."
2,neg,"Raised concerns about a 'zero recoil' LMG, tho...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
3,neg,Raises questions if the new fast-paced movemen...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
4,neg,The gunplay is quite underwhelming... it feels...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."


Con estos puntajes ahora podemos convertir el resultado en una etiqueta de predicción:

In [62]:
reviews['compound'] = reviews.scores.apply(lambda s: s['compound'])    
reviews['prediction'] = reviews['compound'].apply(lambda c: 'pos' if c > 0 else 'neg')
reviews.head()

Unnamed: 0,sentiment,review,scores,compound,prediction
0,pos,"After 4 hours with Battlefield 6, I think Batt...","{'neg': 0.342, 'neu': 0.658, 'pos': 0.0, 'comp...",-0.6369,neg
1,pos,The community has embraced Battlefield 6’s ret...,"{'neg': 0.109, 'neu': 0.588, 'pos': 0.303, 'co...",0.6808,pos
2,neg,"Raised concerns about a 'zero recoil' LMG, tho...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,neg
3,neg,Raises questions if the new fast-paced movemen...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,neg
4,neg,The gunplay is quite underwhelming... it feels...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,neg


Y finalmente computar unas cuantas métricas de calidad del modelo:

In [64]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_true = reviews.sentiment.values
y_pred = reviews.prediction.values

acc = accuracy_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
cr = classification_report(y_true, y_pred)


print(f"Accuracy:\n{acc}\n")
print(f"Classification Report:\n{cr}")
print(f"Confusion Matrix:\n{cm}")

Accuracy:
0.6

Classification Report:
              precision    recall  f1-score   support

         neg       0.57      0.80      0.67         5
         pos       0.67      0.40      0.50         5

    accuracy                           0.60        10
   macro avg       0.62      0.60      0.58        10
weighted avg       0.62      0.60      0.58        10

Confusion Matrix:
[[4 1]
 [3 2]]


La correctitud no es la mejor, aún podemos hacerlo mucho mejor que la línea base (50%). Parece que tenemos problemas con las etiquetas negativas!