# Analisis de Sentimientos en rese√±as de pel√≠culas

# Version: Ivan Moran - Erik Vergara


Ahora pongamos en pr√°ctica algunos de estos conceptos en un caso m√°s real. Para esta pr√°ctica vamos a hacer un an√°lisis de sentimientos sobre unas rese√±as de pel√≠culas. Este caso ser√≠a una simple clasificaci√≥n binaria y podemos utilizar cualquier modelo para ese fin, lo adicional aqu√≠ es el pre-procesamiento de las entradas de texto.

### Referencias
* [Natural Language Processing in Action](https://www.manning.com/books/natural-language-processing-in-action)

In [1]:
# Se instalan solo librerias necesarias
!pip install nltk==3.9.1



In [4]:
!test '{IN_COLAB}' = 'True' && wget  https://github.com/Ohtar10/icesi-nlp/raw/refs/heads/main/Sesion1/moviereviews.tsv

Empecemos por cargar el dataset:

In [7]:
import pandas as pd
import numpy as np

reviews = pd.read_csv('./moviereviews.tsv', sep='\t')
reviews.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


## Limpieza de Datos

Antes de proceder con el an√°lisis, realizamos un proceso b√°sico de limpieza para garantizar la calidad del dataset.

Las acciones incluyen:
- Eliminaci√≥n de valores nulos
- Eliminaci√≥n de rese√±as vac√≠as
- Normalizaci√≥n b√°sica del texto (remoci√≥n de espacios innecesarios)

Este paso es fundamental para evitar errores en etapas posteriores del procesamiento y modelado.

In [8]:
reviews.dropna(inplace=True)
reviews.review = reviews.review.apply(lambda r: r.strip())
blanks = reviews[reviews.review == ''].index
reviews.drop(blanks, inplace=True)

In [9]:
reviews[reviews.review == ''].index

Index([], dtype='int64')

In [10]:
reviews.label.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
neg,969
pos,969


## Aplicaci√≥n de VADER para An√°lisis de Sentimiento

El dataset se encuentra balanceado, con aproximadamente mil ejemplos por clase.  

Para esta primera aproximaci√≥n utilizaremos **VADER (Valence Aware Dictionary and sEntiment Reasoner)**, un modelo basado en reglas y l√©xico que permite estimar autom√°ticamente la polaridad de un texto.

VADER est√° especialmente dise√±ado para an√°lisis de sentimiento en textos cortos y es parte de la librer√≠a NLTK.  

Este modelo calcula cuatro m√©tricas:

- **neg** ‚Üí proporci√≥n de contenido negativo  
- **neu** ‚Üí proporci√≥n de contenido neutral  
- **pos** ‚Üí proporci√≥n de contenido positivo  
- **compound** ‚Üí puntaje normalizado global entre -1 y 1  

El valor `compound` suele utilizarse como indicador final de polaridad.

In [11]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [12]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
reviews['scores'] = reviews.review.apply(lambda r: sid.polarity_scores(r))
reviews.head()

Unnamed: 0,label,review,scores
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co..."
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com..."
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.068, 'neu': 0.781, 'pos': 0.15, 'com..."
3,pos,according to hollywood movies made in last few...,"{'neg': 0.071, 'neu': 0.782, 'pos': 0.147, 'co..."
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.091, 'neu': 0.817, 'pos': 0.093, 'co..."


## Conversi√≥n de Puntajes a Etiquetas de Predicci√≥n

Una vez obtenido el puntaje `compound` de VADER, debemos transformarlo en una etiqueta binaria (`pos` / `neg`).

Para ello definimos un **threshold (umbral de decisi√≥n)** que determine a partir de qu√© valor consideramos que una rese√±a es positiva.

Por defecto, VADER recomienda:

- compound ‚â• 0.05 ‚Üí positivo
- compound ‚â§ -0.05 ‚Üí negativo
- valores intermedios ‚Üí neutral

En este caso, trabajaremos con una clasificaci√≥n binaria, por lo que utilizaremos el umbral de 0.05 para separar clases positivas y negativas.

In [13]:
# Se ajusta Threshold a 0.05 para mejorar resultado en negativos
reviews['compound'] = reviews.scores.apply(lambda s: s['compound'])
# reviews['prediction'] = reviews['compound'].apply(lambda c: 'pos' if c > 0 else 'neg')
reviews['prediction'] = reviews['compound'].apply(lambda c: 'pos' if c > 0.05 else 'neg')
reviews.head()

Unnamed: 0,label,review,scores,compound,prediction
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...",-0.9125,neg
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...",-0.8618,neg
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.068, 'neu': 0.781, 'pos': 0.15, 'com...",0.9951,pos
3,pos,according to hollywood movies made in last few...,"{'neg': 0.071, 'neu': 0.782, 'pos': 0.147, 'co...",0.9972,pos
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.091, 'neu': 0.817, 'pos': 0.093, 'co...",-0.2484,neg


## Evaluaci√≥n del Desempe√±o del Modelo

Una vez generadas las predicciones, evaluamos el desempe√±o del modelo comparando las etiquetas reales (`label`) contra las etiquetas predichas (`prediction`).

Utilizaremos m√©tricas est√°ndar de clasificaci√≥n:

- Accuracy
- Precision
- Recall
- F1-score
- Matriz de confusi√≥n

In [21]:


from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Etiquetas reales y predichas
y_true = reviews.label.values
y_pred = reviews.prediction.values

# M√©tricas
acc = accuracy_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
cr = classification_report(y_true, y_pred)

print(f"Accuracy:\n{acc:.4f}\n")
print("Classification Report:\n", cr)
print("Confusion Matrix:\n", cm)

Accuracy:
0.6357

Classification Report:
               precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

    accuracy                           0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938

Confusion Matrix:
 [[427 542]
 [164 805]]


## Conclusi√≥n del Modelo Basado en VADER

El modelo alcanza una accuracy de aproximadamente 63.6%, lo cual representa una mejora significativa frente a una l√≠nea base aleatoria (50%). Sin embargo, el desempe√±o a√∫n es limitado.

El principal problema se observa en la clase negativa:

- Recall en negativos: 0.44  
- Recall en positivos: 0.83  

Esto indica que el modelo tiene dificultades para identificar correctamente rese√±as negativas y tiende a clasificarlas como positivas.

En otras palabras, existe un sesgo hacia la clase positiva, lo cual se evidencia tambi√©n en la matriz de confusi√≥n.

## Modelo Supervisado: TF-IDF + Logistic Regression

En esta secci√≥n implementamos un modelo supervisado de clasificaci√≥n de texto.

A diferencia de VADER (modelo basado en reglas), ahora entrenaremos un modelo que aprende patrones directamente del dataset.

El pipeline incluye:

1. Divisi√≥n train/test estratificada
2. Vectorizaci√≥n TF-IDF
3. Entrenamiento con Logistic Regression
4. Evaluaci√≥n sobre el conjunto de prueba

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Separaci√≥n train/test (estratificada)
X_train, X_test, y_train, y_test = train_test_split(
    reviews.review,
    reviews.label,
    test_size=0.2,
    random_state=42,
    stratify=reviews.label
)

# Vectorizador TF-IDF
tfidf = TfidfVectorizer(
    stop_words='english',
    max_features=5000,
    ngram_range=(1, 2)
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

# Modelo Logistic Regression
model = LogisticRegression(
    max_iter=1000,
    class_weight=None  # dataset balanceado
)

model.fit(X_train_tfidf, y_train)

# Predicciones sobre test
y_pred = model.predict(X_test_tfidf)

# Guardar predicciones en dataframe (solo para test)
reviews.loc[X_test.index, 'prediction_ml'] = y_pred

In [16]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Corrected approach: Use only the test set for evaluation of the TF-IDF model.
# `y_test` (from train_test_split in cell A8l_Wa1HCbjr) contains the true labels for the test set.
# `model.predict(X_test_tfidf)` re-generates the predictions from the TF-IDF Logistic Regression model
# specifically for the test set, ensuring consistency.

y_true_tfidf = y_test # True labels for the TF-IDF test set
y_pred_tfidf = model.predict(X_test_tfidf) # Predictions from the TF-IDF Logistic Regression model

acc = accuracy_score(y_true_tfidf, y_pred_tfidf)
cm = confusion_matrix(y_true_tfidf, y_pred_tfidf)
cr = classification_report(y_true_tfidf, y_pred_tfidf)

print(f"Accuracy (TF-IDF Model):\n{acc}\n")
print(f"Classification Report (TF-IDF Model):\n{cr}")
print(f"Confusion Matrix (TF-IDF Model):\n{cm}")

Accuracy (TF-IDF Model):
0.8273195876288659

Classification Report (TF-IDF Model):
              precision    recall  f1-score   support

         neg       0.83      0.82      0.83       194
         pos       0.83      0.83      0.83       194

    accuracy                           0.83       388
   macro avg       0.83      0.83      0.83       388
weighted avg       0.83      0.83      0.83       388

Confusion Matrix (TF-IDF Model):
[[160  34]
 [ 33 161]]


## An√°lisis de Resultados

El modelo supervisado alcanza una accuracy de aproximadamente **82.7%**, lo que representa una mejora significativa frente al modelo basado en reglas (VADER ‚âà 63.6%).

### M√©tricas por clase

- Clase negativa:
  - Precision: 0.83
  - Recall: 0.82
  - F1-score: 0.83

- Clase positiva:
  - Precision: 0.83
  - Recall: 0.83
  - F1-score: 0.83

Observamos un desempe√±o balanceado entre ambas clases, sin sesgo evidente.

Porque el modelo supervisado:
- Aprende patrones espec√≠ficos del dataset
- Usa bigramas (ngram_range=(1,2))
- No depende de un diccionario fijo
- Aprende correlaciones reales entre palabras y etiquetas

VADER:
- No aprende
- No se adapta al dominio
- Est√° optimizado para redes sociales, no rese√±as largas

**Palabras influyentes**

In [17]:
feature_names = tfidf.get_feature_names_out()
coefs = model.coef_[0]

top_pos = sorted(zip(coefs, feature_names), reverse=True)[:15]
top_neg = sorted(zip(coefs, feature_names))[:15]

print("Top palabras positivas:")
for coef, word in top_pos:
    print(word)

print("\nTop palabras negativas:")
for coef, word in top_neg:
    print(word)

Top palabras positivas:
life
great
seen
truman
family
mulan
fun
jackie
quite
performance
excellent
true
gives
hilarious
perfect

Top palabras negativas:
bad
worst
plot
stupid
boring
supposed
movie
waste
reason
unfortunately
script
poor
godzilla
ridiculous
awful


**Modelo Sentence-Bert**

In [18]:
# ================================
# SENTENCE-BERT + Logistic Regression
# ================================

from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import numpy as np

# 1Ô∏è‚É£ Separaci√≥n train/test
X_train, X_test, y_train, y_test = train_test_split(
    reviews.review,
    reviews.label,
    test_size=0.2,
    random_state=42,
    stratify=reviews.label
)

# 2Ô∏è‚É£ Cargar modelo Sentence-BERT
model_sbert = SentenceTransformer('all-MiniLM-L6-v2')

# 3Ô∏è‚É£ Generar embeddings
X_train_embeddings = model_sbert.encode(
    X_train.tolist(),
    convert_to_numpy=True,
    show_progress_bar=True
)

X_test_embeddings = model_sbert.encode(
    X_test.tolist(),
    convert_to_numpy=True,
    show_progress_bar=True
)

# 4Ô∏è‚É£ Clasificador (Logistic Regression)
classifier = LogisticRegression(max_iter=1000)

classifier.fit(X_train_embeddings, y_train)

# 5Ô∏è‚É£ Predicciones
y_pred = classifier.predict(X_test_embeddings)

# Guardar predicciones en el dataframe (solo test)
reviews.loc[X_test.index, 'prediction_ml'] = y_pred



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/49 [00:00<?, ?it/s]

Batches:   0%|          | 0/13 [00:00<?, ?it/s]

In [19]:
# ================================
# M√âTRICAS
# ================================

y_true = y_test.values

acc = accuracy_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
cr = classification_report(y_true, y_pred)

print(f"Accuracy:\n{acc}\n")
print(f"Classification Report:\n{cr}")
print(f"Confusion Matrix:\n{cm}")

Accuracy:
0.6984536082474226

Classification Report:
              precision    recall  f1-score   support

         neg       0.69      0.73      0.71       194
         pos       0.71      0.66      0.69       194

    accuracy                           0.70       388
   macro avg       0.70      0.70      0.70       388
weighted avg       0.70      0.70      0.70       388

Confusion Matrix:
[[142  52]
 [ 65 129]]


**Ejemplos:**

(Positivo)

The Housemaid is the real deal. It‚Äôs a slick, high-voltage thriller that takes a classic setup and turns it into a total masterclass in suspense. The lead is absolutely magnetic, and the twists are so sharp they actually land instead of just being "filler." It‚Äôs stylish, mean, and keeps you hooked from start to finish. Honestly? It‚Äôs a total knockout that makes the rest of the genre look lazy.

(Negativo)

The Housemaid is a tired exercise in predictability that mistakes recycled clich√©s for genuine suspense. While visually polished, its hollow script and telegraphed twists waste a talented cast, delivering a thriller that is as shallow as it is derivative. It‚Äôs a chore to watch a film so convinced of its own cleverness while offering nothing we haven't seen a dozen times before.

In [20]:
# 1Ô∏è‚É£ Ingresar nuevo comentario
nuevo_comentario = input("Ingrese un comentario: ")

print("\nComentario ingresado:")
print(nuevo_comentario)
print("\n--- RESULTADOS ---")

# ==========================================
# üîπ MODELO 1: NLTK (VADER)
# ==========================================

scores = sid.polarity_scores(nuevo_comentario)

# Regla est√°ndar VADER
if scores['compound'] >= 0:
    pred_nltk = "pos"
else:
    pred_nltk = "neg"

print(f"NLTK (VADER): {pred_nltk} | Score: {scores['compound']:.4f}")


# ==========================================
# üîπ MODELO 2: TF-IDF + Logistic Regression
# ==========================================

nuevo_tfidf = tfidf.transform([nuevo_comentario])
pred_tfidf = model.predict(nuevo_tfidf)

print(f"TF-IDF + Logistic Regression: {pred_tfidf}")


# ==========================================
# üîπ MODELO 3: Sentence-BERT + Logistic
# ==========================================

nuevo_embedding = model_sbert.encode(
    [nuevo_comentario],
    convert_to_numpy=True
)

pred_sbert = classifier.predict(nuevo_embedding)[0]

print(f"Sentence-BERT + Logistic Regression: {pred_sbert}")

print("\n===================================")

Ingrese un comentario: Gracias

Comentario ingresado:
Gracias

--- RESULTADOS ---
NLTK (VADER): pos | Score: 0.0000
TF-IDF + Logistic Regression: ['pos']
Sentence-BERT + Logistic Regression: pos

