# Analisis de Sentimientos en rese√±as de pel√≠culas

# Version: Ivan Moran - Erik Vergara


Ahora pongamos en pr√°ctica algunos de estos conceptos en un caso m√°s real. Para esta pr√°ctica vamos a hacer un an√°lisis de sentimientos sobre unas rese√±as de pel√≠culas. Este caso ser√≠a una simple clasificaci√≥n binaria y podemos utilizar cualquier modelo para ese fin, lo adicional aqu√≠ es el pre-procesamiento de las entradas de texto.

### Referencias
* [Natural Language Processing in Action](https://www.manning.com/books/natural-language-processing-in-action)

In [32]:
# Se instalan solo librerias necesarias
!pip install nltk==3.9.1



In [33]:
!test '{IN_COLAB}' = 'True' && wget  https://github.com/Ohtar10/icesi-nlp/raw/refs/heads/main/Sesion1/moviereviews.tsv

Empecemos por cargar el dataset:

In [34]:
import pandas as pd
import numpy as np

reviews = pd.read_csv('./moviereviews.tsv', sep='\t')
reviews.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


Luego, hagamos algo de limpieza, vamos a remover nulos y valores vac√≠os:

In [35]:
reviews.dropna(inplace=True)
reviews.review = reviews.review.apply(lambda r: r.strip())
blanks = reviews[reviews.review == ''].index
reviews.drop(blanks, inplace=True)

In [36]:
reviews[reviews.review == ''].index

Index([], dtype='int64')

In [37]:
reviews.label.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
neg,969
pos,969


Tenemos un dataset balanceado de casi mil ejemplares por cada clase.

Para hacer las cosas simples, vamos a utilizar un VADER para computar el puntaje de positivo o negativo. Este modelo ya viene implementado dentro de NLTK.

In [38]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [39]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
reviews['scores'] = reviews.review.apply(lambda r: sid.polarity_scores(r))
reviews.head()

Unnamed: 0,label,review,scores
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co..."
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com..."
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.068, 'neu': 0.781, 'pos': 0.15, 'com..."
3,pos,according to hollywood movies made in last few...,"{'neg': 0.071, 'neu': 0.782, 'pos': 0.147, 'co..."
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.091, 'neu': 0.817, 'pos': 0.093, 'co..."


Con estos puntajes ahora podemos convertir el resultado en una etiqueta de predicci√≥n:

In [40]:
# Se ajusta Threshold a 0.05 para mejorar resultado en negativos
reviews['compound'] = reviews.scores.apply(lambda s: s['compound'])
# reviews['prediction'] = reviews['compound'].apply(lambda c: 'pos' if c > 0 else 'neg')
reviews['prediction'] = reviews['compound'].apply(lambda c: 'pos' if c > 0.05 else 'neg')
reviews.head()

Unnamed: 0,label,review,scores,compound,prediction
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...",-0.9125,neg
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...",-0.8618,neg
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.068, 'neu': 0.781, 'pos': 0.15, 'com...",0.9951,pos
3,pos,according to hollywood movies made in last few...,"{'neg': 0.071, 'neu': 0.782, 'pos': 0.147, 'co...",0.9972,pos
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.091, 'neu': 0.817, 'pos': 0.093, 'co...",-0.2484,neg


Y finalmente computar unas cuantas m√©tricas de calidad del modelo:

In [41]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_true = reviews.label.values
y_pred = reviews.prediction.values

acc = accuracy_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
cr = classification_report(y_true, y_pred)


print(f"Accuracy:\n{acc}\n")
print(f"Classification Report:\n{cr}")
print(f"Confusion Matrix:\n{cm}")

Accuracy:
0.6357069143446853

Classification Report:
              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

    accuracy                           0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938

Confusion Matrix:
[[427 542]
 [164 805]]


La correctitud no es la mejor, a√∫n podemos hacerlo mucho mejor que la l√≠nea base (50%). Parece que tenemos problemas con las etiquetas negativas!

# Modelo TF-IDF + Logistic Regression

Se prueba con un bloque de modelo Supervisado

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Separaci√≥n train/test (estratificada)
X_train, X_test, y_train, y_test = train_test_split(
    reviews.review,
    reviews.label,
    test_size=0.2,
    random_state=42,
    stratify=reviews.label
)

# Vectorizador TF-IDF
tfidf = TfidfVectorizer(
    stop_words='english',
    max_features=5000,
    ngram_range=(1, 2)
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

# Modelo Logistic Regression
model = LogisticRegression(
    max_iter=1000,
    class_weight=None  # dataset balanceado
)

model.fit(X_train_tfidf, y_train)

# Predicciones sobre test
y_pred = model.predict(X_test_tfidf)

# Guardar predicciones en dataframe (solo para test)
reviews.loc[X_test.index, 'prediction_ml'] = y_pred

In [45]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Corrected approach: Use only the test set for evaluation of the TF-IDF model.
# `y_test` (from train_test_split in cell A8l_Wa1HCbjr) contains the true labels for the test set.
# `model.predict(X_test_tfidf)` re-generates the predictions from the TF-IDF Logistic Regression model
# specifically for the test set, ensuring consistency.

y_true_tfidf = y_test # True labels for the TF-IDF test set
y_pred_tfidf = model.predict(X_test_tfidf) # Predictions from the TF-IDF Logistic Regression model

acc = accuracy_score(y_true_tfidf, y_pred_tfidf)
cm = confusion_matrix(y_true_tfidf, y_pred_tfidf)
cr = classification_report(y_true_tfidf, y_pred_tfidf)

print(f"Accuracy (TF-IDF Model):\n{acc}\n")
print(f"Classification Report (TF-IDF Model):\n{cr}")
print(f"Confusion Matrix (TF-IDF Model):\n{cm}")

Accuracy (TF-IDF Model):
0.8273195876288659

Classification Report (TF-IDF Model):
              precision    recall  f1-score   support

         neg       0.83      0.82      0.83       194
         pos       0.83      0.83      0.83       194

    accuracy                           0.83       388
   macro avg       0.83      0.83      0.83       388
weighted avg       0.83      0.83      0.83       388

Confusion Matrix (TF-IDF Model):
[[160  34]
 [ 33 161]]


Porque el modelo supervisado:
- Aprende patrones espec√≠ficos del dataset
- Usa bigramas (ngram_range=(1,2))
- No depende de un diccionario fijo
- Aprende correlaciones reales entre palabras y etiquetas

VADER:
- No aprende
- No se adapta al dominio
- Est√° optimizado para redes sociales, no rese√±as largas

**Palabras influyentes**

In [46]:
feature_names = tfidf.get_feature_names_out()
coefs = model.coef_[0]

top_pos = sorted(zip(coefs, feature_names), reverse=True)[:15]
top_neg = sorted(zip(coefs, feature_names))[:15]

print("Top palabras positivas:")
for coef, word in top_pos:
    print(word)

print("\nTop palabras negativas:")
for coef, word in top_neg:
    print(word)

Top palabras positivas:
life
great
seen
truman
family
mulan
fun
jackie
quite
performance
excellent
true
gives
hilarious
perfect

Top palabras negativas:
bad
worst
plot
stupid
boring
supposed
movie
waste
reason
unfortunately
script
poor
godzilla
ridiculous
awful


**Modelo Sentence-Bert**

In [47]:
# ================================
# SENTENCE-BERT + Logistic Regression
# ================================

from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import numpy as np

# 1Ô∏è‚É£ Separaci√≥n train/test
X_train, X_test, y_train, y_test = train_test_split(
    reviews.review,
    reviews.label,
    test_size=0.2,
    random_state=42,
    stratify=reviews.label
)

# 2Ô∏è‚É£ Cargar modelo Sentence-BERT
model_sbert = SentenceTransformer('all-MiniLM-L6-v2')

# 3Ô∏è‚É£ Generar embeddings
X_train_embeddings = model_sbert.encode(
    X_train.tolist(),
    convert_to_numpy=True,
    show_progress_bar=True
)

X_test_embeddings = model_sbert.encode(
    X_test.tolist(),
    convert_to_numpy=True,
    show_progress_bar=True
)

# 4Ô∏è‚É£ Clasificador (Logistic Regression)
classifier = LogisticRegression(max_iter=1000)

classifier.fit(X_train_embeddings, y_train)

# 5Ô∏è‚É£ Predicciones
y_pred = classifier.predict(X_test_embeddings)

# Guardar predicciones en el dataframe (solo test)
reviews.loc[X_test.index, 'prediction_ml'] = y_pred



Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Batches:   0%|          | 0/49 [00:00<?, ?it/s]

Batches:   0%|          | 0/13 [00:00<?, ?it/s]

In [48]:
# ================================
# M√âTRICAS
# ================================

y_true = y_test.values

acc = accuracy_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
cr = classification_report(y_true, y_pred)

print(f"Accuracy:\n{acc}\n")
print(f"Classification Report:\n{cr}")
print(f"Confusion Matrix:\n{cm}")

Accuracy:
0.6984536082474226

Classification Report:
              precision    recall  f1-score   support

         neg       0.69      0.73      0.71       194
         pos       0.71      0.66      0.69       194

    accuracy                           0.70       388
   macro avg       0.70      0.70      0.70       388
weighted avg       0.70      0.70      0.70       388

Confusion Matrix:
[[142  52]
 [ 65 129]]


**Ejemplos:**

(Positivo)

The Housemaid is the real deal. It‚Äôs a slick, high-voltage thriller that takes a classic setup and turns it into a total masterclass in suspense. The lead is absolutely magnetic, and the twists are so sharp they actually land instead of just being "filler." It‚Äôs stylish, mean, and keeps you hooked from start to finish. Honestly? It‚Äôs a total knockout that makes the rest of the genre look lazy.

(Negativo)

The Housemaid is a tired exercise in predictability that mistakes recycled clich√©s for genuine suspense. While visually polished, its hollow script and telegraphed twists waste a talented cast, delivering a thriller that is as shallow as it is derivative. It‚Äôs a chore to watch a film so convinced of its own cleverness while offering nothing we haven't seen a dozen times before.

In [54]:
# 1Ô∏è‚É£ Ingresar nuevo comentario
nuevo_comentario = input("Ingrese un comentario: ")

print("\nComentario ingresado:")
print(nuevo_comentario)
print("\n--- RESULTADOS ---")

# ==========================================
# üîπ MODELO 1: NLTK (VADER)
# ==========================================

scores = sid.polarity_scores(nuevo_comentario)

# Regla est√°ndar VADER
if scores['compound'] >= 0:
    pred_nltk = "pos"
else:
    pred_nltk = "neg"

print(f"NLTK (VADER): {pred_nltk} | Score: {scores['compound']:.4f}")


# ==========================================
# üîπ MODELO 2: TF-IDF + Logistic Regression
# ==========================================

nuevo_tfidf = tfidf.transform([nuevo_comentario])
pred_tfidf = model.predict(nuevo_tfidf)

print(f"TF-IDF + Logistic Regression: {pred_tfidf}")


# ==========================================
# üîπ MODELO 3: Sentence-BERT + Logistic
# ==========================================

nuevo_embedding = model_sbert.encode(
    [nuevo_comentario],
    convert_to_numpy=True
)

pred_sbert = classifier.predict(nuevo_embedding)[0]

print(f"Sentence-BERT + Logistic Regression: {pred_sbert}")

print("\n===================================")

Ingrese un comentario: The Housemaid is the real deal. It‚Äôs a slick, high-voltage thriller that takes a classic setup and turns it into a total masterclass in suspense. The lead is absolutely magnetic, and the twists are so sharp they actually land instead of just being "filler." It‚Äôs stylish, mean, and keeps you hooked from start to finish. Honestly? It‚Äôs a total knockout that makes the rest of the genre look lazy.

Comentario ingresado:
The Housemaid is the real deal. It‚Äôs a slick, high-voltage thriller that takes a classic setup and turns it into a total masterclass in suspense. The lead is absolutely magnetic, and the twists are so sharp they actually land instead of just being "filler." It‚Äôs stylish, mean, and keeps you hooked from start to finish. Honestly? It‚Äôs a total knockout that makes the rest of the genre look lazy.

--- RESULTADOS ---
NLTK (VADER): pos | Score: 0.2263
TF-IDF + Logistic Regression: ['pos']
Sentence-BERT + Logistic Regression: pos

