# Práctica 2 — Solución Analítica (YouTube Comment Sentiment)

**Dataset:** `AmaanP314/youtube-comment-sentiment` (Hugging Face)

**Objetivo:**  
1) Definir una variable objetivo discreta y una continua (y justificar).  
2) Entrenar el mejor modelo para la discreta.  
3) Entrenar el mejor modelo para la continua.  
4) Interpretación en función del negocio.

## 0. Carga y exploración inicial del dataset

En esta sección:
- Cargamos el dataset desde Hugging Face.
- Revisamos columnas disponibles.
- Verificamos tamaño y valores faltantes.

In [3]:
import pandas as pd
import numpy as np

from datasets import load_dataset

ds = load_dataset("AmaanP314/youtube-comment-sentiment")
df = ds["train"].to_pandas()

df.shape, df.columns


  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|██████████| 1032225/1032225 [00:03<00:00, 301246.55 examples/s]


((1032225, 12),
 Index(['CommentID', 'VideoID', 'VideoTitle', 'AuthorName', 'AuthorChannelID',
        'CommentText', 'Sentiment', 'Likes', 'Replies', 'PublishedAt',
        'CountryCode', 'CategoryID'],
       dtype='object'))

### Auditoría inicial del dataset (calidad y consistencia)

Objetivos:
1) Revisar **missing** (NaN/None) en TODAS las columnas.
2) Detectar **missing disfrazados**: strings vacíos, espacios, "nan", "null", etc.
3) Validar tipos (dtypes) y rangos básicos (Likes/Replies).
4) Parsear `PublishedAt` y medir fallos de parseo.
5) Ver cardinalidad de categóricas (`Sentiment`/`CountryCode`/`CategoryID`).

In [7]:
missing_all = df.isna().mean().sort_values(ascending=False)
missing_all[missing_all > 0]

AuthorName    0.000611
dtype: float64

In [8]:
def disguised_missing_stats(s: pd.Series) -> pd.Series:
    s = s.astype("string")
    stripped = s.str.strip()

    missing_words = {"", "nan", "null", "none", "na", "n/a", "nil", "missing"}

    is_empty = stripped.eq("")
    is_word_missing = stripped.str.lower().isin(missing_words)

    is_literal_na = stripped.eq("<NA>")

    return pd.Series({
        "pct_empty_or_spaces": float(is_empty.mean()),
        "pct_missing_words": float(is_word_missing.mean()),
        "pct_literal_<NA>": float(is_literal_na.mean()),
        "examples_empty": stripped[is_empty].head(3).tolist(),
        "examples_missing_words": stripped[is_word_missing].head(3).tolist(),
    })

text_cols = ["VideoTitle", "AuthorName", "AuthorChannelID", "CommentText", "CountryCode", "PublishedAt", "Sentiment"]
audit_text = pd.DataFrame({c: disguised_missing_stats(df[c]) for c in text_cols}).T
audit_text


Unnamed: 0,pct_empty_or_spaces,pct_missing_words,pct_literal_<NA>,examples_empty,examples_missing_words
VideoTitle,0.0,0.0,0.0,[],[]
AuthorName,0.0,0.0,0.0,[],[]
AuthorChannelID,0.0,0.0,0.0,[],[]
CommentText,0.000156,0.00016,0.0,"[, , ]","[, , ]"
CountryCode,0.0,0.0,0.0,[],[]
PublishedAt,0.0,0.0,0.0,[],[]
Sentiment,0.0,0.0,0.0,[],[]


In [9]:
df["Sentiment"].value_counts(dropna=False)

Sentiment
Negative    346075
Positive    343317
Neutral     342833
Name: count, dtype: int64

In [10]:
cc = df["CountryCode"].astype("string").str.strip()
cc_len = cc.str.len()

{
    "pct_missing_like_empty": float((cc_len==0).mean()),
    "len_counts_top": cc_len.value_counts(dropna=False).head(10).to_dict(),
    "sample_weird_len": cc[~cc_len.isin([0,2])].dropna().unique()[:15].tolist()
}

{'pct_missing_like_empty': 0.0,
 'len_counts_top': {np.int64(2): 1032225},
 'sample_weird_len': []}

In [11]:
cat = pd.to_numeric(df["CategoryID"], errors="coerce")
{
    "pct_non_numeric": float(cat.isna().mean()),
    "min": float(cat.min()),
    "max": float(cat.max()),
    "n_unique": int(cat.nunique(dropna=True)),
    "top_values": cat.value_counts().head(10).to_dict()
}

{'pct_non_numeric': 0.0,
 'min': 1.0,
 'max': 28.0,
 'n_unique': 11,
 'top_values': {25: 332543,
  27: 290237,
  26: 85502,
  17: 69322,
  15: 49635,
  24: 48406,
  28: 47887,
  2: 44749,
  20: 32088,
  22: 17532}}

In [12]:
likes = pd.to_numeric(df["Likes"], errors="coerce")
replies = pd.to_numeric(df["Replies"], errors="coerce")

{
    "likes_pct_non_numeric": float(likes.isna().mean()),
    "replies_pct_non_numeric": float(replies.isna().mean()),
    "likes_pct_negative": float((likes < 0).mean()),
    "replies_pct_negative": float((replies < 0).mean()),
    "likes_desc": likes.describe(percentiles=[.5,.9,.95,.99]).to_dict(),
    "replies_desc": replies.describe(percentiles=[.5,.9,.95,.99]).to_dict(),
}

{'likes_pct_non_numeric': 0.0,
 'replies_pct_non_numeric': 0.0,
 'likes_pct_negative': 0.0,
 'replies_pct_negative': 0.0,
 'likes_desc': {'count': 1032225.0,
  'mean': 101.66075419603284,
  'std': 1538.978145542954,
  'min': 0.0,
  '50%': 0.0,
  '90%': 35.0,
  '95%': 157.0,
  '99%': 1633.0,
  'max': 275849.0},
 'replies_desc': {'count': 1032225.0,
  'mean': 2.023081208069946,
  'std': 14.144702381178911,
  'min': 0.0,
  '50%': 0.0,
  '90%': 2.0,
  '95%': 7.0,
  '99%': 42.0,
  'max': 751.0}}

In [13]:
published_dt = pd.to_datetime(df["PublishedAt"], errors="coerce", utc=True)

{
    "pct_parse_fail": float(published_dt.isna().mean()),
    "min_dt": str(published_dt.min()),
    "max_dt": str(published_dt.max()),
    "sample_raw_failures": df.loc[published_dt.isna(), "PublishedAt"].astype("string").head(10).tolist()
}

{'pct_parse_fail': 0.0,
 'min_dt': '2013-04-05 22:47:16+00:00',
 'max_dt': '2025-02-05 14:33:11+00:00',
 'sample_raw_failures': []}

In [14]:
dup_commentid = df["CommentID"].duplicated().mean()
dup_text = df["CommentText"].astype("string").str.strip().duplicated().mean()

{"pct_dup_commentid": float(dup_commentid), "pct_dup_commenttext": float(dup_text)}

{'pct_dup_commentid': 0.00035457385744387123,
 'pct_dup_commenttext': 0.04103368936036232}

### Limpieza final para modelado

Aplicamos limpieza mínima y justificada:

1) Eliminamos comentarios con `CommentText` vacío (espacios).
2) Eliminamos duplicados por `CommentID` (consistencia de identificador).
3) Parseamos `PublishedAt` a datetime y creamos variables de tiempo.
4) (Opcional) Eliminamos duplicados exactos de `CommentText` para reducir leakage.

In [15]:
df_model = df.copy()

# 1) Quitar textos vacíos o solo espacios
df_model["CommentText"] = df_model["CommentText"].astype("string")
mask_nonempty = df_model["CommentText"].str.strip().fillna("").ne("")
df_model = df_model[mask_nonempty].copy()

# 2) Deduplicar por CommentID (mantener primera ocurrencia)
df_model = df_model.drop_duplicates(subset=["CommentID"], keep="first").copy()

# 3) Parseo de fecha + features temporales
df_model["published_dt"] = pd.to_datetime(df_model["PublishedAt"], errors="coerce", utc=True)
df_model["hour"] = df_model["published_dt"].dt.hour
df_model["dow"] = df_model["published_dt"].dt.dayofweek
df_model["month"] = df_model["published_dt"].dt.month
df_model["is_weekend"] = df_model["dow"].isin([5,6]).astype(int)

# Targets
df_model["y_disc"] = df_model["Sentiment"].astype(str)
df_model["Likes"] = pd.to_numeric(df_model["Likes"], errors="coerce")
df_model["Replies"] = pd.to_numeric(df_model["Replies"], errors="coerce")

df_model["y_cont"] = np.log1p(df_model["Likes"].astype(float))

df_model.shape, df_model[["CommentText","y_disc","Likes","Replies","hour","dow","month","is_weekend","y_cont"]].head(3)

((1031698, 19),
                                          CommentText    y_disc  Likes  \
 0                    Anyone know what movie this is?   Neutral      0   
 1  The fact they're holding each other back while...  Positive      0   
 2                        waiting next video will be?   Neutral      1   
 
    Replies  hour  dow  month  is_weekend    y_cont  
 0        2     0    2      1           0  0.000000  
 1        0    23    0      1           0  0.000000  
 2        0    13    0      7           0  0.693147  )

## 1. Variables objetivo

Se requieren dos variables objetivo:

### 1.1 Variable objetivo discreta (clasificación)
Definimos la variable objetivo discreta como:

- **y_disc = Sentiment** ∈ {Negative, Neutral, Positive}

**Justificación / utilidad (negocio):**
Clasificar el sentimiento de comentarios permite monitorear reputación, moderación de contenido, priorización de atención y análisis del clima de la comunidad.

### 1.2 Variable objetivo continua (regresión)
Definimos la variable objetivo continua como el engagement del comentario medido por likes:

- **Likes** = número de "me gusta" del comentario  
- Debido a que `Likes` suele ser altamente asimétrico (muchos ceros y pocos valores grandes), se usa:
- **y_cont = log1p(Likes)** = log(1 + Likes)

**Justificación / utilidad (negocio):**
Predecir engagement permite priorizar comentarios valiosos, mejorar ranking/visibilidad y asignar recursos de moderación o respuesta.

In [16]:
df_model[["y_disc", "Likes", "y_cont"]].head(5)

Unnamed: 0,y_disc,Likes,y_cont
0,Neutral,0,0.0
1,Positive,0,0.0
2,Neutral,1,0.693147
3,Neutral,0,0.0
4,Positive,3,1.386294


In [17]:
df_model["y_disc"].value_counts(normalize=True).round(4)

y_disc
Negative    0.3354
Positive    0.3326
Neutral     0.3321
Name: proportion, dtype: float64