# Ejercicio 2: Reddit API Data Collection & Sentiment Analysis

**Autora:** Bianca Peraltilla  
**Curso:** Python for Data Science (UP, 2025-II)  

---

## Objetivo
- Conectarme a la API de Reddit usando **PRAW**.  
- Extraer publicaciones de subreddits políticos:  
  - r/politics  
  - r/PoliticalDiscussion  
  - r/worldnews  
- Guardar la información principal de las publicaciones:  
  - título, score (upvotes), número de comentarios, id y url.  
- Extraer comentarios de los posts más relevantes (5 por post).  
- Almacenar los datos en DataFrames de **pandas** para posterior análisis de sentimiento.  

---

## Paso 1: Preparación del entorno
En este paso:
1. Instalo y cargo las librerías necesarias:  
   - **praw** → conexión con la API de Reddit.  
   - **python-dotenv** → gestión de credenciales de manera segura.  
   - **pandas** → manejo y almacenamiento de datos.  
2. Verifico que todas las librerías se importan correctamente.  

In [1]:
%pip install --quiet praw python-dotenv pandas


Note: you may need to restart the kernel to use updated packages.


In [2]:
import praw
import pandas as pd
from dotenv import load_dotenv

print("PRAW:", praw.__version__)
print("Pandas:", pd.__version__)
print("dotenv listo ✔")


PRAW: 7.8.1
Pandas: 2.2.2
dotenv listo ✔


In [3]:
try:
    import praw, pandas as pd
    from dotenv import load_dotenv
    print("✅ Todo OK, puedes seguir con Reddit.")
except Exception as e:
    print("❌ Falta algo:", e)


✅ Todo OK, puedes seguir con Reddit.


## Paso 2: Configuración de credenciales y conexión con la API de Reddit

En este paso:  
1. Creo un archivo `.env` que guarda mis credenciales de forma segura.  
2. Cargo estas credenciales en el notebook con **python-dotenv**.  
3. Establezco la conexión con la API de Reddit usando **PRAW**.  


In [4]:
import os
from dotenv import load_dotenv
import praw

# Cargar variables desde el archivo .env
load_dotenv()

# Conexión a Reddit usando PRAW
reddit = praw.Reddit(
    client_id=os.getenv("REDDIT_CLIENT_ID"),
    client_secret=os.getenv("REDDIT_CLIENT_SECRET"),
    user_agent=os.getenv("REDDIT_USER_AGENT"),
    username=os.getenv("REDDIT_USERNAME"),
    password=os.getenv("REDDIT_PASSWORD")
)

# Prueba de conexión: verifico mi identidad
print("✅ Conexión exitosa con Reddit")
print("Usuario autenticado:", reddit.user.me())


✅ Conexión exitosa con Reddit
Usuario autenticado: Positive-Fly6703


## Parte 2: Recolección de posts por subreddit

**Objetivo:** descargar **20 posts** de cada subreddit político usando **PRAW** y guardar los resultados en un CSV para análisis posterior.

- Subreddits objetivo:
  - `r/politics`
  - `r/PoliticalDiscussion`
  - `r/worldnews`
- Tipo de listado: dejo **parametrizable** `hot` o `top` (inicio con `hot`).
- Campos extraídos por post:
  - `title`, `score` (upvotes), `num_comments`, `id`, `url`, `subreddit`
- Salida: `output/reddit_posts.csv`

> Nota: este dataset será la base para extraer comentarios (siguiente paso) y luego hacer análisis de sentimiento.


In [5]:
# Paso 3 — Descarga de 20 posts por subreddit y guardado en CSV

import time
from pathlib import Path
import pandas as pd

# 1) Defino parámetros (puedo cambiar "hot" por "top" si quiero)
subreddits = ["politics", "PoliticalDiscussion", "worldnews"]
list_type = "hot"     # opciones: "hot" o "top"
limit_per_sub = 20    # posts por subreddit

# 2) Aseguro carpeta de salida
OUTDIR = Path("output")
OUTDIR.mkdir(exist_ok=True)

def fetch_posts_df(reddit, subreddit: str, list_type: str = "hot", limit: int = 20) -> pd.DataFrame:
    """
    Descargo posts de un subreddit y devuelvo un DataFrame limpio
    con las columnas que voy a analizar después.
    """
    sr = reddit.subreddit(subreddit)
    it = sr.hot(limit=limit) if list_type == "hot" else sr.top(limit=limit)

    rows = []
    for p in it:
        rows.append({
            "subreddit": subreddit,
            "id": p.id,
            "title": p.title or "",
            "score": int(getattr(p, "score", 0) or 0),
            "num_comments": int(getattr(p, "num_comments", 0) or 0),
            "url": p.url or ""
        })
    df = pd.DataFrame(rows).drop_duplicates(subset=["id"]).reset_index(drop=True)
    return df

# 3) Recojo 20 por cada subreddit, concateno y guardo
frames = []
for s in subreddits:
    print(f"Descargando {limit_per_sub} posts ({list_type}) de r/{s}…")
    df_s = fetch_posts_df(reddit, s, list_type=list_type, limit=limit_per_sub)
    print(f"  → {len(df_s)} posts obtenidos")
    frames.append(df_s)
    time.sleep(0.8)  # pequeña pausa para ser amable con la API

posts = pd.concat(frames, ignore_index=True)
posts_path = OUTDIR / "reddit_posts.csv"
posts.to_csv(posts_path, index=False, encoding="utf-8")

print(f"\n✅ Guardé posts en: {posts_path} (total={len(posts)})")
print("\nDistribución por subreddit:")
display(posts.groupby("subreddit")["id"].count().to_frame("n_posts"))

print("\nVista previa:")
display(posts.head(10))


Descargando 20 posts (hot) de r/politics…
  → 20 posts obtenidos
Descargando 20 posts (hot) de r/PoliticalDiscussion…
  → 20 posts obtenidos
Descargando 20 posts (hot) de r/worldnews…
  → 20 posts obtenidos

✅ Guardé posts en: output\reddit_posts.csv (total=60)

Distribución por subreddit:


Unnamed: 0_level_0,n_posts
subreddit,Unnamed: 1_level_1
PoliticalDiscussion,20
politics,20
worldnews,20



Vista previa:


Unnamed: 0,subreddit,id,title,score,num_comments,url
0,politics,1n62egw,Donald Trump Declares D.C. a 'Crime Free Zone'...,2943,253,https://www.rollingstone.com/politics/politics...
1,politics,1n5xbq5,PragerU reveals full list of questions from Ok...,5266,694,https://www.newson6.com/story/68b49a3e4c96f952...
2,politics,1n5ru6d,Trump Admits His Administration Is Being ‘Ripp...,10569,368,https://www.thedailybeast.com/trump-admits-his...
3,politics,1n5odqt,Donald Trump posting week-old photo raises eye...,23242,1995,https://www.newsweek.com/donald-trump-health-p...
4,politics,1n5n1rf,Trump faces returning $100bn in tariffs after ...,34344,1311,https://www.thetimes.com/article/a09594e1-46f2...
5,politics,1n5vzou,Trump Lashes Out After Putin and Modi Are Seen...,4825,333,https://www.thedailybeast.com/trump-lashes-out...
6,politics,1n5yycl,Trump Family Amasses $6 Billion Fortune After ...,2999,138,https://www.wsj.com/finance/currencies/trump-f...
7,politics,1n5rek2,Trump Is Wiping Out Unions. Why Are They So Qu...,5327,512,https://www.nytimes.com/2025/09/01/opinion/tru...
8,politics,1n63tdc,Most US Voters Say Israel Is Committing Genoci...,1169,109,https://truthout.org/articles/most-us-voters-o...
9,politics,1n5u494,Gavin Newsom Mocks Trump With Labor Day Messag...,3063,138,https://www.latintimes.com/gavin-newsom-mocks-...


## Parte 3: Recolección de comentarios (hasta 5 por post)

**Objetivo:** para cada post descargado anteriormente, recolecto **hasta 5 comentarios** con más puntaje y los guardo en un CSV.

- Fuente: el archivo `output/reddit_posts.csv` generado en el paso anterior.
- Para cada `post_id`:
  - Expando comentarios y filtro los que están vacíos o borrados.
  - Tomo los **5 con mayor `score`** (si hay menos, guardo los que existan).
- Campos por comentario:
  - `post_id` (para enlazar con la tabla de posts)
  - `body` (texto del comentario, sin saltos de línea)
  - `score` (upvotes del comentario)
- Salida: `output/reddit_comments.csv`

> Nota: uso pausas cortas para no golpear la API y manejo errores por post, así no se detiene todo el proceso si uno falla.


In [6]:
# Paso 4 — Descarga de comentarios (hasta 5 por post) y guardado a CSV

import time
from pathlib import Path
import pandas as pd

# 1) Cargo los posts del paso anterior
OUTDIR = Path("output")
posts_path = OUTDIR / "reddit_posts.csv"
assert posts_path.exists(), "No encuentro output/reddit_posts.csv. Corre primero el Paso 3."

posts = pd.read_csv(posts_path)

print(f"Total de posts cargados: {len(posts)}")
display(posts.head())

# 2) Función para traer comentarios top por post_id
def fetch_top_comments(reddit, post_id: str, per_post: int = 5):
    """
    Devuelve lista de diccionarios con: post_id, body, score.
    Tomo hasta 'per_post' comentarios mejor puntuados.
    """
    subm = reddit.submission(id=post_id)
    subm.comments.replace_more(limit=0)  # expando "MoreComments"
    comments = subm.comments.list()

    # Filtro borrados/vacíos y ordeno por score desc
    clean = []
    for c in comments:
        body = getattr(c, "body", None)
        if body and body not in ("[deleted]", "[removed]"):
            clean.append(c)
    clean.sort(key=lambda c: getattr(c, "score", 0) or 0, reverse=True)

    rows = []
    for c in clean[:per_post]:
        txt = (c.body or "").replace("\n", " ").strip()
        rows.append({"post_id": post_id, "body": txt, "score": int(getattr(c, "score", 0) or 0)})
    return rows

# 3) Itero por todos los posts y voy acumulando
all_comments = []
errors = 0

for i, pid in enumerate(posts["id"].tolist(), start=1):
    try:
        rows = fetch_top_comments(reddit, pid, per_post=5)
        all_comments.extend(rows)
        # Pausa breve para ser amable con la API
        time.sleep(0.5)
    except Exception as e:
        errors += 1
        print(f"⚠️ Error en post {pid}: {e}")
        time.sleep(1.0)

# 4) Armo DataFrame y guardo
comments = pd.DataFrame(all_comments, columns=["post_id", "body", "score"])
comments_path = OUTDIR / "reddit_comments.csv"
comments.to_csv(comments_path, index=False, encoding="utf-8")

print(f"\n✅ Guardé comentarios en: {comments_path} (total={len(comments)})")
print(f"ℹ️ Posts con error: {errors}")

# 5) Chequeos rápidos de integridad
print("\nEjemplos (5 filas):")
display(comments.head())

print("\nCantidad de comentarios por post (top 10):")
display(comments.groupby("post_id")["score"].count().sort_values(ascending=False).head(10).to_frame("n_comments"))


Total de posts cargados: 60


Unnamed: 0,subreddit,id,title,score,num_comments,url
0,politics,1n62egw,Donald Trump Declares D.C. a 'Crime Free Zone'...,2943,253,https://www.rollingstone.com/politics/politics...
1,politics,1n5xbq5,PragerU reveals full list of questions from Ok...,5266,694,https://www.newson6.com/story/68b49a3e4c96f952...
2,politics,1n5ru6d,Trump Admits His Administration Is Being ‘Ripp...,10569,368,https://www.thedailybeast.com/trump-admits-his...
3,politics,1n5odqt,Donald Trump posting week-old photo raises eye...,23242,1995,https://www.newsweek.com/donald-trump-health-p...
4,politics,1n5n1rf,Trump faces returning $100bn in tariffs after ...,34344,1311,https://www.thetimes.com/article/a09594e1-46f2...



✅ Guardé comentarios en: output\reddit_comments.csv (total=295)
ℹ️ Posts con error: 0

Ejemplos (5 filas):


Unnamed: 0,post_id,body,score
0,1n62egw,"As a reminder, this subreddit [is for civil di...",1
1,1n62egw,Donald Trump is declaring a false victory agai...,1
2,1n62egw,There’s a reason it’s all Truth Social posts a...,1
3,1n62egw,So is he dead or what?,1
4,1n62egw,Trump declaring that DC is crime free is direc...,1



Cantidad de comentarios por post (top 10):


Unnamed: 0_level_0,n_comments
post_id,Unnamed: 1_level_1
1bwbuka,5
1msce9l,5
1mxwbnt,5
1myv6j9,5
1mz6m1w,5
1mzn0u5,5
1mzt9qg,5
1mzv1a2,5
1n09ltj,5
1n0wgiu,5


### Resultado de la recolección de comentarios
- Archivo generado: `output/reddit_comments.csv`
- Claves:
  - `post_id` me permite enlazar con `reddit_posts.csv`.
  - `score` ayuda a priorizar comentarios más relevantes.
  - `body` es el texto que luego usaré para análisis de sentimiento.
- Con esto cierro la recolección de datos (posts + comentarios) y quedo lista para la fase de **análisis exploratorio / sentiment analysis**.


## Parte final: Verificación del enlace post–comentario + Mini-EDA

**Objetivo:**  
- Verifico que cada comentario (`reddit_comments.csv`) está correctamente vinculado a su post padre (`reddit_posts.csv`) mediante `post_id`.  
- Hago un análisis rápido (EDA) para agregar valor:
  - Top posts por `score` (upvotes).
  - Posts con más comentarios (según `num_comments`).
  - Comentarios por subreddit.
  - Top posts por “impacto de comentarios” (suma de `score` de los comentarios).

Además, genero un archivo `output/reddit_merged_sample.csv` con ejemplo de unión *post + comentario* para evidenciar el enlace.


In [7]:
# Verifico el enlace post–comentario y hago un pequeño EDA

import pandas as pd
from pathlib import Path

OUTDIR = Path("output")
posts_path = OUTDIR / "reddit_posts.csv"
comments_path = OUTDIR / "reddit_comments.csv"

# 1) Cargo las tablas generadas
df_posts = pd.read_csv(posts_path)
df_comments = pd.read_csv(comments_path)

print(f"Posts: {len(df_posts)} filas | Comments: {len(df_comments)} filas")

# 2) Verificación: uno comentarios con su post por el id
merged = df_comments.merge(
    df_posts[["id", "title", "subreddit", "score", "num_comments", "url"]],
    left_on="post_id",
    right_on="id",
    how="left",
    validate="many_to_one"  # cada comment pertenece a un único post
)

# Reviso integridad del link (cuántos comentarios quedaron sin post)
missing = merged["title"].isna().sum()
print(f"Comentarios sin post emparejado: {missing}")

# 3) Exporto una muestra como evidencia del link post–comentario
sample_cols = ["post_id", "subreddit", "title", "comment_body", "comment_score"]
# Si tus columnas se llaman "body" y "score" en comments, ajusto:
if "comment_body" not in merged.columns:
    merged = merged.rename(columns={"body": "comment_body"})
if "comment_score" not in merged.columns:
    merged = merged.rename(columns={"score_x": "comment_score"})  # por si hay colisión de nombre
    if "comment_score" not in merged.columns and "score" in merged.columns:
        merged = merged.rename(columns={"score": "comment_score"})

evidence = merged[sample_cols].head(50)
evidence_path = OUTDIR / "reddit_merged_sample.csv"
evidence.to_csv(evidence_path, index=False, encoding="utf-8")
print(f"✅ Evidencia guardada: {evidence_path}")

display(evidence.head(10))

# 4) Mini-EDA que agrega valor
print("\nTop 10 posts por score (upvotes):")
top_by_score = df_posts.sort_values("score", ascending=False).head(10)[["subreddit", "title", "score", "num_comments", "url"]]
display(top_by_score)

print("\nTop 10 posts por número de comentarios (según metadata del post):")
top_by_num_comments = df_posts.sort_values("num_comments", ascending=False).head(10)[["subreddit", "title", "num_comments", "score", "url"]]
display(top_by_num_comments)

print("\nComentarios por subreddit (conteo) según el dataset de comentarios:")
comments_per_sub = merged.groupby("subreddit")["comment_body"].count().sort_values(ascending=False).to_frame("n_comments")
display(comments_per_sub)

print("\nTop 10 posts por 'impacto de comentarios' (suma de score de comentarios):")
impact = merged.groupby(["post_id", "title", "subreddit"], dropna=False)["comment_score"].sum().reset_index()
impact = impact.sort_values("comment_score", ascending=False).head(10)
display(impact)


Posts: 60 filas | Comments: 295 filas
Comentarios sin post emparejado: 0
✅ Evidencia guardada: output\reddit_merged_sample.csv


Unnamed: 0,post_id,subreddit,title,comment_body,comment_score
0,1n62egw,politics,Donald Trump Declares D.C. a 'Crime Free Zone'...,"As a reminder, this subreddit [is for civil di...",1
1,1n62egw,politics,Donald Trump Declares D.C. a 'Crime Free Zone'...,Donald Trump is declaring a false victory agai...,1
2,1n62egw,politics,Donald Trump Declares D.C. a 'Crime Free Zone'...,There’s a reason it’s all Truth Social posts a...,1
3,1n62egw,politics,Donald Trump Declares D.C. a 'Crime Free Zone'...,So is he dead or what?,1
4,1n62egw,politics,Donald Trump Declares D.C. a 'Crime Free Zone'...,Trump declaring that DC is crime free is direc...,1
5,1n5xbq5,politics,PragerU reveals full list of questions from Ok...,"As a reminder, this subreddit [is for civil di...",1
6,1n5xbq5,politics,PragerU reveals full list of questions from Ok...,"The certificate misspells the word ""certify"", ...",1
7,1n5xbq5,politics,PragerU reveals full list of questions from Ok...,Oklahoma is dead last in education rankings. ...,1
8,1n5xbq5,politics,PragerU reveals full list of questions from Ok...,This is all for performance. No teacher from ...,1
9,1n5xbq5,politics,PragerU reveals full list of questions from Ok...,"The one thing more dangerous than old, senile ...",1



Top 10 posts por score (upvotes):


Unnamed: 0,subreddit,title,score,num_comments,url
55,worldnews,Zelenskyy points out that Trump’s “two weeks” ...,37444,694,https://www.pravda.com.ua/eng/news/2025/08/31/...
4,politics,Trump faces returning $100bn in tariffs after ...,34344,1311,https://www.thetimes.com/article/a09594e1-46f2...
3,politics,Donald Trump posting week-old photo raises eye...,23242,1995,https://www.newsweek.com/donald-trump-health-p...
43,worldnews,EU head's plane hit by suspected Russian GPS i...,22070,749,https://tvpworld.com/88657721/eu-heads-plane-h...
41,worldnews,"To defend against Russian tanks, Finland and P...",16105,411,https://www.france24.com/en/europe/20250828-to...
2,politics,Trump Admits His Administration Is Being ‘Ripp...,10569,368,https://www.thedailybeast.com/trump-admits-his...
42,worldnews,"Maduro warns of ""bloody threat"" as Trump deplo...",5941,380,https://www.newsweek.com/maduro-wans-trump-ven...
7,politics,Trump Is Wiping Out Unions. Why Are They So Qu...,5327,512,https://www.nytimes.com/2025/09/01/opinion/tru...
1,politics,PragerU reveals full list of questions from Ok...,5266,694,https://www.newson6.com/story/68b49a3e4c96f952...
11,politics,Trump’s bill is a ‘death warrant’ say parents ...,4911,360,https://www.independent.co.uk/news/world/ameri...



Top 10 posts por número de comentarios (según metadata del post):


Unnamed: 0,subreddit,title,num_comments,score,url
20,PoliticalDiscussion,Casual Questions Thread,8311,89,https://www.reddit.com/r/PoliticalDiscussion/c...
3,politics,Donald Trump posting week-old photo raises eye...,1995,23242,https://www.newsweek.com/donald-trump-health-p...
4,politics,Trump faces returning $100bn in tariffs after ...,1311,34344,https://www.thetimes.com/article/a09594e1-46f2...
43,worldnews,EU head's plane hit by suspected Russian GPS i...,749,22070,https://tvpworld.com/88657721/eu-heads-plane-h...
55,worldnews,Zelenskyy points out that Trump’s “two weeks” ...,694,37444,https://www.pravda.com.ua/eng/news/2025/08/31/...
1,politics,PragerU reveals full list of questions from Ok...,694,5266,https://www.newson6.com/story/68b49a3e4c96f952...
23,PoliticalDiscussion,What do you think about Gavin Newsom's new soc...,540,889,https://www.reddit.com/r/PoliticalDiscussion/c...
7,politics,Trump Is Wiping Out Unions. Why Are They So Qu...,512,5327,https://www.nytimes.com/2025/09/01/opinion/tru...
41,worldnews,"To defend against Russian tanks, Finland and P...",411,16105,https://www.france24.com/en/europe/20250828-to...
42,worldnews,"Maduro warns of ""bloody threat"" as Trump deplo...",380,5941,https://www.newsweek.com/maduro-wans-trump-ven...



Comentarios por subreddit (conteo) según el dataset de comentarios:


Unnamed: 0_level_0,n_comments
subreddit,Unnamed: 1_level_1
PoliticalDiscussion,100
politics,100
worldnews,95



Top 10 posts por 'impacto de comentarios' (suma de score de comentarios):


Unnamed: 0,post_id,title,subreddit,comment_score
24,1n5n1rf,Trump faces returning $100bn in tariffs after ...,politics,31354
27,1n5odqt,Donald Trump posting week-old photo raises eye...,politics,17970
23,1n5k4p2,EU head's plane hit by suspected Russian GPS i...,worldnews,17249
20,1n5asa1,Zelenskyy points out that Trump’s “two weeks” ...,worldnews,11308
36,1n5tdc8,"To defend against Russian tanks, Finland and P...",worldnews,9068
34,1n5ru6d,Trump Admits His Administration Is Being ‘Ripp...,politics,7814
33,1n5rek2,Trump Is Wiping Out Unions. Why Are They So Qu...,politics,7659
45,1n5yfan,"Maduro warns of ""bloody threat"" as Trump deplo...",worldnews,3063
26,1n5nv9j,Trump’s bill is a ‘death warrant’ say parents ...,politics,2581
7,1mzv1a2,Trump has said the DOJ will be filing a lawsui...,PoliticalDiscussion,2286


# Conclusiones — Ejercicio 2: Reddit API & Sentiment Data Collection

- Se logró **conectar exitosamente a la API de Reddit (PRAW)** usando credenciales seguras guardadas en `.env`.
- Se recolectaron **20 posts por cada subreddit** (`r/politics`, `r/PoliticalDiscussion`, `r/worldnews`), guardando título, score, n_comments, id y url.
- Se descargaron hasta **5 comentarios principales por post**, enlazándolos a su publicación mediante `post_id`.
- Los datos quedaron almacenados en dos CSV (`reddit_posts.csv` y `reddit_comments.csv`) y además se verificó la relación posts–comentarios con un `merge`.

### Insights del mini-EDA:
- Los posts con mayor puntaje (upvotes) y con más comentarios no siempre coinciden, lo que refleja diferencias entre “popularidad” y “participación”.
- `r/politics` y `r/worldnews` concentran la mayoría de los comentarios en el dataset, confirmando su mayor nivel de interacción frente a `r/PoliticalDiscussion`.
- Algunos posts destacan por su **alto “impacto de comentarios”** (suma de score de los comentarios), lo que permite detectar hilos donde la discusión generó mayor resonancia.
- Estos datos recolectados pueden servir como base para un análisis de **sentimiento** en los comentarios (positivo/negativo), enriqueciendo el entendimiento de la polarización política.

---

📌 **Con esto se cumple el Ejercicio 2:** se conectó la API, se almacenaron posts y comentarios vinculados, y se realizó un EDA preliminar que añade valor.
