# üìì 08_transform_playlist_items_snapshot.ipynb

Este notebook es 100% modelado, no extracci√≥n.

üéØ Objetivo

Crear el hist√≥rico mensual de la composici√≥n de playlists manuales.

1 fila = 1 video en 1 playlist en 1 mes.

Este notebook:

‚ùå NO llama a la API  
‚ùå NO reescribe estado  
‚úÖ Agrega contexto temporal  
‚úÖ Permite hist√≥rico

In [4]:
from pathlib import Path
import pandas as pd
from datetime import datetime, date
import requests
import os
from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = Path.cwd().parent
PROCESSED_PATH = PROJECT_ROOT / "data" / "processed" / "youtube"
RAW_PATH = PROJECT_ROOT / "data" / "raw" / "youtube"

API_KEY = os.getenv("YOUTUBE_API_KEY")


In [5]:
df_playlists = pd.read_parquet(
    PROCESSED_PATH / "playlists_manual_static.parquet"
)

df_playlists[["playlist_id", "title"]].head()


Unnamed: 0,playlist_id,title
0,PLV4oS06_KpqbsY_I8iR4HRvb6w3vXUBIM,SQL - Repaso
1,PLV4oS06_KpqZGwOHo-tsdIiaZts7qaqql,Python - Repaso
2,PLV4oS06_KpqaqyS9x6h5ys3REiUfUDOgy,Curso gratuito de SQL en BigQuery | Funciones ...
3,PLV4oS06_KpqbhnVieDd19KJczH_BlBArN,Git - Repaso
4,PLV4oS06_KpqYRtYRoQHo_F_KsEjmqcDK7,Power Bi - Repaso


In [7]:
from datetime import date, datetime, timezone

# Fecha de snapshot (solo fecha, sin hora)
SNAPSHOT_DATE = date.today()

# Timestamp expl√≠citamente en UTC (timezone-aware)
EXTRACTED_AT = datetime.now(timezone.utc)

SNAPSHOT_DATE, EXTRACTED_AT


(datetime.date(2026, 2, 14),
 datetime.datetime(2026, 2, 14, 23, 0, 21, 11925, tzinfo=datetime.timezone.utc))

‚ö†Ô∏è Nota (Arquitectura Final)

En esta versi√≥n del notebook, el playlist_items_snapshot se construye realizando llamadas directas a la API de YouTube para cada playlist.

Sin embargo, en la versi√≥n final del pipeline en producci√≥n:

Primero se actualiza playlist_items_manual_static.

Luego playlist_items_snapshot se construye a partir de esa tabla static.

In [8]:
rows = []

for _, playlist in df_playlists.iterrows():
    playlist_id = playlist["playlist_id"]

    page_token = None

    while True:
        params = {
            "part": "snippet,contentDetails",
            "playlistId": playlist_id,
            "maxResults": 50,
            "key": API_KEY
        }

        if page_token:
            params["pageToken"] = page_token

        response = requests.get(
            "https://www.googleapis.com/youtube/v3/playlistItems",
            params=params
        )
        response.raise_for_status()

        data = response.json()

        for item in data.get("items", []):
            rows.append({
                "snapshot_date": SNAPSHOT_DATE,
                "playlist_id": playlist_id,
                "video_id": item["contentDetails"]["videoId"],
                "position": item["snippet"]["position"],
                "added_at": item["snippet"]["publishedAt"],
                "extracted_at": EXTRACTED_AT
            })

        page_token = data.get("nextPageToken")
        if not page_token:
            break


In [9]:
df_playlist_items_snapshot = pd.DataFrame(rows)

df_playlist_items_snapshot.head()


Unnamed: 0,snapshot_date,playlist_id,video_id,position,added_at,extracted_at
0,2026-02-14,PLV4oS06_KpqbsY_I8iR4HRvb6w3vXUBIM,7bwkNrRpgw0,0,2026-01-23T06:51:06Z,2026-02-14 23:00:21.011925+00:00
1,2026-02-14,PLV4oS06_KpqbsY_I8iR4HRvb6w3vXUBIM,HDyKUodeuNw,1,2026-01-23T06:37:27Z,2026-02-14 23:00:21.011925+00:00
2,2026-02-14,PLV4oS06_KpqZGwOHo-tsdIiaZts7qaqql,Zj6uiqMvFOU,0,2026-01-17T20:02:37Z,2026-02-14 23:00:21.011925+00:00
3,2026-02-14,PLV4oS06_KpqZGwOHo-tsdIiaZts7qaqql,RiYjYfMTGvw,1,2026-01-11T23:05:55Z,2026-02-14 23:00:21.011925+00:00
4,2026-02-14,PLV4oS06_KpqZGwOHo-tsdIiaZts7qaqql,0VmI47XeOuE,2,2026-01-11T23:05:33Z,2026-02-14 23:00:21.011925+00:00


In [10]:
print("Filas:", len(df_playlist_items_snapshot))
print("Playlists √∫nicas:", df_playlist_items_snapshot["playlist_id"].nunique())
print("Videos √∫nicos:", df_playlist_items_snapshot["video_id"].nunique())


Filas: 165
Playlists √∫nicas: 16
Videos √∫nicos: 165


In [11]:
df_playlist_items_snapshot.dtypes

snapshot_date                 object
playlist_id                      str
video_id                         str
position                       int64
added_at                         str
extracted_at     datetime64[us, UTC]
dtype: object

In [12]:
# added_at ‚Üí datetime UTC
df_playlist_items_snapshot["added_at"] = (
    pd.to_datetime(df_playlist_items_snapshot["added_at"], utc=True)
)

In [13]:

df_playlist_items_snapshot.dtypes

snapshot_date                 object
playlist_id                      str
video_id                         str
position                       int64
added_at         datetime64[us, UTC]
extracted_at     datetime64[us, UTC]
dtype: object

In [14]:
df_playlist_items_snapshot = df_playlist_items_snapshot[
    [
        "snapshot_date",
        "playlist_id",
        "video_id",
        "position",
        "added_at",
        "extracted_at",
    ]
]


> ‚ö†Ô∏è Nota (fase de desarrollo)
>
> El guardado en formato Parquet se utiliza temporalmente para pruebas y separaci√≥n entre notebooks.
> En la versi√≥n final del pipeline (.py), este paso ser√° omitido y el DataFrame se enviar√° directamente a BigQuery, donde residir√° el hist√≥rico definitivo.


In [None]:
output_file = PROCESSED_PATH / "playlist_items_snapshot.parquet"

df_playlist_items_snapshot.to_parquet(output_file, index=False)

output_file


PosixPath('/Users/angelgarciachanga/repositorios/publico/youtube-v3-data-pipeline/data/processed/youtube/playlist_items_snapshot.parquet')