# 03 – Transformación: channels_snapshot

Este notebook transforma el JSON crudo del recurso `channels`
en una tabla de snapshot mensual (`channels_snapshot`) lista
para análisis histórico y carga en BigQuery.

In [1]:
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
PROJECT_ROOT

PosixPath('/Users/angelgarciachanga/repositorios/publico/youtube-v3-data-pipeline')

In [2]:
import json
import pandas as pd
from datetime import datetime, date

In [4]:
RAW_PATH = PROJECT_ROOT / "data" / "raw" / "youtube"
PROCESSED_PATH = PROJECT_ROOT / "data" / "processed" / "youtube"

RAW_PATH, PROCESSED_PATH


(PosixPath('/Users/angelgarciachanga/repositorios/publico/youtube-v3-data-pipeline/data/raw/youtube'),
 PosixPath('/Users/angelgarciachanga/repositorios/publico/youtube-v3-data-pipeline/data/processed/youtube'))

In [5]:
PROCESSED_PATH.mkdir(parents=True, exist_ok=True)


In [6]:
with open(RAW_PATH / "channels.json", "r", encoding="utf-8") as f:
    channels_raw = json.load(f)

channels_raw.keys()


dict_keys(['kind', 'etag', 'pageInfo', 'items'])

In [7]:
item = channels_raw["items"][0]
item.keys()


dict_keys(['kind', 'etag', 'id', 'snippet', 'contentDetails', 'statistics'])

In [8]:
from datetime import date, datetime, timezone

# Fecha de snapshot (solo fecha, sin hora)
SNAPSHOT_DATE = date.today()

# Timestamp explícitamente en UTC (timezone-aware)
EXTRACTED_AT = datetime.now(timezone.utc)

SNAPSHOT_DATE, EXTRACTED_AT

(datetime.date(2026, 2, 16),
 datetime.datetime(2026, 2, 16, 6, 52, 5, 893517, tzinfo=datetime.timezone.utc))

In [9]:
row_snapshot = {
    "snapshot_date": SNAPSHOT_DATE,
    "channel_id": item["id"],
    "subscriber_count": int(item["statistics"].get("subscriberCount", 0)),
    "view_count": int(item["statistics"].get("viewCount", 0)),
    "video_count": int(item["statistics"].get("videoCount", 0)),
    "extracted_at": EXTRACTED_AT
}

df_channels_snapshot = pd.DataFrame([row_snapshot])

df_channels_snapshot


Unnamed: 0,snapshot_date,channel_id,subscriber_count,view_count,video_count,extracted_at
0,2026-02-16,UCUEOHBht8pnQhQvCfIcl-gg,2060,169848,198,2026-02-16 06:52:05.893517+00:00


In [10]:
df_channels_snapshot.shape

(1, 6)

In [11]:
df_channels_snapshot.dtypes

snapshot_date                    object
channel_id                       object
subscriber_count                  int64
view_count                        int64
video_count                       int64
extracted_at        datetime64[ns, UTC]
dtype: object

Nota: snapshot_date  -> Los objetos datetime.date Se almacenan como dtype object, Esto es normal. No es un string. No es un error. Es simplemente cómo pandas maneja date.

In [29]:
type(df_channels_snapshot["snapshot_date"].iloc[0])

datetime.date

In [32]:
df_channels_snapshot = df_channels_snapshot[
    [
        "snapshot_date",
        "channel_id",
        "subscriber_count",
        "view_count",
        "video_count",
        "extracted_at"
    ]
]

> ⚠️ Nota (fase de desarrollo)
>
> El guardado en formato Parquet se utiliza temporalmente para pruebas y separación entre notebooks.
> En la versión final del pipeline (.py), este paso será omitido y el DataFrame se enviará directamente a BigQuery, donde residirá el histórico definitivo.

In [None]:
output_file = PROCESSED_PATH / "channels_snapshot.parquet"

df_channels_snapshot.to_parquet(output_file, index=False)

output_file


In [34]:
import pandas as pd
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
file_path = PROJECT_ROOT / "data" / "processed" / "youtube" / "channels_snapshot.parquet"

df = pd.read_parquet(file_path)
df


Unnamed: 0,snapshot_date,channel_id,subscriber_count,view_count,video_count,extracted_at
0,2026-02-14,UCUEOHBht8pnQhQvCfIcl-gg,2060,169848,198,2026-02-14 20:48:31.518251+00:00
