# üìò 10_load_videos_snapshot_to_bigquery

### üéØ Objetivo

Este notebook no transforma datos.  
Su √∫nica responsabilidad es: Cargar df_videos_snapshot (generado en el notebook 04) hacia BigQuery como tabla hist√≥rica.  
Destino: youtube-datasets-360.angelgarciadatablog.videos_snapshot

In [None]:
from dotenv import load_dotenv
import os
from google.cloud import bigquery

In [8]:
PROJECT_ID = os.getenv("GCP_PROJECT")
DATASET_ID = "angelgarciadatablog"
TABLE_ID = "videos_snapshot"

FULL_TABLE_ID = f"{PROJECT_ID}.{DATASET_ID}.{TABLE_ID}"

client = bigquery.Client(project=PROJECT_ID)

print("Destino configurado:", FULL_TABLE_ID)

Destino configurado: youtube-datasets-360.angelgarciadatablog.videos_snapshot


## üß± Cargar snapshot desde Parquet (temporal)

‚ö†Ô∏è Nota temporal:
Durante la fase de notebooks, el DataFrame se carga desde Parquet como mecanismo de intercambio entre notebooks.
En la versi√≥n productiva (scripts .py), el DataFrame se pasar√° directamente sin almacenamiento intermedio.

In [9]:
from pathlib import Path
import pandas as pd

PROJECT_ROOT = Path.cwd().parents[0]
PROCESSED_PATH = PROJECT_ROOT / "data" / "processed" / "youtube"

file_path = PROCESSED_PATH / "videos_snapshot.parquet"

print("Ruta:", file_path)
print("Existe:", file_path.exists())

df_videos_snapshot = pd.read_parquet(file_path)

df_videos_snapshot.head()


Ruta: /Users/angelgarciachanga/repositorios/publico/youtube-v3-data-pipeline/data/processed/youtube/videos_snapshot.parquet
Existe: True


Unnamed: 0,snapshot_date,video_id,channel_id,published_at,duration_seconds,view_count,like_count,comment_count,extracted_at
0,2026-02-14,xB4ecIksJSY,UCUEOHBht8pnQhQvCfIcl-gg,2026-01-24 12:04:21+00:00,960,30,1,0,2026-02-14 22:16:51.958746+00:00
1,2026-02-14,7bwkNrRpgw0,UCUEOHBht8pnQhQvCfIcl-gg,2026-01-23 06:52:23+00:00,69,16,2,0,2026-02-14 22:16:51.958746+00:00
2,2026-02-14,HDyKUodeuNw,UCUEOHBht8pnQhQvCfIcl-gg,2026-01-23 06:43:39+00:00,294,9,1,0,2026-02-14 22:16:51.958746+00:00
3,2026-02-14,Zj6uiqMvFOU,UCUEOHBht8pnQhQvCfIcl-gg,2026-01-17 20:07:55+00:00,1186,18,1,0,2026-02-14 22:16:51.958746+00:00
4,2026-02-14,RiYjYfMTGvw,UCUEOHBht8pnQhQvCfIcl-gg,2026-01-11 23:25:57+00:00,1138,13,0,0,2026-02-14 22:16:51.958746+00:00


In [10]:
df_videos_snapshot.dtypes

snapshot_date                    object
video_id                            str
channel_id                          str
published_at        datetime64[us, UTC]
duration_seconds                  int64
view_count                        int64
like_count                        int64
comment_count                     int64
extracted_at        datetime64[us, UTC]
dtype: object

## üèó Crear tabla particionada con el esquema y datos del dataframe 




In [11]:
from google.api_core.exceptions import NotFound
from google.cloud.bigquery import SchemaField

schema = [
    SchemaField("snapshot_date", "DATE"),
    SchemaField("video_id", "STRING"),
    SchemaField("channel_id", "STRING"),
    SchemaField("published_at", "TIMESTAMP"),
    SchemaField("duration_seconds", "INT64"),
    SchemaField("view_count", "INT64"),
    SchemaField("like_count", "INT64"),
    SchemaField("comment_count", "INT64"),
    SchemaField("extracted_at", "TIMESTAMP"),
]

try:
    client.get_table(FULL_TABLE_ID)
    print("Tabla ya existe.")
    
except NotFound:
    table = bigquery.Table(FULL_TABLE_ID, schema=schema)

    table.time_partitioning = bigquery.TimePartitioning(
        type_=bigquery.TimePartitioningType.DAY,
        field="snapshot_date",
    )

    client.create_table(table)
    print("Tabla creada con partici√≥n.")


Tabla creada con partici√≥n.


## üîí Control de idempotencia por `snapshot_date`

Antes de insertar el snapshot actual, se eliminan los registros existentes con el mismo `snapshot_date`.

Esto garantiza que el proceso sea **idempotente**:  Si el pipeline se ejecuta m√∫ltiples veces el mismo d√≠a (pruebas, re-ejecuciones o errores), el resultado final en BigQuery ser√° siempre consistente y sin duplicados.


In [None]:
#Eliminar datos si tienen snapshot_date del mismo d√≠a
snapshot_date = df_videos_snapshot["snapshot_date"].iloc[0]

delete_query = f"""
DELETE FROM `{FULL_TABLE_ID}`
WHERE snapshot_date = @snapshot_date
"""

job_config = bigquery.QueryJobConfig(
    query_parameters=[
        bigquery.ScalarQueryParameter(
            "snapshot_date",
            "DATE",
            snapshot_date
        )
    ]
)

client.query(delete_query, job_config=job_config).result()

print(f"Snapshots del {snapshot_date} eliminados si exist√≠an.")



Snapshots del 2026-02-14 eliminados si exist√≠an.


## üìå Cargar datos del parquet a big query

In [None]:
# 2Ô∏è‚É£ Carga los datos desde tu DataFrame hacia BigQuery. WRITE APPEND = Agrega fila nuevas
job_config = bigquery.LoadJobConfig(
    write_disposition="WRITE_APPEND"
)

job = client.load_table_from_dataframe(
    df_videos_snapshot,
    FULL_TABLE_ID,
    job_config=job_config
)

job.result()

print("Carga completada correctamente.")



Carga completada correctamente.
