# YouTube Trending ETL: Exploring Global Cultural Preferences

This notebook processes daily trending YouTube video data from 113 countries to prepare it for analysis in Tableau. The objective of this project is to uncover global cultural preferences by examining patterns in video titles, sentiment, engagement metrics, and publishing behavior.

The full dataset is updated daily and sourced from Kaggle.

**Project Goals:**
- Download and validate the raw dataset
- Clean and standardize the data
- Engineer relevant features for analysis
- Export the processed dataset for visualization

Dataset source: https://www.kaggle.com/datasets/asaniczka/trending-youtube-videos-113-countries

## Environment Setup

In this section, we import all necessary Python libraries for data processing, text analysis, and emoji detection. We also configure the notebook to ensure that all file paths are relative to the project root directory, rather than the `notebooks/` folder. This allows consistent file organization when reading or writing to the `data/` and `processed/` directories.

All required dependencies are listed in the `requirements.txt` file in the root directory. To install them, run:

```bash
pip install -r requirements.txt

In [1]:
# Core Libraries
import pandas as pd
import numpy as np
import re
import unicodedata
import os
from datetime import datetime

# Text Analysis
from textblob import TextBlob
import emoji

# Dataset Access
import kagglehub

# Ensure paths are relative to the project root, not the notebook location
import sys
from pathlib import Path

# Set working directory to project root (parent of /notebooks)
project_root = Path(__file__).resolve().parent.parent if "__file__" in globals() else Path.cwd().parent
os.chdir(project_root)

print("Working directory set to:", os.getcwd())

  from pandas.core import (


Working directory set to: c:\Users\derek\projects\youtube-preference-trends-analysis


## Downloading the Dataset

We use the `kagglehub` library to programmatically download the latest version of the dataset from Kaggle. This ensures we always work with the most recent snapshot. The downloaded files will be cached locally, but for clarity and reproducibility, we will move the dataset into the `data/raw/` directory once downloaded.


In [2]:
# Download the dataset using kagglehub
dataset_path = kagglehub.dataset_download("asaniczka/trending-youtube-videos-113-countries")

print("Dataset downloaded to:", dataset_path)

# List the contents to verify
os.listdir(dataset_path)

Dataset downloaded to: C:\Users\derek\.cache\kagglehub\datasets\asaniczka\trending-youtube-videos-113-countries\versions\644


['trending_yt_videos_113_countries.csv']

## Organizing the Raw Data

After downloading the dataset using `kagglehub`, we relocate the CSV file to a dedicated `data/raw/` directory. This improves reproducibility and organization by separating raw inputs from processed outputs. It also allows us to version and reference the dataset consistently across the project.

In [3]:
# Create raw data directory if it doesn't exist
raw_dir = os.path.join("data", "raw")
os.makedirs(raw_dir, exist_ok=True)

# Define source and destination paths
src_file = os.path.join(dataset_path, "trending_yt_videos_113_countries.csv")
dst_file = os.path.join(raw_dir, "trending_yt_videos_113_countries.csv")

# Copy the file to the data/raw/ directory
if not os.path.exists(dst_file):
    import shutil
    shutil.copy2(src_file, dst_file)
    print("Copied file to:", dst_file)
else:
    print("File already exists in:", dst_file)

Copied file to: data\raw\trending_yt_videos_113_countries.csv


## Loading and Previewing the Raw Data

We now load the raw CSV file into a Pandas DataFrame and inspect its structure. This step helps verify the format and identify any obvious issues such as missing values, inconsistent columns, or encoding problems.

In [4]:
# Load the dataset from the raw data directory
csv_file = os.path.join("data", "raw", "trending_yt_videos_113_countries.csv")
df = pd.read_csv(csv_file)

# Basic inspection
print("Dataset shape:", df.shape)
df.info()
df.head(3)

Dataset shape: (3594728, 18)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3594728 entries, 0 to 3594727
Data columns (total 18 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   title            object
 1   channel_name     object
 2   daily_rank       int64 
 3   daily_movement   int64 
 4   weekly_movement  int64 
 5   snapshot_date    object
 6   country          object
 7   view_count       int64 
 8   like_count       int64 
 9   comment_count    int64 
 10  description      object
 11  thumbnail_url    object
 12  video_id         object
 13  channel_id       object
 14  video_tags       object
 15  kind             object
 16  publish_date     object
 17  langauge         object
dtypes: int64(6), object(12)
memory usage: 493.7+ MB


Unnamed: 0,title,channel_name,daily_rank,daily_movement,weekly_movement,snapshot_date,country,view_count,like_count,comment_count,description,thumbnail_url,video_id,channel_id,video_tags,kind,publish_date,langauge
0,Master H - Haudi Kudiwa (Official Video) ft. T...,Master H,1,0,49,2025-07-25,ZW,1109150,32819,3103,Artist : @master_H_official Ft @Bottom_Camp ...,https://i.ytimg.com/vi/rgwL9V3t1og/mqdefault.jpg,rgwL9V3t1og,UC5jrU8WxzE16gPTnlqQAqew,#zimbabwe #music #trending #masterh #anika #da...,youtube#video,2025-07-04 00:00:00+00:00,
1,EA SPORTS FC 26 | Official Reveal Trailer,EA SPORTS FC,2,0,23,2025-07-25,ZW,8365074,266900,22703,Innovation powered by you in every mode. The C...,https://i.ytimg.com/vi/TSi0iJYSQ24/mqdefault.jpg,TSi0iJYSQ24,UCoyaxd5LQSuP4ChkxK0pnZQ,"yt:cc=on, EA SPORTS FC, FC 25, EA FC, EA FC re...",youtube#video,2025-07-16 00:00:00+00:00,en
2,[LIVE] Manchester United vs Leeds United Pre-s...,Kampleng Com,3,0,47,2025-07-25,ZW,143492,617,3,[LIVE] Manchester United vs Leeds United Pre-s...,https://i.ytimg.com/vi/MjpunTTElhY/mqdefault.jpg,MjpunTTElhY,UCdMJwGhTUhAzk1ksH6iHjUw,"manchester united vs leeds united, manchester ...",youtube#video,2025-07-19 00:00:00+00:00,id


## Data Cleaning and Standardization

This section focuses on preparing the dataset for analysis by resolving common data quality issues. Specifically, we:
- Drop duplicate records based on `video_id` and `snapshot_date`
- Remove rows with missing required fields
- Correct inconsistent column names
- Standardize text fields by trimming whitespace
- Convert date fields to proper datetime format

In [5]:
# Drop exact duplicate rows (if any)
df.drop_duplicates(inplace=True)

# Drop duplicate trending entries for the same video on the same day
df.drop_duplicates(subset=["video_id", "snapshot_date"], inplace=True)

# Drop rows missing critical fields (video title, country, or view count)
df.dropna(subset=["title", "country", "view_count"], inplace=True)

# Fix column name typo ("langauge" → "language")
df.rename(columns={"langauge": "language"}, inplace=True)

# Standardize string fields
df["title"] = df["title"].str.strip()
df["description"] = df["description"].fillna("").str.strip()
df["channel_name"] = df["channel_name"].str.strip()

### Parsing Date Columns

We convert the `snapshot_date` and `publish_date` fields to datetime format. This allows for accurate calculation of time-based features in the next stage of the pipeline.

In [6]:
# Parse date columns
df["snapshot_date"] = pd.to_datetime(df["snapshot_date"], errors="coerce")
df["publish_date"] = pd.to_datetime(df["publish_date"], utc=True, errors="coerce")

# Drop rows with unparseable dates, if any
df.dropna(subset=["snapshot_date", "publish_date"], inplace=True)

## Feature Engineering

In this section, we derive new features to capture patterns in title structure, engagement behavior, and publishing timelines. These features will support cross-country comparisons of content preferences and viewer interaction styles.

The engineered features include:
- Title characteristics (length, emoji usage, presence of a question)
- Sentiment of the title
- Engagement ratios (likes/views, comments/views)
- Temporal features (days between publish and trending, day of week)

In [7]:
# Title length (characters and word count)
df["title_length_chars"] = df["title"].str.len()
df["title_word_count"] = df["title"].str.split().str.len()

# Emoji presence in title
df["title_has_emoji"] = df["title"].apply(lambda x: any(char in emoji.EMOJI_DATA for char in x))

# Question mark in title
df["title_has_question"] = df["title"].str.contains(r"\?", regex=True)

# Sentiment polarity of title
df["title_sentiment"] = df["title"].apply(lambda x: TextBlob(x).sentiment.polarity)

In [8]:
# Avoid division errors
df["like_count"] = df["like_count"].replace(0, np.nan)
df["comment_count"] = df["comment_count"].replace(0, np.nan)
df["view_count"] = df["view_count"].replace(0, np.nan)

# Engagement ratios
df["like_ratio"] = df["like_count"] / df["view_count"]
df["comment_ratio"] = df["comment_count"] / df["view_count"]

# Drop rows where view_count is missing or zero after replacement
df.dropna(subset=["view_count"], inplace=True)

In [9]:
# Convert both to timezone-naive datetime
df["publish_date"] = df["publish_date"].dt.tz_localize(None)
df["snapshot_date"] = pd.to_datetime(df["snapshot_date"])  # already naive, but reparse just in case

# Compute days between publish and snapshot
df["days_since_publish"] = (df["snapshot_date"] - df["publish_date"]).dt.days

# Day of week published
df["publish_day_of_week"] = df["publish_date"].dt.day_name()

## Exporting the Processed Dataset

With all features engineered and the dataset cleaned, we now export the final data to a CSV file for use in Tableau. The exported file will include all video-level records, enriched with derived features such as title sentiment, engagement ratios, and temporal indicators.

We also generate an optional aggregated dataset that summarizes feature values by country. This can support country-level comparisons in Tableau.


In [10]:
# Create the processed data directory if it doesn't exist
os.makedirs("processed", exist_ok=True)

# Export the full cleaned dataset
output_path_full = os.path.join("processed", "trending_youtube_enriched.csv")
df.to_csv(output_path_full, index=False)

print("Exported full enriched dataset to:", output_path_full)

Exported full enriched dataset to: processed\trending_youtube_enriched.csv
