# 🎥 YouTube Audience Trends Analysis

**Objective:**  
Analyze how YouTube audience preferences evolve over time using trending video data. We'll explore changes in engagement, content types, and metadata patterns across different countries and timeframes.

**Dataset:**  
[YouTube Trending Video Statistics (Kaggle)](https://www.kaggle.com/datasets/datasnaek/youtube-new)

### Step 1: Import Required Libraries

We begin by importing all necessary Python libraries for the **ETL process** — including tools for file management, data manipulation, visualization, environment variable handling, and accessing the Kaggle API to download the dataset.

In [15]:
# If not already installed, run this in a separate cell:
# !pip install kaggle python-dotenv pandas numpy matplotlib seaborn

# File handling & environment variables
import os
from dotenv import load_dotenv

# Data manipulation & exploration
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Kaggle API
from kaggle.api.kaggle_api_extended import KaggleApi

# Plot settings
sns.set(style='whitegrid')

### Step 2: Extract YouTube Trending Dataset from Kaggle

We load our Kaggle authentication credentials from the `.env` file and use the Kaggle API to download the trending video dataset published by [datasnaek](https://www.kaggle.com/datasets/datasnaek/youtube-new). This dataset includes trending videos from six countries, including the US, UK, Canada, and India.

In [16]:
# Load environment variables from .env file
load_dotenv()

# Set Kaggle credentials
os.environ['KAGGLE_USERNAME'] = os.getenv('KAGGLE_USERNAME')
os.environ['KAGGLE_KEY'] = os.getenv('KAGGLE_KEY')

# Authenticate and download dataset
api = KaggleApi()
api.authenticate()
api.dataset_download_files(
    'datasnaek/youtube-new',
    path='youtube_data',
    unzip=True
)

Dataset URL: https://www.kaggle.com/datasets/datasnaek/youtube-new


### Step 3: Load and Inspect the Raw YouTube Dataset

We'll begin by listing all of the files in the downloaded folder and then loading the US trending videos dataset (`USvideos.csv`) as our working subset. This initial inspection will help us understand the structure of the data, identify key fields, and begin planning our cleaning steps.

In [28]:
# List all files in the dataset folder
files = os.listdir('youtube_data')
print("Files in dataset folder:", files)

# Load the US trending videos dataset
df_us = pd.read_csv(os.path.join('youtube_data', 'USvideos.csv'))

# Preview structure
print("Shape of dataset:", df_us.shape)
df_us.head()

Files in dataset folder: ['CAvideos.csv', 'CA_category_id.json', 'DEvideos.csv', 'DE_category_id.json', 'FRvideos.csv', 'FR_category_id.json', 'GBvideos.csv', 'GB_category_id.json', 'INvideos.csv', 'IN_category_id.json', 'JPvideos.csv', 'JP_category_id.json', 'KRvideos.csv', 'KR_category_id.json', 'MXvideos.csv', 'MX_category_id.json', 'RUvideos.csv', 'RU_category_id.json', 'USvideos.csv', 'US_category_id.json']
Shape of dataset: (40949, 16)


Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


In [29]:
# Check data types and nulls
df_us.info()
df_us.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40949 entries, 0 to 40948
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   video_id                40949 non-null  object
 1   trending_date           40949 non-null  object
 2   title                   40949 non-null  object
 3   channel_title           40949 non-null  object
 4   category_id             40949 non-null  int64 
 5   publish_time            40949 non-null  object
 6   tags                    40949 non-null  object
 7   views                   40949 non-null  int64 
 8   likes                   40949 non-null  int64 
 9   dislikes                40949 non-null  int64 
 10  comment_count           40949 non-null  int64 
 11  thumbnail_link          40949 non-null  object
 12  comments_disabled       40949 non-null  bool  
 13  ratings_disabled        40949 non-null  bool  
 14  video_error_or_removed  40949 non-null  bool  
 15  de

video_id                    0
trending_date               0
title                       0
channel_title               0
category_id                 0
publish_time                0
tags                        0
views                       0
likes                       0
dislikes                    0
comment_count               0
thumbnail_link              0
comments_disabled           0
ratings_disabled            0
video_error_or_removed      0
description               570
dtype: int64

### Step 4: Transform and Enrich the US Dataset

Now that we've previewed the US trending dataset, we'll apply a full set of transformations to prepare it for analysis. These changes focus on time parsing, categorical mapping, and extracting new features that will help us explore audience preferences and cultural trends.

#### Transformations performed:
- Convert `publish_time` and `trending_date` to datetime
- Extract time-related features (e.g., publish/trending hour, weekday, etc.)
- Map `category_id` to readable `category_name` using the provided JSON file
- Clean the `title` column and derive:
  - `title_length`
  - `has_emoji`
  - `has_question`
- Add `tag_count` for analysis of metadata use
- Prepare the final cleaned DataFrame `df_us_clean` for export

In [32]:
import json
import emoji
import re

# ---------- Date Conversion ----------
df_us['publish_time'] = pd.to_datetime(df_us['publish_time'])
df_us['trending_date'] = pd.to_datetime(df_us['trending_date'], format='%y.%d.%m')

# ---------- Time Features ----------
df_us['publish_hour'] = df_us['publish_time'].dt.hour
df_us['publish_dayofweek'] = df_us['publish_time'].dt.day_name()
df_us['trending_year'] = df_us['trending_date'].dt.year
df_us['trending_month'] = df_us['trending_date'].dt.month_name()
df_us['trending_dayofweek'] = df_us['trending_date'].dt.day_name()

# ---------- Category Mapping ----------
with open('youtube_data/US_category_id.json', 'r') as f:
    category_data = json.load(f)

# Flatten JSON to get category_id → category_name mapping
id_to_name = {int(item['id']): item['snippet']['title'] for item in category_data['items']}
df_us['category_name'] = df_us['category_id'].map(id_to_name)

# ---------- Title Cleaning + Feature Engineering ----------
df_us['title'] = df_us['title'].str.strip()

# Title length
df_us['title_length'] = df_us['title'].str.len()

# Contains emoji
df_us['has_emoji'] = df_us['title'].apply(lambda x: any(char in emoji.EMOJI_DATA for char in x))

# Contains question mark
df_us['has_question'] = df_us['title'].str.contains(r'\?', regex=True)

# ---------- Tag Feature ----------
df_us['tag_count'] = df_us['tags'].apply(lambda x: 0 if x == '[none]' else len(x.split('|')))

# ---------- Final Preview ----------
df_us_clean = df_us.copy()
df_us_clean.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,...,publish_hour,publish_dayofweek,trending_month,trending_year,trending_dayofweek,category_name,title_length,has_emoji,has_question,tag_count
0,2kyS6SvSYSE,2017-11-14,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13 17:13:01+00:00,SHANtell martin,748374,57527,2966,...,17,Monday,November,2017,Tuesday,People & Blogs,34,False,False,1
1,1ZAPwfrtAFY,2017-11-14,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13 07:30:00+00:00,"last week tonight trump presidency|""last week ...",2418783,97185,6146,...,7,Monday,November,2017,Tuesday,Entertainment,62,False,False,4
2,5qpjK5DgCt4,2017-11-14,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12 19:05:24+00:00,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,...,19,Sunday,November,2017,Tuesday,Comedy,53,False,False,23
3,puqaWrEC7tY,2017-11-14,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13 11:00:04+00:00,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,...,11,Monday,November,2017,Tuesday,Entertainment,32,False,True,27
4,d380meD0W0M,2017-11-14,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12 18:01:41+00:00,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,...,18,Sunday,November,2017,Tuesday,Entertainment,24,False,True,14


### Step 5: Review Feature Enrichment and Save Cleaned US Dataset

Before moving on to processing the other country datasets, we briefly review the enriched features in the US dataset to confirm that the transformations were successful and meaningful.

#### Notable observations:
- `publish_dayofweek` and `trending_dayofweek` columns show the weekday patterns of uploads vs when they trend.
- `has_emoji` appears to be rare in this early subset, but we can analyze its rise in later years.
- `has_question` is true for some titles, indicating variation in tone and style — possibly linked to clickbait or audience engagement tactics.
- `tag_count` varies widely — from a single keyword to over 20 — hinting at differing metadata strategies by creators.

We now save this cleaned US dataset to a CSV file so it can be reused for Tableau visualization or combined with other cleaned datasets.


In [33]:
# Ensure output directory exists
os.makedirs('clean_data', exist_ok=True)

# Save cleaned US dataset
df_us_clean.to_csv('clean_data/US_clean.csv', index=False)
print("✅ US cleaned dataset saved to 'clean_data/US_clean.csv'")

✅ US cleaned dataset saved to 'clean_data/US_clean.csv'


### Step 6: Clean and Combine All Country Datasets

Now that we've verified the enrichment process on the US dataset, we'll apply the same transformations to all other country files. This includes:

- Parsing date and time fields
- Mapping category IDs to names
- Extracting title and tag-based features
- Adding a `country` column to each dataset

We'll skip the already-processed US file, clean the remaining datasets in batch, and then combine all of them into a single master file for cross-country analysis. Each cleaned country file will also be saved individually in the `clean_data/` folder.


In [35]:
def clean_country_df(file_path, country_code, category_map):
    df = df = pd.read_csv(file_path, encoding='utf-8', encoding_errors='replace')

    # Parse dates
    df['publish_time'] = pd.to_datetime(df['publish_time'])
    df['trending_date'] = pd.to_datetime(df['trending_date'], format='%y.%d.%m')

    # Time features
    df['publish_hour'] = df['publish_time'].dt.hour
    df['publish_dayofweek'] = df['publish_time'].dt.day_name()
    df['trending_year'] = df['trending_date'].dt.year
    df['trending_month'] = df['trending_date'].dt.month_name()
    df['trending_dayofweek'] = df['trending_date'].dt.day_name()

    # Map category ID to name
    df['category_name'] = df['category_id'].map(category_map)

    # Title features
    df['title'] = df['title'].str.strip()
    df['title_length'] = df['title'].str.len()
    df['has_emoji'] = df['title'].apply(lambda x: any(char in emoji.EMOJI_DATA for char in x))
    df['has_question'] = df['title'].str.contains(r'\?', regex=True)

    # Tag features
    df['tag_count'] = df['tags'].apply(lambda x: 0 if x == '[none]' else len(x.split('|')))

    # Country label
    df['country'] = country_code

    return df


# Load category map (same for all countries)
with open('youtube_data/US_category_id.json', 'r') as f:
    category_data = json.load(f)
category_map = {int(item['id']): item['snippet']['title'] for item in category_data['items']}

# Get list of country files
all_files = os.listdir('youtube_data')
country_files = [f for f in all_files if f.endswith('.csv') and not f.startswith('US')]

# Process each file
clean_dfs = []
for filename in country_files:
    country_code = filename[:2].upper()
    print(f"🔄 Processing {country_code}...")
    df_clean = clean_country_df(os.path.join('youtube_data', filename), country_code, category_map)
    clean_dfs.append(df_clean)

    # Save individual country CSV
    output_path = f'clean_data/{country_code}_clean.csv'
    df_clean.to_csv(output_path, index=False)
    print(f"✅ Saved cleaned {country_code} data to {output_path}")

# Combine with US cleaned dataset
df_us_clean['country'] = 'US'  # Ensure US has country column
df_all = pd.concat([df_us_clean] + clean_dfs, ignore_index=True)

# Save combined dataset
df_all.to_csv('clean_data/all_countries_clean.csv', index=False)
print("✅ Combined dataset saved to 'clean_data/all_countries_clean.csv'")


🔄 Processing CA...
✅ Saved cleaned CA data to clean_data/CA_clean.csv
🔄 Processing DE...
✅ Saved cleaned DE data to clean_data/DE_clean.csv
🔄 Processing FR...
✅ Saved cleaned FR data to clean_data/FR_clean.csv
🔄 Processing GB...
✅ Saved cleaned GB data to clean_data/GB_clean.csv
🔄 Processing IN...
✅ Saved cleaned IN data to clean_data/IN_clean.csv
🔄 Processing JP...
✅ Saved cleaned JP data to clean_data/JP_clean.csv
🔄 Processing KR...
✅ Saved cleaned KR data to clean_data/KR_clean.csv
🔄 Processing MX...
✅ Saved cleaned MX data to clean_data/MX_clean.csv
🔄 Processing RU...
✅ Saved cleaned RU data to clean_data/RU_clean.csv
✅ Combined dataset saved to 'clean_data/all_countries_clean.csv'


### Step 7: Validate the Combined Dataset

Before exporting the final dataset for visualization, we perform a few quick checks to confirm the success of the cleaning process across all countries:

- Preview the combined dataset
- Check the presence and consistency of key columns
- Validate country distribution
- Look for unexpected nulls or formatting issues

In [36]:
# Preview rows and shape
print("Shape of combined dataset:", df_all.shape)
df_all.head()

Shape of combined dataset: (375942, 28)


Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,...,publish_dayofweek,trending_month,trending_year,trending_dayofweek,category_name,title_length,has_emoji,has_question,tag_count,country
0,2kyS6SvSYSE,2017-11-14,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13 17:13:01+00:00,SHANtell martin,748374,57527,2966,...,Monday,November,2017,Tuesday,People & Blogs,34,False,False,1,US
1,1ZAPwfrtAFY,2017-11-14,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13 07:30:00+00:00,"last week tonight trump presidency|""last week ...",2418783,97185,6146,...,Monday,November,2017,Tuesday,Entertainment,62,False,False,4,US
2,5qpjK5DgCt4,2017-11-14,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12 19:05:24+00:00,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,...,Sunday,November,2017,Tuesday,Comedy,53,False,False,23,US
3,puqaWrEC7tY,2017-11-14,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13 11:00:04+00:00,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,...,Monday,November,2017,Tuesday,Entertainment,32,False,True,27,US
4,d380meD0W0M,2017-11-14,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12 18:01:41+00:00,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,...,Sunday,November,2017,Tuesday,Entertainment,24,False,True,14,US


In [37]:
# Confirm all countries present
print("Countries included:", df_all['country'].unique())

# Check that each country has data
print("\nRows per country:")
print(df_all['country'].value_counts())

# Check for nulls in important columns
print("\nMissing values in key fields:")
print(df_all[['title', 'publish_time', 'trending_date', 'category_name']].isnull().sum())

# Quick look at derived features
print("\nHas emoji breakdown:")
print(df_all['has_emoji'].value_counts())

print("\nAverage title length per country:")
print(df_all.groupby('country')['title_length'].mean())


Countries included: ['US' 'CA' 'DE' 'FR' 'GB' 'IN' 'JP' 'KR' 'MX' 'RU']

Rows per country:
country
US    40949
CA    40881
DE    40840
RU    40739
FR    40724
MX    40451
GB    38916
IN    37352
KR    34567
JP    20523
Name: count, dtype: int64

Missing values in key fields:
title            0
publish_time     0
trending_date    0
category_name    0
dtype: int64

Has emoji breakdown:
has_emoji
False    364824
True      11118
Name: count, dtype: int64

Average title length per country:
country
CA    53.709254
DE    55.407542
FR    53.717562
GB    49.549774
IN    70.563022
JP    37.880817
KR    41.244395
MX    58.042891
RU    53.124058
US    48.578183
Name: title_length, dtype: float64


### Step 8: Final Remarks and Next Steps

The data cleaning and enrichment process has been completed successfully across all ten countries. Each dataset has been:

- Parsed for accurate datetime values
- Enriched with time-based and text-based features
- Mapped to human-readable category names
- Consolidated into a single master dataset with a `country` label

#### Key Takeaways from the Validation:
- No missing values in critical fields
- Emoji usage is rare overall but present (~3%)
- Title lengths vary meaningfully by country — suggesting cultural or platform strategy differences

The final cleaned dataset (`clean_data/all_countries_clean.csv`) is now ready for visualization in Tableau. The next phase will focus on exploring how audience preferences have evolved over time — across content types, formats, and countries — to better understand the cultural pulse of YouTube.

👉 Proceed to Tableau for interactive analysis and storytelling.