## 0.1. Preprocessed Data Initial Exploration

This notebook provides a brief exploration of the raw dataset obtained from `youtube_trends/dataset.py` and saved in `data/raw/dataset.csv`. This initial exploration was performed to determine the techniques and tools to use during data processing for future analysis. The data processing stage can also be found in `youtube_trends/dataset.py`.

---
#### About Raw Dataset

The raw dataset contains information about trending YouTube videos, including details about the videos and their respective channels.

**Video Information**
- `video_id`: Unique identifier for the video on YouTube.
- `video_published_at`: The date and time when the video was published.
- `video_trending_date`: The date when the video was identified as trending.
- `video_trending_country`: The country where the video is trending (ISO 3166-1 alpha-2 country code, e.g., "US" for the United States).
- `video_title`: The title of the video as displayed on YouTube.
- `video_description`: The description provided by the video creator.
- `video_default_thumbnail`: URL of the default thumbnail for the video.
- `video_category_id`: Numeric ID representing the category of the video (e.g., Music, Gaming, etc.).
- `video_tags`: List of tags associated with the video for categorization and discoverability.
- `video_duration`: Duration of the video in ISO 8601 format (e.g., "PT10M15S" for 10 minutes and 15 seconds).
- `video_dimension`: Dimension of the video (e.g., "2d", "3d").
- `video_definition`: Video resolution quality (e.g., "hd" for high definition, "sd" for standard definition).
- `video_licensed_content`: Boolean indicating if the video contains licensed content.
- `video_view_count`: Total number of views for the video.
- `video_like_count`: Total number of likes for the video.
- `video_comment_count`: Total number of comments on the video.

**Channel Information**
- `channel_id`: Unique identifier for the YouTube channel.
- `channel_title`: The name/title of the channel.
- `channel_description`: Description provided by the channel owner.
- `channel_custom_url`: Custom URL for the channel (if available).
- `channel_published_at`: The date and time when the channel was created.
- `channel_country`: The country associated with the channel (if specified by the creator).
- `channel_view_count`: Total number of views across all videos on the channel.
- `channel_subscriber_count`: Total number of subscribers to the channel.
- `channel_have_hidden_subscribers`: Boolean indicating if the channel has hidden its subscriber count.
- `channel_video_count`: Total number of videos uploaded by the channel.
- `channel_localized_title`: The localized title of the channel (if available in a different language).
- `channel_localized_description`: The localized description of the channel (if available in a different language).

---

In [None]:
import torch
import warnings
import requests
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.io as pio
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from PIL import Image
from io import BytesIO
from pathlib import Path
from ultralytics import YOLO
from IPython.display import HTML
from IPython.display import display
from youtube_trends.config import RAW_DATA_DIR

pio.renderers.default = 'notebook' 
warnings.filterwarnings('ignore')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
df = pd.read_csv(RAW_DATA_DIR / "dataset.csv", low_memory=False)

In [None]:
display(df)

In [None]:
print(f"Shape of the dataset: {df.shape}\n")

In [None]:
df.info()

#### Object columns

In [None]:
# Check for missing values
missing_values = df.select_dtypes(include=['object']).isnull().sum()
missing_percentage = 100 * missing_values / len(df)
missing_df = pd.DataFrame({
    'Column': df.select_dtypes(include=['object']).columns,
    'Missing Values': missing_values,
    'Missing Percentage': missing_percentage
})
missing_df = missing_df.reset_index(drop=True)
missing_df

In [None]:
# Check for unique values in each object column
unique_values = df.select_dtypes(include=['object']).nunique()
unique_values_df = unique_values.reset_index()
unique_values_df.columns = ['Column', 'Unique Values']
unique_values_df

The columns `video_dimension` and `channel_have_hidden_subscribers` have all non-null values equal to each other, so those columns are not gonna be consider for future analysis.

In [None]:
def histogram_unique(df, column):
    values = df[column].dropna()
    counts = values.value_counts(normalize=True) 
    
    fig = go.Figure(data=[go.Bar(
        x=counts.index,
        y=counts.values,
        marker_color='blue',
        opacity=0.6,
        text=[f'{p * 100:.2f}%' for p in counts.values], 
        textposition='outside',
    )])

    fig.update_layout(
        title=f'{column} Distribution',
        xaxis_title=column,
        yaxis_title='Proportion',
        template='plotly_white',
        xaxis=dict(tickmode='array', tickvals=counts.index),
        yaxis=dict(showgrid=True),
        showlegend=False
    )

    fig.show() 

In [None]:
histogram_unique(df, 'video_definition')

In [None]:
histogram_unique(df, 'video_licensed_content')

The `video_definition` and `video_licensed_content` columns are highly unbalanced, to the point that we suspect that the `sd` and `false` classes in each column, respectively, do not contain enough information to be considered in the model.

In [None]:
histogram_unique(df, 'video_category_id')

**Note:** Due to the almost non-existent amount of videos with the categories `Nonprofits & Activism`, `CHK CHK (촉촉)` and `Better Voice`, it is not worth considering these categories when creating the model. Also, for simplicity during the analysis of the dataset, we will drop theh nan values of this column.

In [None]:
removed_categories = ['Nonprofits & Activism', 'CHK CHK (촉촉)', 'Better Voice']
df = df[~df['video_category_id'].isin(removed_categories)]
df = df.dropna(subset=['video_category_id'])

For thumbnails, we need to be able to see them if necessary and detect what type of object they contain. To do this, we'll consider the yolov5x model pre-trained on the COCO dataset.

In [None]:
def show_thumbnails(df, n_thumbnails=12):
    html = ""
    for url in df['video_default_thumbnail'].head(n_thumbnails):
        html += f'<img src="{url}" width="120" height="90" style="margin:5px">'
    display(HTML(html))

In [None]:
show_thumbnails(df)

In [None]:
def detect_thumbnails(df, n_thumbnails=6):
    print('yolov5x inferences:')
    model = torch.hub.load('ultralytics/yolov5', 'yolov5x').to(device)
    for url in df['video_default_thumbnail'].head(n_thumbnails):
        response = requests.get(url)
        img = Image.open(BytesIO(response.content))
        results = model(img)
        results.show()  
        results.print() 

In [None]:
detect_thumbnails(df)

#### Float columns

In [None]:
df['video_published_at'] = pd.to_datetime(df['video_published_at'], errors='coerce').dt.tz_localize(None)
df['video_trending__date'] = pd.to_datetime(df['video_trending__date'], errors='coerce').dt.tz_localize(None)
df['days_until_trend'] = (df['video_trending__date'] - df['video_published_at']).dt.days

In [None]:
print(df['days_until_trend'])

In [None]:
# Check for missing values in both float64 and int64 columns
missing_values = df.select_dtypes(include=['float64', 'int64']).isnull().sum()
missing_percentage = 100 * missing_values / len(df)

# Create a DataFrame showing missing values and their percentage
missing_df = pd.DataFrame({
    'Column': missing_values.index,
    'Missing Values': missing_values,
    'Missing Percentage': missing_percentage
})

missing_df = missing_df.reset_index(drop=True)
missing_df

In [None]:
df.describe()

**Note:** Due to lack of reliability in the quality of the data, negative values in `days_until_trend` will not be taken into account.

In [None]:
df = df[df['days_until_trend'] >= 0]

In [None]:
correlation_matrix = df.corr(numeric_only=True)
correlation_matrix

In [None]:

fig = go.Figure(data=go.Heatmap(
    z=correlation_matrix.loc[correlation_matrix.columns[::-1], correlation_matrix.columns].values, 
    x=correlation_matrix.columns,  
    y=correlation_matrix.columns[::-1],  
    colorscale='Blues',  
    colorbar=dict(title="Correlation"),  
    hoverongaps=True, 
    zmin=-1,  
    zmax=1,   
))

annotations = []
for i in range(len(correlation_matrix.columns)):
    for j in range(len(correlation_matrix.columns)):
        value = correlation_matrix.loc[correlation_matrix.columns[j], correlation_matrix.columns[i]]
        text_color = "black" if abs(value) < 0.5 else "white"  
        annotations.append(
            dict(
                x=correlation_matrix.columns[i],  
                y=correlation_matrix.columns[j], 
                text=f"{value:.2f}", 
                showarrow=False,
                font=dict(size=12, color=text_color),
                align="center"
            )
        )

fig.update_layout(
    title="Correlation Matrix",
    xaxis=dict(title="Variables", tickangle=45), 
    yaxis=dict(title="Variables"),  
    height=900,  
    width=900,   
    annotations=annotations  
)

fig.show() 

#### Time series

In [None]:
videos_per_day = df.groupby(df['video_published_at'].dt.date).size().reset_index(name='count')
videos_per_day['video_published_at'] = pd.to_datetime(videos_per_day['video_published_at'])

videos_per_day_rolling = videos_per_day['count'].rolling(window=7).mean()

fig = px.line(
    videos_per_day,
    x='video_published_at',
    y='count',
    markers=True,
    title='Number of videos uploaded per day',
    labels={'video_published_at': 'Date', 'count': 'Number of videos'}
)

fig.add_trace(
    go.Scatter(
        x=videos_per_day['video_published_at'],
        y=videos_per_day_rolling,
        mode='lines',
        name='Moving average (7 days)',
        line=dict(color='red'),
        opacity=1
    )
)

fig.update_layout(
    xaxis_title='Date',
    yaxis_title='Number of videos',
    template='plotly_white',
    showlegend=True
)

fig.show() 

In [None]:
videos_per_category = df.groupby([df['video_category_id'], df['video_published_at'].dt.date]).size().reset_index(name='count')
videos_per_category['video_published_at'] = pd.to_datetime(videos_per_category['video_published_at'])

category_figures = []

for category in videos_per_category['video_category_id'].unique():
    category_data = videos_per_category[videos_per_category['video_category_id'] == category]
    category_rolling = category_data['count'].rolling(window=7).mean()

    fig = px.line(
        category_data,
        x='video_published_at',
        y='count',
        markers=True,
        title=f'Number of videos uploaded per day ({category})',
        labels={'video_published_at': 'Date', 'count': 'Number of videos'}
    )

    fig.add_trace(
        go.Scatter(
            x=category_data['video_published_at'],
            y=category_rolling,
            mode='lines',
            name='Moving average (7 days)',
            line=dict(color='red'),
            opacity=1
        )
    )

    fig.update_layout(
        xaxis_title='Date',
        yaxis_title='Number of videos',
        template='plotly_white',
        showlegend=True
    )

    category_figures.append(fig)

for fig in category_figures:
    fig.show()     

In [None]:
days_trend = df['days_until_trend'].nunique()
fig = px.histogram(df, x='days_until_trend', nbins=days_trend, title='Days until it becomes a trend')
fig.update_layout(
    xaxis_title='Days',
    yaxis_title='Frequency'
)

fig.show()


In [None]:
category_figures_days_until_trend = []

for category in df['video_category_id'].unique():
    if pd.isna(category):
        continue

    df_category = df[df['video_category_id'] == category]

    days_trend = df_category['days_until_trend'].nunique()

    fig = go.Figure(data=[go.Histogram(
        x=df_category['days_until_trend'],
        nbinsx=days_trend,
        marker_color='blue',
        opacity=0.6
    )])

    fig.update_layout(
        title=f'Days Until Trend for Category {category}',
        xaxis_title='Days',
        yaxis_title='Frequency',
        template='plotly_white',
        showlegend=False
    )

    category_figures_days_until_trend.append(fig)

for fig in category_figures_days_until_trend:
    fig.show()

In [None]:
df['day_of_week'] = df['video_published_at'].dt.dayofweek
day_names = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
df['day_name'] = df['day_of_week'].map(day_names)

videos_per_dayofweek = df['day_name'].value_counts().reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

fig = go.Figure(data=[go.Bar(
    x=videos_per_dayofweek.index,
    y=videos_per_dayofweek.values,
    marker_color='blue',
    opacity=0.6
)])

fig.update_layout(
    title='Number of videos uploaded per day of the week',
    xaxis_title='Day of the week',
    yaxis_title='Number of videos',
    template='plotly_white',
    xaxis=dict(tickmode='array', tickvals=videos_per_dayofweek.index),
    yaxis=dict(showgrid=True),
    showlegend=False
)

fig.show() 

In [None]:
category_figures_day = []

for category in df['video_category_id'].unique():
    if pd.isna(category):
        continue

    category_data = df[df['video_category_id'] == category]
    videos_per_dayofweek = category_data['day_name'].value_counts().reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

    fig = go.Figure(data=[go.Bar(
        x=videos_per_dayofweek.index,
        y=videos_per_dayofweek.values,
        marker_color='blue',
        opacity=0.6
    )])

    fig.update_layout(
        title=f'Number of videos uploaded per day of the week ({category})',
        xaxis_title='Day of the week',
        yaxis_title='Number of videos',
        template='plotly_white',
        xaxis=dict(tickmode='array', tickvals=videos_per_dayofweek.index),
        yaxis=dict(showgrid=True),
        showlegend=False
    )

    category_figures_day.append(fig)

for fig in category_figures_day:
    fig.show() 

In [None]:
df['hour'] = df['video_published_at'].dt.hour

heatmap_data = df.pivot_table(
    index='day_of_week',
    columns='hour',
    values='video_published_at',
    aggfunc='count',
    fill_value=0
)

day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
heatmap_data.index = [day_names[int(d)] for d in heatmap_data.index]

full_hours = list(range(24))
for h in full_hours:
    if h not in heatmap_data.columns:
        heatmap_data[h] = 0

heatmap_data = heatmap_data[sorted(heatmap_data.columns)]


fig = go.Figure(data=go.Heatmap(
    z=heatmap_data.values,
    x=heatmap_data.columns,  
    y=heatmap_data.index,    
    colorscale='Blues',
    colorbar=dict(title='Count'),
    hoverongaps=False,
    zmin=0,
    text=heatmap_data.values,
    texttemplate="%{text}",
    showscale=True
))

fig.update_layout(
    title='Number of videos published by day of the week and time',
    xaxis_title='Time of day',
    yaxis_title='Day of the week',
    height=600,
    width=2000,
    xaxis=dict(
        tickmode='array', 
        tickvals=list(range(24)), 
        ticktext=[str(i) for i in range(24)] 
    ),
    yaxis=dict(autorange="reversed")  
)

fig.show() 

In [None]:
categories = df['video_category_id'].unique()

for category in categories:
    category_df = df[df['video_category_id'] == category]

    heatmap_data = category_df.pivot_table(
        index='day_of_week',
        columns='hour',
        values='video_published_at',
        aggfunc='count',
        fill_value=0
    )

    day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    heatmap_data.index = [day_names[int(d)] for d in heatmap_data.index]

    full_hours = list(range(24))
    for h in full_hours:
        if h not in heatmap_data.columns:
            heatmap_data[h] = 0

    heatmap_data = heatmap_data[sorted(heatmap_data.columns)]

    fig = go.Figure(data=go.Heatmap(
        z=heatmap_data.values,
        x=heatmap_data.columns, 
        y=heatmap_data.index,  
        colorscale='Blues',
        colorbar=dict(title='Count'),
        hoverongaps=False,
        zmin=0,
        text=heatmap_data.values,
        texttemplate="%{text}",
        showscale=True
    ))

    fig.update_layout(
        title=f'Number of videos published by day of the week and time (Category {category})',
        xaxis_title='Time of day',
        yaxis_title='Day of the week',
        height=600,
        width=2000,
        yaxis=dict(autorange="reversed"), 
        xaxis=dict(
            tickmode='array', 
            tickvals=list(range(24)), 
            ticktext=[str(i) for i in range(24)] 
        )
    )

    fig.show() 

In [None]:
df['month'] = df['video_published_at'].dt.month

month_names = {1: 'January', 2: 'February', 3: 'March', 4: 'April', 5: 'May', 6: 'June', 7: 'July', 8: 'August', 9: 'September', 10: 'October', 11: 'November', 12: 'December'}
df['month_name'] = df['month'].map(month_names)

videos_per_month = df['month_name'].value_counts().reindex(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])

fig = go.Figure(data=[go.Bar(
    x=videos_per_month.index,
    y=videos_per_month.values,
    marker_color='blue',  
    opacity=0.6           
)])

fig.update_layout(
    title='Number of videos uploaded per month',
    xaxis_title='Month',
    yaxis_title='Number of videos',
    template='plotly_white',
    xaxis=dict(tickmode='array', tickvals=videos_per_month.index),
    yaxis=dict(showgrid=True),
    showlegend=False
)

fig.show() 

In [None]:
category_figures_month = []

for category in df['video_category_id'].unique():
    if pd.isna(category):
        continue

    category_data = df[df['video_category_id'] == category]
    videos_per_month = category_data['month_name'].value_counts().reindex(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])

    fig = go.Figure(data=[go.Bar(
        x=videos_per_month.index,
        y=videos_per_month.values,
        marker_color='blue',  
        opacity=0.6           
    )])

    fig.update_layout(
        title=f'Number of videos uploaded per month ({category})',
        xaxis_title='Month',
        yaxis_title='Number of videos',
        template='plotly_white',
        xaxis=dict(tickmode='array', tickvals=videos_per_month.index),
        yaxis=dict(showgrid=True),
        showlegend=False
    )

    category_figures_month.append(fig)

for fig in category_figures_month:
    fig.show() 

In [None]:
df['month'] = df['video_published_at'].dt.month

heatmap_data_month = df.pivot_table(
    index='month', 
    columns='hour',  
    values='video_published_at',  
    aggfunc='count',  
    fill_value=0
)

month_names = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
heatmap_data_month.index = heatmap_data_month.index.map(lambda x: month_names[x-1])

full_hours = list(range(24))
for h in full_hours:
    if h not in heatmap_data_month.columns:
        heatmap_data_month[h] = 0

heatmap_data_month = heatmap_data_month[sorted(heatmap_data_month.columns)]

fig = go.Figure(data=go.Heatmap(
    z=heatmap_data_month.values,
    x=heatmap_data_month.columns,  
    y=heatmap_data_month.index,   
    colorscale='Blues',
    colorbar=dict(title='Count'),
    hoverongaps=False,
    zmin=0,
    text=heatmap_data_month.values,
    texttemplate="%{text}",
    showscale=True
))

fig.update_layout(
    title='Number of videos published per month and hour',
    xaxis_title='Time of day (Hour)',
    yaxis_title='Month',
    height=600,
    width=2000,
    xaxis=dict(
        tickmode='array', 
        tickvals=list(range(24)),  
        ticktext=[str(i) for i in range(24)]  
    ),
    yaxis=dict(
        tickmode='array',
        tickvals=heatmap_data_month.index,
        ticktext=month_names 
    )
)

fig.show() 

In [None]:

df['hour'] = df['video_published_at'].dt.hour
df['month'] = df['video_published_at'].dt.month

month_names = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
categories = df['video_category_id'].unique()

for category in categories:
    category_data = df[df['video_category_id'] == category]
    
    heatmap_data_month = category_data.pivot_table(
        index='month', 
        columns='hour', 
        values='video_published_at', 
        aggfunc='count', 
        fill_value=0
    )

    heatmap_data_month.index = heatmap_data_month.index.map(lambda x: month_names[x-1])
   
    full_hours = list(range(24))
    for h in full_hours:
        if h not in heatmap_data_month.columns:
            heatmap_data_month[h] = 0
 
    heatmap_data_month = heatmap_data_month[sorted(heatmap_data_month.columns)]

    fig = go.Figure(data=go.Heatmap(
        z=heatmap_data_month.values,
        x=heatmap_data_month.columns,  
        y=heatmap_data_month.index,    
        colorscale='Blues',
        colorbar=dict(title='Count'),
        hoverongaps=False,
        zmin=0,
        text=heatmap_data_month.values,
        texttemplate="%{text}",
        showscale=True
    ))

    fig.update_layout(
        title=f'Number of videos published per month and hour for Category {category}',
        xaxis_title='Time of day (Hour)',
        yaxis_title='Month',
        height=600,
        width=2000,
        xaxis=dict(
            tickmode='array', 
            tickvals=list(range(24)),  
            ticktext=[str(i) for i in range(24)]  
        ),
        yaxis=dict(
            tickmode='array',
            tickvals=heatmap_data_month.index,
            ticktext=month_names  
        )
    )

    fig.show() 

**Note:** The dataframe contains data from September 2024 to the present, this explains why there is no record of some months.

In [None]:
#pio.renderers.default = 'iframe_connected' 