YouTube, the world's second-largest search engine, is a treasure trove of data. With over 2 billion logged-in monthly users, understanding the dynamics of YouTube analytics can provide invaluable insights for content creators and marketers alike. Let's dive into the data and see what stories it tells us.

<div style="text-align:center; border-radius:15px; padding:15px; color:#FFC0CB; margin:0; font-size:150%; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden"><b>Table of contents</b></div>

1. Data Loading and Overview
2. Data Cleaning and Preprocessing
3. Exploratory Data Analysis (EDA)
4. Correlation Analysis
5. Predictive Modeling
6. Conclusion and Future Work

<div style="text-align:center; border-radius:15px; padding:15px; color:#FFC0CB; margin:0; font-size:150%; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden"><b>1. Data Loading and Overview</b></div>

In [1]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
file_path = '/kaggle/input/200k-youtube-channel-analytics/all_youtube_analytics.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataframe
df.head()

Unnamed: 0,video_id,day,views,redViews,comments,likes,dislikes,videosAddedToPlaylists,videosRemovedFromPlaylists,shares,...,annotationClicks,annotationCloses,cardClickRate,cardTeaserClickRate,cardImpressions,cardTeaserImpressions,cardClicks,cardTeaserClicks,subscribersGained,subscribersLost
0,YuQaT52VEwo,2019-09-06,8.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,YuQaT52VEwo,2019-09-07,7.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,SfTEVOQP-Hk,2019-09-07,6.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,YuQaT52VEwo,2019-09-08,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,SfTEVOQP-Hk,2019-09-08,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<div style="text-align:center; border-radius:15px; padding:15px; color:#FFC0CB; margin:0; font-size:150%; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden"><b>2. Data Cleaning and Preprocessing</b></div>

In [2]:
# Check for missing values
df.isnull().sum()

# Convert 'day' column to datetime
df['day'] = pd.to_datetime(df['day'])

# Summary statistics
df.describe()

Unnamed: 0,day,views,redViews,comments,likes,dislikes,videosAddedToPlaylists,videosRemovedFromPlaylists,shares,estimatedMinutesWatched,...,annotationClicks,annotationCloses,cardClickRate,cardTeaserClickRate,cardImpressions,cardTeaserImpressions,cardClicks,cardTeaserClicks,subscribersGained,subscribersLost
count,234889,234889.0,234889.0,234889.0,234889.0,234889.0,234889.0,234889.0,234889.0,234889.0,...,234889.0,234889.0,234889.0,234889.0,234889.0,234889.0,234889.0,234889.0,234889.0,234889.0
mean,2022-11-17 13:28:29.122181120,88.842121,18.717326,0.039419,0.969816,0.032215,1.262835,0.228908,0.337198,3466.270749,...,0.0,0.0,0.000429,0.000178,0.040185,10.419104,0.00215,0.017587,0.167173,0.004743
min,2019-09-06 00:00:00,0.0,0.0,0.0,-11.0,-19.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2021-11-20 00:00:00,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2022-12-24 00:00:00,8.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,168.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,2023-12-28 00:00:00,35.0,8.0,0.0,0.0,0.0,1.0,0.0,0.0,1047.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,2024-11-10 00:00:00,8818.0,2658.0,24.0,206.0,11.0,2678.0,2647.0,251.0,285103.0,...,0.0,0.0,1.25,7.0,60.0,5894.0,8.0,111.0,31.0,9.0
std,,331.280375,78.2934,0.355816,3.984013,0.241694,7.535244,5.70264,1.552007,12548.191609,...,0.0,0.0,0.013099,0.016315,0.617346,90.929272,0.061275,0.364308,0.826207,0.079772


<div style="text-align:center; border-radius:15px; padding:15px; color:#FFC0CB; margin:0; font-size:150%; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden"><b>3. Exploratory Data Analysis (EDA)</b></div>

In [3]:
# Distribution of views
plt.figure(figsize=(10, 6))
sns.histplot(df['views'], bins=50, kde=True, color='blue')
plt.title('Distribution of Views')
plt.xlabel('Views')
plt.ylabel('Frequency')
plt.show()

In [4]:
# Scatter plot of likes vs dislikes
plt.figure(figsize=(10, 6))
sns.scatterplot(x='likes', y='dislikes', data=df, alpha=0.5)
plt.title('Likes vs Dislikes')
plt.xlabel('Likes')
plt.ylabel('Dislikes')
plt.show()

<div style="text-align:center; border-radius:15px; padding:15px; color:#FFC0CB; margin:0; font-size:150%; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden"><b>4. Correlation Analysis</b></div>

In [5]:
# Correlation heatmap
numeric_df = df.select_dtypes(include=[np.number])
plt.figure(figsize=(12, 10))
sns.heatmap(numeric_df.corr(), cmap='viridis', annot=True, fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

<div style="text-align:center; border-radius:15px; padding:15px; color:#FFC0CB; margin:0; font-size:150%; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden"><b>5. Predictive Modeling</b></div>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Define features and target
features = numeric_df.drop(columns=['views'])
target = numeric_df['views']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Train a Random Forest model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
rmse

<div style="text-align:center; border-radius:15px; padding:15px; color:#FFC0CB; margin:0; font-size:150%; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden"><b>6. Conclusion and Future Work</b></div>

In this notebook, we explored a comprehensive YouTube analytics dataset, performed exploratory data analysis, and built a predictive model to estimate video views. The Random Forest model provided a reasonable prediction accuracy, but there's always room for improvement. Future work could involve feature engineering, trying different algorithms, or even incorporating external data sources to enhance the model's performance. If you found this notebook insightful, consider giving it an upvote.

## Credits
This notebook was created with the help of [Devra AI data science assistant](https://devra.ai/ref/kaggle)