# EDA: Global YouTube Statistics
This notebook provides a reproducible exploratory data analysis for the dataset `Global YouTube Statistics.csv`.
Place this notebook in `analytics_project/notebooks/` and run cells in order.

## 1) Environment Setup & Imports

In [None]:
import sys
import os
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(42)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 200)
plt.style.use('seaborn')

## 2) Locate and Load CSV file

In [None]:
# Auto-detect CSV in the workspace root: go up two levels from this notebook to workspace root
nb_dir = Path.cwd()
workspace_root = nb_dir.parents[1]  # analytics_project/.. -> workspace root
csv_candidates = list(workspace_root.glob('*.csv'))
csv_candidates[:10]  # show any matches
csv_path = csv_candidates[0] if csv_candidates else workspace_root / 'Global YouTube Statistics.csv'
print('Using CSV:', csv_path)
# load with fallback encoding
try:
    df = pd.read_csv(csv_path)
except Exception as e:
    print('Primary read failed, retrying latin-1, error:', e)
    df = pd.read_csv(csv_path, encoding='latin-1')
df.shape

## 3) Data Snapshot and Metadata

In [None]:
df.head(10)

In [None]:
df.info()
df.describe(include='all').T
df.isnull().sum().sort_values(ascending=False).head(20)

## 4) Data Cleaning & Preprocessing (light)

Placeholder functions are provided; adapt as needed for full cleaning.

In [None]:
def basic_cleaning(df):
    df = df.copy()
    # normalize column names
    df.columns = [c.strip() for c in df.columns]
    # numeric conversion examples
    if 'subscribers' in df.columns:
        df['subscribers'] = pd.to_numeric(df['subscribers'], errors='coerce')
    if 'video views' in df.columns:
        df['video_views'] = pd.to_numeric(df['video views'], errors='coerce')
    return df

df = basic_cleaning(df)
df.shape

## 5) Exploratory Data Analysis (EDA)

In [None]:
# Top categories by count
top_cats = df['category'].fillna('Unknown').value_counts().head(20)
top_cats

In [None]:
import matplotlib.ticker as ticker
plt.figure(figsize=(10,6))
sns.barplot(x=top_cats.values, y=top_cats.index)
plt.title('Top categories in dataset')
plt.xlabel('count')
plt.tight_layout()
out_dir = workspace_root / 'analytics_project' / 'outputs'
out_dir.mkdir(parents=True, exist_ok=True)
plt.savefig(out_dir / 'top_categories.png')
plt.show()

In [None]:
# Subscribers distribution (numeric)
if 'subscribers' in df.columns:
    plt.figure(figsize=(8,5))
    sns.histplot(pd.to_numeric(df['subscribers'], errors='coerce').dropna(), bins=80)
    plt.title('Subscribers distribution')
    plt.tight_layout()
    plt.savefig(out_dir / 'subscribers_dist.png')
    plt.show()

## 6) Visualizations (saved to disk)

Placeholder: additional visualizations (correlation heatmap, top channels by subscribers, scatter subscribers vs video views).

## 7+) Modeling, Evaluation, Tests (placeholders)
This notebook includes placeholders for building a pipeline, training models, cross-validation, hyperparameter search, saving models with joblib, and writing pytest-based tests for the pipeline. For a single-run reproducible workflow, move those sections into scripts under `analytics_project/src/` and add small unit tests under `analytics_project/tests/`.

## How to proceed
1. Install requirements from `analytics_project/requirements.txt`.
2. Run the notebook interactively and edit cleaning/feature-engineering steps as required.
3. Use `analytics_project/src/eda.py` to run scripted EDA and save outputs to `analytics_project/outputs/`.