# DistroKid Raw Data Exploration

This notebook explores and prepares raw DistroKid HTML data for cleaning and analysis.

**Goals:**
- Explore the structure of messy HTML exports from DistroKid
- Extract streaming data from tables and nested elements
- Identify data quality issues
- Prototype a cleaning pipeline for accurate CSV export


## 1. Load and Preview Raw HTML Files
- Load sample HTML files from `../raw/distrokid/streams/`
- Display the raw HTML and preview the file structure


In [ ]:
import os
from bs4 import BeautifulSoup
import pandas as pd
from pathlib import Path

# List available HTML files
html_dir = Path('../raw/distrokid/streams')
html_files = list(html_dir.glob('*.html'))
print('Found HTML files:', html_files)

# Load a sample file
sample_file = html_files[0] if html_files else None
if sample_file:
    with open(sample_file, 'r', encoding='utf-8') as f:
        html = f.read()
    print(html[:1000])  # Preview first 1000 characters
else:
    print('No HTML files found.')


## 2. Explore HTML Structure
- Visualize DOM tree and locate tables
- Identify headers, nested elements, and quirks


In [ ]:
if sample_file:
    soup = BeautifulSoup(html, 'html.parser')
    tables = soup.find_all('table')
    print(f'Found {len(tables)} tables.')
    for i, table in enumerate(tables):
        print(f'--- Table {i+1} ---')
        print(table.prettify()[:1000])  # Preview


## 3. Extract Streaming Data
- Parse tables to extract stream counts, dates, track info
- Handle inconsistencies: missing headers, merged cells, nested tables


In [ ]:
# Example: Extract rows from first table
if sample_file:
    try:
        df = pd.read_html(str(tables[0]))[0]
        print(df.head())
        # TODO: Normalize column names and handle quirks
    except Exception as e:
        print('Error parsing table:', e)


## 4. Data Quality Checks
- Identify missing/duplicate/ambiguous values
- Visualize distributions (e.g., stream counts per day/track)


In [ ]:
# Example: Visualize stream count distribution
import matplotlib.pyplot as plt
if sample_file and 'df' in locals():
    if 'Streams' in df.columns:
        df['Streams'] = pd.to_numeric(df['Streams'], errors='coerce')
        df['Streams'].hist(bins=30)
        plt.title('Distribution of Stream Counts')
        plt.xlabel('Streams')
        plt.ylabel('Frequency')
        plt.show()


## 5. Cleaning Pipeline Prototype
- Develop functions to normalize and flatten the data
- Output a well-structured DataFrame
- Export to CSV


In [ ]:
# TODO: Write cleaning functions
# def clean_distrokid_df(df):
#     ...
# cleaned = clean_distrokid_df(df)
# cleaned.to_csv('../curated/distrokid_streams_clean.csv', index=False)


## 6. Documentation & Comments
- Use markdown and code comments to explain transformation logic
- Summarize findings and next steps
