# EDA Starter (Data Analytics)

This notebook is a small, practical walkthrough of **exploratory data analysis (EDA)**.

You can run it as-is: it generates a tiny sample dataset if you don’t have a CSV handy.

## 1) Setup

We’ll try to use `pandas`. If it’s not installed, the notebook will fall back to basic Python lists/dicts for a minimal demo.

In [None]:
from __future__ import annotations

from datetime import date, timedelta
import random

try:
    import pandas as pd
    HAS_PANDAS = True
except Exception:
    pd = None
    HAS_PANDAS = False

HAS_PANDAS

## 2) Create (or load) a dataset

Replace this section with `pd.read_csv('your_file.csv')` when you have a real dataset.

In [None]:
random.seed(7)

start = date(2026, 1, 1)
rows = []
for i in range(60):
    day = start + timedelta(days=i)
    channel = random.choice(['organic', 'ads', 'referral'])
    visits = random.randint(50, 250) + (30 if channel == 'ads' else 0)
    signups = max(0, int(visits * random.uniform(0.03, 0.12)))
    revenue = round(signups * random.uniform(8.0, 35.0), 2)
    rows.append({'date': day.isoformat(), 'channel': channel, 'visits': visits, 'signups': signups, 'revenue': revenue})

if HAS_PANDAS:
    df = pd.DataFrame(rows)
    df.head()
else:
    rows[:3]

## 3) Quick checks

Look at shape, dtypes, missing values, duplicates.

In [None]:
if not HAS_PANDAS:
    print('Install pandas for the full EDA workflow: pip install pandas')
else:
    print('shape:', df.shape)
    display(df.dtypes)
    display(df.isna().sum())
    print('duplicates:', df.duplicated().sum())

## 4) Basic summary stats

Start with `describe()` and simple group-bys.

In [None]:
if HAS_PANDAS:
    display(df[['visits', 'signups', 'revenue']].describe())
    by_channel = df.groupby('channel')[['visits', 'signups', 'revenue']].agg(['mean', 'sum'])
    display(by_channel)

## 5) Simple visualizations

If you have `matplotlib` installed, this plots visits over time and revenue by channel.

In [None]:
if HAS_PANDAS:
    try:
        import matplotlib.pyplot as plt

        df2 = df.copy()
        df2['date'] = pd.to_datetime(df2['date'])
        daily = df2.groupby('date', as_index=False)[['visits', 'revenue']].sum()

        fig, axes = plt.subplots(1, 2, figsize=(12, 4))
        axes[0].plot(daily['date'], daily['visits'])
        axes[0].set_title('Daily visits')
        axes[0].set_xlabel('date')
        axes[0].set_ylabel('visits')

        rev = df2.groupby('channel', as_index=False)['revenue'].sum().sort_values('revenue', ascending=False)
        axes[1].bar(rev['channel'], rev['revenue'])
        axes[1].set_title('Revenue by channel (total)')
        axes[1].set_xlabel('channel')
        axes[1].set_ylabel('revenue')

        plt.tight_layout()
        plt.show()
    except Exception as e:
        print('Plotting skipped (install matplotlib): pip install matplotlib')
        print('Error:', e)

## Next ideas

- Load a real CSV and repeat the same checks
- Handle missing values (imputation / dropping)
- Detect outliers
- Create KPI metrics (conversion rate = signups/visits, ARPU, etc.)
- Make a small dashboard notebook