# 01: PCOS Dataset Exploration

**Goal:** Load and inspect the PCOS prediction dataset to understand features, data quality, and key patterns.

**Dataset:** [Kaggle PCOS Prediction Dataset](https://www.kaggle.com/datasets/ankushpanday1/pcos-prediction-datasettop-75-countries)

**Next steps:**
- Feature correlation analysis
- Identify key predictive features
- Build baseline ML classifier

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

In [None]:
# Load the dataset
# Update the filename if yours is different
df = pd.read_csv("../data/PCOS_data.csv")

print(f"Dataset shape: {df.shape}")
print(f"Number of samples: {df.shape[0]}")
print(f"Number of features: {df.shape[1]}")

In [None]:
# Preview the data
df.head(10)

In [None]:
# Check column names and data types
print("\nColumn names:")
print(df.columns.tolist())
print("\nData types:")
print(df.dtypes)

In [None]:
# Check for missing values
print("Missing values per column:")
missing = df.isnull().sum()
missing_percent = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_percent
})
print(missing_df[missing_df['Missing Count'] > 0])

In [None]:
# Summary statistics
df.describe()

## Target Variable Analysis

In [None]:
# Find the target column (usually named 'PCOS' or 'PCOS (Y/N)' or similar)
# Adjust column name based on your dataset
target_col = 'PCOS (Y/N)'  # Change this if needed

if target_col in df.columns:
    print(f"Target variable distribution:")
    print(df[target_col].value_counts())
    print(f"\nClass balance:")
    print(df[target_col].value_counts(normalize=True))
    
    # Visualize
    plt.figure(figsize=(8, 5))
    df[target_col].value_counts().plot(kind='bar', color=['#2ecc71', '#e74c3c'])
    plt.title('PCOS Distribution in Dataset', fontsize=14, fontweight='bold')
    plt.xlabel('PCOS Status')
    plt.ylabel('Count')
    plt.xticks(rotation=0)
    plt.tight_layout()
    plt.show()
else:
    print(f"Column '{target_col}' not found. Available columns:")
    print(df.columns.tolist())

## Feature Distribution Analysis

In [None]:
# Select numerical columns for visualization
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numerical features ({len(numerical_cols)}): {numerical_cols}")

In [None]:
# Plot distributions of first few numerical features
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, col in enumerate(numerical_cols[:6]):
    axes[idx].hist(df[col].dropna(), bins=30, color='skyblue', edgecolor='black')
    axes[idx].set_title(col, fontweight='bold')
    axes[idx].set_xlabel('Value')
    axes[idx].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

## Initial Observations

**Key findings:**
- Dataset contains X samples with Y features
- Target variable distribution: [fill in after running]
- Missing data: [note any patterns]
- Potential outliers: [note any unusual distributions]

**Next steps:**
1. Feature correlation analysis
2. Handle missing values
3. Feature engineering
4. Build baseline ML model