# Example Data Analysis Notebook

This notebook demonstrates how to structure a data analysis workflow using the project template. It includes examples of loading data, exploratory data analysis, feature engineering, model training, and evaluation.

## NBDoc Documentation

This notebook is documented using NBDoc, which allows for generating documentation from Jupyter notebooks. NBDoc uses special comment syntax to mark sections of the notebook for documentation generation.

<!-- #nbdoc:title Example Data Analysis -->
<!-- #nbdoc:description This notebook demonstrates a complete data analysis workflow using the project template. -->
<!-- #nbdoc:version 0.1.0 -->
<!-- #nbdoc:author Your Name -->
<!-- #nbdoc:keywords data analysis, machine learning, visualization -->

## Setup

First, let's import the necessary libraries and set up the environment.

In [None]:
# Standard libraries
import logging
import sys
from pathlib import Path

# Visualization
import matplotlib.pyplot as plt
import numpy as np

# Data processing
import polars as pl
import seaborn as sns

# Machine learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Add the project root to the path so we can import our modules.
# A bit hacky, but we use because it is commonly used, easy to understand, and it works.
project_root = Path.cwd().parent
sys.path.append(str(project_root))

In [None]:
# Import project modules
from src.tabular_data_utils import (
    calculate_feature_importance,
    encode_categorical,
    normalize_features,
    split_train_test,
)

# Configure logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logging.getLogger("matplotlib.category").setLevel(logging.WARNING)
logger = logging.getLogger(__name__)

# Set plot style
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 8)

## Data Loading

Let's load a sample dataset for our analysis. In a real project, you would replace this with your actual data loading code.

In [None]:
# For demonstration, we'll create a synthetic dataset
# In a real project, you would use load_dataset() to load your data

# Create synthetic data
np.random.seed(42)
n_samples = 1000

# Numeric features
feature1 = np.random.normal(0, 1, n_samples)
feature2 = np.random.normal(5, 2, n_samples)
feature3 = np.random.uniform(0, 10, n_samples)

# Categorical features
categories = ["A", "B", "C"]
category1 = np.random.choice(categories, n_samples)
category2 = np.random.choice(["X", "Y"], n_samples)

# Target variable (binary classification)
# Target depends on features to create a pattern
target_probs = 1 / (
    1
    + np.exp(
        -(
            0.5 * feature1
            - 0.2 * feature2
            + 0.1 * feature3
            + 0.5 * (category1 == "A")
            + 0.7 * (category2 == "X")
        )
    )
)
target = np.random.binomial(1, target_probs)

# Create DataFrame
data = {
    "feature1": feature1,
    "feature2": feature2,
    "feature3": feature3,
    "category1": category1,
    "category2": category2,
    "target": target,
}

df = pl.DataFrame(data)

# Display the first few rows
df.head()

## Exploratory Data Analysis

Let's explore the dataset to understand its characteristics.

In [None]:
# Basic statistics
print("Dataset shape:", df.shape)
print("\nNumeric columns statistics:")
df.select(pl.col(["feature1", "feature2", "feature3"])).describe()

In [None]:
# Distribution of categorical variables
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

category1_counts = df.group_by("category1").agg(pl.len()).sort("len", descending=True)
category2_counts = df.group_by("category2").agg(pl.len()).sort("len", descending=True)

sns.barplot(
    x=category1_counts["category1"].to_numpy(),
    y=category1_counts["len"].to_numpy(),
    ax=axes[0],
)
axes[0].set_title("Distribution of Category1")
axes[0].set_ylabel("Count")

sns.barplot(
    x=category2_counts["category2"].to_numpy(),
    y=category2_counts["len"].to_numpy(),
    ax=axes[1],
)
axes[1].set_title("Distribution of Category2")
axes[1].set_ylabel("Count")

plt.tight_layout()
plt.show()

In [None]:
# Distribution of numeric features
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Feature1 distribution
sns.histplot(df["feature1"].to_numpy(), kde=True, ax=axes[0])
axes[0].set_title("Distribution of Feature1")

# Feature2 distribution
sns.histplot(df["feature2"].to_numpy(), kde=True, ax=axes[1])
axes[1].set_title("Distribution of Feature2")

# Feature3 distribution
sns.histplot(df["feature3"].to_numpy(), kde=True, ax=axes[2])
axes[2].set_title("Distribution of Feature3")

plt.tight_layout()
plt.show()

In [None]:
# Target distributionplt.figure(figsize=(10, 6))

target_counts = df.group_by("target").agg(pl.len())
sns.barplot(x=target_counts["target"].to_numpy(), y=target_counts["len"].to_numpy())
plt.title("Distribution of Target Variable")
plt.xlabel("Target")
plt.ylabel("Count")
plt.xticks([0, 1], ["Class 0", "Class 1"])
plt.show()

In [None]:
# Convert Polars DataFrame to pandas for seaborn
df_pandas = df.to_pandas()

# Relationship between features and target
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Feature1 vs Target
sns.boxplot(x="target", y="feature1", data=df_pandas, ax=axes[0])
axes[0].set_title("Feature1 vs Target")
axes[0].set_xlabel("Target")
axes[0].set_ylabel("Feature1")
axes[0].set_xticks([0, 1])
axes[0].set_xticklabels(["Class 0", "Class 1"])

# Feature2 vs Target
sns.boxplot(x="target", y="feature2", data=df_pandas, ax=axes[1])
axes[1].set_title("Feature2 vs Target")
axes[1].set_xlabel("Target")
axes[1].set_ylabel("Feature2")
axes[1].set_xticks([0, 1])
axes[1].set_xticklabels(["Class 0", "Class 1"])

# Feature3 vs Target
sns.boxplot(x="target", y="feature3", data=df_pandas, ax=axes[2])
axes[2].set_title("Feature3 vs Target")
axes[2].set_xlabel("Target")
axes[2].set_ylabel("Feature3")
axes[2].set_xticks([0, 1])
axes[2].set_xticklabels(["Class 0", "Class 1"])

plt.tight_layout()
plt.show()

## Feature Engineering

Now let's prepare the data for modeling by normalizing numeric features and encoding categorical features.

In [None]:
# Normalize numeric features
numeric_columns = ["feature1", "feature2", "feature3"]
df_normalized = normalize_features(df, columns=numeric_columns, method="standard")

# Encode categorical features
categorical_columns = ["category1", "category2"]
df_processed = encode_categorical(
    df_normalized, columns=categorical_columns, method="one_hot"
)

# Display the processed data
df_processed.head()

## Model Training

Let's split the data into training and testing sets, and train a machine learning model.

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = split_train_test(
    df_processed, target_column="target", test_size=0.2, random_seed=42
)

# Train a Random Forest classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train.to_numpy(), y_train.to_numpy())

print("Model trained successfully!")

## Model Evaluation

Let's evaluate the model's performance on the test set.

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test.to_numpy())

# Calculate and display classification metrics
print("Classification Report:")
print(classification_report(y_test.to_numpy(), y_pred))

# Plot confusion matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test.to_numpy(), y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.xticks([0.5, 1.5], ["Class 0", "Class 1"])
plt.yticks([0.5, 1.5], ["Class 0", "Class 1"])
plt.show()

In [None]:
# Calculate and visualize feature importance
feature_names = X_train.columns
importance_values = model.feature_importances_

# Get the top 10 most important features
feature_importance = calculate_feature_importance(
    feature_names, importance_values, top_n=10
)

# Plot feature importance
plt.figure(figsize=(12, 8))
sns.barplot(x=list(feature_importance.values()), y=list(feature_importance.keys()))
plt.title("Top 10 Feature Importance")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

## Conclusion

In this notebook, we've demonstrated a complete data workflow using the project template. We've covered:

1. Data loading and exploration
2. Feature engineering
3. Model training
4. Model evaluation

This notebook serves as a template for your own data analysis projects. You can adapt it to your specific needs by replacing the synthetic data with your actual data and customizing the analysis steps as needed.

<!-- #nbdoc:section Conclusion -->
<!-- #nbdoc:description Summary of the analysis and next steps -->