# AI Programming Foundations Project: Reproducible Data Workflow

**Name:** Frank Allen Motley  
**Dataset:** Titanic (Seaborn example dataset) — https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv  
**Project Description:** This notebook demonstrates a complete, reproducible data workflow using Python, NumPy, Pandas, and Matplotlib/Seaborn. It loads a real dataset, cleans and transforms it with reusable functions, performs exploratory analysis, creates labeled visualizations, and summarizes insights with notes on limitations and responsible data handling.


## 1. Setup

**Environment note (important):** This project is tested with **NumPy 1.x** (specifically `numpy==1.26.4`). If you see an error about “a module compiled using NumPy 1.x cannot be run in NumPy 2.x”, create a clean environment and install the pinned dependencies from `requirements.txt` before running this notebook.

NumPy is pinned to version 1.x in requirements.txt to avoid binary compatibility issues with pandas and pyarrow, which are still stabilizing support for NumPy 2.x.


In [None]:
# Imports (step 1/2)
import os
import numpy as np

# Compatibility note: this project expects NumPy 1.x
print('NumPy version:', np.__version__)
if int(np.__version__.split('.')[0]) >= 2:
    print(
        'WARNING: You are running NumPy 2.x. If you encounter import errors for pandas/pyarrow,\n'
        'create a clean environment and install dependencies from requirements.txt (pins NumPy 1.x).'
    )


In [None]:
# Imports (step 2/2)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Display and plotting settings
pd.set_option('display.max_columns', 50)
sns.set_theme(style='whitegrid')


## 2. Data Ingestion

In [None]:
# Load dataset (downloads from a stable URL if local file is not present)
DATA_URL = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
DATA_PATH = "titanic.csv"

if not os.path.exists(DATA_PATH):
    df = pd.read_csv(DATA_URL)
    df.to_csv(DATA_PATH, index=False)
else:
    df = pd.read_csv(DATA_PATH)

df.head()


In [None]:
# Quick structural check
df.info()


## 3. Cleaning and Transformation

This section defines **reusable cleaning functions** (with docstrings) and applies them to the dataset.

In [None]:
def clean_column_names(data: pd.DataFrame) -> pd.DataFrame:
    """Return a copy of the DataFrame with standardized column names.

    Standardization improves readability and reduces errors when referencing columns.
    This function lowercases names and replaces spaces with underscores.
    """
    df_clean = data.copy()
    df_clean.columns = (
        df_clean.columns
        .str.strip()
        .str.lower()
        .str.replace(r"\s+", "_", regex=True)
    )
    return df_clean


def impute_missing_values(data: pd.DataFrame) -> pd.DataFrame:
    """Impute missing values using simple, transparent rules.

    - Numeric columns: fill missing values with the column median.
    - Categorical/object columns: fill missing values with the column mode (most frequent).

    This approach is suitable for exploratory workflows and keeps assumptions explicit.
    """
    df_clean = data.copy()

    for col in df_clean.columns:
        if pd.api.types.is_numeric_dtype(df_clean[col]):
            if df_clean[col].isna().any():
                df_clean[col] = df_clean[col].fillna(df_clean[col].median())
        else:
            if df_clean[col].isna().any():
                mode = df_clean[col].mode(dropna=True)
                fill_value = mode.iloc[0] if len(mode) else "Unknown"
                df_clean[col] = df_clean[col].fillna(fill_value)

    return df_clean


In [None]:
# Apply cleaning functions
df_clean = clean_column_names(df)
df_clean = impute_missing_values(df_clean)

# Check remaining missingness
df_clean.isna().sum().sort_values(ascending=False).head(10)


In [None]:
# Optional: type tweaks for analysis
# (kept simple to reduce assumptions)
for c in ["sex", "class", "embarked", "who", "deck", "embark_town", "alive", "alone"]:
    if c in df_clean.columns:
        df_clean[c] = df_clean[c].astype("category")

df_clean.head()


## 4. Exploratory Data Analysis (EDA)

At least one EDA function is defined and used below.

In [None]:
def eda_summary(data: pd.DataFrame) -> dict:
    """Compute a small set of reusable EDA outputs.

    Returns a dict containing:
    - shape, dtypes
    - summary statistics for numeric columns
    - survival rate overall and by key groups (if present)

    Designed for quick reuse across exploratory notebooks.
    """
    out = {
        "shape": data.shape,
        "dtypes": data.dtypes.astype(str).to_dict(),
        "numeric_summary": data.select_dtypes(include=[np.number]).describe(),
    }

    if "survived" in data.columns:
        out["survival_rate_overall"] = float(data["survived"].mean())

        if "sex" in data.columns:
            out["survival_by_sex"] = (
                data.groupby("sex", observed=False)["survived"]
                .mean()
                .sort_values(ascending=False)
            )
            
        if "class" in data.columns:
            out["survival_by_class"] = (
                data.groupby("class", observed=False)["survived"]
                .mean()
                .sort_values(ascending=False)
            )

    return out


eda = eda_summary(df_clean)

eda["shape"], eda.get("survival_rate_overall")


In [None]:
# View numeric summary
eda["numeric_summary"]


In [None]:
# Survival by sex / class (if available)
eda.get("survival_by_sex"), eda.get("survival_by_class")


## 5. Visualizations

Each visualization includes a **title** and **labeled axes**, and is saved to the `figures/` folder for use in the written report.

In [None]:
from pathlib import Path

# Create a figures folder next to your notebook (portable)
FIGURES_DIR = Path("figures")
FIGURES_DIR.mkdir(parents=True, exist_ok=True)


In [None]:
# Figure 1: Survival rate by sex
plt.figure()
ax = sns.barplot(data=df_clean, x="sex", y="survived", errorbar=None)
ax.set_title("Figure 1. Survival Rate by Sex (Titanic)")
ax.set_xlabel("Sex")
ax.set_ylabel("Mean Survival Rate")
plt.tight_layout()
plt.savefig(FIGURES_DIR / "figure1_survival_by_sex.png", dpi=200)
plt.show()

# Interpretation:
# In this dataset, female passengers show a higher mean survival rate than male passengers.


In [None]:
# Figure 2: Survival rate by passenger class
plt.figure()
ax = sns.barplot(data=df_clean, x="class", y="survived", errorbar=None, order=sorted(df_clean["class"].cat.categories))
ax.set_title("Figure 2. Survival Rate by Passenger Class (Titanic)")
ax.set_xlabel("Passenger Class")
ax.set_ylabel("Mean Survival Rate")
plt.tight_layout()
plt.savefig(FIGURES_DIR / "figure2_survival_by_class.png", dpi=200)
plt.show()

# Interpretation:
# Survival is highest in higher classes and lowest in third class, suggesting strong stratification in outcomes.


In [None]:
# Figure 3: Age distribution by survival outcome
plt.figure()
ax = sns.histplot(data=df_clean, x="age", hue="survived", bins=30, element="step", stat="density", common_norm=False)
ax.set_title("Figure 3. Age Distribution by Survival Outcome (Titanic)")
ax.set_xlabel("Age (years)")
ax.set_ylabel("Density")
plt.tight_layout()
plt.savefig(FIGURES_DIR / "figure3_age_by_survival.png", dpi=200)
plt.show()

# Interpretation:
# The age distributions overlap substantially. Any relationship between age and survival appears nuanced and may interact with class/sex.


## 6. Summary, Interpretation, and Limitations

**Key takeaways (based on this exploratory workflow):**
- Survival differs strongly by **sex** and **passenger class**.
- Age patterns appear more subtle; additional stratified analysis (e.g., age-by-class-by-sex) would be a good next step.

**Limitations / assumptions:**
- Missing-value imputation uses simple rules (median for numeric, mode for categorical). This can reduce variance and may change group comparisons.
- The dataset is historical and represents a specific context; it should not be used to generalize beyond the Titanic passenger population.
- This analysis is descriptive; it does not establish causal relationships.

**Next steps (future AI integration):**
- Prepare train/test splits and feature engineering for ML models.
- Use pipelines for reproducible preprocessing.
- Add dataset documentation (datasheet-style) and evaluate fairness impacts across subgroups.
