# Answers

## A1 – Initial Overview

_Load the `titanic` dataset from seaborn, report its shape, and preview the first five rows._

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

sns.set_theme(style='whitegrid')

titanic = sns.load_dataset('titanic')
print(f'Shape: {titanic.shape}')
titanic.head()
# Observation: The dataset contains a few hundred passengers with rich demographic and survival detail.


## A2 – Column Types

_Inspect the current data types and convert passenger descriptors (`sex`, `embarked`, `class`, `who`, `adult_male`, `alone`, `alive`) to categorical dtypes._

In [None]:
dtype_series = titanic.dtypes.sort_values()
print(dtype_series)

category_cols = ['sex', 'embarked', 'class', 'who', 'adult_male', 'alone', 'alive']
titanic[category_cols] = titanic[category_cols].astype('category')
titanic.dtypes.loc[category_cols]
# Observation: Casting to categorical reduces memory footprint and clarifies the nominal nature of these fields.


## A3 – Missingness Scan

_Compute the count and percentage of missing values for every column, sorted by the highest percentage first._

In [None]:
missing_summary = (
    titanic.isna().sum()
    .to_frame(name='missing_count')
    .assign(missing_pct=lambda df_: (df_['missing_count'] / len(titanic)) * 100)
    .sort_values('missing_pct', ascending=False)
    .round(2)
)
missing_summary
# Observation: Cabin is the sparsest field, while many core demographic columns are fully populated.


## A4 – Median Age Imputation

_Create an `age_filled` column by imputing missing ages with the median age within each passenger class._

In [None]:
titanic['age_filled'] = titanic.groupby('class')['age'].transform(lambda s: s.fillna(s.median()))
titanic[['age', 'age_filled']].head()
# Observation: Class-based medians preserve broad age differences between passenger tiers.


## A5 – Embarkation Cleanup

_Fill missing `embarked` values with the most frequent embarkation port in the dataset and confirm no nulls remain._

In [None]:
embarked_mode = titanic['embarked'].mode(dropna=True)[0]
titanic['embarked'] = titanic['embarked'].fillna(embarked_mode)
titanic['embarked'].isna().sum()
# Observation: The mode is typically 'S', so remaining blanks default to Southampton.


## A6 – Duplicate Detection

_Check for duplicate passenger records using all columns and remove any that exist, keeping the first occurrence._

In [None]:
dup_mask = titanic.duplicated()
print(f'Duplicate rows: {dup_mask.sum()}')

titanic = titanic.loc[~dup_mask].reset_index(drop=True)
print(f'Post-drop shape: {titanic.shape}')
# Observation: The seaborn titanic sample ships without duplicates, so shape remains unchanged.


## A7 – Cabin Availability Flag

_Create a boolean `has_cabin` indicator based on whether the `cabin` field is known, and show the value counts._

In [None]:
titanic['has_cabin'] = titanic['cabin'].notna()
titanic['has_cabin'].value_counts()
# Observation: Only a small minority of passengers have recorded cabin information.


## A8 – Deck Extraction

_Derive a new categorical `deck` column by taking the first letter of `cabin` and report its distribution._

In [None]:
titanic['deck'] = titanic['cabin'].str[0]
titanic['deck'] = titanic['deck'].astype('category')
titanic['deck'].value_counts(dropna=False)
# Observation: Deck C dominates among known cabins, while many entries remain missing.


## A9 – Deck Imputation

_Fill missing `deck` values with `'Unknown'` and store the result as an ordered categorical feature with `Unknown` last._

In [None]:
deck_categories = sorted([c for c in titanic['deck'].dropna().unique()]) + ['Unknown']
titanic['deck'] = titanic['deck'].cat.add_categories(['Unknown']).fillna('Unknown')
titanic['deck'] = titanic['deck'].cat.reorder_categories(deck_categories, ordered=True)
titanic['deck'].value_counts()
# Observation: Treating Unknown as an explicit level makes grouping logic downstream more transparent.


## A10 – Fare Log Transform

_Add a `fare_log` column using the natural log of `fare + 1` to dampen skewness, and describe the new feature._

In [None]:
titanic['fare_log'] = np.log1p(titanic['fare'])
titanic['fare_log'].describe().round(2)
# Observation: The log transform compresses extreme fares, yielding a more symmetric distribution.


## A11 – Age Bands

_Bin `age_filled` into categorical bands (`Child`, `Teen`, `Adult`, `Mature`, `Senior`) using appropriate numeric cutoffs._

In [None]:
age_bins = [0, 12, 18, 40, 60, titanic['age_filled'].max()]
age_labels = ['Child', 'Teen', 'Adult', 'Mature', 'Senior']
titanic['age_band'] = pd.cut(titanic['age_filled'], bins=age_bins, labels=age_labels, right=False)
titanic['age_band'].value_counts()
# Observation: Adults dominate the manifest, with far fewer children and seniors aboard.


## A12 – Family Size Feature

_Create `family_size` as `sibsp + parch + 1` and display its distribution._

In [None]:
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1
titanic['family_size'].value_counts().sort_index()
# Observation: Most passengers traveled alone or with one companion, while large families were rare.


## A13 – Fare Outliers

_Calculate the interquartile range for `fare` and count how many fares exceed the upper fence._

In [None]:
q1, q3 = titanic['fare'].quantile([0.25, 0.75])
iqr = q3 - q1
upper_fence = q3 + 1.5 * iqr
outlier_count = (titanic['fare'] > upper_fence).sum()
print(f'IQR: {iqr:.2f}, Upper fence: {upper_fence:.2f}, Outliers: {outlier_count}')
# Observation: A handful of luxury fares sit well above the Tukey upper fence.


## A14 – Fare Capping

_Create a `fare_capped` column where values above the upper fence are replaced with the 95th percentile fare._

In [None]:
cap_value = titanic['fare'].quantile(0.95)
titanic['fare_capped'] = titanic['fare'].clip(upper=cap_value)
titanic[['fare', 'fare_capped']].describe().round(2)
# Observation: Capping trims extreme fares while leaving the bulk of the distribution untouched.


## A15 – Scaled Ages

_Standardize `age_filled` to a 0–1 range as `age_scaled` and confirm the min/max boundaries._

In [None]:
age_min = titanic['age_filled'].min()
age_max = titanic['age_filled'].max()
titanic['age_scaled'] = (titanic['age_filled'] - age_min) / (age_max - age_min)
titanic['age_scaled'].agg(['min', 'max']).round(3)
# Observation: Min-max scaling maps the youngest passenger to 0 and the oldest to 1 for downstream comparability.


## A16 – Boolean Encoding

_Convert the `adult_male` indicator to an integer `adult_male_int` column._

In [None]:
titanic['adult_male_int'] = titanic['adult_male'].astype('int')
titanic[['adult_male', 'adult_male_int']].head()
# Observation: Boolean casts provide clean 0/1 columns for modeling pipelines.


## A17 – Column Ordering

_Assemble a cleaned view `titanic_cleaned` with key demographics first (survived, class, sex, age_filled, fare_capped, deck, family_size)._

In [None]:
ordered_cols = ['survived', 'class', 'sex', 'age_filled', 'fare_capped', 'deck', 'family_size']
secondary_cols = [c for c in titanic.columns if c not in ordered_cols]
titanic_cleaned = titanic[ordered_cols + secondary_cols]
titanic_cleaned.head()
# Observation: Reordering surfaces the most relevant modeling columns at a glance.


## A18 – Class Summary

_Produce an aggregated summary by passenger class for `fare_capped`, `age_filled`, and `family_size` (mean, median)._

In [None]:
class_summary = titanic_cleaned.groupby('class').agg({
    'fare_capped': ['mean', 'median'],
    'age_filled': ['mean', 'median'],
    'family_size': ['mean', 'median']
}).round(2)
class_summary
# Observation: First-class travelers pay far higher fares and skew slightly older on average.


## A19 – Deck Survival Rate

_Compute survival rates by `deck`, sorting from highest to lowest, and include the passenger count per deck._

In [None]:
deck_survival = (
    titanic_cleaned.groupby('deck')
    .agg(passengers=('survived', 'size'), survival_rate=('survived', 'mean'))
    .sort_values('survival_rate', ascending=False)
    .round({'survival_rate': 3})
)
deck_survival
# Observation: Upper decks report higher survival, though sample sizes vary markedly.


## A20 – Cleaning Checklist

_Compile a Python list called `cleaning_steps` summarizing the major transformations applied so far, then display it._

In [None]:
cleaning_steps = [
    'Cast categorical descriptors to category dtype',
    'Imputed age by passenger class median',
    'Filled missing embarkation ports with global mode',
    'Flagged cabin availability and standardized deck levels',
    'Log-transformed and capped fare to reduce skew',
    'Engineered family size, age bands, and scaled ages'
]
cleaning_steps
# Observation: Documenting steps clarifies the reproducible data-prep pipeline for collaborators.
