# Titanic Survival Classification – Data Preparation

## 1. Setup and Data Load
## 2. Initial Exploration
## 3. Preprocessing
## 4. Feature and Target Selection
## 5. Save Cleaned Data (Optional)

In [None]:
import pandas as pd

# Load Titanic dataset
file_path = 'data/titanic.csv'  # adjust if needed
titanic_data = pd.read_csv(file_path)

In [None]:
# View basic dataset shape
print("Shape:", titanic_data.shape)

# Preview column names
print("\nColumns:")
print(titanic_data.columns.tolist())

# Show first few rows
print("\nTop rows:")
print(titanic_data.head())

# Summary statistics
print("\nSummary statistics:")
print(titanic_data.describe(include='all'))

# Check for missing values
print("\nMissing values per column:")
print(titanic_data.isnull().sum())

Initial inspection shows several columns with missing values (e.g., Age, Cabin). We'll decide how to handle them based on model needs.

In [None]:
# Drop columns unlikely to help with classification or too incomplete
titanic_data = titanic_data.drop(columns=['Cabin', 'Ticket', 'Name'])

# Drop rows with missing target ('Survived') or critical features
titanic_data = titanic_data.dropna(subset=['Survived', 'Embarked'])

# Fill missing Age with median
titanic_data['Age'] = titanic_data['Age'].fillna(titanic_data['Age'].median())

# Encode 'Sex' and 'Embarked'
titanic_data['Sex'] = titanic_data['Sex'].map({'male': 0, 'female': 1})
titanic_data['Embarked'] = titanic_data['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

In [None]:
# Define target and features
y = titanic_data['Survived']

feature_names = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = titanic_data[feature_names]

print(X.head())
print(y.head())

In [None]:
# Save cleaned data
titanic_data.to_csv('data/titanic_cleaned.csv', index=False)

## Notes – Day 4

- Loaded Titanic dataset and explored structure and missing values.
- Dropped irrelevant columns (Name, Ticket, Cabin).
- Filled missing Age values with median, dropped missing Embarked rows.
- Encoded 'Sex' and 'Embarked' for modeling.
- Selected 7 features to use in classification modeling tomorrow.
- Dataset is clean and ready for training.

## Day 5 – Train/Test Split and Decision Tree Classification

Today’s goal is to train a DecisionTreeClassifier to predict Titanic survival.  
We'll evaluate its accuracy and interpret the model using a confusion matrix, precision, and recall.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

In [None]:
titanic_data = pd.read_csv('data/titanic_cleaned.csv')

y = titanic_data['Survived']
feature_names = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = titanic_data[feature_names]

In [None]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

In [None]:
model = DecisionTreeClassifier(random_state=1)
model.fit(train_X, train_y)

In [None]:
val_predictions = model.predict(val_X)
accuracy = accuracy_score(val_y, val_predictions)
print(f"Validation Accuracy: {accuracy:.3f}")

In [None]:
cm = confusion_matrix(val_y, val_predictions)
precision = precision_score(val_y, val_predictions)
recall = recall_score(val_y, val_predictions)

print("Confusion Matrix:\n", cm)
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")

## Day 5 Results – Titanic Classification

- **Model:** DecisionTreeClassifier
- **Validation Accuracy:** 80.3%
- **Confusion Matrix:**

| Actual \ Predicted | 0 (Did Not Survive) | 1 (Survived) |
|--------------------|---------------------|--------------|
| **0 (Did Not Survive)** | 113                 | 25           |
| **1 (Survived)**         | 19                  | 66           |

- **Precision:** 72.5% – Of the predicted survivors, 72.5% actually survived.
- **Recall:** 77.6% – The model correctly identified 77.6% of the actual survivors.

### Interpretation

- The model performs reasonably well out of the box.
- Slight bias toward predicting non-survivors (class 0), but catches most real survivors.
- Future improvements could include:
  - Trying a `max_depth` limit to reduce overfitting.
  - Testing a RandomForestClassifier for better generalization.
  - Exploring class imbalance solutions (e.g. balanced weights).