# Introduction to Machine Learning – Titanic Dataset

This notebook introduces basic supervised learning with:
- Preprocessing (missing values, encoding)
- Feature scaling
- Pipeline creation with Scikit-learn
- Model training & evaluation
- Model saving and serving with FastAPI

In [19]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import joblib

In [20]:
# 📥 Load Titanic Dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
df.to_csv("titanic.csv", index=False)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [21]:
# 🧹 Select Features and Target
features = ["Pclass", "Sex", "Age", "Fare", "Embarked"]
target = "Survived"

X = df[features]
y = df[target]

In [22]:
# 🔧 Define Preprocessing Pipeline
numeric_features = ["Age", "Fare"]
numeric_transformer = Pipeline(
    [("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

categorical_features = ["Pclass", "Sex", "Embarked"]
categorical_transformer = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocessor = ColumnTransformer(
    [
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

In [23]:
# 🔁 Full Pipeline with Model
clf_pipeline = Pipeline(
    [("preprocessing", preprocessor), ("classifier", LogisticRegression(max_iter=1000))]
)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train the model
clf_pipeline.fit(X_train, y_train)

# Evaluate
y_pred = clf_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.84      0.82       105
           1       0.76      0.72      0.74        74

    accuracy                           0.79       179
   macro avg       0.78      0.78      0.78       179
weighted avg       0.79      0.79      0.79       179



  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  ret = a @ b
  ret = a @ b
  ret = a @ b


## Save the Trained Pipeline

In [24]:
joblib.dump(clf_pipeline, "titanic_pipeline.pkl")

['titanic_pipeline.pkl']

## Exercise 1: Try a Different Classifier
Replace the logistic regression model in the pipeline with another classifier, such as `RandomForestClassifier`, and compare the results.

```python
from sklearn.ensemble import RandomForestClassifier
# Replace the classifier in clf_pipeline
```

*What changes do you observe in precision and recall?*

In [25]:
print("=" * 60)
print("EXERCISE 1: Random Forest Classifier")
print("=" * 60)

# Select features and target
features = ["Pclass", "Sex", "Age", "Fare", "Embarked"]
target = "Survived"

X = df[features]
y = df[target]

# Define preprocessing pipeline (same as before)
numeric_features = ["Age", "Fare"]
numeric_transformer = Pipeline(
    [("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

categorical_features = ["Pclass", "Sex", "Embarked"]
categorical_transformer = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocessor = ColumnTransformer(
    [
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Create pipeline with Random Forest
rf_pipeline = Pipeline(
    [
        ("preprocessing", preprocessor),
        ("classifier", RandomForestClassifier(n_estimators=100, random_state=42)),
    ]
)

# Split and train
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Random Forest model
rf_pipeline.fit(X_train, y_train)
y_pred_rf = rf_pipeline.predict(X_test)

print("Random Forest Results:")
print(classification_report(y_test, y_pred_rf))

# Compare with Logistic Regression
lr_pipeline = Pipeline(
    [
        ("preprocessing", preprocessor),
        ("classifier", LogisticRegression(max_iter=1000, random_state=42)),
    ]
)

lr_pipeline.fit(X_train, y_train)
y_pred_lr = lr_pipeline.predict(X_test)

print("\nLogistic Regression Results:")
print(classification_report(y_test, y_pred_lr))

EXERCISE 1: Random Forest Classifier
Random Forest Results:
              precision    recall  f1-score   support

           0       0.82      0.81      0.81       105
           1       0.73      0.74      0.74        74

    accuracy                           0.78       179
   macro avg       0.78      0.78      0.78       179
weighted avg       0.78      0.78      0.78       179


Logistic Regression Results:
              precision    recall  f1-score   support

           0       0.81      0.84      0.82       105
           1       0.76      0.72      0.74        74

    accuracy                           0.79       179
   macro avg       0.78      0.78      0.78       179
weighted avg       0.79      0.79      0.79       179



  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  ret = a @ b
  ret = a @ b
  ret = a @ b


- Precision for Class 0 (Did not survive): Improved from 0.81 to 0.82 (+0.01)
- Recall for Class 0: Decreased from 0.84 to 0.81 (-0.03)
- Precision for Class 1 (Survived): Decreased from 0.76 to 0.73 (-0.03)
- Recall for Class 1: Improved from 0.72 to 0.74 (+0.02)

The Random Forest shows a trade-off: better at identifying survivors (higher recall for Class 1) but slightly worse at predicting survival when it does occur (lower precision for Class 1).

## Exercise 2: Use Cross-Validation
Apply cross-validation on the pipeline instead of a single train/test split.

```python
from sklearn.model_selection import cross_val_score
```

*Is the model stable across folds?*

In [26]:
print("\n" + "=" * 60)
print("EXERCISE 2: Cross-Validation")
print("=" * 60)

# Cross-validation for Random Forest
rf_cv_scores = cross_val_score(rf_pipeline, X, y, cv=5, scoring="accuracy")
print(f"Random Forest CV Scores: {rf_cv_scores}")
print(
    f"Random Forest CV Mean: {rf_cv_scores.mean():.3f} (+/- {rf_cv_scores.std() * 2:.3f})"
)

# Cross-validation for Logistic Regression
lr_cv_scores = cross_val_score(lr_pipeline, X, y, cv=5, scoring="accuracy")
print(f"\nLogistic Regression CV Scores: {lr_cv_scores}")
print(
    f"Logistic Regression CV Mean: {lr_cv_scores.mean():.3f} (+/- {lr_cv_scores.std() * 2:.3f})"
)

print(f"\nModel stability analysis:")
print(f"Random Forest std deviation: {rf_cv_scores.std():.4f}")
print(f"Logistic Regression std deviation: {lr_cv_scores.std():.4f}")
print("Lower std deviation indicates more stable model across folds.")


EXERCISE 2: Cross-Validation
Random Forest CV Scores: [0.79329609 0.79775281 0.84831461 0.79213483 0.82022472]
Random Forest CV Mean: 0.810 (+/- 0.043)

Logistic Regression CV Scores: [0.78212291 0.80898876 0.78089888 0.76966292 0.80337079]
Logistic Regression CV Mean: 0.789 (+/- 0.030)

Model stability analysis:
Random Forest std deviation: 0.0215
Logistic Regression std deviation: 0.0148
Lower std deviation indicates more stable model across folds.


  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  ret = a @ b
  ret = a @ b
  ret = a @ b
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  ret = a @ b
  ret = a @ b
  ret = a @ b
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad

- Random Forest: Standard deviation of 0.0215 - moderately stable but shows some variation across folds (scores range from 79.2% to 84.8%)
- Logistic Regression: Standard deviation of 0.0148 - more stable across folds with less variation (scores range from 77.0% to 80.9%)

Logistic Regression is more stable across folds, while Random Forest shows higher variance but better average performance.

## Exercise 3: Add Feature Engineering
Add a new column to the Titanic data, such as `FamilySize = SibSp + Parch`, and evaluate if this feature improves the model.

```python
df['FamilySize'] = df['SibSp'] + df['Parch']
# Then include it in the feature list and re-run the pipeline
```

*Does the new feature improve the prediction metrics?*

In [27]:
print("\n" + "=" * 60)
print("EXERCISE 3: Feature Engineering - Family Size")
print("=" * 60)

# Add FamilySize feature
df["FamilySize"] = df["SibSp"] + df["Parch"]

# Updated feature list including FamilySize
features_with_family = ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize"]
X_family = df[features_with_family]

# Update preprocessing pipeline to handle the new feature
numeric_features_family = ["Age", "Fare", "FamilySize"]
categorical_features_family = ["Pclass", "Sex", "Embarked"]

numeric_transformer_family = Pipeline(
    [("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

categorical_transformer_family = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocessor_family = ColumnTransformer(
    [
        ("num", numeric_transformer_family, numeric_features_family),
        ("cat", categorical_transformer_family, categorical_features_family),
    ]
)

# Create new pipeline with family size feature
rf_family_pipeline = Pipeline(
    [
        ("preprocessing", preprocessor_family),
        ("classifier", RandomForestClassifier(n_estimators=100, random_state=42)),
    ]
)

# Train and evaluate with family size feature
X_train_family, X_test_family, y_train_family, y_test_family = train_test_split(
    X_family, y, test_size=0.2, random_state=42
)

rf_family_pipeline.fit(X_train_family, y_train_family)
y_pred_family = rf_family_pipeline.predict(X_test_family)

print("Random Forest with FamilySize feature:")
print(classification_report(y_test_family, y_pred_family))

# Cross-validation comparison
cv_scores_original = cross_val_score(rf_pipeline, X, y, cv=5, scoring="accuracy")
cv_scores_family = cross_val_score(
    rf_family_pipeline, X_family, y, cv=5, scoring="accuracy"
)

print(f"\nOriginal features CV mean: {cv_scores_original.mean():.3f}")
print(f"With FamilySize CV mean: {cv_scores_family.mean():.3f}")
print(f"Improvement: {cv_scores_family.mean() - cv_scores_original.mean():.3f}")

# Save the best model
joblib.dump(rf_family_pipeline, "titanic_best_pipeline.pkl")
print("\nBest model saved as 'titanic_best_pipeline.pkl'")


EXERCISE 3: Feature Engineering - Family Size
Random Forest with FamilySize feature:
              precision    recall  f1-score   support

           0       0.84      0.86      0.85       105
           1       0.79      0.77      0.78        74

    accuracy                           0.82       179
   macro avg       0.82      0.81      0.81       179
weighted avg       0.82      0.82      0.82       179


Original features CV mean: 0.810
With FamilySize CV mean: 0.801
Improvement: -0.009

Best model saved as 'titanic_best_pipeline.pkl'


Mixed results !

- Yes on the test set: accuracy improved from 78% to 82%
- No in cross-validation: mean CV score decreased from 81.0% to 80.1%

The FamilySize feature shows promise for this specific data split but may not consistently improve performance across different data samples.

## Exercise 4 (Bonus): Create a Streamlit Interface
Build a simple Streamlit UI to load the trained model and predict survival based on user input.

```python
# Example streamlit interface
import streamlit as st
import joblib
import pandas as pd

model = joblib.load("titanic_pipeline.pkl")
Pclass = st.selectbox("Pclass", [1, 2, 3])
Sex = st.selectbox("Sex", ["male", "female"])
Age = st.slider("Age", 0, 100, 25)
Fare = st.slider("Fare", 0.0, 500.0, 32.0)
Embarked = st.selectbox("Embarked", ["S", "C", "Q"])

if st.button("Predict"):
    X_new = pd.DataFrame([[Pclass, Sex, Age, Fare, Embarked]],
                         columns=["Pclass", "Sex", "Age", "Fare", "Embarked"])
    pred = model.predict(X_new)
    st.write("Prediction:", "Survived" if pred[0] == 1 else "Did not survive")
```

Run the following commands on 2 different bash terminal :

1. API run :
```bash
cd ML && uvicorn main:app --reload
```

2. Streamlit run :
```bash
cd ML && streamlit run app_ui.py
```