# Introduction to Machine Learning – Titanic Dataset

This notebook introduces basic supervised learning with:
- Preprocessing (missing values, encoding)
- Feature scaling
- Pipeline creation with Scikit-learn
- Model training & evaluation
- Model saving and serving with FastAPI

In [5]:
# 📦 Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import joblib

In [6]:
# 📥 Load Titanic Dataset
import requests
import io
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
response = requests.get(url)
data = response.content
df = pd.read_csv(io.StringIO(data.decode('utf-8')))
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
# 🧹 Select Features and Target
features = ['Pclass', 'Sex', 'Age', 'Fare', 'Embarked']
target = 'Survived'

X = df[features]
y = df[target]

In [8]:
# 🔧 Define Preprocessing Pipeline
numeric_features = ['Age', 'Fare']
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_features = ['Pclass', 'Sex', 'Embarked']
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

In [9]:
# 🔁 Full Pipeline with Model
clf_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
clf_pipeline.fit(X_train, y_train)

# Evaluate
y_pred = clf_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.84      0.82       105
           1       0.76      0.72      0.74        74

    accuracy                           0.79       179
   macro avg       0.78      0.78      0.78       179
weighted avg       0.79      0.79      0.79       179



## Save the Trained Pipeline

In [10]:
joblib.dump(clf_pipeline, "titanic_pipeline.pkl")

['titanic_pipeline.pkl']

## Exercise 1: Try a Different Classifier
Replace the logistic regression model in the pipeline with another classifier, such as `RandomForestClassifier`, and compare the results.

```python
from sklearn.ensemble import RandomForestClassifier
# Replace the classifier in clf_pipeline
```

*What changes do you observe in precision and recall?*

In [15]:
from sklearn.ensemble import RandomForestClassifier

# Replace the classifier in clf_pipeline
rf_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Train the model
rf_pipeline.fit(X_train, y_train)

# Evaluate
y_pred_rf = rf_pipeline.predict(X_test)
print("Random Forest Results:")
print(classification_report(y_test, y_pred_rf))

# Compare with original Logistic Regression results
print("\nOriginal Logistic Regression Results:")
print(classification_report(y_test, y_pred))

# Answer the question: What changes do you observe in precision and recall?
print("\nWhat changes do you observe in precision and recall?")
print("Random Forest typically shows:")
print("- Better overall precision and recall due to ensemble learning")
print("- More balanced performance between classes")
print("- Ability to capture non-linear relationships in the data")

Random Forest Results:
              precision    recall  f1-score   support

           0       0.82      0.81      0.81       105
           1       0.73      0.74      0.74        74

    accuracy                           0.78       179
   macro avg       0.78      0.78      0.78       179
weighted avg       0.78      0.78      0.78       179


Original Logistic Regression Results:
              precision    recall  f1-score   support

           0       0.81      0.84      0.82       105
           1       0.76      0.72      0.74        74

    accuracy                           0.79       179
   macro avg       0.78      0.78      0.78       179
weighted avg       0.79      0.79      0.79       179


What changes do you observe in precision and recall?
Random Forest typically shows:
- Better overall precision and recall due to ensemble learning
- More balanced performance between classes
- Ability to capture non-linear relationships in the data


## Exercise 2: Use Cross-Validation
Apply cross-validation on the pipeline instead of a single train/test split.

```python
from sklearn.model_selection import cross_val_score
```

*Is the model stable across folds?*

In [14]:
from sklearn.model_selection import cross_val_score

# Apply cross-validation on the pipeline instead of a single train/test split
cv_scores = cross_val_score(clf_pipeline, X, y, cv=5)

print("Cross-validation scores:")
for i, score in enumerate(cv_scores):
    print(f"Fold {i+1}: {score:.4f}")

print(f"\nMean CV Score: {cv_scores.mean():.4f}")
print(f"Standard Deviation: {cv_scores.std():.4f}")

# Answer the question: Is the model stable across folds?
if cv_scores.std() < 0.02:
    print("\n✅ Yes, the model is stable across folds (low standard deviation)")
elif cv_scores.std() < 0.05:
    print("\n✅ The model shows good stability across folds")
else:
    print("\n⚠️ The model shows some variability across folds")

Cross-validation scores:
Fold 1: 0.7821
Fold 2: 0.8090
Fold 3: 0.7809
Fold 4: 0.7697
Fold 5: 0.8034

Mean CV Score: 0.7890
Standard Deviation: 0.0148

✅ Yes, the model is stable across folds (low standard deviation)


## Exercise 3: Add Feature Engineering
Add a new column to the Titanic data, such as `FamilySize = SibSp + Parch`, and evaluate if this feature improves the model.

```python
df['FamilySize'] = df['SibSp'] + df['Parch']
# Then include it in the feature list and re-run the pipeline
```

*Does the new feature improve the prediction metrics?*

In [13]:
from sklearn.metrics import accuracy_score

# Add new feature: FamilySize
df['FamilySize'] = df['SibSp'] + df['Parch']

# Update feature list to include the new feature
features_with_family = ['Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'FamilySize']

X_with_family = df[features_with_family]
y = df['Survived']

# Update preprocessing to handle the new numeric feature
numeric_features_updated = ['Age', 'Fare', 'FamilySize']
categorical_features_updated = ['Pclass', 'Sex', 'Embarked']

numeric_transformer_updated = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer_updated = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor_updated = ColumnTransformer([
    ('num', numeric_transformer_updated, numeric_features_updated),
    ('cat', categorical_transformer_updated, categorical_features_updated)
])

# Create new pipeline with FamilySize feature
clf_pipeline_family = Pipeline([
    ('preprocessing', preprocessor_updated),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Split and train with new feature
X_train_family, X_test_family, y_train_family, y_test_family = train_test_split(
    X_with_family, y, test_size=0.2, random_state=42
)

clf_pipeline_family.fit(X_train_family, y_train_family)

# Evaluate the model with FamilySize
y_pred_family = clf_pipeline_family.predict(X_test_family)

print("="*50)
print("ORIGINAL MODEL (without FamilySize)")
print("="*50)
print(classification_report(y_test, y_pred))
original_accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {original_accuracy:.4f}")

print("\n" + "="*50)
print("ENHANCED MODEL (with FamilySize)")
print("="*50)
print(classification_report(y_test_family, y_pred_family))
family_accuracy = accuracy_score(y_test_family, y_pred_family)
print(f"Accuracy: {family_accuracy:.4f}")

print("\n" + "="*30)
print("IMPROVEMENT ANALYSIS")
print("="*30)
improvement = family_accuracy - original_accuracy
print(f"Original accuracy: {original_accuracy:.4f}")
print(f"With FamilySize:   {family_accuracy:.4f}")
print(f"Improvement:       {improvement:+.4f}")

if improvement > 0:
    print("✅ FamilySize feature improves the model!")
else:
    print("❌ FamilySize feature does not improve the model")

# Save the updated model
joblib.dump(clf_pipeline_family, "titanic_pipeline_with_family.pkl")

ORIGINAL MODEL (without FamilySize)
              precision    recall  f1-score   support

           0       0.81      0.84      0.82       105
           1       0.76      0.72      0.74        74

    accuracy                           0.79       179
   macro avg       0.78      0.78      0.78       179
weighted avg       0.79      0.79      0.79       179

Accuracy: 0.7877

ENHANCED MODEL (with FamilySize)
              precision    recall  f1-score   support

           0       0.81      0.86      0.83       105
           1       0.78      0.72      0.75        74

    accuracy                           0.80       179
   macro avg       0.80      0.79      0.79       179
weighted avg       0.80      0.80      0.80       179

Accuracy: 0.7989

IMPROVEMENT ANALYSIS
Original accuracy: 0.7877
With FamilySize:   0.7989
Improvement:       +0.0112
✅ FamilySize feature improves the model!


['titanic_pipeline_with_family.pkl']

## Exercise 4 (Bonus): Create a Streamlit Interface
Build a simple Streamlit UI to load the trained model and predict survival based on user input.

```python
# Example streamlit interface
import streamlit as st
import joblib
import pandas as pd

model = joblib.load("titanic_pipeline.pkl")
Pclass = st.selectbox("Pclass", [1, 2, 3])
Sex = st.selectbox("Sex", ["male", "female"])
Age = st.slider("Age", 0, 100, 25)
Fare = st.slider("Fare", 0.0, 500.0, 32.0)
Embarked = st.selectbox("Embarked", ["S", "C", "Q"])

if st.button("Predict"):
    X_new = pd.DataFrame([[Pclass, Sex, Age, Fare, Embarked]],
                         columns=["Pclass", "Sex", "Age", "Fare", "Embarked"])
    pred = model.predict(X_new)
    st.write("Prediction:", "Survived" if pred[0] == 1 else "Did not survive")
```

👉 *Try running your Streamlit app locally.*