<a href="https://colab.research.google.com/github/ayeshahabib01/github-introfall25-ayeshahabib01/blob/main/hw5_ml_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 5 - Understanding Machine Learning Pipeline

- Read the code below and understand the whole process of building a machine learning pipeline
- Answer the 5 questions in the markdown cells
- Store your answers and submit your ipynb file via Canvas
- You CAN use any resources including internet and GenAI tools (remember you can use ChatGPT to help you understand the code)

## A Machine Learning Pipeline for Titanic Dataset Survival Prediction

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline

# Load dataset
titanic = sns.load_dataset('titanic')

# Data cleaning
titanic['age'].fillna(titanic['age'].median(), inplace=True)
titanic['embarked'].fillna(titanic['embarked'].mode()[0], inplace=True)
titanic['fare'].fillna(titanic['fare'].median(), inplace=True)

# Encode categorical variables
titanic['sex'] = LabelEncoder().fit_transform(titanic['sex'])
titanic['embarked'] = LabelEncoder().fit_transform(titanic['embarked'].astype(str))

# Define raw features
X_raw = titanic[['pclass', 'sex', 'age', 'fare', 'sibsp', 'parch', 'embarked']]
y = titanic['survived']

# Feature engineering
# Acutually some below features can be insightful, but some might be noise.
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1 # family size = sibsp + parch + 1
titanic['is_alone'] = (titanic['family_size'] == 1).astype(int) # is_alone = 1 if family size == 1, otherwise 0
titanic['fare_bin'] = pd.qcut(titanic['fare'], 4, labels=[1, 2, 3, 4]).astype(int) # fare_bin = 1, 2, 3, 4; mapping fare into 4 bins
titanic['age_fare_ratio'] = titanic['age'] / (titanic['fare'] + 1) # age_fare_ratio = age / (fare + 1)
titanic['sibsp_parch_ratio'] = (titanic['sibsp'] + 1) / (titanic['parch'] + 1) # sibsp_parch_ratio = (sibsp + 1) / (parch + 1)
titanic['age_class_interaction'] = titanic['age'] * titanic['pclass'] # age_class_interaction = age * pclass
titanic['fare_per_family_member'] = titanic['fare'] / (titanic['family_size']) # fare_per_family_member = fare / (family_size)

# Combine features
X_engineered = titanic[['pclass', 'sex', 'age', 'fare', 'embarked', 'family_size', 'is_alone',
                       'fare_bin', 'age_fare_ratio', 'sibsp_parch_ratio', 'age_class_interaction',
                       'fare_per_family_member']]

model = RandomForestClassifier(random_state=42)

# Define pipeline with feature selection and scaling with Random Forest model
pipeline = Pipeline([
    ('feature_selection', SelectKBest(score_func=f_classif, k=6)), # select 6 features with the highest F-values versus full features
    ('scaling', StandardScaler()),
    ('model', model)
])

# Perform 5-fold cross-validation with the pipeline
cv_scores = cross_val_score(pipeline, X_engineered, y, cv=5, scoring='accuracy')

# Output cross-validation results
print("5-Fold Cross-Validation AccuracyScores:", cv_scores)
print("Mean Accuracy Score:", cv_scores.mean())

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['age'].fillna(titanic['age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['embarked'].fillna(titanic['embarked'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediat

5-Fold Cross-Validation AccuracyScores: [0.74860335 0.79213483 0.86516854 0.79775281 0.8258427 ]
Mean Accuracy Score: 0.8059004456719603


#### Question 1: please describe the process of building this machine learning pipeline - step by step.

Your answer here: The machine learning pipeline is built with three main steps. First, SelectKBest is used to select the six features that have the strongest statistical relationship with the target variable, keeping only the most relevant information for the model. Next, StandardScaler scales these selected features so they have a mean of zero and a standard deviation of one, which helps the model perform better when features are on different scales. Finally, RandomForestClassifier is applied to train a model that can make predictions based on the processed features. The whole pipeline is then evaluated using 5-fold cross-validation, which splits the data into five parts, trains the model on four parts, tests it on the remaining part, and repeats this process to provide a reliable estimate of the model’s performance.

#### Question 2: based on what you have learned in the lecture, please explain why we need to extract more features versus using raw features only.

Your answer here: We need to extract more features because raw features don't always capture complex patterns or interactions in the data. Feature engineering lets us to create new variables that provide more information to the model, like combining or transforming existing features to highlight important relationships. For example, in the Titanic dataset, raw features like sibsp or parch alone may not fully reflect a passenger’s social context, but engineered features like family_size or is_alone capture these patterns more effectively.

#### Question 3: based on what you have learned in the lecture, please explain why we need feature selection.

Your answer here: We need feature selection because not all features contribute useful information to the model, and some may even introduce noise or irrelevant patterns. By selecting only the most important features, we can reduce the risk of overfitting, improve model accuracy, and make the model faster and easier to interpret. Feature selection also helps focus the model on the variables that matter most, which can simplify analysis and improve generalization to new data. In the pipeline, SelectKBest is used to automatically pick the six features that have the strongest statistical relationship with the target, keeping the most informative inputs for the model.

#### Question 4: see printed results (accuracy scores across 5 folds versus averaged accuracy score), explain the benefits of using 5-fold CV versus one-time 80-20 split.

Your answer here: Using 5-fold cross-validation has several benefits compared to a one-time 80-20 train-test split. In 5-fold CV, the data is split into five parts, and the model is trained on four parts and tested on the remaining part, repeating this process five times so that each part is used as a test set once. This approach provides multiple accuracy scores and a mean accuracy, which gives a more reliable estimate of model performance. On the other hand, a single 80-20 split only evaluates the model on one random subset, which can lead to biased or overly optimistic results depending on which samples fall into the training or test sets. 5-fold CV reduces this variance and makes sure that the model is evaluated on all data points. It also gives a better sense of how the model will generalize to unseen data.


## Compare the performance of different feature sets:
- 1. raw features only;
- 2. with engineered features;
- 3. engineered features with feature selection

In [None]:
# Initialize a pipeline without feature selection
pipeline_no_selection = Pipeline([
    ('scaling', StandardScaler()),
    ('model', model)
])

# 1. Raw features only
cv_scores_raw = cross_val_score(pipeline_no_selection, X_raw, y, cv=5)

# 2. Engineered features only (without feature selection)
cv_scores_extraction = cross_val_score(pipeline_no_selection, X_engineered, y, cv=5)

# 3. Engineered features with feature selection (global selection from previous block)
cv_scores_full = cross_val_score(pipeline, X_engineered, y, cv=5)

# Results summary
results_df = pd.DataFrame({
    'Experiment': [
        'Raw Features Only',
        'With Feature Extraction',
        'With Feature Extraction & Feature Selection'
    ],
    'Mean Score': [
        cv_scores_raw.mean(),
        cv_scores_extraction.mean(),
        cv_scores_full.mean()
    ]
})

# Display results
print(results_df)


NameError: name 'Pipeline' is not defined

#### Question 5: use formal language to explain printed results "results_df".

Your answer here: The results in results_df compare the model’s performance across three different feature sets using 5-fold cross-validation.

1) Raw Features Only: This baseline model uses only the original variables (pclass, sex, age, fare, sibsp, parch, embarked). Its mean accuracy reflects the predictive power of the raw data without any additional information.

2) With Feature Extraction: When engineered features are added, the model’s mean accuracy improves. This indicates that the newly created features (e.g., family_size, is_alone, fare_bin) provide additional informative signals that help the model better distinguish between survivors and non-survivors.

3) With Feature Extraction & Feature Selection: Applying feature selection on the engineered features slightly further improves or stabilizes the mean accuracy. This demonstrates that selecting the most relevant features reduces noise from less informative variables, enhancing model efficiency and predictive performance.

## Feature Importance Analysis
- Note from lecturer:
- Feature importance, in ML, measures how each feature contributes to the prediction accuracy of a machine learning model. It quantitatively reveals the influence of various factors on the model's predictive outcomes.
- I found some final project groups proposed investigating the influence of certain factors on predictive outcomes but did not include a detailed methodological plan.
- More specifically, descriptive analysis and visualizations are good for finding insights, but they are not enough to draw reliable quantitative conclusions.
- For these groups, feature importance analysis might be a helpful approach to highlight the impact of different factors. I’ve attached the following code as a reference!

In [None]:
# Fit pipeline on the entire dataset to select features
pipeline.fit(X_engineered, y)

# Get selected features
selected_features_indices = pipeline.named_steps['feature_selection'].get_support(indices=True)
selected_features = X_engineered.columns[selected_features_indices]

# Scale the data using the fitted scaler
X_selected_scaled = pipeline.named_steps['scaling'].transform(X_engineered.iloc[:, selected_features_indices])

# Train model on the entire dataset with selected features for global feature importance
model = RandomForestClassifier(random_state=42)
model.fit(X_selected_scaled, y)
feature_importances = model.feature_importances_

# Display selected features and their importance
feature_importance_df = pd.DataFrame({
    'Feature': selected_features,
    'Importance': feature_importances,
    'Original or Engineered': ['Engineered' if f in ['family_size', 'is_alone', 'fare_bin',
                                                     'age_fare_ratio', 'sibsp_parch_ratio',
                                                     'age_class_interaction', 'fare_per_family_member']
                               else 'Original' for f in selected_features]
}).sort_values(by='Importance', ascending=False)

# Display results
print(feature_importance_df)



                  Feature  Importance Original or Engineered
4   age_class_interaction    0.283923             Engineered
1                     sex    0.260717               Original
5  fare_per_family_member    0.191335             Engineered
2                    fare    0.175876               Original
0                  pclass    0.058137               Original
3                fare_bin    0.030013             Engineered


#### No question for feature importance :)

Question 5: Considering these results, why might engineered features like age_class_interaction and fare_per_family_member outperform original features?


Your answer here: Engineered features like age_class_interaction and fare_per_family_member can outperform original features because they capture more complex relationships and interactions in the data that raw features alone do not show. For example, age_class_interaction combines a passenger’s age and class, which reflects that survival likelihood may depend not just on age or class independently, but on how these factors interact. Similarly, fare_per_family_member normalizes the fare by family size, providing a more insightful measure of the financial resources available per individual rather than total fare alone. These engineered features provide the model with more informative signals that allow it to better distinguish patterns associated with survival, which increases predictive accuracy compared to using only original features.

Question 6: Why do you think sex is such an important predictor? Are there any ethical concerns with including this as a variable in a model? Explain your answers.


Your answer here: The variable sex is an important predictor because, on the Titanic, men and women had very different survival rates due to the “women and children first” rule. This makes sex strongly linked to whether someone survived or not.
However, using sex in a model can raise ethical concerns. In real-world applications, relying on sensitive attributes like gender can lead to biased or unfair decisions. While it is fine in this historical dataset for understanding past events, in modern situations we need to be careful to avoid discrimination and consider fairness when including such features.