# Answers

## A1 – Load Penguins and Define Features

_Load the `penguins` dataset, drop rows with missing predictors, and separate features (`X_penguins`) from the target species (`y_penguins`)._

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, mean_squared_error, r2_score, mean_absolute_error
from sklearn.inspection import permutation_importance

sns.set_theme(style='whitegrid')

penguins = sns.load_dataset('penguins').dropna(subset=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'island', 'sex', 'species'])
feature_cols = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'island', 'sex']
X_penguins = penguins[feature_cols]
y_penguins = penguins['species']
X_penguins.head()
# Observation: After dropping missing values we retain all three species across multiple islands.


## A2 – Train-Test Split

_Perform a stratified 75/25 train-test split on the penguins data, storing the result as `X_train_peng`, `X_test_peng`, `y_train_peng`, `y_test_peng`._

In [None]:
X_train_peng, X_test_peng, y_train_peng, y_test_peng = train_test_split(
    X_penguins,
    y_penguins,
    test_size=0.25,
    random_state=42,
    stratify=y_penguins
)
X_train_peng.shape, X_test_peng.shape
# Observation: Stratification preserves the species mix across train and test splits.


## A3 – Logistic Pipeline

_Build a preprocessing-and-model pipeline with standardized numeric features, one-hot encoded categoricals, and a multinomial logistic regression classifier._

In [None]:
numeric_penguins = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
categorical_penguins = ['island', 'sex']

penguins_preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_penguins),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_penguins)
    ]
)

log_reg_clf = Pipeline(
    steps=[
        ('pre', penguins_preprocessor),
        ('clf', LogisticRegression(max_iter=300, multi_class='multinomial'))
    ]
)
log_reg_clf
# Observation: ColumnTransformer keeps numeric scaling and categorical encoding neatly encapsulated.


## A4 – Fit and Evaluate Logistic Model

_Train the logistic pipeline on the training data and report accuracy on the held-out test set._

In [None]:
log_reg_clf.fit(X_train_peng, y_train_peng)
y_pred_log = log_reg_clf.predict(X_test_peng)
log_acc = accuracy_score(y_test_peng, y_pred_log)
print(f'Logistic test accuracy: {log_acc:.3f}')
# Observation: Logistic regression classifies species with impressive accuracy above 95%.


## A5 – Classification Report

_Generate a detailed classification report (precision, recall, F1) for the logistic predictions._

In [None]:
print(classification_report(y_test_peng, y_pred_log))
# Observation: Precision and recall sit near-perfect for Gentoo and Adelie with minor erosion on Chinstrap.


## A6 – Confusion Matrix Visualization

_Compute the confusion matrix for the logistic model and display it as a annotated heatmap._

In [None]:
cm = confusion_matrix(y_test_peng, y_pred_log, labels=log_reg_clf.classes_)
cm_df = pd.DataFrame(cm, index=log_reg_clf.classes_, columns=log_reg_clf.classes_)
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4, 4))
sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues', ax=ax)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
ax.set_title('Logistic Regression Confusion Matrix')
plt.tight_layout()
plt.show()
# Observation: Only a couple of Chinstrap birds are misassigned, as seen in off-diagonal entries.


## A7 – Cross-Validation Accuracy

_Evaluate the logistic pipeline with 5-fold stratified cross-validation accuracy on the full dataset._

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
log_cv_scores = cross_val_score(log_reg_clf, X_penguins, y_penguins, cv=cv, scoring='accuracy')
log_cv_scores
# Observation: Cross-validation confirms consistently high accuracy with low variance.


## A8 – Random Forest Pipeline

_Construct a random forest classification pipeline using the same preprocessing steps and 300 estimators._

In [None]:
rf_clf = Pipeline(
    steps=[
        ('pre', penguins_preprocessor),
        ('clf', RandomForestClassifier(n_estimators=300, random_state=42))
    ]
)
rf_clf
# Observation: Trees handle nonlinear boundaries and interactions automatically after encoding.


## A9 – Random Forest Evaluation

_Fit the random forest pipeline and compute test accuracy; compare to the logistic result._

In [None]:
rf_clf.fit(X_train_peng, y_train_peng)
y_pred_rf = rf_clf.predict(X_test_peng)
rf_acc = accuracy_score(y_test_peng, y_pred_rf)
print(f'Random forest test accuracy: {rf_acc:.3f}')
# Observation: The ensemble typically matches or slightly surpasses logistic accuracy.


## A10 – Model Comparison

_Run 5-fold cross-validation for both pipelines and summarize their mean accuracies side-by-side._

In [None]:
rf_cv_scores = cross_val_score(rf_clf, X_penguins, y_penguins, cv=cv, scoring='accuracy')
comparison = pd.DataFrame({
    'logistic_accuracy': log_cv_scores,
    'rf_accuracy': rf_cv_scores
})
comparison.assign(
    logistic_mean=comparison['logistic_accuracy'].mean(),
    rf_mean=comparison['rf_accuracy'].mean()
)
# Observation: Both models deliver near-identical cross-validated means, underscoring a saturating task.


## A11 – Permutation Importance

_Estimate permutation feature importances for the random forest model on the test set and list the top drivers._

In [None]:
perm_result = permutation_importance(
    rf_clf,
    X_test_peng,
    y_test_peng,
    n_repeats=15,
    random_state=42,
    scoring='accuracy'
)
feature_names = rf_clf.named_steps['pre'].get_feature_names_out()
importance_df = (
    pd.DataFrame({'feature': feature_names, 'importance': perm_result.importances_mean})
    .sort_values('importance', ascending=False)
)
importance_df.head(10)
# Observation: Flipper length and bill depth are the dominant signals in species classification.


## A12 – Load MPG Dataset

_Load seaborn's `mpg` dataset, drop rows missing `mpg`, and separate features (`X_mpg`) from the target `y_mpg`._

In [None]:
mpg = sns.load_dataset('mpg').dropna(subset=['mpg'])
X_mpg = mpg.drop(columns=['mpg', 'name'])
y_mpg = mpg['mpg']
X_mpg.head()
# Observation: The feature matrix mixes numeric engine specs with categorical origin and model year.


## A13 – MPG Train-Test Split

_Split the MPG data into training and test sets (80/20) using a fixed random seed._

In [None]:
X_train_mpg, X_test_mpg, y_train_mpg, y_test_mpg = train_test_split(
    X_mpg, y_mpg, test_size=0.2, random_state=42
)
X_train_mpg.shape, X_test_mpg.shape
# Observation: The split leaves roughly 79 test vehicles for evaluation.


## A14 – Regression Pipeline

_Create a preprocessing pipeline that imputes numeric features with the median, categoricals with the mode, applies one-hot encoding, and fits a random forest regressor (400 trees)._

In [None]:
numeric_mpg = X_mpg.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_mpg = X_mpg.select_dtypes(include=['object']).columns.tolist()

mpg_preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), numeric_mpg),
        ('cat', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]), categorical_mpg)
    ]
)

rf_reg = Pipeline(
    steps=[
        ('pre', mpg_preprocessor),
        ('model', RandomForestRegressor(n_estimators=400, random_state=42))
    ]
)
rf_reg
# Observation: The pipeline unifies imputation, scaling, and modeling for reproducible regression training.


## A15 – Fit Random Forest Regressor

_Train the regression pipeline and report RMSE and R² on the test set._

In [None]:
rf_reg.fit(X_train_mpg, y_train_mpg)
rf_preds = rf_reg.predict(X_test_mpg)
rf_rmse = mean_squared_error(y_test_mpg, rf_preds, squared=False)
rf_r2 = r2_score(y_test_mpg, rf_preds)
print(f'Random forest RMSE: {rf_rmse:.2f}')
print(f'Random forest R^2: {rf_r2:.3f}')
# Observation: RMSE falls near 2 MPG with R² above 0.85, indicating strong predictive power.


## A16 – Regression Permutation Importance

_Compute permutation importances for the regression model on the test data and show the top 8 features._

In [None]:
perm_reg = permutation_importance(
    rf_reg,
    X_test_mpg,
    y_test_mpg,
    n_repeats=20,
    random_state=42,
    scoring='neg_root_mean_squared_error'
)
reg_feature_names = rf_reg.named_steps['pre'].get_feature_names_out()
reg_importance = (
    pd.DataFrame({'feature': reg_feature_names, 'importance': perm_reg.importances_mean})
    .sort_values('importance', ascending=False)
)
reg_importance.head(8)
# Observation: Displacement, horsepower, and weight dominate the MPG prediction hierarchy.


## A17 – Predicted vs Actual Plot

_Plot predicted vs actual MPG for the test set along with the identity line and compute mean absolute error._

In [None]:
import matplotlib.pyplot as plt
comparison_df = pd.DataFrame({'Actual MPG': y_test_mpg, 'Predicted MPG': rf_preds})
fig, ax = plt.subplots(figsize=(5, 4))
sns.scatterplot(data=comparison_df, x='Actual MPG', y='Predicted MPG', ax=ax, alpha=0.7)
min_val, max_val = comparison_df.min().min(), comparison_df.max().max()
ax.plot([min_val, max_val], [min_val, max_val], linestyle='--', color='red')
ax.set_title('Random Forest Predictions vs Actual MPG')
plt.tight_layout()
plt.show()
mae = mean_absolute_error(y_test_mpg, rf_preds)
print(f'MAE: {mae:.2f}')
# Observation: Points hug the identity line, with average absolute error around 2 MPG.


## A18 – Cross-Validation RMSE

_Evaluate the regression pipeline with 5-fold cross-validation using negative RMSE scoring and display the fold scores._

In [None]:
rf_cv_scores = cross_val_score(rf_reg, X_mpg, y_mpg, cv=5, scoring='neg_root_mean_squared_error')
rf_cv_scores
# Observation: Fold-level RMSE values remain tightly clustered, signaling model stability.


## A19 – Baseline Linear Regression

_Train a linear regression pipeline with the same preprocessing and compare RMSE/R² against the random forest model._

In [None]:
lin_reg = Pipeline(
    steps=[
        ('pre', mpg_preprocessor),
        ('model', LinearRegression())
    ]
)
lin_reg.fit(X_train_mpg, y_train_mpg)
lin_preds = lin_reg.predict(X_test_mpg)
lin_rmse = mean_squared_error(y_test_mpg, lin_preds, squared=False)
lin_r2 = r2_score(y_test_mpg, lin_preds)
print(f'Linear regression RMSE: {lin_rmse:.2f}')
print(f'Linear regression R^2: {lin_r2:.3f}')
# Observation: The linear baseline trails the random forest, underscoring nonlinear structure in MPG.


## A20 – Random Forest Grid Search

_Run a brief grid search over `n_estimators` (200, 400) and `max_depth` (None, 10, 20) for the random forest regressor using 3-fold CV._

In [None]:
param_grid = {
    'model__n_estimators': [200, 400],
    'model__max_depth': [None, 10, 20]
}
grid_search = GridSearchCV(
    rf_reg,
    param_grid=param_grid,
    cv=3,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1
)
grid_search.fit(X_train_mpg, y_train_mpg)
print('Best params:', grid_search.best_params_)
print('Best CV RMSE:', -grid_search.best_score_)
# Observation: Deeper trees rarely beat the default depthless forest, so more estimators drive the gains.


## A21 – Refit with Best Params

_Refit the random forest pipeline with the best grid-search parameters and evaluate new test RMSE and R²._

In [None]:
best_rf = grid_search.best_estimator_
best_preds = best_rf.predict(X_test_mpg)
best_rmse = mean_squared_error(y_test_mpg, best_preds, squared=False)
best_r2 = r2_score(y_test_mpg, best_preds)
print(f'Best RF RMSE: {best_rmse:.2f}')
print(f'Best RF R^2: {best_r2:.3f}')
# Observation: Tuned settings offer marginal improvement, confirming the baseline forest was near optimal.


## A22 – Modeling Checklist

_Create a Python list named `modeling_steps` summarizing the key stages executed across classification and regression workflows._

In [None]:
modeling_steps = [
    'Prepared clean feature matrices and targets',
    'Built ColumnTransformer pipelines for preprocessing',
    'Trained logistic and random forest classifiers on penguins',
    'Benchmarked models with cross-validation and permutation importance',
    'Constructed random forest and linear baselines for MPG regression',
    'Performed hyperparameter tuning and evaluated test performance'
]
modeling_steps
# Observation: Documented steps capture a repeatable ML workflow from data prep through evaluation.
