# **Exercise**
> Tackle the **Titanic dataset**. A great place to start is on Kaggle.
Alternatively, you can download the data from
https://homl.info/titanic.tgz and unzip this tarball like you did for the
housing data in Chapter 2. This will give you two CSV files, `train.csv`
and `test.csv`, which you can load using `pandas.read_csv()`. The goal is to
**train a classifier that can predict the Survived column** based on the other
columns.

---

### **Step 1**: Load the Dataset
We begin by loading `train.csv` and `test.csv`. `train.csv` has both features and the target (Survived), while `test.csv` has only features. We'll train on the former and predict on the latter.

In [1]:
import pandas as pd

train_df = pd.read_csv("datasets/titanic/train.csv")
test_df = pd.read_csv("datasets/titanic/test.csv")

### **Step 2**: Initial Exploration (EDA)
We inspect the structure, check for missing values, and look at distributions. Observations:

- `Age`, `Cabin`, and `Embarked` *have missing values*.
- `Sex`, `Pclass`, `Embarked` are *categorical*.
- We may drop `Cabin`, `Ticket`, and `Name` initially.

In [2]:
train_df.info()
train_df.describe()
train_df.head()
train_df["Survived"].value_counts(normalize=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Survived
0    0.616162
1    0.383838
Name: proportion, dtype: float64

### **Step 3**: Separate Features and Labels
We separate the input features (`X_train`) from the target label (`y_train`). We'll train our model using `X_train` to predict `y_train`.

In [3]:
X_train = train_df.drop(["Survived"], axis=1)
y_train = train_df["Survived"]

### **Step 4**: Preprocessing Pipelines
We build separate pipelines for:

- **Numerical features**: fill missing values using the median, then scale.
- **Categorical features**: fill missing values with the most frequent value, then one-hot encode.

The `ColumnTransformer` applies each pipeline to the correct columns.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

# Select feature columns
num_features = ["Age", "Fare"]
cat_features = ["Pclass", "Sex", "Embarked"]

# Numeric pipeline: impute + scale
num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# Categorical pipeline: impute + one-hot encode
cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

# Combine both
preprocessor = ColumnTransformer([
    ("num", num_pipeline, num_features),
    ("cat", cat_pipeline, cat_features)
])

### **Step 5**: Full Modeling Pipeline
We wrap **preprocessing** and **classification** into one pipeline. This ensures consistent transformation during training, validation, and testing.

In [5]:
from sklearn.ensemble import RandomForestClassifier

full_pipeline = Pipeline([
    ("preprocessing", preprocessor),
    ("classifier", RandomForestClassifier(random_state=42))
])

### **Step 6**: Train and Evaluate
We use **cross-validation** to estimate accuracy. This helps prevent overfitting and gives a better sense of the model’s performance on unseen data.

In [6]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(full_pipeline, X_train, y_train, cv=5, scoring="accuracy")
print("CV accuracy:", scores.mean())

CV accuracy: 0.8047517418868871


### **Step 7**: Grid Search for Hyperparameter Tuning
We search for the best combination of RandomForest hyperparameters using `GridSearchCV`. The best model will replace the default one in our pipeline.

In [7]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "classifier__n_estimators": [50, 100, 200],
    "classifier__max_depth": [None, 5, 10],
    "classifier__max_features": ["sqrt", "log2"]
}

grid_search = GridSearchCV(full_pipeline, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best score:", grid_search.best_score_)
print("Best params:", grid_search.best_params_)

Best score: 0.8271797125102003
Best params: {'classifier__max_depth': 10, 'classifier__max_features': 'sqrt', 'classifier__n_estimators': 50}


### **Step 8**: Predict on Test Set
We apply the **trained model** on the **real test** set to generate predictions. 

In [9]:
# Prepare test features
X_test = test_df.copy()
passenger_ids = X_test["PassengerId"]
X_test = X_test.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1)

# Predict
final_model = grid_search.best_estimator_
y_pred = final_model.predict(X_test)

# result file
submission_df = pd.DataFrame({
    "PassengerId": passenger_ids,
    "Survived": y_pred
})
submission_df.to_csv("titanic_submission.csv", index=False)