
# Survival Prediction in Titanic Dataset
## Week 4 Assignment: Cross-Validation, Hyperparameter Tuning & Ensemble Methods

Topics covered in this notebook :-

- Cross-validation and hyperparameter tuning
- Decision Trees and Random Forests
- Boosting algorithms (Gradient Boosting, AdaBoost, XGBoost, CatBoost, LightGBM)
- K-Nearest Neighbors (KNN)

We'll predict passenger survival based on available features.


In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

df = pd.read_excel('titanic.csv.xlsx')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Data Preprocessing
We will handle missing values, encode categorical features, and scale numerical features.

In [2]:
df.Age=df['Age'].fillna(df['Age'].mean())
df.Embarked=df['Embarked'].fillna(df['Embarked'].mode()[0])

df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)

df.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1, inplace=True)

X = df.drop('Survived', axis=1)
y = df['Survived']

### Spliting Data into Train and Test

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Cross-Validation & Hyperparameter Tuning
Use **GridSearchCV** to find the best Decision Tree parameters.


In [4]:
param_grid = {
    'max_depth': [3, 5, 7, 9],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("CV Score:", grid.best_score_)

Best Parameters: {'criterion': 'entropy', 'max_depth': 3, 'min_samples_split': 2}
CV Score: 0.8230079779375554



### Random Search
Use **RandomizedSearchCV** for Random Forest.


In [5]:
param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10]
}

random_search = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=5, cv=5, random_state=42)
random_search.fit(X_train, y_train)

print("Best Parameters:", random_search.best_params_)
print("CV Score:", random_search.best_score_)

Best Parameters: {'n_estimators': 100, 'min_samples_split': 10, 'max_depth': 5}
CV Score: 0.8300502314586822



### Decision Tree & Random Forest
Train using best parameters and evaluate.


In [6]:
dt = DecisionTreeClassifier(**grid.best_params_)
dt.fit(X_train, y_train)
rf = RandomForestClassifier(**random_search.best_params_)
rf.fit(X_train, y_train)

dt_pred = dt.predict(X_test)
rf_pred = rf.predict(X_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, dt_pred))
print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))

Decision Tree Accuracy: 0.7988826815642458
Random Forest Accuracy: 0.8100558659217877



### Boosting: Gradient Boosting, AdaBoost


In [7]:
gb = GradientBoostingClassifier(n_estimators=300)
ab = AdaBoostClassifier(n_estimators=300)

gb.fit(X_train, y_train)
ab.fit(X_train, y_train)

gb_pred = gb.predict(X_test)
ab_pred = ab.predict(X_test)

print("Gradient Boosting Accuracy:", accuracy_score(y_test, gb_pred))
print("AdaBoost Accuracy:", accuracy_score(y_test, ab_pred))

Gradient Boosting Accuracy: 0.8212290502793296
AdaBoost Accuracy: 0.8100558659217877


### Boosting: XGBoost, Catboost, LightGBM

In [19]:
xgb = XGBClassifier(max_depth=2)
cat = CatBoostClassifier(verbose=0)
lgbm = LGBMClassifier(verbose=-1)

xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)
print("XGBoost Accuracy:", accuracy_score(y_test, xgb_pred))

cat.fit(X_train, y_train)
cat_pred = cat.predict(X_test)
print("CatBoost Accuracy:", accuracy_score(y_test, cat_pred))

lgbm.fit(X_train, y_train)
lgbm_pred = lgbm.predict(X_test)
print("LightGBM Accuracy:", accuracy_score(y_test,lgbm_pred))

XGBoost Accuracy: 0.8212290502793296
CatBoost Accuracy: 0.8268156424581006
LightGBM Accuracy: 0.8379888268156425



### K-Nearest Neighbors (KNN)


In [9]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train_scaled, y_train)
knn_pred = knn.predict(X_test_scaled)

print("KNN Accuracy:", accuracy_score(y_test, knn_pred))

KNN Accuracy: 0.8268156424581006


In [10]:
print(f"Decision Tree: {accuracy_score(y_test, dt_pred):.8f}")
print(f"Random Forest: {accuracy_score(y_test, rf_pred):.8f}")
print(f"Gradient Boosting: {accuracy_score(y_test, gb_pred):.8f}")
print(f"AdaBoost: {accuracy_score(y_test, ab_pred):.8f}")
print(f"XGBoost: {accuracy_score(y_test, xgb_pred):.8f}")
print(f"CatBoost: {accuracy_score(y_test, cat_pred):.8f}")
print(f"LightGBM: {accuracy_score(y_test, lgbm_pred):.8f}")

Decision Tree: 0.79888268
Random Forest: 0.81005587
Gradient Boosting: 0.82122905
AdaBoost: 0.81005587
XGBoost: 0.82122905
CatBoost: 0.82681564
LightGBM: 0.83798883


## Conclusion
- Applied cross-validation and hyperparameter tuning.
- Implemented Decision Trees, Random Forests, Boosting, and KNN.
- Compared models' performance on the Titanic dataset.

This concludes the Week 4 assignment.