# COE305 – Machine Learning Project (Stage 2 + Stage 3 Combined)
## Titanic Survival Prediction using Machine Learning Models

**Team Members**
- Beyza Özel (210905042)
- Derviş Karakoca (220905068)
- Tolga Kaplan (220905095)

**Dataset (Kaggle)**
https://www.kaggle.com/competitions/titanic

**GitHub Notebook Link**
(Replace with your GitHub link after uploading this notebook)

---
This notebook contains:
- **Stage 2:** Dataset cleaning + Feature Engineering + EDA  
- **Stage 3:** Baseline models (**≥ 3 algorithms**) + initial evaluation results


In [None]:
# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    confusion_matrix, classification_report, RocCurveDisplay
)

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier


## Load Dataset (Kaggle Titanic - train set)

We use the Titanic dataset from Kaggle for a binary classification task where:
- Target: **Survived** (0/1)
- Inputs: passenger demographic & socio-economic features


In [None]:
# If you downloaded Kaggle 'train.csv', you can use:
# raw_df = pd.read_csv("train.csv")

# In our submission, we also keep our Stage files (raw/clean) as Excel:
raw_df = pd.read_excel("raw (4).xlsx")
raw_df.head()


# STAGE 2 — Dataset Cleaning, Feature Engineering, and EDA

Stage 2 goal: produce a model-ready dataset by handling missing values, removing noisy columns, and creating useful features.


In [None]:
# Missing values overview
raw_df.isnull().sum().sort_values(ascending=False)


In [None]:
# ---- Cleaning + Feature Engineering (consistent with our Stage 2 report) ----
df = raw_df.copy()

# 1) Missing values
df["Age"] = df["Age"].fillna(28)          # filled with median age (28)
df["Embarked"] = df["Embarked"].fillna("S")  # filled with mode 'S'
df["Fare"] = df["Fare"].fillna(df["Fare"].median())

# 2) Drop Cabin (too many missing values)
df = df.drop(columns=["Cabin"], errors="ignore")

# 3) Feature Engineering
df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
df["IsAlone"] = (df["FamilySize"] == 1).astype(int)

# Title extraction from Name
df["Title"] = df["Name"].str.extract(r" ([A-Za-z]+)\.", expand=False)
df["Title"] = df["Title"].replace(["Mlle", "Ms"], "Miss")
df["Title"] = df["Title"].replace("Mme", "Mrs")
rare_titles = ["Lady","Countess","Capt","Col","Don","Dr","Major","Rev","Sir","Jonkheer","Dona"]
df["Title"] = df["Title"].replace(rare_titles, "Rare")

# Fare log transform
df["Fare_log"] = np.log1p(df["Fare"])

# Build model dataset (drop IDs/text columns not needed for modeling)
data = df.drop(columns=["PassengerId", "Name", "Ticket"], errors="ignore")

data.head()


## Stage 2 — EDA (Key Visuals)

We include the most important plots that show relationships highlighted in Stage 2:
- Target distribution  
- Survival rate by Sex and Pclass  
- Distributions of Age and Fare_log


In [None]:
# Target distribution
data["Survived"].value_counts().plot(kind="bar")
plt.title("Target Distribution (Survived)")
plt.xlabel("Survived")
plt.ylabel("Count")
plt.show()

# Sex vs Survived
pd.crosstab(raw_df["Sex"], raw_df["Survived"], normalize="index").plot(kind="bar")
plt.title("Survival Rate by Sex")
plt.ylabel("Rate")
plt.show()

# Pclass vs Survived
pd.crosstab(raw_df["Pclass"], raw_df["Survived"], normalize="index").plot(kind="bar")
plt.title("Survival Rate by Pclass")
plt.ylabel("Rate")
plt.show()

# Age distribution
data["Age"].plot(kind="hist", bins=30)
plt.title("Age Distribution")
plt.xlabel("Age")
plt.show()

# Fare_log distribution
data["Fare_log"].plot(kind="hist", bins=30)
plt.title("Fare_log Distribution")
plt.xlabel("Fare_log")
plt.show()


# STAGE 3 — Midterm Progress Report (Baseline Models)

Stage 3 requirement: **implement ≥ 3 models** and report **baseline results**.
We train and evaluate:
- Logistic Regression
- KNN
- Random Forest

Metrics:
- Accuracy, Precision, Recall, F1-score, ROC-AUC


In [None]:
# Train/Test Split
X = data.drop(columns=["Survived"])
y = data["Survived"]

cat_cols = [c for c in X.columns if X[c].dtype == "object"]
num_cols = [c for c in X.columns if X[c].dtype != "object"]

preprocess = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ]
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "KNN": KNeighborsClassifier(n_neighbors=5),
    "Random Forest": RandomForestClassifier(n_estimators=300, random_state=42)
}

results = []

for name, model in models.items():
    pipe = Pipeline(steps=[("prep", preprocess), ("model", model)])
    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_test)

    # ROC-AUC needs probabilities
    proba = pipe.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None

    results.append({
        "Model": name,
        "Accuracy": accuracy_score(y_test, preds),
        "Precision": precision_score(y_test, preds),
        "Recall": recall_score(y_test, preds),
        "F1": f1_score(y_test, preds),
        "ROC_AUC": roc_auc_score(y_test, proba) if proba is not None else np.nan
    })

results_df = pd.DataFrame(results).sort_values("F1", ascending=False)
results_df


In [None]:
# Best model detailed report + ROC curve
best_model_name = results_df.iloc[0]["Model"]
best_model = models[best_model_name]

best_pipe = Pipeline(steps=[("prep", preprocess), ("model", best_model)])
best_pipe.fit(X_train, y_train)

best_preds = best_pipe.predict(X_test)
print("Best Model:", best_model_name)
print(classification_report(y_test, best_preds))

cm = confusion_matrix(y_test, best_preds)
print("Confusion Matrix:\n", cm)

# ROC curve (only if predict_proba exists)
if hasattr(best_model, "predict_proba"):
    proba = best_pipe.predict_proba(X_test)[:, 1]
    RocCurveDisplay.from_predictions(y_test, proba)
    plt.title(f"ROC Curve - {best_model_name}")
    plt.show()


## Project Links (Required)

- Dataset (Kaggle): https://www.kaggle.com/competitions/titanic  
- GitHub Repository: (paste after upload)  
- Colab Link (optional): (paste if you use Colab)
