# Titanic Kaggle: XGBoost Model
This notebook demonstrates a commonly used approach for tackling the Titanic classification problem on Kaggle.

Steps:
1. Load libraries
2. Load data
3. Exploratory Data Analysis & Feature Engineering
4. Model building with XGBoost
5. Evaluation (local CV) and Prediction
6. Generate submission file

In [None]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import re
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

## 1. Load Data
Make sure you have `train.csv` and `test.csv` from the Titanic Kaggle competition in the same folder as this notebook.

In [None]:
# Load train and test data
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

print("Train shape:", train_data.shape)
print("Test shape:\t", test_data.shape)

train_data.head()

## 2. Exploratory Data Analysis & Basic Cleaning
- Check missing values
- Fill or drop as needed
- Convert categorical variables
- Feature engineering (Title, FamilySize, etc.)

In [None]:
# Check missing values in train
train_data.isnull().sum()

We see Age and Cabin are quite missing, and Embarked might have some missing. The Cabin column often gets dropped or encoded in various ways. Let’s do a minimal approach here.

**Key feature engineering steps** often used:
1. **Title extraction**: from Name (e.g., Mr, Mrs, Miss, Master)
2. **Fill missing Age** with median (optionally by Title)
3. **Family size**: SibSp + Parch + 1 (the passenger)
4. **Embarked**: fill with mode (most common)
5. **Fare**: fill missing (in test) with median, or do binning
6. **Cabin**: either drop or turn into an indicator for having a cabin vs. not
7. **Drop unnecessary columns** like Ticket, or partial: Name, Cabin.

We’ll demonstrate a typical pipeline below.

In [None]:
# Combine train and test for consistent feature engineering
full_data = pd.concat([train_data, test_data], sort=False).reset_index(drop=True)
full_data.head()

### 2.1 Extract Title

In [None]:
def get_title(name):
    # Extract title from the name
    # e.g., 'Braund, Mr. Owen Harris' -> 'Mr'
    title_search = re.search(' ([A-Za-z]+)\.', name)
    if title_search:
        return title_search.group(1)
    return ""  # in case there's no title

full_data['Title'] = full_data['Name'].apply(get_title)

# Some titles are very rare and can be grouped
full_data['Title'] = full_data['Title'].replace(['Lady','Countess','Capt','Col',\
                                                'Don','Dr','Major','Rev','Sir',\
                                                'Jonkheer','Dona'],'Rare')
full_data['Title'] = full_data['Title'].replace('Mlle','Miss')
full_data['Title'] = full_data['Title'].replace('Ms','Miss')
full_data['Title'] = full_data['Title'].replace('Mme','Mrs')

### 2.2 Fill Missing Data
- **Age**: fill with median, optionally by Title.
- **Embarked**: fill missing with mode (S).

In [None]:
# Fill Embarked
full_data['Embarked'].fillna('S', inplace=True)

# Fill Fare (if missing in test)
full_data['Fare'].fillna(full_data['Fare'].median(), inplace=True)

# Let's fill Age by median of [Title]
age_medians = full_data.groupby('Title')['Age'].median()

def fill_age(row):
    if pd.isnull(row['Age']):
        return age_medians[row['Title']]
    else:
        return row['Age']

full_data['Age'] = full_data.apply(fill_age, axis=1)

# We'll drop Cabin but create a feature "HasCabin" (0 or 1)
full_data['HasCabin'] = full_data['Cabin'].apply(lambda x: 0 if pd.isnull(x) else 1)

# Now drop columns we won't use
full_data.drop(['PassengerId','Name','Cabin','Ticket'], axis=1, inplace=True)

### 2.3 Create Family Size

In [None]:
full_data['FamilySize'] = full_data['SibSp'] + full_data['Parch'] + 1
full_data.head()

### 2.4 Convert Categorical to Numeric
We'll encode **Sex** (already M/F) and **Embarked**, **Title** if not numeric.

Actually, **Sex** is male/female, so we can map that to 0/1. Or we can treat it as a string and let XGBoost handle it, but typically we do numeric.

In [None]:
# Encode Sex as 0/1
full_data['Sex'] = full_data['Sex'].map({'male':1, 'female':0}).astype(int)

# Embarked can also be encoded
le_embarked = LabelEncoder()
full_data['Embarked'] = le_embarked.fit_transform(full_data['Embarked'])

# Title
le_title = LabelEncoder()
full_data['Title'] = le_title.fit_transform(full_data['Title'])

full_data.head()

### 2.5 Split back into train/test

In [None]:
train_df = full_data[:len(train_data)]
test_df = full_data[len(train_data):]

# Our target is Survived in train_data
y = train_data['Survived'].values
X = train_df.values

X_test = test_df.values

print("Train shape:", X.shape)
print("Test shape:\t", X_test.shape)

## 3. Model Building (XGBoost)
We’ll do a quick local validation with `StratifiedKFold` or `train_test_split`.

We can do some light hyperparameter tuning. Often, people use Bayesian optimization or larger grid searches for the best parameters. Here we demonstrate a smaller approach.

In [None]:
# We'll do a simple train/val split for local check
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, \ 
                                                  random_state=42, stratify=y)

# Build the model
model = xgb.XGBClassifier(
    n_estimators=1000,
    max_depth=4,
    learning_rate=0.01,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    use_label_encoder=False,
    eval_metric='logloss'  # for binary classification
)

# Fit with early stopping
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=50,
    verbose=False
)

val_preds = model.predict(X_val)
val_acc = accuracy_score(y_val, val_preds)
print(f"Validation Accuracy: {val_acc:.4f}")

You can try adjusting parameters like `max_depth`, `learning_rate`, etc., to see if you get better local validation accuracy. Typical local validation accuracies with a well-tuned XGBoost for Titanic might be **0.82–0.86**.

If you want a more robust measure, you can do **k-fold cross validation**. But for demonstration, a single split is fine.

## 4. Train on Full Training Set & Generate Submission
Now that we’ve done a local check, we can train on the **entire** dataset and produce predictions on the test set for Kaggle submission.

In [None]:
# Retrain on the entire training data
model_full = xgb.XGBClassifier(
    n_estimators=1000,
    max_depth=4,
    learning_rate=0.01,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    use_label_encoder=False,
    eval_metric='logloss'
)
model_full.fit(X, y)

# Predict on test data
test_preds = model_full.predict(X_test)
test_preds[:10]

### 4.1 Create Submission File
Kaggle expects a CSV with format:

```
PassengerId,Survived
892,0
893,1
...
```

But we previously dropped PassengerId. We’ll fetch it from the original test data.

In [None]:
submission = pd.read_csv('test.csv', usecols=['PassengerId'])
submission['Survived'] = test_preds.astype(int)
submission.to_csv('submission_xgb.csv', index=False)
submission.head()

You can now **upload** this `submission_xgb.csv` to the Titanic Kaggle competition. Typical XGBoost solutions with decent feature engineering often score in the **0.78–0.82** range on the public leaderboard (sometimes higher if well-tuned).

## 5. Possible Improvements
1. **More Feature Engineering**: Create advanced features from `Cabin` or `Name` (Title grouping is just a start). Compute family survival rates if using the complementary test set, etc.
2. **Hyperparameter Tuning**: Use cross-validation or a tuning library (Optuna, GridSearchCV, etc.) to find the best parameters (max_depth, learning_rate, n_estimators, etc.).
3. **Ensembling**: Combine multiple models (e.g., random forest, LightGBM, logistic regression) to get a better final prediction.
4. **Handling Outliers**: If any outlier is present in `Fare` or `Age`, apply transformations or binning.
5. **Stacking**: Build multiple levels of models (meta-model) for improved performance.

---
## Conclusion
In this notebook, we:
1. Loaded and preprocessed the Titanic dataset
2. Engineered features (Title, HasCabin, FamilySize)
3. Trained a strong baseline model using XGBoost
4. Produced a CSV submission file for Kaggle

This approach typically scores well on the leaderboard. You can further refine and tune hyperparameters for even better performance.