# Classification using `SupportVectorClassifier` on `Titanic Dataset`
## 🛠 Project Overview

In this project, I am building a Support Vector Machine (SVM) model to predict survival on the Titanic dataset. The workflow includes:

1. **Data Loading & Exploration**
2. **Handling Missing Values**
3. **Feature Engineering** (dropping irrelevant columns & encoding categorical features)
4. **Feature Scaling**
5. **Visualization** (PCA projection and feature plots)
6. **Model Training** with `SVC`
7. **Model Evaluation** (classification report & accuracy score)

You can **download** the **dataset** directly from [Kaggle](https://www.kaggle.com/competitions/titanic/data).

## 📦 Import Required Libraries

We start by importing all the required libraries for data manipulation, visualization, preprocessing, and model training.

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Disabling warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully.")

# 📥 Load the Dataset

Read the Titanic training dataset (`train.csv`) and display the first few rows to understand the data structure.

In [None]:
# Reading the dataset
df_train = pd.read_csv("Titanic Dataset/train.csv")
df_train.head()

## 📊 Dataset Information

Check dataset information (column names, non-null counts, and datatypes) using `.info()`. This helps in identifying categorical and numerical features.

In [None]:
df_train.info()

## 📈 Summary Statistics

Use `.describe()` to view summary statistics of numerical columns (mean, std, min, max, quartiles).

In [None]:
df_train.describe()

## ❓ Missing Values Check

Check how many missing values are present in each column. Handling missing values properly is critical for improving model performance.

In [None]:
df_train.isnull().sum()

## 🔧 Handle Missing Values — `Age`

The `Age` column contains missing values. Instead of filling them with a single statistic, we fill each missing `Age` with the `median` age grouped by `Pclass` and `Sex`, as these features are strongly correlated with age.

In [None]:
# Filling `Age` column
df_train["Age"] = df_train["Age"].fillna(
    df_train.groupby(["Pclass", "Sex"])["Age"].transform("median")
)

df_train.isnull().sum()

## 🔧 Handle Missing Values — Embarked

The `Embarked` column has a small number of missing values. We fill them using the mode (most frequent port of embarkation).

In [None]:
# Filling ``Embarked`` column with mode
df_train["Embarked"] = df_train["Embarked"].fillna(df_train["Embarked"].mode()[0])
df_train.isnull().sum()

## 🗑 Remove Irrelevant Columns

Drop irrelevant or high-missing-value columns that won’t add value to the prediction:

- `PassengerId` (just an identifier)
- `Name` (too diverse)
- `Ticket` (not meaningful for survival prediction)
- `Cabin` (too many missing values)

In [None]:
# It is time to remove all unnecessary columns
df_train.drop(columns=["PassengerId", "Name", "Ticket", "Cabin"], inplace=True)

df_train.head()

## 🔄 One-Hot Encoding

Convert categorical variables (`Sex`, `Embarked`) into numerical dummy variables using `One-Hot Encoding`. This prepares the dataset for machine learning algorithms that require numeric input.

In [None]:
# One-Hot Encoding for categorical columns
df_train = pd.get_dummies(df_train,columns=['Sex', 'Embarked'])
df_train.head()

## 🎯 Define Features and Target

- Target (`y`) = `Survived`
- Features (`X`) = All remaining numeric columns after preprocessing

We also apply `StandardScaler` to standardize the feature values, which helps SVM perform better.

In [None]:
# Splitting the dataset into features and target variable
X = df_train.drop("Survived", axis=1)
y = df_train["Survived"]

# Scale Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

## 🔍 PCA Visualization

Apply Principal Component Analysis (PCA) to reduce features to 2D for visualization.
This scatter plot gives an idea of how survival outcomes are distributed in reduced feature space.

In [None]:
# Select only numeric columns
A = df_train.drop('Survived', axis=1).select_dtypes(include=['int64','float64'])
B = df_train['Survived']

# PCA reduce to 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(A)

plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=B, cmap='coolwarm', s=50)
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.title('Titanic Data (PCA Projection)')
plt.show()


## 📊 Age vs Fare Visualization

Plot `Age` vs `Fare`, coloring points by `survival` outcome. This helps visualize how these two features relate to survival probability.

In [None]:
plt.scatter(df_train['Age'], df_train['Fare'], 
            c=df_train['Survived'], cmap='bwr', s=50)

plt.xlabel("Age")
plt.ylabel("Fare")
plt.title("Titanic Dataset: Age vs Fare (colored by Survival)")
plt.show()

## ✂️ Train-Test Split

Split the dataset into `training` (65%) and `testing` (35%) sets.
This allows us to train the model on one part of the data and evaluate performance on unseen data.

In [None]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=42)

# Training the SVC model with `linear` kernal
model = SVC(kernel = 'rbf', class_weight = 'balanced', C = 1.0, gamma = 0.001, random_state = 42)
model.fit(X_train, y_train)

# Printing Success Message
print("Model Trained Successfully.")

## 🤖 Train the SVM Model

Train a Support Vector Classifier (SVC) with:

- `Kernel` = `RBF`
- `class_weight` = `'balanced'` (to handle class imbalance)
- `C` = `1.0` (regularization parameter)
- `gamma` = `0.001` (influence of a single training example)

## 📈 Model Evaluation

Make predictions on the test set and evaluate using:

- `Classification Report` (precision, recall, F1-score)
- `Accuracy Score`

This shows how well the model predicts survival.

In [None]:
# Predicting the values and printing the classification report and accuracy score

y_pred = model.predict(X_test)
print(f"Classification Report:\n{classification_report(y_test, y_pred)}")
print(f"Accuracy Score: {accuracy_score(y_test, y_pred)}")

## ✅ Conclusion  

The Support Vector Classifier (SVC) with an **RBF kernel** achieved an **accuracy of ~73%** on the Titanic dataset.  

### 📊 Performance Breakdown  
- **Class 0 (Did Not Survive):**
  - Precision: **0.77**
  - Recall: **0.79**
  - F1-score: **0.78**
- **Class 1 (Survived):**
  - Precision: **0.67**
  - Recall: **0.65**
  - F1-score: **0.66**

### 🔎 Observations  
- The model performs **slightly better at predicting non-survivors (Class 0)** compared to survivors (Class 1).  
- This imbalance is expected since the dataset itself is slightly skewed toward non-survivors.  
- **Feature scaling and missing value imputation** helped improve the model’s stability.  

### 💡 Possible Improvements  
- Apply **hyperparameter tuning** (GridSearchCV / RandomizedSearchCV) for optimal `C`, `gamma`, and kernel values.  
- Experiment with **other models** such as Random Forest, Gradient Boosting, or Logistic Regression for comparison.  
- Perform **feature engineering** (e.g., FamilySize, Title from Name, Fare per Person) to capture hidden patterns.  
- Use **cross-validation** for more robust and generalized performance evaluation.  

### 📌 Final Note  
With **73% accuracy**, the model demonstrates **reasonable predictive power**, but there is **room for improvement** through tuning and advanced feature engineering.  