# Titanic Survival Prediction System
## Machine Learning Model Development

This notebook develops a machine learning model to predict whether a passenger survived the Titanic disaster based on selected features.

### Selected Features (5 input features):
1. **Pclass** - Passenger class (1, 2, or 3)
2. **Sex** - Gender (male or female)
3. **Age** - Age in years
4. **SibSp** - Number of siblings/spouses aboard
5. **Fare** - Ticket fare

### Target Variable:
**Survived** (0 = Did not survive, 1 = Survived)

### Algorithm Used:
**Logistic Regression**

## Step 1: Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")

All libraries imported successfully!


## Step 2: Define the TitanicSurvivalPredictor Class

In [2]:
class TitanicSurvivalPredictor:
    """
    A machine learning model to predict Titanic passenger survival.
    
    Selected Features (5 input features):
    1. Pclass - Passenger class (1, 2, or 3)
    2. Sex - Gender (male or female)
    3. Age - Age in years
    4. SibSp - Number of siblings/spouses aboard
    5. Fare - Ticket fare
    
    Target Variable: Survived (0 = Did not survive, 1 = Survived)
    """
    
    def __init__(self, model_path='titanic_model.pkl'):
        """Initialize the predictor with model path."""
        self.model_path = model_path
        self.model = None
        self.scaler = None
        self.label_encoders = {}
        self.feature_names = ['Pclass', 'Sex', 'Age', 'SibSp', 'Fare']
        
    def load_data(self, filepath):
        """
        Load the Titanic dataset from CSV file.
        
        Args:
            filepath: Path to the CSV file
            
        Returns:
            DataFrame containing the loaded data
        """
        print("=" * 60)
        print("STEP 1: LOADING DATASET")
        print("=" * 60)
        
        df = pd.read_csv(filepath)
        print(f"Dataset loaded successfully!")
        print(f"Shape: {df.shape}")
        print(f"Columns: {list(df.columns)}\n")
        
        return df
    
    def preprocess_data(self, df, is_training=True):
        """
        Perform comprehensive data preprocessing.
        
        Includes:
        - Feature selection
        - Handling missing values
        - Encoding categorical variables
        - Feature scaling
        
        Args:
            df: DataFrame to preprocess
            is_training: Boolean indicating if this is training data
            
        Returns:
            Preprocessed features (X) and target (y) if training,
            or just features (X) if testing
        """
        print("=" * 60)
        print("STEP 2: DATA PREPROCESSING")
        print("=" * 60)
        
        # Feature selection: Select only required features
        print("\na) Feature Selection:")
        print(f"   Selected features: {self.feature_names}")
        print("   Target variable: Survived")
        
        df_processed = df.copy()
        
        # Display missing values before handling
        print(f"\n   Missing values before handling:")
        print(f"   - Pclass: {df_processed['Pclass'].isna().sum()}")
        print(f"   - Sex: {df_processed['Sex'].isna().sum()}")
        print(f"   - Age: {df_processed['Age'].isna().sum()}")
        print(f"   - SibSp: {df_processed['SibSp'].isna().sum()}")
        print(f"   - Fare: {df_processed['Fare'].isna().sum()}")
        
        # Handle missing values
        print(f"\nb) Handling Missing Values:")
        
        # Fill Age with median
        age_median = df_processed['Age'].median()
        df_processed['Age'].fillna(age_median, inplace=True)
        print(f"   - Age: Filled missing values with median ({age_median:.2f})")
        
        # Fill Fare with median
        fare_median = df_processed['Fare'].median()
        df_processed['Fare'].fillna(fare_median, inplace=True)
        print(f"   - Fare: Filled missing values with median ({fare_median:.2f})")
        
        # Drop rows with missing Embarked (if any) - but we're not using Embarked in our features
        df_processed.dropna(subset=['Sex'], inplace=True)
        
        # Encode categorical variables
        print(f"\nc) Encoding Categorical Variables:")
        
        # Sex encoding
        if is_training:
            self.label_encoders['Sex'] = LabelEncoder()
            df_processed['Sex'] = self.label_encoders['Sex'].fit_transform(df_processed['Sex'])
            print(f"   - Sex: Encoded as {dict(zip(self.label_encoders['Sex'].classes_, self.label_encoders['Sex'].transform(self.label_encoders['Sex'].classes_)))}")
        else:
            df_processed['Sex'] = self.label_encoders['Sex'].transform(df_processed['Sex'])
        
        # Prepare features and target
        X = df_processed[self.feature_names]
        
        if is_training:
            y = df_processed['Survived']
            print(f"\nd) Feature Scaling:")
            print(f"   Using StandardScaler for numerical features")
            
            # Fit and transform features
            self.scaler = StandardScaler()
            X_scaled = self.scaler.fit_transform(X)
            
            print(f"\n   Data after preprocessing:")
            print(f"   - Features shape: {X_scaled.shape}")
            print(f"   - Target shape: {y.shape}")
            print(f"   - Survival distribution:\n{y.value_counts()}\n")
            
            return X_scaled, y
        else:
            # Transform only (using fitted scaler)
            X_scaled = self.scaler.transform(X)
            return X_scaled
    
    def train_model(self, X_train, y_train):
        """
        Train the Logistic Regression model.
        
        Args:
            X_train: Training features
            y_train: Training target
        """
        print("=" * 60)
        print("STEP 3: MODEL TRAINING")
        print("=" * 60)
        
        print("\nAlgorithm Selected: Logistic Regression")
        print("Training the model...")
        
        self.model = LogisticRegression(random_state=42, max_iter=1000)
        self.model.fit(X_train, y_train)
        
        # Training accuracy
        train_accuracy = self.model.score(X_train, y_train)
        print(f"\nModel Training Complete!")
        print(f"Training Accuracy: {train_accuracy:.4f}\n")
    
    def evaluate_model(self, X_test, y_test):
        """
        Evaluate the trained model on test data.
        
        Args:
            X_test: Test features
            y_test: Test target
        """
        print("=" * 60)
        print("STEP 4: MODEL EVALUATION")
        print("=" * 60)
        
        # Make predictions
        y_pred = self.model.predict(X_test)
        
        # Calculate accuracy
        accuracy = accuracy_score(y_test, y_pred)
        print(f"\nTest Accuracy: {accuracy:.4f}")
        
        # Classification Report
        print("\nClassification Report:")
        print("-" * 60)
        print(classification_report(y_test, y_pred, 
                                   target_names=['Did Not Survive', 'Survived']))
        
        # Confusion Matrix
        print("\nConfusion Matrix:")
        cm = confusion_matrix(y_test, y_pred)
        print(cm)
        print(f"\nTrue Negatives: {cm[0, 0]}")
        print(f"False Positives: {cm[0, 1]}")
        print(f"False Negatives: {cm[1, 0]}")
        print(f"True Positives: {cm[1, 1]}\n")
    
    def save_model(self):
        """Save the trained model and scaler to disk using pickle."""
        print("=" * 60)
        print("STEP 5: SAVING THE MODEL")
        print("=" * 60)
        
        # Save model using pickle
        with open(self.model_path, 'wb') as f:
            pickle.dump(self.model, f)
        print(f"\nModel saved to: {self.model_path}")
        
        # Save scaler
        scaler_path = self.model_path.replace('.pkl', '_scaler.pkl')
        with open(scaler_path, 'wb') as f:
            pickle.dump(self.scaler, f)
        print(f"Scaler saved to: {scaler_path}")
        
        # Save label encoders
        encoder_path = self.model_path.replace('.pkl', '_encoders.pkl')
        with open(encoder_path, 'wb') as f:
            pickle.dump(self.label_encoders, f)
        print(f"Label encoders saved to: {encoder_path}\n")
    
    def load_model(self):
        """Load the trained model and scaler from disk."""
        print("=" * 60)
        print("STEP 6: RELOADING MODEL FOR PREDICTION")
        print("=" * 60)
        
        # Load model
        with open(self.model_path, 'rb') as f:
            self.model = pickle.load(f)
        print(f"\nModel loaded from: {self.model_path}")
        
        # Load scaler
        scaler_path = self.model_path.replace('.pkl', '_scaler.pkl')
        with open(scaler_path, 'rb') as f:
            self.scaler = pickle.load(f)
        print(f"Scaler loaded from: {scaler_path}")
        
        # Load label encoders
        encoder_path = self.model_path.replace('.pkl', '_encoders.pkl')
        with open(encoder_path, 'rb') as f:
            self.label_encoders = pickle.load(f)
        print(f"Label encoders loaded from: {encoder_path}\n")
    
    def predict_survival(self, passenger_data):
        """
        Predict survival for a single or multiple passengers.
        
        Args:
            passenger_data: Dictionary or list of dictionaries with passenger features
                           Keys should be: Pclass, Sex, Age, SibSp, Fare
        
        Returns:
            Prediction (0 or 1) and probability
        """
        # Handle single passenger
        if isinstance(passenger_data, dict):
            passenger_data = [passenger_data]
        
        # Create DataFrame
        df_pred = pd.DataFrame(passenger_data)
        
        # Encode Sex
        df_pred['Sex'] = self.label_encoders['Sex'].transform(df_pred['Sex'])
        
        # Select features
        X_pred = df_pred[self.feature_names]
        
        # Scale features
        X_pred_scaled = self.scaler.transform(X_pred)
        
        # Make predictions
        predictions = self.model.predict(X_pred_scaled)
        probabilities = self.model.predict_proba(X_pred_scaled)
        
        return predictions, probabilities
    
    def run_full_pipeline(self, train_filepath):
        """
        Run the complete pipeline: load, preprocess, train, evaluate, and save.
        
        Args:
            train_filepath: Path to the training dataset
        """
        # Load data
        df = self.load_data(train_filepath)
        
        # Preprocess data
        X, y = self.preprocess_data(df, is_training=True)
        
        # Split data into training and testing sets (80-20 split)
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42, stratify=y
        )
        
        print("=" * 60)
        print("DATA SPLIT")
        print("=" * 60)
        print(f"Training set size: {X_train.shape[0]} samples")
        print(f"Test set size: {X_test.shape[0]} samples\n")
        
        # Train model
        self.train_model(X_train, y_train)
        
        # Evaluate model
        self.evaluate_model(X_test, y_test)
        
        # Save model
        self.save_model()

print("TitanicSurvivalPredictor class defined successfully!")

TitanicSurvivalPredictor class defined successfully!


## Step 3: Initialize and Run the Complete Pipeline

In [3]:
# Initialize the predictor
predictor = TitanicSurvivalPredictor(model_path='titanic_model.pkl')

# Run the complete pipeline
predictor.run_full_pipeline('train.csv')

STEP 1: LOADING DATASET
Dataset loaded successfully!
Shape: (891, 12)
Columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

STEP 2: DATA PREPROCESSING

a) Feature Selection:
   Selected features: ['Pclass', 'Sex', 'Age', 'SibSp', 'Fare']
   Target variable: Survived

   Missing values before handling:
   - Pclass: 0
   - Sex: 0
   - Age: 177
   - SibSp: 0
   - Fare: 0

b) Handling Missing Values:
   - Age: Filled missing values with median (28.00)
   - Fare: Filled missing values with median (14.45)

c) Encoding Categorical Variables:
   - Sex: Encoded as {'female': np.int64(0), 'male': np.int64(1)}

d) Feature Scaling:
   Using StandardScaler for numerical features

   Data after preprocessing:
   - Features shape: (891, 5)
   - Target shape: (891,)
   - Survival distribution:
Survived
0    549
1    342
Name: count, dtype: int64

DATA SPLIT
Training set size: 712 samples
Test set size: 179 samples

STEP 3: MODEL 

## Step 4: Demonstrate Model Reload and Prediction

This step demonstrates that the saved model can be reloaded and used for prediction without retraining.

In [4]:
print("=" * 60)
print("DEMONSTRATION: RELOADING AND PREDICTING")
print("=" * 60)

# Create a new predictor instance
predictor_reload = TitanicSurvivalPredictor(model_path='titanic_model.pkl')

# Load the saved model
predictor_reload.load_model()

print("\nTesting model with sample passengers:\n")

DEMONSTRATION: RELOADING AND PREDICTING
STEP 6: RELOADING MODEL FOR PREDICTION

Model loaded from: titanic_model.pkl
Scaler loaded from: titanic_model_scaler.pkl
Label encoders loaded from: titanic_model_encoders.pkl


Testing model with sample passengers:



### Sample 1: Wealthy Female Passenger (High Chance of Survival)

In [5]:
sample_1 = {
    'Pclass': 1,
    'Sex': 'female',
    'Age': 35,
    'SibSp': 1,
    'Fare': 71.28
}

pred, prob = predictor_reload.predict_survival(sample_1)
survival_label = "SURVIVED" if pred[0] == 1 else "DID NOT SURVIVE"

print("Passenger 1: Wealthy female (1st class, 35 years old)")
print(f"  Prediction: {survival_label}")
print(f"  Probability of NOT surviving: {prob[0][0]:.4f}")
print(f"  Probability of SURVIVING: {prob[0][1]:.4f}\n")

Passenger 1: Wealthy female (1st class, 35 years old)
  Prediction: SURVIVED
  Probability of NOT surviving: 0.0973
  Probability of SURVIVING: 0.9027



### Sample 2: Poor Male Passenger (Low Chance of Survival)

In [6]:
sample_2 = {
    'Pclass': 3,
    'Sex': 'male',
    'Age': 25,
    'SibSp': 0,
    'Fare': 7.75
}

pred, prob = predictor_reload.predict_survival(sample_2)
survival_label = "SURVIVED" if pred[0] == 1 else "DID NOT SURVIVE"

print("Passenger 2: Poor male (3rd class, 25 years old)")
print(f"  Prediction: {survival_label}")
print(f"  Probability of NOT surviving: {prob[0][0]:.4f}")
print(f"  Probability of SURVIVING: {prob[0][1]:.4f}\n")

Passenger 2: Poor male (3rd class, 25 years old)
  Prediction: DID NOT SURVIVE
  Probability of NOT surviving: 0.8908
  Probability of SURVIVING: 0.1092



### Sample 3: Middle-Class Young Male

In [7]:
sample_3 = {
    'Pclass': 2,
    'Sex': 'male',
    'Age': 18,
    'SibSp': 2,
    'Fare': 21.07
}

pred, prob = predictor_reload.predict_survival(sample_3)
survival_label = "SURVIVED" if pred[0] == 1 else "DID NOT SURVIVE"

print("Passenger 3: Middle-class male (2nd class, 18 years old)")
print(f"  Prediction: {survival_label}")
print(f"  Probability of NOT surviving: {prob[0][0]:.4f}")
print(f"  Probability of SURVIVING: {prob[0][1]:.4f}\n")

Passenger 3: Middle-class male (2nd class, 18 years old)
  Prediction: DID NOT SURVIVE
  Probability of NOT surviving: 0.7865
  Probability of SURVIVING: 0.2135



## Summary

### Key Accomplishments:

✓ **Model was saved to disk using pickle** - Files saved:
  - `titanic_model.pkl` - The trained Logistic Regression model
  - `titanic_model_scaler.pkl` - The StandardScaler for feature normalization
  - `titanic_model_encoders.pkl` - The LabelEncoders for categorical features

✓ **Model was successfully reloaded without retraining**

✓ **Model can make predictions on new data** with probability scores

✓ **Feature Engineering** - Selected and preprocessed 5 optimal features

✓ **Data Preprocessing** - Handled missing values, encoded categorical variables, and scaled features

✓ **Model Evaluation** - Generated comprehensive classification reports and confusion matrices