# Logistic Regression

This notebook implements a **Logistic Regression** model from scratch to predict passenger survival on the Titanic. The process includes:
1. Loading the dataset.
2. Cleaning and preprocessing the data (handling missing values and categorical features).
3. Training a custom logistic regression model using gradient descent.
4. Evaluating the model's accuracy.
5. Comparing its performance to the standard `scikit-learn` implementation.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

## 1. Core Model Functions

Here we define all the functions required for our custom Logistic Regression model. This includes the sigmoid activation function, the gradient calculation, the training loop, and the prediction function.

In [2]:
def train(x, y, epochs, learning_rate):
    """Trains the logistic regression model using gradient descent."""
    weights = np.zeros(x.shape[1])
    bias = 0
    for i in range(epochs):
        x_dot_weights = np.matmul(weights, x.transpose()) + bias
        pred = sigmoid(x_dot_weights)
        error_w, error_b = compute_gradients(x, y, pred)
        weights -= learning_rate * error_w
        bias -= learning_rate * error_b
    return weights, bias

def compute_gradients(x, y_true, y_pred):
    """Computes the gradients of the binary cross-entropy loss function."""
    difference = y_pred - y_true
    gradient_b = np.mean(difference)
    gradients_w = np.matmul(x.transpose(), difference) / len(y_true)
    return gradients_w, gradient_b

def sigmoid(x):
    """Element-wise wrapper for the sigmoid function."""
    return np.array([sigmoid_function(value) for value in x])

def sigmoid_function(x):
    """Numerically stable implementation of the sigmoid function."""
    if x >= 0:
        z = np.exp(-x)
        return 1 / (1 + z)
    else:
        z = np.exp(x)
        return z / (1 + z)

def predict(x, weights, bias):
    """Uses the learned weights and bias to make predictions."""
    x_dot_weights = np.matmul(weights, x.transpose()) + bias
    probabilities = sigmoid(x_dot_weights)
    return [1 if p > 0.5 else 0 for p in probabilities]

## 2. Data Loading and Preprocessing

This is a critical step where we load the raw Titanic dataset and clean it up. We will:
- Convert categorical features like `Sex` and `Embarked` into numerical values.
- Handle missing data, specifically in the `Age` column, by filling `NaN` values with the mean age.

In [3]:
# Set the path to your CSV file.
path = r"Titanic-Dataset.csv"

try:
    file = pd.read_csv(path)
    print(f"File '{path}' loaded successfully!")

    # --- Feature Extraction and Cleaning ---
    y = file["Survived"].values
    # Create a copy to avoid SettingWithCopyWarning
    features_df = file[["Age", "Sex", "Pclass", "Fare", "Embarked"]].copy()
    
    # 1. Fill missing Age values with the mean
    features_df['Age'].fillna(features_df['Age'].mean(), inplace=True)
    
    # 2. Convert 'Sex' to numerical (male: 1, female: 0)
    features_df['Sex'] = features_df['Sex'].map({'male': 1, 'female': 0})

    # 3. Convert 'Embarked' to numerical (S: 0, C: 1, Q: 2)
    # First, fill any missing embarked values with the most common port ('S')
    features_df['Embarked'].fillna('S', inplace=True)
    features_df['Embarked'] = features_df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

    # Create the final feature matrix `x`
    x = features_df.values

    print("\nData cleaning complete. First 5 rows of processed features (x):")
    print(pd.DataFrame(x, columns=features_df.columns).head())

except FileNotFoundError:
    print(f"Error: The file was not found at '{path}'")
    print("Please make sure 'Titanic-Dataset.csv' is in the same directory as the notebook.")
    x, y = None, None

File 'Titanic-Dataset.csv' loaded successfully!

Data cleaning complete. First 5 rows of processed features (x):
    Age  Sex  Pclass     Fare  Embarked
0  22.0  1.0     3.0   7.2500       0.0
1  38.0  0.0     1.0  71.2833       1.0
2  26.0  0.0     3.0   7.9250       0.0
3  35.0  0.0     1.0  53.1000       0.0
4  35.0  1.0     3.0   8.0500       0.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  features_df['Age'].fillna(features_df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  features_df['Embarked'].fillna('S', inplace=True)


## 3. Splitting Data for Training and Testing

We'll split our processed data into a training set and a testing set. The model will learn from the training set, and we'll evaluate its performance on the unseen testing set.

In [4]:
if x is not None:
    # A common 80/20 split is used for training and testing.
    # random_state ensures that the split is the same every time we run the code.
    x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=42)

    print("Data split complete:")
    print(f"x_train shape: {x_train.shape}")
    print(f"x_test shape:  {x_test.shape}")

Data split complete:
x_train shape: (712, 5)
x_test shape:  (179, 5)


## 4. Training and Evaluating Our Custom Model

Now we train our custom model using the training data and then evaluate its performance by making predictions on the test set and calculating the accuracy.

In [5]:
if x is not None:
    # Train the model and get the learned parameters
    weights, bias = train(x_train, y_train, epochs=1000, learning_rate=0.0001)

    # Make predictions on the test set
    y_pred = predict(x_test, weights, bias)

    print("--- Custom Logistic Regression Model ---")
    print(f"Learned Weights: {weights}")
    print(f"Learned Bias: {bias}")
    print(f"Our model accuracy: {accuracy_score(y_test, y_pred):.4f}")

--- Custom Logistic Regression Model ---
Learned Weights: [-0.0272077  -0.01265766 -0.01469868  0.0128648   0.00156715]
Learned Bias: -0.002478327364715696
Our model accuracy: 0.6536


## 5. Comparison with Scikit-learn

Finally, we train a `scikit-learn` `LogisticRegression` model on the exact same data to see how our implementation compares to a standard, highly optimized library.

In [6]:
if x is not None:
    # Create an instance of the sklearn model
    model = LogisticRegression(max_iter=1000)

    # Train the model
    model.fit(x_train, y_train)

    # Make predictions
    sklearn_pred = model.predict(x_test)

    # Calculate accuracy
    sklearn_accuracy = accuracy_score(y_test, sklearn_pred)

    print("--- Scikit-learn Logistic Regression Model ---")
    print(f"Sklearn model accuracy: {sklearn_accuracy:.4f}")

--- Scikit-learn Logistic Regression Model ---
Sklearn model accuracy: 0.7933
