# Logistic Regression

This notebook implements a **Logistic Regression** model for binary classification from scratch using Python and NumPy. The core of the model is trained using **gradient descent**.

We will train our custom model on the PIMA Indians Diabetes Dataset and then compare its performance against the highly optimized Logistic Regression implementation from the `scikit-learn` library. 

In [1]:
import numpy as np
import pandas as pd
import math
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

## 1. Core Model Functions

Here we define the building blocks of our model:
- **`sigmoid_function`**: The activation function that maps any real value into a probability between 0 and 1.
- **`compute_gradients`**: Calculates the gradients (derivatives) of the loss function, which are needed to update the model's parameters.
- **`predict`**: Uses the trained model (weights and bias) to make predictions on new data.

In [2]:
def sigmoid_function(x):
    """Numerically stable implementation of the sigmoid function."""
    if x >= 0:
      z = np.exp(-x)
      return 1 / (1 + z)
    else:
      z = np.exp(x)
      return z / (1 + z)

def sigmoid(x):
    """Element-wise wrapper for the sigmoid function."""
    return np.array([sigmoid_function(value) for value in x])

def compute_gradients(x, y_true, y_pred):
    """Computes the gradients of the binary cross-entropy loss function."""
    difference = y_pred - y_true
    gradient_b = np.mean(difference)
    gradients_w = np.matmul(x.transpose(), difference) / len(y_true)
    return gradients_w, gradient_b

def predict(x, weights, bias):
    """Uses the learned weights and bias to make predictions."""
    x_dot_weights = np.matmul(weights, x.transpose()) + bias
    probabilities = sigmoid(x_dot_weights)
    return [1 if p > 0.5 else 0 for p in probabilities]

## 2. Training with Gradient Descent

The `train` function implements the gradient descent algorithm. It iteratively adjusts the model's **weights** and **bias** to minimize the prediction error over a number of **epochs**.

In [3]:
def train(x, y, epochs, learning_rate):
    """Trains the logistic regression model using gradient descent."""
    weights = np.zeros(x.shape[1])
    bias = 0
    
    for i in range(epochs):
        # 1. Get predicted probabilities
        x_dot_weights = np.matmul(weights, x.transpose()) + bias
        pred = sigmoid(x_dot_weights)
        
        # 2. Compute gradients
        error_w, error_b = compute_gradients(x, y, pred)
        
        # 3. Update weights and bias
        weights -= learning_rate * error_w
        bias -= learning_rate * error_b
        
    return weights, bias

## 3. Data Loading and Splitting

We load the diabetes dataset, prepare the feature matrix (`x`) and target vector (`y`), and then split them into training and testing sets. This ensures we can evaluate our model on data it has never seen before.

In [4]:
# Set the path to your CSV file.
# For best results, place 'diabetes.csv' in the same folder as this notebook.
path = r"diabetes.csv"

try:
    file = pd.read_csv(path)
    print(f"File '{path}' loaded successfully!")
    
    # Prepare the data
    y = np.array(file["Outcome"])
    x = np.array(file[["Glucose", "BMI"]])

    # Split the dataset into training and testing sets
    # A 80/20 split is common practice (train_size=0.8)
    x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=42)
    print(f"Data split complete:")
    print(f"x_train shape: {x_train.shape}")
    print(f"x_test shape:  {x_test.shape}")

except FileNotFoundError:
    print(f"Error: The file was not found at '{path}'")
    print("Please make sure the 'diabetes.csv' file is in the same directory as the notebook, or update the 'path' variable.")

File 'diabetes.csv' loaded successfully!
Data split complete:
x_train shape: (614, 2)
x_test shape:  (154, 2)


## 4. Training and Evaluating Our Custom Model

Now we train our custom model using the training data and then evaluate its performance by making predictions on the test set and calculating the accuracy.

In [5]:
# Set hyperparameters
epochs = 2000
learning_rate = 0.001

# Train the model and get the learned parameters
weights, bias = train(x_train, y_train, epochs, learning_rate)

# Make predictions on the test set
y_pred = predict(x_test, weights, bias)

print("--- Custom Logistic Regression Model ---")
print(f"Learned Weights: {weights}")
print(f"Learned Bias: {bias}")
print(f"Our model accuracy: {accuracy_score(y_test, y_pred):.4f}")

--- Custom Logistic Regression Model ---
Learned Weights: [ 0.02700935 -0.06431908]
Learned Bias: -0.08658905905005242
Our model accuracy: 0.4026


## 5. Comparison with Scikit-learn

Finally, we train a `scikit-learn` `LogisticRegression` model on the exact same data to see how our implementation compares to a standard, highly optimized library.

In [6]:
# Create an instance of the sklearn model
model = LogisticRegression()

# Train the model
model.fit(x_train, y_train)

# Make predictions
sklearn_pred = model.predict(x_test)

# Calculate accuracy
sklearn_accuracy = accuracy_score(y_test, sklearn_pred)

print("--- Scikit-learn Logistic Regression Model ---")
print(f"Sklearn model accuracy: {sklearn_accuracy:.4f}")

--- Scikit-learn Logistic Regression Model ---
Sklearn model accuracy: 0.7662
