## Logistic Regression Explained

### Introduction

Logistic regression is a statistical model used for predicting binary outcomes, like whether an email is spam or not spam, based on given features.

### How It Works

1. **Binary Outcome**: Logistic regression predicts outcomes that are binary, meaning they have two possible values (e.g., 0 or 1, yes or no).

2. **Linear Relationship**: It assumes a linear relationship between the input features and the log-odds of the outcome. The log-odds are transformed into probabilities using a special function.

3. **Sigmoid Function**: Logistic regression uses the sigmoid function to map the predicted values to probabilities. The sigmoid function looks like this:

   $$\sigma(z) = \frac{1}{1 + e^{-z}}$$

   Here, \( z \) is a linear combination of the input features and their corresponding coefficients.

### Assumptions

- **Linearity**: It assumes that the relationship between the input features and the log-odds of the outcome is linear.
  
- **Independence of Errors**: The errors made by the model are assumed to be independent of each other.

- **No Multicollinearity**: The input features should not be highly correlated with each other.

### Equations

- **Log-Odds**: The model predicts the log-odds of the outcome being 1 as a linear function of the input features:

   $$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n$$

   Here, \( p \) is the probability of the outcome being 1.

- **Probability Prediction**: Using the sigmoid function, the probability of the outcome being 1 is:

   $$p(y=1 \mid x) = \sigma(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n)$$

   And \( p(y=0 \mid x) = 1 - p(y=1 \mid x) \).

### Learning from Data

- **Parameter Estimation**: Logistic regression estimates the parameters  ( $ \beta $ ) using maximum likelihood estimation (MLE). It finds the parameters that maximize the likelihood of observing the data given the model.

- **Gradient Descent**: To optimize the parameters, logistic regression often uses gradient descent. Gradient descent adjusts the parameters iteratively to minimize the difference between predicted probabilities and actual outcomes.

### Making Predictions

- Once trained, the logistic regression model can predict the probability of a new data point belonging to a certain class (e.g., spam or not spam).
  
- It classifies the outcome based on a chosen threshold (typically 0.5). If the predicted probability is >= 0.5, it predicts class 1; otherwise, it predicts class 0.


# Function to load data from a CSV file

In [1]:
import numpy as np
import csv


def load_data(filename):
    with open(filename, 'r') as file:
        reader = csv.reader(file)
        next(reader)  # Skip header row
        data = [[float(x) for x in row] for row in reader]
    return data

# Function to preprocess data

In [2]:

def preprocess_data(data):
    X = np.array([row[:-1] for row in data])
    y = np.array([row[-1] for row in data])
    return X, y

# Feature scaling using standardization

In [3]:

def feature_scaling(X):
    mean = np.mean(X, axis=0)
    std = np.std(X, axis=0)
    X_scaled = (X - mean) / std
    return X_scaled

# Logistic Regression Class

## Logistic Regression Algorithm

### 1. Initialization
- **Class Definition**: `LogisticRegression`
  - **`__init__` Method**: Initializes the logistic regression model with a specified learning rate and number of iterations.
    - **Parameters**:
      - `learning_rate`: The step size for gradient descent updates.
      - `num_iterations`: The number of iterations for the gradient descent optimization.

### 2. Sigmoid Function
- **`sigmoid` Method**: Computes the sigmoid function, which maps any real-valued number into the range (0, 1).
  - **Formula**: 
    $$
    \sigma(z) = \frac{1}{1 + e^{-z}}
    $$

### 3. Training the Model
- **`fit` Method**: Trains the logistic regression model using gradient descent.
  - **Parameters**:
    - `X`: Feature matrix.
    - `y`: Target vector.
  - **Process**:
    1. Initialize weights and bias to zero.
    2. Iterate for the specified number of iterations:
       - Compute the linear combination of inputs and weights: $ z = X \cdot \text{weights} + \text{bias} $.
       - Apply the sigmoid function to the linear combination to get the predicted probabilities: $ \hat{y} = \sigma(z) $.
       - Compute the gradients of the cost function with respect to the weights and bias:
         - $ \frac{\partial J}{\partial \text{weights}} = \frac{1}{m} \sum ( \hat{y} - y ) \cdot X $
         - $ \frac{\partial J}{\partial \text{bias}} = \frac{1}{m} \sum ( \hat{y} - y ) $
       - Update the weights and bias using the gradients and the learning rate:
         - `weights -= learning_rate * dw`
         - `bias -= learning_rate * db`

### 4. Making Predictions
- **`predict` Method**: Predicts binary labels for input data based on the learned weights and bias.
  - **Parameters**:
    - `X`: Feature matrix for which predictions are to be made.
  - **Process**:
    1. Compute the linear combination of inputs and weights: $ z = X \cdot \text{weights} + \text{bias} $.
    2. Apply the sigmoid function to the linear combination to get the predicted probabilities: $ \hat{y} = \sigma(z) $.
    3. Convert the predicted probabilities to binary labels (0 or 1) based on a threshold of 0.5.

In [4]:

class LogisticRegression:
    def __init__(self, learning_rate=0.01, num_iterations=1000):
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations

    def sigmoid(self, z):
        z = np.clip(z, -500, 500) 
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        self.m, self.n = X.shape
        self.weights = np.zeros(self.n)
        self.bias = 0

        for _ in range(self.num_iterations):
            linear_model = np.dot(X, self.weights) + self.bias
            y_predicted = self.sigmoid(linear_model)

            dw = (1 / self.m) * np.dot(X.T, (y_predicted - y))
            db = (1 / self.m) * np.sum(y_predicted - y)

            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict(self, X):
        linear_model = np.dot(X, self.weights) + self.bias
        y_predicted = self.sigmoid(linear_model)
        return [1 if i > 0.5 else 0 for i in y_predicted]

# Function to calculate accuracy

In [5]:

def accuracy(y_true, y_pred):
    correct = np.sum(y_true == y_pred)
    return correct / len(y_true)


# Calculating the Accuracy of our custum Logistic Regression Model

In [6]:
def evaluate_logistic_regression(X_train, y_train, X_test, y_test):
    X_train = feature_scaling(X_train)
    X_test = feature_scaling(X_test)


    model = LogisticRegression(learning_rate=0.01, num_iterations=10000)

    model.fit(X_train, y_train)


    y_train_pred = model.predict(X_train)


    y_test_pred = model.predict(X_test)


    train_accuracy = accuracy(y_train, y_train_pred)
    test_accuracy = accuracy(y_test, y_test_pred)


    print(f"Training Accuracy for data1_train.csv: {train_accuracy * 100:.2f}%")
    print(f"Test Accuracy for data1_test.csv: {test_accuracy * 100:.2f}%")

    
train_data = load_data('data1_train.csv')                               # Load and preprocess the first dataset
test_data = load_data('data1_test.csv')
X_train1, y_train1 = preprocess_data(train_data)
X_test1, y_test1 = preprocess_data(test_data)


train_data2 = load_data('data2_train.csv')                              # Load and preprocess the second dataset
test_data2 = load_data('data2_test.csv')
X_train2, y_train2 = preprocess_data(train_data2)
X_test2, y_test2 = preprocess_data(test_data2)


print("Results for data1:")                                            # Evaluate on the first dataset
evaluate_logistic_regression(X_train1, y_train1, X_test1, y_test1)



print("\nResults for data2:")                                             # Evaluate on the second dataset
evaluate_logistic_regression(X_train2, y_train2, X_test2, y_test2)

Results for data1:
Training Accuracy for data1_train.csv: 50.00%
Test Accuracy for data1_test.csv: 56.50%

Results for data2:
Training Accuracy for data1_train.csv: 98.88%
Test Accuracy for data1_test.csv: 97.00%


# Hyperparameter Tuning

### Introduction
Hyperparameter tuning is the process of optimizing the hyperparameters of a model to improve its performance. For logistic regression, common hyperparameters include the learning rate and the number of iterations. The goal is to find the best combination of these hyperparameters that maximizes the model's accuracy.

### Grid Search Method
Grid search is a systematic method for hyperparameter tuning. It involves training the model with different combinations of hyperparameters and evaluating their performance. The combination that yields the highest accuracy is chosen as the best.

Here is the markdown explanation for the hyperparameter tuning process:


#### Parameters
- `X_train`: Feature matrix for the training data.
- `y_train`: Target vector for the training data.
- `learning_rates`: List of learning rates to try.
- `num_iterations_list`: List of iteration counts to try.

#### Process
1. **Initialize Best Accuracy and Parameters:**
   - `best_accuracy` is initialized to 0.
   - `best_params` is initialized as an empty dictionary.
   
2. **Iterate Over Hyperparameter Combinations:**
   - For each learning rate in `learning_rates`:
     - For each number of iterations in `num_iterations_list`:
       - Create a new instance of the `LogisticRegression` model with the current learning rate and number of iterations.
       - Fit the model to the training data.
       - Predict the labels for the training data.
       - Calculate the accuracy of the model on the training data.
       - If the current accuracy is higher than `best_accuracy`, update `best_accuracy` and `best_params`.

3. **Return Best Parameters and Accuracy:**
   - After iterating through all combinations, return the `best_params` and `best_accuracy`.


In [7]:
def grid_search(X_train, y_train, learning_rates, num_iterations_list):
    best_accuracy = 0
    best_params = {}
    
    for lr in learning_rates:
        for num_iter in num_iterations_list:
            model = LogisticRegression(learning_rate=lr, num_iterations=num_iter)
            model.fit(X_train, y_train)
            y_pred = model.predict(X_train)
            acc = accuracy(y_train, y_pred)
            if acc > best_accuracy:
                best_accuracy = acc
                best_params = {'learning_rate': lr, 'num_iterations': num_iter}
    
    return best_params, best_accuracy



def evaluate_hyperparameter(X_train, y_train, X_test, y_test):
    learning_rates = [0.001, 0.01, 0.1]
    num_iterations_list = [100, 500, 1000]
    best_params, best_accuracy = grid_search(X_train, y_train, learning_rates, num_iterations_list)
    print(f"Best Parameters: {best_params}")
    print(f"Best Training Accuracy: {best_accuracy * 100:.2f}%")

train_data = load_data('data1_train.csv')                               # Load and preprocess the first dataset
test_data = load_data('data1_test.csv')
X_train1, y_train1 = preprocess_data(train_data)
X_test1, y_test1 = preprocess_data(test_data)


train_data2 = load_data('data2_train.csv')                              # Load and preprocess the second dataset
test_data2 = load_data('data2_test.csv')
X_train2, y_train2 = preprocess_data(train_data2)
X_test2, y_test2 = preprocess_data(test_data2)


print("Results for data1:")                                            # Evaluate on the first dataset
evaluate_hyperparameter(X_train1, y_train1, X_test1, y_test1)



print("\nResults for data2:")                                             # Evaluate on the second dataset
evaluate_hyperparameter(X_train2, y_train2, X_test2, y_test2)

Results for data1:
Best Parameters: {'learning_rate': 0.001, 'num_iterations': 100}
Best Training Accuracy: 32.25%

Results for data2:
Best Parameters: {'learning_rate': 0.001, 'num_iterations': 1000}
Best Training Accuracy: 89.38%


#  Comparison with Scikit-Learn

### `evaluate_sklearn` Function Explanation

The `evaluate_sklearn` function in your code performs the following steps to train and evaluate a logistic regression model using scikit-learn:

1. **Data Scaling**:
   - **Standardization**: The function begins by scaling the features of both the training and test datasets using a `StandardScaler`. This ensures that each feature has a mean of 0 and a standard deviation of 1. Standardization is important because it ensures that the model treats all features equally and helps in the convergence of the algorithm.

2. **Model Initialization**:
   - A `LogisticRegression` model is initialized. In this case, the `lbfgs` solver is used, which is efficient for small to medium-sized datasets. The maximum number of iterations is set to a high value (10,000) to ensure that the optimization process converges.

3. **Model Training**:
   - The logistic regression model is trained on the scaled training data. During training, the model learns the coefficients that best fit the relationship between the input features and the target labels by minimizing the logistic loss function.

4. **Prediction**:
   - After training, the model makes predictions on both the training and test datasets. This involves calculating the probability that each sample belongs to a particular class and then assigning the sample to the class with the highest probability.

5. **Evaluation**:
   - The function evaluates the model's performance by calculating the accuracy on the training and test sets. Accuracy is the ratio of correctly predicted instances to the total instances:
     $$
     \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
     $$
   - The function then prints the accuracy for both the training and test datasets.


In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score


def evaluate_sklearn(X_train, y_train, X_test, y_test):
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)


    model = LogisticRegression(solver='lbfgs', max_iter=10000)


    model.fit(X_train_scaled, y_train)


    y_train_pred = model.predict(X_train_scaled)


    y_test_pred = model.predict(X_test_scaled)


    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    print(f"Training Accuracy for : {train_accuracy * 100:.2f}%")
    print(f"Test Accuracy for : {test_accuracy * 100:.2f}%")


train_data = load_data('data1_train.csv')                               # Load and preprocess the first dataset
test_data = load_data('data1_test.csv')
X_train1, y_train1 = preprocess_data(train_data)
X_test1, y_test1 = preprocess_data(test_data)


train_data2 = load_data('data2_train.csv')                              # Load and preprocess the second dataset
test_data2 = load_data('data2_test.csv')
X_train2, y_train2 = preprocess_data(train_data2)
X_test2, y_test2 = preprocess_data(test_data2)


print("Results for data1:")                                            # Evaluate on the first dataset
evaluate_sklearn(X_train1, y_train1, X_test1, y_test1)



print("\nResults for data2:")                                             # Evaluate on the second dataset
evaluate_sklearn(X_train2, y_train2, X_test2, y_test2)


Results for data1:
Training Accuracy for : 97.25%
Test Accuracy for : 98.00%

Results for data2:
Training Accuracy for : 98.88%
Test Accuracy for : 99.00%
