# Logistic Regression

Logistic Regression is a classification algorithm (supervised learning algorithm for discrete data).

It is used for `binary classification` problems. Logistic regression predicts the probability of an event occuring.


In [1]:
from sklearn.linear_model import LogisticRegression

In [2]:
import numpy as np

# Hours studied 
X = np.array([0.5, 0.75, 1, 1.25, 1.5, 1.75, 1.75, 2, 2.25, 2.5, 2.75, 3, 3.25, 3.5, 4, 4.25, 4.5, 4.75, 5, 5.5]).reshape(-1, 1)

# Pass (1) or fail (0)
y = np.array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1])

In [3]:
model = LogisticRegression()
model.fit(X, y)

In [25]:
# Predict pass or fail
new_X = [[2]] # 2 hours studied
prediction = model.predict(new_X)

print("Prediction: ", "Pass" if prediction[0] == 1 else "Fail")

new_X = [[3]] # 3 hours studied
prediction = model.predict(new_X)

print("Prediction: ", "Pass" if prediction[0] == 1 else "Fail")

Prediction:  Fail
Prediction:  Pass



Logistic Regression uses a `sigmoid function`, to squeeze the output of a linear equation between 0 and 1.

```
sigmoid(x) = 1 / (1 + e^(-x))
```

The cost function,

`Cost(h(x), y) = -y * log(h(x)) - (1 - y) * log(1 - h(x))` where `h(x)` is the sigmoid function.

In [28]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [29]:
def cost_function(X, y, weights):
    z = np.dot(X, weights)
    predict_1 = y * np.log(sigmoid(z))
    predict_0 = (1 - y) * np.log(1 - sigmoid(z))
    return -sum(predict_1 + predict_0) / len(X)

### Gradient Descent

Gradient Descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model.


1. Initialize the parameters of the model with some values.
2. Calculate the gradient of the cost function with respect to each parameter. The gradient is a vector that points in the direction of the greatest rate of increase of the function, and its magnitude is the rate of increase in that direction.
3. Update the parameters by moving in the direction of the negative gradient (i.e., the direction of steepest descent). The size of the step is determined by the learning rate, a hyperparameter that you choose.
4. Repeat steps 2 and 3 until the algorithm converges to a minimum.

There are several variants of gradient descent that differ in how much data we use to compute the gradient of the objective function. The three main forms are batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.

- **Batch gradient descent** computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. However, it is terribly inefficient for large datasets.

- **Stochastic gradient descent (SGD)** computes the gradient using a single sample. SGD can be faster than batch gradient descent, since it performs updates more frequently. It can also introduce noise into the gradient descent process, which can help avoid local minima.

- **Mini-batch gradient descent** is a compromise between batch gradient descent and SGD. It uses a mini-batch of `n` samples to compute the gradient at each step. This can be more efficient than SGD for many machine learning problems.

In [30]:
def gradient_descent(X, y, weights, alpha, num_iterations):
    m = len(y)
    for i in range(num_iterations):
        z = np.dot(X, weights)
        h = sigmoid(z)
        gradient = np.dot(X.T, (h - y)) / m
        weights -= alpha * gradient
        cost = cost_function(X, y, weights)
        if i % 1000 == 0:
            print(f"Cost after iteration {i}: {cost}")
    return weights

In [31]:
X = np.array([0.5, 0.75, 1, 1.25, 1.5, 1.75, 1.75, 2, 2.25, 2.5, 2.75, 3, 3.25, 3.5, 4, 4.25, 4.5, 4.75, 5, 5.5]).reshape(-1, 1)
# Add a 1 to each input to account for the bias term
X = np.hstack((np.ones((X.shape[0], 1)), X))

y = np.array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1])

# Initialize weights
weights = np.zeros(X.shape[1])

alpha = 0.01 # learning rate
num_iterations = 10000

# Run gradient descent
weights = gradient_descent(X, y, weights, alpha, num_iterations)

print("Weights after gradient descent: ", weights)

Cost after iteration 0: 0.690616095200909
Cost after iteration 1000: 0.518972736605199
Cost after iteration 2000: 0.46474846732535574
Cost after iteration 3000: 0.4386448374933137
Cost after iteration 4000: 0.42463199409812324
Cost after iteration 5000: 0.4165035663973164
Cost after iteration 6000: 0.4115206358923655
Cost after iteration 7000: 0.40833995868307565
Cost after iteration 8000: 0.40624707399606014
Cost after iteration 9000: 0.40483741299062537
Weights after gradient descent:  [-3.55491702  1.32858743]


### Regularization

Regularization is a solution to overfitting, in which we remove some of the features. The types of regularization are

1. L1 Regularization
2. L2 Regularization
3. Elastic Net Regularization

L1 Regularization adds a penalty term to the cost function that is proportional to the absolute value of the coefficients. This encourages sparsity in the model, meaning it will set some coefficients to zero, effectively removing those features from the model.

L2 Regularization adds a penalty term to the cost function that is proportional to the square of the coefficients. This encourages smaller coefficients, effectively shrinking them towards zero.

Elastic Net Regularization is a combination of L1 and L2 regularization. It adds both penalty terms to the cost function, allowing for a balance between sparsity and shrinkage.

Regularization helps to prevent overfitting by reducing the complexity of the model and improving its generalization ability. It is particularly useful when dealing with high-dimensional datasets or when there is a large number of features compared to the number of observations.
