# Logistic Regression From Scratch

We will build a logistic regression model for classifying whether a patient has diabetes or not. We will
only use python to build functions for reading, normalizing data, optimizing parameters, and more.

## What is Logistic Regression?

Logistic regression is a **supervised** machine learning algorithm used for **classification** purposes.
Logistic Regression is somewhat the same as linear regression but is has a different **cost function** 
and **prediction function**.

$$
\text{Sigmoid Function: } g(z) = \frac{1}{1+e^{-z}}
$$

$$
\text{Hypothesis: } h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}
$$

Note that the range of g is $[0,1]$ where values that are above and include a threshold $\alpha\in (0,1)$ represent the class 1 and values below
$\alpha$ represent the class 0.

## Cost Function

**Cost functions** find the error between the **actual value** and the **predicted value** of our
algorithm. The error should be as small as possible. 

In the case of linear regression, the formula is

$$
J(\theta) = \frac{1}{m} \sum_{i=1}^m (\theta^T x^i - y^i)^2
$$

Where m is the number of examples of rows in the data set, $x^i$ is the feature values of the $i-th$ example,
and $y^i$ is the actual outcome of the $i-th$ example. Note that we want each $(h_\theta(x^i) - y^i)^2$ as small as possible but this formula **cannot** be used for **logistic regression** since $h_\theta$ is not convex so there is a chance of finding the local minima thus missing the global minima. Let us change each term in the summation above to 

$$
-y^i \log (h_\theta(x^i)) - (1-y^i)\log (1 - h_\theta(x^i))
$$

 In case $y^i=1$, the output (i.e. the cost to pay) approaches to 0 as $h_\theta(x^i)$ approaches to 1. Conversely, the cost to pay grows to infinity as $h_\theta(x^i)$ approaches to 0. You can clearly see it in the plot below, left side. 

<img src='img/ex1.png'>

This is a desirable property: we want a bigger penalty as the algorithm predicts something far away from the actual value. If the label is $y^i=1$ but the algorithm predicts $h_\theta(x^i) = 0$, the outcome is completely wrong.

Conversely, the same intuition applies when 𝑦=0, depicted in the plot above, right side. Bigger penalties when the label is $y^i=0$ but the algorithm predicts $h_\theta(x^i) = 1$. Each term is convex and we want each term as small as possible so we can rewrite our new cost function as

$$
J(\theta) = -\frac{1}{m} \sum_{i=1}^m \big[y^i \log (h_\theta(x^i)) + (1-y^i)\log (1 - h_\theta(x^i))\big]
$$

## Gradient Descent

The goal of an ML algorithm is to find the set of parameters that **minimizes**
the **cost function**. Here is where we use optimization techniques. One of them
is called gradient descent. 

First, we start with random values of parameters (in most cases **zero**) then
keep changing the parameters to reduce $J(\theta)$, the formula is:

Repeat:
$$
\theta_j:= \theta_j - \alpha\frac{\partial}{\partial\theta_j} J(\theta)
$$

Note that 
$$
\frac{\partial}{\partial\theta_j} J(\theta) = \frac{1}{m} \sum_{i=1}^m (h_\theta(x^i) - y^i)x_j^i
$$ 

So we have

Repeat:
$$
\theta_j:= \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^i) - y^i)x_j^i
$$

## Dataset

We will be using **Pima Indians Diabetes Dataset**. The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values. The variable names are as follows:

1. Number of times pregnant.
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
3. Diastolic blood pressure (mm Hg).
4. Triceps skinfold thickness (mm).
5. 2-Hour serum insulin (mu U/ml).
6. Body mass index (weight in kg/(height in m)^2).
7. Diabetes pedigree function.
8. Age (years).
9. Class variable (0 or 1).

## Now Let's Code

### Convert csv file to tabular data

In [1]:
import csv

# read data
with open('Pima_Indians_Diabetes_Data.csv') as file:
    data = csv.reader(file)
    data = [val for val in data]

# convert each value from string to float
df = [[float(val) for val in row] for row in data]
df[:10]

[[6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0],
 [1.0, 85.0, 66.0, 29.0, 0.0, 26.6, 0.351, 31.0, 0.0],
 [8.0, 183.0, 64.0, 0.0, 0.0, 23.3, 0.672, 32.0, 1.0],
 [1.0, 89.0, 66.0, 23.0, 94.0, 28.1, 0.167, 21.0, 0.0],
 [0.0, 137.0, 40.0, 35.0, 168.0, 43.1, 2.288, 33.0, 1.0],
 [5.0, 116.0, 74.0, 0.0, 0.0, 25.6, 0.201, 30.0, 0.0],
 [3.0, 78.0, 50.0, 32.0, 88.0, 31.0, 0.248, 26.0, 1.0],
 [10.0, 115.0, 0.0, 0.0, 0.0, 35.3, 0.134, 29.0, 0.0],
 [2.0, 197.0, 70.0, 45.0, 543.0, 30.5, 0.158, 53.0, 1.0],
 [8.0, 125.0, 96.0, 0.0, 0.0, 0.0, 0.232, 54.0, 1.0]]

## Min-Max Scaling (Normalization)

In [2]:
def get_maxmin(N):
    num_cols = len(N[0])
    num_rows = len(N)
    col_minmax = []

    for i in range(num_cols):
        temp = []
        for j in range(num_rows):
            temp.append(N[j][i])
        
        col_minmax.append([min(temp), max(temp)])

    return col_minmax

minmax_data = get_maxmin(df)

In [3]:
import copy

def normalize(N, minmax):
    num_cols = len(N[0])
    num_rows = len(N)

    for i in range(num_cols):
        min_max_vals = minmax[i]
        for j in range(num_rows):
            N[j][i] = (N[j][i] - min_max_vals[0])/(min_max_vals[1]-min_max_vals[0])
        
    return N

df_deepcopy = copy.deepcopy(df)

norm_df = normalize(df_deepcopy, minmax_data)
print(get_maxmin(norm_df))

[[0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0]]


### Split Data 80% Training Data  20% Test Data

In [5]:
from random import shuffle

def train_test_split(dataset):
    shuffle(dataset)
    n = int(.8*len(dataset))
    train = dataset[:n]
    test = dataset[n:]

    return train, test

train, test = train_test_split(norm_df)
print('size of train {} and size of test {}'.format(len(train), len(test)))

size of train 614 and size of test 154


### Accuracy

In [None]:
def get_accuracy(actual, predicted):
    count = 0
    n = len(actual)
    for i in range(n):
        if actual[i] == predicted[i]:
            count += 1
    
    return count/n

### Hypothesis Function

Our `prediction` function is our hypothesis function that takes the whole row and parameters as arguments.

In [None]:
import math

def hyp_function(arr, beta):
    arr = [1] + arr[:-1]
    n = len(arr)
    dot = sum([arr[i]*beta[i] for i in range(n)])
    denom = 1 + math.exp(-dot)

    return 1/denom

### Cost Function

We will use the cost function to calculate the cost with every iteration and plot that data point.
$$
J(\theta) = -\frac{1}{m} \sum_{i=1}^m \big[y^i \log (h_\theta(x^i)) + (1-y^i)\log (1 - h_\theta(x^i))\big]
$$

In [None]:
def cost(dataset, beta):
    m = len(dataset)
    terms = [row[-1]*math.log(hyp_function(row[-1], beta))+ (1-row[-1])*math.log(1-hyp_function(row[-1])) for row in dataset]
    total = sum(terms)
    avg = -(1/m)*total

    return avg

### Optimization Technique

Here we use the `gradient_descent` function for finding the best set of parameters for our model. This function
takes **dataset**, **epochs**(number of iterations), and **alpha**(learning rate) as arguments. 

So we have

Repeat:
$$
\theta_j:= \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^i) - y^i)x_j^i
$$

In [None]:
import matplotlib.pyplot as plt

MAX_ITER = 1000


for i in range(MAX_ITER):


### Conclusion

We have successfully built a **Logistic Regression** model from scratch with out using **pandas**, **scikit learn**.
Note that matplotlib was not neccessary but we did use it to see how the cost function decreases for each iteration.