<a href="https://colab.research.google.com/github/davy-datascience/portfolio/blob/master/LogisticRegression/Approach-1/Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Logistic Regression - Percetron Algorithm

## Intro

I first tried coding the perceptron algorithm being taught by Luis Serrano. Luis is a very good teacher, he produces youtube videos on data-science subjects with easy-to-understand visualizations. In the first part of his video [Logistic Regression and the Perceptron Algorithm: A friendly introduction](https://www.youtube.com/watch?v=jbluHIgBmBo) he uses the following approach :

![percetron algorithm](https://raw.githubusercontent.com/davy-datascience/portfolio/master/LogisticRegression/Approach-1/img/PercetronAlgorithm.png)

**Note:**

The dataset we're using is the Iris dataset, which I simplified.

It consists of 2 different types of irises (Setosa and Virginica).

The features available are sepal length, sepal width, petal length and petal width.

I decided to choose 2 features (sepal length and petal width named respectively sep_long and pet_larg). 

The variable we are trying to predict is the iris specy. 0 stands for Setosa and 1 for Virginica.

The x-axis is related to feature sep_long.

The y-axis is related to feature pet_larg.

So we are trying to find a line that can separate Setosa irises from Virginica irises.

## Implementation

Run the following cell to import all needed modules:

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import progressbar

So, we are trying to find a line that can separate our 2 iris species. The line equation can be written as : ![line equation](https://raw.githubusercontent.com/davy-datascience/portfolio/master/LogisticRegression/Approach-1/img/eq0.png)

Run the following cell. It contains the functions that will be used in the program: 

In [0]:
def drawAll(dataset, x_min, x_max, coefs):
    """ plot the points from the dataset and draw the actual Line """
    datasetTrue = dataset[dataset["specy"] == 1]
    datasetFalse = dataset[dataset["specy"] == 0]
    
    x = np.linspace(x_min,x_max)
    y = (-coefs[0] * x - coefs[2]) / coefs[1]
    
    plt.plot(x, y)
    
    plt.scatter(datasetFalse["sep_long"], datasetFalse["pet_larg"], color = 'red')
    plt.scatter(datasetTrue["sep_long"], datasetTrue["pet_larg"], color = 'green')
    plt.show()

def transformLine(x, y, categUp, lineCoefs, learning_rate):
    """ According to the random point, update the Line """
    
    # Check if the point is below or above the line
    # By reporting the point (x,y) to the equation ax+by+c :
    # If ax+by+c > 0 then the point is above the line, else it is below the line
    
    position = lineCoefs[0] * x + lineCoefs[1] *y + lineCoefs[2]
    
    # Look if the point is incorrectly classified, if so move the line towards point
    if position > 0 and not categUp :
        lineCoefs[0] -= x * learning_rate
        lineCoefs[1] += y * learning_rate
        lineCoefs[2] -= learning_rate
    elif position < 0 and categUp : 
        lineCoefs[0] += x * learning_rate
        lineCoefs[1] -= y * learning_rate
        lineCoefs[2] += learning_rate        
    
    return lineCoefs

def predict(X, lineCoefs):
    """ I use my model (the equation of the line) to predict which specy a new set of values belongs to  """
    prediction = []
    
    a = lineCoefs[0]
    b = lineCoefs[1]
    c = lineCoefs[2]
    
    for row in X.iterrows():
        x = row[1].loc["sep_long"]
        y = row[1].loc["pet_larg"]
        
        # The result of the equation ax + by+ c = 0 tells if the point is in category 1 (positive) or category 0 (negative)
        position = a * x + b * y + c
        prediction.append(position > 0)

    return pd.DataFrame(prediction, index= X.index)

Run the following cell to launch the logistic regression program:

In [0]:
# Set the learning rate and the number of iterations
learning_rate = 0.01
nb_epochs = 1000

# Read the data
dataset = pd.read_csv("https://raw.githubusercontent.com/davy-datascience/portfolio/master/LogisticRegression/Approach-1/dataset/dataset.csv", index_col='id')

# Separate the dataset into a training set and a test set
train, test = train_test_split(dataset, test_size = 0.2)

# Look for the point with the maximum and minimum value of x in the dataset to see the range of x values
# Find the point with the maximum value of x in the dataset
idx_max = train["sep_long"].idxmax()
x_max = train.loc[idx_max]["sep_long"]

# Find the point with the minimum value of x in the dataset
idx_min = train["sep_long"].idxmin()
x_min = train.loc[idx_min]["sep_long"]

# Begin with the line y = 0
lineCoefs = [0, 1, 0]

drawAll(train, x_min, x_max, lineCoefs)

# Iterate choosing a random point and moving the line with the function transformLine
for i in progressbar.progressbar(range(nb_epochs)):
    sample = train.sample()
    sepL = sample.iloc[0].sep_long
    petL = sample.iloc[0].pet_larg
    categUp = sample.iloc[0].specy
    lineCoefs = transformLine(sepL, petL, categUp, lineCoefs, learning_rate)
    #drawAll(train, x_min, x_max, lineCoefs)  # Uncomment this line to see the line at each iteration

drawAll(train, x_min, x_max, lineCoefs)

# Predict the test set with my model and print the mae
y_pred = predict(test, lineCoefs)
print("MAE : {}".format(mean_absolute_error(y_pred, test.specy)))