<a href="https://colab.research.google.com/github/davy-datascience/portfolio/blob/master/LinearRegression/Approach-1/Linear%20Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Linear Regression - with single variable

## Intro

I first tried coding linear regression algorithm being taught by Luis Serrano. Luis produces youtube videos on data-science subjects with easy-to-understand visualizations. In his video [Linear Regression: A friendly introduction](https://www.youtube.com/watch?v=wYPUhge9w5c) he uses the following approach :
<br/>

![linear regression algorithm](https://raw.githubusercontent.com/davy-datascience/portfolio/master/LinearRegression/Approach-1/img/linear-regression-algo.png)

**Note:**

The dataset we're using contains salary of some people and the number of year of experience. 

We are trying to predict the salary given the number of year of experience.

So the number of year of experience is the independent variable and the salary is the dependent variable.

The x-axis is related to the number of year of experience.

The y-axis is related to the salary.

y-intercept is the point that satisfy x = 0, in other words the point of the line that intersects the y-axis

Increasing y-intercept means translating the line up, and decreasing y-intercept means translating the line down

## Implementation

Run the following cell to import all needed modules, you must have opened this document on Google Colab before doing so: <a href="https://colab.research.google.com/github/davy-datascience/portfolio/blob/master/LinearRegression/Approach-1/Linear%20Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
from sympy.geometry import Point, Line
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
import progressbar

I used the component Line from the module sympy.geometry. To create a Line I need to specify two Points. The line is also characterized by 3 coefficients (a, b and c) that match the following equation : 
![equation](https://raw.githubusercontent.com/davy-datascience/portfolio/master/LinearRegression/Approach-1/img/eq0.png) 

In my appoach I am dealing with a line equation of this sort : 
![equation](https://raw.githubusercontent.com/davy-datascience/portfolio/master/LinearRegression/Approach-1/img/eq1.png) 

So I translated the first equation to match my equation requirement :

![equation](https://raw.githubusercontent.com/davy-datascience/portfolio/master/LinearRegression/Approach-1/img/eq2.png) 

Run the following cell. It contains the functions that will be used in the program:

In [0]:
def drawAll(X, Y, line):
    """ plot the points from the dataset and draw the actual Line """
    coefs = line.coefficients
    x = np.linspace(X.min(),X.max())
    y = (-coefs[0] * x - coefs[2]) / coefs[1]
    plt.plot(x, y)
    plt.scatter(X, Y, color = 'red')
    plt.show()

def transformLine(point, line, x_median, learning_rate): 
    """ According to the random point, update the Line """
    # We take the median of the x values for better results for the calculations of the horizontal distances
    
    # Creation of the vertical line passing through the new point
    ymin = line.points[0] if line.direction.y > 0 else line.points[1]
    ymax = line.points[1] if line.direction.y > 0 else line.points[0]
    vertical_line = Line(Point(point.x,ymin.y), Point(point.x,ymax.y))
    # Find the intersection with our line (to calculate the vertical distance)
    I = line.intersection(vertical_line)
    vertical_distance = point.y - I[0].y
    horizontal_distance = point.x - x_median  
    
    coefs = line.coefficients

    a = coefs[0]
    b = coefs[1]
    c = coefs[2]
    
    # Calculation of the points which constitute the new line
    # Reminder: we add (learning_rate * vertical_distance * horizontal_distance) to the slope and we add (learning_rate * vertical_distance) to y-intercept
    # The equation now looks like : 
    # y = - (a/b)*x + (learning_rate * vertical_distance * horizontal_distance) * x - (c/b) + learning_rate * vertical_distance

    # We keep the same scope of the line so the min value of x and the max value of x don't change
    
    x_min = line.points[0].x
    y_min = - (a/b)*x_min + (learning_rate * vertical_distance * horizontal_distance * x_min) - (c/b) + learning_rate * vertical_distance
    
    x_max = line.points[1].x
    y_max = - (a/b)*x_max + (learning_rate * vertical_distance * horizontal_distance * x_max) - (c/b) + learning_rate * vertical_distance
    
    newLine = Line(Point(x_min, y_min), Point(x_max, y_max))
    return newLine
    
def predict(X, line):
    """ I use my model (the equation of the line) to predict new values """
    prediction = []
    coefs = line.coefficients
    a = coefs[0]
    b = coefs[1]
    c = coefs[2]
    for x in X.values:
        y = - (a/b)*x - (c/b)
        prediction.append(y)
    return prediction

Run the following cell to launch the linear regression program:

In [0]:
# Set the learning rate and the number of iterations
learning_rate = 0.01
nb_epochs = 1000

# Read the data
dataset = pd.read_csv("https://raw.githubusercontent.com/davy-datascience/portfolio/master/LinearRegression/Approach-1/dataset/Salary_Data.csv")

# Separate the dataset into a training set and a test set
train, test = train_test_split(dataset, test_size = 0.2)

# Separation independent variable X - dependent variable y for the train set & the test set
X_train = train.YearsExperience
y_train = train.Salary

X_test = test.YearsExperience
y_test = test.Salary

# Looking for 1st line equation
# The line must have the same scope than the scatter plots from the dataset
# I decided to build the line choosing the point that has the max x-value and the point that has the min x-value

# Find the point with the maximum value of x in the dataset
idx_max = X_train.idxmax()
x_max = Point(X_train.loc[idx_max], y_train.loc[idx_max])

# Find the point with the minimum value of x in the dataset
idx_min = X_train.idxmin()
x_min = Point(X_train.loc[idx_min], y_train.loc[idx_min])

# Build the line with the 2 points 
line = Line(x_min, x_max)
drawAll(X_train, y_train, line)

# Iterate choosing a random point and moving the line with the function transformLine
for i in progressbar.progressbar(range(nb_epochs)):
    sample = train.sample()
    point = Point(sample.YearsExperience, sample.Salary)
    line = transformLine(point, line, X_train.median(), learning_rate)
    #drawAll(X_train, y_train, line)    # Uncomment this line to see the line at each iteration

drawAll(X_train, y_train, line)

# Predict the test set with my model and see
y_pred = predict(X_test, line)
print("MAE (Mean Absolute Error) is used to evaluate the model accuracy")
print("MAE for my model: {}".format(mean_absolute_error(y_pred, y_test)))

# Predict the test set with the sklearn algorithm
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train.to_frame(), y_train)
y_pred2 = regressor.predict(X_test.to_frame())
print("MAE for the algorithm of the sklearn module: {}".format(mean_absolute_error(y_pred2, y_test)))