# Linear Regression - multi variable (using batch gradient descent)

## Intro

Here batch gradient descent is implemented for linear regression. The theory part was brought to me by Andrew Ng in the [Machine Learning](https://www.coursera.org/learn/machine-learning/) course which I strongly recommend. 

I implemented the algorithm in python.

We are seeking to build a regression function with the hypothesis below: ![hypothesis equation](https://raw.githubusercontent.com/davy-datascience/portfolio/master/LinearRegression/Approach-2/img/hypothesis.png)

n = number of features

![feature values](https://raw.githubusercontent.com/davy-datascience/portfolio/master/LinearRegression/Approach-2/img/x1x2x3.png) = the values of each feature

We can represent the feature values as the following column vector:

![x vector](https://raw.githubusercontent.com/davy-datascience/portfolio/master/LinearRegression/Approach-2/img/x.png)

![tetha parameters](https://raw.githubusercontent.com/davy-datascience/portfolio/master/LinearRegression/Approach-2/img/tetha_list.png) = parameters (coefficients) of the hypothesis equation

We can represent those parameters as the following column vector:


![theta vector](https://raw.githubusercontent.com/davy-datascience/portfolio/master/LinearRegression/Approach-2/img/theta.png) 

The hypothesis can be represented as the multiplication of those matrix:

![hypothesis equation](https://raw.githubusercontent.com/davy-datascience/portfolio/master/LinearRegression/Approach-2/img/hypothesis_matrix.png)


## Cost function & Gradient descent

The cost function represents the error: how far are the predicted values from the actual values?

The cost function choosen here is the mean squared error:

![mean squared error](https://raw.githubusercontent.com/davy-datascience/portfolio/master/LinearRegression/Approach-2/img/mse.png)

Our goal is to minimize the cost function. To do so, we update simultaneously each parameter tetha by substrating a small portion of the partial dervative of the cost function:

![gradient descent](https://raw.githubusercontent.com/davy-datascience/portfolio/master/LinearRegression/Approach-2/img/gradient_descent.png)
*italicized text*


## Implementation

Run the following cell to import all needed modules, you must have opened this document on Google Colab before doing so: <a href="" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
from pandas.core.frame import DataFrame
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from my_functions import normalize, feature_engineering, predict
import progressbar

Run the following cell. It contains the functions that will be used in the program:

In [0]:
def normalize(X):
    ''' Apply mean normalization, so each feature is at the same scale '''
    X_normalized = pd.DataFrame()
    avg_X = X.mean()
    range_X = X.max() - X.min()
    
    for i in range(len(X.columns)):
        X_normalized[i] = (X.iloc[:, i] - avg_X[i]) / (range_X[i] if range_X[i] != 0 else 1)
    X_normalized.columns = X.columns
    return X_normalized

def feature_engineering(X):
    ''' Select interesting features and create new calculated features '''
    features = ['LotArea', '1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'KitchenAbvGr', 'FullBath', 'HalfBath', 'BsmtFullBath', 'BsmtHalfBath', 'PoolArea']
    
    X = X[features].copy()
    X["FlrSF"] = X["1stFlrSF"] + X["2ndFlrSF"]
    X["Bath"] = X["FullBath"] +  X["HalfBath"] + X["BsmtFullBath"] + X["BsmtHalfBath"]
    
    X = X.drop(columns=['1stFlrSF', '2ndFlrSF', "FullBath", "HalfBath", "BsmtFullBath", "BsmtHalfBath"])
    
    return X

def predict(X, tetas):
    ''' Predict with the model (tetas)
        X input can be either a Series for a single example to predict or a DataFrame
    '''
    X = X.copy() # So we do not alter the input dataframe X
    
    # Check if the input is either DataFrame or a Series
    if isinstance(X, DataFrame):
        X.insert(0, "X0", 1)
        y_pred = pd.Series()
        
        for i in range(len(X.index)):
            y_pred = y_pred.append(pd.Series(tetas.dot(X.iloc[i])))
        
        return y_pred
    
    else:
        X = pd.Series([1]).append(X)
        return tetas.dot(X)

Run the following cell to launch the linear regression program:

In [0]:
# Set the learning rate and the number of iterations
learning_rate = 0.1
epochs = 50

# Read the data
dataset = pd.read_csv("https://raw.githubusercontent.com/davy-datascience/portfolio/master/LinearRegression/Approach-2/dataset/house_pricing.csv")

# Separate the dataset into a training set and a test set
train, test = train_test_split(dataset, test_size = 0.2)

# Separation independent variable X - dependent variable y for the train set & the test set
# Do specific feature engineering & normalize X
X_train = feature_engineering(train)
X_train = normalize(X_train)
y_train = train["SalePrice"]

X_test = feature_engineering(test)
X_test = normalize(X_test)
y_test = test["SalePrice"]

# Arbitrary begin with all tetas = 1
tetas = np.ones(X_train.columns.size + 1)

for i in progressbar.progressbar(range(epochs)):
    # Create a variable new_tetas so we can update simulatneously variable tetas
    new_tetas = []
    
    # Iterate over all tetas to calcuate their new values that will be updated simulteously
    for j in range(len(tetas)):
        sum_cost = []
        # Iterate over all training data
        for i in range(len(X_train.index)):
            # Calculate the predicted value of the data row with our regressionFunction
            h = predict(X_train.iloc[i], tetas)
            y = y_train.iloc[i]
            x = 1
            if j != 0:
                x = X_train.iloc[i, j - 1]
            sum_cost.append((h - y) * x)
        new_tetas.append(tetas[j] - learning_rate * (1/len(X_train.index)) * sum(sum_cost))

    # Simultaneous update of thetas
    tetas = np.array(new_tetas)

# Predict the test set with my model and see
y_pred = predict(X_test, tetas)
print("MAE for my model: {}".format(mean_absolute_error(y_pred, y_test)))

# Predict the test set with the sklearn algorithm
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred2 = regressor.predict(X_test)
print("MAE for the algorithm of the sklearn module: {}".format(mean_absolute_error(y_pred2, y_test)))