<h1> Regression Notes </h1>

- Regression can be simply  computed by taking the weighted sum of input features plus a constant called the "bias term"
- Common way to measure it is using RMSE (Root Mean Squared Error)
- To minimize RMSE, we need to find the values theta (coefficient of input features).
- We want to minimise the MSE which can be easier than RMSE.
- Minimising RMSE and MSE leads to the same result since the value that minimises a function also minimizes the square root.

In [5]:
#import necessary libraries for data wrangling and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#import machine learning libraries and modules
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import mean_squared_error

Let's use multiple-linear regression to predict the income of a person based of two key variables:
- Age: Age of person
- Experience: No. of years a user has worked

In [6]:
#Load some data from kaggle
import os
import pandas as pd

# Specify the path to the directory you want to search in
path = "C:\\Users\\jaspe\\OneDrive\\Documents"

# Specify the name of the file you're looking for
file_to_find = "multiple_linear_regression_dataset.csv"

# Use os.listdir to get a list of all files in the directory
for file in os.listdir(path):
    if file == file_to_find:
        # Now you can read the file as a CSV using pandas
        data = pd.read_csv(os.path.join(path, file))
data


Unnamed: 0,age,experience,income
0,25,1,30450
1,30,3,35670
2,47,2,31580
3,32,5,40130
4,43,10,47830
5,51,7,41630
6,28,5,41340
7,33,4,37650
8,37,5,40250
9,39,8,45150


For simplicty, let's just look at the relationship between two variables.

In [7]:
X = data[['age', 'experience']] #independent variable
y = data['income'] #dependent variable

To perform basic linear regression analysis, we do the following:

In [8]:
lr = LinearRegression() #instantiate 
lr_fit = lr.fit(X, y) #fit/train model

To find the intercepts and coefficients:

In [9]:
lr.intercept_, lr.coef_

(31261.689854101285, array([ -99.19535546, 2162.40419192]))

From here, we can see that our model is now:

\begin{gather*}
y_{pred}=-99.19x_{1}+2162.40x_{2}+31261.69\\
\end{gather*}

\begin{gather*}
y_{pred}: predicted \ income \\
x_{1}: age \ input feature \\
x_{2}: experience \ input feature
\end{gather*}

Let's predict the income of a person based of the model above using the following example:
- Age = 23
- Experience = 3

In [10]:
X_test = [[23,3]] #initialise age and experience as array

y_pred = lr_fit.predict(X_test) #predict our income
y_pred



array([35467.40925427])

Therefore, our predicted income is $35,467 if an individual was 23 and his experience was 3 years.

The computational time of LinearRegression() is $O(n^{2})$. Since the normal equation computes the psuedoinverse (a standard matrix factorization technique called SINGULAR VALUE DECOMPOSITION (SVD)), if you double the number of features in the model, you multiply the computational time by 4.

Normal regression (least-squares) has a few issues with it:
- It is not efficient to use normal linear regression if the no. of features in our dataset is less than the no. of trainin instances
- If some features are considered redundant

However, the computational complexity is linear with regards to:
- No. of instancecs you want to make predictions on
- No. of features

Therefore, it can handle large trainingn sets provided that it can fit in your memory.

Once your linear regression model is trained, it takes very little time to predict since the computational complexity is linear with regards to the no. of instances and no. of features e.g. if you increase the no. of instances (or no. of features) twice, it will take twice as long to predict.

<h1>Gradient Descent</h1>

Gradient Descent algorithms are responsible for helping the training rate of models by helping optimize and tweak parameters to minimize the cost function. 

- Gradient Descent takes the no. of steps $\theta$ and the local gradient/steepness at that stepping point until it reaches a global minimum once the gradient becomes 0. 
- Gradient Descent tries to fill the stepping size with values (random initialization.
- The goal is to minimize the cost function as much as possible (using MSE)
- Step size is important to the learning rate:
    - Step size is too big = may end up diverging from global minimum
    - Step size is to small = will take a long time for the model to train due to low learning rates

Sometimes, the cost function may not be perfectly concave (regular bowl) shaped and may contain ridges and bumps along the way, smaller dips along the cost function are known as local minimum.
- If random initialization $\theta$ starts on the left, it will reach a local minimum
- If random initialization starts on the right, it will take a long time
- If we stop early, we may never reach the global minimum

The MSE cost function is convex in nature, meaning if we take two points from the function and run a line segment between them, it will it never intersect the curve -> no local minimums -> it's also a conitnous function.

A few other key points:
- Gradient descent can get very close the the global minimum (if we wait long enough and the learning rate isnt too high)

Visually speaking, Gradient Descent algorithms approach the global minimum in a straight manner but if the data is scaled, the shape will become elongated and approach the global minimum at an orthogonal angle. 

<b>NOTE: YOU MUST SCALE YOUR FEATURES using StandardScaler() or else due to the magnitude and scale of the original dat, it could take a long time to converge.</b>


<h2>Batch Gradient Descent</h2>

Batch gradient descent attempts to calculate the local gradient at each point of $\theta$ bit by bit. Essentially we are taking the partial derivative at each point. 

Batch Gradient Descent has one main downfall:
- Batch Gradient descent looks at the full training set of your data and hence it can take a very long time to train your models and can be terribly slow as the number of training instances increase.

But...
- Gradient descent algorithms can scale well as the no. of features increase, therefore training a linear regression model on data with large no. of features will be quicker with Gradient descent rather than using normal least squares.





X_b = np.c_[np.ones((20,1)), X]

eta = 0.1
n_iterations = 1000
m = 20

theta = np.random.randn(20,3)

for iteration in range(n_iterations):

    gradients = 2/m*X_b.T.dot(X_b.dot(theta)-y)
    
    theta = theta - eta*gradients

If the cost function is convex (in most cases for MSE cost functions), Batch Gradient Descent with a fixed learning rate, it will eventually reach an optimal solution. 

<h2>Stochastic Gradient Descent</h2>

Stochastic Gradient Descent randomly selects values of $\theta$ rather than using the full training set at each step. This means that the gradient will be calculated at that point alone. 
- It works on a single instance rather the full instance each time
- Can be used for larger training instances.
- SGD can be good to get out of local minimums due to random "jumping" in error values

Issue however:
- Stochastic methods being random will gently decrease to until it reaches the minimum to which it will continue to "bounce and fluctuate" in error.
- This makes it difficult to attain an optimal minimum
- If the cost function is irregular, it can be easy for the SGD to reach global minimum.

To solve the issue of not reaching the global minimum we can:
- reduce the learning rate
- make the first step sizes larger then slowly decrease it as $\theta$ changes

The function that determines the learning rate is the learning <b>schedule</b>