<h1> Regression Notes </h1>

- Regression can be simply  computed by taking the weighted sum of input features plus a constant called the "bias term"
- Common way to measure it is using RMSE (Root Mean Squared Error)
- To minimize RMSE, we need to find the values theta (coefficient of input features).
- We want to minimise the MSE which can be easier than RMSE.
- Minimising RMSE and MSE leads to the same result since the value that minimises a function also minimizes the square root.

In [2]:
#import necessary libraries for data wrangling and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#import machine learning libraries and modules
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import mean_squared_error

Let's use multiple-linear regression to predict the income of a person based of two key variables:
- Age: Age of person
- Experience: No. of years a user has worked

In [25]:
#Load some data from kaggle
import os
import pandas as pd

# Specify the path to the directory you want to search in
path = "C:\\Users\\jaspe\\OneDrive\\Documents"

# Specify the name of the file you're looking for
file_to_find = "multiple_linear_regression_dataset.csv"

# Use os.listdir to get a list of all files in the directory
for file in os.listdir(path):
    if file == file_to_find:
        # Now you can read the file as a CSV using pandas
        data = pd.read_csv(os.path.join(path, file))
data


Unnamed: 0,age,experience,income
0,25,1,30450
1,30,3,35670
2,47,2,31580
3,32,5,40130
4,43,10,47830
5,51,7,41630
6,28,5,41340
7,33,4,37650
8,37,5,40250
9,39,8,45150


For simplicty, let's just look at the relationship between two variables.

In [22]:
X = data[['age', 'experience']] #independent variable
y = data['income'] #dependent variable

To perform basic linear regression analysis, we do the following:

In [32]:
lr = LinearRegression() #instantiate 
lr_fit = lr.fit(X, y) #fit/train model

To find the intercepts and coefficients:

In [30]:
lr.intercept_, lr.coef_

(31261.689854101285, array([ -99.19535546, 2162.40419192]))

From here, we can see that our model is now:

\begin{gather*}
y_{pred}=-99.19x_{1}+2162.40x_{2}+31261.69\\
\end{gather*}

\begin{gather*}
y_{pred}: predicted \ income \\
x_{1}: age \ input feature \\
x_{2}: experience \ input feature
\end{gather*}

Let's predict the income of a person based of the model above using the following example:
- Age = 23
- Experience = 3

In [39]:
X_test = [[23,3]] #initialise age and experience as array

y_pred = lr_fit.predict(X_test) #predict our income
y_pred



array([35467.40925427])

Therefore, our predicted income is $35,467 if an individual was 23 and his experience was 3 years.

Since the normal equation computes the psuedoinverse (a standard matrix factorization technique called SINGULAR VALUE DECOMPOSITION (SVD)), if you double the number of features in the model, you multiply the computational time by 4.

Normal regression (least-squares) has a few issues with it:
- It is not efficient to use normal linear regression if the no. of features in our dataset is less than the no. of trainin instances
- If some features are considered redundant

However, the computational complexity is linear with regards to:
- No. of instancecs you want to make predictions on
- No. of features

Therefore, it can handle large trainingn sets provided that it can fit in your memory.

Once your linear regression model is trained, it takes very little time to predict since the computational complexity is linear with regards to the no. of instances and no. of features e.g. if you increase the no. of instances (or no. of features) twice, it will take twice as long to predict.

<h1>Gradient Descent</h1>

Gradient Descent algorithms are responsible for helping the training rate of models by helping optimize and tweak parameters to minimize the cost function. 

- Gradient Descent takes the no. of steps $\theta$ and the local gradient/steepness at that stepping point until it reaches a global minimum once the gradient becomes 0. 
- Gradient Descent tries to fill the stepping size with values (random initialization.
- The goal is to minimize the cost function as much as possible (using MSE)
- Step size is important to the learning rate:
    - Step size is too big = may end up diverging from global minimum
    - Step size is to small = will take a long time for the model train due to low learning rate

Sometimes, you cost function may not perfectly concave (regular bowl) shaped and may contain ridges and bumps along the way, smaller dips along the cost function are known as local minimum.
- If random initialization $\theta$ starts on the left, it will reach a local minimum
- If random initialization starts on the right, it will take a long time
- If we stop early, we may never reach the global minimum

In [3]:
import os
print(os.getcwd())

C:\Users\jaspe\Documents\Regression


In [5]:
import os
os.chdir("C:\\Users\\jaspe\\onedrive\\Documents")

In [6]:
print(os.getcwd())

C:\Users\jaspe\onedrive\Documents
