***FCIM.FIA - Fundamentals of Artificial Intelligence***

> **Lab 4:** Learning \
> **Performed by:** Dobrojan Alexandru, FAF-212 \
> **Verified by:** Elena Graur, asist. univ.

# Theory
Linear regression is one of the most simple machine learning models capable of predicting and classifying data. Linear regression is based on the regression line or line of best fit from mathematics that finds a line that would best fit between several values with an amount of noise. In this laboratory work I need only the prediction linear regression, that evaluates the input variables in order to generate a float value, that is in my case the predicted salary of a person.

As we can see in the following example, regression line is a line that shows the approximate progression of the dependent variables, i.e. where they would fit if not noise was added to each data point. There is a regression line between dependent and independent variable only if there is a correlation between them to. The correlation measures the tendency of two variables to change together and ranges from 1 meaning perfect positive correlation to -1 meaning perfect negative correlation, 0 means there is no correlation at all and the variables grow independent of each other. The regression line can be expressed by the function $y = \alpha x + \beta$ for a single independent variable and $y = \alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_i x_i + \beta$ for an arbitrary number of independent variables.

The model does a very similar thing by finding the regression line intercept and slopes and using this function to generate results for the input variables.

![linear regression example](img/linear-regression.png)

## Linear regression types
There are several types of linear regression algorithms based on different formulas. Each type of linear regression has its strengths and weaknesses. In this laboratory work I used Ordinary Least Squares, Lasso and Ridge methods.

- **Ordinary least squares** (OLS) regression finds the best-fit line by minimizing the sum of squared errors SSE: $SSE = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$.
Finding the coefficients require solving the equation: $$\beta = (X^T X)^{-1} X^T y$$ for an arbitrary number of regressors.
- **Lasso** regression is an optimization with a hyperparameter $\alpha$. It adds a regularization parameter to the objective function, making it $$Minimize \frac{1}{2}\sum_{i=1}^n (y_i - X_i \beta)^2 + \alpha \sum_{j=1}^p |\beta_j|$$ Where the $\alpha$ is the regularization parameter, a positive float in range [0, $\infty$)
- **Ridge** - also known as Tikhonov regularization, is a type of linear regression that includes a regularization term to prevent overfitting and improve the model's generalization to new data. Its objective function is $$Minimize \frac{1}{2}\sum_{i=1}^n (y_i - X_i \beta)^2 + \lambda \sum_{j=1}^p \beta_j^2$$

# Practice

## Global definitions part, imports, loading the data and printing some information

In [54]:
# Imports section

import pandas as pd
import seaborn as sns
import numpy as np

# Models
from sklearn.linear_model import LinearRegression, Ridge, Lasso
# Dataset selection
from sklearn.model_selection import train_test_split
# Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [55]:
# Global definitions
COLUMN_TO_PREDICT = 'TotalPay'
# Size in percentage of the dataset to be split for the test, 0.15 means 15% of the dataset rows should be randomly
# chosen for testing (verifying) and 85% of the dataset is used for training the model
TEST_DATASET_SIZE = 0.15

In [56]:
# Loading the data
data = pd.read_csv('data/data.csv', low_memory=False)
# Let's have a look at the loaded data
data.head()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


In [57]:
# Print the data types of each column
print(data.dtypes)

Id                    int64
EmployeeName         object
JobTitle             object
BasePay              object
OvertimePay          object
OtherPay             object
Benefits             object
TotalPay            float64
TotalPayBenefits    float64
Year                  int64
Notes               float64
Agency               object
Status               object
dtype: object


Object data type means no data type could be inferred for the column. That may happen because some values are missing or there are multiple data types present in the column, for example string, int32, float64 and so on. In my case EmployeeName and JobTitle should be strings, and in the next section I am converting the columns to string.

### Analyzing and transforming the dataset

In [58]:
# Convert object data types to string
data['EmployeeName'] = data['EmployeeName'].astype('string')
data['JobTitle'] = data['JobTitle'].astype('string')
print(data.dtypes)

Id                           int64
EmployeeName        string[python]
JobTitle            string[python]
BasePay                     object
OvertimePay                 object
OtherPay                    object
Benefits                    object
TotalPay                   float64
TotalPayBenefits           float64
Year                         int64
Notes                      float64
Agency                      object
Status                      object
dtype: object


Next, a very important step is looking for columns with missing data, it can be 0 or NaN, in my case it is NaN. There is how I print the column names that have at least one NaN value in them:

In [59]:
# See what columns have rows with missing data
print(data.columns[data.isna().any()].values)

['BasePay' 'Benefits' 'Notes' 'Status']


In [60]:
print(data.dtypes)


Id                           int64
EmployeeName        string[python]
JobTitle            string[python]
BasePay                     object
OvertimePay                 object
OtherPay                    object
Benefits                    object
TotalPay                   float64
TotalPayBenefits           float64
Year                         int64
Notes                      float64
Agency                      object
Status                      object
dtype: object


Now the data types are more consistent. Next, I look at unique values in the following columns, because there were empty cells in the dataset or there were only one values when I looked at the raw dataset. I use the *unique* function of pandas to show the unique values:

In [61]:
# See if these columns have more than a single value
for column in ['Year', 'Notes', 'Agency', 'Status']:
    print(f"{column}: {data[column].unique()}")

Year: [2011 2012 2013 2014]
Notes: [nan]
Agency: ['San Francisco']
Status: [nan 'PT' 'FT']


As I noticed, Notes and Agency can be removed since they have only one value, that makes them useless. We have several values in Year, thus I leave it. Also the Status column has NaN values in it, that means I have to either populate NaN values or get rid of the whole column, because the lack of values will affect the predictions precision.

In [62]:
cols_to_remove = ['Id', 'EmployeeName', 'BasePay', 'OvertimePay', 'OtherPay', 'Benefits', 'TotalPayBenefits', 'Notes',
                  'Agency', 'Status']
data = data.drop(cols_to_remove, axis='columns')
data.head()

Unnamed: 0,JobTitle,TotalPay,Year
0,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,567595.43,2011
1,CAPTAIN III (POLICE DEPARTMENT),538909.28,2011
2,CAPTAIN III (POLICE DEPARTMENT),335279.91,2011
3,WIRE ROPE CABLE MAINTENANCE MECHANIC,332343.61,2011
4,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",326373.19,2011


Next, I have to get rid of the string column, because the linear regression works only with numbers, so I have to replace each string with a number. I chose to factorize each title into a number that would act as an ID, below I show how:

In [63]:
# Replace (factorize) Job titles from string to integers, for example [a, b, c, a, b] -> [0, 1, 2, 0, 1]
data['JobTitle'], _ = pd.factorize(data['JobTitle'])
data.head()

Unnamed: 0,JobTitle,TotalPay,Year
0,0,567595.43,2011
1,1,538909.28,2011
2,1,335279.91,2011
3,2,332343.61,2011
4,3,326373.19,2011


In [64]:
# Show correlations with remaining columns
data.corr()[COLUMN_TO_PREDICT]

JobTitle   -0.350263
TotalPay    1.000000
Year        0.032090
Name: TotalPay, dtype: float64

The correlation with year is minimal, meaning they are almost independent of each other. JobTitle correlation is irrelevant since it factorized from string.

In [65]:
# Divide data into train and test sets
X = data.drop(COLUMN_TO_PREDICT, axis='columns')
y = data[COLUMN_TO_PREDICT]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_DATASET_SIZE)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(126355, 2) (126355,)
(22299, 2) (22299,)


In [66]:
least_squares_regression = LinearRegression()
lasso_regression = Lasso(alpha=5) # default 1
ridge_regression = Ridge(alpha=5) # default 1

predictors = X_train.columns
# Train the model
least_squares_regression.fit(X_train[predictors], y_train)
lasso_regression.fit(X_train[predictors], y_train)
ridge_regression.fit(X_train[predictors], y_train)

In [67]:
least_squares_y_pred = least_squares_regression.predict(X_test)
least_squares_y_pred

array([105859.22644974, 111075.81773567, 113792.69638845, ...,
        90510.22580697,  21304.27472325, 104677.38950985])

In [68]:
lasso_y_pred = lasso_regression.predict(X_test)
lasso_y_pred

array([105860.26893026, 111076.07084698, 113781.21322201, ...,
        90502.26571999,  21314.33691773, 104667.28566226])

In [69]:
ridge_y_pred = ridge_regression.predict(X_test)
ridge_y_pred

array([105859.38701244, 111075.85890108, 113790.95668171, ...,
        90509.01898928,  21305.79768305, 104675.8584342 ])

In [70]:
merged_df = X_test.copy()
merged_df['Predicted (Least Squares)'] = least_squares_y_pred
merged_df['Predicted (Lasso)'] = lasso_y_pred
merged_df['Predicted (Ridge)'] = ridge_y_pred
merged_df['Actual'] = y_test
merged_df.head()

Unnamed: 0,JobTitle,Year,Predicted (Least Squares),Predicted (Lasso),Predicted (Ridge),Actual
12059,144,2011,105859.22645,105860.26893,105859.387012,88830.79
9146,49,2011,111075.817736,111076.070847,111075.858901,101931.31
113336,1047,2014,113792.696388,113781.213222,113790.956682,160073.15
111241,1045,2014,113902.519363,113891.019578,113900.777143,197673.19
20641,128,2011,106737.810245,106738.719779,106737.950699,61244.72


In [71]:
least_squares_mae = mean_absolute_error(y_test, least_squares_y_pred)
lasso_mae = mean_absolute_error(y_test, lasso_y_pred)
ridge_mae = mean_absolute_error(y_test, ridge_y_pred)
print('Least Squares MAE:', least_squares_mae)
print('Lasso MAE:', lasso_mae)
print('Ridge MAE:', ridge_mae)

Least Squares MAE: 34525.11296455352
Lasso MAE: 34525.25952910238
Ridge MAE: 34525.135032945516


In [72]:
least_squares_mse = mean_squared_error(y_test, least_squares_y_pred)
lasso_mse = mean_squared_error(y_test, lasso_y_pred)
ridge_mse = mean_squared_error(y_test, ridge_y_pred)
print('Least Squares MSE:', least_squares_mse)
print('Lasso MSE:', lasso_mse)
print('Ridge MSE:', ridge_mse)

Least Squares MSE: 1957812452.8090372
Lasso MSE: 1957805297.0043395
Ridge MSE: 1957811364.0930016


# Conclusions

