# Recall Machine Learning Linear Regression

At the end of this Lesson the studen will remember the main steps to train a model:

 - Split dataset in train and test subsets
 - Standardize continuous varuables
 - Transform categorical variables to dummy
 - Train linear regression models
 - Train classification models
 - Interpret the error and accuracy metrics to validate the built models

**You have two exercises at the end of the notebook**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Import data and libraries

In [2]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn import metrics

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [3]:
df = pd.read_csv('/content/drive/MyDrive/USC Upstate/IronHack/Prework/DATA/data/Fish.csv')
df

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.5200,4.0200
1,Bream,290.0,24.0,26.3,31.2,12.4800,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961
3,Bream,363.0,26.3,29.0,33.5,12.7300,4.4555
4,Bream,430.0,26.5,29.0,34.0,12.4440,5.1340
...,...,...,...,...,...,...,...
154,Smelt,12.2,11.5,12.2,13.4,2.0904,1.3936
155,Smelt,13.4,11.7,12.4,13.5,2.4300,1.2690
156,Smelt,12.2,12.1,13.0,13.8,2.2770,1.2558
157,Smelt,19.7,13.2,14.3,15.2,2.8728,2.0672


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Species  159 non-null    object 
 1   Weight   159 non-null    float64
 2   Length1  159 non-null    float64
 3   Length2  159 non-null    float64
 4   Length3  159 non-null    float64
 5   Height   159 non-null    float64
 6   Width    159 non-null    float64
dtypes: float64(6), object(1)
memory usage: 8.8+ KB


### Species variable treatment

Species is a categorical variable, hence we need to transform it to dummies before inserting in the model

https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html


In [5]:
df.Species.value_counts()

Perch        56
Bream        35
Roach        20
Pike         17
Smelt        14
Parkki       11
Whitefish     6
Name: Species, dtype: int64

Firstly, let's reduce the categories to Perch, Bream and Others

In [6]:
def fish_species(x):
    if x == 'Perch':
        return 'Perch'
    elif x == 'Bream':
        return 'Bream'
    else:
        return 'Others'

df['fish_species'] = df['Species'].apply(fish_species)
df

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width,fish_species
0,Bream,242.0,23.2,25.4,30.0,11.5200,4.0200,Bream
1,Bream,290.0,24.0,26.3,31.2,12.4800,4.3056,Bream
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961,Bream
3,Bream,363.0,26.3,29.0,33.5,12.7300,4.4555,Bream
4,Bream,430.0,26.5,29.0,34.0,12.4440,5.1340,Bream
...,...,...,...,...,...,...,...,...
154,Smelt,12.2,11.5,12.2,13.4,2.0904,1.3936,Others
155,Smelt,13.4,11.7,12.4,13.5,2.4300,1.2690,Others
156,Smelt,12.2,12.1,13.0,13.8,2.2770,1.2558,Others
157,Smelt,19.7,13.2,14.3,15.2,2.8728,2.0672,Others


### Get dummies

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

https://stats.stackexchange.com/questions/350492/why-do-we-create-dummy-variables

https://towardsdatascience.com/what-are-dummy-variables-and-how-to-use-them-in-a-regression-model-ee43640d573e

In [7]:
df_dum = pd.get_dummies(df.fish_species)
df = df.merge(df_dum, right_index = True, left_index = True, how = 'left')


In [8]:
df_dum.head()

Unnamed: 0,Bream,Others,Perch
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


In [9]:
df.head()

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width,fish_species,Bream,Others,Perch
0,Bream,242.0,23.2,25.4,30.0,11.52,4.02,Bream,1,0,0
1,Bream,290.0,24.0,26.3,31.2,12.48,4.3056,Bream,1,0,0
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961,Bream,1,0,0
3,Bream,363.0,26.3,29.0,33.5,12.73,4.4555,Bream,1,0,0
4,Bream,430.0,26.5,29.0,34.0,12.444,5.134,Bream,1,0,0


In [10]:
#We eliminate the other columns because we have created the dummies
df.drop(['Species','fish_species'], axis = 1, inplace = True)
df.columns

Index(['Weight', 'Length1', 'Length2', 'Length3', 'Height', 'Width', 'Bream',
       'Others', 'Perch'],
      dtype='object')

### Train test split

It is mandatory to randomly divide the dataset into two. One for training the model and the test split for validate it.

If error metrics are low with the test split means that our model is robust


https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html


 The `random_state` parameter is set to 0, which means that the random splitting of the data will be reproducible. (default is also 0)
 `test_size = 0.2`, it means that we divided our dataset into 80% for training and 20% for testing. (default is 0.25)

In [11]:
fish_train, fish_test = train_test_split(df, test_size=0.2, random_state=0)

In [12]:
print(fish_train.shape)
print(fish_test.shape)

(127, 9)
(32, 9)


### Standardize the numerical variables

Sometimes numerical variables in our dataset have very different scales, that's to have very different values between one column and other. That can harm model accuracy.

For solve this situation, we standardize, that's to put every continuous variable centered in 0 and with standard deviation 1

We **first** standardize the training set, then the test set with the training set parameters

**We do not standardize the target variable**

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler


https://www.askpython.com/python/examples/standardize-data-in-python#:~:text=Ways%20to%20Standardize%20Data%20in%20Python%201%201.,load_iris%20...%202%202.%20Using%20StandardScaler%20%28%29%20function



In [13]:
fish_train.head()

Unnamed: 0,Weight,Length1,Length2,Length3,Height,Width,Bream,Others,Perch
143,1550.0,56.0,60.0,64.0,9.6,6.144,0,1,0
130,300.0,32.7,35.0,38.8,5.9364,4.3844,0,1,0
16,700.0,30.4,33.0,38.3,14.8604,5.2854,1,0,0
96,225.0,22.0,24.0,25.5,7.293,3.723,0,0,1
107,300.0,26.9,28.7,30.1,7.5852,4.6354,0,0,1


In [14]:
fish_train.columns

Index(['Weight', 'Length1', 'Length2', 'Length3', 'Height', 'Width', 'Bream',
       'Others', 'Perch'],
      dtype='object')

In [15]:
scale= StandardScaler()
variables_sc = ['Length1', 'Length2', 'Length3', 'Height', 'Width'] # Variables that we have to standardize

X_train = fish_train[['Length1', 'Length2', 'Length3', 'Height', 'Width', 'Bream','Others', 'Perch']]
y_train  = fish_train['Weight']  # target we don't standardize

X_test = fish_test[['Length1', 'Length2', 'Length3', 'Height', 'Width', 'Bream', 'Others', 'Perch']]
y_test  = fish_test['Weight']   # target


scale_train = scale.fit(X_train[variables_sc]) # Scale the train data. In dummies variables we don't need it because thay are already standardize

X_train_sc = pd.DataFrame(scale_train.transform(X_train[variables_sc]), columns = [variables_sc])
X_train = X_train.drop(variables_sc, axis = 1) # , inplace = True . We elimate the previous variables, that are not standardize with the one that they are -> X_train_sc
X_train = X_train.reset_index(drop = True) # Eliminate the index column
X_train = pd.concat([X_train, X_train_sc], axis = 1) # Concat dummies variables and the standardize comumns
X_train.columns = ['Length1', 'Length2', 'Length3', 'Height', 'Width', 'Bream','Others', 'Perch']
y_train = y_train.reset_index(drop=True) # Eliminate the index colum

X_test_sc = pd.DataFrame(scale_train.transform(X_test[variables_sc]), columns =[variables_sc])
X_test = X_test.drop(variables_sc, axis = 1) # , inplace = True
X_test = X_test.reset_index(drop = True)
X_test = pd.concat([X_test, X_test_sc], axis = 1)
X_test.columns = ['Length1', 'Length2', 'Length3', 'Height', 'Width', 'Bream','Others', 'Perch']
y_test = y_test.reset_index(drop=True) # Eliminate the index colum

In [16]:
X_train_sc.head()

Unnamed: 0,Length1,Length2,Length3,Height,Width
0,3.170918,3.139044,2.989534,0.165693,1.083663
1,0.732809,0.700467,0.734117,-0.6726,0.025189
2,0.492137,0.505381,0.689367,1.369361,0.567179
3,-0.386838,-0.372507,-0.456242,-0.362187,-0.372671
4,0.125897,0.085945,-0.044539,-0.295327,0.176176


## Linear Regression

https://medium.com/swlh/interpreting-linear-regression-through-statsmodels-summary-4796d359035a

In [17]:
X_train = sm.add_constant(X_train)
result = sm.OLS(y_train, X_train).fit()

print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                 Weight   R-squared:                       0.895
Model:                            OLS   Adj. R-squared:                  0.889
Method:                 Least Squares   F-statistic:                     144.5
Date:                Fri, 29 Sep 2023   Prob (F-statistic):           4.55e-55
Time:                        19:51:25   Log-Likelihood:                -774.25
No. Observations:                 127   AIC:                             1564.
Df Residuals:                     119   BIC:                             1587.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        304.3119     10.022     30.365      0.0

In [18]:
X_train['predict'] = result.predict(X_train)

X_test = sm.add_constant(X_test)
X_test['predict'] = result.predict(X_test)

In [19]:
def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((sum(y_true) - sum(y_pred)) / sum(y_true))) * 100

print("MAE: ", metrics.mean_absolute_error(y_train, X_train['predict']))
print("MSE: ", metrics.mean_squared_error(y_train, X_train['predict']))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(y_train, X_train['predict'])))
print("MAPE: ", mean_absolute_percentage_error(y_train, X_train['predict']))
print("R2: ", metrics.r2_score(y_train, X_train['predict']))


MAE:  79.48159685784083
MSE:  11556.170674549494
RMSE:  107.4996310437831
MAPE:  6.033857813902521e-14
R2:  0.8947146195772225


In [20]:
print("MAE: ", metrics.mean_absolute_error(y_test, X_test['predict']))
print("MSE: ", metrics.mean_squared_error(y_test, X_test['predict']))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(y_test, X_test['predict'])))
print("MAPE: ", mean_absolute_percentage_error(y_test, X_test['predict']))
print("R2: ", metrics.r2_score(y_test, X_test['predict']))


MAE:  99.17774652309214
MSE:  26636.577496366965
RMSE:  163.20716129008238
MAPE:  3.1590670956092
R2:  0.860065737204647


## Exercise 1

Response the answers in 4-5 lines each, read the links you have along this document, or in the theory notebooks, or you can also search on the internet:

 - Which type of variables do we transform into dummies? Why do we do it?

 In this exercise, the target variable is `Weight`, and to predict this column we have some variables: Species, Length1, Length2, Length3,	Height and Width. All this variables are continuos unless `Species`, that is categorical. So we want to transform this categorical variables into numbers.

 In our case, we choose the 3 most common species and the ohter ones we collect in other variable. When we applied the dummies function, we will have 4 new columns: the 3 most common species and the others one. Each variable is converted in as many 0/1 variables as there are different values.

 We do it because we can't train the model with categorical variables, for machine learning models require numerical input.

 - Why is so important to divide our data into train and test datasets? Which is the purpose of doing it?

  When we train some model, we don't usually validate with the same data that we trained. The main problem that we can find training and validating the model with the same data is overfitting the model (overfitting).
  This is because our model could be very focused in our data, and if we introduce other type of data may not be powerful in predicting future observations. Therefore, this observations are unknown, so the best option is to validate out model with outside observations. So the main porpose is to having better results in our model,  testing it with unknown data.

  Usually, we divided our model into two datasets: train and test, with a proportion of 80% and 20% respectively.

 - Why do we standardize some varaiables? Which type of variables do we standardize?

  Usually, our data has been buildt for a particular problem statement and is built from various sources. It has different features, and each feature has different scales.

  In order for our model to work well, it is very necessary for the data to have the same scale in terms of the Feature to avoid bias in the outcome.

  The types of variables we standardize are the continuous variables centered in 0 and with a standard desviation 1.

  First we standardize the training dataset, and then the test with the training paramentes.

  We do not standardize the target variable.

## Exercise 2

Regarding the summary and the errors, Would you use this model to predict the weights of the fishes? Justify your answer. Comment the usefulness of the main indicators of the summary and the errors.

Firstly, we have used the Ordinary Least Squares function from the statsmodels library, to print `OLS Regression Results`. In simple terms, it checks how much the actual data points differ from the line that represents the best prediction, helping us see how much error there is in our predictions.

This table starts saying us which is our Dependant Variable that is Weight and, then the model OLS (Ordinary Least Squares) and the Date and Time we’ve created the Model. Then, it says the number of observations we have used to train this model (127). Otherwise, it says DF Residuals (degrees of freedom residuals), which is calculated:

$ \text{Df}_{\text{Residuals}} = \text{No}_{\text{Observations}} - \text{Df}_{\text{Model}} - 1 $

And then our model has a non robust covariance type. Talking about, the R-squared. R-squared (R²) tells us how much our model explains the differences in fish weights. An R² of 0.895 means our model accounts for a large part of these differences, which is good.

Adjusted R-squared (Adj. R²) is like R-squared but adjusts for the number of things we're predicting. An Adj. R² of 0.889 means our model still works well even when considering multiple predictors.

The F-statistic checks if our whole model is useful. A high F-statistic (144.5) with a tiny p-value (4.55e-55) means our model is very useful and better than no predictors.

Coefficients show how each predictor affects fish weight. Length1, Length2, Bream, Others, and Perch have strong effects. However, Length3, Height, and Width might not matter much because their p-values are high.

Now, let's look at the evaluation metrics for both the training and test datasets:

**Training Metrics:**

*   MAE (Mean Absolute Error): 79.48
*   MSE (Mean Squared Error): 11556.17
*   RMSE (Root Mean Squared Error): 107.50
*   MAPE (Mean Absolute Percentage Error): Very close to zero
*   R² (R-squared): 0.895

**Test Metrics:**

*   MAE: 99.18
*   MSE: 26636.58
*   RMSE: 163.21
*   MAPE: 3.16
*   R²: 0.860

Here's the analysis:

The training R² is high (0.895), indicating that the model fits the training data well.
The test R² is also reasonably high (0.860), suggesting that the model generalizes well to unseen data.
Both training and test MAE values are relatively low, indicating that the model's predictions are close to the actual weights on average.
The MAPE on both datasets is low, indicating low percentage errors in predictions.


Overall, this model appears to be a good predictor of fish weights, given the high R-squared values and low error metrics. However, it's essential to keep in mind that this assessment is based on the available data, and the model's performance should be monitored and validated with new data to ensure it continues to perform well in a real-world setting. Additionally, you may want to investigate further why some predictors (e.g., Length3, Height, and Width) are not statistically significant and consider whether they should be included in the model.