# Recall Machine Learning Linear Regression

At the end of this Lesson the studen will remember the main steps to train a model:

 - Split dataset in train and test subsets
 - Standardize continuous varuables
 - Transform categorical variables to dummy
 - Train linear regression models
 - Train classification models
 - Interpret the error and accuracy metrics to validate the built models

**You have two exercises at the end of the notebook**

### Import data and libraries

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn import metrics

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [2]:
df = pd.read_csv('..\data\Fish.csv')
df

FileNotFoundError: ignored

In [None]:
df.info()

### Species variable treatment

Species is a categorical variable, hence we need to transform it to dummies before inserting in the model

https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html


In [None]:
df.Species.value_counts()

Firstly, let's reduce the categories to Perch, Bream and Others

In [None]:
def fish_species(x):
    if x == 'Perch':
        return 'Perch'
    elif x == 'Bream':
        return 'Bream'
    else:
        return 'Others'

df['fish_species'] = df['Species'].apply(fish_species)
df

### Get dummies

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

https://stats.stackexchange.com/questions/350492/why-do-we-create-dummy-variables

https://towardsdatascience.com/what-are-dummy-variables-and-how-to-use-them-in-a-regression-model-ee43640d573e

In [None]:
df_dum = pd.get_dummies(df.fish_species)
df = df.merge(df_dum, right_index = True, left_index = True, how = 'left')


In [None]:
df.drop(['Species','fish_species'], axis = 1, inplace = True)
df.columns

### Train test split

It is mandatory to randomly divide the dataset into two. One for training the model and the test split for validate it.

If error metrics are low with the test split means that our model is robust


https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html


In [None]:
fish_train, fish_test = train_test_split(df, test_size=0.2, random_state=0)

In [None]:
print(fish_train.shape)
print(fish_test.shape)

### Standardize the numerical variables

Sometimes numerical variables in our dataset have very different scales, taht's to have very different values between one column and other. That can harm model accuracy.

For solve this situation, we standardize, that's to put every continuous variable centered in 0 and with standard deviation 1

We **first** standardize the training set, then the test set with the training set parameters

We do not standardize the target variable

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler


https://www.askpython.com/python/examples/standardize-data-in-python#:~:text=Ways%20to%20Standardize%20Data%20in%20Python%201%201.,load_iris%20...%202%202.%20Using%20StandardScaler%20%28%29%20function



In [None]:
scale= StandardScaler()

variables_sc = ['duration_ms', 'loudness', 'tempo']

scale_fit = scale.fit(X_train1[variables_sc])

X_train_sc = pd.DataFrame(scale.transform(X_train1[variables_sc]), columns = variables_sc)

X_test_sc = pd.DataFrame(scale.transform(X_test1[variables_sc]), columns = variables_sc)

X_train_sc.shape

X_train1.drop(variables_sc, axis = 1, inplace = True)
X_train1 = X_train1.reset_index(drop = True)
y_train1 = y_train1.reset_index(drop=True)
X_train = pd.concat([X_train1, X_train_sc], axis = 1)

X_test1.drop(variables_sc, axis = 1, inplace = True)
X_test1 = X_test1.reset_index(drop = True)
y_test1 = y_test1.reset_index(drop=True)
X_test = pd.concat([X_test1, X_test_sc], axis = 1)

In [None]:
scale= StandardScaler()
variables_sc = ['Length1', 'Length2', 'Length3', 'Height', 'Width']

X_train = fish_train[['Length1', 'Length2', 'Length3', 'Height', 'Width', 'Bream','Others', 'Perch']]
y_train  = fish_train['Weight']

X_test = fish_test[['Length1', 'Length2', 'Length3', 'Height', 'Width', 'Bream', 'Others', 'Perch']]
y_test  = fish_test['Weight']


scale_train = scale.fit(X_train[variables_sc])

X_train_sc = pd.DataFrame(scale_train.transform(X_train[variables_sc]), columns = [variables_sc])
X_train = X_train.drop(variables_sc, axis = 1) # , inplace = True
X_train = X_train.reset_index(drop = True)
X_train = pd.concat([X_train, X_train_sc], axis = 1)
X_train.columns = ['Length1', 'Length2', 'Length3', 'Height', 'Width', 'Bream','Others', 'Perch']
y_train = y_train.reset_index(drop=True)

X_test_sc = pd.DataFrame(scale_train.transform(X_test[variables_sc]), columns =[variables_sc])
X_test = X_test.drop(variables_sc, axis = 1) # , inplace = True
X_test = X_test.reset_index(drop = True)
X_test = pd.concat([X_test, X_test_sc], axis = 1)
X_test.columns = ['Length1', 'Length2', 'Length3', 'Height', 'Width', 'Bream','Others', 'Perch']
y_test = y_test.reset_index(drop=True)

## Linear Regression

https://medium.com/swlh/interpreting-linear-regression-through-statsmodels-summary-4796d359035a

In [None]:
X_train = sm.add_constant(X_train)
result = sm.OLS(y_train, X_train).fit()

print(result.summary())

In [None]:
X_train['predict'] = result.predict(X_train)

X_test = sm.add_constant(X_test)
X_test['predict'] = result.predict(X_test)

In [None]:
def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((sum(y_true) - sum(y_pred)) / sum(y_true))) * 100

print("MAE: ", metrics.mean_absolute_error(y_train, X_train['predict']))
print("MSE: ", metrics.mean_squared_error(y_train, X_train['predict']))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(y_train, X_train['predict'])))
print("MAPE: ", mean_absolute_percentage_error(y_train, X_train['predict']))
print("R2: ", metrics.r2_score(y_train, X_train['predict']))


In [None]:
print("MAE: ", metrics.mean_absolute_error(y_test, X_test['predict']))
print("MSE: ", metrics.mean_squared_error(y_test, X_test['predict']))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(y_test, X_test['predict'])))
print("MAPE: ", mean_absolute_percentage_error(y_test, X_test['predict']))
print("R2: ", metrics.r2_score(y_test, X_test['predict']))


## Exercise 1

Response the answers in 4-5 lines each, read the links you have along this document, or in the theory notebooks, or you can also search on the internet:

 - Which type of variables do we transform into dummies? Why do we do it?
 - Why is so important to divide our data into train and test datasets? Which is the purpose of doing it?
 - Why do we standardize some varaiables? Which type of variables do we standardize?

Las variables que transformamos en dummies son las categóricas. Lo hacemos para conservar la información de las categorías originales evitando errores

La razón por la que separamos los datos en train y test es poder entrenar con una cantidad de datos y luego validar con los datos que faltan. Lo hacemos para evitar overfitting

Estandarizamos las variables para evitar errores cuando las variables estan en diferentes escalas lo que puede llevarnos a una predicción menos precisa. Estandarizamos variables numéricas

## Exercise 2

Regarding the summary and the errors, Would you use this model to predict the weights of the fishes? Justify your answer. Comment the usefulness of the main indicators of the summary and the errors.


Si, porque según las gráficas, podemos observar una relación lineal entre el peso de los peces y las demás variables

El resumen nos ayuda a visualizar los datos del modelo, la variable que queremos predecir y la matriz de covarianza del error. Por lo que podemos observar el error de cada uno de los métodos para seleccionar el más óptimo