<div style="max-width:66ch;">

# Scikit-learn exercises 

These are introductory exercises in Machine learning with focus in **scikit-learn** and **linear regression**.

<p class = "alert alert-info" role="alert"><b>Note</b> that sometimes you don't get exactly the same answer as I get, but it doesn't neccessarily mean it is wrong. Could be some parameters, randomization, that we have different. Also very important is that in the future there won't be any answer sheets, use your skills in data analysis, mathematics and statistics to back up your work.</p>

<p class = "alert alert-info" role="alert"><b>Note</b> that in cases when you start to repeat code, try not to. Create functions to reuse code instead. </p>

<p class = "alert alert-info" role="alert"><b>Remember</b> to use <b>descriptive variable, function, index </b> and <b> column names</b> in order to get readable code </p>

The number of stars (\*), (\*\*), (\*\*\*) denotes the difficulty level of the task

---

</div>

<div style="max-width:66ch;">

## 0. EDA (*)

In the whole exercise, we will work with the "mpg" dataset from seaborn dataset. Start by loading dataset "mpg" from the ```load_dataset``` method in seaborn module. The goal will be to use linear regression to predict mpg - miles per gallon.

&nbsp; a) Start by doing some initial EDA such as info(), describe() and figure out what you want to do with the missing values.

&nbsp; b) Use describe only on those columns that are relevant to get statistical information from. 

&nbsp; c) Make some plots on some of the columns that you find interesting.

&nbsp; d) Check if there are any columns you might want to drop.

</div>


In [29]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt 
import numpy as np

df=sns.load_dataset("mpg")




In [30]:
#dataset.head(), dataset.info(), dataset.describe()

stat_4_cyl=df[df['cylinders']==4].describe()
hp4cyl=stat_4_cyl[stat_4_cyl.index=='50%']['horsepower']

stat_6_cyl = df[df['cylinders']==6].describe()
hp6cyl=stat_6_cyl[stat_6_cyl.index=='50%']['horsepower']

print(f"{hp4cyl}  vs \n {hp6cyl}")


50%    78.0
Name: horsepower, dtype: float64  vs 
 50%    100.0
Name: horsepower, dtype: float64


In [31]:
s=(df['cylinders']==4) & (df['horsepower'].isnull())
v=(df['cylinders']==6) & (df['horsepower'].isnull())

# dataset.loc[s,'horsepower']=dataset.loc[s,'horsepower'].fillna(hp4cyl)
# dataset.loc[v,'horsepower'] =dataset.loc[v,'horsepower'].fillna(hp6cyl)

df.loc[v,'horsepower']=df.loc[v,'horsepower'].fillna(100)
df.loc[s,'horsepower']=df.loc[s,'horsepower'].fillna(78)
df[df['horsepower'].isnull()]



Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name


In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
 8   name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB


<div style="max-width:66ch;">

---

## 1. Train|test split (*)

We want to predict the "mpg", split up X and y, and perform train|test split using scikit-learn. Choose test_size of 0.2 and random_state 42. Control the shapes of each X_train, X_test, y_train, y_test.  

</div>


In [33]:
from sklearn.model_selection import train_test_split

#df=df.drop("origin", axis="columns")
df=df.drop("name", axis="columns")

X,y = df.drop("mpg", axis="columns"), df[["mpg",'origin']]



#help(train_test_split)

X1_train, X1_test, y1_train, y1_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train=X1_train.drop('origin', axis='columns')
y_train=y1_train.drop('origin', axis='columns')
X_test=X1_test.drop('origin', axis='columns')
y_test=y1_test.drop('origin',axis='columns')

disp_train=X1_train['displacement']
acc_train=X1_train['acceleration']
disp_test=X1_test['displacement']

ys_train=y1_train['origin']
ys_test=y1_test['origin']

from sklearn.preprocessing import MinMaxScaler

# instanciate a scaler from class MinMaxScalerr
scaler=MinMaxScaler()
#scaler2=MinMaxScaler()
# finds min and max from X_train and stores them 
# fit_transform(df[['x','z']])
scaler.fit(X_train)
#scaler2.fit(disp_train)
scaler
#scaler2

In [34]:
scaled_X_train= scaler.transform(X_train)
# transform X_train -> scaled_X_test
scaled_X_test = scaler.transform(X_test)

#scaled_disp_train = scaler.transform(disp_train)
#scaled_acc_train = scaler.transform(acc_train)

print(f"{scaled_X_train.shape =}")
print(f"{scaled_X_train.min() =}")
print(f"{scaled_X_train.max() =}")

scaled_X_train.shape =(318, 6)
scaled_X_train.min() =0.0
scaled_X_train.max() =1.0


In [35]:
print(f"{scaled_X_test.shape =}")
print(f"{scaled_X_test.min() =}")
print(f"{scaled_X_test.max() =}")


scaled_X_test.shape =(80, 6)
scaled_X_test.min() =0.0
scaled_X_test.max() =1.0279329608938548


<div style="max-width:66ch;">

---

## 2. Function for evaluation (*)

Create a function for training a regression model, predicting and computing the metrics MAE, MSE, RMSE. It should take in parameters of X_train, X_test, y_train, y_test, model. Now create a linear regression model using scikit-learns ```LinearRegression()``` (OLS normal equation with SVD) and call your function to get metrics. 

</div>


In [36]:
def trainmodel(X_tr, X_te, y_tr, y_te):
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_absolute_error, mean_squared_error
    import numpy as np
    lin_mod =LinearRegression()
    lin_mod.fit(X_tr,y_tr)
    y_pred=lin_mod.predict(X_te)
    MAP=mean_absolute_error(y_te, y_pred)
    MSE=mean_squared_error(y_te,y_pred)
    RMSE=np.sqrt(mean_squared_error(y_te,y_pred))
    print(f" MAP={MAP}, MSE={MSE}, RMSE={RMSE}")


trainmodel(scaled_X_train,scaled_X_test,y_train,y_test)

 MAP=2.4669745391493647, MSE=9.44152909449315, RMSE=3.0727071280050673


In [44]:
from sklearn.linear_model import LinearRegression

# instantiate a model instance from LinearRegression class
model = LinearRegression()
model.fit(scaled_X_train, y_train)
print(f"{model.intercept_ =}")
print(f"{model.coef_ =}")

# none scaled
model2 = LinearRegression()
model2.fit(X_train, y_train)
print(f"{model2.intercept_ =}")
print(f"{model2.coef_ =}")

# stocastic grandient decent 
from sklearn.linear_model import SGDClassifier
model3 = SGDClassifier(loss="hinge", penalty="l2", max_iter=1000)
model3.fit(scaled_X_train, ys_train)

print(f"{model3.intercept_ =}")
print(f"{model3.coef_ =}")

# Polylynomial linear regression degree 1
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# poly = PolynomialFeatures(degree=2, include_bias=False)
# poly_features = poly.fit_transform(x.reshape(-1, 1))
# poly_reg_model = LinearRegression()
# poly_reg_model.fit(poly_features, y)
# y_predicted = poly_reg_model.predict(poly_features)

model4 = PolynomialFeatures(degree=1, include_bias=False)
model4_features = model4.fit_transform(np.array(disp_train).reshape(-1,1))
model4_reg_mod = LinearRegression()
model4_reg_mod.fit(model4_features,y_train)

# Polylynomial linear regression degree 2
model5 = PolynomialFeatures(degree=2, include_bias=False)
model5_features = model5.fit_transform(np.array(disp_train).reshape(-1,1))
model5_reg_mod =LinearRegression()
model5_reg_mod.fit(model5_features, y_train)


# Polylynomial linear regression degree 3
model6 = PolynomialFeatures(degree=3, include_bias=False)
model6_features=model6.fit_transform(np.array(disp_train).reshape(-1,1))
model6_reg_mod=LinearRegression()
model6_reg_mod.fit(model6_features,y_train)

# for degree in range(1,100):
#     model_poly=PolynomialFeatures(degree, include_bias=False)
#     # transformera feature space
#     train_features= model_poly.fit_transform(X_train)
#     val_features= model_poly.fit_transform(X_val)

#     model_linear_regression = LinearRegression()
#     model_linear_regression.fit(train_features,y_train)
#     # predict on valöidation data
#     y_pred_val=model_linear_regression.predict(val_features)
#     RMSE_val.append(np.sqrt(mean_squared_error(y_val,y_pred_val)))




model.intercept_ =array([27.11315166])
model.coef_ =array([[  0.34335647,   0.61747941,   0.54045672, -24.85179699,
          1.36486055,   9.61895667]])
model2.intercept_ =array([-18.73531059])
model2.coef_ =array([[ 0.06867129,  0.00159555,  0.00301931, -0.00704616,  0.0812417 ,
         0.80157972]])
model3.intercept_ =array([ 0.38049523, -4.05513114, -0.13504928])
model3.coef_ =array([[  0.78385264, -18.71439728,  -6.71456634,  14.96883693,
         -2.09571712,  -1.95963159],
       [  4.07363886, -18.61315309,  14.78379035,  -9.73609548,
          2.11236081,   1.56678418],
       [ -5.74825266,  30.4164867 ,  -4.76223319,  -4.29411379,
         -0.2955    ,   0.87094737]])


<div style="max-width:66ch;">

---

## 3. Compare models (*)

Create the following models 
- Linear regression (SVD)
- Linear regression (SVD) with scaled data (feature standardization)
- Stochastic gradient descent with scaled data (feature standardization)
- Polynomial linear regression with degree 1
- Polynomial linear regression with degree 2
- Polynomial linear regression with degree 3

Make a DataFrame with evaluation metrics and model. Which model performed overall best?

---
</div>



In [46]:
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
from sklearn.metrics import accuracy_score
def evaluate(y_te,y_pred):
    MAP=mean_absolute_error(y_te, y_pred)
    MSE=mean_squared_error(y_te,y_pred)
    RMSE=np.sqrt(mean_squared_error(y_te,y_pred))
    return [MAP, MSE, RMSE]

df_eval=pd.DataFrame()
arr1=evaluate(y_test,model.predict(scaled_X_test))
arr2=evaluate(y_test, model2.predict(scaled_X_test))
ys_pred=model3.predict(scaled_X_test)
ACC=accuracy_score(ys_test, ys_pred)
model5_features_test = model5.fit_transform(np.array(disp_test).reshape(-1,1))
model6_features_test = model6.fit_transform(np.array(disp_test).reshape(-1,1))

arr3=evaluate(y_test, model4_reg_mod.predict(np.array(disp_test).reshape(-1,1)))
arr4=evaluate(y_test, model5_reg_mod.predict(model5_features_test))
arr5=evaluate(y_test, model6_reg_mod.predict(model6_features_test))
df_eval['linear scaled']=arr1
df_eval['linear']=arr2
df_eval['poly1']=arr3
df_eval['poly2']=arr4
df_eval['poly3']=arr4

df_eval 



Unnamed: 0,linear scaled,linear,poly1,poly2,poly3
0,2.466975,41.432113,3.39496,2.980251,2.980251
1,9.441529,1768.427403,18.102544,15.107354,15.107354
2,3.072707,42.052674,4.254708,3.886818,3.886818


<div style="max-width:66ch;">


## 4. Further explorations (**)

Feel free to further explore the dataset, for example you could choose to 
- drop different columns
- find out feature importance in polynomial models
- fine tune further for a specific model by exploring hyperparameters (check documentation which type of parameters that can be changed)

---

</div>


<div style="width: 66ch;">


</div>

<div style="background-color: #FFF; color: #212121; border-radius: 20px; width:25ch; box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px; display: flex; justify-content: center; align-items: center;">
<div style="padding: 1em; width: 60%;">
    <h2 style="font-size: 1.2rem;">Kokchun Giang</h2>
    <a href="https://www.linkedin.com/in/kokchungiang/" target="_blank" style="display: flex; align-items: center; gap: .4em; color:#0A66C2;">
        <img src="https://content.linkedin.com/content/dam/me/business/en-us/amp/brand-site/v2/bg/LI-Bug.svg.original.svg" width="20"> 
        LinkedIn profile
    </a>
    <a href="https://github.com/kokchun/Portfolio-Kokchun-Giang" target="_blank" style="display: flex; align-items: center; gap: .4em; margin: 1em 0; color:#0A66C2;">
        <img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width="20"> 
        Github portfolio
    </a>
    <span>AIgineer AB</span>
    <div>
</div>