# Lab 07 - Error based learning

This week we are going to focus about linear regression and logistic regression.

#### Importing libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import classification_report, mean_squared_error
from sklearn.model_selection import train_test_split

## Linear regression

Before starting, let's import the Boston dataset from sklearn.

In [2]:
data = datasets.load_boston()
data_df = pd.DataFrame(data.data, columns=data.feature_names)
target_df = pd.DataFrame(data.target, columns=['Target'])

data_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [3]:
target_df.head()

Unnamed: 0,Target
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2


Boston dataset has 13 descriptive features. All of them are numerical features. So we can use them for our
algorithm as it is. Let's divide this dataset to train and test sets.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(data_df, target_df)

As you saw earlier the target feature this time is continuous. So we have to use a regressor, not a classifier as our
model. Let's initiate a linear regression model.

You can find linear regression models at sklearn.linear_model package. Let's import LinearRegression model from there. You can find the documentation
[here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression).

In [5]:
lin_reg = LinearRegression(normalize=True)

It's time to train the data. Because every feature is already numerical, we can use that directly with this model.
Let's train our model and evaluate the performance.

In [6]:
lin_reg.fit(X=X_train, y=y_train)

LinearRegression(normalize=True)

In [7]:
pred = lin_reg.predict(X=X_test)

We cannot use performance measures like accuracy, recall or f1-score here. Because those metrics are defined for
classification tasks. There are few metrics you can use to evaluate performance of a regression task. One simplest
metric is the mean squared error.

In [8]:
print("Mean squared error:", mean_squared_error(y_pred=pred, y_true=y_test))

Mean squared error: 19.66812408357505


You can see the trained coefficients in our model and get a idea on which features are better for predicting compared
 to others. The higher the absolute value of the weight, the better it contribute to the prediction.

In [9]:
corr_df = pd.DataFrame(data.feature_names, columns=['Features'])
corr_df['weight'] = lin_reg.coef_[0]
print(corr_df)

   Features     weight
0      CRIM  -0.126973
1        ZN   0.046281
2     INDUS   0.000489
3      CHAS   2.376636
4       NOX -18.926876
5        RM   3.580899
6       AGE   0.008522
7       DIS  -1.510305
8       RAD   0.325122
9       TAX  -0.011137
10  PTRATIO  -1.027901
11        B   0.008949
12    LSTAT  -0.547230


## Logistic Regression

Next we will try to perform classification task using error based learning. Let's import a dataset for classification task first.

In [10]:
heart_df = pd.read_csv('Heart.csv')

heart_df.head()

Unnamed: 0.1,Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
0,1,63,1,typical,145,233,1,2,150,0,2.3,3,0.0,fixed,No
1,2,67,1,asymptomatic,160,286,0,2,108,1,1.5,2,3.0,normal,Yes
2,3,67,1,asymptomatic,120,229,0,2,129,1,2.6,2,2.0,reversable,Yes
3,4,37,1,nonanginal,130,250,0,0,187,0,3.5,3,0.0,normal,No
4,5,41,0,nontypical,130,204,0,2,172,0,1.4,1,0.0,normal,No


In [11]:
print(heart_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  303 non-null    int64  
 1   Age         303 non-null    int64  
 2   Sex         303 non-null    int64  
 3   ChestPain   303 non-null    object 
 4   RestBP      303 non-null    int64  
 5   Chol        303 non-null    int64  
 6   Fbs         303 non-null    int64  
 7   RestECG     303 non-null    int64  
 8   MaxHR       303 non-null    int64  
 9   ExAng       303 non-null    int64  
 10  Oldpeak     303 non-null    float64
 11  Slope       303 non-null    int64  
 12  Ca          299 non-null    float64
 13  Thal        301 non-null    object 
 14  AHD         303 non-null    object 
dtypes: float64(2), int64(10), object(3)
memory usage: 35.6+ KB
None


Let's fill up the null values and do a little cleaning.

In [12]:
heart_df['Ca'].fillna(heart_df['Ca'].mode(dropna=True)[0], inplace=True)
heart_df['Thal'].fillna(heart_df['Thal'].mode(dropna=True)[0], inplace=True)
print(heart_df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  303 non-null    int64  
 1   Age         303 non-null    int64  
 2   Sex         303 non-null    int64  
 3   ChestPain   303 non-null    object 
 4   RestBP      303 non-null    int64  
 5   Chol        303 non-null    int64  
 6   Fbs         303 non-null    int64  
 7   RestECG     303 non-null    int64  
 8   MaxHR       303 non-null    int64  
 9   ExAng       303 non-null    int64  
 10  Oldpeak     303 non-null    float64
 11  Slope       303 non-null    int64  
 12  Ca          303 non-null    float64
 13  Thal        303 non-null    object 
 14  AHD         303 non-null    object 
dtypes: float64(2), int64(10), object(3)
memory usage: 35.6+ KB
None


You can see that we have some categorical non-ordinal data in the dataset. We have to encode them as numerics to use
in the linear model.

You can use get_dummies function in pandas to get one-hot encoded features from a categorical feature.

In [13]:
X = heart_df.drop(['AHD', 'Unnamed: 0'], axis=1)
y = heart_df['AHD']
df_encoded = pd.get_dummies(X)

df_encoded.head()

Unnamed: 0,Age,Sex,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,ChestPain_asymptomatic,ChestPain_nonanginal,ChestPain_nontypical,ChestPain_typical,Thal_fixed,Thal_normal,Thal_reversable
0,63,1,145,233,1,2,150,0,2.3,3,0.0,0,0,0,1,1,0,0
1,67,1,160,286,0,2,108,1,1.5,2,3.0,1,0,0,0,0,1,0
2,67,1,120,229,0,2,129,1,2.6,2,2.0,1,0,0,0,0,0,1
3,37,1,130,250,0,0,187,0,3.5,3,0.0,0,1,0,0,0,1,0
4,41,0,130,204,0,2,172,0,1.4,1,0.0,0,0,1,0,0,1,0


To perform a classification task using error based learning, we use Logistic Regression algorithm. You can find the documentation of Logistic Regression
 [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression).

In [14]:
log_reg = LogisticRegression(max_iter=10000)

Let's divide and  fit our encoded dataset into logistic regression model and measure the performance.

In [15]:
X_train, X_test, y_train, y_test = train_test_split(df_encoded, y)

In [16]:
log_reg.fit(X_train, y_train)
pred = log_reg.predict(X_test)

print(classification_report(y_pred=pred, y_true=y_test))

              precision    recall  f1-score   support

          No       0.81      0.83      0.82        42
         Yes       0.79      0.76      0.78        34

    accuracy                           0.80        76
   macro avg       0.80      0.80      0.80        76
weighted avg       0.80      0.80      0.80        76



### Task
* Using the heart.csv given, train a model to predict the 'Chol' level of a patient using linear regression. Take
'AHD' values as a descriptive feature.
    1. Read the documentation and identify the hyperparameters of the Linear regression algorithm. Using
  hyperparameter tuning, identify the set of parameters that's best for your model.
  
* Using the dataset you cleaned in lab 02 and 03,
  1. First,
    * Train a logistic regression model and measure the performance.
    * Then encode all the non-ordinal categorical variables using one-hot encoding and train another logistic
    regression model.
    
    Compare the results in both scenarios.
    
  2. Read the documentation and identify the hyperparameters of the Logistic regression algorithm. Using hyperparameter
  tuning, identify the set of parameters that's best for your model.