# Supervised Learning
Supervised learning is a type of machine learning used to train models from __labeled training data__. It allows you to presict output for future or unseen data.

The 2 types of supervised learning are 
1. Regression
2. Classification

## 1. Regression - used when target variable is continuous numeric 
e.g. prediction of house prices, stock prices, height of a person, salary

__Algorithms:__

1. Simple Linear
2. Multiple Linear
3. Polynomial 
4. Ridge Regression
5. Lasso Regression
6. ElasticNet Regression

## 2. Classification - used when target variable is of categorical type
e.g. a transaction is fraudulent or not, patient is diabetic or not, spam/not spam email (2 classes are possible), grades of a student (6 classes from A - F) are possible, movie ratings, classify a trip as long/medium/short distance.
Note: the number of classes can be any number based on the possible scenarios

__Algorithms:__
1. Logistic Regression
2. Decision trees
3. Support Vector Machines etc.

### Examples of supervised learning
1. Weather apps used to predict weather at a given time based on prior knowledge of weather over a perios of time for a particular place.

2. Email filters into inbox(normal/ham) or junk folder(spam) based on past information od spam

3. Netflix/Amazon recommendations - uses what you like and what similar people that likes the movie you liked also liked

Note: A model cannot have more than ONE target/output feature


classification predict/identifies the class variable while categorization organizes data

Grouping - clustering


## Variable types

### Quantitative variables            

- continuous (float)              
- discrete (integer # of kids )   
                                 
                                                                  
### Qualiitative variables
- categorical - nominal or ordinal and binary
- nominal - no order required
- binary - true or false, yes or no
- ordinal - tall>medium>short

When the target feature is quantitative, you go for regression
When the target is qualitative, you go for classification

## Machine Learning model building steps

1. Import the right kind of data according to the problem statement.
2. Explore your data and visualize the data to gain insight about the data(shape, describe, boxplot, graphs, info, etc).
3. Find the relationship between the features and the target i.e. how they are related.
4. Encode all the necessay categorial variables present in the dataset. __MACHINE LEARNING CANNOT DEAL WITH OBJECT TYPE DATA. IT CAN ONLY WORK WITH NUMERIC TYPE DATA SO YOU NEED TO ENCODE THE CATEGORICAL DATA. i.e. CONVERT THEM TO NUMERIC TYPE. The process is called ENCODING.__
5. Identify the target variable and split the features into X(features) and y(target) i.e. isolate/remove your target variable from the features.
6. Split the original dataaset into train set and test set e.g. 80:20 or 70:30 etc (industry dependent) where 80%-training set and 20%-testing set

1000 data points - original dataset

training - 800 data points
testing - 200 data points.
7. Build your machine learning model with the training dataset (features, target).
8. Test your model with the features of the test dataset and observe the prediction output. You then compare the predicted output and the actual target of the test dataset.
9. Use various evaluation metrics for regression and classification to determine how well your model is performing.


ETL - Extract, Transform, Load people prep the data for Data Scientists


# Linear Regression

Regression is the __strength of relationship__ between your features and target variables:

__y = mx + c__

m = regression coefficient which expresses the strength of relationship between the feature and the target

- y = target
- x = feature
- m = slope
- c = intercept

### 1. Simple linear regression:
For simple linear regression, you have one target variable and ONLY one feature

### 2. Multiple linear regression
You have more than one feature

__y = b1x1 + b2x2 + b3x3 +.......+ bnxn + c__

where b1, b2, b3, ...,bn are regression coefficients which express the strength of the relationship between x1, x2, x3...,xn respectively with y.

#### Evaluation metrics for regression
        Root Mean Squared error - root of squared mean error (where error = actual-predicted)
        R2 Score - how the model fits the data
        
### 3. Polynomial



## 1. Import the dataset

In [4]:
# using one of the 7 datasets inbuilt in sklearn
from sklearn.datasets import load_boston

boston = load_boston()  # instantiate it and assign it to a variable

# this is not a dataframe but a Bunch- sklearrn data type
type(boston)

sklearn.utils.Bunch

In [5]:
# check the features in the dataset
boston.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

In [6]:
# check the documentation for the dataset
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [7]:
boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

## 2. Explore Data

In [10]:
# remember that it's not a dataframe yet so first convert it to a dataframe
# boston data is an array under data
import pandas as pd
boston_df = pd.DataFrame(boston.data, )
boston_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48


Notice that there are no column names

In [11]:
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48


Notice that the output/target MEDV is not included

### Adding the target 

In [12]:
# create a column MEDV and assign the target data to it
boston_df['MEDV'] = boston.target

boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [14]:
boston_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB


In [15]:
boston_df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [16]:
# check for null values - you'll get True if there is a null value and false if it isn't null
boston_df.isnull()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,False,False,False,False,False,False,False,False,False,False,False,False,False,False
502,False,False,False,False,False,False,False,False,False,False,False,False,False,False
503,False,False,False,False,False,False,False,False,False,False,False,False,False,False
504,False,False,False,False,False,False,False,False,False,False,False,False,False,False


Use .sum() after .isnull() to add up the true and false values. It will show the number of null values for each feature. 0 means there is no null.

In [17]:
# checking for null values
boston_df.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

## 3. Splitting your X and y

X - variable which holds all the features

y - variable which holds the target value

Here you have the liberty to either use the Bunch data type which has separate X(feature) and y(target) already or to drop the target and assign the features to X and target to y

In [18]:
# Option 1: Using the data from sklearn in bunch format - with separate feature and target values already
X = boston.data
y = boston.target

In [19]:
# Option 2: splitting X and y from the dataframe which has features and target
X = boston_df.drop('MEDV', axis=1)  # 1 for column
y = boston_df['MEDV']

## 4. Split the dataset into train and test sets

In [20]:
# import train_test_split
from sklearn.model_selection import train_test_split

In [21]:
# invoke the split method, it will return 4 values which is assigned to variables X train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# test size (0,1):
# 0 denotes that the entire data has been used for training
# 1 denotes that the entire data has been used for testing
# 0.2 - 20% of data is taken for testing
# 0.3 - 30% of data is taken for testing

In [22]:
# check the shape of your split to confirm it's 80:20

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(404, 13)
(102, 13)
(404,)
(102,)


## 5. Build regression model

In [23]:
#1. import the necessary model function
from sklearn.linear_model import LinearRegression

#2. instantiate the estimator object
model_lr = LinearRegression()

#3. fit the model on the data i.e. training the model with the data - supervised approach
model_lr.fit(X_train, y_train)

print('The intercept for the LR model is:', model_lr.intercept_)
print('The coefficient for all the features:', model_lr.coef_)  # there will be 13 values i.e. 1 value for each feature

The intercept for the LR model is: 35.413008956629454
The coefficient for all the features: [-1.17203860e-01  2.88328374e-02 -6.02101673e-03  3.12189192e+00
 -1.85715694e+01  3.68520052e+00  3.03683595e-04 -1.42177340e+00
  3.05356843e-01 -1.08651262e-02 -8.89154520e-01  1.12857303e-02
 -5.11914784e-01]


Note that without setting the __random state__, i.e. a seed you will get different values each time you run the model because of the random selection of the records during split.

In [24]:
# set the seed to 21, can be any integer value
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
from sklearn.linear_model import LinearRegression
model_lr = LinearRegression()
model_lr.fit(X_train, y_train)
print('The intercept for the LR model is:', model_lr.intercept_)
print('The coefficient for all the features:', model_lr.coef_)  

The intercept for the LR model is: 40.653176529790514
The coefficient for all the features: [-8.77422649e-02  4.87770336e-02  1.94746142e-02  3.06314365e+00
 -1.84821160e+01  3.34704170e+00  3.22024333e-03 -1.42569490e+00
  3.25184188e-01 -1.20259158e-02 -1.05582832e+00  1.07682087e-02
 -5.38356500e-01]


## 6. Evaluate the model

6a. __RMSE__ is used for evaluating regression

error = actual - predicted

when actual > predicted, +ve error value
when actual < predicted, -ve error value

so RMSE is the root of the mean of all squared errors. Squared to take care of -ve values before averaging/mean and root to return it back to previous magnitude

In [26]:
# Evaluate the model - use the same estimator object and the predict function
# Use only your X_test so your model can come up with the predicted target values
y_pred = model_lr.predict(X_test)

In [28]:
# compare your predicted value (y_pred) with the y_test values
# we want to show the values side by side so create a dataframe with values in a dict
pd.DataFrame({'Actual y_test': y_test, 'Predicted y_test':y_pred})

Unnamed: 0,Actual y_test,Predicted y_test
455,14.1,15.311568
142,13.4,15.324187
311,22.1,26.890855
232,41.7,37.384876
290,28.5,33.375220
...,...,...
486,19.1,20.027912
468,19.1,17.513802
302,26.4,29.172278
244,17.6,16.904621


You can see that there is a deviation in the actual and predicted values so to find the RMSE, import numpy to be able to use the sqrt function. We only get mean_squared_error from sklearn but this is ROOT mean squared error so Numpy will provide the ROOT part which sklearn doesn't.

The mean_sqaured_error function will only accept 2 values (in this case, the actual and pred variables)

In [29]:
from sklearn.metrics import mean_squared_error
import numpy as np

print('RMSE value of testing dataset')
print(np.sqrt(mean_squared_error(y_test, y_pred)))

RMSE value of testing dataset
5.179324335658004


The RMSE tells us that overall there is an average deviation of 5.18
Remember that our y i.e MEDV is in 1000 dollars i.e. x1k. So the RMSE value acceptance is dependent on the problem statement. This may be ok depending on the context/ what y target represents. A +/-$5.17K difference in the house price in this case may be considered acceptable in the real estate field. 

If the RMSE value is too high, it means the model is not good enough. You can try another model type e.g. polynomial or select the features that are best correlated

Note: You only use __linear regression__ if there is a linear relationship between the target and features. Use a scatterplot to check the relationship/spread between the features and the target variable.

If the spread is non linear, you use polynomial regression.


### Note that the above use of multiple features is Multiple Linear Regression and not Linear Regression. You also need to use the features that are good predictors for the target variable i.e features with the best correlation. Refer to IBM DA0101EN - Data Analysis with Python training 

#### 6b. R2 Score
__R2 Score__ is also used to determine how well the model fits the data. Use R2 Score to infer how well the model fits the dataset after using the RMSE

In [31]:
# the method is inbuilt in the linear regression function so you don't need to import any new metric

print('R2 Score is:', model_lr.score(X_test, y_test))

R2 Score is: 0.714936416139223


This score tells us that our model was able to fit about 71% of the data