<a href="https://colab.research.google.com/github/aantonelloborges/ml/blob/master/Supervised_Learning_REGRESSION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to Machine Learning - https://bit.ly/2JVEj2W

We will be doing a simple exercise using a well known data collection - Boston area home prices. The problem is to predict home prices based on this data using REGRESSION. 


*Disclaimer: this data is old so some of the information will not properly predict current trends! So current, reliable and relevant data is needed for good ML modeling!

### We will be using these packages to help us
- numpy: Used for NumPy is the fundamental package for scientific computing with Python.

- pandas: pandas is used for data analysis tools for the Python programming

- scipy: SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering

- matplotlib: Matplotlib is a plotting library for the Python programming language

- sklearn: Scikit-learn (formerly scikits.learn) is a free software machine learning library for the Python programming language

- statsmodels: statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models

- seaborn: Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

## Loading Packages
Please execute the next cell to load our required packages - note how package names are refernced in short form by the use of _"as"_

In [0]:
get_ipython().run_line_magic('matplotlib', 'inline')

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
import statsmodels.api as sm

import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

# special matplotlib argument for improved plots
from matplotlib import rcParams

### Load Boston dataset

In [0]:
# sklearn provides datasets as part of the package! You don't have to download datasets from other websites. 
from sklearn.datasets import load_boston
boston = load_boston() # load_boston in sklearn already has information about the housing data! 

In [0]:
print(boston.DESCR) # describe the dataset

In [0]:
# Let's get some more information about our dataset
print(boston.keys())
print(type(boston.data))
print(boston.data.shape)

In [0]:
print(boston.feature_names)

In [0]:
bos = pd.DataFrame(boston.data) # pandas is used here to format the data into a DataFrame called "bos"- rows and columns 
print(bos.head()) # top 5 rows of dataset bos

In [0]:
bos.columns = boston.feature_names # we are adding the feature names to the columns
print(bos.head())

In [0]:
print(boston.target.shape) # target is where we will place the price - it is a single column vector

In [0]:
bos['PRICE'] = boston.target # the dependant variable > Price is dependent on the features. 
print(bos.head())

In [0]:
bos.describe() # get all stats for each column of data quickly !

# this is useful to understand various statistical measures about our data - please review the output below!


In [0]:
X = bos.drop('PRICE', axis = 1) # these are our variables
Y = bos['PRICE'] # this is the result

In [0]:
import sklearn.model_selection




### Split the data into training and test set using the train_test_split function

Why do we split the data ? 

The data is split into training (this has the known output based on the data)  and test datasets. As this i duperised learning the "training data" helps the model learn. To validate this learning we use the "test data" which also was part of the original
data so it has the output - making it possible for us to verify the model! 

The command in the cell below uses a sklearn function "train_test_split" using the input data (X,Y) and creates a 2/3 (training) - 1/3 (test)  split of data using the test_size parameter (=0.33). 

Please note there are several ways to split data and that depends on the particular problem to be solved.


In [0]:
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size = 0.33, random_state = 5)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)



### Linear Regression
Build the model and predict the target value using the Linear Regression model

Linear Regression is a supervised machine learning algorithm where the predicted output is continuous and has a constant slope. It’s used to predict values within a continuous range, (e.g. sales, price) rather than trying to classify them into categories (e.g. cat, dog).

In [0]:
from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm.fit(X_train, Y_train)

Y_pred = lm.predict(X_test)

Let's plot the predicted prices vs actual prices along with the ideal prediction

In [0]:
plt.scatter(Y_test, Y_pred)
plt.xlabel("Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$")
plt.plot([Y_test.min(), Y_test.max()], [Y_test.min(), Y_test.max()], 'k--', lw=1)

### Mean Squared Error - from the linear regression model

We use loss functions to improve our model. This implies that our effort is minimize the "loss" or "error" between predicted and actual results. 

Mean Squared Error (MSE) is the most common regression loss function. MSE is the sum of squared distances between the target variable and predicted values.


In [0]:
# calculate the MSE between the target values
mse = sklearn.metrics.mean_squared_error(Y_test, Y_pred)
print(mse)

### Let's see how another model performs - this is the RandomForestRegressor model 

This model uses decision trees in a regressor model.

Article https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/

In [0]:
from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(max_depth=2, random_state=0)
regr.fit(X_train, Y_train)

Y_pred = regr.predict(X_test)

plt.scatter(Y_test, Y_pred)
plt.xlabel("Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$")
plt.plot([Y_test.min(), Y_test.max()], [Y_test.min(), Y_test.max()], 'k--', lw=1)

### Mean Squared Error - from the Random Forest regression model

In [0]:
mse = sklearn.metrics.mean_squared_error(Y_test, Y_pred)
print(mse)

In [0]:
prices=boston.target

In [0]:
print(boston.feature_names)

## Predicting 

### Did you compare the MSEs between the 2 models ? One using liner regression and the other using random forest ??


Now that we have a model **regr** let's see how it does on predictions...

Everything being the same - how do you think the price will be affected by
        * the number of rooms (RM)
        * the Pupil Teacher Ration (PTRATIO) and
        * the level of poverty of people in the neighborhood (LSTAT) ?
       


Here we have 3 homes with these characteristics :

    
| Client | Rooms (RM) | PTRATIO | LSTAT 
| ------|-----|---------|----------
|   1   |    5       |   15    |   20 
|   2   |      4      |   30    |   10  |
 |   3   |    8       |   12    |   3   |
   
   
   -----------------------------------------
  
  
   Let's see how our model predicts prices for these homes. You may want to guess if the prices would be higher/lower than the median values..     
   

In [0]:
# This is a simple test dataset that is in the same format as our original dataset

clients = [[0.06905, 0.00, 2.18, 0, 0.469, 5.0,61.1,4.0900, 1.0, 296.0, 15.00, 332.09, 20],
               [0.06905, 0.00, 2.18, 0, 0.469, 4.0,61.1,4.0900, 1.0, 296.0, 30.00, 332.09, 10],
               [0.06905, 0.00, 2.18, 0, 0.469, 8.0,61.1,4.0900, 1.0, 296.0, 12.00, 332.09, 3]]


In [0]:
# Let's use the "regr" model and predict prices for each of the clients

for i, price in enumerate(regr.predict(clients)):
    print ("Predicted price for client {}'s home ${:,.2f} ".format(i+1,price))


In [0]:
plt.hist(prices,bins=20)
for price in regr.predict(clients):
    plt.axvline(price, lw=5, c='r')

In [0]:
print ("Min Price ",prices.min())
print ("Max Price ", prices.max())
print ("Mean Price ", prices.mean())
print ("Median Price ", np.median(prices))
print (" Prices - Standard Deviation ", np.std(prices))

### Visualizing the data we see that 

* Client 1 is below the median (5 rooms, 15 to 1 PT ratio and 20% lower income)
* Client 2 is closer to the median (4 rooms, 30 to 1 PT ratio and 10% lower income) and
* Client 3 is closer to the max (8 rooms, 12 to 1 PT ratio and 3% lower income) 
    
Was your intuition similar ? :)




## In Conclusion

We used Regression to fit data during the training of the model, we tested the model using test data and we tried using the model to predict some fictional homes. 

We used several packages in this process. 

This is a taste of what ML can do .. there are many resources on the Net so please have fun exploring them!
Thank you!