# How does a machine learn?

![Machine learning](images/ml_comic3.png)

https://github.com/Calysto/notebook-extensions
https://github.com/damianavila/RISE

## Scenario - Selling house

### You're planning to sell your 600 Sq Feet old house 
### Next step : Estimate up with a reasonable selling price 
### Okay, let's get some data of how other houses are sold in the same location

###### Let's start by importing some libraries and examining the data.

In [3]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import clear_output
from IPython.display import Image, YouTubeVideo
from scipy.optimize import curve_fit
import random
import time
import matplotlib
import re
%matplotlib inline

### CSS Stuff :/

In [4]:
font = {'family' : "DejaVu Sans",
        'weight' : 'normal',
        'size'   : 25}

matplotlib.rc('font', **font)

def bigger_table(val):
    """
    bigger fonts for the table
    """
    size = 'x-large'
    #color =  'black' if re.match("^\d+?\.\d+?$", val) else 'red'
    return 'font-size: %s' % size

### Here is the table of data which you've collected

In [5]:
path = 'data/house_price_list1.txt'
house_price_data_1 = pd.read_csv(path, header=None, names=['Area', 'Price'])
(pd.DataFrame([['Area (Sq. Feet)', 'Price (INR)']], columns=['Area', 'Price']).append(house_price_data_1).\
 reset_index(drop=True)).style.applymap(bigger_table)


FileNotFoundError: File b'data/house_price_list1.txt' does not exist

### Can you estimate the price of your house (600 Sq Feet) from this data?

### Approach 1 - Calculate the price per square feet of each house

In [None]:
house_price_data_1['Unit Price'] = house_price_data_1.Price / house_price_data_1.Area
#house_price_data_1.head().style.applymap(bigger_table)

(pd.DataFrame([['Area (Sq. Feet)', 'Price (INR)', 'Unit Price (Rs per Sq. Feet)']], columns=['Area', 'Price','Unit Price'])\
 .append(house_price_data_1). reset_index(drop=True)).head(6).style.applymap(bigger_table)

### Approach 2 - Use highschool maths

In [None]:
fig = plt.figure(figsize=(8, 6))
plt.scatter(x = house_price_data_1['Area'], y = house_price_data_1['Price'], s=180, color='red')
plt.title('Area Vs Cost')
plt.xlabel("Area (Sq Feet)")
plt.ylabel("Price in ₹")
plt.show()

### What next?

### See if you can draw a straight line through all the data points !

### How will you define that line?

#### Equation of a 2D line: 
### y = m * x + b
- m --> Slope
- b --> Y-intercept

### How do we find the slope of a given line? - Recall highschool math

<img src="images/slope.gif" alt="Drawing" style="width: 600px;"/>

### Once you find the slope of the line , you can draw the line and use it as a reference to predict the value of any house in your area

In [None]:
font = {'family' : "DejaVu Sans",
        'weight' : 'normal',
        'size'   : 20}

matplotlib.rc('font', **font)



fig = plt.figure(figsize=(8, 6))
plt.scatter(x = house_price_data_1['Area'], y = house_price_data_1['Price'], s=280, color='red')
plt.scatter(x = [600], y = [3000000], s=300)
plt.plot(list(map(lambda x: x*5000,range(1200))), color = 'green', marker ='.')
plt.axvline(x=600)
plt.axhline(y=3000000)
plt.title('Area Vs Price')
plt.xlabel("Area (Sq Feet)")
plt.ylabel("Price in ₹")
plt.locator_params(numticks=12)
plt.show()


font = {'family' : "DejaVu Sans",
        'weight' : 'normal',
        'size'   : 25}

matplotlib.rc('font', **font)
#font size

### Successfully you've estimated and sold your old house, selling was so easy ! Congratulations !

## Scenario 2
### Oh boy!  I need to buy a new house

### A bigger one ;) and close to a beach

### Well, house prices are different near the beach, let's collect some data again! 
### Use some more math! and predict how much it would cost me

### Data is here!

In [None]:
path = 'data/house_price_list2.txt'
house_price_data_2 = pd.read_csv(path, header=None, names=['Area', 'Price'])


(pd.DataFrame([['Area (Sq. Feet)', 'Price (INR)']], columns=['Area', 'Price'])\
 .append(house_price_data_2). reset_index(drop=True)).head(15).style.applymap(bigger_table)

### Let's visualize

In [None]:
fig = plt.figure(figsize=(7, 8))
plt.scatter(x = house_price_data_2['Area'], y = house_price_data_2['Price'], s=80, color='red')
plt.title('Area Vs Cost')
plt.xlabel("Area (Sq Feet)")
plt.ylabel("Price in ₹")
plt.show()

### Wooooohh  , draw a straight line through these points?


### Let's draw some lines and see how it looks

In [None]:
fig = plt.figure(figsize=(7, 8))
plt.scatter(x = house_price_data_2['Area'], y = house_price_data_2['Price'], s=80, color='red')
m,b = 150, 0
plt.plot(list(map(lambda x: x*m+b,range(5000))), color = 'red')
m,b = 70, 100000
plt.plot(list(map(lambda x: x*m+b,range(5000))), color = 'green')
m,b = -150, 1000000
plt.plot(list(map(lambda x: x*m+b,range(5000))), color = 'blue')
plt.title('Area Vs Cost')
plt.xlabel("Area (Sq Feet)")
plt.ylabel("Price in ₹")
plt.show()

### Interesting! Which one do you think will help in your cause?

### Why is the red line better? 
### How good is it than the other two lines? Quantify?

### Little more math - Mean Square Error

![Machine learning](images/MSE.gif)

In [None]:
def mean_square_error(slope, offset, x_values, y_values):
    mse = 0
    for x,y in zip(x_values, y_values):
        mse += ((y - (slope*x + offset))*(y - (slope*x + offset)))
    mse/=len(x_values)
    return mse

def plot_and_return_mse(slope, offset, x_values, y_values, color='red', size=80,
                        x_limit = 5000, y_limit = 770000, title = 'Area Vs Cost'):
    matplotlib.rc('font', **font)
    fig = plt.figure(figsize=(18, 10))

    plt.scatter(x = x_values, y = y_values, s=size, color=color)
    plt.xlim(0, x_limit)
    plt.ylim(0, y_limit)
    plt.title(title)
    plt.xlabel("Area (Sq Feet)")
    plt.ylabel("Price in ₹")

    m,b = slope, offset
    plt.plot(list(map(lambda x: x*m+b,range(x_limit))), color = 'red')
    for area,price in zip(list(house_price_data_2['Area']), list(house_price_data_2['Price'])):
        plt.axvline(x=area, ymin=min(price,area*m+b)/y_limit, ymax=max(price,area*m+b)/y_limit)
    clear_output()
    plt.show()
    return (mean_square_error(m,b, x_values, y_values))

<img src="images/MSE.gif" alt="Drawing" style="width: 300px;"/>

<h1> Understanding MSE with our Red and Green Lines </h1>

In [None]:
fig = plt.figure(figsize=(22, 10))

plt.subplot(121)

plt.scatter(x = house_price_data_2['Area'], y = house_price_data_2['Price'], s=80, color='red')
x_limit = 5000
y_limit = 770000
plt.xlim(0, x_limit)
plt.ylim(0, y_limit)

m,b = 150, 0
plt.plot(list(map(lambda x: x*m+b,range(x_limit))), color = 'red')
for area,price in zip(list(house_price_data_2['Area']), list(house_price_data_2['Price'])):
    plt.axvline(x=area, ymin=min(price,area*m+b)/y_limit, ymax=max(price,area*m+b)/y_limit)
    plt.plot([area],[area*m+b],marker='o', color='blue')
    
red_mse = mean_square_error(m,b, list(house_price_data_2['Area']), list(house_price_data_2['Price']))
    

plt.title('Area Vs Cost - Red Line - m=%s, b=%s'%(m,b))
plt.xlabel("Area (Sq Feet)")
plt.ylabel("Price in ₹")


plt.subplot(122)
plt.scatter(x = house_price_data_2['Area'], y = house_price_data_2['Price'], s=80, color='red')
x_limit = 5000
y_limit = 770000
plt.xlim(0, x_limit)
plt.ylim(0, y_limit)

m,b = 70, 0
plt.plot(list(map(lambda x: x*m+b,range(x_limit))), color = 'green')
for area,price in zip(list(house_price_data_2['Area']), list(house_price_data_2['Price'])):
    plt.axvline(x=area, ymin=min(price,area*m+b)/y_limit, ymax=max(price,area*m+b)/y_limit)
    plt.plot([area],[area*m+b],marker='o', color='blue')

green_mse = mean_square_error(m,b, list(house_price_data_2['Area']), list(house_price_data_2['Price']))

plt.title('Area Vs Cost - Green Line - m=%s, b=%s'%(m,b))
plt.xlabel("Area (Sq Feet)")
plt.ylabel("Price in ₹")

plt.show()


print ('Red-Line\'s Mean Square Error \t= %s'%red_mse)
print ('Green-Line\'s Mean Square Error \t= %s'%green_mse)

### Okay great! I can use this metric to pick the best line
### Only thing I've to do now is to find the line with least possible MSE

### So, how do I find the line with least MSE?

### Approach 1 : Try random 'm' and 'b' values, and pick the ones giving lowest MSE

In [None]:
no_of_iterations = 100
delay = 0
history = {}
#add comments
for iteration in range(no_of_iterations):
    m = random.randint(-200,200)
    b = random.randint(0,1000000)
    mse = plot_and_return_mse(m, b, house_price_data_2['Area'], house_price_data_2['Price'], color='red', size=80,
                        x_limit = 5000, y_limit = 770000, title = 'Area Vs Cost - m=%s, b=%s'%(m,b))
    history[mse] = (m,b)
    time.sleep(delay)
    
#plotting best line
mse = sorted(history.keys())[0]
m,b = history[mse]
mse = plot_and_return_mse(m, b, house_price_data_2['Area'], house_price_data_2['Price'], color='red', size=80,
                        x_limit = 5000, y_limit = 770000, title = 'Area Vs Cost - m=%s, b=%s'%(m,b))

#Print the lines in ascending order od MSE
results = pd.DataFrame()

for mse,variables in sorted(history.items()):
    m,b = variables
    results = results.append({'m': m, 'b': b, 'mse': mse}, ignore_index=True)

(pd.DataFrame([['m (Slope)', 'b (Y-intercept)', 'MSE (Mean Squared Error)']], columns=['m', 'b', 'mse'])\
 .append(results). reset_index(drop=True)).head(15).style.applymap(bigger_table)

### Approach 2 : Try all possible combinations of  'm' and 'b' values, and pick the ones giving lowest MSE
### Wow that's lot of work, so let's fix 'b = 0' and try different m values from 0 to 300

In [None]:
history = {}
delay = 1
m_range = list(range(0,300,20))

for m in m_range:
    mse = plot_and_return_mse(m, 0, house_price_data_2['Area'], house_price_data_2['Price'], color='red', size=80,
                        x_limit = 5000, y_limit = 770000, title = 'Area Vs Cost - m=%s, b=%s'%(m,b))
    history[mse] = (m,0)
    time.sleep(delay)
    
#plotting best line
mse = sorted(history.keys())[0]
m,c = history[mse]
mse = plot_and_return_mse(m, c, house_price_data_2['Area'], house_price_data_2['Price'], color='red', size=80,
                        x_limit = 5000, y_limit = 770000, title = 'Area Vs Cost - m=%s, c=%s'%(m,c))

#Print the lines in ascending order od MSE
results = pd.DataFrame()

for mse,variables in sorted(history.items()):
    m,c = variables
    results = results.append({'m': m, 'b': c, 'mse': mse}, ignore_index=True)
    #print ('for a line with m = %s\t and c = %s\t MSE is %s'%(m,c,mse))

(pd.DataFrame([['m (Slope)', 'b (Y-intercept)', 'MSE (Mean Squared Error)']], columns=['m', 'b', 'mse'])\
 .append(results). reset_index(drop=True)).head(11).style.applymap(bigger_table)

### Let's visualize the results of Approach 2

In [None]:
fig = plt.figure(figsize=(7, 8))
plt.scatter(x = results['m'], y = results['mse'], s=80, color='red')
plt.title('m(Slope) Vs MSE for b = 0')
plt.xlabel("m (Slope)")
plt.ylabel("MSE")
plt.show()

### We visualized the result as a 2-D curve as we fixed b=0
### When b is also considered we will get a 3-D surface
### The 2-D curve which we got a cross section of the 3-D surface at b = 0

In [None]:
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from scipy.spatial import ConvexHull  

fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')

points = []

for m,b,mse in zip(results['m'], results['b'], results['mse']):
    plt.plot([m], [b], [mse], 'r-')
    points.append([m, b, mse])



ax.plot(*zip(*points),'bo') 

ax.set_xlabel('m (Slope)', labelpad=15)
ax.set_ylabel('b (Y-Intercept)', labelpad=15)
ax.set_zlabel('MSE', labelpad=15)


plt.show()

### Let's iterate over both m and b and get the entire 3-D surface

In [None]:
history = {}
delay = 0
m_range = list(range(-300,300,10))
b_range = list(range(0,700000,25000))

for b in b_range:
    for m in m_range:
        mse = mean_square_error(m, b, house_price_data_2['Area'], house_price_data_2['Price'])
#         mse = plot_and_return_mse(m, 0, house_price_data_2['Area'], house_price_data_2['Price'], color='red', size=80,
#                             x_limit = 5000, y_limit = 770000, title = 'Area Vs Cost - m=%s, c=%s'%(m,c))
        history[mse] = (m,b)
        time.sleep(delay)
    
#plotting best line
mse = sorted(history.keys())[0]
m,b = history[mse]
mse = plot_and_return_mse(m, b, house_price_data_2['Area'], house_price_data_2['Price'], color='red', size=80,
                        x_limit = 5000, y_limit = 770000, title = 'Area Vs Cost - m=%0.3f, b=%0.2f'%(m,b))

#Print the lines in ascending order od MSE
results_complete = pd.DataFrame()

for mse,variables in sorted(history.items()):
    m,b = variables
    results_complete = results_complete.append({'m': m, 'b': b, 'mse': mse}, ignore_index=True)

(pd.DataFrame([['m (Slope)', 'b (Y-intercept)', 'MSE (Mean Squared Error)']], columns=['m', 'b', 'mse'])\
 .append(results_complete). reset_index(drop=True)).head(11).style.applymap(bigger_table)

### Here is the 3-D surface which represents the  error function / loss function
### Each point in this surface corresponds to different line with different combination of slope and intercept

In [None]:
fig = plt.figure(figsize=(14, 14))
ax = fig.add_subplot(111, projection='3d')

points = []

for m,b,mse in zip(results_complete['m'], results_complete['b'], results_complete['mse']):
    plt.plot([m], [b], [mse], 'r-')
    points.append([m, b, mse])



ax.plot(*zip(*points),'bo') 

ax.set_xlabel('m (Slope)', labelpad=15)
ax.set_ylabel('b (Y-Intercept)', labelpad=15)
ax.set_zlabel('MSE', labelpad=15)


plt.show()

### What is the problem with the Approach 1 (trying random lines)

### The best line could be missed easily

### What is the problem with the Approach 2 (trying all possible lines)

### Infinite no. of lines are possible

### Approach 3 : Gradient Descent - A mathematical technique for finding the minimum of a function


![gradient_descent_error_surface](images/gradient_descent_error_surface.png)

![gradient_descent_example](images/gradient_descent_example.gif)

### Visualizing how gradient descent works

In [None]:
YouTubeVideo('kJgx2RcJKZY',width=800, height=600)

### Gradient descent - Rolling ball analogy 

In [None]:
YouTubeVideo('vWFjqgb-ylQ',width=800, height=600, start=5)

### Let's make use of gradient descent and find the best line for our scenario

In [None]:
def error_function(x, m, b): # this is your 'straight line' y=f(x)
    return m*x + b

m, b = curve_fit(error_function, house_price_data_2['Area'], house_price_data_2['Price'])[0]
mse = plot_and_return_mse(m, b, house_price_data_2['Area'], house_price_data_2['Price'], color='red', size=80,
                        x_limit = 5000, y_limit = 770000, title = 'Area Vs Cost - m=%0.3f, b=%0.2f'%(m,b))

### What we have been doing so far is Linear Regression
### Our example was a Linear Regression problem with 1 variable (Area of the House)

### When more variables come into picture, it becomes a multivariable Linear Regression Problem
### An Example data for multivariate Linear regression is below:

In [None]:
path = 'data/kaggle_house_value.csv'
house_price_data_3 = pd.read_csv(path, header=0)
pd.set_option('display.max_columns', 500)
house_price_data_3.head(10)
#(pd.DataFrame([['Area (Sq. Feet)', 'Price (INR)']], columns=['Area', 'Price']).append(house_price_data_1).\
# reset_index(drop=True)).style.applymap(bigger_table)

### In a multivariate Linear regession the line to be fitted will be in higher dimentional space - (not in 2-D space as in one variable linear regression)

### General equation of a line in n-dimentional space can be put as :

### y = m1*x1 + m2*x2 + m3*x3 +  .. . .  + mn*xn + b

### A Linear Regression is a simple Machine Learning Model, which Learns all these parameters m1, m2, m3 .. mn and b from the given data

### Congratulations!

### You've learnt how a simple machine learning model learns 

### Thanks for learning