# Learn Machine Learning

The notebook takes in snippets from [kaggle.com/learn](http://kaggle.com/learn) ML program and builds on that

### Dataset: House Prices: Advanced Regression Techniques

In [1]:
import pandas as pd

train_path = "../input/train.csv"

df = pd.read_csv(train_path)

In [2]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
df.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


* **count**: Shows how many rows have non-missing values
* **mean**: Average
* **std**: Tells how numerically spread out the values are
* **min, 25%, 50%, 75%**: lowest values is min, quarter 25% and so on

### Selecting and Filtering Data

**df.columns**: Gives a list of all the columns in a Pandas DataFrame

In [4]:
columns = df.columns
print(columns)

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

To select a single column out of the dataset, use **df["COLUMN_NAME"]**

In [5]:
df["SalePrice"].head(5)

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

More often than not we need to subset our data for ad-hoc analysis or testing a UDF and for billion other reasons, this is one of the many ways to subset your data according to columns

In [6]:
columns_of_interest = ["SaleCondition","SalePrice"]
df[columns_of_interest].head()

Unnamed: 0,SaleCondition,SalePrice
0,Normal,208500
1,Normal,181500
2,Normal,223500
3,Abnorml,140000
4,Normal,250000


In [7]:
df[columns_of_interest].describe()

Unnamed: 0,SalePrice
count,1460.0
mean,180921.19589
std,79442.502883
min,34900.0
25%,129975.0
50%,163000.0
75%,214000.0
max,755000.0


### Building your first Scikit Learn model

Target Variable: **SalePrice**

For now, we are using the Decision Tree Regressor: [Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)

In [9]:
from sklearn.tree import DecisionTreeRegressor

In [8]:
y = df["SalePrice"]
predictor_variables = ["LotArea", "YearBuilt", "1stFlrSF", "2ndFlrSF", "FullBath", "BedroomAbvGr", "TotRmsAbvGrd"]
X = df[predictor_variables]

Fitting the model using **.fit()** function, takes in Training Data and Labels

* Training Data: **X**
* Labels: **y**

In [10]:
regressor = DecisionTreeRegressor()
regressor.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

Prdicting Labels for testing set by using **.predict()** function, takes in the to be predicted values and should be in the exact format as provided above

Side note: Here the prediction data is first 5 rows of Training set, this would never be the case in real-life problems

In [11]:
regressor.predict(X.head(5))

array([ 208500.,  181500.,  223500.,  140000.,  250000.])

### Model Validation

To test how our model is performing on real-world data, we divide our overal training set into 2 sets, Training set and Testing

The model is trained on Training set and predictions are derived from the test set

**Mean Absolute Error** is then measured using original labels and predicted variables for testing set

Ideal train test split is **60:40**, but it may differ from case to case basis

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=6, test_size=0.40)

In [14]:
regressor = DecisionTreeRegressor()
regressor.fit(train_X, train_y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [15]:
predictions = regressor.predict(test_X)

**Mean Absolute Error** is the absolute value of each error

In [16]:
from sklearn.metrics import mean_absolute_error

mean_absolute_error(test_y, predictions)

30421.69006849315

Optimizing the number of leaf nodes to get better MAE results

In [17]:
def get_mae(leaf_nodes, train_X, train_y, test_X, test_y):
    regressor = DecisionTreeRegressor(random_state=0, max_leaf_nodes=leaf_nodes)
    regressor.fit(train_X, train_y)
    predictions = regressor.predict(test_X)
    error = mean_absolute_error(test_y, predictions)
    print("Leaf Nodes: " + str(leaf_nodes) + " Error: " + str(error))

In [18]:
nodes = [50, 100, 500, 1000, 2000]

for node in nodes:
    get_mae(node, train_X, train_y, test_X, test_y)

Leaf Nodes: 50 Error: 29514.142466
Leaf Nodes: 100 Error: 29022.8035487
Leaf Nodes: 500 Error: 31430.6623002
Leaf Nodes: 1000 Error: 31493.3236301
Leaf Nodes: 2000 Error: 31493.3236301


In [19]:
nodes = [5, 10, 50, 100, 500]

for node in nodes:
    get_mae(node, train_X, train_y, test_X, test_y)

Leaf Nodes: 5 Error: 36994.303582
Leaf Nodes: 10 Error: 33207.8198356
Leaf Nodes: 50 Error: 29514.142466
Leaf Nodes: 100 Error: 29022.8035487
Leaf Nodes: 500 Error: 31430.6623002


In [20]:
nodes = [100, 200, 300, 400, 500]

for node in nodes:
    get_mae(node, train_X, train_y, test_X, test_y)

Leaf Nodes: 100 Error: 29022.8035487
Leaf Nodes: 200 Error: 29522.5853728
Leaf Nodes: 300 Error: 30430.6891636
Leaf Nodes: 400 Error: 31074.9160041
Leaf Nodes: 500 Error: 31430.6623002


### Training a RandomForestRegressor

[Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

In [24]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(random_state=0)
regressor.fit(train_X, train_y)
predictions = regressor.predict(test_X)
mean_absolute_error(test_y, predictions)

25041.4450913242