## Using Pandas to get familiarize with the data
The most important part of the Pandas is the <b>DataFrame</b>. It holds the type of data we can think as of a table.

In [1]:
import pandas as pd

melbourn_file = 'datasets/melb_data.csv'

In [2]:
data = pd.read_csv(melbourn_file)
data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


## Intrepreting the Data Description
The results show 8 numbers for each column in the dataset. The first number, <b>count</b> shows how many rows have <b>non-missing values.</b>
<br>
The second value is the <b>mean</b>, which is average. <br>
Next is <b>std</b> is the standard deviation which measures how numerically spread out the values are.<br>
To interpret the <b>min, 25%, 50%, 75%</b> and <b>max</b> values, imagin sorting each column from lowest to highest value. The fist (smallest) value is the min. If you go through a quater way through the list, you will find a number which is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced as <b>25th percentile</b>). The max is the largest number.<br>


# <u>Selecting Data for Modeling</u>
We have too many variables to wrap around. We'll start by picking some variables using our intution, leter we'll see statistical techniques to automatically prioritise the variables.<br>

In [3]:
data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [4]:
# Drop the missing values present in the dataset
data = data.dropna(axis=0)
data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1068828.0,9.751097,3101.947708,2.902034,1.57634,1.573596,471.00694,141.568645,1964.081988,-37.807904,144.990201,7435.489509
std,0.971079,675156.4,5.612065,86.421604,0.970055,0.711362,0.929947,897.449881,90.834824,38.105673,0.07585,0.099165,4337.698917
min,1.0,131000.0,0.0,3000.0,0.0,1.0,0.0,0.0,0.0,1196.0,-38.16492,144.54237,389.0
25%,2.0,620000.0,5.9,3044.0,2.0,1.0,1.0,152.0,91.0,1940.0,-37.855438,144.926198,4383.75
50%,3.0,880000.0,9.0,3081.0,3.0,1.0,1.0,373.0,124.0,1970.0,-37.80225,144.9958,6567.0
75%,4.0,1325000.0,12.4,3147.0,3.0,2.0,2.0,628.0,170.0,2000.0,-37.7582,145.0527,10175.0
max,8.0,9000000.0,47.4,3977.0,9.0,8.0,10.0,37000.0,3112.0,2018.0,-37.45709,145.52635,21650.0


## Selecting a subset of your data
You can pull out a variable using dot notation. The single column is stored in a <b>Series</b>, which is like a DataFrame but with only single column of the data.<br>
We use the dot notation to select the column we want to predict, which is called the <b>prediction targer</b>. By convention the prediction target is called "y".

In [5]:
y = data.Price

## Choosing Features
The columns that are inputted into our model and later used to make predictions, are called <b>features</b>. Here it would be the columns used to determine the home price. Sometimes we may use all columns except the target column as features. Other times it is better with few features. 

In [6]:
features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
#by convention we call it x
x = data[features]
x.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [7]:
x.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


# <u>Building your model</u>
Steps to build a model are: <br>
1) Define - what type of model will it be?<br>
2) Fit - Capture the patterns provided by the data. This is the heart of modeling.<br>
3) Predict<br>
4) Evaluate - determine how accurate the model's predictions are.<br>



In [8]:
from sklearn.tree import DecisionTreeRegressor
# Define model. Specify a number for random_state to ensure same results each run

model = DecisionTreeRegressor(random_state = 1)

# Fit model
model.fit(x, y)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=1, splitter='best')

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

In [9]:
print("Making predictions for the following 5 houses:")
print(x.head())
print("The predictions are")
print(model.predict(x.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]


In [10]:
print(x)

       Rooms  Bathroom  Landsize  Lattitude  Longtitude
1          2       1.0     156.0  -37.80790   144.99340
2          3       2.0     134.0  -37.80930   144.99440
4          4       1.0     120.0  -37.80720   144.99410
6          3       2.0     245.0  -37.80240   144.99930
7          2       1.0     256.0  -37.80600   144.99540
...      ...       ...       ...        ...         ...
12205      3       2.0     972.0  -37.51232   145.13282
12206      3       1.0     179.0  -37.86558   144.90474
12207      1       1.0       0.0  -37.85588   144.89936
12209      2       1.0       0.0  -37.85581   144.99025
12212      6       3.0    1087.0  -37.81038   144.89389

[6196 rows x 5 columns]


# Model Validation
We will have to evaluate every model we will build. In most applications (not all), the relavent measure of model quality is predictive accuracy. <br>
We need to summarise the model quality into a single metric.<br>
There are many, but lets start with :<br>
<br>
1) Mean Absolute Error (MAE)
Here the prediction error for each house is: <b>error = actual - predicted</b>
<br>
With MAE, we take the absolute of each error. This converte each error into a positive number. Then we take average of those absolute errors. This is our measure of our model quality.

In [11]:
from sklearn.metrics import mean_absolute_error
predicted_home_prices = model.predict(x)
mean_absolute_error(y, predicted_home_prices)

1115.7467183128902

## Problem with it
The measure we just computer can be called as in-sample score. We used a single sample of houses for both building it and evaluating it. Here is why it is bad.<br>
Imagine that in a large real estate market, the door color is unrelated to the price. However in sample data that you used to build your model, all the doors with green color were expensive. The model's job is to find patterns that predict home prices, so it will see this pattern and it will always predict high prices for green coloured doors. <br>
Since this model was derived from the training dataset, it will hold good in the training dataset. But if this model doesn't hold if it sees some new data, the model would be very inaccurate when used in practical. 
<br>
Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called validation data.<br><br>
We will be using the train test split

In [12]:
from sklearn.model_selection import train_test_split
# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we run this script

train_x, val_x, train_y, val_y = train_test_split(x, y, random_state = 0)
model = DecisionTreeRegressor()
model.fit(train_x, train_y)

val_predictions = model.predict(val_x)
print(mean_absolute_error(val_y, val_predictions))

274127.193458145


So in Reality this model is giving the mean absolute error to be more than <b>250,000 dollars</b> unlike previous where it gave only about <b>1000 dollars</b>

# UnderFitting and Overfitting
<b>Fine tuning for better preformance</b><br>

An example of a tree we are using here is.
![image.png](attachment:image.png)
In practice it is common to have 10 splits between the top and the leaf nodes. So the total number of leaves will become 2^n that is 2^10 = 1024 leaves.<br>
When we divide the houses among many leaves, we also have less  number of houses per leaf. Leaves that have less houses will make predictions that are quite close to actual values of the houses, but they may make very unrealiable precidtions on new data as each prediction is based on only a few house values. This is called <b>Overfitting</b> where model matches the training data almost perfectly but it does poorly in validation and new datasets. <br>
On the other hand if we make our tree vert shallow so that we have more number of houses per leaf, it doesn't divide the houses into many distinct groups. At an extreme if a tree divides the houses into 2 or 4 groups, each groups still has a wide variety of houses. So the resulting predictions may be far off than the actual price of houses. When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in the training data, that is called <b>underfitting</b>.

<br><br>
Since we care about our accuracy on new data, which we estimate from our validation data, we want to find a sweet spot between underfitting and overfitting. Visually we want the low point of the (red) validation curve.<br>
![image-2.png](attachment:image-2.png)<br>
There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. But the <b>max_leaf_nodes</b> argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.<br><br>

We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:

In [13]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

def get_mae(max_leaf_nodes, train_x, val_x, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 0)
    model.fit(train_x, train_y)
    preds_val = model.predict(val_x)
    mae = mean_absolute_error(val_y, preds_val)
    return mae

train_x, val_x, train_y, val_y = train_test_split(x, y, random_state = 0)

for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_x, val_x, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  385696
Max leaf nodes: 50  		 Mean Absolute Error:  279794
Max leaf nodes: 500  		 Mean Absolute Error:  261718
Max leaf nodes: 5000  		 Mean Absolute Error:  271996


# Random Forest
A random forest uses many trees and it makes a prediction by averaging the predictions of each component tree. It generally has much better prediction accuracy than a normal decision tree and works well with default parameters. <br>
We use the RandomForestRegressor class from sklearn instead of DecisionTreeRegressor

In [14]:
from sklearn.ensemble import RandomForestRegressor
forest_model = RandomForestRegressor(random_state = 1)
forest_model.fit(train_x, train_y)
melb_pred = forest_model.predict(val_x)
print(mean_absolute_error(val_y, melb_pred))

207190.6873773146


# <u>Dealing with missing values</u>
There are 3 approches to deal with missing values in a real dataset.<br>
1) A simple option: Drop Columns with missing values<br>
But if a column does not have a lot of missing values, a model misses out a lot of important information in this process. <br><br>
2) A better option: Imputation<br>
Imputation fills the missing values with some number. For instance we can fill in the mean value along each column. The imputed value won't be the most correct value but it increases the efficiency of the model as compared to dropping the column entirely.<br><br>
3) An extention to Imputation<br>
Imputation is the standard approach, and it usually works well. However, imputed values may be systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing. In this approach, we impute the missing values, as before. And, additionally, for each column with missing entries in the original dataset, we add a new column that shows the location of the imputed entries. In some cases, this will meaningfully improve results. In other cases, it doesn't help at all.
![image.png](attachment:image.png)

In [15]:
data2 = pd.read_csv(melbourn_file)
y2 = data2.Price

#To keep things simple we will only use numerical predictors
melb_predictors = data2.drop(['Price'], axis = 1)
x2 = melb_predictors.select_dtypes(exclude = ['object'])

x_train, x_valid, y_train, y_valid = train_test_split(x2, y2, train_size = 0.8, test_size = 0.2, random_state = 0)


In [16]:
#Function for comparing different approaches.
def score_dataset(x_train, x_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators = 10, random_state = 0)
    model.fit(x_train, y_train)
    preds = model.predict(x_valid)
    return mean_absolute_error(y_valid, preds)

In [17]:
# From Approach 1 Drop Columns with Missing Values

# Get names of columns with missing values
cols_with_missing = [col for col in x_train.columns if x_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_x_train = x_train.drop(cols_with_missing, axis = 1)
reduced_x_valid = x_valid.drop(cols_with_missing, axis = 1)
print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_x_train, reduced_x_valid, y_train, y_valid))

MAE from Approach 1 (Drop columns with missing values):
183550.22137772635


In [18]:
# Score from Approach 2 (Imputation)
# Next, we use SimpleImputer to replace missing values with the mean value along each column.

# Although it's simple, filling in the mean value generally performs quite well (but this varies by dataset). 
# While statisticians have experimented with more complex ways to determine imputed values (such as regression imputation, 
# for instance), the complex strategies typically give no additional benefit once you plug the results into 
# sophisticated machine learning models.

from sklearn.impute import SimpleImputer
# Imputation
my_imputer = SimpleImputer()
imputed_x_train = pd.DataFrame(my_imputer.fit_transform(x_train))
imputed_x_valid = pd.DataFrame(my_imputer.transform(x_valid))

# Imputation removed column names; put them back
imputed_x_train.columns = x_train.columns
imputed_x_valid.columns = x_valid.columns

print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_x_train, imputed_x_valid, y_train, y_valid))

MAE from Approach 2 (Imputation):
178166.46269899711


In [19]:
# Approach 3, Extension of Imputed method

# Make copy to avoid changing original data (when imputing)
x_train_plus = x_train.copy()
x_valid_plus = x_valid.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    x_train_plus[col + '_was_missing'] = x_train_plus[col].isnull()
    x_valid_plus[col + '_was_missing'] = x_valid_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_x_train_plus = pd.DataFrame(my_imputer.fit_transform(x_train_plus))
imputed_x_valid_plus = pd.DataFrame(my_imputer.transform(x_valid_plus))

# Imputation removed column names; put them back
imputed_x_train_plus.columns = x_train_plus.columns
imputed_x_valid_plus.columns = x_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_x_train_plus, imputed_x_valid_plus, y_train, y_valid))

MAE from Approach 3 (An Extension to Imputation):
178927.503183954


# Categorical Variables

A categorial varuable takes only a limited number of values.<br>
Consider a survey that asks what brand of car they owned, the responses would fall into categories like Honda, Toyota, Ford etc. In this case the data is categorical.<br>
<br>
We can get an error if we try to plug these variables into most machine learning models in Python withoud preprocessing them first.<br>
There are 3 approaches that we can use to prepare our catogorical data.<br>
<br>
1) Drop Categorical Variables<br> 
The easiest way to deal with categorical variables is to simple remove them from the dataset. This approcah will only work well if the columns did not contain useful information.<br>
2) Label Encoding<br>
Label encoding assigns each unique value to a different integer.
![image.png](attachment:image.png)
This assumption makes sense in this example as there is as indisputable ranking to the categories. Not all categorical variables have a clear ordering in the values, but we refer to those that do as ordinal variables. For tree-based models (decision trees and random forests), we can expect the label encoding to work well with the ordinal variables.<br>
3) One-Hot Encoding<br>
One -Hot encoding creates new columns indicating the presence of apsence of each possible value in the original data.
![image-2.png](attachment:image-2.png)
It does not assume ordering of the categories like the Label Encoding. Thus we can expect this approac to work well if there is no clear ordering in the categorical data. Example "Red" is neither more nor less than "Yellow". We refer to categorical variables without as intrinsic ranking as nominal variables.<br>
One-hot encoding generally does not perfoem well if the categorical variable takes on a large number of values, i.e. we generally won't use it for variables taking more than 15 different values.

In [20]:
x_train.head()

Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
12167,1,5.0,3182.0,1.0,1.0,1.0,0.0,,1940.0,-37.85984,144.9867,13240.0
6524,2,8.0,3016.0,2.0,2.0,1.0,193.0,,,-37.858,144.9005,6380.0
8413,3,12.6,3020.0,3.0,1.0,1.0,555.0,,,-37.7988,144.822,3755.0
2919,3,13.0,3046.0,3.0,1.0,1.0,265.0,,1995.0,-37.7083,144.9158,8870.0
6043,3,13.3,3020.0,3.0,1.0,2.0,673.0,673.0,1970.0,-37.7623,144.8272,4217.0
