<a href="https://colab.research.google.com/github/arun-arunisto/Machine_Learning_Tutorial/blob/todo/MachineLearningTutorial1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Topics covered in part 1
- Models
- Data exploration
- first ML model
- model validation
- underfitting & overfitting
- random forests


##Models
- we'll start with model called <b>Decision Tree</b>
- there are more fancier models that will give accurate predictions, why we use decision trees is, its actually easy to understand
- so for simplicity we will start with simplest possible decision tree
<img src="https://storage.googleapis.com/kaggle-media/learn/images/7tsb5b1.png">

- we use data to decide how to break the houses into 2 groups, and then again to determine he predicted price in each group.
- this step of capturing patterns from data is called "fitting" or "training" data
- the data is used to <b>fit</b> the model is called the <b>training data</b>
- the details of how the model is fit <b>(eg: how to split up the data)</b>you can apply it to new data to <b>predict</b>.
- you can capture more factors using a tree that has more "splits". these are called <b>Deeper</b> trees.
<img src="https://storage.googleapis.com/kaggle-media/learn/images/R3ywQsR.png">
- the bottom where we make prediction is called <b>leaf</b>

##Data Exploration
- using pandas to familiar with data

In [1]:
#importing pandas
import pandas as pd

- here we are going to use melbourne house data to explore in python

In [4]:
file_path = "/content/drive/MyDrive/Datascience&MachineLearning/datasets/melb_data.csv"

#using pandas to open the csv file
data = pd.read_csv(file_path)

#printing the summary of the data
data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


- <b>count</b> - shows how many rows have non-missing values.
- <b>mean</b> - average
- <b>std</b> - standard deviation which measures how numerically spread out values are.
- <b>min, 25%, 50%, 75%</b> and <b>max</b> - sorting each column from lowest to highest values.

##First ML Model
- Selecting data for modeling

In [5]:
#to choose columns/variables of the csv file we use "columns"
data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

- the data has some missing values (some houses for which some columns/variables weren't recorded)
- we are going to use a simple option for now, we are going to drop it from our data
- we are going to use code to drop the columns that dont have data

In [6]:
#dropna drops missing values
data = data.dropna(axis=0)

- you can simply pullout the variable/column by using just a "dot-notation"
- next we are going to select the column we want to predict by using "dot-notation", which is called "prediction target"

In [7]:
#selecting the prediction target
#price is column we are going to predict
y = data.Price

- after select the prediction target we are going to choose the features that we are used to make predictions
- we are going to select multiple features by providing list of column names

In [8]:
#list that contains feautres
features = ["Rooms", "Bathroom", "Landsize", "Lattitude", "Longtitude"]

In [9]:
#then we are going to use X and assign features into it
X = data[features]

In [10]:
#we are going to review the data using "describe" method
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [11]:
#"head" method which shows the top few rows
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


###Building Model
- <b>scikit-learn</b> library we are going to use to create our models.

Steps to building and using model are:
- <b>Define</b> - type of model? decision tree? some other type model? or some other parameters type are specified too.
- <b>Fit</b> - capture patterns provided data, this is the heart of modeling
- <b>Predict</b>
- <b>Evaluate</b> - determining the accuracy of model's predictions

In [12]:
#from scikit-learn we are going to import the decision tree regressor
from sklearn.tree import DecisionTreeRegressor

#defining model, specifying a number for random_state
#to ensure the same results each run
model = DecisionTreeRegressor(random_state=1)

#fitting the model by using the X and y variables
model.fit(X, y)

- now we fit the model, so we can use this to make predictions
- we are going take only first few rows of the training data, to see how the predictions work

In [13]:
print("Making predictions for the following houses")
print(X.head())

Making predictions for the following houses
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954


In [15]:
#going to predict the prices of above houses
print("Predictions are:")
print(model.predict(X.head()))

Predictions are:
[1035000. 1465000. 1600000. 1876000. 1636000.]


##Model Validation
- after builting the model, you need to check how good was it?
- we are going to use model validation to to measure the quality of your model, measuring the model quality is the key to improve your models
- how to summarize model quality?
if you campare predicted price to actual values of 10,000 houses, you will get mix of good and bad predictions.
looking through a list of 10,000 predicted and actual values are pointless. so, we need to summarize this into a single metric.
- there are many metrics for summarizing model quality
- here we are going to use <b>Mean Absolute Error</b> (also called <b>MAE</b>)
- the prediction error for each house is
<code>error = actual - predicted</code>
- so, if the actual value of a house is $150,000 and predicted cost is $100,000 the error is $50,000

In [16]:
#importing the MAE
from sklearn.metrics import mean_absolute_error

In [17]:
#next we are going find the error between actual and predicted
#setting a variable for predicted prices
predicted_prices = model.predict(X)

In [18]:
#the actual value is y and prdicted is predicted prices
mean_absolute_error(y, predicted_prices)

1115.7467183128902

- we are validating models using sample data, it has inaccurate values
- so we are going to add some training data, then the model will accurate in the training data
- in scikit-library has function called <b>train_test_split</b> it will break up the data into pieces.
- after that we will use some of the data as training data to fit the model
- and we will use the other data as validation data to calculate MAE

In [19]:
#importing train_test_split from scikit-learn
from sklearn.model_selection import train_test_split

In [21]:
#next we are going to split the training and validation data
#and we are ensuring the random_state argument to get the same split
#data every time the script runs
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)
#fitting the model with training data
model.fit(train_X, train_y)

#then predicting the values from training data
val_train_predict = model.predict(val_X)
print(mean_absolute_error(val_y, val_train_predict))

273518.01872175594


###finally
The MAE is from sample data 1000 dollars to more than 250,000 dollars

##Underfitting and Overfitting
- now we are going to experiment with alteranative models and going to see which gives the best
- the decision tree has many options
- the most important options is determines the tree depth
- that a tree's depth is a measure of how many splits it makes before coming to a prediction.
<img src="https://storage.googleapis.com/kaggle-media/learn/images/R3ywQsR.png">
In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have  2**10groups of houses by the time we get to the 10th level. That's 1024 leaves.


###overfitting
When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).This is a phenomenon called <b>overfitting</b>

###underfitting
At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called <b>underfitting</b>.

- if we care about accuracy on new data which we estimate from our validation data
- we want to find the spot between underfitting and overfitting
- visually we want the low point of the (red) validation curve
<img src="https://storage.googleapis.com/kaggle-media/learn/images/AXSEOfI.png">

- there are few alternatives for controlling the tree depth
- and many allow for some routes through the tree to have greater depth than other routes.
- <b>max_leaf_nodes</b> argument provides a very sensible way to control overfitting vs underfitting
- the more leaves we allow to model to make, the more we move from the underfitting area in the above graph to the overfitting area

In [22]:
#we are going to make a function to get MAE
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
  model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
  model.fit(train_X, train_y)
  preds_val = model.predict(val_X)
  mae = mean_absolute_error(val_y, preds_val)
  return mae

In [24]:
#we are going to use the for loop to compare the models accuracy with
#different values for max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
  mae_val = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
  print(f"Max leaf nodes: {max_leaf_nodes}\nMean Absolute Error:{int(mae_val)}")


Max leaf nodes: 5
Mean Absolute Error:385696
Max leaf nodes: 50
Mean Absolute Error:279794
Max leaf nodes: 500
Mean Absolute Error:261718
Max leaf nodes: 5000
Mean Absolute Error:271320


##Random Forests
- on todays world most sophsiticated techniques face ths tension between underfitting and overfitting
- but there are many other models
- we are going to take <b>Random Forest</b> as an example

- random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree
- it has much better predictive accuracy than a single decision tree and it works well with default parameters


In [25]:
#importing the random_forest
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
forest_pre = forest_model.predict(val_X)
print(mean_absolute_error(val_y, forest_pre))

207190.6873773146
