### Model Evaluation & Validation Project

## <font color="#347797">Project Description</font>

You want to be the best real estate agent out there. In order to compete with other agents in your area, you decide to use machine learning. You are going to use various statistical analysis tools to build the best model to predict the value of a given house. Your task is to find the best price your client can sell their house at. The best guess from a model is one that best generalizes the data.

For this assignment your client has a house with the following feature set: [11.95, 0.00, 18.100, 0, 0.6590, 5.6090, 90.00, 1.385, 24, 680.0, 20.20, 332.09, 12.13]. To get started, use the example scikit implementation. You will have to modify the code slightly to get the file up and running.

When you are done implementing the code please answer the following questions in a report with the appropriate sections provided.

In [1]:
from boston_housing_students import *

In [2]:
city_data = load_data()

In [3]:
print(city_data.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

## <font color="#347797">Questions and Report Structure</font>

### 1) Statistical Analysis and Data Exploration

In [4]:
explore_city_data(city_data)

Number of data points              : 506
Number of features                 : 13
Minimum house price                : 5,000.00
Maximum house price                : 50,000.00
Mean house price                   : 22,532.81
Median house price                 : 21,200.00
Standard deviation of house prices : 9,188.01


### 2) Evaluating Model Performance

***Which measure of model performance is best to use for regression and predicting Boston housing data? ***

Mean Absolute Error

***Why is this measurement most appropriate? Why might the other measurements not be appropriate here?***

We have seen a few different measurement for performance for both classification and regression. Since this is  a regression problem, Precision, Recall, Accuracy and F1 score are useless. As for regression we have seen the mean squared error (MSE) and the mean absolute error (MAE). I would argue that using the MAE is preferable because we do not necessarily want to put more weight on outliers. Using MSE will square the error and that will be heavy. Some houses will be very expensive and they should not play a major role in predicting the house prices.

***Why is it important to split the data into training and testing data? What happens if you do not do this?***

Testing with the same, or part of the same, dataset that was used for training will introduce bias in the predictions. The model will make predictions on the same data points that it used for training, thus resulting in cheating. If we were to chose a model based on those predictions, the results could be catastrophic. We would have no idea whatsoever what the performances would be on examples that the model never saw before.

***Which cross validation technique do you think is most appropriate and why?***

At this point in the class we have only seen K-Fold, so I'm gonna go with that. As far as I can tell, all the other CV techniques are variations on K-Fold. The main advantage of K-Fold is that it allows to train on everthing and test on everything. 

***What does grid search do and why might you want to use it?***

Hyperparameters can only be learned empirically. It's something that we have to see if it works well or not. Doing it manually is tedious and fortunatly librairies like Scikit Learn makes it super easy to do. It will do all the possible permutations of the specified hyperparameters and run the learning algorithms with them. It will select the best ones and use them for further predictions. It's all nicely wrapped and easy to use. So basically grid search saves a lot of hassels and programming time, but takes more time to train. So if the dataset is too big or you have a learning algorithm that naturally takes a lot of time to train, like neural nets, then you might want to think about it twice or at least do a smart selection first and not just throw a range of hyperparameters to try at random.

### 3) Analyzing Model Performance

***Look at all learning curve graphs provided. What is the general trend of training and testing error as training size increases?***

<table>
    <tr>
        <td>
            <img src="figure_1.png" alt="Depth 1">
        </td>
        <td>
            <img src="figure_2.png" alt="Depth 2">
        </td>
        <td>
            <img src="figure_3.png" alt="Depth 3">
        </td>
    </tr>
    <tr>
        <td>
            <img src="figure_4.png" alt="Depth 4">
        </td>
        <td>
            <img src="figure_5.png" alt="Depth 5">
        </td>
        <td>
            <img src="figure_6.png" alt="Depth 6">
        </td>
    </tr>
    <tr>
        <td>
            <img src="figure_7.png" alt="Depth 7">
        </td>
        <td>
            <img src="figure_8.png" alt="Depth 8">
        </td>
        <td>
            <img src="figure_9.png" alt="Depth 9">
        </td>
    </tr>
    <tr>
        <td>
            <img src="figure_10.png" alt="Depth 10">
        </td>
    </tr>
</table>

Training and testing error are quite high with a small depth for the decision tree. The testing error soon stabilizes no matter the depth of the depth of the decision tree whereas the training set keeps getting lower and lower, creating a huge gap between the testing and training error curves.

***Look at the learning curves for the decision tree regressor with max depth 1 and 10 (first and last learning curve graphs). When the model is fully trained does it suffer from either high bias/underfitting or high variance/overfitting?***

As the depth increases, the variance is getting higher. An overfitting problem starts to appear where the training error is close to zero but the testing error stays the same. That indicates an inability to generalize well to unseen data.

***Look at the model complexity graph. How do the training and test error relate to increasing model complexity? ***

<img src="model complexity.png">

As we saw with the previous graphics, are the complexity of the model inceases, the gap between the testing and training error grows wider. Testing error stays approximatly the same after a certain point. So there is no need to increase the complexity beyond that point.

***Based on this relationship, which model (max depth) best generalizes the dataset and why?***

A depth of 4 or 5 is good enough for generalization. Beyond that the training error is getting lower, but not the testing error. If increasing the complexity of the model does not allow better generalization, then we should stick to a point where testing and training error are closer. We will save computation time as a side effect.

### 4) Model Prediction

***Model makes predicted housing price with detailed model parameters***

GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, random_state=None,
           splitter='best'),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'max_depth': (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring=make_scorer(mean_absolute_error), verbose=0)
       
       
House: [11.95, 0.0, 18.1, 0, 0.659, 5.609, 90.0, 1.385, 24, 680.0, 20.2, 332.09, 12.13]

Prediction: [ 19.93372093]

***Compare prediction to earlier statistics***

The predicted price is 19,933.72$. It is inside one standard deviation from the mean and not too far from the mean and the median.

The interquartile range of the provided examples is 17,025-25,000, so the predicted price is well between the outliers range.

Minimum house price                : 5,000.00<br>
Maximum house price                : 50,000.00<br>
Mean house price                   : 22,532.81<br>
Median house price                 : 21,200.00<br>
Standard deviation of house prices : 9,188.01<br>

