# Statistical Analysis
   
   <p>All requested statistics for the Boston Housing dataset are accurately calculated. Student correctly leverages NumPy functionality to obtain these results.</p>

# Answer:

In [None]:
import numpy as np
import pandas as pd

data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)

print("Boston housing dataset has {} datapoints with {} variables each".format(*data.shape))

min_price = np.min(prices)
print("Min:"+"\n"+str(min_price))

max_price = np.max(prices)
print("Max:"+"\n"+str(max_price))

median_price = np.median(prices)
print("Median:"+"\n"+str(median_price))

std_price = np.std(prices)
print("Standard Deviation:"+"\n"+str(std_price))


# Question 1: Feature Observation

   
   <p>
    'RM' is the average number of rooms among homes in the neighborhood.
    'LSTAT' is the percentage of homeowners in the neighborhood considered "lower class" (working poor). 
    'PTRATIO' is the ratio of students to teachers in primary and secondary schools in the neighborhood.
    
    Using your intuition, for each of the three features above, do you think that an increase in the value of that feature would lead to an increase in the value of 'MEDV' or a decrease in the value of 'MEDV'? Justify your answer for each.</p>

# Answer:

   <p>Independent variables:
    1) RM: Number of rooms in a house
    2) LSTAT: Percentage of neighborhood population below poverty line.
    3) PTR: Pupil-Teacher ratio
    
    Dependent variable:
    1) Price of the house
    
    Correlations:
    1) RM: Positively correlated to price. As RM goes up, price goes up.
    2) LSTAT: Negatively correlated to price. As LSTAT goes up, price goes down.
    3) PTR: Negatively correlated to price. As PTR goes up, price goes down.</p>

# Question 2 - Goodness of Fit
   
   <p>
    Assume that a dataset contains five data points and a model made the following predictions for the target variable:
    <br>
    <br>
    True Values:[3.0, -0.5, 2.0, 7.0, 4.2]<br>	
    Predictions:[2.5, 0.0, 2.1, 7.8, 5.3]<br><br>
        
    Run the code cell below to use the performance_metric function and calculate this model's coefficient of determination.
  </p>

In [None]:
# Calculate the performance of this model
score = performance_metric([3, -0.5, 2.7, 4.2], [2.5, 0.0, 2.1, 7.8, 5.3])
print "Model has a coefficient of determination, R^2, of {:.3f}.".format(score)

   
   <p>Would you consider this model to have successfully captured the variation of the target variable? Why or why not?</p>


# Answer:

   <p>The r2_score function was imported from sklearn.metrics and the arrays of true and predicted values were plugged in. The resulting value calculated was 0.923.</p>

In [None]:
from sklearn.metrics import r2_score
def performance_metric(y_true, y_predict):
	error = r2_score(y_true, y_predict)
	return error
print("r2_score"+"\n"+str(performance_metric(y_train, y_train)))
print(str(performance_metric([3, -0.5, 2.7, 4.2], [2.5, 0.0, 2.1, 7.8, 5.3])))
>>> 0.923


   <p>I would say this metric is pretty successful in capturating the variation of the target variable for this data. The fairly high value of 0.923 indicates a fairly good correlation between the true and predicted values. While the predictions are off by about 0.1-1.0, the predictions are consistently higher for the higher values and lower for the lower values.</p>

# Question 3:
   
   <p>What is the benefit to splitting a dataset into some ratio of training and testing subsets for a learning algorithm?</p>

# Answer:
   
   <p>Saving a portion of the dataset for testing allows us to evaluate the strength of our model and determine if we are overfitting or underfitting. If the model works well for the training data but poorly for the testing data, then the model overfits. If the model works well on the testing data but poorly on the training data, then the model underfits. 
    
    Below, the code for splitting the dataset is shown: </p>

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd
#Obtain the data
data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)
#Split the data
X_train, X_test, y_train, t_test = train_test_split(prices,
						features,
						test_size=0.2)

# Question 4:
   
   <p>Choose one of the graphs above and state the maximum depth for the model.
What happens to the score of the training curve as more training points are added? What about the testing curve?
Would having more training points benefit the model?
</p>

# Answer
   
   <p>Below, the learning curves prodcuded by the line 'vs.ModelLearning(features, prices)' in boston_housing.py are shown.</p>

<img src="figure_1.png">

   
   <p>Based on this these images, the best fit is the model with max_depth=3, because the score for the testing and training sets converge at a fairly high level. 150 data points is enough for the scores for the two sets to be nearly converged, and beyond 300 data points, the model does improve significantly.</p>

# Question 5
   
   <p>When the model is trained with a maximum depth of 1, does the model suffer from high bias or from high variance?
How about when the model is trained with a maximum depth of 10? What visual cues in the graph justify your conclusions?</p>

# Answer

   <p>The model with max_depth 10 always has a high score for the cross validation data, but never gets very good for the training data no matter how many data points are provided. This is a model with high variance/overfitting.
    
    The model with a max_depth of 1 never has a very high score for either the training or testing sets, but the two score values converge fairly quickly. This is a model with high bias/underfitting.</p>

# Question 6
   
   <p>Which maximum depth do you think results in a model that best generalizes to unseen data?
What intuition lead you to this answer?</p>

# Answer:
   
   <p>The best model appears to be the model with max_depth=3, because the scores for the training and testing sets seem to converge at a fairly high value, using a reasonable number of data points.

# Question 7
   
   <p>What is the grid search technique? How it can be applied to optimize a learning algorithm?</p>

# Answer
    
   <p>The grid search technique tests a range of models against the data set, the models varying by two or more parameters (e.g. polynomial degree, number of data points in training set, etc.), which can be imagined as axes on a grid. Each point on the grid represents a model. Each point on the grid is given a score, and the model at the best scoring point is the optimal model.</p>

# Question 8
   <p>What is the k-fold cross-validation training technique?
What benefit does this technique provide for grid search when optimizing a model?</p>

# Answer
   <p>The draw back to splitting the data set into distinct training and testing sets is that some of the data goes to waste during testing and training. With k-fold cross validation, we instead split the data points into a number (k) buckets. The model is trained k times, and each time the testing set is different. This way, we get to use all the data for testing and training.The value of k is a parameter which can be optimized in a grid search</p>

# Question 9
   <p>What maximum depth does the optimal model have? How does this result compare to your guess in Question 6?</p>

# Answer
   <p>The code for my fit_model function is shown below: </p>

In [None]:
def fit_model(X, y):
        regressor = DecisionTreeRegressor()
        parameters = {'max_depth':(1,2,3,4,5,6,7,8,9)}
        scoring_function = make_scorer(performance_metric,
         greater_is_better=False)
        reg = GridSearchCV(regressor, parameters, scoring=scoring_function)
        reg.fit(X,y)
        print str(reg.best_estimator_)
        return reg.best_estimator_

<br><p>The output of <code>print str(reg.best_estimator_)</code> is shown below</p><br>

In [None]:
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

   <p>The max_depth for this function is 4. This is close to my best guess of 3 from the learning curves.</p>

# Question 10
   <p> Imagine that you were a real estate agent in the Boston area looking to use this model to help price homes owned by your clients that they wish to sell. You have collected the following information from three of your clients:
<table>    
    <tr>
        <td>Feature</td><td>Client 1</td><td>Client 2</td><td>Client 3</td>
    </tr>
    <tr>
        <td>total number of rooms in home</td><td>5</td><td>4</td><td>8</td>
    </tr>
    <tr>
        <td>Neighborhood poverty (as %)</td>
        <td>17%</td>
        <td>32%</td>
        <td>3%</td>
    </tr>
    <tr>
        <td>Student-teacher ratio</td>
        <td>15-to-1</td>
        <td>22-to-1</td>
        <td>12-to-1</td>
    </tr>
		    	    
</table>
<br>
What price would you recommend each client sell his/her home at?

<br>
Do these prices seem reasonable given the values for the respective features?</p>

# Answer
   <p>The code for predicting the home values is shown below:</p>

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer
from sklearn.grid_search import GridSearchCV
from sklearn.tree import DecisionTreeRegressor

client_data = [[5,17,15],[4,32,22],[8,3,12]]

data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)

def performance_metric(y_true, y_predict):
        r2 = r2_score(y_true, y_predict)
        return r2   
print("mse"+"\n"+str(performance_metric(y_train, y_train)))

def fit_model(X, y):
        regressor = DecisionTreeRegressor()
        parameters = {'max_depth':(1,2,3,4,5,6,7,8,9)}
        scoring_function = make_scorer(performance_metric,
         greater_is_better=True)
        reg = GridSearchCV(regressor, parameters, scoring=scoring_function)
        reg.fit(X,y)
        print str(reg.best_estimator_)
        return reg.best_estimator_

print("Predicted sales prices for client input data:"+"\n"+str(fit_model(features, prices).predict(client_data)))

In [None]:
#Predicted sales prices for client input data:
>>[ 408800, 231253.44827586, 938053.84615385]

   <p>These numbers seem reasonable for the houses.</p>