## Model Evaluation and Validtion

## Project: Predicting Boston Housing Prices

In [1]:
#import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.linear_model import LinearRegression

#import Supplementary visualization code visuals.py
import visuals as vs

# pretty display for notebook
%matplotlib inline

#load the Boston housing dataset

data = pd.read_csv('dataset/housing.csv')
prices = data['MEDV']
features = data.drop('MEDV',axis=1)

# Success
print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))

Boston housing dataset has 489 data points with 4 variables each.


In [2]:
#lets check how many data points we have and how many features we have ??
print("We have {} Data Points.".format(data.shape[0]))
print("We have {} Features.".format(data.shape[1]))



We have 489 Data Points.
We have 4 Features.


In [3]:
#Checking which features we have
data.columns

Index(['RM', 'LSTAT', 'PTRATIO', 'MEDV'], dtype='object')

In [4]:
features.shape

(489, 3)

In [5]:
features.head(5)

Unnamed: 0,RM,LSTAT,PTRATIO
0,6.575,4.98,15.3
1,6.421,9.14,17.8
2,7.185,4.03,17.8
3,6.998,2.94,18.7
4,7.147,5.33,18.7


In [6]:
prices.head(5)

0    504000.0
1    453600.0
2    728700.0
3    701400.0
4    760200.0
Name: MEDV, dtype: float64

In [7]:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 489 entries, 0 to 488
Data columns (total 3 columns):
RM         489 non-null float64
LSTAT      489 non-null float64
PTRATIO    489 non-null float64
dtypes: float64(3)
memory usage: 11.5 KB


In [8]:
prices.describe()

count    4.890000e+02
mean     4.543429e+05
std      1.653403e+05
min      1.050000e+05
25%      3.507000e+05
50%      4.389000e+05
75%      5.187000e+05
max      1.024800e+06
Name: MEDV, dtype: float64

In [9]:
#Minimum price
minimum_price = np.min(prices)
print("Minimum price: ${}".format(minimum_price))

Minimum price: $105000.0


In [10]:
#Maximum Price
maximum_price = np.max(prices)
print("Maximum price: ${}".format(maximum_price))

Maximum price: $1024800.0


In [11]:
#Mean Price
mean_price = np.mean(prices)
print("Mean Price: ${}".format(mean_price))

Mean Price: $454342.9447852761


In [12]:
#Median Of price
median_price = np.median(prices)
print("Median price: ${}".format(median_price))

Median price: $438900.0


In [13]:
#Standered Daviation of price
std_price = np.std(prices)
print("Standard deviation of prices: ${}".format(std_price))

Standard deviation of prices: $165171.13154429477


### Question 1 - Feature Observation
As a reminder, we are using three features from the Boston housing dataset: `'RM'`, `'LSTAT'`, and `'PTRATIO'`. For each data point (neighborhood):
- `'RM'` is the average number of rooms among homes in the neighborhood.
- `'LSTAT'` is the percentage of homeowners in the neighborhood considered "lower class" (working poor).
- `'PTRATIO'` is the ratio of students to teachers in primary and secondary schools in the neighborhood.


**Using your intuition, for each of the three features above, do you think that an increase in the value of that feature would lead to an **increase** in the value of `'MEDV'` or a **decrease** in the value of `'MEDV'`? Justify your answer for each.**

- Would you expect a home that has an `'RM'` value(number of rooms) of 6 be worth more or less than a home that has an `'RM'` value of 7?
- Would you expect a neighborhood that has an `'LSTAT'` value(percent of lower class workers) of 15 have home prices be worth more or less than a neighborhood that has an `'LSTAT'` value of 20?
- Would you expect a neighborhood that has an `'PTRATIO'` value(ratio of students to teachers) of 10 have home prices be worth more or less than a neighborhood that has an `'PTRATIO'` value of 15?

## Observations:
- if we increase the value of `'RM'` then the price of house will increase.
- if we increase the value of `'LSTAT'` then the price of house will decrease.
- if we increase the value of `'PTRATIO'` then the price of house will decrease.


## Developing a simple Model Without Extra Efforts

In [14]:
# lets fit data into model
X = features
y = prices
model_reg = LinearRegression()
model_reg.fit(X,y)

sample_house = [[3.575, 1.98, 10.0]]

#Predict housing price for the sample_house
prediction = model_reg.predict(sample_house)
#Print Prediction
print("The Predicted Price For {} Sample is :: ${}".format(*sample_house,prediction[0]))

The Predicted Price For [3.575, 1.98, 10.0] Sample is :: $508532.26538074866


## Let's Verify Above Observations:


In [15]:
#increase the value of 'RM'
sample_house = [[10.575, 1.98, 10.0]]
#Predict housing price for the sample_house
prediction = model_reg.predict(sample_house)
#Print Prediction
print("The Predicted Price For {} Sample is :: ${}".format(*sample_house,prediction[0]))

The Predicted Price For [10.575, 1.98, 10.0] Sample is :: $1114488.9183116534


In [16]:
#increasing the value of 'LSTAT'
sample_house = [[3.575, 4.0, 10.0]]
#Predict housing price for the sample_house
prediction = model_reg.predict(sample_house)
#Print Prediction
print("The Predicted Price For {} Sample is :: ${}".format(*sample_house,prediction[0]))

The Predicted Price For [3.575, 4.0, 10.0] Sample is :: $486616.59780546196


In [17]:
#increasing the value of 'PTRATIO'
sample_house = [[3.575, 1.98, 15.0]]
#Predict housing price for the sample_house
prediction = model_reg.predict(sample_house)
#Print Prediction
print("The Predicted Price For {} Sample is :: ${}".format(*sample_house,prediction[0]))


The Predicted Price For [3.575, 1.98, 15.0] Sample is :: $411071.6872050121


## Performance Metric
The values for R<sup>2</sup> range from 0 to 1, which captures the percentage of squared correlation between the predicted and actual values of the **target variable**. A model with an R<sup>2</sup> of 0 is no better than a model that always predicts the *mean* of the target variable, whereas a model with an R<sup>2</sup> of 1 perfectly predicts the target variable. Any value between 0 and 1 indicates what percentage of the target variable, using this model, can be explained by the **features**. _A model can be given a negative R<sup>2</sup> as well, which indicates that the model is **arbitrarily worse** than one that always predicts the mean of the target variable._

For the `performance_metric` function in the code cell below, you will need to implement the following:
- Use `r2_score` from `sklearn.metrics` to perform a performance calculation between `y_true` and `y_predict`.
- Assign the performance score to the `score` variable.

In [18]:
from sklearn.metrics import r2_score

In [22]:
# 'r2_score'

def performance_metric(y_true, y_predict):
    """ Calculates and returns the performance score between 
        true and predicted values based on the metric chosen. """
    
    # Calculate the performance score between 'y_true' and 'y_predict'
    score = r2_score(y_true,y_predict)
    
    # Return the score
    return score

### Question 2 - Goodness of Fit
Assume that a dataset contains five data points and a model made the following predictions for the target variable:

| True Value | Prediction |
| :-------------: | :--------: |
| 3.0 | 2.5 |
| -0.5 | 0.0 |
| 2.0 | 2.1 |
| 7.0 | 7.8 |
| 4.2 | 5.3 |

Run the code cell below to use the `performance_metric` function and calculate this model's coefficient of determination.

In [23]:
# Calculate the performance of this model
score = performance_metric([3, -0.5, 2, 7, 4.2], [2.5, 0.0, 2.1, 7.8, 5.3])
print("Model has a coefficient of determination, R^2, of {:.3f}.".format(score))

Model has a coefficient of determination, R^2, of 0.923.


* Would you consider this model to have successfully captured the variation of the target variable? 
* Why or why not?

** Hint: **  The R2 score is the proportion of the variance in the dependent variable that is predictable from the independent variable. In other words:
* R2 score of 0 means that the dependent variable cannot be predicted from the independent variable.
* R2 score of 1 means the dependent variable can be predicted from the independent variable.
* R2 score between 0 and 1 indicates the extent to which the dependent variable is predictable. An 
* R2 score of 0.40 means that 40 percent of the variance in Y is predictable from X.

**Answer:**  92.3 % of variance in Y is predictable from X.

### Implementation: Shuffle and Split Data
Next implementation requires that take the Boston housing dataset and split the data into training and testing subsets. Typically, the data is also shuffled into a random order when creating the training and testing subsets to remove any bias in the ordering of the dataset.

For the code cell below, you will need to implement the following:
- Using `train_test_split` from `sklearn.model_selection` to shuffle and split the `features` and `prices` data into training and testing sets.
  - Split the data into 80% training and 20% testing.
  - Set the `random_state` for `train_test_split` to a value of your choice. This ensures results are consistent.
- Assigning the train and testing splits to `X_train`, `X_test`, `y_train`, and `y_test`.

In [27]:
#import train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=30)
# Success
print("Training and testing split was successful.")

Training and testing split was successful.


### Question 3 - Training and Testing

* What is the benefit to splitting a dataset into some ratio of training and testing subsets for a learning algorithm?

 **Think about how overfitting or underfitting is contingent upon how splits on data is done.**