# Boston House Prices dataset

## Notes

**Data Set Characteristics:**

* Number of Instances: 506 
* Number of Attributes: 13 numeric/categorical predictive
* Median Value (attribute 14) is usually the target
* Attribute Information (in order):
    - CRIM     per capita crime rate by town
    - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

In [2]:
# https://towardsdatascience.com/simple-and-multiple-linear-regression-in-python-c928425168f9
    
import numpy as np
import pandas as pd
from sklearn import datasets ## imports datasets from scikit-learn
data = datasets.load_boston() ## loads Boston dataset from datasets library 


In [3]:

# define the data/predictors as the pre-set feature names  
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [5]:
df = pd.read_csv("stocks.csv")
df.head()

Unnamed: 0,date,close,volume,open,high,low
0,2018/03/16,28.41,125084.0,28.35,28.57,28.21
1,2018/03/15,28.29,237656.0,28.3,28.54,28.08
2,2018/03/14,28.33,108618.0,28.08,28.46,28.01
3,2018/03/13,28.01,108562.0,28.34,28.49,27.94
4,2018/03/12,28.29,99621.0,28.29,28.42,28.12


In [7]:
df.corr()

Unnamed: 0,close,volume,open,high,low
close,1.0,0.102636,0.930245,0.978459,0.967709
volume,0.102636,1.0,0.076809,0.122905,0.052798
open,0.930245,0.076809,1.0,0.957596,0.973654
high,0.978459,0.122905,0.957596,1.0,0.964729
low,0.967709,0.052798,0.973654,0.964729,1.0


In [8]:
df = df.dropna()

In [10]:
X = df[['volume','open','high','low']]
y = df[['close']]

In [15]:
#Split Data
#Now we can split our data into a training and test set:
    
from sklearn.model_selection import train_test_split

# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [16]:
# Train Model
# We train our LinearRegression model using the training set of data.

from sklearn.linear_model import LinearRegression

regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [17]:
# Now that our model is trained, we can view the coefficients of the model using regression_model.coef_, which is an array of tuples of coefficients.

for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][idx]))

The coefficient for volume is -5.074218482083912e-08
The coefficient for open is -0.5467155020487786
The coefficient for high is 0.806823599855042
The coefficient for low is 0.7368877862586202


In [20]:
# regression_model.intercept_ contains an array of intercepts (β0 values)


intercept = regression_model.intercept_[0]

print("The intercept for our model is {}".format(intercept))

The intercept for our model is 0.07705062402589746


Scoring Model
A common method of measuring the accuracy of regression models is to use the R2 statistic.

The R^2 statistic is defined as follows:

R^2=1–(RSS/TSS)

* The RSS (Residual sum of squares) measures the variability left unexplained after performing the regression
* The TSS measues the total variance in Y
* Therefore the R2 statistic measures proportion of variability in Y that is explained by X using our model

In [22]:
# R^2  can be determined using our test set and the model’s score method.

regression_model.score(X_test, y_test)

# So in our model, 95.7% of the variability in Y can be explained using X

0.95731510573595224

In [23]:
# We can also get the mean squared error using scikit-learn’s mean_squared_error method and comparing the prediction for the test data set (data not used for training) with the ground truth for the data test set:

from sklearn.metrics import mean_squared_error

y_predict = regression_model.predict(X_test)

regression_model_mse = mean_squared_error(y_predict, y_test)

regression_model_mse

0.023328878725050219

In [24]:
import math

math.sqrt(regression_model_mse)

0.1527379413408804

In [25]:
# Making Predictions
# We can use our model to predict closing for another unknown day. 

# In the dataset, the data for 3/5/2018 is as follows:

# close: 27.69
# volume: 90590
# open: 27.85
# high: 28.04
# low: 27.51

# First, let's see if our model will predict the close amount given these exact values:

new_data = [[90590, 27.85, 28.04, 27.51]]

regression_model.predict(new_data)



array([[ 27.7415439]])

In [28]:
# Now let's try to change some of the values so that the data is unknown to our model 
# (our model wasn't trained or tested on this data)

new_data = [[200590, 30.85, 28.04, 20.51]]
regression_model.predict(new_data)

array([[ 20.93760125]])