# Linear Regression on Boston Housing Dataset
Housing dataset contains information about different houses in Boston. We'll access the data from scikit-learn library. There are 506 samples and 13 feature variables in the dataset. The objective is to predict the value of prices of the house using the given features.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

%matplotlib inline

Let's load housing data from the scikit-learn library and understand it. 

In [2]:
from sklearn.datasets import load_boston
boston_dataset = load_boston()
boston_dataset.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

In [3]:
boston_dataset['data'].shape

(506, 13)

In [5]:
boston_dataset['target'].shape

(506,)

In [8]:
type(boston_dataset.DESCR)

str

The description of all features is given below:

    CRIM: Per capita crime rate by town
    ZN: Proportion of residential land zoned for lots over 25,000 sq. ft
    INDUS: Proportion of non-retail business acres per town
    CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    NOX: Nitric oxide concentration (parts per 10 million)
    RM: Average number of rooms per dwelling
    AGE: Proportion of owner-occupied units built prior to 1940
    DIS: Weighted distances to five Boston employment centers
    RAD: Index of accessibility to radial highways
    TAX: Full-value property tax rate per 10,000(us dollars)
    PTRATIO: Pupil-teacher ratio by town
    B: 1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
    LSTAT: Percentage of lower status of the population
    MEDV: Median value of owner-occupied homes in 1000s(us dollars)

The prices of the house indicated by the variable `MEDV` is our **target variable** and the remaining are the **feature variables** based on which we will predict the value of a house. 

In [9]:
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


We can see that the target `MEDV` is missing from the data. We create a new column of target values and add it to the dataframe. 

In [10]:
boston['MEDV'] = boston_dataset.target