# Linear Regression on Boston Housing Dataset

This data was originally a part of UCI Machine Learning Repository. This data also ships with the scikit-learn library. 
There are 506 samples and 13 feature variables in this data-set. The objective is to predict the value of prices of the house using the given features.

The description of all the features is given below:

  **CRIM**: Per capita crime rate by town

  **ZN**: Proportion of residential land zoned for lots over 25,000 sq. ft

  **INDUS**: Proportion of non-retail business acres per town

  **CHAS**: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

  **NOX**: Nitric oxide concentration (parts per 10 million)

  **RM**: Average number of rooms per dwelling

  **AGE**: Proportion of owner-occupied units built prior to 1940

  **DIS**: Weighted distances to five Boston employment centers

  **RAD**: Index of accessibility to radial highways

  **TAX**: Full-value property tax rate per $10,000

  **B**: 1000(Bk - 0.63)², where Bk is the proportion of [people of African American descent] by town

  **LSTAT**: Percentage of lower status of the population

  **MEDV**: Median value of owner-occupied homes in $1000s


https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155


I**mport the required Libraries**

In [None]:
import numpy as np
import matplotlib.pyplot as plt 

import pandas as pd  
import seaborn as sns 

from sklearn.model_selection import train_test_split

%matplotlib inline

**Load the Boston Housing DataSet from a csv file into a pandas dataframe**

In [None]:
boston = pd.read_csv('HousingData.csv')
boston.head()

**Data preprocessing**

In [None]:
boston.describe()

In [None]:
# check for missing values in all the columns
boston.isnull().sum()

### **EXERCISE: set all the null values to 0.0**

**Data Visualization**

In [None]:
# set the size of the figure
sns.set(rc={'figure.figsize':(11.7,8.27)})

# plot a histogram showing the distribution of the target values
sns.histplot(boston['MEDV'], bins=30)
plt.show()

**Correlation matrix**

In [None]:
# compute the pair wise correlation for all columns  
correlation_matrix = boston.corr().round(2)

In [None]:
# use the heatmap function from seaborn to plot the correlation matrix
# annot = True to print the values inside the square
sns.heatmap(data=correlation_matrix, annot=True)

**Observations**




*   From the above correlation plot we can see that **MEDV** is strongly correlated to **LSTAT**, **RM**

*  **RAD** and **TAX** are stronly correlated, so we don't include this in our features together to avoid multi-colinearity




In [None]:
plt.figure(figsize=(20, 5))

features = ['LSTAT', 'RM']
target = boston['MEDV']

for i, col in enumerate(features):
    plt.subplot(1, len(features) , i+1)
    x = boston[col]
    y = target
    plt.scatter(x, y, marker='o')
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('MEDV')

**Observations:**

1. The prices increase as the value of **RM** increases linearly. There are few outliers and the data seems to be capped at 50.
2. The prices tend to decrease with an increase in **LSTAT**. Though it doesn’t look to be following exactly a linear line.

**Prepare the data for training**

For simple linear regression, we first use only **RM**  

In [None]:
X = pd.DataFrame(boston['RM'], columns = ['RM'])
y = boston['MEDV']

**Split the data into training and testing sets**

In [None]:
# splits the training and test data set in 80% : 20%
# assign random_state to any value.This ensures consistency.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=5)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

**Train the model using sklearn LinearRegression**

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

lin_model = LinearRegression()
lin_model.fit(X_train, y_train)

y_train_predict = lin_model.predict(X_train)

print("Intercept:", lin_model.intercept_)  
print("Coeficient:", lin_model.coef_)
print('Mean MEDV:', np.mean(y))

In [None]:
# model evaluation for training set

print('Mean Absolute Error:', mean_absolute_error(y_train, y_train_predict))  
print('Mean Squared Error:', mean_squared_error(y_train, y_train_predict))  
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_train, y_train_predict)))

print('R2 score:', r2_score(y_train, y_train_predict))

#comparison between training data and regression model
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, y_train_predict, color = 'blue')
plt.show()

In [None]:
# model evaluation for testing set

y_test_predict = lin_model.predict(X_test)

print('Mean Absolute Error:', mean_absolute_error(y_test, y_test_predict))  
print('Mean Squared Error:', mean_squared_error(y_test, y_test_predict))  
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_test_predict)))

print('R2 score:', r2_score(y_test, y_test_predict))

#comparison between test data and regression model
plt.scatter(X_test, y_test, color = 'green')
plt.plot(X_train, y_train_predict, color = 'blue')
plt.show()

### **EXERCISE: repeat this simple linear regression analysis with the variable LSTAT**