This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston MA
As mentioned about [The Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). There are 14 attributes in each case of the dataset. They are:
* CRIM - per capita crime rate by town
* ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS - proportion of non-retail business acres per town.
* CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
* NOX - nitric oxides concentration (parts per 10 million)
* RM - average number of rooms per dwelling
* AGE - proportion of owner-occupied units built prior to 1940
* DIS - weighted distances to five Boston employment centres
* RAD - index of accessibility to radial highways
* TAX - full-value property-tax rate per \$10,000
* PTRATIO - pupil-teacher ratio by town
* B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
* LSTAT - % lower status of the population
* MEDV - Median value of owner-occupied homes in $1000's

Variable #14 seems to be censored at 50.00 (corresponding to a median price of $50,000); Censoring is suggested by the fact that the highest median price of exactly $50,000 is reported in 16 cases, while 15 cases have prices between $40,000 and $50,000, with prices rounded to the nearest hundred.

Our goal is to select the valiables which predicts the MEDV best, also to suggest a machine learning model to predict MEDV

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("/kaggle/input/boston-housing/Boston Housing.csv")

In [None]:
df.head()

In [None]:
df.describe()

As per the above observation we have checked the count, mean, percentile(etc.)
Now if we will see the data 
1. ZN column has 0 values in 25% - 50% and it is skewed its gaining results at 75% and above. As ZN is proportion of residential land zoned for lots over 25,000 sq.ft hence we can understand that it is conditional data
2. Chas column has 0 values in 25%, 50%, 75%. As CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise) and it is a categorized data (0,1).
So we have concluded that both ZN & Chas columns are skewed and can impact MEDV so we have to remove ZN & Chas, lets drop it.

#Removing ZN & Chas
  

In [None]:
df = df.drop(["ZN", "Chas"], axis = 1)

In [None]:
df.describe()

In [None]:
#checking columns / rows
df.shape

In [None]:
#lets see if we have any null values in our data
df.isnull().sum()

In [None]:
#moving to our next step to treat Outliers
#lets visualize the data through box plot
for i in df.columns:
  sns.boxplot(y=i, data=df)
  plt.tight_layout(pad=0.4)
  plt.show()


From above boxplot we can see that Columns CRIM, RM, DIS, PTRATIO, B, LSTAT and MEDV have outliers.


In [None]:
df2 = df.copy()

In [None]:
#Lets check the outliers using for loop and removing the outliers
for i in df2.columns:
  df2.sort_values(by=i, ascending=True, na_position='last') #sorting is required before percentile
  q1, q3 = np.percentile(df2[i], [25,75])
  iqr = q3-q1
  upper_bound = q3+(1.5 * iqr)
  lower_bound = q1-(1.5 * iqr)
  mean = df2[i].mean()
  df2.loc[df2[i]< lower_bound, [i]] = mean
  df2.loc[df2[i]> upper_bound, [i]] = mean

In [None]:
df2.shape

In [None]:
df2.describe()

In [None]:
#importing dataset
X = df2.iloc[:, :-1] #independent variable
y = df2.iloc[:, 11] #dependent variable

In [None]:
import statsmodels.api as sm
X = df2.iloc[:, :-1] #independent variable
y = df2.iloc[:, 11] #dependent variable
X = sm.add_constant(X) # adding a constant

model = sm.OLS(y, X).fit()
predictions = model.predict(X) 

print_model = model.summary()
print(print_model)

In [None]:
#lets make the correlation matrix
corr_data = df2.corr()
corr_data.style.background_gradient(cmap="coolwarm")

In [None]:
#as per the above correlation matrix we can see that Tax & RAD are highly correlated,
#as per the observation of data Rad is more important variable in predicting the Medv so I am dropping Tax here
df2 = df2.drop(["Tax"], axis = 1)

In [None]:
df2.shape

In [None]:
#lets make the correlation matrix
corr_data = df2.corr()
corr_data.style.background_gradient(cmap="coolwarm")

In [None]:
#now after dropping TAX we have noticed that CRIM and RAD are highly correlated and as per the observation of data it is suggested to drop RAD rather than CRIM in respect to MEDV
df2 = df2.drop(["Rad"], axis=1)

In [None]:
#lets make the correlation matrix again
corr_data = df2.corr()
corr_data.style.background_gradient(cmap="coolwarm")

In [None]:
#now after dropping RAD we have noticed that NOX and INDUS are highly correlated and as per the observation of data it is suggested to drop INDUS rather than NOX in respect to MEDV
df2 = df2.drop(["Nox"], axis=1)

In [None]:
#lets make the correlation matrix again
corr_data = df2.corr()
corr_data.style.background_gradient(cmap="coolwarm")

In [None]:
#till now we have removed the highly correlated data that might impact our predictions of MEDV. 
#Lets see the correlation of MEDV with other variables and remove the less correlated variables (using pearson method here)
from scipy.stats import pearsonr
for i in df2.columns:
  corr, p_val = pearsonr(df2[i], df2["Medv"])
  print (i, corr)

In [None]:
#from above pearson methond I have concluded that B is least correlated and have least impact on MEDV so removing same
df2 = df2.drop(["B"], axis= 1)

In [None]:
df2.describe()

Lets implement the Machine learning models
As we know that this is a regression a problem as we have to predict a continous (non catagorical) value.

In [None]:
#splitting data into train test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)

Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
y_compare = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
from sklearn.model_selection import cross_val_score as cvs
accuracy = cvs(lr, X_train, y_train, scoring='r2', cv=5)
print (accuracy.mean())
y_compare.head() #Comparison b/w Actual & Predicted

Polynomial Regression

In [None]:
#fitting polynomial regression.....WHEN WE USE POLYNOMIAL REGRESSION WE HAVE TO FIT DATASET IN LINEAR REGRESSION FIRST
from sklearn.preprocessing import PolynomialFeatures
polyReg = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = polyReg.fit_transform(X_train)
X_test_poly = polyReg.fit_transform(X_test)
poly = LinearRegression()
poly.fit(X_train_poly, y_train)
y_pred = poly.predict(X_test_poly)
y_compare_poly = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
accuracy = cvs(poly, X_train_poly, y_train, scoring='r2', cv=5)
print (accuracy.mean())
y_compare.head()

Support Vector Regression

In [None]:
from sklearn.svm import SVR
svr = SVR (kernel = 'rbf', gamma = 'scale')
svr.fit(X_train, y_train)
y_pred = svr.predict(X_test)
y_compare_svr = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
accuracy = cvs(svr, X_train, y_train, scoring='r2', cv=5)
print (accuracy.mean())
y_compare.head()

Decission Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor (random_state = 0)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
y_compare_dt = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
accuracy = cvs(dt, X_train, y_train, scoring='r2', cv=5)
print (accuracy.mean())
y_compare_dt.head()

Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
RF = RandomForestRegressor(n_estimators = 160, random_state = 0)
RF.fit(X_train, y_train)
y_pred = dt.predict(X_test)
y_compare_RF = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
accuracy = cvs(RF, X_train, y_train, scoring='r2', cv=5)
print (accuracy.mean())
y_compare_RF.head()

K-NN

In [None]:
from sklearn.neighbors import KNeighborsRegressor
KNN = KNeighborsRegressor(n_neighbors = 4)
KNN.fit(X_train, y_train)
y_pred = KNN.predict(X_test)
y_compare_KNN = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
accuracy = cvs(KNN, X_train, y_train, scoring='r2', cv=5)
print (accuracy.mean())
y_compare_KNN.head()

Plotting compariasion of actual and predicted values of MEDV that we got using different machine learning models



In [None]:
fig, ax = plt.subplots(nrows=1, ncols=6, figsize=(25,4))
ax = ax.flatten()
y_compare.head(10).plot(kind='bar', title='Linear Regression', grid='True', ax=ax[0])
y_compare_dt.head(10).plot(kind='bar', title='Decission Tree', grid='True', ax=ax[1])
y_compare_KNN.head(10).plot(kind='bar', title='KNN', grid='True', ax=ax[2])
y_compare_RF.head(10).plot(kind='bar', title='Random Forest', grid='True', ax=ax[3])
y_compare_svr.head(10).plot(kind='bar', title='SVR', grid='True', ax=ax[4])
y_compare_poly.head(10).plot(kind='bar', title='Poly', grid='True', ax=ax[5])

In [None]:
print('According to R squared scorring method we got below scores for out machine learning models:')
modelNames = ['Linear', 'Polynomial', 'Support Vector', 'Random Forrest', 'K-Nearest Neighbour', 'Decission Tree']
modelRegressors = [lr, poly, svr, RF, KNN, dt]
models = pd.DataFrame({'modelNames' : modelNames, 'modelRegressors' : modelRegressors})
counter=0
score=[]
for i in models['modelRegressors']:
  if i is poly:
    accuracy = cvs(i, X_train_poly, y_train, scoring='r2', cv=5)
    print('Accuracy of %s Regression model is %.2f' %(models.iloc[counter,0],accuracy.mean()))
    score.append(accuracy.mean())
  else:
    accuracy = cvs(i, X_train, y_train, scoring='r2', cv=5)
    print('Accuracy of %s Regression model is %.2f' %(models.iloc[counter,0],accuracy.mean()))
    score.append(accuracy.mean())
  counter+=1

In [None]:
print('According to Mean Absolute Error scorring method we got below scores for out machine learning models:')
modelNames = ['Linear', 'Polynomial', 'Support Vector', 'Random Forrest', 'K-Nearest Neighbour', 'Decission Tree']
modelRegressors = [lr, poly, svr, RF, KNN, dt]
models = pd.DataFrame({'modelNames' : modelNames, 'modelRegressors' : modelRegressors})
counter=0
score=[]
for i in models['modelRegressors']:
  if i is poly:
    accuracy = cvs(i, X_train_poly, y_train, scoring='neg_mean_absolute_error', cv=5)
    print('Accuracy of %s Regression model is %.2f' %(models.iloc[counter,0],accuracy.mean()))
    score.append(accuracy.mean())
  else:
    accuracy = cvs(i, X_train, y_train, scoring='neg_mean_absolute_error', cv=5)
    print('Accuracy of %s Regression model is %.2f' %(models.iloc[counter,0],accuracy.mean()))
    score.append(accuracy.mean())
  counter+=1

In [None]:
print('According to Mean Squared Error scorring method we got below scores for out machine learning models:')
modelNames = ['Linear', 'Polynomial', 'Support Vector', 'Random Forrest', 'K-Nearest Neighbour', 'Decission Tree']
modelRegressors = [lr, poly, svr, RF, KNN, dt]
models = pd.DataFrame({'modelNames' : modelNames, 'modelRegressors' : modelRegressors})
counter=0
score=[]
for i in models['modelRegressors']:
  if i is poly:
    accuracy = cvs(i, X_train_poly, y_train, scoring='neg_mean_squared_error', cv=5)
    print('Accuracy of %s Regression model is %.2f' %(models.iloc[counter,0],accuracy.mean()))
    score.append(accuracy.mean())
  else:
    accuracy = cvs(i, X_train, y_train, scoring='neg_mean_squared_error', cv=5)
    print('Accuracy of %s Regression model is %.2f' %(models.iloc[counter,0],accuracy.mean()))
    score.append(accuracy.mean())
  counter+=1

From above results of R2 MSE & MAE we found that Random Forest gives us the best results to predict MEDV.

I would like to close it by mentioning an important fact, that no Data Science technique is perfect and there is always scope for imporvement.

**Please comment your suggestions.**

**Please upvote if this notebook is helpful.**