In [12]:
import pandas as pd

df = pd.read_csv('housing2.csv')
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280,565.0,259.0,3.8462,342200.0,NEAR BAY


In [14]:
df.shape

(19675, 10)

## Feature engineering
* We have to turn ocean_proximity into dummy variables.

In [15]:
df.ocean_proximity.unique()

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)

In [16]:
df=pd.concat([df, pd.get_dummies(df['ocean_proximity'],drop_first=True)],axis=1).drop('ocean_proximity',axis=1)
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129,322.0,126.0,8.3252,452600.0,0,0,1,0
1,-122.22,37.86,21.0,7099.0,1106,2401.0,1138.0,8.3014,358500.0,0,0,1,0
2,-122.24,37.85,52.0,1467.0,190,496.0,177.0,7.2574,352100.0,0,0,1,0
3,-122.25,37.85,52.0,1274.0,235,558.0,219.0,5.6431,341300.0,0,0,1,0
4,-122.25,37.85,52.0,1627.0,280,565.0,259.0,3.8462,342200.0,0,0,1,0


### Looking at Variance Inflation Factor

Sometimes having more features is not always better especially when the model is exhibiting high multi-colinearity. Multicollinearity creates a problem in the multiple regression because the inputs are all influencing each other. Therefore, they are not actually independent, and it is difficult to test how much the combination of the independent variables affects the dependent variable, or outcome, within the regression model. Using variance inflation factors helps to identify the severity of any multicollinearity issues so that the model can be adjusted. 

In [17]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculating VIF
num_features = df[['housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']]
vif = pd.DataFrame()
vif["variables"] = [feature for feature in num_features]
vif["VIF"] = [variance_inflation_factor(df[vif['variables']].values, i) for i in range(len(vif["variables"]))]
print(vif)


            variables        VIF
0  housing_median_age   3.701284
1         total_rooms  27.794914
2      total_bedrooms  86.556098
3          population  15.851491
4          households  89.351744
5       median_income   5.208743


Results show that total_bedrooms and households have massively high VIF values; a sign that they are indeed co-linear which is also what we have found and exploited to impute the missing values for total_bedrooms. Since I am merely doing this project as a proof-of-concept for myself and to showcase some data analysis work with predictive modelling, I will not remove the salient features. However, it could be more important to consider this action when trying to associate explanatory variables to real-world outcomes and acquire more fine-grained details about relationships between variables. 

In [18]:
df.to_csv('data.csv', index=False) # export to csv for modelling later.