In [1]:
%%capture
# move to src folder so we can import code
%cd ../src

In [2]:
from common.kaggle import download_competition_data
import config

In [3]:
download_competition_data(config.COMPETITION, config.INPUTS)


In this competition we will be using data generated by a deep learning model trained on the [California housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). We can expect the relationships between variables to be similar as in the original dataset, but not exactly the same.

We will be predicting the the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000). The independent variables at our disposal are:

* __MedInc__ - median income in block group
* __HouseAge__ - median house age in block group
* __AveRooms__ - average number of rooms per household
* __AveBedrms__ - average number of bedrooms per household
* __Population__ - block group population
* __AveOccup__ - average number of household members
* __Latitude__ - block group latitude
* __Longitude__ - block group longitude

The evaluation metric is going the be the standard Root Mean Squared Error (RMSE) and the useful thing to keep in mind about this metric, as it involves a squared term, is that outliers, or predictions that err a lot, are disproportionately penalized!

# Let's take a look at the data

In [4]:
from pathlib import Path
import pandas as pd

In [5]:
df = pd.read_csv(config.TRAINING_DATA)
df.head()

Unnamed: 0,id,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,0,2.3859,15.0,3.82716,1.1121,1280.0,2.486989,34.6,-120.12,0.98
1,1,3.7188,17.0,6.013373,1.054217,1504.0,3.813084,38.69,-121.22,0.946
2,2,4.775,27.0,6.535604,1.103175,1061.0,2.464602,34.71,-120.45,1.576
3,3,2.4138,16.0,3.350203,0.965432,1255.0,2.089286,32.66,-117.09,1.336
4,4,3.75,52.0,4.284404,1.069246,1793.0,1.60479,37.8,-122.41,4.5


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37137 entries, 0 to 37136
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           37137 non-null  int64  
 1   MedInc       37137 non-null  float64
 2   HouseAge     37137 non-null  float64
 3   AveRooms     37137 non-null  float64
 4   AveBedrms    37137 non-null  float64
 5   Population   37137 non-null  float64
 6   AveOccup     37137 non-null  float64
 7   Latitude     37137 non-null  float64
 8   Longitude    37137 non-null  float64
 9   MedHouseVal  37137 non-null  float64
dtypes: float64(9), int64(1)
memory usage: 2.8 MB


In [7]:
df.nunique()

id             37137
MedInc         12310
HouseAge          51
AveRooms       22069
AveBedrms      14066
Population      3694
AveOccup       21078
Latitude         791
Longitude        755
MedHouseVal     3723
dtype: int64

In [8]:
len(df[df.duplicated()])

0

We have eight numeric features, where 'HouseAge' and 'Population' seem to be integers and the rest are floats.
'id' column can be discarded as it is unique.
The target is MedHouseVal, a float value that seems to be positive and that goes beyond 1.

# Location features

In [9]:
# # import folium
# # from folium.plugins import HeatMap

# # heat_map = folium.Map(df[['Latitude', 'Longitude']].mean(axis=0),
# #                     zoom_start = 6) 

# # df['Latitude'] = df['Latitude'].astype(float)
# # df['Longitude'] = df['Longitude'].astype(float)

# # lat_long_list = [[row['Latitude'],row['Longitude']] for index, row in df.iterrows()]
# # HeatMap(lat_long_list, radius=10).add_to(heat_map)
# # heat_map

__Insights__

* Data is distributed across the entire california state
* We can consider using socieconomic external data to have additional insights about geographic location
* Lots of properties in Los Angeles
* Train and test follow the same distribution of latitud-longitude pairs. random KFold Is enough

TODO: check that as expected test data has similar distributio

## Latitude & Longitud with respect to target

In [10]:
# # import branca

# # inferno_colors = [
# #     (0, 0, 4),
# #     (40, 11, 84),
# #     (101, 21, 110),
# #     (159, 42, 99),
# #     (212, 72, 66),
# #     (245, 125, 21),
# #     (250, 193, 39),
# #     (252, 255, 164)
# # ]

# # map = folium.Map(df[['Latitude', 'Longitude']].mean(axis=0), zoom_start = 6)
# # lat = list(df.Latitude)
# # lon = list(df.Longitude)
# # populations = list(df.Population)
# # targets = list(df.MedHouseVal)

# # # define colormap using inferno colors and normalizing them according MedHouseVal
# # cmap = branca.colormap.LinearColormap(
# #     inferno_colors, vmin=min(targets), vmax=max(targets)
# # )

# # for loc, population, target in zip(zip(lat, lon), populations, targets):
# #     folium.Circle(
# #         location=loc,
# #         radius=population,
# #         fill=True,
# #             color=cmap(target),
# #         fill_opacity=0.2,
# #         weight=0
# #     ).add_to(map)

# # map.add_child(cmap)
# # display(map)

__Insights__

* Most expensive properties tend to be located nearby and in big citiest like San Francisco and Los Angeles and that are close to the beaches


In [12]:
from common.feature_engineering.geo import compute_geo_features

In [15]:
df_geo = compute_geo_features(df, cache=config.TRAIN_GEO_CACHE)
df_geo.isnull().sum()

road               1920
neighbourhood     32935
town              27959
county             1335
city              14771
state_district    17945
postcode           1810
dtype: int64

# Merge original data

It's been proved that merging original data improves the performance of the models