In [2]:
import sys
import pandas as pd

sys.path.insert(1, '..')
import src.constants as cst
from src.external_data import (
    external_geo_data as ext_geo,
    external_insee_data as ext_insee
)

import plotly.express as px

# Load data

In [3]:
train = pd.read_csv(cst.PREPROCESSED_NB_TRAIN_PATH, index_col=0)
test = pd.read_csv(cst.PREPROCESSED_NB_TEST_PATH, index_col=0)

In [4]:
transports = pd.read_csv(cst.RAW_TRANSPORTS_PATH, sep=';')

In [5]:
revenus = pd.read_csv(cst.RAW_INSEE_PATH, sep=';')

# IDF Mobilités - Open transportation data 

I used an open dataset from IDF Mobilités, listing all inbound and outbound connections for metros, RER, trains and tramways in Ile-de-France. 

To reduce computation time when computing the distance of each unit to all stations in the dataset, I start by extracting stations within 1.2km from each district center (district being defined by the initial `code_district_custom` column). Then, for each housing unit in the DVF dataset, I compute the geodesic distance between the unit and stations within 1.2km for the associated district. 

The geodesic distance is the length of the shortest path between 2 points on any surface (in our case, the surface is the earth). It is computed using latitude-longitude data. Once I have this distance, I filter stations that lie within 0.5km of the housing unit, and count the total number of metro/train/RER connections

In [6]:
train, test = ext_geo.count_close_stations_all_housing_units_in_train_test(train, test, transports)

Preprocessing train...
Processed 1000 units...
Processed 2000 units...
Processed 3000 units...
Processed 4000 units...
Processed 5000 units...
Processed 6000 units...
Processed 7000 units...
Processed 8000 units...
Processed 9000 units...
Processed 10000 units...
Processed 11000 units...
Processed 12000 units...
Processed 13000 units...
Processed 14000 units...
Processed 15000 units...
Processed 16000 units...
Processed 17000 units...
Processed 18000 units...
Preprocessing test...
Processed 1000 units...


In [7]:
fig = px.bar(train['n_metros_within_0.5km'].value_counts(), title='Distribution of housing units by number of close metro stations')
fig.update_layout(xaxis_title='Metro stations within 0.5km', yaxis_title='Volume of units', showlegend=False)
fig.show()

fig = px.bar(train['n_trains_within_0.5km'].value_counts(), title='Distribution of housing units by number of close train stations')
fig.update_layout(xaxis_title='Train stations within 0.5km', yaxis_title='Volume of units', showlegend=False)
fig.show()

# INSEE - Median revenue data by IRIS code

I also tried to map the median revenues of each IRIS zone (using the `code_IRIS` initial column). I used open INSEE data for the year 2018. 

Some IRIS zones could be found in the INSEE dataset. To overcome this issue, we map each missing IRIS code to the closest non-missing IRIS code (by averaging the coordinates of all housing units in both train and test by IRIS code). In practice, we should be using only the train set to do that (but I ran out of time here).

In practice, this feature did not improve the predictions. 

In [8]:
train, test = ext_insee.add_external_insee_revenue_data(train, test, revenus)

In [9]:
train.to_csv(cst.PREPROCESSED_NB_TRAIN_PATH, index=True)
test.to_csv(cst.PREPROCESSED_NB_TEST_PATH, index=True)