#### The goal of this notebook is to verify the following hypothesis: if you build a model over a given dataset, the addition of new columns --- even if they are totally unrelated or random --- is likely to bring some sort of gain.

In [2]:
import pandas as pd
import os

In [11]:
datasets = {filename: pd.read_csv(os.path.join('data', filename), sep='|') for filename in os.listdir('data')}

#### Let's check whether years 2013, 2014, and 2015 are present in all datasets and, if so, use their timestamps as keys. The goal is to avoid any containment ratio that is lower than one.

In [18]:
years = ['2013', '2014', '2015']
datasets_for_experiment = []
for d in datasets.keys():
    timestamp_years = list(set([row['time'].split('-')[0] for index, row in datasets[d].iterrows()]))
    if all(year in timestamp_years for year in years):
        datasets_for_experiment.append(d)
print(datasets_for_experiment)

['311_category_taxi.csv', '311_category_Agency_Issues_added_zeros.csv', '311_category_DOH_New_License_Application_Request.csv', '311_category_SCRIE_added_zeros.csv', '311_category_SCRIE.csv', '311_category_Industrial_waste.csv', 'cyclist_killed_sum.csv', '311_category_electric_added_zeros.csv', '311_category_Illegal_parking_added_zeros.csv', '311_category_Vacant_Lot_added_zeros.csv', '311_category_Illegal_parking.csv', '311_category_consumer_complaint_added_zeros.csv', '311_category_Violation_of_Park_Rules.csv', '311_category_Enforcement.csv', '311_category_Street_light_condition.csv', '311_category_derelict.csv', '311_category_collection.csv', '311_category_Litter_basket_added_zeros.csv', '311_category_dof_added_zeros.csv', '311_category_graffiti.csv', '311_category_Paint.csv', '311_category_Construction_added_zeros.csv', '311_category_Highway_condition.csv', '311_category_DOH_New_License_Application_Request_added_zeros.csv', '13316684_pedestrians_killed_sum.csv', '311_category_sewer.

#### Let's "curate" a few initial, larger datasets, along with targets for learning, based on datasets_for_experiment.

#### We know, for example, that there is a high correlation between X_killed_sum.csv and Y_injured_sum.csv.

In [30]:
from functools import reduce

injured_and_killed = [datasets['cyclist_injured_sum.csv'].rename(columns={'sum': 'cyclist_injured_sum'}), 
                      datasets['cyclist_killed_sum.csv'].rename(columns={'sum': 'cyclist_killed_sum'}), 
                      datasets['motorist_injured_sum.csv'].rename(columns={'sum': 'motorist_injured_sum'}), 
                      datasets['motorist_killed_sum.csv'].rename(columns={'sum': 'motorist_killed_sum'}), 
                      datasets['pedestrians_injured_sum.csv'].rename(columns={'sum': 'pedestrians_injured_sum'}), 
                      datasets['persons_injured_sum.csv'].rename(columns={'sum': 'persons_injured_sum'}), 
                      datasets['persons_killed_sum.csv'].rename(columns={'sum': 'persons_killed_sum'})]

initial_injured_killed_dataset = reduce(lambda left,right: pd.merge(left,right,on='time'), injured_and_killed)
initial_injured_killed_dataset = initial_injured_killed_dataset[(initial_injured_killed_dataset['time'].str.match('2013')) |
                                                                (initial_injured_killed_dataset['time'].str.match('2014')) |
                                                                (initial_injured_killed_dataset['time'].str.match('2015'))]

initial_injured_killed_dataset.head()

Unnamed: 0,time,cyclist_injured_sum,cyclist_killed_sum,motorist_injured_sum,motorist_killed_sum,pedestrians_injured_sum,persons_injured_sum,persons_killed_sum
182,2013-01-01,3,0,106,0,28,137,0
183,2013-01-02,2,0,79,1,19,100,1
184,2013-01-03,10,0,65,0,39,114,0
185,2013-01-04,6,0,48,1,31,85,3
186,2013-01-05,4,0,68,3,27,99,4


#### For each column in initial_injured_killed_dataset, we'll build a model to predict is values based on growing combinations of the other attributes.