# Creating Negative Class

Because the crime data only consists of positive instances of crime, there is no negative class. By creating a negative class, I will be able to predict violent crime occurences for 2017 using classification models.

## Importing python libraries and dataframes

In [104]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

In [77]:
crime_2016 = pd.read_csv('../Data/crime_2016.csv', index_col = 0)
weather_2016 = pd.read_csv('../Data/weather_2016.csv', index_col = 0)
crime_2017 = pd.read_csv('../Data/crime_2017.csv', index_col = 0)
weather_2017 = pd.read_csv('../Data/weather_2017.csv', index_col = 0)

## Creating hours dataframe

We need to first create a dataframe containing hours for each day of the year using `timedelta` and `datetime`. The reasoning behind this is so that we can merge the hours dataframe with weather data in order to match the level of granularity that the crime data has. The crime data consists of hourly occurrences, so we need the weather data to also have an hours feature. By merging the hours data with weather data, we are enabling a smooth merge on time features with the crime data. These merged dataframes will then be our final train and test datasets (2016 is train and 2017 is test).

In [3]:
def daterange(start_date, end_date):
    delta = timedelta(hours=1)
    while start_date < end_date:
        yield start_date
        start_date += delta
        
start_date = datetime(2016, 1, 1, 0, 00)
end_date = datetime(2017, 1, 1, 0, 00)
dates_list = []
for single_date in daterange(start_date, end_date):
    dates_list.append(single_date.strftime("%Y-%m-%d %H:%M"))

In [4]:
hours_df = pd.DataFrame(dates_list, columns=['date'])

In [5]:
hours_df['date'] = pd.to_datetime(hours_df['date'])
hours_df['hr'] = hours_df.date.dt.hour
hours_df['da'] = hours_df.date.dt.day
hours_df['mo'] = hours_df.date.dt.month
hours_df.head()

Unnamed: 0,date,hr,da,mo
0,2016-01-01 00:00:00,0,1,1
1,2016-01-01 01:00:00,1,1,1
2,2016-01-01 02:00:00,2,1,1
3,2016-01-01 03:00:00,3,1,1
4,2016-01-01 04:00:00,4,1,1


## Merging hours dataframe with weater data

In [78]:
hourly_weather_2016 = hours_df.merge(weather_2016, how='left', left_on=['mo', 'da'], right_on=['a_mo', 'a_da'])
hourly_weather_2017 = hours_df.merge(weather_2017, how='left', left_on=['mo', 'da'], right_on=['a_mo', 'a_da'])

In [79]:
hourly_weather_2016.head()

Unnamed: 0,date,hr,da,mo,a_stn,a_wban,a_year,a_mo,a_da,a_temp,...,b_wban,b_name,b_country,b_state,b_call,b_lat,b_lon,b_elev,b_begin,b_end
0,2016-01-01 00:00:00,0,1,1,725300,94846,2016,1,1,23.4,...,94846,CHICAGO O'HARE INTERNATIONAL,US,IL,KORD,41.995,-87.934,201.8,19461001,20181015
1,2016-01-01 01:00:00,1,1,1,725300,94846,2016,1,1,23.4,...,94846,CHICAGO O'HARE INTERNATIONAL,US,IL,KORD,41.995,-87.934,201.8,19461001,20181015
2,2016-01-01 02:00:00,2,1,1,725300,94846,2016,1,1,23.4,...,94846,CHICAGO O'HARE INTERNATIONAL,US,IL,KORD,41.995,-87.934,201.8,19461001,20181015
3,2016-01-01 03:00:00,3,1,1,725300,94846,2016,1,1,23.4,...,94846,CHICAGO O'HARE INTERNATIONAL,US,IL,KORD,41.995,-87.934,201.8,19461001,20181015
4,2016-01-01 04:00:00,4,1,1,725300,94846,2016,1,1,23.4,...,94846,CHICAGO O'HARE INTERNATIONAL,US,IL,KORD,41.995,-87.934,201.8,19461001,20181015


Now that we have weather values for each hour of 2016, we can add beat label as a new column. By doing so, we are ensuring that the correct beat label corresponds to the right crime value for a given hour when merging. The features that will be merged on are month, day, hour, and beat. 

## Adding beat label to dataframe

In [86]:
filename = '/Users/blisspaik/notebooks/DSI-US-5/capstone/Data/cumulative-train.csv'
with open(filename, 'w') as f: 
    writer = csv.writer(f) 
    header = hourly_weather.columns 
    writer.writerow(header)
    
for beat_label in crime.beat.value_counts().index:
    hourly_weather['beat_label'] = beat_label
    hourly_weather.to_csv(filename, mode = 'a', index=False, header = False)

What this function is doing is opening a direct filepath in write mode to the csv so that a column could be easily added. We can then read in the new dataframe with the added beat column.

In [87]:
train_df = pd.read_csv('../Data/cumulative-train.csv')

In [88]:
train_df.shape

(2406816, 18)

We need to do the same for the test data.

In [89]:
filename = '/Users/blisspaik/notebooks/DSI-US-5/capstone/Data/cumulative-test.csv'
with open(filename, 'w') as f:
    writer = csv.writer(f)
    # write the header 
    header = hourly_weather.columns
    writer.writerow(header)
    
for beat_label in crime.beat.value_counts().index:
    hourly_weather['beat_label'] = beat_label
    hourly_weather.to_csv(filename, mode = 'a', index=False, header = False)

In [90]:
test_df = pd.read_csv('../Data/cumulative-test.csv')

In [91]:
test_df.shape

(2400240, 18)

## Merging crime and weather dataframes

We need to make a target column for crime data filled with 1's so that when it is merged, we can see either 1's or NaN's. The NaN's represent rows where crime did not occur at that given time. These nulls could then be replaced with 0's, resulting in our negative class.

In [95]:
crime_2016['target'] = 1
crime_2017['target'] = 1

In [96]:
train_df = train_df.merge(crime_2016[['beat', 'a_mo', 'a_da', 'a_hour', 'target']], how='left', left_on=['beat_label', 'mo', 'da', 'hr'], 
                       right_on=['beat', 'a_mo', 'a_da', 'a_hour'])
test_df = test_df.merge(crime_2017[['beat', 'a_mo', 'a_da', 'a_hour', 'target']], how='left', left_on=['beat_label', 'mo', 'da', 'hr'], 
                       right_on=['beat', 'a_mo', 'a_da', 'a_hour'])

In [110]:
train_df.shape, test_df.shape

((2410013, 17), (2403351, 17))

## Cleaning up redundant columns

In [99]:
for df in [train_df, test_df]:
    df.drop(['beat', 'a_mo_x', 'a_da_x', 'a_mo_y', 'a_da_y', 'a_hour'], 1, inplace=True)
    df.sort_values(['a_year', 'mo', 'da', 'hr', 'beat_label'], inplace=True)
    df.fillna(0, inplace = True)

I dropped all columns that were duplicated in the merge, and replaced all nulls with 0's. Now that we have both classes, we are one step closer to modeling.

## Saving dataframes

In [103]:
train_df.to_csv('../Data/train_df.csv')
test_df.to_csv('../Data/test_df.csv')