### Cleaning Spray Dataset

In this notebook, we will clean the spray dataset.

In [1]:
#import libraries

import numpy as np
import pandas as pd
import time
import datetime as datetime
import re
import os

#### Import and Inspect Data

The spray dataset has 14,835 entries and 4 columns, namely Date, Time, Latitude and Longitude. There are 584 null values in the Time column and when we look at these null values, we noticed that the null values are all from spray data on 2011-09-07. We decided to drop the Time column as our sense is that the timing of the spray (given that the spray frequency is so high) would not be critical for our data analysis.

In [2]:
spray = pd.read_csv('./datasets/spray.csv')

In [3]:
spray.head()

Unnamed: 0,Date,Time,Latitude,Longitude
0,2011-08-29,6:56:58 PM,42.391623,-88.089163
1,2011-08-29,6:57:08 PM,42.391348,-88.089163
2,2011-08-29,6:57:18 PM,42.391022,-88.089157
3,2011-08-29,6:57:28 PM,42.390637,-88.089158
4,2011-08-29,6:57:38 PM,42.39041,-88.088858


In [4]:
spray.shape

(14835, 4)

In [5]:
spray.isnull().sum()

Date           0
Time         584
Latitude       0
Longitude      0
dtype: int64

In [6]:
spray[spray.Time.isnull()]

Unnamed: 0,Date,Time,Latitude,Longitude
1030,2011-09-07,,41.987092,-87.794286
1031,2011-09-07,,41.987620,-87.794382
1032,2011-09-07,,41.988004,-87.794574
1033,2011-09-07,,41.988292,-87.795486
1034,2011-09-07,,41.988100,-87.796014
...,...,...,...,...
1609,2011-09-07,,41.995876,-87.811615
1610,2011-09-07,,41.995972,-87.810271
1611,2011-09-07,,41.995684,-87.810319
1612,2011-09-07,,41.994724,-87.810415


#### Clean Data

Same as the train, test and weather datasets, we split the Date columns into year, month and day columns so that it will be easier for us to analyse the seasonality effect later on. With that, we saved our processed spray data in another folder.

In [7]:
def create_yr(x): 
    return x.split('-')[0] 

def create_mth(x): 
    return x.split('-')[1] 

def create_day(x): 
    return x.split('-')[2] 

def rename_columns (columns):
    return [column.lower() for column in columns]

def clean_data(df): 
    df['year_spray'] = df.Date.apply(create_yr)
    df['month_spray'] = df.Date.apply(create_mth)
    df['day_spray'] = df.Date.apply(create_day)
    df.drop(['Date', 'Time'], axis = 1, inplace = True)
    df.columns = rename_columns(df.columns)
    return df

In [13]:
# #We drop time column that contain NaN value due to the human error. 
spray = clean_data(spray)

In [9]:
spray.head()

Unnamed: 0,latitude,longitude,year_spray,month_spray,day_spray
0,42.391623,-88.089163,2011,8,29
1,42.391348,-88.089163,2011,8,29
2,42.391022,-88.089157,2011,8,29
3,42.390637,-88.089158,2011,8,29
4,42.39041,-88.088858,2011,8,29


In [10]:
spray.isnull().sum()

latitude       0
longitude      0
year_spray     0
month_spray    0
day_spray      0
dtype: int64

In [11]:
spray.shape

(14835, 5)

#### Output Data

In [14]:
spray.to_csv('./clean data/spray_clean.csv', index = False)