## Cleaning Spray Dataset

In this notebook, we will clean the spray dataset. 

In [None]:
# import libraries

# maths
import numpy as np
import pandas as pd
import os
import re
import time
import datetime as datetime

In [None]:
# file paths

input_path = '../data/2_input/'
clean_path = '../data/3_clean/'
output_path = '../data/4_output/'

image_path = '../images/'

### Import and Inspect Data

The spray dataset has 14,835 entries and 4 columns, namely `Date`, `Time`, `Latitude` and `Longitude`. There are 584 null values in the `Time` column and when we look at these null values, we noticed that the null values are all from spray data on 2011-09-07. We decided to drop the `Time` column as our sense is that the timing of the spray (given that the spray frequency is so high) would not be critical for our data analysis. 

In [None]:
spray = pd.read_csv(input_path+'spray.csv')


In [None]:
spray.shape

In [None]:
spray.head()

In [None]:
spray.isnull().sum()

In [None]:
spray[spray.Time.isnull()]

### Clean Data

Same as the train, test and weather datasets, we split the `Date` columns into `year`, `month` and `day` columns so that it will be easier for us to analyse the seasonality effect later on. With that, we saved our processed spray data in another folder. 

In [None]:
def create_yr(x): 
    return x.split('-')[0] 

def create_mth(x): 
    return x.split('-')[1] 

def create_day(x): 
    return x.split('-')[2] 

def rename_columns (columns):
    return [column.lower() for column in columns]

def clean_data(df): 
    df['year_spray'] = df.Date.apply(create_yr)
    df['month_spray'] = df.Date.apply(create_mth)
    df['day_spray'] = df.Date.apply(create_day)
    df.drop(['Date', 'Time'], axis = 1, inplace = True)
    df.columns = rename_columns(df.columns)
    return df

spray = clean_data(spray)

In [None]:
spray.head()

### Output Data

In [None]:
spray.to_csv(clean_path+'spray_clean.csv', index = False)