# Appendix i - Cleaning: Spray Data

Description: Ensuring the spray data has been cleaned, and ready for potential merging with our other datasets.

*To Note: We decided to merge the date and time columns into a new pandas date time object.
In the end we opted not to use the spray data as a consideration in the modeling process. The data was used to plot
the frequency of spray by year and location. This was taken into account in the cost benefit analysis.*

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('../data/spray.csv')

# Checking the shape

In [4]:
df.shape

(14835, 4)

# Inspecting the columns

In [5]:
df.head()

Unnamed: 0,Date,Time,Latitude,Longitude
0,2011-08-29,6:56:58 PM,42.391623,-88.089163
1,2011-08-29,6:57:08 PM,42.391348,-88.089163
2,2011-08-29,6:57:18 PM,42.391022,-88.089157
3,2011-08-29,6:57:28 PM,42.390637,-88.089158
4,2011-08-29,6:57:38 PM,42.39041,-88.088858


# Dropping nan values

In [6]:
df.dropna(inplace=True)

In [7]:
df.isnull().sum()

Date         0
Time         0
Latitude     0
Longitude    0
dtype: int64

In [8]:
df.shape

(14251, 4)

# Making a new datetime column
Here I made a dateime column and specified the type as a pandas datetime obejct.

In [10]:
df["DateTime"] = df.Date + " " + df.Time

In [11]:
df.DateTime.head()

0    2011-08-29 6:56:58 PM
1    2011-08-29 6:57:08 PM
2    2011-08-29 6:57:18 PM
3    2011-08-29 6:57:28 PM
4    2011-08-29 6:57:38 PM
Name: DateTime, dtype: object

In [12]:
df['DateTime']= pd.to_datetime(df['DateTime'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14251 entries, 0 to 14834
Data columns (total 5 columns):
Date         14251 non-null object
Time         14251 non-null object
Latitude     14251 non-null float64
Longitude    14251 non-null float64
DateTime     14251 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 668.0+ KB


In [13]:
df.drop(['Time'],axis=1,inplace=True)


# Checking the columns


In [14]:
df.head()

Unnamed: 0,Date,Latitude,Longitude,DateTime
0,2011-08-29,42.391623,-88.089163,2011-08-29 18:56:58
1,2011-08-29,42.391348,-88.089163,2011-08-29 18:57:08
2,2011-08-29,42.391022,-88.089157,2011-08-29 18:57:18
3,2011-08-29,42.390637,-88.089158,2011-08-29 18:57:28
4,2011-08-29,42.39041,-88.088858,2011-08-29 18:57:38


# Checking the types
Double checking the the DateTime column type

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14251 entries, 0 to 14834
Data columns (total 4 columns):
Date         14251 non-null object
Latitude     14251 non-null float64
Longitude    14251 non-null float64
DateTime     14251 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 556.7+ KB


# Checking the unique values
Here I wanted to see the range of dates in the data.

In [16]:
df.DateTime.unique()

array(['2011-08-29T18:56:58.000000000', '2011-08-29T18:57:08.000000000',
       '2011-08-29T18:57:18.000000000', ...,
       '2013-09-05T20:35:21.000000000', '2013-09-05T20:35:31.000000000',
       '2013-09-05T20:35:41.000000000'], dtype='datetime64[ns]')

# Inspecting the duplicated values.


In [17]:
df[df.duplicated()]

Unnamed: 0,Date,Latitude,Longitude,DateTime
485,2011-09-07,41.983917,-87.793088,2011-09-07 19:43:40
490,2011-09-07,41.986460,-87.794225,2011-09-07 19:44:32
491,2011-09-07,41.986460,-87.794225,2011-09-07 19:44:32
492,2011-09-07,41.986460,-87.794225,2011-09-07 19:44:32
493,2011-09-07,41.986460,-87.794225,2011-09-07 19:44:32
494,2011-09-07,41.986460,-87.794225,2011-09-07 19:44:32
495,2011-09-07,41.986460,-87.794225,2011-09-07 19:44:32
496,2011-09-07,41.986460,-87.794225,2011-09-07 19:44:32
497,2011-09-07,41.986460,-87.794225,2011-09-07 19:44:32
498,2011-09-07,41.986460,-87.794225,2011-09-07 19:44:32


In [17]:
df.Date.unique()

array(['2011-08-29', '2011-09-07', '2013-07-17', '2013-07-25',
       '2013-08-08', '2013-08-15', '2013-08-16', '2013-08-22',
       '2013-08-29', '2013-09-05'], dtype=object)

# Exporting the data as a csv

In [18]:
df.to_csv('../data/updated_spray.csv')