In [9]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt

## Reading the data

In [8]:
data = pd.read_csv('data/pollution_us_2000_2016.csv')

## Some elementary analysis

In [6]:
data.columns

Index(['State Code', 'County Code', 'Site Num', 'Address', 'State', 'County',
       'City', 'Date Local', 'NO2 Units', 'NO2 Mean', 'NO2 1st Max Value',
       'NO2 1st Max Hour', 'NO2 AQI', 'O3 Units', 'O3 Mean',
       'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 Units',
       'SO2 Mean', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI',
       'CO Units', 'CO Mean', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI'],
      dtype='object')

In [10]:
data.head()

Unnamed: 0.1,Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,Date Local,NO2 Units,...,SO2 Units,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI
0,0,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,3.0,9.0,21,13.0,Parts per million,1.145833,4.2,21,
1,1,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,3.0,9.0,21,13.0,Parts per million,0.878947,2.2,23,25.0
2,2,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,2.975,6.6,23,,Parts per million,1.145833,4.2,21,
3,3,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,2.975,6.6,23,,Parts per million,0.878947,2.2,23,25.0
4,4,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-02,Parts per billion,...,Parts per billion,1.958333,3.0,22,4.0,Parts per million,0.85,1.6,23,


In [13]:
data.nunique()

Unnamed: 0           134576
State Code               47
County Code              73
Site Num                110
Address                 204
State                    47
County                  133
City                    144
Date Local             5996
NO2 Units                 1
NO2 Mean              31859
NO2 1st Max Value       990
NO2 1st Max Hour         24
NO2 AQI                 129
O3 Units                  1
O3 Mean                8196
O3 1st Max Value        134
O3 1st Max Hour          24
O3 AQI                  125
SO2 Units                 1
SO2 Mean              12736
SO2 1st Max Value       921
SO2 1st Max Hour         24
SO2 AQI                 140
CO Units                  1
CO Mean               34123
CO 1st Max Value       2698
CO 1st Max Hour          24
CO AQI                  107
dtype: int64

#### Observing the stats about data up till here, I found the follwoing:
    - I should drop Unnamed: 0. This is just an index column.
    
    - The NO2 units, SO2 units, O3 units and CO units have no significance.
        - They only have a single unique value which is "Parts per billion"
        - I will drop these too as they would require a lot of memory.

In [16]:
data = data.drop(['Unnamed: 0','NO2 Units','O3 Units','SO2 Units','CO Units'],axis=1)

In [17]:
data.head()

Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,...,O3 1st Max Hour,O3 AQI,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI
0,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,19.041667,49.0,...,10,34,3.0,9.0,21,13.0,1.145833,4.2,21,
1,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,19.041667,49.0,...,10,34,3.0,9.0,21,13.0,0.878947,2.2,23,25.0
2,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,19.041667,49.0,...,10,34,2.975,6.6,23,,1.145833,4.2,21,
3,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,19.041667,49.0,...,10,34,2.975,6.6,23,,0.878947,2.2,23,25.0
4,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-02,22.958333,36.0,...,10,27,1.958333,3.0,22,4.0,0.85,1.6,23,


#### I also feel that I should drop the following columns, but I am keping those for now
    - Country Code, State Code, Site Num
        - These are just identifiers and have no significance.
        - But for now I am keeping these because I think we can use them as id selectors for selections using d3.
        
    - Address
        - This is the address of the recording station.
        - I am keeping this as I think we can pin point these locations on the map and hovering over them would get you an infobox with the details of the selected recording station.
        - But this will be an optional feature!

## Now, let's check NaN values in the dataset

In [18]:
data.isnull().sum()

State Code                0
County Code               0
Site Num                  0
Address                   0
State                     0
County                    0
City                      0
Date Local                0
NO2 Mean                  0
NO2 1st Max Value         0
NO2 1st Max Hour          0
NO2 AQI                   0
O3 Mean                   0
O3 1st Max Value          0
O3 1st Max Hour           0
O3 AQI                    0
SO2 Mean                  0
SO2 1st Max Value         0
SO2 1st Max Hour          0
SO2 AQI              872907
CO Mean                   0
CO 1st Max Value          0
CO 1st Max Hour           0
CO AQI               873323
dtype: int64

* That's strange! Out of the four AQI columns, SO2 AQI and CO AQI have around 872k null values!
* But I am not going to blindly drop these rows!
* Because that would be a loss of so much of data.
* I'll now come up with a strategy to fill up these values.