In [43]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 

We have taken data on the particulates in the air in London from (www.londonair.org.uk)[www.londonair.org.uk].

In particular, we chose data on:
* Nitric Oxide (NO)
* Nitrogen dioxide (N2)
* Oxides of nitrogen (NOX) 
* PM2.5 Particulate (PM2.5)
* Sulphur Dioxide (SO2)

(all measured in ug/m3).

The air was sampled at Haringey Town Hall at 15 minute intervals between 1st June 2018 and 30th June 2018.

[Query entered here](https://www.londonair.org.uk/london/asp/datasite.asp?CBXSpecies1=NOm&CBXSpecies2=NO2m&CBXSpecies3=NOXm&CBXSpecies5=PM25m&CBXSpecies6=SO2m&day1=1&month1=jun&year1=2018&day2=30&month2=jun&year2=2018&period=15min&ratidate=&site=HG1&res=6&Submit=Replot+graph)

In [44]:
df = pd.read_csv('https://www.londonair.org.uk/london/asp/downloadsite.asp?site=HG1&species1=NOm&species2=NO2m&species3=NOXm&species4=PM25m&species5=SO2m&species6=&start=1-jun-2018&end=30-jun-2018&res=6&period=15min&units=ugm3')

## Save the raw data, and point Dotscience to it.

In [45]:
### DOTSCIENCE TO DO: add input dataset.

This will put the data under version control

In [46]:
df.to_csv("input_data/raw_air_data_harringey.csv")

Let's take a look at the data:

In [47]:
df

Unnamed: 0,Site,Species,ReadingDateTime,Value,Units,Provisional or Ratified
0,HG1,NO,01/06/2018 00:00,2.7,ug m-3,P
1,HG1,NO,01/06/2018 00:15,3.0,ug m-3,P
2,HG1,NO,01/06/2018 00:30,4.7,ug m-3,P
3,HG1,NO,01/06/2018 00:45,,ug m-3,P
4,HG1,NO,01/06/2018 01:00,,ug m-3,P
5,HG1,NO,01/06/2018 01:15,1.6,ug m-3,P
6,HG1,NO,01/06/2018 01:30,8.1,ug m-3,P
7,HG1,NO,01/06/2018 01:45,1.7,ug m-3,P
8,HG1,NO,01/06/2018 02:00,8.0,ug m-3,P
9,HG1,NO,01/06/2018 02:15,6.6,ug m-3,P


There are too many rows to view this easily. The records are sorted by `Species`. Let's see what unique values are in this column:

In [48]:
df.Species.unique()

array(['NO', 'NO2', 'NOX', 'PM2.5', 'SO2'], dtype=object)

We can see that there are five values: `NO` for nitric oxide, `NO2` for nitrogen dioxide, `NOX` for oxides of nitrogen, `Pm2.5` for PM2.5 Particulate, and `S02` for sulphur dioxide. 

How is each measurement recorded?

In [49]:
df.loc[df.ReadingDateTime == "01/06/2018 11:00"]

Unnamed: 0,Site,Species,ReadingDateTime,Value,Units,Provisional or Ratified
44,HG1,NO,01/06/2018 11:00,26.4,ug m-3,P
2828,HG1,NO2,01/06/2018 11:00,44.2,ug m-3,P
5612,HG1,NOX,01/06/2018 11:00,84.7,ug m-3 as NO2,P
8396,HG1,PM2.5,01/06/2018 11:00,,ug m-3,P
11180,HG1,SO2,01/06/2018 11:00,,ug m-3,P


It appears that each unique time seems to have five records: one for each of the particulates. Let's check that this is true in general.

In [50]:
datetimes_no_data =[]
for date in df.ReadingDateTime:
    if len(df.loc[df.ReadingDateTime == date]) != 5:
        datetimes_no_data.append(date)
        print date

if len(datetimes_no_data) == 0:
    print "no datetimes without sample data for all particulates: NO, NO2, NOX, PM2.5 and SO2!"
        

no datetimes without sample data for all particulates: NO, NO2, NOX, PM2.5 and SO2!


Although every timestamp has a record for the particulates we are interested in, it looks like a lot of these are null (`NaN`). We want to remove null values from our data. First, let's check whether any of the other columns have null values:

In [51]:
df.isnull().any()

Site                       False
Species                    False
ReadingDateTime            False
Value                       True
Units                      False
Provisional or Ratified    False
dtype: bool

It is just the `Value` column that has some nulls. Let's see how many nulls it contains:

In [52]:
df['Value'].isnull().sum()

5754

###  Number of null values for each particulate

In [53]:
particulates = [str(array) for array in df.Species.unique()]

nulls_and_nonnulls_per_particulate = zip(particulates, 
                            [len(df.loc[(df.Value.isnull()) & (df.Species == particulate)]) for particulate in particulates],
                            [(len(df.loc[(df.Species == particulate)])) for particulate in particulates]
                           )


print "particulate: null values: non-null values:"
for (a, b, c) in nulls_and_nonnulls_per_particulate:
    print a, "\t\t", b, "\t\t", c-b
    
    

particulate: null values: non-null values:
NO 		62 		2722
NO2 		62 		2722
NOX 		62 		2722
PM2.5 		2784 		0
SO2 		2784 		0


So, all of our records for PM2.5 and SO2 are null. 62 of our records for the remaining particulates are null.

Let's go ahead and drop the null values:


## Removing null values

All the PM2.5 and SO2 measures are null, so let's remove those records altogether from our dataframe. 
The remaining particulates just have a few nulls, so rather than dropping the timestamps with null measurements, we will replace those with the mean values. 

### Drop the PM2.5 and SO2 rows


In [59]:
rows_to_drop = df.index[df['Species'] == "SO2"].tolist() + df.index[df['Species'] == "PM2.5"].tolist()


In [60]:
len(rows_to_drop)

5568

In [56]:
#df.dropna(inplace = True)

In [61]:
df.drop(df.index[rows_to_drop], inplace = True)


In [62]:
len(df)

8352

Now, let's save this dataset. We'll give it a new name to preserve the raw data saved earlier.

In [63]:
df.to_csv("input_data/nonnull_air_data_harringey.csv")

Then, we can replace those 62 null NO, NO2 and NOX measurements with their respective means.

In [67]:
NO2_rows = df.index[df['Species'] == "NO2"].tolist()
PM25_rows = df.index[df['Species'] == "PM2.5"].tolist()

In [None]:
### trying to set null NO2 rows to mean of NO2, and same for PM2.5

In [None]:
df.loc[df.Species == "NO2"])

In [57]:
df.Species = df.Species.fillna(df.Species.mean())

df

5754

And we'll save our newly clean data. Let's give it a new name, to preserve the raw data:

In [58]:
df.to_csv("input_data/nonnull_air_data_harringey.csv")