In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 

In [None]:
print('DOTSCIENCE_INPUTS=["gov_data_air_quality"]')
print('DOTSCIENCE_OUTPUTS=["input_data"]')
print('DOTSCIENCE_LABELS={"model_type": "data_cleaning"}')

We have taken data on the particulates in the air in London from (www.londonair.org.uk)[www.londonair.org.uk].

In particular, we chose data on:
* Nitric Oxide (NO)
* Nitrogen dioxide (N2)
* Oxides of nitrogen (NOX) 
* PM2.5 Particulate (PM2.5)
* Sulphur Dioxide (SO2)

(all measured in ug/m3).

The air was sampled at Haringey Town Hall at 15 minute intervals between 1st June 2018 and 30th June 2018.

[Query entered here](https://www.londonair.org.uk/london/asp/datasite.asp?CBXSpecies1=NOm&CBXSpecies2=NO2m&CBXSpecies3=NOXm&CBXSpecies5=PM25m&CBXSpecies6=SO2m&day1=1&month1=jun&year1=2018&day2=30&month2=jun&year2=2018&period=15min&ratidate=&site=HG1&res=6&Submit=Replot+graph)

In [None]:
df = pd.read_csv('https://www.londonair.org.uk/london/asp/downloadsite.asp?site=HG1&species1=NOm&species2=NO2m&species3=NOXm&species4=PM25m&species5=SO2m&species6=&start=1-jun-2018&end=30-jun-2018&res=6&period=15min&units=ugm3')

## Save the raw data, and point Dotscience to it.

In [None]:
### DOTSCIENCE TO DO: add input dataset.

This will put the data under version control

In [None]:
df.to_csv("input_data/raw_air_data_harringey.csv")

Let's take a look at the data:

In [None]:
df

There are too many rows to view this easily. The records are sorted by `Species`. Let's see what unique values are in this column:

In [None]:
df.Species.unique()

We can see that there are five values: `NO` for nitric oxide, `NO2` for nitrogen dioxide, `NOX` for oxides of nitrogen, `Pm2.5` for PM2.5 Particulate, and `S02` for sulphur dioxide. 

How is each measurement recorded?

In [None]:
df.loc[df.ReadingDateTime == "01/06/2018 11:00"]

It appears that each unique time seems to have five records: one for each of the particulates. Let's check that this is true in general.

In [None]:
datetimes_no_data =[]
for date in df.ReadingDateTime:
    if len(df.loc[df.ReadingDateTime == date]) != 5:
        datetimes_no_data.append(date)
        print(date)

if len(datetimes_no_data) == 0:
    print("no datetimes without sample data for all particulates: NO, NO2, NOX, PM2.5 and SO2!")
        

Although every timestamp has a record for the particulates we are interested in, it looks like a lot of these are null (`NaN`). We want to remove null values from our data. First, let's check whether any of the other columns have null values:

In [None]:
df.isnull().any()

It is just the `Value` column that has some nulls. Let's see how many nulls it contains:

In [None]:
df['Value'].isnull().sum()

###  Number of null values for each particulate

In [None]:
particulates = [str(array) for array in df.Species.unique()]

nulls_and_nonnulls_per_particulate = zip(particulates, 
                            [len(df.loc[(df.Value.isnull()) & (df.Species == particulate)]) for particulate in particulates],
                            [(len(df.loc[(df.Species == particulate)])) for particulate in particulates]
                           )


print("particulate: null values: non-null values:")
for (a, b, c) in nulls_and_nonnulls_per_particulate:
    print(a, "\t\t", b, "\t\t", c-b)
    
    

So, all of our records for PM2.5 and SO2 are null. 62 of our records for the remaining particulates are null.

Let's go ahead and drop the null values:


## Removing null values

All the PM2.5 and SO2 measures are null, so let's remove those records altogether from our dataframe. 
The remaining particulates just have a few nulls, so rather than dropping the timestamps with null measurements, we will replace those with the mean values. 

### Drop the PM2.5 and SO2 rows


In [None]:
rows_to_drop = df.index[df['Species'] == "SO2"].tolist() + df.index[df['Species'] == "PM2.5"].tolist()


In [None]:
len(rows_to_drop)

In [None]:
#df.dropna(inplace = True)

In [None]:
df.drop(df.index[rows_to_drop], inplace = True)


In [None]:
len(df)

Now, let's save this dataset. We'll give it a new name to preserve the raw data saved earlier.

In [None]:
df.to_csv("input_data/nonnull_air_data_harringey.csv")

Then, we can replace those 62 null NO, NO2 and NOX measurements with their respective means.

In [None]:
NO2_rows = df.index[df['Species'] == "NO2"].tolist()
PM25_rows = df.index[df['Species'] == "PM2.5"].tolist()

In [None]:
### TODO: trying to set null NO2 rows to mean of NO2, and same for PM2.5

In [None]:
#df.loc[df.Species == "NO2"])

In [None]:
# df.Species = df.Species.fillna(df.Species.mean())



And we'll once again save our newly clean data. Let's give it a new name, to preserve the raw data:

In [None]:
df.to_csv("input_data/nonnull_air_data_harringey.csv")

In [None]:
## hacky summary stat used to try to get a commit made

In [None]:
import json

print('DOTSCIENCE_PARAMETERS=' + json.dumps({"features": "blah"}))

print('DOTSCIENCE_SUMMARY=' + json.dumps({"thing": 1}))

