# Cleaning methodology for pollutant data

This notebook contains the key steps taken to gather and clean air pollution data. As the data set is quite large and can only be downloaded in small chunks, we are currently investigation different ways of downloading the data in order to increase efficiency. As we continue to explore the data, more substantial filtering and cleaning will take place. 

## Step 1: Identify boundary for monitoring. 

Having manually sifted through the data, we were able to split several monitoring sites into "inner" and "outer" ranges based on their geographic location. The data points closest to our area of focus (Heathrow Airport) have been categorised as "inner" locations and will be vital in our investigation into emissions. The "outer" locations will be used as a way of comparisson to gain a deeper understanding of the scale of the impact arising from air pollution. 

|CCG | Borough| Inner location - monitoring station| Outer location - monitoring site|
|---:|:-----|:-----------|:------------|
|Hillingdon |Hillingdon| Hillingdon South Ruislip, Hillingdon 2 Hillingdon Hospital, Hillingdon Oxford Avenue, Hillingdon Harmondsworth, Hillingdon Harmondsworth Osiris, Hillingdon Hayes, Heathrow LHR2, Heathrow Bath Road, Hillingdon Sipson, Heathrow Green Gates| |
|East Berkshire | Slough |Slough Town Centre Wellington Street,Slough Brands Hill London Road,Slough Windmill Bath Road,Slough Colnbrook,Slough Town Centre A4,Slough Lakeside 1 Osiris,Slough Colnbrook Osiris,Slough Chalvey,Slough Lakeside 2,Slough Lakeside 2 Osiris,Slough - Dennis Way LP11,Slough - Monksfield Way LP20,Slough - The Hawthorns LP2,Slough - Erica Close LP3,Slough - Hatton Avenue LP13,Slough - St Andrews Way LP12,Slough - The Hawthorns LP10,Slough - Francis Way LP13,Slough - The Hawthorns LP1,Slough - Monksfield Way LP19,Slough - Brighton Spur LP3,Slough - Bower Way LP1,Slough - Hatton Avenue LP3,Slough - Cinder Track LP37|
|Hounslow| Hounslow | Hounslow Cranford, Hounslow Chiswick, Hounslow Brentford, Hounslow Heston, Hounslow Hatton Cross, Hounslow Feltham, Hounslow Gunnersbury |
|Ealing | Ealing | Ealing Horn Lane
|Buckinghamshire|South Bucks|Iver Thorney Lane North, Iver North Park Road, Iver Primary School|
|Surrey Heartlands|Richmond|- |Elmbridge|
| - | Spelthorne | Spelthorne Shepperton Squire's Bridge Road, Spelthorne knowle Green, Spelthorne Sunbury Cross, Heathrow Oaks Road|
| - |Waverly and Woking | - | H&F Shepherd’s Bush, Godalming Ockford Road 2|
|South West London| Richmond | London Teddington Bushy Park. |
|Hammersmith & Fulham|London Borough of Hammersmith and Fulham|-| H&F Hammersmith Town Centre, H&F Shepherd’s Bush|
|Watford|Hertfordshire and Bedfordshire|-|Watford Town Hall|
|Oxfordshire|Oxfordshire|-|Oxford High St, Oxford St Ebbes (Cal Club), Oxford Center Roadside, Oxford St Ebbes|
| Berkshire West| Readiing| - | Reading Caversham Road, Reading Oxford Road, Reading London Road, Reading New Town|

## Step 2: Format the Data

In [4]:
df_dict = pd.read_excel('./raw_data/waverly.xlsx', header=[0,1], sheet_name=None)

In [5]:
df_dict = pd.concat(df_dict.values(), axis=0)

In [6]:
df = pd.DataFrame(df_dict, columns=df_dict.keys())

In [7]:
df

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Farnham The Woolmead,Farnham The Woolmead,Farnham The Woolmead,Farnham The Woolmead,Godalming Ockford Road 2,Godalming Ockford Road 2
Unnamed: 0_level_1,Date,Time,Nitrogen dioxide,Status,PM10 particulate matter (Hourly measured),Status.1,Nitrogen dioxide,Status
0,2018-01-01,01:00:00,3.79294,V ugm-3,No data,,No data,
1,2018-01-01,02:00:00,10.808,V ugm-3,No data,,No data,
2,2018-01-01,03:00:00,9.00229,V ugm-3,No data,,No data,
3,2018-01-01,04:00:00,6.37322,V ugm-3,No data,,No data,
4,2018-01-01,05:00:00,2.65853,V ugm-3,No data,,No data,
...,...,...,...,...,...,...,...,...
25579,2020-12-01,20:00:00,35.3732,N ugm-3,27.5,N ugm-3 (Ref.eq),44.7591,N ugm-3
25580,2020-12-01,21:00:00,32.6193,N ugm-3,19.2,N ugm-3 (Ref.eq),40.9585,N ugm-3
25581,2020-12-01,22:00:00,30.7076,N ugm-3,15.8,N ugm-3 (Ref.eq),37.8524,N ugm-3
25582,2020-12-01,23:00:00,31.9377,N ugm-3,13.3,N ugm-3 (Ref.eq),34.697,N ugm-3


In [8]:
# pd.set_option('display.max_columns', 100)

## Step 3: Unify missing data

In [9]:
df = df.replace('No data', np.nan)
df = df.replace('No Data', np.nan)
df

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Farnham The Woolmead,Farnham The Woolmead,Farnham The Woolmead,Farnham The Woolmead,Godalming Ockford Road 2,Godalming Ockford Road 2
Unnamed: 0_level_1,Date,Time,Nitrogen dioxide,Status,PM10 particulate matter (Hourly measured),Status.1,Nitrogen dioxide,Status
0,2018-01-01,01:00:00,3.79294,V ugm-3,,,,
1,2018-01-01,02:00:00,10.80804,V ugm-3,,,,
2,2018-01-01,03:00:00,9.00229,V ugm-3,,,,
3,2018-01-01,04:00:00,6.37322,V ugm-3,,,,
4,2018-01-01,05:00:00,2.65853,V ugm-3,,,,
...,...,...,...,...,...,...,...,...
25579,2020-12-01,20:00:00,35.37315,N ugm-3,27.5,N ugm-3 (Ref.eq),44.75912,N ugm-3
25580,2020-12-01,21:00:00,32.61926,N ugm-3,19.2,N ugm-3 (Ref.eq),40.95849,N ugm-3
25581,2020-12-01,22:00:00,30.70755,N ugm-3,15.8,N ugm-3 (Ref.eq),37.85241,N ugm-3
25582,2020-12-01,23:00:00,31.93767,N ugm-3,13.3,N ugm-3 (Ref.eq),34.69698,N ugm-3


## Step 4: Identify closed monitoring stations

In [10]:
non_null_columns = [col for col in df.columns if df.loc[:, col].notna().any()]
open_monitoring_sites = df[non_null_columns]
open_monitoring_sites

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Farnham The Woolmead,Farnham The Woolmead,Farnham The Woolmead,Farnham The Woolmead,Godalming Ockford Road 2,Godalming Ockford Road 2
Unnamed: 0_level_1,Date,Time,Nitrogen dioxide,Status,PM10 particulate matter (Hourly measured),Status.1,Nitrogen dioxide,Status
0,2018-01-01,01:00:00,3.79294,V ugm-3,,,,
1,2018-01-01,02:00:00,10.80804,V ugm-3,,,,
2,2018-01-01,03:00:00,9.00229,V ugm-3,,,,
3,2018-01-01,04:00:00,6.37322,V ugm-3,,,,
4,2018-01-01,05:00:00,2.65853,V ugm-3,,,,
...,...,...,...,...,...,...,...,...
25579,2020-12-01,20:00:00,35.37315,N ugm-3,27.5,N ugm-3 (Ref.eq),44.75912,N ugm-3
25580,2020-12-01,21:00:00,32.61926,N ugm-3,19.2,N ugm-3 (Ref.eq),40.95849,N ugm-3
25581,2020-12-01,22:00:00,30.70755,N ugm-3,15.8,N ugm-3 (Ref.eq),37.85241,N ugm-3
25582,2020-12-01,23:00:00,31.93767,N ugm-3,13.3,N ugm-3 (Ref.eq),34.69698,N ugm-3


## Step 5: Setting Date and Time columns to Datetime

In [11]:
open_monitoring_sites[('Unnamed: 1_level_0','Time')] = [str(x)[-8:] for x in open_monitoring_sites[('Unnamed: 1_level_0','Time')]]

In [12]:
open_monitoring_sites

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Farnham The Woolmead,Farnham The Woolmead,Farnham The Woolmead,Farnham The Woolmead,Godalming Ockford Road 2,Godalming Ockford Road 2
Unnamed: 0_level_1,Date,Time,Nitrogen dioxide,Status,PM10 particulate matter (Hourly measured),Status.1,Nitrogen dioxide,Status
0,2018-01-01,01:00:00,3.79294,V ugm-3,,,,
1,2018-01-01,02:00:00,10.80804,V ugm-3,,,,
2,2018-01-01,03:00:00,9.00229,V ugm-3,,,,
3,2018-01-01,04:00:00,6.37322,V ugm-3,,,,
4,2018-01-01,05:00:00,2.65853,V ugm-3,,,,
...,...,...,...,...,...,...,...,...
25579,2020-12-01,20:00:00,35.37315,N ugm-3,27.5,N ugm-3 (Ref.eq),44.75912,N ugm-3
25580,2020-12-01,21:00:00,32.61926,N ugm-3,19.2,N ugm-3 (Ref.eq),40.95849,N ugm-3
25581,2020-12-01,22:00:00,30.70755,N ugm-3,15.8,N ugm-3 (Ref.eq),37.85241,N ugm-3
25582,2020-12-01,23:00:00,31.93767,N ugm-3,13.3,N ugm-3 (Ref.eq),34.69698,N ugm-3


In [13]:
open_monitoring_sites[('Unnamed: 1_level_0','Time')] = pd.to_datetime(open_monitoring_sites[('Unnamed: 1_level_0','Time')], format='%H:%M:%S').dt.strftime("%H:%M:%S")

In [14]:
open_monitoring_sites.columns

MultiIndex([(      'Unnamed: 0_level_0', ...),
            (      'Unnamed: 1_level_0', ...),
            (    'Farnham The Woolmead', ...),
            (    'Farnham The Woolmead', ...),
            (    'Farnham The Woolmead', ...),
            (    'Farnham The Woolmead', ...),
            ('Godalming Ockford Road 2', ...),
            ('Godalming Ockford Road 2', ...)],
           )

In [15]:
open_monitoring_sites.set_index([('Unnamed: 0_level_0','Date'), ('Unnamed: 1_level_0','Time')], inplace=True)
open_monitoring_sites

Unnamed: 0_level_0,Unnamed: 1_level_0,Farnham The Woolmead,Farnham The Woolmead,Farnham The Woolmead,Farnham The Woolmead,Godalming Ockford Road 2,Godalming Ockford Road 2
Unnamed: 0_level_1,Unnamed: 1_level_1,Nitrogen dioxide,Status,PM10 particulate matter (Hourly measured),Status.1,Nitrogen dioxide,Status
"(Unnamed: 0_level_0, Date)","(Unnamed: 1_level_0, Time)",Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2018-01-01,01:00:00,3.79294,V ugm-3,,,,
2018-01-01,02:00:00,10.80804,V ugm-3,,,,
2018-01-01,03:00:00,9.00229,V ugm-3,,,,
2018-01-01,04:00:00,6.37322,V ugm-3,,,,
2018-01-01,05:00:00,2.65853,V ugm-3,,,,
...,...,...,...,...,...,...,...
2020-12-01,20:00:00,35.37315,N ugm-3,27.5,N ugm-3 (Ref.eq),44.75912,N ugm-3
2020-12-01,21:00:00,32.61926,N ugm-3,19.2,N ugm-3 (Ref.eq),40.95849,N ugm-3
2020-12-01,22:00:00,30.70755,N ugm-3,15.8,N ugm-3 (Ref.eq),37.85241,N ugm-3
2020-12-01,23:00:00,31.93767,N ugm-3,13.3,N ugm-3 (Ref.eq),34.69698,N ugm-3


In [16]:
open_monitoring_sites = open_monitoring_sites.rename_axis(['Date','Time'])
open_monitoring_sites

Unnamed: 0_level_0,Unnamed: 1_level_0,Farnham The Woolmead,Farnham The Woolmead,Farnham The Woolmead,Farnham The Woolmead,Godalming Ockford Road 2,Godalming Ockford Road 2
Unnamed: 0_level_1,Unnamed: 1_level_1,Nitrogen dioxide,Status,PM10 particulate matter (Hourly measured),Status.1,Nitrogen dioxide,Status
Date,Time,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2018-01-01,01:00:00,3.79294,V ugm-3,,,,
2018-01-01,02:00:00,10.80804,V ugm-3,,,,
2018-01-01,03:00:00,9.00229,V ugm-3,,,,
2018-01-01,04:00:00,6.37322,V ugm-3,,,,
2018-01-01,05:00:00,2.65853,V ugm-3,,,,
...,...,...,...,...,...,...,...
2020-12-01,20:00:00,35.37315,N ugm-3,27.5,N ugm-3 (Ref.eq),44.75912,N ugm-3
2020-12-01,21:00:00,32.61926,N ugm-3,19.2,N ugm-3 (Ref.eq),40.95849,N ugm-3
2020-12-01,22:00:00,30.70755,N ugm-3,15.8,N ugm-3 (Ref.eq),37.85241,N ugm-3
2020-12-01,23:00:00,31.93767,N ugm-3,13.3,N ugm-3 (Ref.eq),34.69698,N ugm-3


## Step 6: Removing Multi-tier Columns

In [17]:
open_monitoring_sites_stack = open_monitoring_sites.stack(0, dropna=True).rename_axis(('Date', 'Time','Location'))
open_monitoring_sites_stack

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Nitrogen dioxide,PM10 particulate matter (Hourly measured),Status,Status.1
Date,Time,Location,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-01-01,01:00:00,Farnham The Woolmead,3.79294,,V ugm-3,
2018-01-01,02:00:00,Farnham The Woolmead,10.80804,,V ugm-3,
2018-01-01,03:00:00,Farnham The Woolmead,9.00229,,V ugm-3,
2018-01-01,04:00:00,Farnham The Woolmead,6.37322,,V ugm-3,
2018-01-01,05:00:00,Farnham The Woolmead,2.65853,,V ugm-3,
...,...,...,...,...,...,...
2020-12-01,22:00:00,Godalming Ockford Road 2,37.85241,,N ugm-3,
2020-12-01,23:00:00,Farnham The Woolmead,31.93767,13.3,N ugm-3,N ugm-3 (Ref.eq)
2020-12-01,23:00:00,Godalming Ockford Road 2,34.69698,,N ugm-3,
2020-12-01,00:00:00,Farnham The Woolmead,29.14097,12.5,N ugm-3,N ugm-3 (Ref.eq)


In [18]:
open_monitoring_sites_stack.reset_index(level=['Location'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Location,Nitrogen dioxide,PM10 particulate matter (Hourly measured),Status,Status.1
Date,Time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-01-01,01:00:00,Farnham The Woolmead,3.79294,,V ugm-3,
2018-01-01,02:00:00,Farnham The Woolmead,10.80804,,V ugm-3,
2018-01-01,03:00:00,Farnham The Woolmead,9.00229,,V ugm-3,
2018-01-01,04:00:00,Farnham The Woolmead,6.37322,,V ugm-3,
2018-01-01,05:00:00,Farnham The Woolmead,2.65853,,V ugm-3,
...,...,...,...,...,...,...
2020-12-01,22:00:00,Godalming Ockford Road 2,37.85241,,N ugm-3,
2020-12-01,23:00:00,Farnham The Woolmead,31.93767,13.3,N ugm-3,N ugm-3 (Ref.eq)
2020-12-01,23:00:00,Godalming Ockford Road 2,34.69698,,N ugm-3,
2020-12-01,00:00:00,Farnham The Woolmead,29.14097,12.5,N ugm-3,N ugm-3 (Ref.eq)


## Next Steps

- Use API call to gather data from monitoring stations
- Investigate other data sources for mass downloads and ease
- Aggregate data on a daily basis
- Investigate similarities/differences between outer and inner locations, perform EDA on each
- Aggregate data annually and merge with Health and Flight data
