## Some analysis of Delhi Pollution Data

The datasets for pollution data in the US are readily available and there has been a bunch of research already done on those. I wanted to see if I could do some analysis of the pollution patterns of Indian cities. Unfortunately, all my search did not lead me to an openly accessible historical pollution dataset for Indian cities. The folks over at [aqicn.org](http://aqicn.org) apparently provide access to institutions but not to individuals. In any case, I was able to locate a fantastic initiative by the [Delhi Pollution Control Committee](http://www.dpccairdata.com/dpccairdata/display/index.php). They provide *raw* pollution data from six sensor clusters inside the city. While the availability could be better, and all sensor clusters do not cover all the metrics, this data is incredibly useful. Kudos to them for having made this available! One problem is that they do not provide historical data, so I had to collect the realtime data over time. What follows is some analysis of that data. Hopefully, as the dataset grows, we'd be able to derive more insights from it.

In [119]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

import re

%matplotlib inline

The data I am dumping out into the csv file below is pretty raw. It look like:

In [109]:
rawdata = pd.read_csv('./netfile.csv', names=['location', 'metric', 'ts', 'reading', 'guidance'])
rawdata.head()

Unnamed: 0,location,metric,ts,reading,guidance
0,Punjabi Bagh,Ammonia,1471960200,19.1 µg/m3,400 µg/m3
1,Punjabi Bagh,Benzene,1471960200,0.2 µg/m3,05 µg/m3 *
2,Punjabi Bagh,Carbon Monoxide,1471960200,1.2 mg/m3,04 mg/m3
3,Punjabi Bagh,Nitrogen Dioxide,1471960200,49.6 µg/m3,80 µg/m3
4,Punjabi Bagh,Nitrogen Oxide,1471960200,4.3 µg/m3,-


In [110]:
rawdata.location.value_counts()

IGI Airport     1660
RK Puram        1577
Punjabi Bagh    1577
Anand Vihar     1577
Civil Lines      913
Name: location, dtype: int64

In [111]:
rawdata.metric.value_counts()

Sulphur Dioxide                                       415
Nitrogen Dioxide                                      415
Benzene                                               415
Ozone                                                 415
Toluene                                               415
p-Xylene                                              415
Ammonia                                               415
Carbon Monoxide                                       415
Nitrogen Oxide                                        415
Wind Direction                                        332
Horizontal Wind Speed                                 249
Particulate Matter < 10 µg                            249
Relative Humidity                                     249
Barometric Pressure                                   249
Oxides of Nitrogen                                    249
Solar Radiation                                       249
Ambient Temperature                                   249
Vertical Wind 

Some of these metrics sound inferable: for example *Nitrogen Dioxide* and *Nitrogen Oxide* should give a good estimate for *Oxides of Nitrogen* where it does not exist independently. However, we'll look at that later. For now, let's munge this into a more useful dataframe

In [112]:
rawdata['ts'] = pd.to_datetime(rawdata.ts, unit='s')

def mungeReading(x):
    return "".join([t[0] for t in re.findall("[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?", x)][:1]) if not (x.strip() == '-') else "coerce"

rawdata['reading'] = pd.to_numeric(rawdata.reading.apply(mungeReading), errors='coerce')
rawdata['guidance'] = pd.to_numeric(rawdata.guidance.apply(mungeReading), errors='coerce')

rawdata.head()

Unnamed: 0,location,metric,ts,reading,guidance
0,Punjabi Bagh,Ammonia,2016-08-23 13:50:00,19.1,400.0
1,Punjabi Bagh,Benzene,2016-08-23 13:50:00,0.2,5.0
2,Punjabi Bagh,Carbon Monoxide,2016-08-23 13:50:00,1.2,4.0
3,Punjabi Bagh,Nitrogen Dioxide,2016-08-23 13:50:00,49.6,80.0
4,Punjabi Bagh,Nitrogen Oxide,2016-08-23 13:50:00,4.3,


In [114]:
rawdata.reading.isnull().sum(), len(rawdata.reading), rawdata.guidance.isnull().sum(), len(rawdata.guidance)

(426, 7304, 4186, 7304)

So that looks reasonable. I expect that the guidance is a simple function of the metric and should not change that often. Let us check:

In [154]:
{k:v for k,v in dict(rawdata[['metric', 'guidance']].groupby('metric').guidance.nunique()).items() if v > 0}

{'Ammonia': 2,
 'Benzene': 1,
 'Carbon Monoxide': 2,
 "Mass Concentration PM 10 (Previous Day's Average)": 1,
 'Nitrogen Dioxide': 2,
 'Nitrogen Oxide': 1,
 'Oxides of Nitrogen': 1,
 'Ozone': 2,
 'Particulate Matter < 10 µg': 2,
 'Particulate Matter < 2.5 µg': 3,
 'Sulphur Dioxide': 2,
 'Vertical Wind Speed': 4}

Looks like that guess was incorrect and that there are multiple guidances per metric.

In [153]:
data = rawdata[['location', 'metric', 'ts', 'reading']].dropna()
data.groupby(['location', 'metric']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,reading
location,metric,Unnamed: 2_level_1,Unnamed: 3_level_1
Anand Vihar,Ambient Temperature,count,83.000000
Anand Vihar,Ambient Temperature,mean,32.431325
Anand Vihar,Ambient Temperature,std,0.540985
Anand Vihar,Ambient Temperature,min,32.100000
Anand Vihar,Ambient Temperature,25%,32.200000
Anand Vihar,Ambient Temperature,50%,32.200000
Anand Vihar,Ambient Temperature,75%,32.200000
Anand Vihar,Ambient Temperature,max,34.500000
Anand Vihar,Ammonia,count,83.000000
Anand Vihar,Ammonia,mean,32.037349
