# Using the OpenAQ API

The `openaq` api is an easy-to-use wrapper built around the [OpenAQ Api](https://docs.openaq.org/). Complete API documentation can be found on their website. 

There are no keys or rate limits (as of March 2017), so working with the API is straight forward. If building a website or app, you may want to just use the python wrapper and interact with the data in json format. However, the rest of this tutorial will assume you are interested in analyzing the data. To get more out of it, I recommend installing `seaborn` for manipulating the asthetics of plots, and working with data as DataFrames using `pandas`. For more information on these, check out the installation section of this documentation.

From this point forward, I assume you have at least a basic knowledge of python and matplotlib. This documentation was built using the following versions of all packages:

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import openaq
import warnings

warnings.simplefilter('ignore')

%matplotlib inline

# Set major seaborn asthetics
sns.set("notebook", style='ticks', font_scale=1.0)

# Increase the quality of inline plots
mpl.rcParams['figure.dpi']= 500

print ("pandas v{}".format(pd.__version__))
print ("matplotlib v{}".format(mpl.__version__))
print ("seaborn v{}".format(sns.__version__))
print ("openaq v{}".format(openaq.__version__))

pandas v0.23.4
matplotlib v3.0.0
seaborn v0.9.0
openaq v1.1.0


## OpenAQ API

The OpenAQ APi has only eight endpoints that we are interested in:

  * cities: provides a simple listing of cities within the platforms
  * countries: provides a simple listing of countries within the platform
  * fetches: providing data about individual fetch operations that are used to populate data in the platform
  * latest: provides the latest value of each available parameter for every location in the system
  * locations: provides a list of measurement locations and their meta data
  * measurements: provides data about individual measurements
  * parameters: provides a simple listing of parameters within the platform
  * sources: provides a list of data sources
  
For detailed documentation about each one in the context of this API wrapper, please check out the API documentation.

### Your First Request

Real quick, let's go ahead and initiate an instance of the `openaq.OpenAQ` class so we can begin looking at data:

In [3]:
api = openaq.OpenAQ()

### Cities

The cities API endpoint lists the cities available within the platform. Results can be subselected by country and paginated to retrieve all results in the database. Let's start by performing a basic query with an increased limit (so we can get all of them) and return it as a DataFrame:

In [4]:
resp = api.cities(df=True, limit=10000)

# display the first 10 rows
resp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2449 entries, 0 to 2448
Data columns (total 4 columns):
city         2449 non-null object
count        2449 non-null int64
country      2449 non-null object
locations    2449 non-null int64
dtypes: int64(2), object(2)
memory usage: 76.6+ KB


So we retrieved 2400+ entries from the database. We can then take a look at them:

In [5]:
print (resp.head(10))

                                             city   count country  locations
0                              Escaldes-Engordany   42206      AD          2
1                                          unused    3238      AD          1
2                                       Abu Dhabi   10633      AE          1
3                                           Dubai    3626      AE          1
4                                    Buenos Aires   14976      AR          4
5                 Amt der Tiroler Landesregierung  113161      AT         19
6                Gemeinde Wien, MA22 Umweltschutz  130328      AT         21
7  Amt der Niederösterreichischen Landesregierung    1413      AT          3
8    Amt der Oberösterreichischen Landesregierung    2828      AT          6
9                                         Austria  121987      AT        174


Let's try to find out which ones are in India:

In [6]:
print (resp.query("country == 'IN'"))

                   city    count country  locations
1219             Jaipur   258364      IN          9
1220              Medak     2671      IN          1
1221            Jodhpur   186822      IN          2
1222              Delhi  1674023      IN         70
1223              Satna    13907      IN          3
1224           Gurugram    23513      IN          1
1225               Agra   117957      IN          2
1226          Mandideep    50439      IN          1
1227          Hyderabad   697767      IN         16
1228           Siliguri    15975      IN          3
1229             Howrah   113197      IN          5
1230             Rohtak   125266      IN          2
1231         Aurangabad   135837      IN          2
1232            Udaipur    58266      IN          2
1233            Bhiwadi    43008      IN          2
1234            Gurgaon   153776      IN          2
1235           Chittoor     2013      IN          1
1236          Singrauli    52661      IN          1
1237        

Great! For the rest of the tutorial, we are going to focus on Delhi, India. Why? Well..because there are over 500,000 data points and my personal research is primarily in India. We will also take a look at some $SO_2$ data from Hawai'i later on (another great research locale).

## Countries

Similar to the `cities` endpoint, the `countries` endpoint lists the countries available. The only parameters we have to play with are the limit and page number. If we want to grab them all, we can just up the limit to the maximum (10000).

In [7]:
res = api.countries(limit=10000, df=True)

print (res.head())

   cities code    count  locations       name
0       2   AD    45444          3    Andorra
1       1   AR    14976          4  Argentina
2      19   AU  4493135        103  Australia
3      16   AT  1521351        306    Austria
4       1   BH    24239          1    Bahrain


## Fetches

If you are interested in getting information pertaining to the individual data fetch operations, go ahead and use this endpoint. Most people won't need to use this. This API method does not allow the `df` parameter; if you would like it to be added, drop me a message.

Otherwise, here is how you can access the json-formatted data:

In [8]:
status, resp = api.fetches(limit=1)

# Print out the meta info
resp['meta']

{'name': 'openaq-api',
 'license': 'CC BY 4.0',
 'website': 'https://docs.openaq.org/',
 'page': 1,
 'limit': 1,
 'found': 123231,
 'pages': 123231}

## Parameters

The `parameters` endpoint will provide a listing off all the parameters available:

In [9]:
res = api.parameters(df=True)

print (res)

                                         description    id   name  \
0                                       Black Carbon    bc     BC   
1                                    Carbon Monoxide    co     CO   
2                                   Nitrogen Dioxide   no2    NO2   
3                                              Ozone    o3     O3   
4  Particulate matter less than 10 micrometers in...  pm10   PM10   
5  Particulate matter less than 2.5 micrometers i...  pm25  PM2.5   
6                                     Sulfur Dioxide   so2    SO2   

  preferredUnit  
0         µg/m³  
1           ppm  
2           ppm  
3           ppm  
4         µg/m³  
5         µg/m³  
6           ppm  


## Sources

The `sources` endpoint will provide a list of the sources where the raw data came from.

In [29]:
status, resp = api.sources(df=False)

# Print out the first one
#res.ix[0]
#pd.io.json.json_normalize(resp['results'])
resp['results'][0:2]

[[{'url': 'unused',
   'adapter': 'unused',
   'name': 'Dr. Raphael E. Arku and Colleagues',
   'city': 'Accra',
   'country': 'GH',
   'description': 'Manual ingest of data from Dr. Raphael E. Arku and colleagues',
   'sourceURL': 'https://www.ncbi.nlm.nih.gov/pubmed?term=Arku+RE%5BAuthor%5D+AND+Accra+air+pollution%5BAll+Fields%5D&cmd=DetailsSearch',
   'contacts': ['info@openaq.org'],
   'active': False}],
 [{'url': 'http://discomap.eea.europa.eu/map/fme/latest/',
   'adapter': 'eea-direct',
   'name': 'EEA Gibraltar',
   'city': '',
   'country': 'GI',
   'description': 'Gibraltar data from UTD service',
   'sourceURL': 'http://www.eea.europa.eu/themes/air/air-quality',
   'contacts': ['info@openaq.org'],
   'active': True}]]

In [21]:
res = api.sources(df=True)

AttributeError: 'list' object has no attribute 'values'

## Locations

The `locations` endpoint will return the list of measurement locations and their meta data. We can do quite a bit of querying with this one:

Let's see what the data looks like:

In [None]:
res = api.locations(df=True)

res.info()

In [None]:
# print out the first one
res.ix[0]

What if we just want to grab the locations in Delhi?

In [None]:
res = api.locations(city='Delhi', df=True)


res.ix[0]

What about just figuring out which locations in Delhi have $PM_{2.5}$ data?

In [None]:
res = api.locations(city='Delhi', parameter='pm25', df=True)

res.ix[0]

## Latest

Grab the latest data from a location or locations.

What was the most recent $PM_{2.5}$ data in Delhi?

In [None]:
res = api.latest(city='Delhi', parameter='pm25', df=True)

res.head()

What about the most recent $SO_2$ data in Hawii?

In [None]:
res = api.latest(city='Hilo', parameter='so2', df=True)

res

## Measurements

Finally, the endpoint we've all been waiting for! Measurements allows you to grab all of the dataz! You can query on a whole bunhc of parameters listed in the API documentation. Let's dive in:

Let's grab the past 10000 data points for $PM_{2.5}$ in Delhi:

In [None]:
res = api.measurements(city='Delhi', parameter='pm25', limit=10000, df=True)

# Print out the statistics on a per-location basiss
res.groupby(['location'])['value'].describe()

Clearly, we should be doing some serious data cleaning ;) Why don't we go ahead and plot all of these locations on a figure.

In [None]:
fig, ax = plt.subplots(1, figsize=(10, 6))

for group, df in res.groupby('location'):
    # Query the data to only get positive values and resample to hourly
    _df = df.query("value >= 0.0").resample('1h').mean()
    
    _df.value.plot(ax=ax, label=group)
    
ax.legend(loc='best')
ax.set_ylabel("$PM_{2.5}$  [$\mu g m^{-3}$]", fontsize=20)
ax.set_xlabel("")
sns.despine(offset=5)

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

plt.show()

Don't worry too much about how ugly and uninteresting the plot above is...we'll take care of that in the next tutorial! Let's go ahead and look at the distribution of $PM_{2.5}$ values seen in Delhi by various sensors. This is the same data as above, but viewed in a different way.

In [None]:
fig, ax = plt.subplots(1, figsize=(14,7))

ax = sns.boxplot(
    x='location', 
    y='value', 
    data=res.query("value >= 0.0"), 
    fliersize=0, 
    palette='deep',
    ax=ax)

ax.set_ylim([0, 750])
ax.set_ylabel("$PM_{2.5}\;[\mu gm^{-3}]$", fontsize=18)
ax.set_xlabel("")

sns.despine(offset=10)

plt.xticks(rotation=90)
plt.show()

If we remember from above, there was at least one location where many parameters were measured. Let's go ahead and look at that location and see if there is any correlation among parameters!

In [None]:
res = api.measurements(city='Delhi', location='Anand Vihar', limit=1000, df=True)

# Which params do we have?
res.parameter.unique()

In [None]:
df = pd.DataFrame()

for u in res.parameter.unique():
    _df = res[res['parameter'] == u][['value']]
    _df.columns = [u]
    
    # Merge the dataframes together
    df = pd.merge(df, _df, left_index=True, right_index=True, how='outer')

# Get rid of rows where not all exist
df.dropna(how='any', inplace=True)

g = sns.PairGrid(df, diag_sharey=False)

g.map_lower(sns.kdeplot, cmap='Blues_d')
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=3)

plt.show()

For kicks, let's go ahead and look at a timeseries of $SO_2$ data in Hawai'i. Quiz: What do you expect? Did you know that Hawai'i has a huge $SO_2$ problem?

In [None]:
res = api.measurements(city='Hilo', parameter='so2', limit=10000, df=True)

# Print out the statistics on a per-location basiss
res.groupby(['location'])['value'].describe()

In [None]:
fig, ax = plt.subplots(1, figsize=(10, 5))

for group, df in res.groupby('location'):
    # Query the data to only get positive values and resample to hourly
    _df = df.query("value >= 0.0").resample('6h').mean()
    
    # Convert from ppm to ppb
    _df['value'] *= 1e3
    
    # Multiply the value by 1000 to get from ppm to ppb
    _df.value.plot(ax=ax, label=group)
    
ax.legend(loc='best')
ax.set_ylabel("$SO_2 \; [ppb]$", fontsize=18)
ax.set_xlabel("")

sns.despine(offset=5)

plt.show()

**NOTE:** These values are for 6h means. The local readings can actually get much, much higher (>5 ppm!) when looking at 1min data.