## Summary Statistics

This notebook creates a box-plot visualization of the variance of different pollutants in each month for different cities. The data is read from the file multi_city.csv

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

plt.style.use('seaborn-bright')

In [None]:
df = pd.read_csv('multi_city.csv')

## A quick look at the data, make sure we understand what's what

In [None]:
df.describe()

In [None]:
df.head()

Interesting: The dataframe has columns for City and date, followed by *four* pollutants, but describe() only summarizes three of them.

Let's also see what cities we have...


In [None]:
df.City.unique()

In [None]:
df[df.City == 'Pune'].describe()

In [None]:
df.dtypes

Aha! and here's why pm10 wasn't being described. We need to convert it to int64, and also encode the date column appropriately

## A bit of cleanup!
When I tried chaging the type of the pm10, I get an error -- there's an entry that cannot be interpreted as an int, even though I'm looking for NaNs and empty strings

In [None]:
print(np.where(pd.isnull(df)))
print(np.where(df.applymap(lambda x: x == '')))

### A bit of a hack
Row 437 has an empty value for pm10, but this is not detected either by isnull or as an empty string. In the original csv, the cell is just... empty! How do we detect and remove these?

For now, I'm just going to remove this row

In [None]:
df.iloc[437]

In [None]:
df.drop(index=437, inplace=True)

The interpretation of the date column gets messed up -- to_datetime is not able to figure out which is the day and which is the month. I force it using the format

In [None]:
df.date = pd.to_datetime(df.date, format = '%d/%m/%Y') # need the format string to specify which is the month and which is the year
df.pm10 = pd.to_numeric(df.pm10)
df.dtypes

Finally, I add another column that just encodes the months. This helps me create the box plots grouped by month (see below)

I also sort the dataframe so that cities line up nicely (most to least polluted also corresponds to alphabetical order, in this case!)

In [None]:
df['month'] = pd.DatetimeIndex(df['date']).month

df['Month'] = df.date.dt.month_name()
df.sort_values(by=['City'], kind='mergesort', inplace=True)

In [None]:
df.head()

## Plot the data! PM 2.5

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(18,5), sharey=True)
plt.title('')
plt.suptitle('')

for i, c in enumerate(df.City.unique()):
    df[df.City == c].boxplot(column='pm25', by='month', ax=ax[i])
    #print(i, ":", c)
    ax[i].set_title(c)
    ax[i].axhline(25, c='g')
    ax[i].axhline(60, c='orange')

plt.show()

I don't like the 'month' label at the bottom and the 'Boxplot grouped by month' title, but not enough that I'll invest the time and energy to fix it!

## PM 10

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(15,8), sharey=True)
plt.title('')
plt.suptitle('')

for i, c in enumerate(df.City.unique()):
    df[df.City == c].boxplot(column='pm10', by='month', ax=ax[i])
    #print(i, ":", c)
    ax[i].set_title(c)
    ax[i].axhline(50, c='g')
    ax[i].axhline(100, c='orange')

plt.show()

## CO

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(15,8), sharey=True)
plt.title('')
plt.suptitle('')

for i, c in enumerate(df.City.unique()):
    df[df.City == c].boxplot(column='co', by='month', ax=ax[i])
    #print(i, ":", c)
    ax[i].set_title(c)
    #ax[i].axhline(100, c='g')
    #ax[i].axhline(100, c='orange')
plt.show()

## O3

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(15,8), sharey=True)
plt.title('')
plt.suptitle('')

for i, c in enumerate(df.City.unique()):
    df[df.City == c].boxplot(column='o3', by='month', ax=ax[i])
    #print(i, ":", c)
    ax[i].set_title(c)
    ax[i].axhline(100, c='g')
    #ax[i].axhline(100, c='orange')
plt.show()

## and all together

Just because I want to!

In [None]:
fig, ax = plt.subplots(4, 3, figsize=(15,16), sharey='row')
plt.title('')
plt.suptitle('')
cities = df.City.unique()
polls = ['pm25', 'pm10', 'co', 'o3']

for i, a in enumerate(ax.flatten()):
    p = polls[i // len(cities)]
    c = cities[i % len(cities)]
    df[df.City == c].boxplot(column=p, by='month', ax=a)
    #print(i, ":", p, ":", c)
    a.set_title(c)
    #ax[i].axhline(100, c='g')
    #ax[i].axhline(100, c='orange')
plt.show()