## DataFrame Rationalization

We have data in two different formats. For large cities, the tables are organized as follows:

The primary columns are Date, City and Specie (pollutant). The remaining columns report the count, min, max, median and variance of the pollutant for that city on that particular date. 

`Unnamed: 0	Date	Country	City	Specie	count	min	max	median	variance
2	2979	06/04/2020	IN	Hyderabad	no2	95	0.1	31.2	6.1	848.08`

Thus, in order to extract either the pollutant or the City, we need to filter the rows: I use something along the lines of
`df[(df.City == c)]` and `df[(df.Specie == p)]` and these can ofcourse be combined. 

For rural areas, however, the data is organized differently. The primary columns are now the Data and the City, and for each, the remaining columns report the pollution levels for different pollutants.

`       	date	pm25	pm10	o3	no2	so2	co	city
3	06/01/2020	62	20	9	19	2	8	coimbatore`

So we can extract data for a city and then use the column to get the data for a pollutant, using a variant of `df[(df.City == c), 'pm25']`

This works fine when plotting things separately for rural and urban regions. However, what happens when we want to compare the data from the two tables? We need to convert these two datas into a single representation. Depending on what we want to do with the data, one format will be preferred over the other. In this sheet, I document how to do this both ways. 


In [None]:
import pandas as pd

----
First, let's load in the data and clean it up a bit:

In [None]:
year = 2020
df_r = pd.read_csv(f'data/{year}_7rural.csv')
df_u = pd.read_csv(f'data/{year}_7cities.csv')

In [None]:
print(df_r.head(5))
print(df_u.head(5))

In [None]:
# cleanup: rename the column titles for rural, remove min/max/count and variance for urban. 
df_r.rename(columns = {'date':'Date', 'city':'City'}, inplace=True)
df_u.drop(['Unnamed: 0', 'count', 'min', 'max', 'variance', 'Country'], axis=1, inplace=True)

# Fix the Dates for both and remove data that does not belong to the current year
df_r.Date = pd.to_datetime(df_r.Date, format = '%d/%m/%Y') 
df_u.Date = pd.to_datetime(df_u.Date, format = '%d/%m/%Y') 
    # we need the format string to specify which is the month and which is the year
df_r = df_r[df_r['Date'].dt.year == year]
df_u = df_u[df_u['Date'].dt.year == year]

In [None]:
print(df_r.head(5))
print(df_u.head(5))

----
### 1. Converting the Rural to the Urban format

The format of both tables is as above. For every city and date, we'd like to add an additional row for each pollutant. This is pretty easy to do using the `melt` method:

In [None]:
df_ru = pd.melt(df_r, id_vars=['City', 'Date'], value_vars=['pm25', 'pm10','o3', 'no2', 'so2', 'co'] )
df_ru.rename(columns = {'variable':'Specie', 'value':'Value'}, inplace=True)
df_uu = df_u.copy()
df_uu.rename(columns = {'median':'Value'}, inplace=True)
print("rural:", df_ru.shape)
print(df_ru.head(5))
print("urban:", df_uu.shape)
print(df_uu.head(5))

... and we're pretty much done! As a finishing touch, I'll just merge these into one table:

In [None]:
combo_u = pd.concat([df_uu, df_ru], sort=False)
combo_u

----
### 2. Converting the Urban to the Rural format

In [None]:
print(df_r.head(5))
print(df_u.head(5))

What we want to do here is split up the urban data so that we have a column for each pollutant, and the primary columns are Dates and Citys

Fairly straightforward to accomplish:

In [None]:
# attempt 1:
df_ur = df_u.pivot_table(index=['City', 'Date'], columns='Specie')
df_ur
# df_ur.reset_index()

In [None]:
# attempt 2
df_ur = df_u.set_index(['Date', 'City', 'Specie'])['median'].unstack().reset_index()

... and as before, merging with the rural data

In [None]:
combo_r = pd.concat([df_r, df_ur], sort=False)
combo_r