In [3]:
import pandas as pd
import numpy as np

# Brian Kitano Coding Test

Start: 9:23 AM

## 2. Data Cleaning

Start: 9:23 AM

The central task is to collapse the grid-level dataset into a district-level dataset.

The final product from this section is a district-level daily dataset from 2009-2013 with temperature, rainfall, and total rainfall variables.

Steps:
1. Calculate the set of points $P$ to include for each district.
2. Use the formulas to calculate the district statistic.

The actual EPIC project used 339 districts in India, 112 years of data and a 0.25°(latitude) x 0.25°(longitude) grid for rainfall. How would your code scale up with this larger dataset? Would you need any additional computing resources? Be as specific as possible.

### Creating the P

We want to go with the minimal possible answer to minize the scope of queries, which will scale better.

Algorithm: weighted average of daily mean temp, mean rainfall, and total rainfall for all grid points w/in 100km of each district's geographic center. Weights are inverse of the squared distance from the district center.

The weighted average is:
$$
\bar t_i = \sum_{p \in P} w_p t_{p,i} 
$$

where $P$ is the set of points within 100km of the district centroid, $d$ is the district centroid, $i$ is the date, and where 

$$w_p = \frac{d(c,p)^{-2}}{\sum_{p \in P} d(c,p)^{-2}}$$



Assumptions:
- the `rainfall` csv's are identical
- how to normalize the weights, otherwise you end up with nonsensical magnitudes of values

In [4]:


raw_rain_df = pd.read_csv('./data/Rainfall/rainfall_2010.csv')
raw_temp_df = pd.read_csv('./data/Temperature/temperature_2010.csv')


In order to do our analysis, we need to ensure that every entry has at least a day, month, year, latitude and longitude. Let's remove any that don't.

In [5]:

def getDefinedRows(df):
    rows_with_data = \
        pd.notna(df['latitude']) & \
        pd.notna(df['longitude']) & \
        pd.notna(df['day']) & \
        pd.notna(df['month']) & \
        pd.notna(df['year'])
    
    return rows_with_data

rain_df = raw_rain_df[getDefinedRows(raw_rain_df)]
temp_df = raw_temp_df[getDefinedRows(raw_temp_df)]

I'm also assuming that in any 'year' file, there is actually only one calendar year represented.

We'll want to create an `index` column in both dataframes so that we can join them. The index needs to be identical in both data frames. Note that the `date` column is formatted differently in each dataframe, but `day`, `month`, and `year` are the same. 

In [6]:
on_columns = ['latitude', 'longitude', 'day', 'month', 'year']
total_df = rain_df.merge(temp_df, how='outer', left_on=on_columns, right_on=on_columns)

In [7]:
# sanity check
len(total_df) == len(rain_df) & len(total_df) == len(temp_df)

True

In [8]:
rainfall_values = sum(pd.notna(total_df['rainfall']))
print(f'rainfall values: {rainfall_values}')

temp_values = sum(pd.notna(total_df['temperature']))
print(f'temp values: {temp_values}')

rainfall values: 112785
temp values: 128471


In [9]:
distances_df = pd.read_csv('./distances.csv')

In [10]:
def getPointsWithinDistance(distances_df, centroid_id, threshold):
    return distances_df[np.logical_and(distances_df['km'] <= threshold, distances_df['unique_dist_id'] == centroid_id)]

In [11]:
getPointsWithinDistance(distances_df, 1711, 100)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,unique_dist_id,uid,km,latitude,longitude,centroid_latitude,centroid_longitude
376,376,28200,1711,0x178,84.662535,22.5,71.5,22.7812,72.267197
377,377,28275,1711,0x179,39.326242,22.5,72.5,22.7812,72.267197
408,408,30600,1711,0x198,83.343798,23.5,72.5,22.7812,72.267197


In [12]:
def getDataFromPoints(dataDf, pointsDf):
    df = pointsDf.merge(
        dataDf,
        how='outer',
        on=['latitude', 'longitude']
    ).dropna()
    
    return df
    

In [14]:
pointDataDf = getDataFromPoints(total_df, getPointsWithinDistance(distances_df, 1711, 100))

In [15]:
pointDataDf

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,unique_dist_id,uid,km,latitude,longitude,centroid_latitude,centroid_longitude,date_x,rainfall,year,month,day,date_y,temperature
0,376.0,28200.0,1711.0,0x178,84.662535,22.5,71.5,22.7812,72.267197,20100101,0.0,2010,1,1,1012010,22.32
1,376.0,28200.0,1711.0,0x178,84.662535,22.5,71.5,22.7812,72.267197,20100102,0.0,2010,1,2,2012010,22.92
2,376.0,28200.0,1711.0,0x178,84.662535,22.5,71.5,22.7812,72.267197,20100103,0.0,2010,1,3,3012010,22.29
3,376.0,28200.0,1711.0,0x178,84.662535,22.5,71.5,22.7812,72.267197,20100104,0.0,2010,1,4,4012010,20.68
4,376.0,28200.0,1711.0,0x178,84.662535,22.5,71.5,22.7812,72.267197,20100105,0.0,2010,1,5,5012010,19.90
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1090,408.0,30600.0,1711.0,0x198,83.343798,23.5,72.5,22.7812,72.267197,20101227,0.0,2010,12,27,27122010,18.83
1091,408.0,30600.0,1711.0,0x198,83.343798,23.5,72.5,22.7812,72.267197,20101228,0.0,2010,12,28,28122010,19.97
1092,408.0,30600.0,1711.0,0x198,83.343798,23.5,72.5,22.7812,72.267197,20101229,0.0,2010,12,29,29122010,19.84
1093,408.0,30600.0,1711.0,0x198,83.343798,23.5,72.5,22.7812,72.267197,20101230,0.0,2010,12,30,30122010,18.88


In [16]:
def getDailyAveragesFromData(dataDf):
    pointDataDf = dataDf
    pointDataDf['weight'] = 1./np.power(pointDataDf['km'], 2.)
    pointDataDf['weightedTemperature'] = (pointDataDf['temperature']*pointDataDf['weight'])
    pointDataDf['weightedRainfall'] = (pointDataDf['rainfall']*pointDataDf['weight'])
    
    keep_cols = ['day', 'month', 'weight', 'year', 'rainfall', 'weightedTemperature', 'weightedRainfall', 'unique_dist_id']
    districtDf = pointDataDf[keep_cols].groupby(by=['month','day', 'year', 'unique_dist_id']).sum()
    districtDf['weightedAverageTemperature'] = districtDf['weightedTemperature']/districtDf['weight']
    districtDf['weightedAverageRainfall'] = districtDf['weightedRainfall']/districtDf['weight']
    
    return districtDf.drop(columns=['weight', 'weightedTemperature', 'weightedRainfall']).reset_index()

In [17]:
dailyAveragesDf = getDailyAveragesFromData(pointDataDf)

In [18]:
dailyAveragesDf

Unnamed: 0,month,day,year,unique_dist_id,rainfall,weightedAverageTemperature,weightedAverageRainfall
0,1,1,2010,1711.0,0.0,22.102138,0.000000
1,1,2,2010,1711.0,0.0,22.924632,0.000000
2,1,3,2010,1711.0,1.3,22.266090,0.471435
3,1,4,2010,1711.0,0.0,21.029408,0.000000
4,1,5,2010,1711.0,0.0,19.833670,0.000000
...,...,...,...,...,...,...,...
360,12,27,2010,1711.0,0.0,21.186616,0.000000
361,12,28,2010,1711.0,0.0,22.180334,0.000000
362,12,29,2010,1711.0,0.0,21.982689,0.000000
363,12,30,2010,1711.0,0.0,21.048189,0.000000


In [20]:
def getDailyAverages(rain_path, temp_path, distances_path, centroid_id, threshold=100):
    
    # read from csv
    raw_rain_df = pd.read_csv('./data/Rainfall/rainfall_2010.csv')
    raw_temp_df = pd.read_csv('./data/Temperature/temperature_2010.csv')
    distances_df = pd.read_csv('./distances.csv')
    
    # filter only defined columns
    rain_df = raw_rain_df[getDefinedRows(raw_rain_df)]
    temp_df = raw_temp_df[getDefinedRows(raw_temp_df)]
    
    # merge df's
    on_columns = ['latitude', 'longitude', 'day', 'month', 'year']
    total_df = rain_df.merge(temp_df, how='outer', left_on=on_columns, right_on=on_columns)
    
    # get points
    pointDataDf = getDataFromPoints(total_df, getPointsWithinDistance(distances_df, centroid_id, threshold))
    dailyAveragesDf = getDailyAveragesFromData(pointDataDf)
    return dailyAveragesDf

In [21]:
RAIN_PATH = './data/Rainfall/rainfall_2010.csv'
TEMP_PATH = './data/Temperature/temperature_2010.csv'
DISTANCES_PATH = './distances.csv'

getDailyAverages(RAIN_PATH, TEMP_PATH, DISTANCES_PATH, 1711)

Unnamed: 0,month,day,year,unique_dist_id,rainfall,weightedAverageTemperature,weightedAverageRainfall
0,1,1,2010,1711.0,0.0,22.102138,0.000000
1,1,2,2010,1711.0,0.0,22.924632,0.000000
2,1,3,2010,1711.0,1.3,22.266090,0.471435
3,1,4,2010,1711.0,0.0,21.029408,0.000000
4,1,5,2010,1711.0,0.0,19.833670,0.000000
...,...,...,...,...,...,...,...
360,12,27,2010,1711.0,0.0,21.186616,0.000000
361,12,28,2010,1711.0,0.0,22.180334,0.000000
362,12,29,2010,1711.0,0.0,21.982689,0.000000
363,12,30,2010,1711.0,0.0,21.048189,0.000000


In [26]:
!jupyter nbconvert --to script BrianKitano-DataCleaning.ipynb

[NbConvertApp] Converting notebook BrianKitano-DataCleaning.ipynb to script
[NbConvertApp] Writing 5745 bytes to BrianKitano-DataCleaning.py
