In [133]:
import pandas as pd
from geopy import distance
import numpy as np
import sklearn.neighbors

# Brian Kitano Coding Test

Start: 9:23 AM

## 2. Data Cleaning

Start: 9:23 AM

The central task is to collapse the grid-level dataset into a district-level dataset.

The final product from this section is a district-level daily dataset from 2009-2013 with temperature, rainfall, and total rainfall variables.

Steps:
1. Calculate the set of points $P$ to include for each district.
2. Use the formulas to calculate the district statistic.

The actual EPIC project used 339 districts in India, 112 years of data and a 0.25°(latitude) x 0.25°(longitude) grid for rainfall. How would your code scale up with this larger dataset? Would you need any additional computing resources? Be as specific as possible.

### Creating the P

Assumptions:
- Are the `rainfall` csv's identical?
- I'm also assuming that in any 'year' file, there is actually only one calendar year represented.

We want to go with the minimal possible answer to minize the scope of queries, which will scale better. Also, we don't want our functions to rely on having to store a merged data frame in memory, since if there were 339 districts, and 112 years of data, the data frames would be too large.

Algorithm: weighted average of daily mean temp, mean rainfall, and total rainfall for all grid points w/in 100km of each district's geographic center. Weights are inverse of the squared distance from the district center.

$$
\bar t = \frac{1}{|P|} \sum_{p \in P} \frac{t_p}{(d - p)^2}
$$
$$
\bar r = \frac{1}{|P|} \sum_{p \in P} \frac{r_p}{(d - p)^2}
$$
$$
R = \sum_{p \in P} \frac{r_p}{(d - p)^2}
$$

where $P$ is the set of points within 100km of the district centroid, $d$ is the district centroid.


In [39]:
raw_rain_df = pd.read_csv('./data/Rainfall/rainfall_2010.csv')

In order to do our analysis, we need to ensure that every entry has at least a day, month, year, latitude and longitude. Let's remove any that don't.

In [40]:
def getDefinedRows(df):
    rows_with_data = \
        pd.notna(df['latitude']) & \
        pd.notna(df['longitude']) & \
        pd.notna(df['day']) & \
        pd.notna(df['month']) & \
        pd.notna(df['year'])
    
    return rows_with_data

rain_df = raw_rain_df[getDefinedRows(raw_rain_df)]

Since the haversine function takes so long to compute, we don't want to compute it twice for both rainfall and temperature, so we should merge the dataframes. In fact, we can store the haversine distances for all the points and centroids so it only needs to be computed once. There are 961 points and 339 districts, which amounts to ~325k distance calculations. If I had the time, I would try to replicate the problem outlined here: https://iliauk.wordpress.com/2016/02/16/millions-of-distances-high-performance-python/

In [93]:
distances_df = pd.DataFrame()

In [121]:
geo_df = pd.read_csv('./data/Geo/district_crosswalk_small.csv')

In [122]:
points_df = rain_df[['latitude', 'longitude']].drop_duplicates(subset=['latitude', 'longitude'])

In [124]:
points_df['uid'] = [hex(i) for i in range(len(points_df))]

In [127]:
geo_df.head()

Unnamed: 0,stname_iaa,distname_iaa,stid_iaa,distid_iaa,centroid_longitude,centroid_latitude,unique_dist_id
0,gujarat,ahmedabad,17,11,72.267197,22.7812,1711
1,gujarat,amreli,17,5,71.177803,21.373199,1705
2,gujarat,banaskantha,17,8,71.939903,24.219601,1708
3,gujarat,baroda,17,14,73.5327,22.230101,1714
4,gujarat,bhavnagar,17,4,71.770798,21.5945,1704


In [128]:
geo_df[['lat_radians', 'lon_radians']] = (np.radians(geo_df.loc[:, ['centroid_latitude','centroid_longitude']]))

In [130]:
points_df[['lat_radians', 'lon_radians']] = (np.radians(points_df.loc[:, ['latitude','longitude']]))

In [140]:
dist = sklearn.neighbors.DistanceMetric.get_metric('haversine')

dist_matrix = (dist.pairwise
    (geo_df[['lat_radians','lon_radians']],
     points_df[['lat_radians','lon_radians']])*6367
)

In [152]:
df_dist_matrix = pd.DataFrame(dist_matrix, 
                 index=geo_df['unique_dist_id'], 
                 columns=points_df['uid']
                )

In [156]:
df_dist_long = pd.melt(df_dist_matrix.reset_index(),id_vars='unique_dist_id')
df_dist_long = df_dist_long.rename(columns={'value':'km'})

In [170]:
distances_df = df_dist_long.merge(points_df, how='outer', on='uid').drop(columns=['lat_radians', 'lon_radians'])

Now we can just load our distances.

In [173]:
distances_df = pd.read_csv('./distances.csv')