In [1]:
import pandas as pd
import numpy as np

# Brian Kitano Coding Test

Start: 9:23 AM

## 2. Data Cleaning

Start: 9:23 AM

The central task is to collapse the grid-level dataset into a district-level dataset.

The final product from this section is a district-level daily dataset from 2009-2013 with temperature, rainfall, and total rainfall variables.

Steps:
1. Calculate the set of points $P$ to include for each district.
2. Use the formulas to calculate the district statistic.

The actual EPIC project used 339 districts in India, 112 years of data and a 0.25°(latitude) x 0.25°(longitude) grid for rainfall. How would your code scale up with this larger dataset? Would you need any additional computing resources? Be as specific as possible.

### Creating the P

We want to go with the minimal possible answer to minize the scope of queries, which will scale better.

Algorithm: weighted average of daily mean temp, mean rainfall, and total rainfall for all grid points w/in 100km of each district's geographic center. Weights are inverse of the squared distance from the district center.

$$
\bar t = \frac{1}{|P|} \sum_{p \in P} \frac{t_p}{(d - p)^2}
$$
$$
\bar r = \frac{1}{|P|} \sum_{p \in P} \frac{r_p}{(d - p)^2}
$$
$$
R = \sum_{p \in P} \frac{r_p}{(d - p)^2}
$$

where $P$ is the set of points within 100km of the district centroid, $d$ is the district centroid.


Assumptions:
- Are the `rainfall` csv's identical?

In [2]:
raw_rain_df = pd.read_csv('./data/Rainfall/rainfall_2010.csv')
raw_temp_df = pd.read_csv('./data/Temperature/temperature_2010.csv')


In order to do our analysis, we need to ensure that every entry has at least a day, month, year, latitude and longitude. Let's remove any that don't.

In [3]:

def getDefinedRows(df):
    rows_with_data = \
        pd.notna(df['latitude']) & \
        pd.notna(df['longitude']) & \
        pd.notna(df['day']) & \
        pd.notna(df['month']) & \
        pd.notna(df['year'])
    
    return rows_with_data

rain_df = raw_rain_df[getDefinedRows(raw_rain_df)]
temp_df = raw_temp_df[getDefinedRows(raw_temp_df)]

I'm also assuming that in any 'year' file, there is actually only one calendar year represented.

We'll want to create an `index` column in both dataframes so that we can join them. The index needs to be identical in both data frames. Note that the `date` column is formatted differently in each dataframe, but `day`, `month`, and `year` are the same. 

In [4]:
on_columns = ['latitude', 'longitude', 'day', 'month', 'year']
total_df = rain_df.merge(temp_df, how='outer', left_on=on_columns, right_on=on_columns)

In [5]:
# sanity check
len(total_df) == len(rain_df) & len(total_df) == len(temp_df)

True

In [6]:
rainfall_values = sum(pd.notna(total_df['rainfall']))
print(f'rainfall values: {rainfall_values}')

temp_values = sum(pd.notna(total_df['temperature']))
print(f'temp values: {temp_values}')

rainfall values: 112785
temp values: 128471


In [7]:
distances_df = pd.read_csv('./distances.csv')
points_df = pd.read_csv('./points.csv')

In [8]:
def getPointsWithinDistance(centroid_id, threshold):
    return distances_df[np.logical_and(distances_df['km'] <= 100, distances_df['unique_dist_id'] == 1711)]

In [9]:
getPointsWithinDistance(1711, 100)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,unique_dist_id,uid,km,latitude,longitude,centroid_latitude,centroid_longitude
376,376,28200,1711,0x178,84.662535,22.5,71.5,22.7812,72.267197
377,377,28275,1711,0x179,39.326242,22.5,72.5,22.7812,72.267197
408,408,30600,1711,0x198,83.343798,23.5,72.5,22.7812,72.267197
