## Feature Extraction Pipeline (working)
This script processes the summer data+ team's figshare imagery for feature extraction.  
In particular, it isolates the VIIRS band (#48) and calculates multiple statistics from this
as features/columns in a new dataframe for input into basic ML models to predict electrification
rates in Bihar, India.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import csv
import math
from skimage import io
import re

# Import Feature List- this is a Python file- feature_list.py! Must be in same directory as this .ipynb file. 
from feature_list import feature_labels
# Import Utils- also a python file utils.py
from utils import *
confirm_utils()

'Utilities library successfully loaded.'

In [2]:
# Load GARV Data, preprocess, into DataFrame
GARV_DATA_PATH = "./indian_village_dataset/garv_data_bihar.csv"
garv_data = preprocess_garv(GARV_DATA_PATH)
garv_data.to_csv('processed_garv.csv')

In [3]:
garv_data = pd.read_csv('processed_garv.csv')
del garv_data['Unnamed: 0']
garv_data.head()

Unnamed: 0,Census 2011 ID,Village Name,District Name,State Name,Number of Households,Number of Electrified Households,Percentage Electrified
0,215989,Kalapani,Pashchim Champaran,Bihar,445.0,42.0,9.438202
1,215991,Tharhi,Pashchim Champaran,Bihar,339.0,214.0,63.126844
2,215992,Pipra,Pashchim Champaran,Bihar,107.0,59.0,55.140187
3,215993,Kotaraha,Pashchim Champaran,Bihar,128.0,64.0,50.0
4,215995,Lachhmipur,Pashchim Champaran,Bihar,615.0,449.0,73.00813


In [4]:
print ('Loaded {} feature labels to extract for each village. To add more, edit feature_list.py.'.format(len(feature_labels)))

Loaded 338 feature labels to extract for each village. To add more, edit feature_list.py.


In [5]:
"""
This first code block takes the VIIRS imagery and binary masks as inputs.
Output is a csv (in same directory) with each village's extracted features.
"""
VIIRS_IMAGE_PATH = "./indian_village_dataset/imagery_res30_48bands/"
MASK_IMAGE_PATH = "./indian_village_dataset/masks_res30/"
csv_name = "output_1.csv"
debug = False 

create_csv(feature_labels, VIIRS_IMAGE_PATH, MASK_IMAGE_PATH, csv_name, debug)

Debug = False: Running for all 45220 villages.
Initialized file reading.
0 of 45220 image files read.


  r = func(a, **kwargs)
  interpolation=interpolation)


4522 of 45220 image files read.
9044 of 45220 image files read.
13566 of 45220 image files read.
18088 of 45220 image files read.
22610 of 45220 image files read.
27132 of 45220 image files read.
31654 of 45220 image files read.
36176 of 45220 image files read.
40698 of 45220 image files read.
Number of invalid images: 16, number of invalid IDs: 296


'Finished writing CSV file output_1.csv.'

In [6]:
# Sanity Check
read_in = pd.read_csv(csv_name) 
read_in.head()

Unnamed: 0.1,Unnamed: 0,min,10th_percentile,median,90th_percentile,max,mean,st_dev,sum,area,...,rain_nov_sum,rain_nov_10th,rain_nov_90th,rain_dec_max,rain_dec_mean,rain_dec_std,rain_dec_median,rain_dec_sum,rain_dec_10th,rain_dec_90th
0,227021,0.844861,0.844861,1.384335,1.480145,1.480145,1.310248,0.240218,133.645309,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,239084,0.543411,0.599819,0.849988,1.041997,1.159564,0.819896,0.185459,1784.912842,41.0,...,334.822571,0.1538,0.1538,0.5212,0.5212,5.960464e-08,0.5212,1134.652344,0.5212,0.5212
2,217323,0.474838,0.599013,0.681873,0.798177,0.798177,0.706367,0.077902,838.457642,26.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,216061,0.139219,0.16981,0.208386,0.281756,0.4456,0.218269,0.054169,778.130371,82.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,250270,0.317444,0.397299,0.542575,0.548436,0.548436,0.502933,0.072423,214.752197,33.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
"""
This code block imports the 1) the GARV dataset and 2) csv of village features extracted from previous
code block. It also pre-processes the dataframes to be cleaned up for merging.
"""
# dataframe setup for village features extracted
df_features = pd.read_csv(csv_name, skip_blank_lines=True).dropna(axis=0, how='all')
df_features = df_features.rename(index=str, columns={"Unnamed: 0": "Census 2011 ID"})
df_features['Census 2011 ID'] = df_features['Census 2011 ID'].astype(str)
df_features[~df_features.index.duplicated(keep=False)]
df_features = df_features.apply(pd.to_numeric, errors="ignore")

df_features.head()

Unnamed: 0,Census 2011 ID,min,10th_percentile,median,90th_percentile,max,mean,st_dev,sum,area,...,rain_nov_sum,rain_nov_10th,rain_nov_90th,rain_dec_max,rain_dec_mean,rain_dec_std,rain_dec_median,rain_dec_sum,rain_dec_10th,rain_dec_90th
0,227021,0.844861,0.844861,1.384335,1.480145,1.480145,1.310248,0.240218,133.645309,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,239084,0.543411,0.599819,0.849988,1.041997,1.159564,0.819896,0.185459,1784.912842,41.0,...,334.822571,0.1538,0.1538,0.5212,0.5212,5.960464e-08,0.5212,1134.652344,0.5212,0.5212
2,217323,0.474838,0.599013,0.681873,0.798177,0.798177,0.706367,0.077902,838.457642,26.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,216061,0.139219,0.16981,0.208386,0.281756,0.4456,0.218269,0.054169,778.130371,82.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,250270,0.317444,0.397299,0.542575,0.548436,0.548436,0.502933,0.072423,214.752197,33.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
df_features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44908 entries, 0 to 44907
Columns: 339 entries, Census 2011 ID to rain_dec_90th
dtypes: float64(338), int64(1)
memory usage: 117.7+ MB


In [23]:
"""
The merging of the village features onto the GARV dataframe. Merge is based on Census 2011 ID column
and keeps only those that match to the 'left' dataframe - the GARV dataframe. By doing so, any rows
that the df_features is unable to match to the GARV dataframe will be dropped.

Results are exported to a csv.
"""
df_merged = pd.merge(left=garv_data, right=df_features, on='Census 2011 ID')
df_merged.head()

Unnamed: 0,Census 2011 ID,Village Name,District Name,State Name,Number of Households,Number of Electrified Households,Percentage Electrified,min,10th_percentile,median,...,rain_nov_sum,rain_nov_10th,rain_nov_90th,rain_dec_max,rain_dec_mean,rain_dec_std,rain_dec_median,rain_dec_sum,rain_dec_10th,rain_dec_90th
0,215989,Kalapani,Pashchim Champaran,Bihar,445.0,42.0,9.438202,0.203942,0.265501,0.324645,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,215991,Tharhi,Pashchim Champaran,Bihar,339.0,214.0,63.126844,0.086101,0.172867,0.260938,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,215992,Pipra,Pashchim Champaran,Bihar,107.0,59.0,55.140187,0.370665,0.425144,0.872517,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,215993,Kotaraha,Pashchim Champaran,Bihar,128.0,64.0,50.0,0.22584,0.274397,0.424527,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,215995,Lachhmipur,Pashchim Champaran,Bihar,615.0,449.0,73.00813,0.211082,0.281324,0.339738,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32822 entries, 0 to 32821
Columns: 345 entries, Census 2011 ID to rain_dec_90th
dtypes: float64(341), int64(1), object(3)
memory usage: 86.6+ MB


In [9]:
"""
Results from previous merge funcion exported to new csv
"""
df_merged.to_csv('new_try.csv')