## Energy Access Team Feature Extraction and Modeling

The Immediate Problem  
The Indian government's GARV dataset contains electrification statistics in terms of numbers of households unelectrified, which is useful as a baseline but is also incomplete. We aim to be able to complete and enhance this data with a simple machine learning classifier.  
What we want to be able to do first is use Python/Jupyter to associate each village in the dataset with its VIIRS lights-at-night image, and then be be able to train a linear regression (or logistic regression) that can use the array of features we have compiled in the dataframe below to make a prediction of the target variable, "Percentage Electrified". A 0 or 1 binary value could also be used through logistic regression as well if this proves difficult. 

For each village, there are a number of features that could be used in the regression analysis to predict electrification. Some form of numerical composite or average kind of value will need to be used to make the VIIRS information usable in regression, and the current electrification statistics are only available for some villages. For now, it seems that we can simply attempt linear regression in the form of  

$y = f(VIIRS)$

This would be a good starting point because it would allow us to attempt to predict the electrification percentages of the villages for which household data is not available, but that we have the VIIRS data for. 

This is the central notebook from which the data is loaded in and easily visualized, and will be passed on the modeling group. 

In [41]:
import pandas as pd
import numpy as np

Data: 
1. Bihar CSV data of electrified and unelectrified households 
2. 30m resolution, 48 bands VIIRS imagery 

In [75]:
df = pd.read_csv('indian_village_dataset/garv_data_bihar.csv')
df = df.replace(-9, np.nan) # replace -9 (unkown values) with np.NaN 
df['Percentage Electrified'] = (df['Number of Electrified Households']/df['Number of Households'])*100
# create new column of 'electrification' as a simple percentage -> this can be our non-binary target variable
df

Unnamed: 0,Census 2011 ID,Village Name,District Name,State Name,Number of Households,Number of Electrified Households,Percentage Electrified
0,215990.0,Bhaisalotan,Pashchim Champaran,Bihar,,,
1,215989.0,Kalapani,Pashchim Champaran,Bihar,445.0,42.0,9.438202
2,215991.0,Tharhi,Pashchim Champaran,Bihar,339.0,214.0,63.126844
3,216180.0,Naurangia,Pashchim Champaran,Bihar,,,
4,216179.0,Gardi,Pashchim Champaran,Bihar,,,
5,215992.0,Pipra,Pashchim Champaran,Bihar,107.0,59.0,55.140187
6,216186.0,Dhayar,Pashchim Champaran,Bihar,,,
7,216181.0,Majurha,Pashchim Champaran,Bihar,,,
8,216187.0,Betahni,Pashchim Champaran,Bihar,,,
9,215993.0,Kotaraha,Pashchim Champaran,Bihar,128.0,64.0,50.000000


In [76]:
print ("Array of Features for a Single Village")
df.iloc[1]
# this is like the n-array of features for each village, more relevant later when we have more than one feature 

Array of Features for a Single Village


Census 2011 ID                                  215989
Village Name                                  Kalapani
District Name                       Pashchim Champaran
State Name                                       Bihar
Number of Households                               445
Number of Electrified Households                    42
Percentage Electrified                          9.4382
Name: 1, dtype: object

In [113]:
# regression can be implemented here, based on the value of the average of the nd-array of the 48th band of the VIIRs data, the
# lights at night image. 

## Loading VIIRS Data
We used a python script 'load_viirs.py' to convert the numpy arrays representing the 48th band of the VIIRS imagery into a number to be able to use in linear regression. If changes need to be made to the file, it is in 'Z:\data\energyaccessbc' like this notebook and the other files/data. The number-ified VIIRS data is stored into a CSV file called 'output.csv' in the same directory, and it's loaded into a pandas dataframe in the next cell.

In [107]:
viirs = pd.read_csv('output.csv')
viirs.columns = ['Village Name', 'Average Lights at Night']
viirs
# dataframe with the averages of the lights at night data (np.mean of the masked matrix)

Unnamed: 0,Village Name,Average Lights at Night
0,Panapur Firoz,6.980392e-01
1,Ranna,2.023970e+14
2,Ratanpur Kewal,9.985774e-01
3,Bhabhuar,9.396663e-01
4,Shankarpur Doulat,7.211694e-01
5,Ghoghraha,4.068781e-01
6,Tola Singhna,1.319042e-01
7,Dahad,7.802587e-01
8,Manaitha,3.209889e-01
9,Rahman Chak,3.460858e+15


In [108]:
# Take only the villages for which we have VIIRS data for (28,500 of the 45,000)
c = [str(x).strip() for x in df['Village Name'].values if str(x).strip() in viirs['Village Name'].values]

In [112]:
c = set(c)
print (len(c))
# Need to create separate dataframe including only the villages for which we have lights-at-night data. 
# Note: while there are indeed 45,000 tiff images in the folder, it seems some may have been filtered out by error handling 
# in our python script and that there were duplicate VIIRs images for some villages.

28469


# Next Steps
The modeling team can come in here and simply add the VIIRS data as another column on the existing 'df' dataframe, or make a new dataframe, ultimately for regression analysis (it seems there are more villages than there are VIIRS data, so a new 'clean' dataframe seems ideal, one that includes the villages for which we have the GARV data (not NaN) and the VIIRS data for). The training will involve associating each lights-at-night mean with an electrification percentage, and testing will involve implementing a k-folds test in which part of the dataset labels are withheld in training and tested on instead. 