Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data & Code] Predicting poverty and wealth from mobile phone metadata #11

Open
chengjun opened this issue Aug 11, 2020 · 4 comments
Open

Comments

@chengjun
Copy link
Member

chengjun commented Aug 11, 2020

Predicting poverty and wealth from mobile phone metadata

 
Joshua Blumenstock, Gabriel Cadamuro, Robert On
 
image

Mobile phone data were supplied by an anonymous service provider in Rwanda and are not available for distribution.

All other data and code, including all intermediate data needed to replicate these results and apply these methods in other contexts, are available through the Inter-university Consortium for Political and Social Research (http://doi.org/10.3886/E50592V2). https://www.openicpsr.org/openicpsr/project/100144/version/V5/view

INTRODUCTION

The code and data required to replicate the findings in the paper are organized into three top-level directories:
/modeling: Used to construct features from call records, and to perform supervised learning on the phone survey sample
/validation: Used to aggregate and spatially join phone and DHS data
/analysis: Used to construct most figures and tables

SYSTEM REQUIREMENTS
* Python running sklearn 0.16.1 and graphlab 1.6.1
* JVM
* Spark
* SBT
* Edit paths in run_local.sh
* Additional details on individual files, including dependencies, is provided in this readme for convenience:

CODE OVERVIEW
Note: ICPSR does not allow certain file extensions, these are automatically replaced by .txt and should be changed back

/analysis
/analysis/fig1.R

  • used to construct figure 1 scatter plot and ROC curves
  • requires: /modeling/data/train/PhoneSurveyFeats.csv
    /modeling/data/train/ROC.csv

/analysis/fig3.R

  • produces Figure 3 and SM Figure 6
  • requires: /validation/data/out/districts2010.csv
    /validation/data/out/clusters2010.csv

/analysis/figS2.R

  • produces SM Figure 2
  • requires: /modeling/data/train/sampleSize.csv

/analysis/figS3.R

  • produces SM Figure 3
  • requires: /modeling/data/train/feature_r2.csv
  • requires: /analysis/vioPlot.R

/analysis/figS8.R

  • produces SM Figure 8
  • requires: /validation/data/out/districts2010.csv
    /validation/data/out/districts2007.csv
    /analysis/rwmap/District_Boundary_2012/

/analysis/figS4/code/figS4.R

  • produces SM Figure 4
  • requires: /modeling/data/train/feature_r2.csv
    /analysis/figS4/data/in/trainFeats.csv
    /analysis/figS4/data/in/trainObj.csv

/modeling
Note: all python files in /modeling/code/scripts require a link to a config file as the first command line argument

/modeling/code/scripts/PhoneSurveyPreProc.py

  • Produces the composite wealth index using PCA
  • requires /modeling/data/misc/districtWeights.csv
    survey population weights
  • output
    /modeling/data/preprocessing/survey_weights.csv
    /modeling/data/preprocessing/survey_means.csv
    /modeling/data/preprocessing/survey_stds.csv
    PCA vectors, used to construct projections later

/modeling/code/scripts/FeatureGenerator.py
Performs feature engineering on original CDR files using graphlab. DFA structure specified in config file

  • requires /modeling/data/cdr/domestic_cdr.csv
    /modeling/data/cdr/sms_cdr.csv
    /modeling/data/cdr/intl_cdr.csv
    Original call detail records (redacted)
    /modeling/data/cdr/allIds.csv
    List of all ids appearing in any CDR (redacted)
    /modeling/code/libs/AggregateFeatureLib.py
    /modeling/code/libs/FeatureLib.py
    /modeling/code/libs/LocalNetworkLib.py
  • output
    /modeling/data/train/PhoneSurveyFeats.csv
    contains survey responses and CDR features for all survey respondents, including actual and predicted wealth composite (redacted).

/modeling/data/train/PhoneSurveyFeatsHash.csv

same as above, but includes original data and masks the call metrics for each subscriber.

Can be used for replication of e.g. Fig 1.

			/modeling/data/features/finalFeatures.csv
				CDR Features for all subscribers in Rwanda (redacted)

/modeling/code/scripts/TrainModel.py
Supervised learning models fit using cross-validation on training set, and out-of-sample predictions generated. Utilizes CLI’s to generate additional analysis files used to construct figures (see next to each output file for the command line required)

  • requires /modeling/data/train/PhoneSurveyFeats.csv
    /modeling/data/train/PhoneSurveyFeatsHash.csv
    /modeling/data/features/finalFeatures.csv
    /modeling/data/aux/districtWeights.csv
    /modeling/code/libs/TrainLib.py
    /modeling/code/libs/FeatureNamerLib.py
  • output
    /modeling/data/train/allPredictions.csv
    Predicted values of wealth and assets for all mobile subscribers
    /modeling/data/train/_roc.csv [with arg = “roc”]
    Data used to generate ROC curves, based on predictions
    /modeling/data/train/feature_r2.csv [with arg = “singler2”]
    Bivariate R2 values from a regression of wealth index on each individual CDR feature. Reports AUC values for binary assets
    /modeling/data/train/sampleSize.csv [with arg = “samplesize”]
    Model performance as a function of sample size

/validation/
/validation/code/run_joint.sh
Joins CDR predictions with DHS data to validate predictions. Script first reads in asset and wealth predictions from CDR data and DHS data. Then reads location data from CDR towers and DHS cluster locations. Performs spatial join of wealth imputed from CDR locations to wealth recorded in DHS locations. Aggregates to district and cluster level

  • requires: data/in/DhsRwandaClusters0708.csv - 2007 DHS Cluster Data
    data/in/DhsRwandaClusters10.csv - 2010 DHS Cluster Data
    data/in/DhsRwandaHH0708.tsv - 2007 DHS Household Survey Data
    data/in/DhsRwandaHH10.tsv - 2010 DHS Household Survey Data
    Note that Dhs* data is available at dhsprogram.com, these files only contain the header
    data/in/Towers.csv - Cell tower lat/long coordinates (redacted)
    data/in/*_HourlyCOG.csv - Approximate location of each subscriber (redacted)
    modeling/data/train/allPredictions.csv - see above
  • output
    /validation/data/out/districts2010.csv
    /validation/data/out/clusters2010.csv
    /validation/data/out/districts2007.csv
    /validation/data/out/clusters2007.csv
@chengjun
Copy link
Member Author

@chengjun
Copy link
Member Author

image

Train and predict using the same dataset!

@chengjun
Copy link
Member Author

# Script inputs:
# 1) Training file (implict from config)
# 2) Which objectives we wish to train (must have mapping in config file)
# 3) Any options : e.g tree based, specific features
# 4) It will generate both a model as well as a score file telling you how the model does


# import graphlab as gl
# import graphlab.aggregate as agg
import csv
#import ConfigParser
import sys
import datetime
import time
import math
import datetime
import numpy as np
import os.path
import code
from scipy.stats import mode
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn import linear_model
from sklearn.model_selections import cross_validation
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import PredefinedSplit
from sklearn import metrics
from sklearn.svm import SVR
from sklearn import svm
from sklearn import tree
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_regression
from sklearn import preprocessing
from sklearn.feature_selection import RFE

sys.path.append('../libs/')
import ConfigLoader
import TrainingLib
import FeatureNamerLib
#config section
configFile = sys.argv[1]
cf = ConfigLoader.setupGraphLabEnv(configFile)

train_dir = cf.get("Filepath","data_abs_home")+"/"+cf.get("Filepath","train_rel_home") +"/"
train_file = train_dir + cf.get("Filepath","train_file_trainingset")
features_dir = cf.get("Filepath","data_abs_home")+"/"+cf.get("Filepath","features_rel_home") +"/"
#features_dir = features_dir+"/CDRV"+str(cf.getint("RawCDR","version"))+"_FeaturesV"+str(cf.getint("Features","version"))+"/"
features_file = features_dir +cf.get("Filepath","features_file_final")

districtWeights_file = cf.get("Filepath","data_abs_home")+"/"+cf.get("Filepath","aux_rel_home") +"/" + cf.get("Filepath","aux_file_districtWeights")

trainFeatures_excludeList = cf.get("Train","train_excludeprefix_list").split(",")
trainFeatures_exclusiveList = cf.get("Train","train_exclusiveFeatures_list").split(",")
if len(trainFeatures_exclusiveList) == 1 and trainFeatures_exclusiveList[0] =='':
    trainFeatures_exclusiveList =[]

if len(trainFeatures_excludeList) == 1 and trainFeatures_excludeList[0] =='':
    trainFeatures_excludeList =[]

train_runId = cf.get("Train","train_experimentId")
trainObj_regList = cf.get("Train","train_objreg_list").split(",")
trainObj_boolList = cf.get("Train","train_objbool_list").split(",")
train_option_windsorFeats = cf.getint("Train","train_featswindsor_int")
train_option_doLog = cf.getboolean("Train","train_logFeats_bool")
train_option_doQuad = cf.getboolean("Train","train_quadFeats_bool")

train_idCol = "Id"
run_anon = True #This is for the case where we don't have the ids (for privacy reaons)
if run_anon:
    train_idCol = trainObj_regList[0]

#now read in modes to run in
analysis_singler2= "singler2" in sys.argv
analysis_sampleSize = "samplesize" in sys.argv
analysis_roc = "roc" in sys.argv
do_predict = "predict" in sys.argv

train_min_file = train_dir +"trainingSet_"+train_runId+"_mins.csv"
#train_preproc_file = 
normalPrefixes =["CDR","SMS","Int","Fea","fea"] #The last two are in case we want to run with anonymized feature names

# PART 1: Loading in the features correctly

#1.A: in roder to use logarthmic transforms of features we need to compute the minimum of all features
if not os.path.exists(train_min_file):
    print "***Calculating new minimum values ***\n"
    TrainingLib.writeMins(features_file,train_min_file)
wg_minMap = TrainingLib.readMins(train_min_file)

#1.B: now load in the training set
#Loading in the file and getting all the features cleaned up, preprocessed etc.
rawFeats, trainHeaders, trainIndices, trainIds =TrainingLib.processFile(train_file, normalPrefixes, trainFeatures_excludeList, train_idCol, 
                                                                        fileDelimiter= ",", exclusiveList = trainFeatures_exclusiveList)
wg_mins = [wg_minMap[header] for header in trainHeaders]


distCol = TrainingLib.readDistrictAssign(train_file, "district")
dWeightMap = TrainingLib.makeDistWeightMap(districtWeights_file)
reWeigh_rowCols, reWeight_crossVal = TrainingLib.sampleWithRepl(distCol, dWeightMap) 
rawFeats = rawFeats[reWeigh_rowCols]

#Now we apply the config defind cleaning/transforms etc
allFeats, unMeans, unStds, unWinds,unLogWinds,unQuadWinds = TrainingLib.featureCleanAndTrans(rawFeats, wg_mins, 
                                                                      train_option_doLog, train_option_doQuad,train_option_windsorFeats)

allHeaders = trainHeaders
if train_option_doLog:
    allHeaders = allHeaders + ["Log " + s for s in trainHeaders]
if train_option_doQuad:
    allHeaders = allHeaders + ["Quad " + s for s in trainHeaders]

#1.c: defining some basic training search params, algs etc.
crossVal_method = reWeight_crossVal

reg_method = [linear_model.ElasticNetCV, linear_model.ElasticNet,"ElasticNet"]
reg_alpha = [1000.0,100.0,10.0,1.0,0.1,0.01, 0.001, 0.0001, 0.00001, 0.000001]
reg_alphaLevel2 = True
reg_metric = "r2"

class_method = ["l1"] #Note that we always use log regression, so you jsut need to deifn the reg mehod
class_alpha = [0.001,0.01,0.1,1,10,100,1000]
class_alphaLevel2 = False
class_metric = "roc_auc"

#Part 2: Nowe move onto training. Part A consists in picking the objective, loading and cleaning it depending on whether its normal regression or logisitc regressiom
for objName in (trainObj_regList+trainObj_boolList):
    dummyFeats, dummyHeaders, dummyIndices, dummyIds, objs =TrainingLib.processFile(train_file, [], trainFeatures_excludeList, train_idCol,objectiveName=objName)
    
    currObj_reg = objName in trainObj_regList
    #preparing the different training methods if normal regression or logistic regression
    if currObj_reg:
        objs = TrainingLib.cleanData(objs, train_option_windsorFeats)
        currAlphas = reg_alpha
        currMethodCV = reg_method[0](alphas = currAlphas, cv = crossVal_method)
        currMethod = reg_method[1](max_iter=2000)
        curr_alphaLevel2 = reg_alphaLevel2
        currMetric = reg_metric
    else:
        objs = TrainingLib.cleanData(objs, 0)
        objs = TrainingLib.processBoolObj(objName, objs)
        currAlphas=class_alpha
        currMethod = LogisticRegression(penalty=class_method[0], tol=0.01, class_weight="auto")
        currMethodCV = LogisticRegressionCV(Cs = class_alpha, penalty=class_method[0], tol=0.01, class_weight="auto",cv = crossVal_method,solver='liblinear', max_iter=1000, scoring = class_metric)
        curr_alphaLevel2=class_alphaLevel2
        currMetric = class_metric
    objs = objs[reWeigh_rowCols]




    #Analysis operations (as defined in command line)
    
    #investigating performance on smaller subsets of train data
    if analysis_sampleSize  and currObj_reg:
        TrainingLib.doSubset(allFeats, objs, crossVal_method, train_dir+"sampleSize.csv")
        continue

    #generate R2 for single features to see which ones are best
    if analysis_singler2:
        if os.path.isfile(train_dir +objName+"_feature_r2.csv"):
            continue
        featR2= {}
        reps =10
        for i in range(0, len(trainHeaders)):
            runSum = 0
            for rep in range(0, reps):
                newFeats = np.transpose(np.array([allFeats[:,i]]))
                if currObj_reg:
                    score = np.mean(cross_validation.cross_val_score(linear_model.LinearRegression(), newFeats, objs,cv=crossVal_method, scoring="r2"))
                else:
                    score = np.mean(cross_validation.cross_val_score(linear_model.LogisticRegression(), newFeats, objs,cv=crossVal_method, scoring="roc_auc"))
                runSum += score
            #print trainHeaders[i] +":" + `runSum/reps`
            featR2[trainHeaders[i]] = runSum/reps
        with open(train_dir+objName+"_feature_r2.csv", 'wb') as f:
            writer = csv.writer(f)
            for trainHeader in trainHeaders:
                writer.writerow([trainHeader,`featR2[trainHeader]`])
        continue
        

    #2.B: Now the training params and data is set up we find the optimal reg params and get the train/test scores
    (finalModel, finalModel_trainScore,finalModel_testScore) = TrainingLib.runTraining(allFeats, objs, currMethod,
                                                                                       currMethodCV, crossVal_method,
                                                                                       curr_alphaLevel2, currObj_reg,
                                                                                       currMetric)
    if currObj_reg:
        finalParam = finalModel.alpha
    else:
        finalParam = finalModel.C
    print  'Objective: ' + objName
    print 'Test score: ' + `finalModel_testScore`
    print 'Train score: ' + `finalModel_trainScore`
    print 'Regularization parameter: ' + `finalParam`
    print 'Num Feats: ' + `np.count_nonzero(finalModel.coef_)`
    
    #Analysis to generate roc curves for boolean predictions
    #THis analysis has to be done AFTER the model is trained
    if analysis_roc and not currObj_reg: 
        trainFeats = allFeats[crossVal_method.test_fold != 0]
        trainObjs = objs[crossVal_method.test_fold != 0]
        testFeats = allFeats[crossVal_method.test_fold == 0]
        testObjs = objs[crossVal_method.test_fold == 0]
        train_bestModel_roc = finalModel
        train_bestModel_roc.fit(trainFeats,trainObjs)
        logReg_probs = train_bestModel_roc.predict_proba(testFeats)
        fpr, tpr, thresholds = metrics.roc_curve(testObjs,logReg_probs[:,1])
        output = np.transpose(np.append(np.array([fpr]),np.array([tpr]),0))
        np.savetxt(train_dir+objName+"_roc.csv", output,delimiter=",", fmt ='%3f')

    if not do_predict:
        continue
    
    #Part 2C:Now we store the important model infromation that will enable us to repredict stuff
    #code.interact(local=locals())
    finalModel_colIndices = np.nonzero(finalModel.coef_)[0]
    finalModel_bestHeaders = [allHeaders[i] for i in finalModel_colIndices]
    finalModel_bestHeaders_repl = [word.replace("Log ","") for word in finalModel_bestHeaders]
    finalModel_bestHeaders_repl= [word.replace("Quad ","") for word in finalModel_bestHeaders_repl]

    #Part 3: Finally we load back in the all the features and make the prediction
    wg_rawFeats, wg_headers, wg_indices, wg_ids = TrainingLib.processFile(features_file, normalPrefixes, trainFeatures_excludeList, "numId", 
                                                              includeList = finalModel_bestHeaders_repl)

    #Handle the transforms and cleans
    wg_logIndices = [i for i in range(0,len(finalModel_bestHeaders )) if finalModel_bestHeaders[i].startswith("Log ")]
    wg_quadIndices =[i for i in range(0,len(finalModel_bestHeaders )) if finalModel_bestHeaders[i].startswith("Quad ")]
    wg_normIndices = [i for i in range(0,len(finalModel_bestHeaders ))  if i not in wg_quadIndices
                      and i not in wg_logIndices]
    wg_logHeaders = np.array(wg_headers)[wg_logIndices]
    wg_logMins = [wg_minMap[s] for s in wg_logHeaders]
    TrainingLib.cleanAndTransformSubset(wg_rawFeats, wg_normIndices, 0, func_transform=None)
    TrainingLib.cleanAndTransformSubset(wg_rawFeats, wg_logIndices, 0, func_transform=TrainingLib.logTransform, wgMins = list(wg_logMins))
    TrainingLib.cleanAndTransformSubset(wg_rawFeats, wg_quadIndices, 0, func_transform=TrainingLib.quadTransform)
    unAllWinds = np.concatenate([unWinds, unLogWinds], axis = 1)
    unAllWinds = np.concatenate([unAllWinds, unQuadWinds], axis= 1)
    TrainingLib.windsor_apply(wg_rawFeats, unAllWinds[:,finalModel_colIndices])
    wg_rawFeats = TrainingLib.removeLowVar(wg_rawFeats,t=1e-9)
    unMeans_curr = np.array([unMeans])[:,finalModel_colIndices]
    unStds_curr = np.array([unStds])[:,finalModel_colIndices]
    for i in range(0, unStds_curr.shape[0]):
        if unStds_curr[0][i] == 0:
            unStds_curr[0][i] = 1
    wg_allFeats = (wg_rawFeats-unMeans_curr)/unStds_curr
    finalModel.coef_ = finalModel.coef_[finalModel_colIndices]
    if currObj_reg:    
        wg_prediction = finalModel.predict(wg_allFeats)
        wg_out = np.transpose(np.array([wg_ids,wg_prediction]))
    else:
        wg_predictionprob =finalModel.predict_proba(wg_allFeats)[:,1]
        wg_out = np.transpose(np.array([wg_ids,wg_predictionprob]))
    TrainingLib.outputPredictions(wg_out, train_dir+train_runId+"_"+objName+"_predictions.csv", stdevs = (currObj_reg))

if analysis_singler2:
    FeatureNamerLib.joinR2Preds(trainObj_regList+trainObj_boolList, train_dir)

@chengjun
Copy link
Member Author

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score, roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split 

matplotlib.style.use('ggplot')

df = pd.read_csv('./modeling/data/train/PhoneSurveyFeatsHash.csv')
df.head()

X = df.drop(['Unnamed: 0',
 'district',
 'own_electric',
 'many_radio',
 'many_tv',
 'own_fridge',
 'many_bike',
 'many_moto',
 'type_toilet',
 'hh_adults',
 'hh_children',
 'wealthActual',
 'wealthPredicted'], axis = 1)

y = df['wealthActual']

X1, X2, y1, y2 = train_test_split(X, y, random_state=0,
                                  train_size=0.8, test_size = 0.2)

@chengjun chengjun added this to Done in Classic Research Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

1 participant