# Aggregation and Analysis of Diffusion of Microfinance Data

## Explanation and Sources
*All data comes from the Harvard Dataverse website. The paper is cited below:*

*Banerjee, Abhijit; Chandrasekhar, Arun G.; Duflo, Esther; Jackson, Matthew O., 2013, "The Diffusion of Microfinance", hdl:1902.1/21538, Harvard Dataverse, V9*

**The paper studies how microfinance is diffused through social networks within villages in rural India. People's everyday activity's and interactions were coded into adjacency matrices so that their interactions can be studied and analyzed.**

The notebook below is, for now, an exploratory analysis of this data. I start off by taking the large number of files (75 villages, several interaction types, and Person/House type, and storing all these files into dictionaries of dataframes that can be called by their adjusted filenames. This helped with organization, as I also built a dataframe of file names corresponding to the village, activity, and type I wanted to look at.

The exploration so far, which I am working on finishing, involves calculating a variety of complex network metrics for each village and interaction type, that can be put into its own dataframe. The thought behind this is that I can pair this with the demographic data for each person/house and see how well the demographic data given can predict various network characteristics/hierarchies/positions within the networks.

For example, one potential (unproven outcome) would be that households with roof type 3 are more likely to have (or do have) a higher betweenness centrality than other households in the village. Once the functions are defined to organize this data, I can train some machine learning algorithms on the data to see how well this fits.

In [1]:
#Import necessary library packages to render HTML in a code cell
from IPython.display import HTML

#Define the javascript function and HTML to produce the show/hide code button
text = str('''
    <script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>''')

In [2]:
HTML(text)

In [3]:
#######========> Import libraries needed for this notebook

import numpy as np #Linear algebra library
import pandas as pd #Data manipulation library
import matplotlib.pyplot as plt #Visualization library
import seaborn as sns #Advanced visualization library
import os #Libary to reference file paths
import re #Regular expressions library
import datetime #Library I am using to estimate time remaining for chunks of code
pd.set_option('display.multi_sparse', False)
import networkx as nx #Library used to analyze and view complex networks/graphs 
from networkx_viewer import Viewer
%matplotlib inline

#######========>End library imports, begin code

#######========>Set file paths for each directory that the files are located in

#Set file path name string for the adjanceny matrices
adj_path = os.getcwd() + '/datav4.0/Data/1. Network Data/Adjacency Matrices/'

#Set the file path name string for the adjacency matrix keys
adj_key_path = os.getcwd() + '/datav4.0/Data/1. Network Data/Adjacency Matrix Keys/'

#Set the file path name string for the Demographics and Outcomes files
dem_path = os.getcwd() + '/datav4.0/Data/2. Demographics and Outcomes/'

#######========>End file path decleration

house_dtypes = {'village' : str, 'adjmatrix_key' : str, 'HHnum_in_village' : str, 'hhid' : str,
                'hohreligion' : str, 'castesubcaste' : str, 'rooftype1' : int, 'rooftype2' : int,
                'rooftype3' : int, 'rooftype4' : int, 'rooftype5' : int, 'rooftypeoth' : str,
                'room_no' : int, 'bed_no' : int, 'electricity' : str, 'latrine' : str,
                'ownrent' : str, 'hhSurveyed' : int, 'leader' : int}
indiv_dtypes = {'adjmatrix_key': str,'age': int,'caste': str,'educ': str,'electioncard': str,
                'english': str,'hhid': str,'hindi': str,'kannada': str,'mothertongue': str,
                'movecontact': str,'movecontact_hhid': str,'movecontact_name': str,
                'movecontact_pid': str,'movecontact_res': str,'movereason': str,'native_district': str,
                'native_name': str,'native_taluk': str,'native_type': str,'occupation': str,
                'otherlang': str,'pid': str,'privategovt': str,'rationcard': str, 
                'rationcard_colour': str,'religion': str,'res_time_mths': float,'res_time_yrs': str,
                'resp_gend': str,'resp_id': str,'resp_status': str,'savings': str,'savings_no': str,
                'shg_no': str,'shgparticipate': str,'speakother': str,'subcaste': str,'tamil': str,
                'telugu': str,'urdu': str,'village': str,'villagenative': str,'work_freq': float,
                'work_freq_type': str,'work_outside': str,'work_outside_freq': str,'workflag': str}

house_dem = pd.read_stata(dem_path + 'household_characteristics.dta', preserve_dtypes = False)
house_dem = house_dem.astype(house_dtypes)
indiv_dem = pd.read_stata(dem_path + 'individual_characteristics.dta', preserve_dtypes = False)
indiv_dem = indiv_dem.astype(indiv_dtypes)

In [4]:
#######========>Use to calculate estimated time remaining

a = datetime.datetime.now()
count, last_time = 0, 0

#######========>End time remaining variables

#######========>Declare dicitionary and list variables

adj_dfs = {} #Store all of the adjacency matrix dataframes in a dictionary
adj_names = [] #Store all of the adjacency matrix file names in a list

#######========>End variable decleration, begin looping through files

files = os.listdir(adj_path)
files.remove('.DS_Store')

for file in files:
    
#######========>Cast each adjacency matrix to a dataframe, stored in a dictionary of dataframes    
    
    #Get rid of filename extension to create filename to be stored
    filename = str(file.replace('.csv','').replace('adj_',''))
    
    #Add name of the file to list of file names
    adj_names.append(filename)
    
    #Create dataframe and cast to dictionary
    adj_dfs[filename] = pd.read_csv(adj_path + file, header = None)

#######========>End casting of dataframe to dictionary on this iteration
    
#######========>Create new columns with file characteristics and reorder and reindex as needed

    #Create village number column
    adj_dfs[filename]['Village'] = str(re.findall('\d+', filename)[0])
    
    #Create type column, can be house or individual
    adj_dfs[filename]['Type'] = (lambda x: 'Person' if not re.findall('HH', x) else 'House')(filename)
    
    #Create activity column, used to show how the people are interacting, i.e. going to temple
    adj_dfs[filename]['Activity'] = (lambda x: x[0:x.index('_')])(filename)
    
    #Add the village and type columns to the index
    adj_dfs[filename].set_index(['Village','Type'], append = True, inplace = True)
    
    #Set the Activity column to column position 0 in the DataFrame
    for column in ['Activity']:
        holder = adj_dfs[filename][column]
        adj_dfs[filename].drop(labels = [column], axis = 1, inplace = True)
        adj_dfs[filename].insert(0, column, holder)
    
#######========>End reordering and reindexing
    
#######========>Used to calculate estimated time remaining in the cell. Not relevant code

    count += 1
    if count%250 == 0:
        b = datetime.datetime.now()
        time_remaining = ((b-a).total_seconds()*(len(os.listdir(adj_path))/count) - last_time)/60
        print('Time remaining: ' + str(round(time_remaining, 2)) + ' minutes')
        last_time = (b-a).total_seconds()
        
#######========>END not time remaining relevant code

Time remaining: 4.31 minutes
Time remaining: 4.07 minutes
Time remaining: 3.76 minutes
Time remaining: 2.87 minutes
Time remaining: 2.4 minutes
Time remaining: 1.97 minutes
Time remaining: 1.28 minutes
Time remaining: 0.79 minutes


In [5]:
#######========>Declare dicitionary and list variables

key_dfs = {} #Store all of the adjacency key matrix dataframes in a dictionary
key_names = [] #Store all of the adjacency key matrix filenames in a list

#######========>End variable decleration, begin looping through files

files = os.listdir(adj_key_path)

for file in files:
    
#######========>Cast each adjacency key matrix to a dataframe, stored in a dictionary of dataframes
    
    #Get rid of filename extension to create filename to be stored
    filename = str(file.replace('.csv',''))
    
    #Add name of the file to list of file names
    key_names.append(filename)
    
    #Create dataframe and cast to dictionary
    key_dfs[filename] = pd.read_csv(adj_key_path + file, header = None, names = ['Key'])
    
#######========>End casting of dataframe to dictionary on this iteration

#######========>Create new columns with file characteristics and reorder and reindex as needed

    #Create village number column
    key_dfs[filename]['Village'] = str(re.findall('\d+', filename)[0])
    
    #Create type column, can be house or individual
    key_dfs[filename]['Type'] = (lambda x: 'Person' if not re.findall('HH', x) else 'House')(filename)
    
    #Add the village and type columns to the index
    key_dfs[filename].set_index(['Village','Type'], append = True, inplace = True)
    
#######========>End reordering and reindexing

In [6]:
#######========>Create dataframe of the adjacency matrix filenames and their corresponding characteristics

#Create dataframe of the adjacency matrix filenames
adj_nameframe = pd.DataFrame(adj_names, columns = ['Adj_Name'])

#Add village number column
adj_nameframe['Village'] = adj_nameframe['Adj_Name'].apply(lambda x: str(re.findall('\d+', x)[0]))

#Add type column
adj_nameframe['Type'] = adj_nameframe['Adj_Name'].apply(lambda x: 'Person' if not re.findall('HH', x) else 'House')

#Add activity column
adj_nameframe['Activity'] = adj_nameframe['Adj_Name'].apply(lambda x: x[0:x.index('_')])

#######========>End creation of dataframe for filenames

#######========>Create dataframe of filenames for the adjacency key matrices

#Create dataframe of the key file names
key_nameframe = pd.DataFrame(key_names, columns = ['Key_Name'])

#Add village number column
key_nameframe['Village'] = key_nameframe['Key_Name'].apply(lambda x: str(re.findall('\d+', x)[0]))

#Add type column
key_nameframe['Type'] = key_nameframe['Key_Name'].apply(lambda x: 'Person' if not re.findall('HH', x) else 'House')

#######========>End creation of dataframe for the key filenames

#######========>Merge the two dataframes together so that the files can be referenced relative to each other

#Merge the two on village number and type (what makes them unique), so the two can be accessed from each other
adj_nameframe = pd.merge(adj_nameframe, key_nameframe, how = 'left', on = ['Village','Type'])

adj_nameframe.head(10)

#######========>End by displaying the first 10 rows of the resulting dataframe

Unnamed: 0,Adj_Name,Village,Type,Activity,Key_Name
0,allVillageRelationships_HH_vilno_1,1,House,allVillageRelationships,key_HH_vilno_1
1,allVillageRelationships_HH_vilno_10,10,House,allVillageRelationships,key_HH_vilno_10
2,allVillageRelationships_HH_vilno_11,11,House,allVillageRelationships,key_HH_vilno_11
3,allVillageRelationships_HH_vilno_12,12,House,allVillageRelationships,key_HH_vilno_12
4,allVillageRelationships_HH_vilno_14,14,House,allVillageRelationships,key_HH_vilno_14
5,allVillageRelationships_HH_vilno_15,15,House,allVillageRelationships,key_HH_vilno_15
6,allVillageRelationships_HH_vilno_16,16,House,allVillageRelationships,key_HH_vilno_16
7,allVillageRelationships_HH_vilno_17,17,House,allVillageRelationships,key_HH_vilno_17
8,allVillageRelationships_HH_vilno_18,18,House,allVillageRelationships,key_HH_vilno_18
9,allVillageRelationships_HH_vilno_19,19,House,allVillageRelationships,key_HH_vilno_19


In [7]:
#######========>Use to calculate estimated time remaining

a = datetime.datetime.now()
count, last_time = 0, 0

#######========>End time remaining variables

#######========>Declare dicitionary and list variables

combined_dfs = {} #Store all of the adjacency matrix dataframes with key values in a dictionary

#######========>End variable decleration, begin looping through files and merging

for file in adj_nameframe['Adj_Name']:
    
    key_file = adj_nameframe[adj_nameframe.Adj_Name == file]['Key_Name'].values[0]
    combined_dfs[file] = pd.merge(adj_dfs[file], key_dfs[key_file], how = 'left',
                                  left_index = True, right_index = True)
    
    combined_dfs[file]['Key'] = combined_dfs[file][combined_dfs[file].Key.notnull()]['Key'].astype(int).astype(str)
    
    combined_dfs[file].set_index(['Key','Activity'], append = True, inplace = True)
    
#######========>Used to calculate estimated time remaining in the cell. Not relevant code

    count += 1
    if count%250 == 0:
        b = datetime.datetime.now()
        time_remaining = ((b-a).total_seconds()*(len(os.listdir(adj_path))/count) - last_time)/60
        print('Time remaining: ' + str(round(time_remaining, 2)) + ' minutes')
        last_time = (b-a).total_seconds()
        
#######========>END not time remaining relevant code

combined_dfs[adj_names[0]].head()

Time remaining: 0.99 minutes
Time remaining: 0.89 minutes
Time remaining: 0.83 minutes
Time remaining: 0.67 minutes
Time remaining: 0.55 minutes
Time remaining: 0.44 minutes
Time remaining: 0.28 minutes
Time remaining: 0.17 minutes


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,0,1,2,3,4,5,6,7,8,9,...,172,173,174,175,176,177,178,179,180,181
Unnamed: 0_level_1,Village,Type,Key,Activity,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
0,1,House,1,allVillageRelationships,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,House,2,allVillageRelationships,1,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,House,3,allVillageRelationships,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,House,4,allVillageRelationships,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,House,5,allVillageRelationships,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [68]:
#Function to get the graph, connected subgraph and percentage of degree zero nodes
def graph(attributes):
    
    Village = attributes[0]
    Category = attributes[1]
    Activity = attributes[2]
    
    #Define the file name for the criteria given in the function
    graph_name = adj_nameframe[(adj_nameframe.Village == Village)&
                 (adj_nameframe.Type == Category)&
                 (adj_nameframe.Activity == Activity)]['Adj_Name'].values[0]
    
    if Category == 'Person':
        
        label_frame = indiv_dem[indiv_dem.village == Village]
        
    else:
        
        label_frame = house_dem[house_dem.village == Village]
    
    #Initialize the graph that includes all the nodes for the filename
    df = combined_dfs[graph_name]
    G = nx.Graph(df.values)
    
    keys = list(df.index.get_level_values('Key'))
    categories = list(df.index.get_level_values('Type'))
    activities = list(df.index.get_level_values('Activity'))
    nodes = list(G.nodes())

    key_dict = {nodes[i] : keys[i] for i in range(len(nodes))}
    cat_dict = {nodes[i] : categories[i] for i in range(len(nodes))}
    act_dict = {nodes[i] : activities[i] for i in range(len(nodes))}
    
    nx.set_node_attributes(G, key_dict, 'Keys')
    nx.set_node_attributes(G, cat_dict, 'Type')
    nx.set_node_attributes(G, act_dict, 'Activity')
    
    #Define the number of nodes in the full graph
    num_nodes = G.order()
    
    #Define the number of edges in the full graph
    num_edges = G.size()
    
    #Define a list of degree zero nodes and get the number
    isolates = list(nx.isolates(G))
    num_isolates = len(isolates)
    
    #Calculate the percentage of degree zero nodes in the graph
    pct_isolates = num_isolates/num_nodes
    
    output = {'graph' : G, 'nodes' : num_nodes, 
              'edges' : num_edges, 'label frame' : label_frame,
              'pct_isolates' : pct_isolates, 'frame' : df,
              'keys' : key_dict, 'type' : cat_dict, 'activity' : act_dict}
    
    return output
    

In [69]:
#Defines a function that takes in a fully connected graph/sub graph and returns a dictionary of outputs
def multi_values(G):
    
    shortest_path = dict(nx.shortest_path_length(G))
    shortest_path = {k : np.mean([j for i,j in v.items() if i != k]) for k,v in shortest_path.items()}
    
    #Define a dictionary that maps algorithm names to their outputs for G
    multi = {'betweenness centrality' : nx.betweenness_centrality(G),
             'eigenvector centrality' : nx.eigenvector_centrality(G),
             'shortest path' : shortest_path,
             'clustering' : nx.clustering(G),
             'closeness centrality' : nx.closeness_centrality(G),
             'katz centrality' : nx.katz_centrality_numpy(G),
             'load centrality' : nx.load_centrality(G),
             'eccentricity' : nx.eccentricity(G), 
             'degrees' : dict(G.degree()),
             'triangles' : nx.triangles(G),
             'square clustering' : nx.square_clustering(G),
             'subgraph centrality' : nx.subgraph_centrality(G)}
    
    #Return the dictionary
    return multi

In [70]:
def single_values(G):
    
    #Define a dictionary that maps algorithm names to their outputs for G
    single = {'diameter' : nx.diameter(G),
              'avg shortest path' : nx.average_shortest_path_length(G),
              'avg clustering' : nx.average_clustering(G),
              'wiener index' : nx.wiener_index(G),
              'radius' : nx.radius(G), 
              'density' : nx.density(G), 
              'transitivity' : nx.transitivity(G),
              'estrada index' : nx.estrada_index(G)}
    
    return single

In [103]:
def make_frames(G, attributes):
    
    Village = attributes[0]
    Category = attributes[1]
    Activity = attributes[2]
    
    names = ['Village','Type','Activity']
    
    single_dfs = {}
    multi_dfs = {}
    
    for i, component in enumerate(nx.connected_component_subgraphs(G)):
    
        #Call the single_values function to get a dictionary of single output values for the graph component
        single = pd.DataFrame.from_dict(single_values(component), orient = 'index').transpose()
        
        #Call the multi_values function to get the dictionary of node specific values for the graph component
        multi = pd.DataFrame.from_dict(multi_values(component))
        
        for i, name in enumerate(names):
            single[name] = attributes[i]
            multi[name] = attributes[i]
        
        #Append the single values dataframe to the single_dfs dictionary
        single_dfs[str(i)] = single
        
        #Append the multi values dataframe to the multi_dfs dictionary
        multi_dfs[str(i)] = multi
    
    #Cocatenate all of the connected subgraph single values dataframes into one so all nodes are present
    single_df = pd.concat(single_dfs.values())
    
    #Cocatenate all of the connected subgraph multi values dataframes into one so all nodes are present
    multi_df = pd.concat(multi_dfs.values())
    
    return single_dfs, multi_dfs

In [98]:
#Define graph to display the connected graph with distortion based on desired centrality measure
def graph_with_distortion(G, summary, label_frame, size = None, 
                          with_labels = False, labels = None, 
                          node_color = None, node_map = None,
                          prog = 'neato',label = None):
    
    if size:
        
        #Define the list of sizes from the measure in the summary dictionary
        size_list = list(summary[size].values())
    
        #Scale the size list so it is appropriate for the graph
        size_list = [(300.0/np.mean(size_list))*x for x in size_list]
        
    else:
        size_list = None
        
    if node_color:
        
        color = list(summary[node_color].values())
    
    else:
        
        color = None
        
    if labels:
        
        label_dict = label_frame.set_index('adjmatrix_key')[labels].to_dict()
        
        label_dict = {int(k) : v for k,v in label_dict.items() if int(k) in list(G.nodes)}
    
    else:
        
        label_dict = None
    
    pos = nx.drawing.nx_agraph.graphviz_layout(G, prog = prog)
    
    #Create a matplotlib figure
    fig = plt.figure(figsize = (10,10))
    plt.title(prog)
    
    #Use the draw spring algorithm from networkx to draw the network
    nx.draw_networkx(G, with_labels = with_labels, node_color = color,
                     cmap = node_map, labels = label_dict, 
                     node_size = size_list, pos = pos, label = label)
    return None

In [99]:
attributes = ['5', 'Person', 'allVillageRelationships']
output = graph(attributes)

In [104]:
G = output['graph']
single_dfs, multi_dfs = make_frames(G, attributes)

In [101]:
multi_dfs['2']

Unnamed: 0,betweenness centrality,closeness centrality,clustering,degrees,eccentricity,eigenvector centrality,katz centrality,load centrality,shortest path,square clustering,subgraph centrality,triangles,Village,Type,Activity
597,0.0,1.0,1.0,2,1,0.57735,0.57735,0.0,1.0,0,2.708272,1,5,Person,allVillageRelationships
598,0.0,1.0,1.0,2,1,0.57735,0.57735,0.0,1.0,0,2.708272,1,5,Person,allVillageRelationships
599,0.0,1.0,1.0,2,1,0.57735,0.57735,0.0,1.0,0,2.708272,1,5,Person,allVillageRelationships


In [96]:
for component in list(nx.connected_component_subgraphs(G)):
    print(component.nodes())

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221,

In [None]:
for i, component in enumerate(nx.connected_component_subgraphs(output['graph'])):
        if i == 0:
            single_df, multi_df = make_frames(component, attributes)

In [None]:
multi_df.head()

In [None]:
summary, G = component_summary(output, 0)

In [None]:
progs = ['neato', 'dot']
for prog in progs:
    graph_with_distortion(G, summary, label_frame, node_map = plt.cm.Reds, 
                          node_color = 'eigenvector centrality', prog = prog, label = prog)