# Aggregation and Analysis of Diffusion of Microfinance Data

## Explanation and Sources
*All data comes from the Harvard Dataverse website. The paper is cited below:*

*Banerjee, Abhijit; Chandrasekhar, Arun G.; Duflo, Esther; Jackson, Matthew O., 2013, "The Diffusion of Microfinance", hdl:1902.1/21538, Harvard Dataverse, V9*

**The paper studies how microfinance is diffused through social networks within villages in rural India. People's everyday activity's and interactions were coded into adjacency matrices so that their interactions can be studied and analyzed.**

The notebook below is, for now, an exploratory analysis of this data. I start off by taking the large number of files (75 villages, several interaction types, and Person/House type, and storing all these files into dictionaries of dataframes that can be called by their adjusted filenames. This helped with organization, as I also built a dataframe of file names corresponding to the village, activity, and type I wanted to look at.

The exploration so far, which I am working on finishing, involves calculating a variety of complex network metrics for each village and interaction type, that can be put into its own dataframe. The thought behind this is that I can pair this with the demographic data for each person/house and see how well the demographic data given can predict various network characteristics/hierarchies/positions within the networks.

For example, one potential (unproven outcome) would be that households with roof type 3 are more likely to have (or do have) a higher betweenness centrality than other households in the village. Once the functions are defined to organize this data, I can train some machine learning algorithms on the data to see how well this fits.

**A note: The term complex networks doesn't refer to the subject being difficult to understand (though it can be!), it refers to the organization of the networks themselves. A 'complex network' has certain characteristics, as defined by Wikipedia these are:**

*In the context of network theory, a complex network is a graph (network) with non-trivial topological features—features that do not occur in simple networks such as lattices or random graphs but often occur in graphs modelling of real systems.*

## Show Code Button

First, below is a javascript function written within HTML that, when rendered, will allow the user to either show the code or hide the code.

In [1]:
#Import necessary library packages to render HTML in a code cell
from IPython.display import HTML

#Define the javascript function and HTML to produce the show/hide code button
text = str('''
    <script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>''')

In [2]:
HTML(text)

## Diving Into the Code

### Importing Libraries and Initial Variables

In the cell below, I begin by importing all of the necessary libraries to perform this analysis. The main notebooks used are pandas (for dataframe creation and manipulation), and networkx for working with the complex networks/graphs derived from the adjacency matrices in the files. Networkx, while not a sophisticated network graphing library, makes it very easy to work with the graphs and derive characteristics from them.

There is no output from the cell below, just initialization of the file paths and data types of the resulting dataframes.

In [3]:
#######========> Import libraries needed for this notebook

import numpy as np #Linear algebra library
import pandas as pd #Data manipulation library
import matplotlib.pyplot as plt #Visualization library
import seaborn as sns #Advanced visualization library
import os #Libary to reference file paths
import re #Regular expressions library
import datetime #Library I am using to estimate time remaining for chunks of code
pd.set_option('display.multi_sparse', False) #Set so that full pandas multi-index is displayed
import networkx as nx #Library used to analyze and view complex networks/graphs
import warnings #Set so that warnings are not displayed
warnings.filterwarnings('ignore')
%matplotlib inline

#######========>End library imports, begin code

#######========>Set file paths for each directory that the files are located in

#Set file path name string for the adjanceny matrices
adj_path = os.getcwd() + '/datav4.0/Data/1. Network Data/Adjacency Matrices/'

#Set the file path name string for the adjacency matrix keys
adj_key_path = os.getcwd() + '/datav4.0/Data/1. Network Data/Adjacency Matrix Keys/'

#Set the file path name string for the Demographics and Outcomes files
dem_path = os.getcwd() + '/datav4.0/Data/2. Demographics and Outcomes/'

#######========>End file path decleration

#Declare all of the python data types for each of the columns in the house demographic file
house_dtypes = {'village' : str, 'adjmatrix_key' : str, 'HHnum_in_village' : str, 'hhid' : str,
                'hohreligion' : str, 'castesubcaste' : str, 'rooftype1' : int, 'rooftype2' : int,
                'rooftype3' : int, 'rooftype4' : int, 'rooftype5' : int, 'rooftypeoth' : str,
                'room_no' : int, 'bed_no' : int, 'electricity' : str, 'latrine' : str,
                'ownrent' : str, 'hhSurveyed' : int, 'leader' : int}

#Declare all of the python data types for each of the columns in the individual demographic file
indiv_dtypes = {'adjmatrix_key': str,'age': int,'caste': str,'educ': str,'electioncard': str,
                'english': str,'hhid': str,'hindi': str,'kannada': str,'mothertongue': str,
                'movecontact': str,'movecontact_hhid': str,'movecontact_name': str,
                'movecontact_pid': str,'movecontact_res': str,'movereason': str,'native_district': str,
                'native_name': str,'native_taluk': str,'native_type': str,'occupation': str,
                'otherlang': str,'pid': str,'privategovt': str,'rationcard': str, 
                'rationcard_colour': str,'religion': str,'res_time_mths': float,'res_time_yrs': str,
                'resp_gend': str,'resp_id': str,'resp_status': str,'savings': str,'savings_no': str,
                'shg_no': str,'shgparticipate': str,'speakother': str,'subcaste': str,'tamil': str,
                'telugu': str,'urdu': str,'village': str,'villagenative': str,'work_freq': float,
                'work_freq_type': str,'work_outside': str,'work_outside_freq': str,'workflag': str}

#Use pandas to read in the stata file of the household demographics
house_dem = pd.read_stata(dem_path + 'household_characteristics.dta', preserve_dtypes = False)
#Declare the data types for the house_dem dataframe using the above dictionary
house_dem = house_dem.astype(house_dtypes)

#Use pandas to read in the stata file of the household demographics
indiv_dem = pd.read_stata(dem_path + 'individual_characteristics.dta', preserve_dtypes = False)
#Declare the data types for the indiv_dem dataframe using the above dictionary
indiv_dem = indiv_dem.astype(indiv_dtypes)

Since there are many hundreds of files, each with an adjacency matrix, I needed an effective way to organize all of them, and have a single entity where I could easily access each one. At first, I tried combining all of the dataframes into one, but this had several shortcomings and was not a tractable method.

So what I ended up doing was creating a dictionary of pandas dataframes. Each key of the dictionary is the adjusted filename (without the '.csv'), and each value of the dictionary is the corresponding adjacency matrix.

In the cell below, the only output is the time remaining I included to check how long this process was taking. The code below, to create the dictionary of dataframes, takes about 5 minutes.

In [4]:
#######========>Use to calculate estimated time remaining

a = datetime.datetime.now()
count, last_time = 0, 0

#######========>End time remaining variables

#######========>Declare dicitionary and list variables

adj_dfs = {} #Store all of the adjacency matrix dataframes in a dictionary
adj_names = [] #Store all of the adjacency matrix file names in a list

#######========>End variable decleration, begin looping through files

files = os.listdir(adj_path)
files.remove('.DS_Store')

for file in files:
    
#######========>Cast each adjacency matrix to a dataframe, stored in a dictionary of dataframes    
    
    #Get rid of filename extension to create filename to be stored
    filename = str(file.replace('.csv','').replace('adj_',''))
    
    #Add name of the file to list of file names
    adj_names.append(filename)
    
    #Create dataframe and cast to dictionary
    adj_dfs[filename] = pd.read_csv(adj_path + file, header = None)

#######========>End casting of dataframe to dictionary on this iteration
    
#######========>Create new columns with file characteristics and reorder and reindex as needed

    #Create village number column
    adj_dfs[filename]['Village'] = str(re.findall('\d+', filename)[0])
    
    #Create type column, can be house or individual
    adj_dfs[filename]['Type'] = (lambda x: 'Person' if not re.findall('HH', x) else 'House')(filename)
    
    #Create activity column, used to show how the people are interacting, i.e. going to temple
    adj_dfs[filename]['Activity'] = (lambda x: x[0:x.index('_')])(filename)
    
    #Add the village and type columns to the index
    adj_dfs[filename].set_index(['Village','Type'], append = True, inplace = True)
    
    #Set the Activity column to column position 0 in the DataFrame
    for column in ['Activity']:
        holder = adj_dfs[filename][column]
        adj_dfs[filename].drop(labels = [column], axis = 1, inplace = True)
        adj_dfs[filename].insert(0, column, holder)
    
#######========>End reordering and reindexing
    
#######========>Used to calculate estimated time remaining in the cell. Not relevant code

    count += 1
    if count%250 == 0:
        b = datetime.datetime.now()
        time_remaining = ((b-a).total_seconds()*(len(os.listdir(adj_path))/count) - last_time)/60
        print('Time remaining: ' + str(round(time_remaining, 2)) + ' minutes')
        last_time = (b-a).total_seconds()
        
#######========>END not time remaining relevant code

Time remaining: 4.82 minutes
Time remaining: 4.59 minutes
Time remaining: 4.36 minutes
Time remaining: 3.41 minutes
Time remaining: 2.88 minutes
Time remaining: 2.43 minutes
Time remaining: 1.49 minutes
Time remaining: 0.89 minutes


Continuing on the same vein, the cell below does the same process as above, except it does it for the keys corresponding to the indices in the adjacency matrices. This is necessary as I will merge the keys, village numbers, activity, and category so that the adjacency matrix characteristics can be paired with the demographic data.

The cell below doesn't output anything, it just creates the dictionary of key dataframes.

In [5]:
#######========>Declare dicitionary and list variables

key_dfs = {} #Store all of the adjacency key matrix dataframes in a dictionary
key_names = [] #Store all of the adjacency key matrix filenames in a list

#######========>End variable decleration, begin looping through files

files = os.listdir(adj_key_path)

for file in files:
    
#######========>Cast each adjacency key matrix to a dataframe, stored in a dictionary of dataframes
    
    #Get rid of filename extension to create filename to be stored
    filename = str(file.replace('.csv',''))
    
    #Add name of the file to list of file names
    key_names.append(filename)
    
    #Create dataframe and cast to dictionary
    key_dfs[filename] = pd.read_csv(adj_key_path + file, header = None, names = ['Key'])
    
#######========>End casting of dataframe to dictionary on this iteration

#######========>Create new columns with file characteristics and reorder and reindex as needed

    #Create village number column
    key_dfs[filename]['Village'] = str(re.findall('\d+', filename)[0])
    
    #Create type column, can be house or individual
    key_dfs[filename]['Type'] = (lambda x: 'Person' if not re.findall('HH', x) else 'House')(filename)
    
    #Add the village and type columns to the index
    key_dfs[filename].set_index(['Village','Type'], append = True, inplace = True)
    
#######========>End reordering and reindexing

The cell below creates a dataframe of file names corresponding to each combination of village, activity, and category (house/person). This is helpful so that it is possible to get an adjacency matrix from the dataframe by knowing the village, activity, and category, without having to memorize the way the filenames are structured.

This cell outputs the first 10 lines of this dataframe, called adj_nameframe.

In [6]:
#######========>Create dataframe of the adjacency matrix filenames and their corresponding characteristics

#Create dataframe of the adjacency matrix filenames
adj_nameframe = pd.DataFrame(adj_names, columns = ['Adj_Name'])

#Add village number column
adj_nameframe['Village'] = adj_nameframe['Adj_Name'].apply(lambda x: str(re.findall('\d+', x)[0]))

#Add type column
adj_nameframe['Type'] = adj_nameframe['Adj_Name'].apply(lambda x: 'Person' if not re.findall('HH', x) else 'House')

#Add activity column
adj_nameframe['Activity'] = adj_nameframe['Adj_Name'].apply(lambda x: x[0:x.index('_')])

#######========>End creation of dataframe for filenames

#######========>Create dataframe of filenames for the adjacency key matrices

#Create dataframe of the key file names
key_nameframe = pd.DataFrame(key_names, columns = ['Key_Name'])

#Add village number column
key_nameframe['Village'] = key_nameframe['Key_Name'].apply(lambda x: str(re.findall('\d+', x)[0]))

#Add type column
key_nameframe['Type'] = key_nameframe['Key_Name'].apply(lambda x: 'Person' if not re.findall('HH', x) else 'House')

#######========>End creation of dataframe for the key filenames

#######========>Merge the two dataframes together so that the files can be referenced relative to each other

#Merge the two on village number and type (what makes them unique), so the two can be accessed from each other
adj_nameframe = pd.merge(adj_nameframe, key_nameframe, how = 'left', on = ['Village','Type'])

adj_nameframe.head(10)

#######========>End by displaying the first 10 rows of the resulting dataframe

Unnamed: 0,Adj_Name,Village,Type,Activity,Key_Name
0,allVillageRelationships_HH_vilno_1,1,House,allVillageRelationships,key_HH_vilno_1
1,allVillageRelationships_HH_vilno_10,10,House,allVillageRelationships,key_HH_vilno_10
2,allVillageRelationships_HH_vilno_11,11,House,allVillageRelationships,key_HH_vilno_11
3,allVillageRelationships_HH_vilno_12,12,House,allVillageRelationships,key_HH_vilno_12
4,allVillageRelationships_HH_vilno_14,14,House,allVillageRelationships,key_HH_vilno_14
5,allVillageRelationships_HH_vilno_15,15,House,allVillageRelationships,key_HH_vilno_15
6,allVillageRelationships_HH_vilno_16,16,House,allVillageRelationships,key_HH_vilno_16
7,allVillageRelationships_HH_vilno_17,17,House,allVillageRelationships,key_HH_vilno_17
8,allVillageRelationships_HH_vilno_18,18,House,allVillageRelationships,key_HH_vilno_18
9,allVillageRelationships_HH_vilno_19,19,House,allVillageRelationships,key_HH_vilno_19


In the cell below, I combined the key information with the adjacency matrices. The resulting dataframe will have the village, key number, activity, and category as its indices. This information needs to be in the index rather than as columns so that the values of the dataframe are still a square adjacency matrix that can be read into a graph easily.

A new dictionary, called combined_dfs, will be created, with the adjusted filename as the key, and the combined dataframe as the value.

The output of the cell, in addition to displaying the time remaining throughout the iteration, is the first 10 rows of one of the dataframes in the dictionary.

In [7]:
#######========>Use to calculate estimated time remaining

a = datetime.datetime.now()
count, last_time = 0, 0

#######========>End time remaining variables

#######========>Declare dicitionary and list variables

combined_dfs = {} #Store all of the adjacency matrix dataframes with key values in a dictionary

#######========>End variable decleration, begin looping through files and merging

for file in adj_nameframe['Adj_Name']:
    
    key_file = adj_nameframe[adj_nameframe.Adj_Name == file]['Key_Name'].values[0]
    combined_dfs[file] = pd.merge(adj_dfs[file], key_dfs[key_file], how = 'left',
                                  left_index = True, right_index = True)
    
    combined_dfs[file]['Key'] = combined_dfs[file][combined_dfs[file].Key.notnull()]['Key'].astype(int).astype(str)
    
    combined_dfs[file].set_index(['Key','Activity'], append = True, inplace = True)
    
#######========>Used to calculate estimated time remaining in the cell. Not relevant code

    count += 1
    if count%250 == 0:
        b = datetime.datetime.now()
        time_remaining = ((b-a).total_seconds()*(len(os.listdir(adj_path))/count) - last_time)/60
        print('Time remaining: ' + str(round(time_remaining, 2)) + ' minutes')
        last_time = (b-a).total_seconds()
        
#######========>END not time remaining relevant code

combined_dfs[adj_names[0]].head()

Time remaining: 1.08 minutes
Time remaining: 0.97 minutes
Time remaining: 0.89 minutes
Time remaining: 0.72 minutes
Time remaining: 0.6 minutes
Time remaining: 0.49 minutes
Time remaining: 0.32 minutes
Time remaining: 0.19 minutes


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,0,1,2,3,4,5,6,7,8,9,...,172,173,174,175,176,177,178,179,180,181
Unnamed: 0_level_1,Village,Type,Key,Activity,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
0,1,House,1,allVillageRelationships,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,House,2,allVillageRelationships,1,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,House,3,allVillageRelationships,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,House,4,allVillageRelationships,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,House,5,allVillageRelationships,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


## Defining the Functions

Now that the data is neatly organized into a dictionary of dataframes that can be easily called, I will begin to define a series of functions to get analyzable data from the complex networks. None of these graph definitions will output anything before being called, so if you have clicked the button to hide the code, you won't see the functions. I will however, briefly explain what each does in the cell directly above it.

The first function defined below, called 'graph', generates a networkx graph from a given set of attributes (village, category, activity), and returns the graph object along with a couple characteristics, and dictionaries to map the nodes to the correct keys.

In [8]:
#Function to get the graph, connected subgraph and percentage of degree zero nodes
def graph(attributes):
    
    #Enumerate the list of attributes, which is a list of size three, including the below attributes
    Village = attributes[0]
    Category = attributes[1]
    Activity = attributes[2]
    
    #Define the file name for the criteria given in the function
    graph_name = adj_nameframe[(adj_nameframe.Village == Village)&
                 (adj_nameframe.Type == Category)&
                 (adj_nameframe.Activity == Activity)]['Adj_Name'].values[0]
    
    #Define a conditional that calls the correct dataframe to pull the labels frame (house or indiv)
    if Category == 'Person':
        
        #Pull the section of the individual demographic dataframe that is equal to the current village
        label_frame = indiv_dem[indiv_dem.village == Village]
        
    else:
        
        #Pull the section of the household demographic dataframe that is equal to the current village
        label_frame = house_dem[house_dem.village == Village]
    
    #Call the dataframe in the combined_dfs dictionary corresponding to the pulled filename
    df = combined_dfs[graph_name]
    #Initialize the graph from the adjacency matrix that includes all the nodes for the filename
    G = nx.Graph(df.values)
    
    #Define a list of keys, categories, and activities that correspond the the indices of the current dataframe
    keys = list(df.index.get_level_values('Key'))
    categories = list(df.index.get_level_values('Type'))
    activities = list(df.index.get_level_values('Activity'))
    
    #Define a list of all nodes of the graph
    nodes = list(G.nodes())
    
    #Create dictionaries for mapping the list of nodes to the keys, categories, and activities for this graph
    key_dict = {nodes[i] : keys[i] for i in range(len(nodes))}
    cat_dict = {nodes[i] : categories[i] for i in range(len(nodes))}
    act_dict = {nodes[i] : activities[i] for i in range(len(nodes))}
    
    #Use those mapping dictionaries to set the Keys, Type, and Activity as node attributes
    nx.set_node_attributes(G, key_dict, 'Keys')
    nx.set_node_attributes(G, cat_dict, 'Type')
    nx.set_node_attributes(G, act_dict, 'Activity')
    
    #Define the number of nodes in the full graph
    num_nodes = G.order()
    
    #Define the number of edges in the full graph
    num_edges = G.size()
    
    #Define a list of degree zero nodes and get the number
    isolates = list(nx.isolates(G))
    num_isolates = len(isolates)
    
    #Calculate the percentage of degree zero nodes in the graph
    pct_isolates = num_isolates/num_nodes
    
    #Put all of the relevant calculations/characteristics into a 
    output = {'graph' : G, 'nodes' : num_nodes, 
              'edges' : num_edges, 'label frame' : label_frame,
              'pct_isolates' : pct_isolates, 'frame' : df,
              'keys' : key_dict, 'type' : cat_dict, 'activity' : act_dict}
    
    return output
    

The next function, called 'multi_values', creates a dictionary of the outputs of all the algorithms that return a unique value for each node in the graph (or connected subgraph if the graph is not connected). 

There is no output for the cell below, it is just used to return a dictionary when called in other functions.

In [9]:
#Defines a function that takes in a fully connected graph/sub graph and returns a dictionary of outputs
def multi_values(G):
    
    #Define the average shortest path length per node, rather than for the entire graph
    shortest_path = dict(nx.shortest_path_length(G))
    shortest_path = {k : np.mean([j for i,j in v.items() if i != k]) for k,v in shortest_path.items()}
    
    #Conditional for if the power iteration can't converge in max_iter, use the numpy version
    def eig(G):
        try:
            return nx.eigenvector_centrality(G)
        except:
            return nx.eigenvector_centrality_numpy(G)
    
    #Define a dictionary that maps algorithm names to their outputs for G
    multi = {'betweenness centrality' : nx.betweenness_centrality(G),
             'eigenvector centrality' : eig(G),
             'shortest path' : shortest_path,
             'clustering' : nx.clustering(G),
             'closeness centrality' : nx.closeness_centrality(G),
             'katz centrality' : nx.katz_centrality_numpy(G),
             'load centrality' : nx.load_centrality(G),
             'eccentricity' : nx.eccentricity(G), 
             'degrees' : dict(G.degree()),
             'triangles' : nx.triangles(G),
             'square clustering' : nx.square_clustering(G),
             'subgraph centrality' : nx.subgraph_centrality(G)}
    
    #Return the dictionary
    return multi

Similar to the function above, the function single_values creates a dictionary of outputs that pertain to the entire graph (connected subgraph if the graph is not connected). 

There is no output from this cell, but the function is called in other functions to return the desired values.

In [10]:
def single_values(G):
    
    #Define a function that returns whether the component is an isolate within the greater graph
    def is_isolate(G):
        
        #Conditional for if the number of nodes is equal to 1 (i.e. is an isolate)
        if G.order() == 1:
            return "Yes"
        else:
            return "No"
        
    #Define a dictionary that maps algorithm names to their outputs for G
    single = {'diameter' : nx.diameter(G),
              'avg shortest path' : nx.average_shortest_path_length(G),
              'avg clustering' : nx.average_clustering(G),
              'wiener index' : nx.wiener_index(G),
              'radius' : nx.radius(G), 
              'density' : nx.density(G), 
              'transitivity' : nx.transitivity(G),
              'estrada index' : nx.estrada_index(G),
              'Edges' : G.size(), 'Nodes' : G.order(),
              'Isolate' : is_isolate(G)}
    
    return single

The function below, make_frames, takes in a single graph as input and creates two dataframes, one for the algorithm outputs pertaining to each node, and one for the ouptuts pertaining to the entire graph. It does this by looping through all of the connected subgraphs, and calling the multi_values and single_values functions for each subgraph. After getting the dictionaries for each connected subgraph, the function converts the dicts to dataframes then concatenates them together.

The cell below has no output, but it can be called on its own, and is also called by the all_files function, returning the dataframes for each file.

In [11]:
def make_frames(G, attributes):
    
    #Initialize a dictionary containing the mapping from the node values to the key values
    keys = pd.DataFrame.from_dict(nx.get_node_attributes(G, 'Keys'), orient = 'index', dtype = str)
    
    #Name the Keys columns appropriately
    keys.reset_index(inplace = True)
    keys.rename(columns = {0 : 'Keys', 'index' : 'Node_Num'}, inplace = True)
    
    #Initialize the attributes specific to this file
    Village = attributes[0] #The village number
    Category = attributes[1] #The category (House/Person)
    Activity = attributes[2] #The interaction type/activity interaction recorded on
    
    #Initialize a list of attribute names, so that they can be entered as columns in the dataframes
    names = ['Village','Type','Activity']
    
    #Initialize empty dictionaries for the connected subgraph vlaues to be stored in
    single_dfs = {}
    multi_dfs = {}
    
    #Loop over the connected subgraphs of the graph for G
    for i, component in enumerate(nx.connected_component_subgraphs(G)):
        
        nodes, edges = component.order(), component.size()
    
        #Call the single_values function to get a dictionary of single output values for the graph component
        single = pd.DataFrame.from_dict(single_values(component), orient = 'index').transpose()
        
        #Call the multi_values function to get the dictionary of node specific values for the graph component
        multi = pd.DataFrame.from_dict(multi_values(component))
        
        #Append node and edge numbers to the multi value dataframe
        multi['Nodes'] = nodes
        multi['Edges'] = edges
        
        #Loop over the list of attribute names to create new columns for the current dataframes
        for j, name in enumerate(names):
            single[name] = attributes[j]
            multi[name] = attributes[j]
        
        #Append the single values dataframe to the single_dfs dictionary
        single_dfs[str(i)] = single
        
        #Append the multi values dataframe to the multi_dfs dictionary
        multi_dfs[str(i)] = multi
        
        
    #Cocatenate all of the connected subgraph single values dataframes into one so all nodes are present
    single_df = pd.concat(single_dfs.values(), ignore_index = True)
    
    #Cocatenate all of the connected subgraph multi values dataframes into one so all nodes are present
    multi_df = pd.concat(multi_dfs.values())
    
    
    #Merge the keys dataframe to assign each of the node index values to the key present in the demographic frame
    multi_df = multi_df.merge(keys, how = 'left', left_index = True, right_on = 'Node_Num')
    
    return single_df, multi_df

The function below, called all_files, loops through either a specified number of files, or all the files in the adj_nameframe. For each file, it calls the make_frames function, and then concatenates each of the created dataframes into one each for the multiple value outputs and one for the singular value outputs. 

The goal of this function is to create a single dataframe that can be merged with the house demographic and individual demographic data for each person/house in each village. Once this is done, analysis can be done to see if any of the demographic characteristics are correlated with characteristics of the resulting complex networks.

The cell below doesn't return any output, but when called it will print a time remaining string that gives an estimate of time remaining for every 25 files iterated through.

In [None]:
def all_files(num_files = 'all'):
    
    #Initialize variables to calculate how much time is remaining
    a = datetime.datetime.now()
    count, last_time = 0, 0
    
    #Initialize empty dictionaries to store the dataframes as the file names are iterated over
    single_dfs, multi_dfs = {}, {}
    
    #Define a conditional that decides whether to include a slice of the files or all of them
    if num_files == 'all':
        #Define the full list of file names to loop through
        files = list(adj_nameframe.Adj_Name.unique())
        num_iter = len(files)
        
    else:
        #Define a partial slice of all the file names to loop through
        files = list(adj_nameframe.Adj_Name.unique())[:num_files]
        num_iter = num_files
    
    #Loop over the predefined number of files
    for file in files:
        
        #Define the village, category, and activity, then store in a list to initialize the graph
        Village = str(adj_nameframe[adj_nameframe.Adj_Name == file]['Village'].values[0])
        Category = str(adj_nameframe[adj_nameframe.Adj_Name == file]['Type'].values[0])
        Activity = str(adj_nameframe[adj_nameframe.Adj_Name == file]['Activity'].values[0])
        attributes = [Village, Category, Activity]
        
        #Use the attributes list defined above to pass into the graph function
        output = graph(attributes)
        
        #Get the full graph from the dictionary returned by the graph function
        G = output['graph']
        
        #Use the make_frames function to make a concatenated dataframe for all components of the graph
        s_df, m_df = make_frames(G, attributes)
        
        #Store both the single value and multi value dataframes in dictionaries
        single_dfs[file] = s_df
        multi_dfs[file] = m_df
        
        #Set up the timing variables to print the time remaining every 25th iteration
        count += 1
        if count%25 == 1:
            b = datetime.datetime.now()
            time_remaining = ((b-a).total_seconds()*(num_iter/count) - last_time)/60
            print('Time remaining: ' + str(round(time_remaining, 2)) + ' minutes')
            last_time = (b-a).total_seconds()
    
    #After the iteration is finished, concatenate each of the single values dataframes into one
    single_df = pd.concat(single_dfs.values(), ignore_index = True)
    
    #After the iteration is finished, concatenate each of the multi values dataframes into one
    multi_df = pd.concat(multi_dfs.values(), ignore_index = True)
    
    #Return the fully concatenated dataframes
    return single_df, multi_df
            

With all of the necessary functions defined, I can now loop through the files and create the single value dataframe and multi value dataframe. This will take a significant amount of time, but once it is done I can write them each to csv's. With the resulting data stored in csv's, all I will need to do when I restart the notebook is load these csv's along with the demographic data, so this process only needs to be done once.

The cell below will only output the time remaining estimate

In [None]:
#Call the all files function, and store the outputs as single_df and multi_df
single_df, multi_df = all_files()

Time remaining: 131.27 minutes
Time remaining: 145.98 minutes
Time remaining: 155.53 minutes
Time remaining: 192.12 minutes
Time remaining: 963.85 minutes
Time remaining: 1383.71 minutes


After each of the graphs' characteristics are combined into a dataframe, I can write them to csv's.

In the cells below, I have displayed the first 10 rows of the each of the single/multi value dataframes

In [None]:
#Display first 10 rows of the multi value dataframe
multi_df.head(10)

In [None]:
#Display the first 10 rows of the single value dataframe
single_df.head(10)

The cell below doesn't output anything, it just writes the dataframes to csv files

In [None]:
#Write each of the dataframes to csv files for subsequent import and restarting notebook at this point
multi_df.to_csv('multi_value.csv', index = False)
single_df.to_csv('single_value.csv', index = False)

The function below is for visualizing the graph of one of the adjacency matrices. It can take in some of the graph characteristics, such as eigenvector centrality, and use it to distort the sizes/colors of the nodes, for example.

The graphing programs come from graphviz, which networkx can use. The function doesn't return anything, but it will display a graph when it is finished. The title of the graph will be the graphviz program name.

The cell below just defines the function, and doesn't have an ouput.

In [267]:
#Define graph to display the connected graph with distortion based on desired centrality measure
def graph_with_distortion(G, summary, label_frame, size = None, 
                          with_labels = False, labels = None, 
                          node_color = None, node_map = None,
                          prog = 'neato',label = None):
    
    #Conditional for deciding what the size_list passed into the graph function will be
    if size:
        
        #Define the list of sizes from the measure in the summary dictionary
        size_list = list(summary[size].values())
    
        #Scale the size list so it is appropriate for the graph
        size_list = [(300.0/np.mean(size_list))*x for x in size_list]
        
    else:
        
        #If no size metric is passed, set the size_list to be passed to none
        size_list = None
        
    #Define a conditional that determines what to make the color passed into the graph function
    if node_color:
        
        #Color takes on the values of the metric passed as node_color
        color = list(summary[node_color].values())
    
    else:
        
        #No color values are passed
        color = None
    
    #Define a conditional to map the desired node labels to the nodes in the graph
    if labels:
        
        #Use the label_frame passed into the function and map from node numbers
        label_dict = label_frame.set_index('adjmatrix_key')[labels].to_dict()
        label_dict = {int(k) : v for k,v in label_dict.items() if int(k) in list(G.nodes)}
    
    else:
        
        #If labels is none, pass none
        label_dict = None
    
    #Define the position based on the available graphviz programs
    pos = nx.drawing.nx_agraph.graphviz_layout(G, prog = prog)
    
    #Create a matplotlib figure
    fig = plt.figure(figsize = (10,10))
    #Define the title to be the Graphviz program used
    plt.title(prog)
    
    #Use the draw spring algorithm from networkx to draw the network
    nx.draw_networkx(G, with_labels = with_labels, node_color = color,
                     cmap = node_map, labels = label_dict, 
                     node_size = size_list, pos = pos, label = label)
    return None

The cell below, loops through two of the graphviz graphing programs, and displays both

In [None]:
progs = ['neato', 'dot']
for prog in progs:
    graph_with_distortion(G, summary, label_frame, node_map = plt.cm.Reds, 
                          node_color = 'eigenvector centrality', prog = prog, label = prog)