# Simplicial Complex Analysis

<span style='color:Red'>TODO: Get more robust numbers working for when to display summary information for shapes as a whole and the largest shape.</span> <br>
<span style='color:Red'> Currently using the largest number of nodes in each shape. </span> <br> 
Written by Frederick Miller, Casey McKean, and Wako Bungula
The kepler mapper object gives an output that is not easily navigatible. To resolve this, we wish to create shapes that are easier to navigate and understand, and reveal the data inside of them. <br>
We generate all the shapes in the simplicial complex, condense 1-simplices where possible, and obtain summary statistics on the shapes and the nodes within the shapes.

In [1]:
import numpy as np
import pandas as pd 
import queue
import json
pd.set_option('display.max_rows', None)
print("Imports Done")

Imports Done


# File paths and `.json` import

From the `kmapper_demo` file, I added one extra code block to place the results in a `.json` file, which is a way to store dictionaries in long term storage. The code below only needs to have the file paths changed, and then it will read the simplicial complices generated from kepler mapper. <br>
Here, we also import the actual data set, with data interpolated for the specific pool. <br>
Lastly, a list of the 11 continuous variables (the interpolated versions) is created.

In [2]:
jsonFilePath =r"C:\Users\forre\Desktop\REU\TDA\Data\TDAOutputs\2PCA30Perc7Cube.json"
jsonFile = open(jsonFilePath, "r")
jsonData = json.load(jsonFile) 
jsonFile.close()

dataFilePath = r"C:\Users\forre\Desktop\REU\TDA\Data\interpolatedPool26.csv"
df = pd.read_csv(dataFilePath)

variables = ['PREDICTED_TN',
             'PREDICTED_TP',
             'PREDICTED_TEMP',
             'PREDICTED_DO',
             'PREDICTED_TURB',
             'PREDICTED_COND',
             'PREDICTED_VEL',
             'PREDICTED_SS',
             'PREDICTED_WDP',
             'PREDICTED_CHLcal',
             'PREDICTED_SECCHI']
print("Json file imported")

print(len(jsonData.keys()))

Json file imported
25


# Functions
See the `docstring`'s for what each function does and how it works.

In [3]:
def getSubdf(scomplex, shape, df):
    """
    Returns the part of the data frame from the particular shape in the simplicial complex.
    params:
    scomplex: the entire simplicial complex
    shape: the particular shape being inspected (within the simplicial complex)
    df: the entire data frame
    
    Description:
    1. Get all the nodes from the particular simplicial complex. 
    2. Generate the indices we care about from the particular shape. To do this, we read each node and append it's 
    indices to a list. Then, we convert the list to a set and then back to a list to eliminate duplicates.
    3. Return the dataframe with only those indices.
    """
    nodes = scomplex.get('nodes')
    indices = []
    npShape = np.array(shape).flatten()  # do this in the shapes function!!
    for node in npShape:
        indices.append(nodes.get(node))
    indices = list(set([item for sublist in indices for item in sublist]))
    subdf = df.loc[indices]
    return subdf

def shapeDataSummary(scomplex, shape, df, variables, verbose = False):
    """
    Generates summary statistics of the given variables for a given shape in the simplicial.
    params:
    scomplex: the entire simplicial complex
    shape: the particular shape being inspected (within the simplicial complex) at this function call.
    df: the entire dataframe
    variables: the variables of interest
    verbose: Determines if the function will print out extra information. False by default
    
    Description:
    1. Create an empty result dataframe to store the summary statistics.
    2. Get the sub dataframe (see getSubdf) for the particular shape
    3. For each variable we are analzying, generate summary statistics from the sub dataframe and place them
    inside the result dataframe.
    4. Return the result dataframe
    
    NOTE: this only creates summaries for one particular shape. In executing this method, it is done for each shape 
    outside of the function.
    
    """
    result = pd.DataFrame()
    if verbose == True:
        print("Obtaining sub dataframe for: ", shape)
        print("The number of nodes in this shape is: ", len(shape))
    subdf = getSubdf(scomplex, shape, df)
    if verbose == True:
        print("The number of datapoints in this shape is: ", subdf.shape[0])
    for var in variables:
        result[var] = subdf[var].describe()
    return result
    
    

def adjacent(v, scomplex):
    """
    Determines the nodes adjacent to a given vertex
    
    params:
    v: vertex
    scomlex: the entire simplicial complex
    
    Description:
    Determines the nodes that are adjacent to a given vertex.
    """
    
    simplices = scomplex.get('simplices')
    edges = [item for item in simplices if len(item) == 2]
    result = []
    for edge in edges:
        if v in edge:
            for item in edge:
                if item != v:
                    result.append(item)
    return result

def bfs(node, scomplex):
    """
    Conducts a breadth first search to obtain the entire shape from a given node
    params:
    node: the start node
    scomplex: the entire simplicial complex
    
    Description:
    Preforms a breadth first search to obtain the entire shape for a given start node.
    """
    Q = queue.Queue()
    result = []
    result.append(node)
    Q.put(node)
    while not Q.empty():
        v = Q.get()
        adjacentEdges = adjacent(v, scomplex)
        for edge in adjacentEdges:
            if edge not in result:
                result.append(edge)
                Q.put(edge)
    return result


        
    
def getShapes(scomplex):
    """
    Gets all of the shapes from a given simplicial complex.
    
    params:
    scomplex: the entire simplicial complex
    
    Description:
    1. Obtain all the nodes for the entire complex
    2. For each node, preform a breadth first search to obtain everything in that particular shape. 
    If this entire shape has not already been discovered, add it to the set of results. 
    The result item is a set as the order of the shapes does not matter. The resulting shape is a frozenset
    which means items cannot be added or removed once created, and is needed to allow the set object to have other sets within it.
    3. Convert each shape to a list and the result to a list for easier navigation outside of the function.
    4. Return the result
    
    """
    
    nodes = list(scomplex.get('nodes').keys())
    result = set()
    for node in nodes: # currently does more computations than necessary due to going through every node without considering it is already in a shape
        bfsResult = frozenset(bfs(node, scomplex))
        result.add(bfsResult)
    result = [list(x) for x in result]
    # Sort the list depending on what is decided: nodes or indices. Currently doing it by number of nodes
    result.sort(key = len, reverse = True)
    
    
    return result

def nodeDataSummary(node, scomplex, variables,df):
    """
    Returns a data summary of a particular node
    params:
    node: node in question
    scomplex: The entire simplicial complex
    variables: The variables to obtain summaries
    df: the entire dataframe 
    
    description:
    1. Creates a result dataframe
    2. Get all the indices from the node from the simplicial complex
    3. Generate summaries for each variable
    4. Return the result
    """
    result = pd.DataFrame()
    indices = scomplex.get('nodes').get(node)
    subdf = df.loc[indices]
    for var in variables:
        result[var] = subdf[var].describe()
    return result
    
    
def condenseShape(shape, scomplex):
    """
    
    params:
    shape: a shape of two nodes. must be 2
    scomplex: the entire simplicial complex
    
    description:
    gets the two nodes a and b
    gets the indices for a and b (what is inside the nodes)
    if a \subseteq b, return b
    elif b \subseteq a, return a 
    else return shape 
    
    """
    nodes = scomplex.get('nodes')
    a = shape[0]
    b = shape[1]
    aIndices = set(nodes.get(a))
    bIndices = set(nodes.get(b))
    
    if aIndices.issubset(bIndices):
        return b
    elif bIndices.issubset(aIndices):
        return a
    else:
        return shape

def clean_getShapes(scomplex):
    """
    Condenses 1-simplices down to 0-simplices when each node 
    is a subset of the other 
    
    params:
    scomplex: the entire simplicial complex
    
    Description:
    1. Get all the shapes from the original getShapes function
    2. For shapes that of length 2, if one is a subset of the other, return the larger of the two
        Otherwise, do nothing
    3. return the clean Shapes list 
    
    """
    shapes = getShapes(scomplex)
    cleanShapes = []
    for shape in shapes:
        if len(shape) == 2:
            shape = condenseShape(shape, scomplex)
            cleanShapes.append([shape])
        else:
            cleanShapes.append(shape)
    return cleanShapes
print("Functions loaded")

Functions loaded


# Generating Summary Statistics on the entire simplicial complex
For each `mapper` output from `kepler-mapper`, we can generate the summary statistics for each of the continuous variables. This is done by first obtaining a list of the keys from the `.json` file, and then iterating through each complex, generating the shape and obtaining data summaries on each shape.

In [4]:
allComplices = list(jsonData.keys())
for key in allComplices: # remove indices here to get all the strata for all the time periods
    print("Current Simplical Complex: ", key)
    scomplex = jsonData.get(key)
    shapes = clean_getShapes(scomplex)
    for shape in shapes:
        summaries = shapeDataSummary(scomplex, shape, df, variables, verbose = False)
        if summaries.loc['count'][0] > 5 and len(shape)  > 2: # at least 6 datapoints and 3 nodes to see info
            print("The shape is: ",shape)
            print("The number of nodes in the shape is: ", len(shape))
            #display(summaries) # Uncomment to see summaries

Current Simplical Complex:  ['Stratum 1 SUMMER 93-00: ']
The shape is:  ['cube17_cluster2', 'cube21_cluster0', 'cube4_cluster0', 'cube7_cluster0', 'cube22_cluster0', 'cube23_cluster1', 'cube21_cluster2', 'cube16_cluster2', 'cube21_cluster1', 'cube10_cluster1', 'cube16_cluster0', 'cube15_cluster1', 'cube9_cluster0', 'cube14_cluster0', 'cube15_cluster0', 'cube9_cluster1', 'cube15_cluster4', 'cube21_cluster4', 'cube14_cluster3', 'cube23_cluster0', 'cube3_cluster0', 'cube24_cluster3', 'cube14_cluster2', 'cube14_cluster6', 'cube14_cluster1', 'cube16_cluster3', 'cube8_cluster0']
The number of nodes in the shape is:  27
Current Simplical Complex:  ['Stratum 2 SUMMER 93-00: ']
The shape is:  ['cube32_cluster0', 'cube30_cluster0', 'cube12_cluster1', 'cube4_cluster0', 'cube2_cluster0', 'cube12_cluster0', 'cube25_cluster0', 'cube11_cluster0', 'cube23_cluster1', 'cube19_cluster3', 'cube10_cluster0', 'cube18_cluster3', 'cube3_cluster1', 'cube13_cluster0', 'cube16_cluster0', 'cube31_cluster0', 'cube

# Analyzing the largest structure
Largest = Node count of the shape. The largest structure is likely to be the dominant feature of the stratum during this particular time period. As such, it is important to analyze the nodes within it. To do this, we generate all the shapes, and since the shapes are returned in descending order of the number of nodes per shape, we pull the first shape. From here, we can preform an analysis on each one.

In [5]:
allComplices = list(jsonData.keys())
for key in allComplices: # remove the indices here to get all the strata for all the time periods
    print("Current Simplical Complex: ", key)
    scomplex = jsonData.get(key)
    largestShape = clean_getShapes(scomplex)[0]
    npLargestShape = np.array(largestShape).flatten()
    nodes = scomplex.get('nodes')
    print("Largest shape is: ", largestShape)
    print("Number of nodes is: ", len(largestShape))
    for node in npLargestShape:
        summary = nodeDataSummary(node, scomplex,variables,df)
        if summary.loc['count'][0] > 5: # 5 is chosen arbitraily
            print("Information for: ", node)
            #display(summary)

Current Simplical Complex:  ['Stratum 1 SUMMER 93-00: ']
Largest shape is:  ['cube17_cluster2', 'cube21_cluster0', 'cube4_cluster0', 'cube7_cluster0', 'cube22_cluster0', 'cube23_cluster1', 'cube21_cluster2', 'cube16_cluster2', 'cube21_cluster1', 'cube10_cluster1', 'cube16_cluster0', 'cube15_cluster1', 'cube9_cluster0', 'cube14_cluster0', 'cube15_cluster0', 'cube9_cluster1', 'cube15_cluster4', 'cube21_cluster4', 'cube14_cluster3', 'cube23_cluster0', 'cube3_cluster0', 'cube24_cluster3', 'cube14_cluster2', 'cube14_cluster6', 'cube14_cluster1', 'cube16_cluster3', 'cube8_cluster0']
Number of nodes is:  27
Information for:  cube21_cluster0
Information for:  cube22_cluster0
Information for:  cube16_cluster2
Information for:  cube15_cluster1
Information for:  cube9_cluster0
Information for:  cube23_cluster0
Information for:  cube14_cluster2
Information for:  cube8_cluster0
Current Simplical Complex:  ['Stratum 2 SUMMER 93-00: ']
Largest shape is:  ['cube32_cluster0', 'cube30_cluster0', 'cube12

# Condensing 1-simplices
Currently, many one simplices that we have contain information that means one of them is a subset of the other. To resolve this, we replace them with one cluster with all the indices in one node.

This is stored in the function `clean_getShapes(scomplex)` function. Below is a comparison of running the two functions

In [6]:
allComplices = list(jsonData.keys())
print("Standard shape version")
for key in allComplices[0:1]:
    print("Current Simplical Complex: ", key)
    scomplex = jsonData.get(key)
    nodes = scomplex.get('nodes')
    shapes = getShapes(scomplex)
    for shape in shapes:
        indices = []
        for node in shape:
            indices.append(nodes.get(node))
        indices = list(set([item for sublist in indices for item in sublist]))
        print(str(shape) + " : " + str(indices))

print("Clean shape version")
for key in allComplices[0:1]:
    print("Current Simplical Complex: ", key)
    scomplex = jsonData.get(key)
    nodes = scomplex.get('nodes')
    cleanShapes = clean_getShapes(scomplex)
    for shape in cleanShapes:
        indices = []
        for node in shape:
            indices.append(nodes.get(node))
        indices = list(set([item for sublist in indices for item in sublist]))
        print(str(shape) + " : " + str(indices))

Standard shape version
Current Simplical Complex:  ['Stratum 1 SUMMER 93-00: ']
['cube17_cluster2', 'cube21_cluster0', 'cube4_cluster0', 'cube7_cluster0', 'cube22_cluster0', 'cube23_cluster1', 'cube21_cluster2', 'cube16_cluster2', 'cube21_cluster1', 'cube10_cluster1', 'cube16_cluster0', 'cube15_cluster1', 'cube9_cluster0', 'cube14_cluster0', 'cube15_cluster0', 'cube9_cluster1', 'cube15_cluster4', 'cube21_cluster4', 'cube14_cluster3', 'cube23_cluster0', 'cube3_cluster0', 'cube24_cluster3', 'cube14_cluster2', 'cube14_cluster6', 'cube14_cluster1', 'cube16_cluster3', 'cube8_cluster0'] : [0, 1, 2, 5, 11, 12, 13, 14, 15, 17, 18, 19, 21, 22, 24, 25, 26, 27, 29, 33, 34, 35, 37, 39, 40, 41, 42, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 60, 61, 62, 63, 64, 65, 69, 70, 71, 72, 73, 74, 76, 77, 78, 87, 88, 104, 105, 107, 108, 109, 110, 111, 112, 113, 114, 116, 117, 118, 119]
['cube19_cluster1', 'cube20_cluster1', 'cube12_cluster1', 'cube13_cluster1'] : [91]
['cube31_cluster0', 'cube25

TypeError: unhashable type: 'list'

# Unique Samples
Something Killian was talking about

In [None]:
allComplices = list(jsonData.keys())
for key in allComplices[0:1]: # remove the indices here to get all the strata for all the time periods
    print("Current Simplical Complex: ", key)
    scomplex = jsonData.get(key)
    
