# goo.gl/yFU5aA

# A graph processing engine for Genomics on Ray


###### Devin Petersohn

## Problem Statement


##### System Problems
* Integrating datasets -- hard!
    * Datasets are massive
    * Variety can be a challenge to manage
    * Structure -- variable! (schemas)
* New datasets frequently available
* Variety of different relationships -- hard to model in current architectures
    * e.g. locality in DNA: 3D vs linear
* Compact representation (only have to represent variation)
* Machine Learning is challenging across datasets

##### User Problems
* Adoption
    * Faster is not enough
    * Users want new capabilities
* API/Architecture
    * Complicated architectures die (with caveats)
    * Must be as simple as possible
    

# Architecture

## Architecture

##### Genome graph architecture/motivation

![](GenomeGraph.png)

## Architecture

##### Layered graph architecture/motivation

![](Layers.png)

## Architecture

###### Layered graph architecture representation

![](Graph_hierarchy.png)

## API Goal: Simplicity

#### Load, Parse, Label neighbors, and add to graph

## Why Ray?

* Asynchronous model is attractive for graphs
* Python is a plus -- genomics
* Simple API and design philosophy
* Relatively low overhead
* RISELab project

# Demo

In [1]:
from graphlib import *
import ray

# Start Ray
ray.init()

Waiting for redis server at 127.0.0.1:19174 to respond...
Waiting for redis server at 127.0.0.1:34174 to respond...
Starting local scheduler with 8 CPUs, 0 GPUs

View the web UI at http://localhost:8911/notebooks/ray_ui62910.ipynb?token=ebff4df804f48920d03ea4b247dbfa97fd7288c9693a7810



{'local_scheduler_socket_names': ['/tmp/scheduler30709081'],
 'node_ip_address': '127.0.0.1',
 'object_store_addresses': [ObjectStoreAddress(name='/tmp/plasma_store55243114', manager_name='/tmp/plasma_manager24476538', manager_port=56323)],
 'redis_address': '127.0.0.1:19174',
 'webui_url': 'http://localhost:8911/notebooks/ray_ui62910.ipynb?token=ebff4df804f48920d03ea4b247dbfa97fd7288c9693a7810'}

## Starting the system: building a genome graph

In [2]:
# Create a graph collection
graph_collection = Graph_collection()

# Create a new graph called "genome_graph"
graph_collection.add_graph("genome_graph")

## Popluating the genome graph

#### Step 1: Start with a reference genome

In [3]:
# Load
reference_genome = "CAGTCCTAGCTACGCTCTATCCTCTCAGAGGACCGATCGATATACGCGTGAAACTAGTGCACTAGACTCGAACTGA"

for i in range(len(reference_genome)):
    # Parse
    coordinate = float(i)
    
    # Label neighbors
    neighbors = []
    
    if i != 0:
        neighbors.append(Edge(float(i - 1), 0, "left"))
    if i != len(reference_genome) - 1:
        neighbors.append(Edge(float(i + 1), 0, "right"))

    # create a new node
    node = Node(reference_genome[i])

    # store a link to the object in the masterStore
    graph_collection.add_node_to_graph("genome_graph", 
                                       coordinate, 
                                       node, 
                                       neighbors)



## Populating the genome graph

#### Step 2: Add the variation data

In [4]:
# Load
dna_test_data = [{"individualID":0, "dnaData":
                    [{"coordinateStart":7.1, 
                          "coordinateStop":8.0, 
                          "variantAllele": "C"},
                     {"coordinateStart":12.2, 
                          "coordinateStop":13.0, 
                          "variantAllele": "T"},
                     {"coordinateStart":26.2222, 
                          "coordinateStop":27.0, 
                          "variantAllele": "TTTT"}]},
                 {"individualID":1, "dnaData":
                    [{"coordinateStart":7.2, 
                          "coordinateStop":8.0, 
                          "variantAllele": "G"},
                     {"coordinateStart":12.2, 
                          "coordinateStop":13.0, 
                          "variantAllele": "T"}]}]

for indiv in dna_test_data:
        for variant in indiv["dnaData"]:
            
            #Parse
            coordinate = variant["coordinateStart"]
            node = Node(variant["variantAllele"])
            
            # Label neighbors
            left_conn = Edge(float(int(coordinate) - 1), 0, "left")
            right_conn = Edge(float(int(variant["coordinateStop"])), 0, "right")
            neighbors = [left_conn, right_conn]    
            
            connections_to_other_graphs = {}
            connections_to_other_graphs["individuals"] = indiv["individualID"]
            
            # Add to graph
            graph_collection.add_node_to_graph("genome_graph",
                                               coordinate, 
                                               node, 
                                               neighbors, 
                                               connections_to_other_graphs)



## Simple query of the data

#### Let's see what the graph has for each individual

In [5]:
def query_for_individuals(individual_id):
    genome_graph = graph_collection.get_graph("genome_graph")
    connections = ray.get(genome_graph
                          .get_inter_graph_connections
                          .remote())

    for coordinate in connections:
        if "individuals" not in connections[coordinate]:
            continue
            
        if individual_id in ray.get(connections[coordinate]["individuals"]):
            # print the coordinate and the data in the node
            print(str(coordinate), "\t",
                str(ray.get(ray.get(genome_graph
                                    .get_oid_dictionary
                                    .remote())[coordinate]).data))

print("Individual 0")
query_for_individuals(0)

print("\nIndividual 1")
query_for_individuals(1)

individuals_graph_conns = ray.get(graph_collection.get_graph("individuals")
                             .get_inter_graph_connections.remote())

print("\nIndividuals graph (for bi-directionality):")
for indiv in individuals_graph_conns:
    print(str(individuals_graph_conns[indiv]))


Individual 0
7.1 	 C
12.2 	 T
26.2222 	 TTTT

Individual 1
12.2 	 T
7.2 	 G

Individuals graph (for bi-directionality):
{'genome_graph': ObjectID(16e9f8d75af74c6e8c1c9a102875927eed8d5352)}
{'genome_graph': ObjectID(9583c2033d4e964c09966ed8b0ef8ea16687c5b2)}


## Populating the individuals graph

In [6]:
individuals = {0: {"Name":"John Doe", "Gender":"M"}, 
               1: {"Name":"Jane Doe", "Gender":"F"}}

for indiv_id, data in individuals.items():
    node = Node(data)
    graph_collection.add_node_to_graph("individuals", indiv_id, node)

for indiv_id in individuals:
    print(ray.get(ray.get(graph_collection.get_graph("individuals")
                                          .get_oid_dictionary
                                          .remote())[indiv_id]).data)


{'Name': 'John Doe', 'Gender': 'M'}
{'Name': 'Jane Doe', 'Gender': 'F'}


## Adding relationships

#### Suppose that Jane is John's mother

In [7]:
john_is_son = Edge(0, 0, "son")
jane_is_mom = Edge(1, 0, "mother")

graph_collection.append_to_connections("individuals", 1, john_is_son)
graph_collection.append_to_connections("individuals", 0, jane_is_mom)

for indiv_id in individuals:
    connections = ray.get(ray.get(graph_collection.get_graph("individuals")
                                                  .get_adjacency_list
                                                  .remote())[indiv_id])
    for conn in connections:
        print(individuals[conn.destination]["Name"], 
              "is",
              str(conn.orientation),
              "to",
              str(individuals[indiv_id]["Name"]))

Jane Doe is mother to John Doe
John Doe is son to Jane Doe


## Leveraging the asynchrony in Ray

#### A real example with real data

In [8]:
# this will store all reads in their original form
graph_collection.add_graph("reads")
# this will store the genome graph for all reads
graph_collection.add_graph("reads_genome_graph")

#sample reads
sample_read_data = [{"contigName": "chr1", "start": 268051, "end": 268101, "mapq": 0, "readName": "D3NH4HQ1:95:D0MT5ACXX:2:2307:5603:121126", "sequence": "GGAGTGGGGGCAGCTACGTCCTCTCTTGAGCTACAGCAGATTCACTCNCT", "qual": "BCCFDDFFHHHHHJJJIJJJJJJIIIJIGJJJJJJJJJIIJJJJIJJ###", "cigar": "50M", "readPaired": False, "properPair": False, "readMapped": True, "mateMapped": False, "failedVendorQualityChecks": False, "duplicateRead": False, "readNegativeStrand": False, "mateNegativeStrand": False, "primaryAlignment": True, "secondaryAlignment": False, "supplementaryAlignment": False, "mismatchingPositions": "47T0G1", "origQual": None, "attributes": "XT:A:R\tXO:i:0\tXM:i:2\tNM:i:2\tXG:i:0\tXA:Z:chr16,-90215399,50M,2;chr6,-170736451,50M,2;chr8,+71177,50M,3;chr1,+586206,50M,3;chr1,+357434,50M,3;chr5,-181462910,50M,3;chr17,-83229095,50M,3;\tX1:i:5\tX0:i:3", "recordGroupName": None, "recordGroupSample": None, "mateAlignmentStart": None, "mateAlignmentEnd": None, "mateContigName": None, "inferredInsertSize": None},
                    {"contigName": "chr1", "start": 1424219, "end": 1424269, "mapq": 37, "readName": "D3NH4HQ1:95:D0MT5ACXX:2:2107:15569:102571", "sequence": "AGCGCTGTAGGGACACTGCAGGGAGGCCTCTGCTGCCCTGCTAGATGTCA", "qual": "CCCFFFFFHHHHHJJJJJJJJJJIJJJJJJJJJJJJJJJIJJIJHIIGHI", "cigar": "50M", "readPaired": False, "properPair": False, "readMapped": True, "mateMapped": False, "failedVendorQualityChecks": False, "duplicateRead": False, "readNegativeStrand": False, "mateNegativeStrand": False, "primaryAlignment": True, "secondaryAlignment": False, "supplementaryAlignment": False, "mismatchingPositions": "50", "origQual": None, "attributes": "XT:A:U\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX1:i:0\tX0:i:1", "recordGroupName": None, "recordGroupSample": None, "mateAlignmentStart": None, "mateAlignmentEnd": None, "mateContigName": None, "inferredInsertSize": None},
                    {"contigName": "chr1", "start": 45520936, "end": 45520986, "mapq": 0, "readName": "D3NH4HQ1:95:D0MT5ACXX:2:2103:19714:5712", "sequence": "TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT", "qual": "############################BBBCDEEA<?:FDCADDD?;=+", "cigar": "50M", "readPaired": False, "properPair": False, "readMapped": True, "mateMapped": False, "failedVendorQualityChecks": False, "duplicateRead": False, "readNegativeStrand": True, "mateNegativeStrand": False, "primaryAlignment": True, "secondaryAlignment": False, "supplementaryAlignment": False, "mismatchingPositions": "50", "origQual": None, "attributes": "XT:A:R\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX0:i:1406", "recordGroupName": None, "recordGroupSample": None, "mateAlignmentStart": None, "mateAlignmentEnd": None, "mateContigName": None, "inferredInsertSize": None},
                    {"contigName": "chr1", "start": 45520938, "end": 1443788, "mapq": 0, "readName": "D3NH4HQ1:95:D0MT5ACXX:2:2103:21028:126413", "sequence": "TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT", "qual": "###########################B?;;AFHFIGDDHDDDDDBD@?=", "cigar": "50M", "readPaired": False, "properPair": False, "readMapped": True, "mateMapped": False, "failedVendorQualityChecks": False, "duplicateRead": False, "readNegativeStrand": True, "mateNegativeStrand": False, "primaryAlignment": True, "secondaryAlignment": False, "supplementaryAlignment": False, "mismatchingPositions": "50", "origQual": None, "attributes": "XT:A:R\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX0:i:1406", "recordGroupName": None, "recordGroupSample": None, "mateAlignmentStart": None, "mateAlignmentEnd": None, "mateContigName": None, "inferredInsertSize": None},
                    {"contigName": "chr1", "start": 2653642, "end": 2653692, "mapq": 25, "readName": "D3NH4HQ1:95:D0MT5ACXX:2:2306:20003:84408", "sequence": "ANNACACCCCCAGGCGAGCATCTGACAGCCTGGAACAGCACCCACACCCC", "qual": "######JJJJJJJIJIJJIHGGGIIJJJJJJJJJJJJHHFHHDDDBFC@@", "cigar": "50M", "readPaired": False, "properPair": False, "readMapped": True, "mateMapped": False, "failedVendorQualityChecks": False, "duplicateRead": False, "readNegativeStrand": True, "mateNegativeStrand": False, "primaryAlignment": True, "secondaryAlignment": False, "supplementaryAlignment": False, "mismatchingPositions": "0T0C0C47", "origQual": None, "attributes": "XT:A:U\tXO:i:0\tXM:i:3\tNM:i:3\tXG:i:0\tX1:i:0\tX0:i:1", "recordGroupName": None, "recordGroupSample": None, "mateAlignmentStart": None, "mateAlignmentEnd": None, "mateContigName": None, "inferredInsertSize": None},
                    {"contigName": "chr1", "start": 2664732, "end": 2664782, "mapq": 0, "readName": "D3NH4HQ1:95:D0MT5ACXX:2:2106:12935:169714", "sequence": "GAGCATGTGACAGCCTAGGTCGGCACCCACACCCCCAGGTGAGCATCTGA", "qual": "FDBDCHFFEHDCCAFHIHA6EGB?8GGFF?8IEHEB@FHDHGEDDBD@@@", "cigar": "50M", "readPaired": False, "properPair": False, "readMapped": True, "mateMapped": False, "failedVendorQualityChecks": False, "duplicateRead": False, "readNegativeStrand": True, "mateNegativeStrand": False, "primaryAlignment": True, "secondaryAlignment": False, "supplementaryAlignment": False, "mismatchingPositions": "6C9G33", "origQual": None, "attributes": "XT:A:R\tXO:i:0\tXM:i:2\tNM:i:2\tXG:i:0\tX1:i:13\tX0:i:5", "recordGroupName": None, "recordGroupSample": None, "mateAlignmentStart": None, "mateAlignmentEnd": None, "mateContigName": None, "inferredInsertSize": None},
                    {"contigName": "chr1", "start": 2683541, "end": 2683591, "mapq": 0, "readName": "D3NH4HQ1:95:D0MT5ACXX:2:2107:5053:12847", "sequence": "AGCACCCACAACCACAGGTGAGCATCCGACAGCCTGGAACAGCACCCACA", "qual": "CCCFFFFFHGHHHJIJJJHGGIIJJJJJIJGIIJJIJJIJJJJIJIIJJJ", "cigar": "50M", "readPaired": False, "properPair": False, "readMapped": True, "mateMapped": False, "failedVendorQualityChecks": False, "duplicateRead": False, "readNegativeStrand": False, "mateNegativeStrand": False, "primaryAlignment": True, "secondaryAlignment": False, "supplementaryAlignment": False, "mismatchingPositions": "50", "origQual": None, "attributes": "XT:A:R\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tXA:Z:chr1,+2687435,50M,0;chr1,+2694861,50M,0;chr1,+2755813,50M,1;\tX1:i:1\tX0:i:3", "recordGroupName": None, "recordGroupSample": None, "mateAlignmentStart": None, "mateAlignmentEnd": None, "mateContigName": None, "inferredInsertSize": None},
                    {"contigName": "chr1", "start": 2689861, "end": 2689911, "mapq": 0, "readName": "D3NH4HQ1:95:D0MT5ACXX:2:2108:5080:115408", "sequence": "GGTGAGCATCTGACAGCCCGGAGCAGCACGCAAACCCCCAGGTGAGCATC", "qual": "@@BFBBDFHHHHGJIJGIIFIEIJJJJIJJJJJJJJJJJJIJGHHICEHH", "cigar": "50M", "readPaired": False, "properPair": False, "readMapped": True, "mateMapped": False, "failedVendorQualityChecks": False, "duplicateRead": False, "readNegativeStrand": False, "mateNegativeStrand": False, "primaryAlignment": True, "secondaryAlignment": False, "supplementaryAlignment": False, "mismatchingPositions": "18T3A27", "origQual": None, "attributes": "XT:A:R\tXO:i:0\tXM:i:2\tNM:i:2\tXG:i:0\tX1:i:21\tX0:i:2", "recordGroupName": None, "recordGroupSample": None, "mateAlignmentStart": None, "mateAlignmentEnd": None, "mateContigName": None, "inferredInsertSize": None},
                    {"contigName": "chr1", "start": 2750194, "end": 2750244, "mapq": 0, "readName": "D3NH4HQ1:95:D0MT5ACXX:2:1204:10966:151563", "sequence": "CCCCCNCACCCCCAGGTGAGCATCTGATGGTCTGGAGCAGCACCCACACC", "qual": "######F;JJJJJJJJJJJJIIIJIJJJJFJJIJJGJHHHHHFFDDD?BB", "cigar": "50M", "readPaired": False, "properPair": False, "readMapped": True, "mateMapped": False, "failedVendorQualityChecks": False, "duplicateRead": False, "readNegativeStrand": True, "mateNegativeStrand": False, "primaryAlignment": True, "secondaryAlignment": False, "supplementaryAlignment": False, "mismatchingPositions": "1A3A12C31", "origQual": None, "attributes": "XT:A:R\tXO:i:0\tXM:i:3\tNM:i:3\tXG:i:0\tXA:Z:chr1,-2653118,50M,3;chr1,-2652838,50M,3;chr1,-2653681,50M,3;chr1,-2694823,50M,3;chr1,-2687397,50M,3;chr1,-2755775,50M,3;chr1,-2653921,50M,3;\tX1:i:0\tX0:i:8", "recordGroupName": None, "recordGroupSample": None, "mateAlignmentStart": None, "mateAlignmentEnd": None, "mateContigName": None, "inferredInsertSize": None},
                    {"contigName": "chr1", "start": 3052271, "end": 3052321, "mapq": 25, "readName": "D3NH4HQ1:95:D0MT5ACXX:2:2107:21352:43370", "sequence": "TCANTCATCTTCCATCCATCCGTCCAACAACCATTTGTTGATCATCTCTC", "qual": "@@<#4AD?ACDCDHGIDA>C?<A;8CBEEBAG1D?BG?GH?@DEHFG@FH", "cigar": "50M", "readPaired": False, "properPair": False, "readMapped": True, "mateMapped": False, "failedVendorQualityChecks": False, "duplicateRead": False, "readNegativeStrand": False, "mateNegativeStrand": False, "primaryAlignment": True, "secondaryAlignment": False, "supplementaryAlignment": False, "mismatchingPositions": "3C44A0T0", "origQual": None, "attributes": "XT:A:U\tXO:i:0\tXM:i:3\tNM:i:3\tXG:i:0\tX1:i:0\tX0:i:1", "recordGroupName": None, "recordGroupSample": None, "mateAlignmentStart": None, "mateAlignmentEnd": None, "mateContigName": None, "inferredInsertSize": None}]

for read in sample_read_data:
    graph_collection.add_node_to_graph("reads", read["readName"], Node(read))
    for index in range(len(read["sequence"])):
        data = read["sequence"][index]

        neighbors = []
        if index != 0:
            neighbors.append(Edge(read["contigName"] + 
                                  "\t" + 
                                  str(read["start"] + index - 1), 
                              0, 
                              "left"))

        if index != len(read["sequence"]) - 1:
            neighbors.append(Edge(read["contigName"] + 
                                  "\t" + 
                                  str(read["start"] + index + 1), 
                              0, 
                              "right"))

        coordinate = read["contigName"] + "\t" + str(read["start"] + index)
        node = Node(data)

        graph_collection.add_node_to_graph("reads_genome_graph", coordinate, node, neighbors)
        graph_collection.add_inter_graph_connection("reads_genome_graph", 
                                                    coordinate, 
                                                    "reads", 
                                                    read["readName"])


In [9]:
# for storing the feature data
graph_collection.add_graph("features")

sampleFeatures = [{"featureName": "0", "contigName": "chr1", "start": 45520936, "end": 45522463, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "878", "thickStart": "482.182760214932", "thickEnd": "-1"}},
                    {"featureName": "1", "contigName": "chr1", "start": 88891087, "end": 88891875, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "423", "thickStart": "446.01797654123", "thickEnd": "-1"}},
                    {"featureName": "2", "contigName": "chr1", "start": 181088138, "end": 181090451, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "626", "thickStart": "444.771802710521", "thickEnd": "-1"}},
                    {"featureName": "3", "contigName": "chr1", "start": 179954184, "end": 179955452, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "647", "thickStart": "440.10466093652", "thickEnd": "-1"}},
                    {"featureName": "4", "contigName": "chr1", "start": 246931401, "end": 246932507, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "423", "thickStart": "436.374938660247", "thickEnd": "-1"}},
                    {"featureName": "5", "contigName": "chr1", "start": 28580676, "end": 28582443, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "1106", "thickStart": "434.111845970505", "thickEnd": "-1"}},
                    {"featureName": "6", "contigName": "chr1", "start": 23691459, "end": 23692369, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "421", "thickStart": "426.055504846001", "thickEnd": "-1"}},
                    {"featureName": "7", "contigName": "chr1", "start": 201955033, "end": 201956082, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "522", "thickStart": "423.882565088207", "thickEnd": "-1"}},
                    {"featureName": "8", "contigName": "chr1", "start": 207321011, "end": 207323021, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "741", "thickStart": "423.625988483304", "thickEnd": "-1"}},
                    {"featureName": "9", "contigName": "chr1", "start": 145520936, "end": 145522463, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "878", "thickStart": "482.182760214932", "thickEnd": "-1"}},
                    {"featureName": "10", "contigName": "chr1", "start": 188891087, "end": 188891875, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "423", "thickStart": "446.01797654123", "thickEnd": "-1"}},
                    {"featureName": "11", "contigName": "chr1", "start": 1181088138, "end": 1181090451, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "626", "thickStart": "444.771802710521", "thickEnd": "-1"}},
                    {"featureName": "12", "contigName": "chr1", "start": 1179954184, "end": 1179955452, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "647", "thickStart": "440.10466093652", "thickEnd": "-1"}},
                    {"featureName": "13", "contigName": "chr1", "start": 1246931401, "end": 1246932507, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "423", "thickStart": "436.374938660247", "thickEnd": "-1"}},
                    {"featureName": "14", "contigName": "chr1", "start": 128580676, "end": 128582443, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "1106", "thickStart": "434.111845970505", "thickEnd": "-1"}},
                    {"featureName": "15", "contigName": "chr1", "start": 123691459, "end": 123692369, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "421", "thickStart": "426.055504846001", "thickEnd": "-1"}},
                    {"featureName": "16", "contigName": "chr1", "start": 1201955033, "end": 1201956082, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "522", "thickStart": "423.882565088207", "thickEnd": "-1"}},
                    {"featureName": "17", "contigName": "chr1", "start": 1207321011, "end": 1207323021, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "741", "thickStart": "423.625988483304", "thickEnd": "-1"}},
                    {"featureName": "18", "contigName": "chr1", "start": 1110963118, "end": 1110964762, "score": 0.0, "attributes": {"itemRgb": "5.0696939910406", "blockCount": "758", "thickStart": "421.056761458099", "thickEnd": "-1"}}]
    
for feature in sampleFeatures:
    node = Node(feature)
    graph_collection.add_node_to_graph("features", feature["featureName"], node)
    coordinates = []
    for index in range(feature["end"] - feature["start"]):
        coordinates.append(feature["contigName"] + 
                           "\t" + 
                           str(feature["start"] + index))

    graph_collection.add_multiple_inter_graph_connections("features", 
                                                          feature["featureName"], 
                                                          "reads_genome_graph", 
                                                          coordinates)


import time
reads_graph = graph_collection.get_graph("reads_genome_graph")
for i in range(11):
    print("After", i*3, "seconds:",
          str(len(ray.get(reads_graph.get_inter_graph_connections.remote()))),
          "completed")
    time.sleep(3)

After 0 seconds: 460 completed
After 3 seconds: 3361 completed
After 6 seconds: 6406 completed
After 9 seconds: 9464 completed
After 12 seconds: 12466 completed
After 15 seconds: 15609 completed
After 18 seconds: 18590 completed
After 21 seconds: 21664 completed
After 24 seconds: 24895 completed
After 27 seconds: 27520 completed
After 30 seconds: 27520 completed


## A real query:

#### How many reads overlap with a particular feature?

In [10]:
features_graph = graph_collection.get_graph("features")

query_feature = "0"

inter_graph_for_query = ray.get(features_graph
                            .get_inter_graph_connections
                            .remote())[query_feature]

reads_inter_graph = ray.get(reads_graph.get_inter_graph_connections
                                       .remote())

independent_features = set()
for key in ray.get(inter_graph_for_query["reads_genome_graph"]):
    if "reads" in reads_inter_graph[key]:
        x = ray.get(reads_inter_graph[key]["reads"])
        independent_features.update(x)
        
print("Number of reads_with_features: ", len(independent_features))
print("Reads: ", independent_features)

Number of reads_with_features:  2
Reads:  {'D3NH4HQ1:95:D0MT5ACXX:2:2103:19714:5712', 'D3NH4HQ1:95:D0MT5ACXX:2:2103:21028:126413'}


## Future Work

#### Extending this beyond genomics
* Streaming applications
* Applications where data is 
* Complex querying
* Querying/ML on streaming data

#### Extending to support machine learning
* Prediction problem: What is the confidence for a connection between two nodes?
* General associations: Deep Learning
* Reinforcement Learning?

#### Future Systems work
* Correct execution of queries on data
    * Requires a time library
* Support for transactions
* More robustness
* Testing on a large dataset (on a cluster)

## Future Work sample stack diagram

![](Stack Diagram.png)

# Questions?