<a href="https://colab.research.google.com/github/tjann/api-python/blob/udpatenb/AnalyzingGenomicDatawithBiomedicalDataCommons.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyzing Genomic Data with Biomedical Data Commons
Datacommons is intended for various data science tasks. This tutorial introduces the datacommons knowledge graph and discusses two tools to help integrate its data into your data science projects: (1) the [datacommons browser](https://browser.datacommons.org/) and (2) the [Python API](https://github.com/datacommonsorg/api-python). Before getting started, we will need to install the Python API package.


In [1]:
# Install datacommons
!pip install --upgrade --quiet datacommons

#What is Biomedical Data Commons?
Data Commons is an open knowledge graph of structured data. It contains statements about real world objects such as
*   The genome assembly [hg38](https://browser.datacommons.org/kg?dcid=bio%2Fhg38) is a reference genome for the species *[Homo sapiens](https://browser.datacommons.org/kg?dcid=bio%2Fhs)*.
*   For the [hg38](https://browser.datacommons.org/kg?dcid=bio%2Fhg38) genome assembly [chr17](https://browser.datacommons.org/kg?dcid=bio%2Fhg38_chr17) has [83,257,441 base pairs](https://browser.datacommons.org/kg?dcid=BasePairs83257441).
*   [BRCA1](https://browser.datacommons.org/kg?dcid=bio/hg38_BRCA1) genomic coordinates are [chr17](https://browser.datacommons.org/kg?dcid=bio%2Fhg38_chr17):[43,044,294-43,125,483](https://browser.datacommons.org/kg?dcid=Position43044294To43125483) for genome assembly [hg38](https://browser.datacommons.org/kg?dcid=bio%2Fhg38).

In the graph, [entities](https://en.wikipedia.org/wiki/Entity) like the genome assembly [hg38](https://browser.datacommons.org/kg?dcid=bio%2Fhg38) are represented by nodes. Every node has a type corresponding to what the node represents. For example, *[Homo sapiens](https://browser.datacommons.org/kg?dcid=bio%2Fhs)* is a [Species](https://browser.datacommons.org/kg?dcid=Species). Relations between entities are represented by edges between these nodes. For example, the statement "The genome assembly hg38 is a reference genome for the species *Homo sapiens*." is represented in the graph as two nodes: "hg38" and "HomoSapiens" with an edge labeled "[is_species](https://browser.datacommons.org/kg?dcid=is_chromosome)" pointing from "hg38" to "HomoSapiens". Data Commons follows the [Schema.org data model](https://schema.org/docs/datamodel.html) and leverages schema.org schema to provide a common set of types and properties. To accomodate biological data this schema has been expanded to reflect the schema that has been used across biological databases created by the scientific community.

#Data Commons Browser
The Data Commons browser provides a way to explore the data in a human-readable format. It is the best way to explore what is in Data Commons. Searching in the browser for an entity like [BRACA1](https://browser.datacommons.org/kg?dcid=bio/hg38_BRCA1), takes you to a page about the entity, including properties like [refSeqID](https://browser.datacommons.org/kg?dcid=refSeqID) and [typeOfGene](https://browser.datacommons.org/kg?dcid=typeOfGene).

An important property for all entities is the dcid. The dcid (DataCommons identifier) is a unique identifier assigned to each entity in the knowledge graph. With this identifier, you will be able to search for and query information on the given entity in ways that we will discuss later. The dcid is listed at the top of the page next to "About: " and also in the list of properties.

# Python API

The [Python API](https://github.com/datacommonsorg/api-python) provides functions for users to extract structured information from Data Commons programmatically and view them in different formats such as Python `dict`'s and [Pandas](https://pandas.pydata.org/) DataFrames. DataFrames allow access to all the data processing, analytical and visualization tools provided by packages such as Pandas, NumPy, SciPy, and Matplotlib. For more information check out the [documentation](https://datacommons.readthedocs.io/en/latest/modules.html) on the Python API's modules.

Every notebook begins by loading the datacommons library as follows:

In [2]:
# Import Data Commons
import datacommons as dc

# Import other required libraries
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import pandas as pd
import requests
import json

##Example: Identify the Genome Assemblies Supported by Biomedical Data Commons
For this exercise we will identify the genome assemblies and their related species that are currently supported by Biomedical Data Commons. We will start by looking up the dcid for '[GenomeAssembly](https://browser.datacommons.org/kg?dcid=GenomeAssembly)'.



## Using get_property_value to Access Node Properties
Our first task for this tutorial will be to extract the genome assemblies from Data Commons using the Python API and view it in a Pandas DataFrame. For all properties, one can use [**`get_property_values`**](https://datacommons.readthedocs.io/en/latest/_autosummary/datacommons_core/datacommons.core.get_property_values.html) to get the associated values. Let's look up the dcid of [GenomeAssembly](https://browser.datacommons.org/kg?dcid=GenomeAssembly). We would then like to initialize our Pandas dataframe for the dcid bio/GenomeAssembly.

For all properties, one can use `get_property_values` to get the associated values. We would like to know the instances of "GenomeAssembly" by getting the the typeOf instances that are oriented towards the "GenomeAssembly" identified by "bio/GenomeAssembly". `get_property_values` accepts the following parameters:


*  **`dcids`** - A list of dcids to get property values for.
*  **`prop`** - The property to get property values for.
*  **`out`**`[=True]` - An optional flag that indicates the property is oriented away from the given nodes if true.
*  **`value_type`**`[=None]` - An optional parameter which filters property values by the given type.
*  **`limit`**`[=100]` - An optional parameter which limits the total number of property values returned aggregated over all given nodes.

When the dcids are given as a Pandas Series, the returned list of property values is a Pandas Series where the i-th entry corresponds to property values associated with the i-th given dcid. Some properties, like containedInPlace, may have many property values. Consequently, the cells of the returned series will always contain a list of property values. Let's take a look:

In [3]:
# Call get_property_values. The return value is a dict keyed by 'GenomeAssembly'.
genomeAssembly_dcids = dc.get_property_values(['GenomeAssembly'], 'typeOf',  out=False)['GenomeAssembly']
# Display the frame
print(genomeAssembly_dcids)

['bio/ce10', 'bio/ce9', 'bio/danRer10', 'bio/danRer11', 'bio/dm3', 'bio/dm6', 'bio/galGal5', 'bio/galGal6', 'bio/hg19', 'bio/hg38', 'bio/mm10', 'bio/mm9', 'bio/sacCer3', 'bio/xenLae2']


## Example: List All Genome Assemblies in Human Readable Format
Let's continue learning about the genome assemblies supported by Biomedical Data Commons. We are next going to find the names and species of the genome assemblies that are associated with the list of dcids of genome assemblies that we found. We are going to display the information in a human readable table.

In [4]:
# Intialize the Data Frame
df_genomeAssemblies = pd.DataFrame()

# Add genome assemblies name and dcid
df_genomeAssemblies['name'] = pd.Series(dc.get_property_values(genomeAssembly_dcids, 'name'))
df_genomeAssemblies.reset_index(level=0, inplace=True)
df_genomeAssemblies = df_genomeAssemblies.rename(columns={"index": "dcid"}).explode('name')

# Add Species dcid
df_genomeAssemblies['species_dcid'] = df_genomeAssemblies['dcid'].map(
    dc.get_property_values(df_genomeAssemblies['dcid'], 'ofSpecies'))
df_genomeAssemblies = df_genomeAssemblies.explode('species_dcid')

# Add Species name
df_genomeAssemblies['species_name'] = df_genomeAssemblies['species_dcid'].map(
    dc.get_property_values(df_genomeAssemblies['species_dcid'], 'name'))
df_genomeAssemblies = df_genomeAssemblies.explode('species_name')

print(df_genomeAssemblies)

            dcid      name species_dcid             species_name
0       bio/ce10      ce10       bio/ce    CaenorhabditisElegans
1        bio/ce9       ce9       bio/ce    CaenorhabditisElegans
2   bio/danRer10  danRer10   bio/danRer               DanioRerio
3   bio/danRer11  danRer11   bio/danRer               DanioRerio
4        bio/dm3       dm3       bio/dm   DrosophilaMelanogaster
5        bio/dm6       dm6       bio/dm   DrosophilaMelanogaster
6    bio/galGal5   galGal5   bio/galGal             GallusGallus
7    bio/galGal6   galGal6   bio/galGal             GallusGallus
8       bio/hg19      hg19       bio/hs              HomoSapiens
9       bio/hg38      hg38       bio/hs              HomoSapiens
10      bio/mm10      mm10       bio/mm              MusMusculus
11       bio/mm9       mm9       bio/mm              MusMusculus
12   bio/sacCer3   sacCer3   bio/sacCer  SaccharomycesCerevisiae
13   bio/xenLae2   xenLae2   bio/xenLae            XenopusLaevis


Congratulations! You've found the basic information on all the genome assemblies and species currently supported by Biomedical Data Commons. As you can see we currently support data from 8 model organisms across 14 different genome assemblies.

# Example: Analyze Genetic Variants within RUNX1
For this exercise, we will be analyzing genetic variants within the gene RUNX1. We will start by identifying the genetic variants within the gene region and then limit our list to those within the coding region of BRCA1 and then to those with known clinical significance. First, let's start by looking up the dcid for '[RUNX1](https://browser.datacommons.org/kg?dcid=bio/hg38_RUNX1)'. 

Note that 'Gene' defines the Data Commons type. Let's start by using [**`get_property_labels`**](https://datacommons.readthedocs.io/en/latest/_autosummary/datacommons_core/datacommons.core.get_property_labels.html) to identify all the properties associated with RUNX1. `get_property_labels` accepts the following parameters:
-  **`dcids`** `[list of str]` – A list of nodes identified by their dcids.
-  **`out`** `[bool, optional]` – Whether or not the property points away from the given list of nodes.

The output of `get_property_labels` is a dict mapping dcids to lists of property labels. If out is True, then property labels correspond to edges directed away from given nodes. Otherwise, they correspond to edges directed towards the given nodes.

In [5]:
# Call get_property_labels
dc.get_property_labels(['bio/hg38_RUNX1'])

{'bio/hg38_RUNX1': ['description',
  'fullName',
  'geneSymbol',
  'genomicCoordinates',
  'hasRNATranscript',
  'inChromosome',
  'inGenomeAssembly',
  'mRNA',
  'mapLocation',
  'modificationDate',
  'name',
  'ncbiGeneID',
  'ncbiTaxonID',
  'nomenclatureStatus',
  'ofSpecies',
  'provenance',
  'refSeqID',
  'strandOrientation',
  'typeOf',
  'typeOfGene']}

### Identify RUNX1 Genomic Coordinates
Great, now we see the type of information known about RUNX1. To identify the genetic variants within the gene region we need to know what the chromosome and ther genomic coordinates of RUNX1. Let's grab that information using `get_property_values`.

In [6]:
# Initialize the Data Frame
df_RUNX1 = pd.DataFrame({'gene': ['bio/hg38_RUNX1']})

# Grab the chromosome and genomic coordinates using get_property_values
df_RUNX1['chromosome'] = df_RUNX1['gene'].map(dc.get_property_values(df_RUNX1['gene'], 'inChromosome'))
df_RUNX1['genomicCoordinates'] = df_RUNX1['gene'].map(dc.get_property_values(df_RUNX1['gene'], 'genomicCoordinates'))

# display the genomic coordinates
print(df_RUNX1)

# define start and stop of RUNX1
start, stop = df_RUNX1['genomicCoordinates'][0][0].strip('Position').split('To')
start = int(start)
stop = int(stop)

             gene        chromosome            genomicCoordinates
0  bio/hg38_RUNX1  [bio/hg38_chr21]  [Position34787800To35049344]


## Find All The Genetic Variants Within RUNX1
We found the coordinates of RUNX1 in the hg38 genome: chr21:34787800-35049344. Now that we know this let's find all the genetic variants of class [GeneticVariant](https://browser.datacommons.org/kg?dcid=GeneticVariant) that occur in the region. But first let's establish the information that we know on genetic variants using `get_property_values`.


In [7]:
# Identify all the properties of genetic variants
print(dc.get_property_values(['GeneticVariant'], 'domainIncludes', value_type='Property',  out=False))

{'GeneticVariant': ['alleleOrigin', 'alleleType', 'averageHeterozygosity', 'averageHeterozygositySE', 'clinVarAlleleID', 'clinVarFilterStatus', 'clinVarID', 'clinVarQualityScore', 'clinVarReviewStatus', 'clinicalSignificance', 'clinicalSignificanceConflicting', 'clinicalSignificanceType', 'clinicalSource', 'dbSNPBuildID', 'dbVarID', 'diseaseDescription', 'diseaseName', 'experimentalFactorOntologyID', 'geneID', 'geneReviewsID', 'geneSymbol', 'geneticTestingRegistryID', 'geneticVariantAlignmentQuality', 'geneticVariantAttribute', 'geneticVariantClass', 'geneticVariantExceptions', 'geneticVariantFunctionalCategory', 'geneticVariantImpercise', 'geneticVariantLength', 'geneticVariantLocType', 'geneticVariantSubmitterCount', 'geneticVariantValidationStatus', 'geneticsHomeReferenceID', 'genomicPosition', 'hg19GenomicLocation', 'hg19GenomicPosition', 'hg38GenomicLocation', 'hg38GenomicPosition', 'humanGenomeVariationSocietyNomenclature', 'humanPhenotypeOntologyID', 'medGenID', 'medicalGeneticS

### Using SPARQL and get_property_values to find genetic variants in RUNX1
Like genes, genetic variants also point to the chromosome on which they reside. Their positions on the chromosome is specified by hg38GenomicPosition. Using [**`query`**](https://datacommons.readthedocs.io/en/latest/_autosummary/datacommons.query.html) we will identify the dcids of all genetic variants on chr21. `query` accepts the following parameter
-  **`query`** `[query_string[, select]]` – Returns the results of executing a SPARQL query on the Data Commons graph.

`query` parameter is a SPARQL query that quickly searches and returns the data that matches the query on multiple parameters. There is no limit to the number of values that can be returned by a SPARQL query. In our query here we will be specifying that we want all genetic variants on chr21. Then we will format all the returned genetic variant dcids into a list and use `get_property_values` again to filter for genetic variants whose hg38GenomicPosition is within RUNX1. 

In [8]:
# Query for genetic variants associated with RUNX1
query = '''
SELECT ?gv  ?p
WHERE { 
    ?chr dcid "bio/hg38_chr21" . 
    ?gv inChromosome ?chr .
    ?gv typeOf GeneticVariant .
    ?gv hg38GenomicPosition ?p
}
'''
print(query)
rows = dc.query(query)
dcids = set()
for row in rows:
  dcids.add(row['?gv'])
dcids = list(dcids)
print(len(dcids))


SELECT ?gv  ?p
WHERE { 
    ?chr dcid "bio/hg38_chr21" . 
    ?gv inChromosome ?chr .
    ?gv typeOf GeneticVariant .
    ?gv hg38GenomicPosition ?p
}

8028


In [9]:
# filter all genetic variants for the ones within RUNX1
RUNX1_geneticVariants = []
gen_positions = dc.get_property_values(dcids, 'hg38GenomicPosition')
data = pd.DataFrame(gen_positions).transpose()
data.reset_index(level=0, inplace=True)
data = data.rename(columns={"index": "dcid", 0: "position"})
data['position'] = pd.to_numeric(data['position'])
data = data[data['position'] >= start]
data = data[data['position'] < stop]
RUNX1_geneticVariants = list(set(data['dcid']))

# print the first few genetic variants
print(RUNX1_geneticVariants[:5])

# check how many genetic variants are in RUNX1
print(len(RUNX1_geneticVariants))

['bio/rs1569084170', 'bio/643883', 'bio/rs1569002337', 'bio/839054', 'bio/rs150481777']
504


## Identify Which Genetic Variants Are In Coding Regions
We've identified 360 genetic varaints within RUNX1. However, these can be in introns or exons. Let's further restrict genetic variant list to ones in the coding region of RUNX1. To do this we need to identify the positions of the exons of RUNX1. We know that RUNX1 has a property called rnaTranscript. Let's use `get_property_values` to find out more information on the RUNX1 transcript.

In [10]:
# Get property values for rnaTranscript of RUNX1
df_RUNX1['rnaTranscript'] = df_RUNX1['gene'].map(dc.get_property_values(df_RUNX1['gene'], 'hasRNATranscript'))
print(df_RUNX1)

             gene  ...                                      rnaTranscript
0  bio/hg38_RUNX1  ...  [bio/hg38_ENST00000300305.7, bio/hg38_ENST0000...

[1 rows x 4 columns]


### Identify the properties of RNATranscripts
There are several dcids associated with this property which are pointing to nodes of class [RNATranscript](https://browser.datacommons.org/kg?dcid=RNATranscript). Let's verify that this is indeed the case and then check the properties of RNATrancript using `get_property_values`.

In [11]:
# Specify an rnaTranscript associated with RUNX1
RUNX1_transcript = df_RUNX1.iloc[0]['rnaTranscript'][0]

# Check what type the rnaTranscript is
print(dc.get_property_values([RUNX1_transcript], 'typeOf'))

# Identify properties of RNATranscript
dict_temp = dc.get_property_values(['RNATranscript'], 'domainIncludes', value_type='Property',  out=False)
for prop in dict_temp['RNATranscript']:
  print(prop)

{'bio/hg38_ENST00000300305.7': ['RNATranscript']}
codingCoordinates
exonCoordinates
exonFrame
makesProtein
refSeqID
transcriptionCoordinates


### Explore the difference between codingCoordinates and exonCoordinates
There are two properties that may be useful for us in identifying which genetic variants are in the coding region of RUNX1. Let's figure out which one that we'd like to use moving forward by grabbing the values associated with both these properties using `get_property_values`.

In [12]:
# Get the values for codingCoordinates and exonCoordiantes
temp_dict = dc.get_property_values([RUNX1_transcript], 'codingCoordinates')
print('Coding Coordinate Values:')
for value in temp_dict[RUNX1_transcript]:
  print(value)
print('\n')
temp_dict = dc.get_property_values([RUNX1_transcript], 'exonCoordinates')
print('Exon Coordinate Values:')
for value in temp_dict[RUNX1_transcript]:
  print(value)

Coding Coordinate Values:
Position34792134To35048899


Exon Coordinate Values:
Position34787800To34792610
Position34799300To34799462
Position34834409To34834601
Position34859473To34859578
Position34880556To34880713
Position34886842To34887096
Position34892924To34892963
Position35048841To35049344


### Find the exon coordinates reported for all RNA transcripts of RUNX1
From the values of codingCoordinates and exonCoordinates we observe that coding coordinates contains the range of base pairs spanning the entire coding region of RUNX1 including introns. Whereas exonCoordinates reports the genomic coordinates of all exons of RUNX1. We want to find all genetic variants in exons of RUNX1, so going forward we want to grab the exonCoordinates of transcripts. There are multiple RNA transcripts reported for RUNX1. Let's make a unique list of all exonCoordinates recorded for RUNX1 using `get_property_values`.

In [13]:
# Initiate an empty set
RUNX1_exonCoordinates = set()

# Using get_property_values get all the exon coordinates for all rnaTranscripts
for rnaTranscript_dcid in df_RUNX1['rnaTranscript'][0]:
  temp_dict = dc.get_property_values([rnaTranscript_dcid], 'exonCoordinates')
  for item in temp_dict[rnaTranscript_dcid]:
    RUNX1_exonCoordinates.add(item)

# check the firs few exons
RUNX1_exonCoordinates = list(RUNX1_exonCoordinates)
print(RUNX1_exonCoordinates[:5])

# check how many unique exon coordinates have been reported for RUNX1
print(len(RUNX1_exonCoordinates))

['Position35049167To35049298', 'Position34791754To34792610', 'Position35988129To35988171', 'Position35984571To35984830', 'Position34859473To34859578']
41


### Identify the genetic variants in the exon coding regions
Now that we know all the possible reported exon coordinates for RUNX1 we can identify which genetic variants are in exons. Note that RUNX1 has 4 - 9 variants depending on the isoform. Many of these exon coordinates from transcripts of different isoforms are overlapping with each other, but not exact resulting in 41 unique coordinate ranges to be observed. We are interested in genetic variants within any reported exon coordinates in this example and will therefore use them all for our next filtering step. For filtering by overlap in position please remember that the range of the coordinates is [) in which the first number, but not the last number is considered as inside of the range.

In [14]:
# initialize empty list for storing genetic variants in RUNX1 exons
RUNX1_exon_geneticVariants = []

# for each genetic variant identify their hg38GenomicPosition and check if it's in an exon
for geneticVariant_dcid in RUNX1_geneticVariants:
  position = int(dc.get_property_values([geneticVariant_dcid], 'hg38GenomicPosition')[geneticVariant_dcid][0])
  for exonCoordinates in RUNX1_exonCoordinates:
    start, stop = exonCoordinates.strip('Position').split('To')
    start, stop = int(start), int(stop)
    # filter for variants within RUNX1 exons
    if position >= start and position < stop:
      RUNX1_exon_geneticVariants.append(geneticVariant_dcid)
      break
RUNX1_exon_geneticVariants = list(set(RUNX1_exon_geneticVariants))

# check how many of the genetic variants are in exons
print(len(RUNX1_exon_geneticVariants))

474


### Filter genetic variants for ones that have been clinically studied
Great! We've identified 345 genetic variants in exons that are worth further consideration. Let's narrow our candidate list further by identifying genetic variants with clinical data associated with them (reported in ClinVar). We can do this by checking if the genetic variant has the property `clinVarID` using `get_property_values`. For ones that have been clinically reported we are going to grab the following additional clinical information for the genetic variant: `diseaseName`, `clinicalSignificance`, and `clinVarReviewStatus`.

In [15]:
# initialize Empty Data Frame
column_names = ['name', 'clinVarAlleleID', 'diseaseName', 'clinicalSignificance', 'clinVarReviewStatus', 'dcid']
df_RUNX1_genVar_clinical = pd.DataFrame(columns=column_names)

for geneticVariant_dcid in RUNX1_exon_geneticVariants:
  clinVarID = dc.get_property_values([geneticVariant_dcid], 'clinVarAlleleID')[geneticVariant_dcid][0]
  if clinVarID.isdigit():
    name = geneticVariant_dcid.strip('bio/')
    diseaseName = dc.get_property_values( [geneticVariant_dcid], 'diseaseName')[geneticVariant_dcid][0]
    clinicalSignificance = dc.get_property_values( [geneticVariant_dcid], 'clinicalSignificance')[geneticVariant_dcid][0]
    clinVarReviewStatus = dc.get_property_values( [geneticVariant_dcid], 'clinVarReviewStatus')[geneticVariant_dcid][0]
    df_RUNX1_genVar_clinical = df_RUNX1_genVar_clinical.append({'name': name, 'clinVarAlleleID': clinVarID, 'diseaseName': diseaseName, \
                                     'clinicalSignificance': clinicalSignificance, 'clinVarReviewStatus': clinVarReviewStatus, \
                                     'dcid': geneticVariant_dcid}, ignore_index=True)

# visualize the head of the dataframe containing the clinical info
print(df_RUNX1_genVar_clinical.head())

# see how many clinical genetic variants in RUNX1
print(df_RUNX1_genVar_clinical.shape[0])

           name  ...              dcid
0  rs1569084170  ...  bio/rs1569084170
1        643883  ...        bio/643883
2  rs1569002337  ...  bio/rs1569002337
3        839054  ...        bio/839054
4   rs150481777  ...   bio/rs150481777

[5 rows x 6 columns]
474


## Filter For Pathogenic Genetic Variants
There are 345 genetic variants in the exons of RUNX1 that have recorded clinical information on them. Filter out the ones that are benign to establish our final candidate list of genetic variants that effect the function of RUNX1. We'll do this by filtering our pandas dataframe with the clinical information on the genetic variants.

In [16]:
# identify the clinical significance types for the genetic variants
print(df_RUNX1_genVar_clinical.clinicalSignificance.unique())

['ClinSigPathogenic' 'ClinSigUncertain' 'ClinSigBenign'
 'ClinSigConflictingPathogenicity']


In [17]:
# filter genetic variants for those with pathogenicity
clinSig = ['ClinSigPathogenic']
df_final_geneticVariants = df_RUNX1_genVar_clinical[df_RUNX1_genVar_clinical.clinicalSignificance.isin(clinSig)]

# check how many genetic variants made the final cut
print(df_final_geneticVariants.shape[0])

# identify the diseases associated with these pathogenic variants
print(df_final_geneticVariants.diseaseName.unique())

# print the final genetic variant dataframe
print(df_final_geneticVariants)

21
['not provided'
 'Familial platelet disorder with associated myeloid malignancy'
 'Acute myeloid leukemia']
             name  ...              dcid
0    rs1569084170  ...  bio/rs1569084170
26         871175  ...        bio/871175
64   rs1555889984  ...  bio/rs1555889984
65   rs1569084082  ...  bio/rs1569084082
90    rs587776811  ...   bio/rs587776811
95     rs74315451  ...    bio/rs74315451
129  rs1569008655  ...  bio/rs1569008655
143  rs1569084530  ...  bio/rs1569084530
183  rs1555899813  ...  bio/rs1555899813
192    rs74315450  ...    bio/rs74315450
205  rs1060499616  ...  bio/rs1060499616
207   rs121912498  ...   bio/rs121912498
240        840868  ...        bio/840868
266   rs121912499  ...   bio/rs121912499
321        647118  ...        bio/647118
326  rs1057519748  ...  bio/rs1057519748
377  rs1569061762  ...  bio/rs1569061762
394  rs1569061831  ...  bio/rs1569061831
408  rs1569061768  ...  bio/rs1569061768
444  rs1555884790  ...  bio/rs1555884790
463        869209  ...      

##Conclusion of RUNX1 Analysis
In this exercise we were able to filter millions of genetic variants to identify a handful that meet our specific parameters. To do this we used multiple datasets from UCSC Genome Browser, NCBI/gene, and ClinVar all in a single python notebook. We found that there are 16 genetic variants in exons of RUNX1, which have been reported to be pathogenic by clinical data. We also learned that these pathogenic variants are associated with Familial platelet disorder with associated myeloid malignancy. Such analyses can preformed using any gene and is a way to easily identify candidate genetic variants of interest. The synthesis of data of different types acorss multiple databases enables easy complex analyses all in a single python notebook.