# **Final Project**

## Annie Lydens
#### *CSPB 3287*

----

#### **Project Overview and Motivation** 
For my semester project, I combined several different scientific datasets from the DataArc project, which spans the fields of paleoecology, archeology, and climatology to build my own connected “concept map”.

The DataArc project is an open-source effort to create conceptual mappings between datasets from different scientific disciplines, so scientists can easily share data and work together to identify emerging patterns due to climate change.

The problem that the DataArc project is trying to solve is that while many researchers are using datasets with the same or similar entities and ideas, these datasets are not connected to each other, so researchers cannot easily draw conclusions or gather information across the datasets. Scientists are essentially working within data silos.

A primary goal of the DataArc project is to link different datasets by mapping the concepts within them to shared topics and concepts, so researchers can easily discover new connections within a shared “concept map”. Once the main concept map is established, the hope is that scientists from different disciplines can continue to expand the concept map to empower interdisciplinary studies.

The DataArc project aims to make connections on a large scale across many different datasets and disciplines. For my project, I chose to tackle a subset of the concept map by connecting items from four different datasets in a graph database using Neo4j. The nodes in the completed graph represented different concepts like a “combinator” (an item from a dataset, like a discovery of bone fragments), a “topic”, or a “user” (researcher), and were connected with edges if, for example, an item was tagged with a topic or had been contributed by a user.

I was in contact with Dr. Rachel Opitz, one of the main researchers on the DataArc project, to gather initial information to work on this semester project. I will continue working with her and others on the project to assist with building out the concept map this summer.

I am presenting my results in a Jupyter notebook to show examples of how a researcher could perform dynamic queries about a conceptual entity (like a specific topic or name) and get a result that combines multiple data sources and shows the conceptual links between those sources.

I chose a Jupyter notebook as the user interface for my project because the end users of my project are familiar with Jupyter notebooks and use them often in their work. Additionally, a Jupyter notebook allows me to provide in-line documentation and comments, which is an important aspect of this project for the DataArc researchers.


In [None]:
import sys, time
from neo4j import GraphDatabase

#### Connecting to the database

In [None]:
url = "bolt://hobby-ndpkpgedihfpgbkeffpfgnel.dbs.graphenedb.com:24787"
driver = GraphDatabase.driver(url, auth=(ENV['NEO4J_USERNAME'], ENV['NEO4J_PASSWORD']))

#### Issuing a query

In [None]:
# class BookmarksExample:
#     def __init__(self, uri, auth):
#         self.driver = GraphDatabase.driver(uri, auth=auth)

#     def close(self):
#         self.driver.close()

#     # Create a person node.
#     @classmethod
#     def create_person(cls, tx, name):
#         tx.run("CREATE (:Person {name: $name})", name=name)

#     # Create an employment relationship to a pre-existing company node.
#     # This relies on the person first having been created.
#     @classmethod
#     def employ(cls, tx, person_name, company_name):
#         tx.run("MATCH (person:Person {name: $person_name}) "
#                "MATCH (company:Company {name: $company_name}) "
#                "CREATE (person)-[:WORKS_FOR]->(company)",
#                person_name=person_name, company_name=company_name)

#     # Create a friendship between two people.
#     @classmethod
#     def create_friendship(cls, tx, name_a, name_b):
#         tx.run("MATCH (a:Person {name: $name_a}) "
#                "MATCH (b:Person {name: $name_b}) "
#                "MERGE (a)-[:KNOWS]->(b)",
#                name_a=name_a, name_b=name_b)

#     # Match and display all friendships.
#     @classmethod
#     def print_friendships(cls, tx):
#         result = tx.run("MATCH (a)-[:KNOWS]->(b) RETURN a.name, b.name")
#         for record in result:
#             print("{} knows {}".format(record["a.name"], record["b.name"]))

#     def main(self):
#         saved_bookmarks = []  # To collect the session bookmarks

#         # Create the first person and employment relationship.
#         with self.driver.session() as session_a:
#             session_a.write_transaction(self.create_person, "Alice")
#             session_a.write_transaction(self.employ, "Alice", "Wayne Enterprises")
#             saved_bookmarks.append(session_a.last_bookmark())

#         # Create the second person and employment relationship.
#         with self.driver.session() as session_b:
#             session_b.write_transaction(self.create_person, "Bob")
#             session_b.write_transaction(self.employ, "Bob", "LexCorp")
#             saved_bookmarks.append(session_b.last_bookmark())

#         # Create a friendship between the two people created above.
#         with self.driver.session(bookmarks=saved_bookmarks) as session_c:
#             session_c.write_transaction(self.create_friendship, "Alice", "Bob")
#             session_c.read_transaction(self.print_friendships)
# ######   
            
# def create_person_node(tx, name):
#     tx.run("CREATE (a:Person {name: $name})", name=name)

# def match_person_node(tx, name):
#     result = tx.run("MATCH (a:Person {name: $name}) RETURN count(a)", name=name)
#     return result.single()[0]

# def add_person(name):
#     with driver.session() as session:
#         session.write_transaction(create_person_node, name)
#         persons = session.read_transaction(match_person_node, name)
#         return persons

In [16]:
def match_topic(tx, name):
    result = tx.run("MATCH (t:Topic {name: $name}) RETURN t", name=name)
    return result.single()
        
def query_topic(name):
    with driver.session() as session:
        topics = session.read_transaction(match_topic, name)
        return topics

print(query_topic("humans"))

<Record t=<Node id=51 labels={'Topic'} properties={'name': 'humans', 'subjectIdentifier': ['http://wandora.org/si/temp/1484334746766-6'], 'topicId': 'L1513996235_topic68_1587130146743'}>>


In [None]:
from py2neo import Graph,Node,Relationship

In [None]:
graph = Graph("bolt://hobby-ndpkpgedihfpgbkeffpfgnel.dbs.graphenedb.com:24787", auth=("production", "b.iop9NMRZFS7h.bOxA9WZ5ToUv05Xy"))

In [None]:
#Cypher Query
number_of_person_nodes=”MATCH(p:Person) RETURN Count(p)”
number_of_movie_nodes=”MATCH(m:Movie) RETURN Count(m)”
#Evaluate the Cypher query
result_persons=graph.evaluate(number_of_person_nodes)
result_movies=graph.evaluate(number_of_movie_nodes)
#Print the result
print(f”No of person node is {result_persons} & No of movie node is {result_movies}”)

In [None]:
df_result_count=pd.Series({‘Person’:result_persons,’Movie’:result_movies})
df_result_count.plot(kind=’bar’,color=[‘blue’,’darkorange’])
plt.xlabel(‘Node Label’)
plt.ylabel(‘No Of Nodes’)


----

#### **Project Relations and Data Sources** 
20% - Did the database project involve multiple relations and/or data sources?

As mentioned in the project overview, I mapped nodes from ten different datasets, which are from different scientific disciplines.


In the query below, you can see the nodes in the graph that represent different datasets. We are not yet displaying edges between the nodes.


In [None]:
from IPython.display import IFrame

IFrame(src='./vis.html', width=900, height=700)


----

#### **Project Platform and Uses** 

20% - Does the project effectively use the chosen platform? For example, does it support input boxes if appropriate, or provide directions on how the data is populated an obtained?


