# Graph Databases: Intro to Neo4J and Cypher
![Neo4J Logo](http://neo4j.com/wp-content/themes/neo4jweb/assets/images/neo4j-logo-2015.png)
![PyNeo](http://py2neo.org/v3/_static/py2neo-2016.260x152.png)

# Introduction

Neo4J is a very different type of database than most others. In fact it is a persistent **linked list graph representation**. The alternative for this layout is seen in **triple stores** which basically stores each triplet in 3 columns. The advantage of this data layout for Neo4J is that it can find its neighbouring nodes in O(1) time since it just has to follow pointers (O(N) in case of triple layout!). This makes this database very suitable for queries requiring **graph traversal**, for example for route planning. Since a traversal always has a source point Neo4J also has the ability to store **indexes** in order to easily find the starting points for traveral tasks, thus avoiding a full scan of all nodes.


# Py2Neo

1. In Jupyter (start window) open a terminal (click new > terminal) and install py2neo with: **pip install py2neo **
2. Py2Neo essentials: https://py2neo.org/v4/
3. Neo4J Cypher query language: https://neo4j.com/docs/cypher-manual/current/




In [1]:
import py2neo

In [2]:
from py2neo import Graph

ip_virtual_wall = "193.191.169.38"
user = 'neo4j'
user_pass_tup = ('neo4j', 'admin')
port = '7687'

# set up authentication parameters
connect_string = "bolt://" + ip_virtual_wall + ":" + port

# connect to authenticated graph database
graph = Graph(connect_string, auth=user_pass_tup)
print(graph)


<Graph database=<Database uri='bolt://193.191.169.38:7687' secure=False user_agent='py2neo/4.3.0 neobolt/1.7.16 Python/3.7.6-final-0 (linux)'> name='data'>


In [3]:
# reset the graph
graph.delete_all()

# 1. Getting familiar with Py2Neo
### a. Creating nodes and edges in Py2Neo

* https://py2neo.org/v4/data.html#node-and-relationship-objects
* https://py2neo.org/v4/database.html#the-graph

In [4]:
from py2neo import Node, Relationship

alice = Node("Person", name="Alice")
bob = Node("Person", name="Bob")
alice_knows_bob = Relationship(alice, "KNOWS", bob, since=2002)

# graph create 'persists' adds this relationship to Neo4J database
graph.create(alice_knows_bob)

#### Check the neo4J interface at < ip >:7474 to verify the nodes were created

* Click on the database icon (left + top), and click on one of the node labels or relationship types




### b. Update 

In [5]:
alice.add_label("Woman")

alice

(_6627:Person:Woman {name: 'Alice'})

#### Again check the Neo4J interface and verify whether a new Node label is present?

* <b> Important: </b> Client side objects are not automatically persisted, you'll have to push your changes!

In [6]:
alice.add_label("Woman")
alice['age'] = 33
bob["age"] = 44

#push the changes
graph.push(alice)
graph.push(bob)

* Properties are closely related to dictionaries, labels to sets

In [7]:
print (alice)
print (bob)

print(alice.labels)
print(dict(alice))

print(dict(alice_knows_bob))

(_6627:Person:Woman {age: 33, name: 'Alice'})
(_6628:Person {age: 44, name: 'Bob'})
:Person:Woman
{'name': 'Alice', 'age': 33}
{'since': 2002}


# Exercise 1: Exploring (your) family tree with Py2Neo

### 1a. Create a list of nodes and relationships representing a family tree or (for example the simpsons, but you can choose)

![Simpsons](http://www.englishexercises.org/exercisesmaker/uploads/images/1621353/nueva.png)

Details on <a href="http://simpsons.wikia.com/wiki/Simpson_family"> wikia </a>


* Use the relationships: FATHER_OF, MOTHER_OF, TOGETHER_WITH
* Add birthyear and first name as properties and gender as a label
* Calculate the number of relationships and family members
* Check in Neo4J browser for correctness (are all relationships present + node labels?)

In [8]:
graph.delete_all()

abe = Node("Person", name="Abe", birthyear=None)
abe.add_label("Man")
mona = Node("Person", name="Mona", birthyear=1929)
mona.add_label("Woman")

herb = Node("Person", name="Herb", birthyear=None)
herb.add_label("Man")


homer = Node("Person", name="Homer", birthyear=1950)
homer.add_label("Man")
marge = Node("Person", name="Marge", birthyear=1960)
marge.add_label("Woman")
bart = Node("Person", name="Bart", birthyear=1985)
bart.add_label("Man")
lisa = Node("Person", name="Lisa", birthyear=1987)
lisa.add_label("Woman")
maggie = Node("Person", name="Maggie", birthyear=1989)
maggie.add_label("Woman")


clancy = Node("Person", name="Clancy", birthyear=1918)
clancy.add_label("Man")
jacqueline = Node("Person", name="Jacqueline", birthyear=1919)
jacqueline.add_label("Woman")

patty = Node("Person", name="Patty", birthyear=1952)
patty.add_label("Woman")
selma = Node("Person", name="Selma", birthyear=1952)
selma.add_label("Woman")
ling = Node("Person", name="Ling", birthyear=1988)
ling.add_label("Woman")

nodes = [homer,marge,bart,lisa,maggie]

In [9]:
edges = [
    Relationship(abe, "TOGETHER_WITH", mona),
    Relationship(abe, "FATHER_OF", herb),
    Relationship(abe, "FATHER_OF", homer),
    Relationship(mona, "MOTHER_OF", herb),
    Relationship(mona, "MOTHER_OF", homer),
    
    Relationship(homer, "TOGETHER_WITH", marge),
    Relationship(homer, "FATHER_OF", bart),
    Relationship(homer, "FATHER_OF", lisa),
    Relationship(homer, "FATHER_OF", maggie),
    Relationship(marge, "MOTHER_OF", bart),
    Relationship(marge, "MOTHER_OF", lisa),
    Relationship(marge, "MOTHER_OF", maggie),
    
    Relationship(clancy, "TOGETHER_WITH", jacqueline),
    Relationship(clancy, "FATHER_OF", marge),
    Relationship(clancy, "FATHER_OF", selma),
    Relationship(clancy, "FATHER_OF", patty),
    Relationship(jacqueline, "MOTHER_OF", marge),
    Relationship(jacqueline, "MOTHER_OF", selma),
    Relationship(jacqueline, "MOTHER_OF", patty),
    
    Relationship(selma, "MOTHER_OF", ling)
]

In [10]:
from py2neo import Subgraph

subgraph = Subgraph(nodes, edges)
graph.create(subgraph)


In [11]:
print("The number of nodes in my family tree is: " + str(len(graph.nodes)))
print("The number of relationships in my family tree is: " + str(len(graph.relationships)))


The number of nodes in my family tree is: 13
The number of relationships in my family tree is: 20


### Export the Neo4J graph visualization to svg and add it to your lab submission (double click on the markdown cells to see how to import images)
![graph.svg](graph.svg)

In [12]:
#export from the browser, no code needed here!

## Intermezzo: Finding nodes and edges

* Have a look at the py2neo documentation, specifically for the following functions:
    - match, match_one
    - merge
    - New in version v4 is that you can select with .nodes and .relationships!

### 1b. Finding relationships and nodes

* Find all couples in the family tree and print their names
* Find all of your direct relatives and print their names, and the relationship
* Find all males, print their names and birth year

In [13]:
for rel in graph.match(r_type="TOGETHER_WITH"):
    print(rel.start_node["name"] + " is together with " + rel.end_node["name"])

Homer is together with Marge
Clancy is together with Jacqueline
Abe is together with Mona


In [14]:
for rel in graph.match({homer}, r_type=["FATHER_OF", "MOTHER_OF"]): # all relatives from homer -> NO TOGETHER_WITH
    if rel.end_node["name"] != "Homer":
        print(rel.end_node["name"])
    else:
        print(rel.start_node["name"])

Maggie
Abe
Lisa
Bart
Mona


In [15]:
# Tip: In the graph class you can specifically match nodes or relationships
matcher = py2neo.NodeMatcher(graph)
men = matcher.match("Man")

for node in men:
    print(node["name"] + " : " + str(node["birthyear"]))

Homer : 1950
Bart : 1985
Herb : None
Clancy : 1918
Abe : None


### 1c. Find all siblings and write a function to add all SIBLING_OF relationships

* Sibling: A sibling is one of two or more individuals having one or both parents in common.


In [16]:
# this functions creates all unique (!) sibling relationships (the edges can be in both directions)
def createSiblings(graph):
    sibling_rels = []
    for node1 in matcher.match("Person"):
        for node2 in matcher.match("Person"):
            if (node1 != node2 and node1.identity < node2.identity ):
                father1 = graph.match((None,node1), r_type = "FATHER_OF").first()
                father2 = graph.match((None,node2), r_type = "FATHER_OF").first()
                mother1 = graph.match((None,node1), r_type = "MOTHER_OF").first()
                mother2 = graph.match((None,node2), r_type = "MOTHER_OF").first()
                
                if (not((father1==None and mother1==None) or (father2==None and mother2==None))):
                    if (father1!= None) : father1 = father1.start_node
                    if (father2!= None) : father2 = father2.start_node
                    if (mother1!= None) : mother1 = mother1.start_node
                    if (mother2!= None) : mother2 = mother2.start_node

                    if (father1 == father2 or mother1==mother2):
                        siblings_rel = Relationship(node1, "IS_SIBLING_OF", node2)
                        print(siblings_rel)
                        sibling_rels.append(siblings_rel)
    return sibling_rels

# add siblings to graph
siblings = createSiblings(graph)
siblings_subgraph = Subgraph(nodes, siblings)
graph.create(siblings_subgraph)

(Homer)-[:IS_SIBLING_OF {}]->(Herb)
(Bart)-[:IS_SIBLING_OF {}]->(Maggie)
(Bart)-[:IS_SIBLING_OF {}]->(Lisa)
(Maggie)-[:IS_SIBLING_OF {}]->(Lisa)
(Selma)-[:IS_SIBLING_OF {}]->(Marge)
(Selma)-[:IS_SIBLING_OF {}]->(Patty)
(Marge)-[:IS_SIBLING_OF {}]->(Patty)


# Exercise 2: Exploring family tree with Cypher
## Intermezzo: Cypher query language

The Cypher query language is quite intuitive, you <b>MATCH</b> graph patterns in a rather visual fashion:
* Nodes: (n1:labelname {prop:'propvalue'})
* Edges: -->, -- (undirected) , -[REL]->

* In the current version of the Py2Neo API we can convert the result of a query directly to a dataframe

* The <b>WITH</b> clause allows query parts to be chained together, piping the results from one to be used as starting points or criteria in the next. (see also https://neo4j.com/docs/cypher-manual/current/clauses/with/)

In [17]:
# all males
query = """
MATCH (n:Man)
RETURN n
"""
df = graph.run(query).to_data_frame()
df

Unnamed: 0,n
0,"{'birthyear': 1950, 'name': 'Homer'}"
1,"{'birthyear': 1985, 'name': 'Bart'}"
2,{'name': 'Herb'}
3,"{'birthyear': 1918, 'name': 'Clancy'}"
4,{'name': 'Abe'}


In [18]:
# all males < 40 years old
query = """
MATCH (n:Man)
WHERE n.birthyear + 40 > 2020 
RETURN n
"""
df = graph.run(query).to_data_frame()
df

Unnamed: 0,n
0,"{'birthyear': 1985, 'name': 'Bart'}"


In [19]:
# all males which have children
query = """
MATCH (n:Man)-[:FATHER_OF]->(m)
RETURN n
"""

cursor = graph.run(query)
for record in cursor:
    print(record)

<Record n=(_6629:Man:Person {birthyear: 1950, name: 'Homer'})>
<Record n=(_6629:Man:Person {birthyear: 1950, name: 'Homer'})>
<Record n=(_6629:Man:Person {birthyear: 1950, name: 'Homer'})>
<Record n=(_6632:Man:Person {birthyear: 1918, name: 'Clancy'})>
<Record n=(_6632:Man:Person {birthyear: 1918, name: 'Clancy'})>
<Record n=(_6632:Man:Person {birthyear: 1918, name: 'Clancy'})>
<Record n=(_6633:Man:Person {name: 'Abe'})>
<Record n=(_6633:Man:Person {name: 'Abe'})>


In [20]:
# all distinct males which have children
query = """
MATCH (n:Man)-[:FATHER_OF]->(m)
RETURN DISTINCT(n)
"""
df = graph.run(query).to_data_frame()
df

Unnamed: 0,n
0,"{'birthyear': 1950, 'name': 'Homer'}"
1,"{'birthyear': 1918, 'name': 'Clancy'}"
2,{'name': 'Abe'}


### Let's go ALL IN with dataframes!


In [21]:
# WITH is a method to pull complicated expressions apart
query = """
MATCH (n:Man)-[:FATHER_OF]->(m)
WITH DISTINCT(n) as fathers
RETURN fathers.name, fathers.birthyear
"""
df = graph.run(query).to_data_frame()
df

Unnamed: 0,fathers.name,fathers.birthyear
0,Homer,1950.0
1,Clancy,1918.0
2,Abe,


### 2a. By adding an ORDER BY (ASC|DESC) you can sort the results. Write a query to return all sons and order them by age, oldest to yongest.

In [22]:
query =  """
MATCH (n)-[:FATHER_OF | :MOTHER_OF]->(m:Man)
WITH DISTINCT(m) as sons
RETURN sons.name, sons.birthyear
ORDER BY sons.birthyear
"""
df = graph.run(query).to_data_frame()
df

Unnamed: 0,sons.name,sons.birthyear
0,Homer,1950.0
1,Bart,1985.0
2,Herb,


## Intermezzo: Aggregations

* AVG, COUNT, SUM,  
* MIN, MAX
* COLLECT
* ..

Have a look at the page describing 
<a href="https://neo4j.com/docs/cypher-manual/current/functions/aggregating/"> aggregations in Cypher </a>

### 2b. What is the average age, minimum and maximum age of all mothers

In [23]:
query = """
MATCH (n)-[:MOTHER_OF]->(m)
WITH DISTINCT(n) as mothers
RETURN avg(2020-mothers.birthyear), min(2020-mothers.birthyear), max(2020-mothers.birthyear)
"""
df = graph.run(query).to_data_frame()
df

Unnamed: 0,avg(2020-mothers.birthyear),min(2020-mothers.birthyear),max(2020-mothers.birthyear)
0,80.0,60,101


### 2c. Get all people who are both father and son

In [24]:
query = """
MATCH (n:Man)-[:FATHER_OF]->(m:Man)-[:FATHER_OF]->(p)
WITH DISTINCT(m) as fathers_and_sons
RETURN fathers_and_sons.name
"""
df = graph.run(query).to_data_frame()
df

Unnamed: 0,fathers_and_sons.name
0,Homer


### 2d. Using wildcard syntax (below), output all grandpa(-father-)grandchild relations


* [r*2] matches two times this predicate
* [r2..3] a range! 
* r0 includes starting node
* r* all paths


In [25]:
query = """
MATCH (n:Man)-[:FATHER_OF*2]->(m)
RETURN n.name,m.name
"""
df = graph.run(query).to_data_frame()
df

Unnamed: 0,n.name,m.name
0,Abe,Bart
1,Abe,Lisa
2,Abe,Maggie


### List functions on paths

* <a href="https://neo4j.com/docs/cypher-manual/current/functions/list/">Cypher operations on paths</a>: 
* **EXTRACT**( identifier in collection | expression )
* **REDUCE**( accumulator = initial, identifier in collection | expression )

### 2e Write a query to have all directed(!) paths starting from a male and ending with a female of length 0..3 in the family tree, return 3 columns:
* 'path' is a list of names occurring in the nodes per path
* 'relations' is a list of relationships occuring along a path
* 'length' is the length of the corresponding path
* order the paths according to length (short to long)

**NOTE:** Make sure the result is human-readable (clean!)

In [26]:
query = """
MATCH p = (:Man)-[:FATHER_OF | :MOTHER_OF *0..3]->(:Woman)
WITH reduce(path=[], n IN nodes(p) | path + n.name ) as path,
     reduce(relationships=[], n IN relationships(p) | relationships + type(n)) as relationships,
     length(p) as length
RETURN path, relationships, length
ORDER BY length
"""
df = graph.run(query).to_data_frame()
df

Unnamed: 0,path,relationships,length
0,"[Homer, Lisa]",[FATHER_OF],1
1,"[Homer, Maggie]",[FATHER_OF],1
2,"[Clancy, Patty]",[FATHER_OF],1
3,"[Clancy, Marge]",[FATHER_OF],1
4,"[Clancy, Selma]",[FATHER_OF],1
5,"[Clancy, Marge, Maggie]","[FATHER_OF, MOTHER_OF]",2
6,"[Clancy, Marge, Lisa]","[FATHER_OF, MOTHER_OF]",2
7,"[Clancy, Selma, Ling]","[FATHER_OF, MOTHER_OF]",2
8,"[Abe, Homer, Lisa]","[FATHER_OF, FATHER_OF]",2
9,"[Abe, Homer, Maggie]","[FATHER_OF, FATHER_OF]",2


### And Finally: Operations on paths

* <a href="https://neo4j.com/docs/cypher-manual/current/clauses/match/#query-shortest-path"> Shortest Path Query </a>
* <a href="https://neo4j.com/docs/cypher-manual/current/clauses/match/#all-shortest-paths"> All Shortest Paths </a>

**NOTE:** These functions will be used in the next notebook on route planning!



# You finished part I, proceed to part II on route planning!