# Introduction

This notebook introduces basic Python methods for querying a Neo4j database.  It uses the [official driver](https://neo4j.com/docs/api/python-driver/current/) for these interactions.  It doesn't come installed in Google Colab, so we will be installing it first.  Today we will be using a free [Neo4j Sandbox database](https://sandbox.neo4j.com).  You will need to grab the URI and password for you own Sandbox instance to run the following code.

_It is necessary to have the database populated prior to running this notebook._

In [1]:
!pip install neo4j

Collecting neo4j
  Downloading neo4j-4.3.7.tar.gz (76 kB)
[?25l[K     |████▎                           | 10 kB 13.9 MB/s eta 0:00:01[K     |████████▋                       | 20 kB 16.2 MB/s eta 0:00:01[K     |█████████████                   | 30 kB 19.9 MB/s eta 0:00:01[K     |█████████████████▏              | 40 kB 22.9 MB/s eta 0:00:01[K     |█████████████████████▌          | 51 kB 25.8 MB/s eta 0:00:01[K     |█████████████████████████▉      | 61 kB 29.0 MB/s eta 0:00:01[K     |██████████████████████████████▏ | 71 kB 31.2 MB/s eta 0:00:01[K     |████████████████████████████████| 76 kB 4.5 MB/s 
Building wheels for collected packages: neo4j
  Building wheel for neo4j (setup.py) ... [?25l[?25hdone
  Created wheel for neo4j: filename=neo4j-4.3.7-py3-none-any.whl size=100642 sha256=7b0f1a2aa2b95187ab57dc4b4c067b5c7da1ee1302cc37c087c6d33ffa2c4708
  Stored in directory: /root/.cache/pip/wheels/b5/24/bb/cece9fcfdd5e1aa0683e2533945e1e3f27f70f342ff7e28993
Successfully built

In [2]:
import pandas as pd
from neo4j import GraphDatabase

In [3]:
class Neo4jConnection:
    
    def __init__(self, uri, user, pwd):
        self.__uri = uri
        self.__user = user
        self.__pwd = pwd
        self.__driver = None
        try:
            self.__driver = GraphDatabase.driver(self.__uri, auth=(self.__user, self.__pwd))
        except Exception as e:
            print("Failed to create the driver:", e)
        
    def close(self):
        if self.__driver is not None:
            self.__driver.close()
        
    def query(self, query, parameters=None, db=None):
        assert self.__driver is not None, "Driver not initialized!"
        session = None
        response = None
        try: 
            session = self.__driver.session(database=db) if db is not None else self.__driver.session() 
            response = list(session.run(query, parameters))
        except Exception as e:
            print("Query failed:", e)
        finally: 
            if session is not None:
                session.close()
        return response

## This is where we will use the Sandbox URI and password to make the connection

If all goes well, this should not return an error.

In [5]:
uri = ''
pwd = ''

conn = Neo4jConnection(uri=uri, user="neo4j", pwd=pwd)

## Count the number of nodes within the graph

In [6]:
query = """MATCH (n) RETURN COUNT(n)"""
result = conn.query(query)
result

[<Record COUNT(n)=2708>]

## View the data that is present as properties to a single node

In [9]:
query = """MATCH (p:Paper) RETURN p LIMIT 1"""
result = conn.query(query)
result

[<Record p=<Node id=0 labels=frozenset({'Paper'}) properties={'features': '[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

## View the same as the above but as a Pandas dataframe

In [13]:
query = """MATCH (p:Paper) RETURN p.id AS id, p.subject AS subject, p.features AS features LIMIT 5"""

result_df = pd.DataFrame([dict(_) for _ in conn.query(query)])
result_df.head()

Unnamed: 0,id,subject,features
0,31336,Neural_Networks,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,1061127,Rule_Learning,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ..."
2,1106406,Reinforcement_Learning,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,13195,Reinforcement_Learning,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,37879,Probabilistic_Methods,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


## Calculate inDegree

In [14]:
query = """MATCH (p:Paper)
           WITH p, SIZE(()-[:CITES]->(p)) AS inDegree
           RETURN p.id, p.subject, inDegree
           ORDER BY inDegree DESC
           LIMIT 5
"""

result_df = pd.DataFrame([dict(_) for _ in conn.query(query)])
result_df.head()


Unnamed: 0,p.id,p.subject,inDegree
0,35,Genetic_Algorithms,166
1,6213,Reinforcement_Learning,76
2,1365,Neural_Networks,74
3,3229,Neural_Networks,61
4,114,Reinforcement_Learning,42


## Match all papers that cite a given target paper

_Note:_ We imported the paper IDs as strings.  However, `toInt()` on the import query can change it to an integer if you would like to.  I just point it out because we want to make sure to put the IDs in quotes.

In [19]:
query = """MATCH (p1:Paper)-[:CITES]->(target:Paper)
           WHERE target.id = '114'
           RETURN p1.id AS papers_that_cite
"""

result_df = pd.DataFrame([dict(_) for _ in conn.query(query)])
result_df.head()

Unnamed: 0,papers_that_cite
0,1118245
1,124064
2,1118332
3,64484
4,1103315


## Match all papers that are within 3 hops of target paper

This about how complicated this query would be in SQL...

In [24]:
query = """MATCH (p1:Paper {id: '114'})<-[*1..3]-(p2:Paper)
           RETURN p2.id AS id, p2.subject AS source_subject, p1.subject AS target_subject
           LIMIT 5
"""

result_df = pd.DataFrame([dict(_) for _ in conn.query(query)])
result_df.head()

Unnamed: 0,id,source_subject,target_subject
0,193742,Reinforcement_Learning,Reinforcement_Learning
1,1111230,Reinforcement_Learning,Reinforcement_Learning
2,6152,Reinforcement_Learning,Reinforcement_Learning
3,299195,Neural_Networks,Reinforcement_Learning
4,1105394,Reinforcement_Learning,Reinforcement_Learning
