# Introduction

This notebook walks you through how to solve a basic data science problem using the [Graph Data Science Library](https://neo4j.com/developer/graph-data-science/).  We likely will not get a chance to go through it during this session, but this is provided should you wish to do some problems on your own.

In [None]:
# You must run this line if running it from Google Colab

!pip install neo4j

In [1]:
import pandas as pd
from neo4j import GraphDatabase

In [2]:
class Neo4jConnection:
    
    def __init__(self, uri, user, pwd):
        self.__uri = uri
        self.__user = user
        self.__pwd = pwd
        self.__driver = None
        try:
            self.__driver = GraphDatabase.driver(self.__uri, auth=(self.__user, self.__pwd))
        except Exception as e:
            print("Failed to create the driver:", e)
        
    def close(self):
        if self.__driver is not None:
            self.__driver.close()
        
    def query(self, query, parameters=None, db=None):
        assert self.__driver is not None, "Driver not initialized!"
        session = None
        response = None
        try: 
            session = self.__driver.session(database=db) if db is not None else self.__driver.session() 
            response = list(session.run(query, parameters))
        except Exception as e:
            print("Query failed:", e)
        finally: 
            if session is not None:
                session.close()
        return response

## How to make the connection

From the Sandbox UI, select "Connection Details" and get the Bolt URL and password.  The user name is always `neo4j`.

In [12]:
uri = ''
pwd = ''

conn = Neo4jConnection(uri=uri, user="neo4j", pwd=pwd)
result = conn.query('MATCH (n) RETURN COUNT(n) AS count')

print('One way to get results back: ', result)
print('Another way: ', result[0]['count'])

One way to get results back:  [<Record count=2642>]
Another way:  2642


## Graph Data Science Library

From here we will be using the Neo4j [Graph Data Science Library](https://neo4j.com/developer/graph-data-science/).  You are encouraged to consult the [API docs](https://neo4j.com/docs/graph-data-science/current/) on how to use it.  However, the general approach is:

1. Create a graph projection
2. Run a graph algorithm on it

We will demonstrate this to do the classical centrality calculation of [PageRank](https://en.wikipedia.org/wiki/PageRank).

First, we will create the graph projection called `people` containing all `Person` nodes and all (`*`) relationships between them...

In [13]:
query = """CALL gds.graph.create('people', 'Person', '*')"""
conn.query(query)

[<Record nodeProjection={'Person': {'label': 'Person', 'properties': {}}} relationshipProjection={'__ALL__': {'orientation': 'NATURAL', 'aggregation': 'DEFAULT', 'type': '*', 'properties': {}}} graphName='people' nodeCount=2166 relationshipCount=8170 createMillis=346>]

### And now that we have a graph projection, we can run PageRank on it to find out who is the most influential person in Game of Thrones and write the calculated value back as a property to each node...

In [14]:
pagerank_query = """CALL gds.pageRank.write(
                        'people',
                        { writeProperty: 'pagerank' }
                    )
                    """
conn.query(pagerank_query)

[<Record writeMillis=1123 nodePropertiesWritten=2166 ranIterations=20 didConverge=False centralityDistribution={'p99': 1.8554067611694336, 'min': 0.14999961853027344, 'max': 14.432188987731934, 'mean': 0.2852511044904686, 'p90': 0.5038976669311523, 'p50': 0.14999961853027344, 'p999': 8.362792015075684, 'p95': 0.9612417221069336, 'p75': 0.19827938079833984} postProcessingMillis=168 createMillis=0 computeMillis=429 configuration={'maxIterations': 20, 'writeConcurrency': 4, 'relationshipWeightProperty': None, 'cacheWeights': False, 'concurrency': 4, 'sourceNodes': [], 'writeProperty': 'pagerank', 'scaler': 'NONE', 'nodeLabels': ['*'], 'sudo': False, 'dampingFactor': 0.85, 'relationshipTypes': ['*'], 'tolerance': 1e-07, 'username': None}>]

### And now let's explore the results, imported here as a Pandas dataframe...

In [15]:
query = """MATCH (p:Person)
           RETURN p.name, p.pagerank
           ORDER BY p.pagerank DESC
           LIMIT 10
           """

top_people_df = pd.DataFrame([dict(_) for _ in conn.query(query)])
top_people_df.head(10)

Unnamed: 0,p.name,p.pagerank
0,Tyrion Lannister,14.432143
1,Stannis Baratheon,8.389708
2,Tywin Lannister,8.362748
3,Varys,7.134114
4,Yandry,5.493203
5,Ysilla,5.468612
6,Theon Greyjoy,4.745098
7,Walder Frey,4.525921
8,Sansa Stark,4.489441
9,Perra Royce,3.997016


### If you know the show, those results look strange.

This could be because we included absolutely every relationship type.  So some of these characters, while not important to the story, interact with important characters, perhaps frequently.  Let's limit our interactions to the first book and see what happens.  Do do this, we will repeat the above 2 steps, using a new graph projection.

In [16]:
projection_query = """CALL gds.graph.create('people_1', 'Person', 'INTERACTS_1')"""

pagerank_query = """CALL gds.pageRank.write(
                        'people_1',
                        { writeProperty: 'pagerank_1' }
                    )
                    """

conn.query(projection_query)
conn.query(pagerank_query)

query = """MATCH (p:Person)
           RETURN p.name, p.pagerank_1
           ORDER BY p.pagerank_1 DESC
           LIMIT 10
           """

top_people_df_1 = pd.DataFrame([dict(_) for _ in conn.query(query)])
top_people_df_1.head(10)

Unnamed: 0,p.name,p.pagerank_1
0,Tyrion Lannister,4.369831
1,Varys,3.544865
2,Tywin Lannister,2.984199
3,Robert Baratheon,2.074483
4,Sansa Stark,1.933146
5,Walder Frey,1.883858
6,Robb Stark,1.301297
7,Willis Wode,1.209997
8,Jon Snow,1.187182
9,Vardis Egen,1.181491
