<a href="https://colab.research.google.com/github/conker84/from-0-to-graph-hero/blob/main/from_zero_to_graph_hero.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# From zero to a graph hero

In this talk we'll discuss about why graphs are important, and how we can create a simple recommendation engine starting from few CSV files.


# Prerequisites 

Please install all the required dependencies before executing the notebook

In [None]:
from google.colab import output
output.enable_custom_widget_manager()

In [None]:
!pip install neo4j
!pip install ipycytoscape
!pip install networkx

In [None]:
from getpass import getpass
from neo4j import GraphDatabase

neo4j_user = input('Neo4j user: ')
neo4j_password = getpass('Neo4j password: ')
neo4j_uri = input('Neo4j uri: ')


neo4j_driver = GraphDatabase.driver(neo4j_uri, auth=(neo4j_user, neo4j_password))

In [None]:
import pandas as pd
import networkx as nx
from IPython.core.magic import (register_line_magic, register_cell_magic)
import ipycytoscape
from neo4j.graph import Graph

colors = {
  ':Customer': '#fffb00',
  ':Order': '#00f900',
  ':Product': '#ff2600',
  ':Category': '#53d5fd'
}
captions =  {
  ':Customer': 'companyName',
  ':Order': 'shipName',
  ':Product': 'productName',
  ':Category': 'categoryName'
}
node_centered = {
  'selector': 'node',
  'style': {
    'font-size': '10',
    'label': 'data(title)',
    'height': '60',
    'width': '60',
    'text-max-width': '60',
    'text-wrap': 'wrap',
    'text-valign': 'center',
    'background-color': 'data(color)',
    'background-opacity': 0.6,
    'border-width': 3,
    'border-color': '#D3D3D3'
  }
}
edge_directed_named = {
  'selector': 'edge',
  'style': {
    'font-size': '8',
    'label': 'data(label)',
    'line-color': '#D3D3D3',
    'text-rotation': 'autorotate',
    'target-arrow-shape': 'triangle',
    'target-arrow-color': '#D3D3D3',
    'curve-style': 'bezier',
    'text-background-color': "#FCFCFC",
    'text-background-opacity': 0.8,
    'text-background-shape': 'rectangle',
    'width': 'data(weight)'
  }
}
my_style = [node_centered, edge_directed_named]


def to_nextworkx(graph):
  networkx_graph = nx.MultiDiGraph()

  def add_node(node):
    label = ':' + ':'.join(node.labels)
    props = dict(node.items())
    color = colors[label]
    networkx_graph.add_node(node.id, label=label, color=color, properties=props, title=label, tooltip=str(props))


  def add_edge(edge):
    edge_type = edge.type
    props = dict(edge.items())
    networkx_graph.add_edge(edge.start_node.id, edge.end_node.id, weight=2, label=edge.type, tooltip=str(props))
      
  for node in graph._nodes.values():
    add_node(node)

  for rel in graph._relationships.values():
    add_edge(rel)

  return networkx_graph


def display_graph(networkx_graph, config={'layout': 'dagre', 'padding': 0, 'nodeSpacing': 10, 'edgeLengthVal': 10, 'animate': True, 'randomize': True}):
    w = ipycytoscape.CytoscapeWidget()
    w.graph.add_graph_from_networkx(networkx_graph)
    w.set_style(my_style)
    w.set_layout(name=config['layout'],
                 padding=config['padding'],
                 nodeSpacing=config['nodeSpacing'],
                 edgeLengthVal=config['edgeLengthVal'],
                 animate=config['animate'],
                 randomize=config['randomize'],
                 maxSimulations=1500)
    w.set_tooltip_source('tooltip')
    display(w)


def run_query(query):
  # we return only the last one
  with neo4j_driver.session() as session:
    result = None
    for sub_query in query.split(';'):
      sub_query = sub_query.strip()
      if sub_query != "":
        result = session.run(sub_query)
    graph = result.graph()
    if len(graph._nodes) > 0:
      return display_graph(to_nextworkx(graph))
    else:
      return result.to_df()


@register_cell_magic
def cypher(line, cell):
  return run_query(cell)

# Why are graphs important?

> "I think the next century will be the century of complexity." - Stephen Hawking

<img src="https://miro.medium.com/max/1400/1*K6avHhlmtIE0dnGj7whLag.jpeg" alt="Rail Network in Europe by naturalearthdata.com">

*Rail Network in Europe by naturalearthdata.com*

We are surrounded by systems that are hopelessly complicated. Consider for example: 

* The network encoding the interactions between genes, proteins, and metabolites integrates these components into live cells. The very existence of this cellular network is a prerequisite of life.
* The wiring diagram capturing the connections between neurons, called the neural network, holds the key to our understanding of how the brain functions and to our consciousness.
* The sum of all professional, friendship, and family ties, often called the social network, is the fabric of the society and determines the spread of knowledge, behavior and resources.
* Communication networks, describing which communication devices interact with each other, through wired internet connections or wireless links, are at the heart of the modern communication system.
* The power grid, a network of generators and transmission lines, supplies with energy virtually all modern technology.
* Trade networks maintain our ability to exchange goods and services, being responsible for the material prosperity that the world has enjoyed since WWII


Indeed, behind each complex system there is an intricate network that encodes the interactions between the system’s components.

If we want to understand a complex system, we first need to know how its components interact with each other. In other words we need a map of its wiring diagram, and that's when graph (network) came in and the network representation offers a common language to study systems that may differ greatly in nature, appearance, or scope.

A graph $G$ is a structure $G=<V,E>$ where:
* $V$ are the vertexes or nodes; to each `node` can be assigned one or more `label` (type) and a set of `properties` as key/value bindings;
* $E$ are the edges; each `edge` has a `source` and `target` node and have a `type` and a set of `properties` as key/value bindings.

<img src="https://raw.githubusercontent.com/conker84/from-0-to-graph-hero/main/images/person-follows-person.png" >

# Why do we need a Graph Database?

Relational database are well suited for a lot of use case, but not where your need is to traverse the data, but why?

## JOINs are expensive

Without diving a lot into the problem is known that when you put in join two or more tables, the cost of each join is rough $O(log(n))$ and this means that the performances are proportionally getting worse when the table count gets higher.

Graph databases instead are designed leveraging a data structure called [Adiacency Matrix](https://en.wikipedia.org/wiki/Adjacency_matrix) that is specifically designed to reduce the cost of traversing a relationship (given a start node) to $O(1)$

<table>
  <thead>
    <th>Graph</th>
    <th>Matrix representation</th>
  </thead>
  <tbody>
    <td>
      <img src="https://github.com/conker84/from-0-to-graph-hero/blob/main/images/adiacency-graph.png?raw=true" >
    </td>
    <td>
      <img src="https://github.com/conker84/from-0-to-graph-hero/blob/main/images/adiacency-matrix.png?raw=true" >
      <br/>
      The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph.
    </td>
  </tbody>
</table>



## How to query a Graph?

Given the definition of graph and graph database above, how can you query it?

[openCypher](https://opencypher.org/) (Cypher) is a language specification built to query graphs. Think about it as SQL but for Graphs!




### How does it work?

Cypher has a very nice way in order to represent graphs, it leverages ASCII ART in order to do that, but how?

#### Cypher and ASCII ART

In Cypher:

* `nodes` are represented as `()` and they can also contain identifiers `(person)`
* `relationships` are represented as `-[]-`, they can have an identifier `-[FOLLOWS]-`, a direction `-[]->` or `<-[]-` and the must have a source and a target node `()-[]->()`

Think about at this simple graph:

<img src="https://github.com/conker84/from-0-to-graph-hero/blob/main/images/person-drives-car.png?raw=true" >

The Cypher representation of it is:

`(p:Person)-[d:DRIVES]->(c:Car)`

where:
* `p` is the identifier of the source node which is of type `Person`
* `c` is the identifier of the target node which is of type `Car`
* `d` is the identifier of the relationship which type is `DRIVES`

Given that than you can:

* `CREATE` a graph entity
* `MATCH` a graph entity
* `MERGE` a graph entity; this one is a idempotent operation that first checks if the entity exists otherwise it creates it and return.

```cypher
MATCH path = (p:Person)-[d:DRIVES]->(c:Car)
WHERE p.name = 'Andrea' AND p.surname = 'Santurbano'
RETURN path
```



# Dataset

The Northwind database is a famous dataset containing purchase history that has been used to teach relational databases for years and was a great place to start.

<img src="https://github.com/conker84/from-0-to-graph-hero/blob/main/images/northwind.gif?raw=true" >

It provides us with a rich dataset, but in this what we want to do is to use a subset of information in order to create a graph like this:

<img src="https://github.com/conker84/from-0-to-graph-hero/blob/main/images/graph.png?raw=true" />

In [None]:
%%cypher
// show that there are no constraints
SHOW CONSTRAINTS

In [None]:
%%cypher
// Let's create the constaints
CREATE CONSTRAINT product_id IF NOT EXISTS FOR (p:Product) REQUIRE (p.id) IS UNIQUE;
CREATE CONSTRAINT order_id IF NOT EXISTS FOR (o:Order) REQUIRE (o.id) IS UNIQUE;
CREATE CONSTRAINT customer_id IF NOT EXISTS FOR (c:Customer) REQUIRE (c.id) IS UNIQUE;
CREATE CONSTRAINT category_id IF NOT EXISTS FOR (c:Category) REQUIRE (c.id) IS UNIQUE;

In [None]:
%%cypher
// show that there are constraints
SHOW CONSTRAINTS

In [None]:
%%cypher
// just check that the database is empty
MATCH (n)
RETURN count(n) AS totalNodes

In [None]:
%%cypher
// create the Customer nodes
USING PERIODIC COMMIT 100
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/conker84/from-0-to-graph-hero/main/data/customers.csv" AS row
CREATE (:Customer {id: row.customerID, companyName: row.companyName, fax: row.fax, phone: row.phone});

// create the Order nodes
USING PERIODIC COMMIT 100
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/conker84/from-0-to-graph-hero/main/data/orders.csv" AS row
MERGE (o:Order {id: row.orderID}) ON CREATE SET o.shipName =  row.shipName;

// create the Product nodes
USING PERIODIC COMMIT 100
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/conker84/from-0-to-graph-hero/main/data/products.csv" AS row
CREATE (:Product {productName: row.productName, id: row.productID, unitPrice: toFloat(row.UnitPrice)});

// create the Category nodes
USING PERIODIC COMMIT 100
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/conker84/from-0-to-graph-hero/main/data/categories.csv" AS row
MERGE (c:Category {id: row.categoryID}) ON CREATE SET c.categoryName = row.categoryName, c.description = row.description;

// create the PURCHASED relationships
USING PERIODIC COMMIT 100
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/conker84/from-0-to-graph-hero/main/data/orders.csv" AS row
MATCH (o:Order {id: row.orderID})
MATCH (customer:Customer {id: row.customerID})
MERGE (customer)-[:PURCHASED]->(o);

// create the CONTAINS relationships
USING PERIODIC COMMIT 100
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/conker84/from-0-to-graph-hero/main/data/order-details.csv" AS row
MATCH (o:Order {id: row.orderID})
MATCH (product:Product {id: row.productID})
MERGE (o)-[pu:CONTAINS]->(product)
ON CREATE SET pu.unitPrice = toFloat(row.unitPrice), pu.quantity = toFloat(row.quantity);

// create the HAS relationships
USING PERIODIC COMMIT 100
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/conker84/from-0-to-graph-hero/main/data/products.csv" AS row
MATCH (product:Product {id: row.productID})
MATCH (category:Category {id: row.categoryID})
MERGE (product)-[:HAS]->(category);

In [None]:
%%cypher
// just check that the database is not empty
MATCH (n)
RETURN count(n) AS totalNodes

## Check the graph model

In [None]:
%%cypher
CALL apoc.meta.graph()

## Visualize a simple graph

In [None]:
%%cypher
MATCH (c:Customer)-[r*..2]->(a)
WHERE c.id = 'CHOPS'
RETURN *

# Let's build a recommendation engine

Recommender Systems are a type of information filtering system that seek to generate meaningful recommendations to users for items they may be interested in.

## Popular Products

To find the most popular products in the dataset, we can follow the path from `:Customer` to `:Product`


In [None]:
%%cypher
// get all the customers that purchased a product
MATCH (c:Customer)-[:PURCHASED]->(o:Order)-[:CONTAINS]->(p:Product)
// return the company, the product and the number of times that ht bought it
RETURN c.companyName, p.productName, count(o) as orders
ORDER BY orders desc
LIMIT 5

Nothing so fancy right? Let's do something more graph oriented.

## Content Based Recommendations

The simplest recommendation we can make for a `:Customer` is a content based recommendation.
Based on their previous purchases, can we recommend them anything that they haven't already bought?
For every product our customer has purchased, let's see what other customers have also purchased.
Each `:Product` is related to a `:Category`  so we can use this to further narrow down the list of products to recommend.

**Does it sounds familiar?**

<img src="https://github.com/conker84/from-0-to-graph-hero/blob/main/images/amazon-recommendations.png?raw=true" >


It's quite the same behind what Amazon shows you when you bought a problem and shows corralate products in the same category

In [None]:
%%cypher
// For every product our customer has purchased, let's see what other customers have also purchased
MATCH (c:Customer)-[:PURCHASED]->(o:Order)-[:CONTAINS]->(p:Product)<-[:CONTAINS]-(o2:Order)-[:CONTAINS]->(p2:Product)-[:HAS]->(:Category)<-[:HAS]-(p)
WHERE c.id = 'CHOPS' AND NOT( (c)-[:PURCHASED]->(:Order)-[:PRODUCT]->(p2) )
RETURN c.companyName, p.productName as has_purchased, p2.productName as has_also_purchased, count(DISTINCT o2) as occurrences
ORDER BY occurrences desc
LIMIT 5

There you have it!  Quick and simple recommendations using graph theory and Cypher.


# Lessons learned

* We saw how powerful is the graph theory for describing different scenarios like neural networks, power grids, good and exchanges and so on
* We saw why a graph database is necessary when you need to traverse data
* We created our first graph
* We did our first graph powered recommendation query by traversing our data.
