# Blog 4: Data as a graph network
Data is always becoming more complex and interconnected. Since 1970s the most common type of databse one had to rely on was a relational one. This is where tables are highly structured split into rows/columns and quite often the data is normalised.

Graph networks are a mathmatical concept where by pairwise relations between objects are expressed in the form of nodes (vertices) and edges.

The classic example of data constructed in this format is social media data. Users are interconnected with hundreds of properties and relashionships. To model this in a relational setting we'd need lots of intermediate tables which describe the many to many relashionships i.e. I'm friends with Alex and Paul but only Paul is friends with me. When we are talking about large datasets querying them becomes resource expensive and time consuming. In an age where apps need to return answers in milliseconds, making the user wait is a definite no no.

As shown in the diagram belowGraph databases are a form of NOSQL (not only sql) but sit in their own category away from say other database technology like Cassandra, MongoDB or Reddis (all these other technologies are useful for storing large amounts of unstructured data).

Diagram is taken from chapter 6 of Introducing Data Science - Big Data, Machine Learning and more, using Python tools (2016) by Davy Cielen
Arno D. B. Meysman
Mohamed Ali



![alt text](NOSQL.PNG "Graph_POC")

The graph database software I'll be using is called Neo4j. You can learn more about the software and graph databases in general @https://neo4j.com/


The script below uses a python client library (py2neo) for Neo4j to connect to my graph instance and setup some sample "MOCK" data. Neo4j uses it's own queryling language called Cypher. "Cypher is a declarative, SQL-inspired language for describing patterns in graphs visually using an ascii-art syntax.".

It's very similar to SQL but instead of SELECT you use MATCH and MERGE to find patterns in the graph structure. 

I am using the python module for the Neo4j REST Api to authenicate and then perform inserts and queries on a graph.
You can view the GUI at localhost:7474/browser.

In [12]:
from neo4jrestclient.client import GraphDatabase
 
db = GraphDatabase("http://localhost:7474", username="neo4j", password="someneopassword")
 

In [19]:
query1 = """MATCH (n) DETACH DELETE n"""

print(db.query(query1).get_response())


people = db.labels.create("Person")

Sam = db.nodes.create(name="Sam", age=26)
Alex = db.nodes.create(name="Alex", age=34)
Mark = db.nodes.create(name="Mark", age=26)
James = db.nodes.create(name="James", age=43)
Billy = db.nodes.create(name="Billy", age=36)


people.add(Sam, Alex, Mark, James)


account = db.labels.create("Account")
account1=db.nodes.create(account_id="1234", type="Basic Trading Account")
account2=db.nodes.create(account_id="2234", type="ISA")
account3=db.nodes.create(account_id="3234", type="ISA")
account4=db.nodes.create(account_id="4234", type="ISA")
account5=db.nodes.create(account_id="4234", type="Basic Trading Account")

account.add(account1,account2,account3,account4)


ic = db.labels.create("Investment Club")
ic1=db.nodes.create(name="London's Best Investors")

ic.add(ic1)

oeic = db.labels.create("OEIC")
oeic1=db.nodes.create(name="Global Tech Acc")
oeic2=db.nodes.create(name="EM Small Cap Acc")

oeic.add(oeic1,oeic2)

equity=db.labels.create("Equity")
equity1=db.nodes.create(name="Lloyds plc")

equity.add(equity1)

Sam.relationships.create("accountHolder", account1, date_opened="2016-01-12")
Alex.relationships.create("accountHolder", account2, date_opened="2016-01-15")
Mark.relationships.create("accountHolder", account3, date_opened="2016-04-15")
James.relationships.create("accountHolder", account4, date_opened="2016-01-19")
Billy.relationships.create("accountHolder", account5, date_opened="2016-01-19")
account1.relationships.create("holds", oeic1, units=10)
account2.relationships.create("holds", oeic2, units=20)
account3.relationships.create("holds", oeic1, units=5)
Sam.relationships.create("clubMemeber",ic1, date_joined="2016-04-23")
Alex.relationships.create("clubMemeber",ic1, date_joined="2016-04-23")
account2.relationships.create("holds", equity1, units=5)
ic1.relationships.create("holds", equity1, units=5)



{'data': [], 'columns': []}


<Neo4j Relationship: http://localhost:7474/db/data/relationship/52>

## What does it look like?

Below is the picture taken from the Neo4j browser. It consists of...

5 Customers.
5 Accounts.
1 Investment Club.
3 Investments.

How good does that look! If we'd of created this schema in an realational model the potential tables we'd need would be...
customers,accounts,clubs,members,equity,OEICS,orders,holdings.
What about polymorphic associations? i.e. we'd have to have tables to hold the many to many or perhaps multi value records if you are old school.

We can see that James, Sam and Alex are part of the Super Investment club who own account 5234 and hold 41 units of Lloyds PLC. We can see Sam owns 2 accounts, 1 holding Global Tech and and the other Lloyds plc, he's 23 years old and was acquired via the organic marketing channel.


![alt text](graph.png "Graph_POC")

## Querying the network

I know this is a small scale concept but lets find out how many units in total our customers hold in Lloyds plc.

In [29]:
query2=("MATCH ()-[r:holds]->(p:Equity) WHERE p.name='Lloyds plc'  RETURN sum(r.units) as units_held")

print(db.query(query2).get_response())

{'data': [[10]], 'columns': ['units_held']}


Let's go one step further and find out what accounts own Lloyds plc and what the account type it is.

In [33]:
query3="MATCH (a:Account)-[r:holds]->(p:Equity) WHERE p.name='Lloyds plc'  RETURN  a.account_id as account_id,a.type as account_type,sum(r.units) as units_held"

print(db.query(query3).get_response())

{'data': [['2234', 'ISA', 5]], 'columns': ['account_id', 'account_type', 'units_held']}


For the RDBS model of this schema the same SQL query might look like assuming we have a table called orders.

SELECT a.ACC_ID, a.type, sum(o.units)
FROM accounts
LEFT JOIN orders AS o ON (a.ACC_ID = o.ACC_ID)
LEFT JOIN investments as i ON(o.inv_id=i.inv_id)
WHERE in.name="Lloyds plc"
Group by a.Acc_ID, a.type

## Proof of concept

The aim of this was to show you how a stock brokers data could be modelled as a graph network. Infact some people argue any data problem in the world can be modelled as a graph network.

I hope you can see it provides a way of having a very flexible schema (structure) and makes lookups much quicker!
