Graph Databases versus SQL for Data Science: Identifying ‘Graph-y’ Problems in Your Data

Written by: Dr. Clair J. Sullivan, Data Science Advocate, Neo4j

email: clair.sullivan@neo4j.com

Twitter: @CJLovesData1

Last updated: March 23, 2022

Introduction

A frequent question from data scientists is “why would I want to use a graph database when I can do all of what I need to do in SQL?” In some cases an RDBMS is a fine solution. However, there are many times that your data is a graph, even though it might not be immediately recognizable as such. This course walks through how to identify whether a problem is actually a graph and the benefits of analyzing that way rather than traditional using traditional SQL.

In this course we will be working with a data set of routes between airports. It is based off of the graph data that can be found in here.

Required tools and packages

Official neo4j Python driver (pip install neo4j)
Traditional Python data science packages (numpy, pandas)
A notebook environment (Jupyter, Google Colab, VS Code Notebooks, etc.)
A SQL environment (see comment on Docker below)
Neo4j Sandbox

Recommended tools

Docker (optional)
- We will specifically be using docker-compose
- The Docker container provided in this repo contains the PostgreSQL database setup for this course. The docker-compose.yml file contains 3 containers. Portainer will be used to help us manage and interface with the other two, which are the Postgres database and pgAdmin 4.
- The use of Docker is recommended so that we can all be running the same SQL database system while not interfering with any databases on your local machine. However, if you prefer to not use Docker and use your own Postgres install, you can simply use the database population queries in ./sql.

Basic container instructions

In this course we will be building and running the container from the command line. To bring the container up, run the following command:

docker-compose build && docker-compose up

Once you are done with the container, you can bring it down by hitting CTRL+C and then:

docker-compose down

Accessing the various containers

We will be using Portainer to monitor and interface with our containers. To get into Portainer, use your browser to navigate to http://localhost:9443. Note that this is NOT an https connection, so you will need to navigate to address through the Advanced button.

There are then two ways that we can reach the Postgres database:

1. Using the command line

Via Portainer, open the console for the pg_container container. From there, you can get into Postgres via the following command at the command line:

psql -h localhost -U postgres

2. Using pgAdmin

Using Portainer, open port 80 of pgadmin4_container. You will need to provide the login and password set in the docker-compose.yml file (admin@admin.com and letmein). Then you will need to establish a server connection to the database. This will be done with the IP address of pg_container and the login and password for the database set in the docker-compose.yml file (postgres and letmein).

References

Bite-Sized Neo4j for Data Scientists: Weekly 5ish minute video series for data scientists on learning Neo4j and GDS
Neo4j Discord Server
The Neo4j Cheat Sheet and Quick Reference
Neo4j Cypher Reference Card
Advanced Cypher Query Tuning (video)
Awesome Procedures on Cypher (APOC) User Guide
Graph Data Science Library API Docs
Neosemantics Docs
Bloom Docs

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
cypher_queries		cypher_queries
data		data
notebooks		notebooks
sql		sql
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
workshop_slides.pdf		workshop_slides.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cypher_queries

cypher_queries

data

data

notebooks

notebooks

sql

sql

.gitignore

.gitignore

README.md

README.md

docker-compose.yml

docker-compose.yml

workshop_slides.pdf

workshop_slides.pdf

Repository files navigation

Graph Databases versus SQL for Data Science: Identifying ‘Graph-y’ Problems in Your Data

Written by: Dr. Clair J. Sullivan, Data Science Advocate, Neo4j

email: clair.sullivan@neo4j.com

Twitter: @CJLovesData1

Last updated: March 23, 2022

Introduction

Required tools and packages

Recommended tools

Basic container instructions

Accessing the various containers

1. Using the command line

2. Using pgAdmin

References

About

Releases

Packages

Languages

cj2001/graphy_problems

Folders and files

Latest commit

History

Repository files navigation

Graph Databases versus SQL for Data Science: Identifying ‘Graph-y’ Problems in Your Data

Written by: Dr. Clair J. Sullivan, Data Science Advocate, Neo4j

email: clair.sullivan@neo4j.com

Twitter: @CJLovesData1

Last updated: March 23, 2022

Introduction

Required tools and packages

Recommended tools

Basic container instructions

Accessing the various containers

1. Using the command line

2. Using pgAdmin

References

About

Resources

Stars

Watchers

Forks

Languages