
# Neo4j Streams Sink

This module allows Neo4j consuming data from a Kafka topic, and it does it in a "smart" way, by allowing you to define your custom queries. What you need to do is write in your neo4j.conf something like this:

`streams.sink.topic.cypher.<TOPIC>=<CYPHER_QUERY>`

So if you define a query just like this:

```
streams.sink.topic.my-topic=MERGE (n:Person{id: event.id}) \
    ON CREATE SET n += event.properties
```

And for events like this:

`{id:"alice@example.com",properties:{name:"Alice",age:32}}`

Under the hood the Sink module will execute a query like this:

```
UNWIND {batch} AS event
MERGE (n:Label {id: event.id})
    ON CREATE SET n += event.properties
```

So continuing with the example above a possible full representation could be:

```
WITH [{id:"alice@example.com",properties:{name:"Alice",age:32}},
    {id:"bob@example.com",properties:{name:"Bob",age:42}}] AS batch
UNWIND batch AS event
MERGE (n:Person {id: event.id})
    ON CREATE SET n += event.properties
```

This gives to the developer the power to define his own business rules because you can choose to update, add to, remove, adapt your graph data based on the events you get.

## Our configuration

```
NEO4J_streams_sink_topic_cypher_pharma: "
          MERGE (p:Pharmacy{fiscalId: event.FISCAL_ID}) ON CREATE SET p.name = event.NAME
          MERGE (t:PharmacyType{type: event.TYPE_NAME})
          MERGE (a:Address{name: event.ADDRESS + ', ' + event.CITY})
            ON CREATE SET a.latitude = event.LATITUDE, a.longitude = event.LONGITUDE,
              a.code = event.POSTAL_CODE, a.point = event.POINT
          MERGE (c:City{name: event.CITY})
          MERGE (p)-[:IS_TYPE]-(t)
          MERGE (p)-[:HAS_ADDRESS]-(a)
          MERGE (a)-[:IS_LOCATED_IN]->(c)"
```


# (Neo4j)-[:LOVES]->(Kafka)

## Architecture

<div class="img-responsive center-block" style="background-image: url('https://cdn-images-1.medium.com/max/2000/1*0RNrK1OSS779TJ6F3sysjQ.png'); width: 1327px; height: 300px; background-position: center; background-size: cover;"></div>

# The Open Data API

We'll use the Italian Ministry of Health dataset of Pharmacy stores.

## The data model

<div class="img-responsive center-block" style="background-image: url('https://cdn-images-1.medium.com/max/1600/1*1J4GGP2XenkCfuBi8ZCrjw.png'); width: 481px; height: 134px; background-position: center; background-size: cover;"></div>

## Link

https://neo4j-contrib.github.io/neo4j-streams/

# Initialize Spark & Neo4j Session

In [None]:
# Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local').config(
    "spark.jars.packages",
    "org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2"
).getOrCreate()

# Neo4j
import sys
!{sys.executable} -m pip install py2neo sparksql-magic neographviz

from py2neo import Graph

graph = Graph("bolt://neo4j:7687", auth=("neo4j", "zeppelin"))

%load_ext sparksql_magic

## Set the url of the Open Data API

In [None]:
fileUrl = input("File URL: ") # http://www.dati.salute.gov.it/imgs/C_17_dataset_5_download_itemDownload0_upFile.CSV

import os
from urllib.parse import urlparse

fileUrlParsed = urlparse(fileUrl)
localFilePath = "/home/streams/" + os.path.basename(fileUrlParsed.path)

## Download the Open Data dataset from the URL (if it's not already present in the file system)

In [None]:
import requests

r = requests.get(fileUrl, allow_redirects=True) 
with open(localFilePath, 'wb') as f:
    f.write(r.content)

## Load the CSV into a Dataframe

In [None]:
csvDF = spark.read \
    .format("csv") \
    .option("delimiter", ";") \
    .option("header", "true") \
    .load(localFilePath)

## Inspect the schema

In [None]:
csvDF.printSchema()

## Sample some data

In [None]:
csvDF.show(10)

## Create a temp view

In [None]:
csvDF.createOrReplaceTempView("open_data")

# Filter the Dataframe and replace the old one

In [None]:
%%sparksql

CREATE OR REPLACE TEMP VIEW OPEN_DATA_EN AS
SELECT CODICEIDENTIFICATIVOFARMACIA AS PHARMA_ID,
    INDIRIZZO AS ADDRESS,
    DESCRIZIONEFARMACIA AS NAME,
    PARTITAIVA AS FISCAL_ID,
    CAP AS POSTAL_CODE,
    DESCRIZIONECOMUNE AS CITY,
    DESCRIZIONETIPOLOGIA AS TYPE_NAME,
    REGEXP_REPLACE(LATITUDINE, ',', '.') AS LATITUDE,
    REGEXP_REPLACE(LONGITUDINE, ',', '.') AS LONGITUDE,
    CONCAT(REGEXP_REPLACE(LATITUDINE, ',', '.'), ',', REGEXP_REPLACE(LONGITUDINE, ',', '.')) AS POINT
FROM OPEN_DATA
WHERE DATAFINEVALIDITA <> '-'
AND CODICEIDENTIFICATIVOFARMACIA <> '-'
AND PARTITAIVA <> '-'

## Create a new Dataframe that will be used to send the data to Kafka

In [None]:
%%sparksql

CREATE OR REPLACE TEMP VIEW OPEN_DATA_KAFKA_STAGE AS
SELECT 
    PHARMA_ID AS KEY,
    TO_JSON(
        STRUCT(PHARMA_ID,
            ADDRESS,
            NAME,
            FISCAL_ID,
            POSTAL_CODE,
            CITY,
            TYPE_NAME,
            LATITUDE,
            LONGITUDE,
            POINT)
    ) AS VALUE
FROM OPEN_DATA_EN

In [None]:
%%sparksql

select * from OPEN_DATA_KAFKA_STAGE

# Create the Constrains on Neo4j

In [None]:
graph.run("CREATE CONSTRAINT ON (p:Pharmacy) ASSERT p.fiscalId IS UNIQUE")
graph.run("CREATE CONSTRAINT ON (t:PharmacyType) ASSERT t.type IS UNIQUE")
graph.run("CREATE CONSTRAINT ON (a:Address) ASSERT a.name IS UNIQUE")
graph.run("CREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE")

## Clean the DB (if it's necessary)

In [None]:
graph.run("MATCH (n) DETACH DELETE n")

## Check the current content of the DB

In [None]:
graph.run("MATCH (n) return count(n)")

# Get the staging Dataset and send the data to the "pharma" topic

In [None]:
spark.table("OPEN_DATA_KAFKA_STAGE") \
	.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
    	.write \
    	.format("kafka") \
    	.option("kafka.enable.auto.commit", "true") \
    	.option("kafka.bootstrap.servers", "broker:9093") \
    	.option("topic", "pharma") \
    	.save()

## Query the dataset over Neo4j

In [None]:
city = input("City: ")
query = "MATCH x=(p:Pharmacy)-[pha:HAS_ADDRESS]->(a:Address)-[aic:IS_LOCATED_IN]->(c:City{{name:'{city}'}}) RETURN x LIMIT 100".format(**locals())

from neographviz import Graph, plot

plot(graph, query)
