# Citation Dataset Loading

In this notebook we're going to load the citation dataset into Neo4j.

First let's import a couple of Python libraries that will help us with this process.

We'll start by importing py2neo library which we'll use to import the data into Neo4j. py2neo is a client library and toolkit for working with Neo4j from within Python applications. It is well suited for Data Science workflows and has great integration with other Python Data Science tools.

In [1]:
from neo4j import GraphDatabase

In [2]:
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "letmein"))        
print(driver.address)

localhost:7687


In [3]:
#driver = GraphDatabase.driver("bolt://link-prediction-neo4j", auth=("neo4j", "admin"))        
print(driver.address)

localhost:7687


## Create Constraints

First let's create some constraints to make sure we don't import duplicate data:

In [4]:
with driver.session(database="neo4j") as session:
    display(session.run("CREATE CONSTRAINT ON (a:Article) ASSERT a.index IS UNIQUE").consume().counters)
    display(session.run("CREATE CONSTRAINT ON (a:Author) ASSERT a.name IS UNIQUE").consume().counters)
    display(session.run("CREATE CONSTRAINT ON (v:Venue) ASSERT v.name IS UNIQUE").consume().counters)

{'constraints_added': 1}

{'constraints_added': 1}

{'constraints_added': 1}

## Loading the data

Now let's load the data into the database. We'll create nodes for Articles, Venues, and Authors.


In [5]:
query = """
CALL apoc.periodic.iterate(
  'UNWIND ["dblp-ref-0.json", "dblp-ref-1.json", "dblp-ref-2.json", "dblp-ref-3.json"] AS file
   CALL apoc.load.json("https://github.com/mneedham/link-prediction/raw/master/data/" + file)
   YIELD value WITH value
   return value',
  'MERGE (a:Article {index:value.id})
   SET a += apoc.map.clean(value,["id","authors","references", "venue"],[0])
   WITH a, value.authors as authors, value.references AS citations, value.venue AS venue
   MERGE (v:Venue {name: venue})
   MERGE (a)-[:VENUE]->(v)
   FOREACH(author in authors | 
     MERGE (b:Author{name:author})
     MERGE (a)-[:AUTHOR]->(b))
   FOREACH(citation in citations | 
     MERGE (cited:Article {index:citation})
     MERGE (a)-[:CITED]->(cited))', 
   {batchSize: 1000, iterateList: true});
"""

with driver.session(database="neo4j") as session:
    result = session.run(query)
    for row in result:
        print(row)

<Record batches=52 total=51956 timeTaken=18 committedOperations=51956 failedOperations=0 failedBatches=0 retries=0 errorMessages={} batch={'total': 52, 'committed': 52, 'failed': 0, 'errors': {}} operations={'total': 51956, 'committed': 51956, 'failed': 0, 'errors': {}} wasTerminated=False failedParams={}>


In [6]:
query = """
MATCH (a:Article) 
WHERE not(exists(a.title))
DETACH DELETE a
"""

with driver.session(database="neo4j") as session:
    result = session.run(query)
    print(result.consume().counters)

{'nodes_deleted': 132357, 'relationships_deleted': 261202}


In the next notebook we'll explore the data that we've imported. 