# Importing Data

## A Database Container

![](images/DB_containers_internal.png)

In [246]:
%%bash
docker run \
    -d --name neo4j \
    --rm \
    --publish=7474:7474 \
    --publish=7687:7687 \
    --env NEO4J_AUTH=neo4j/class \
    neo4j

8866f3a8c6f3fe942e2755fa1f0ab25bdcedefbc273951043d140ad9717e38b6


# A Simple Social Network

We will consider a simple social network, a graph with persons having a name, a job title, and a birthday. On top of that, persons can endorse each other.

![](./images/endorsment_graph.png)

 Consequently, our data model looks as in the following:
![](./images/datamodel.png)

## Creating and Using Indexes

You can create an index on certain attributes as in the following:

In [247]:
CALL db.indexes()



In [248]:
CREATE INDEX ON :Person(name)



In [249]:
CALL db.indexes()

+------------------------------------------------------------------------------------------------------------------------------+
| description              | label    | properties | state    | type                  | provider                               |
+------------------------------------------------------------------------------------------------------------------------------+
| "INDEX ON :Person(name)" | "Person" | ["name"]   | "ONLINE" | "node_label_property" | {version: "1.0", key: "lucene+native"} |
+------------------------------------------------------------------------------------------------------------------------------+

1 row available after 5 ms, consumed after another 5 ms

Usually you do not need to specify which indexes to use in a query. When indexes exist, they will be used in `WHERE` clauses for comparison operations, including equality, inequality, `IN`, `STARTS WITH`, `has`, `exists`, etc. In case you want to get rid of an index, run:

```cypher
DROP INDEX ON :Person(name);
```

## Creating and Using Constraints

A constraint assures for example that a certain attribute is unique. Creating a constraint creates automatically an index on this attribute.

In [250]:
CREATE CONSTRAINT ON (p:Person) ASSERT p.id IS UNIQUE;



In [251]:
CALL db.indexes()

+-------------------------------------------------------------------------------------------------------------------------------+
| description              | label    | properties | state    | type                   | provider                               |
+-------------------------------------------------------------------------------------------------------------------------------+
| "INDEX ON :Person(name)" | "Person" | ["name"]   | "ONLINE" | "node_label_property"  | {version: "1.0", key: "lucene+native"} |
| "INDEX ON :Person(id)"   | "Person" | ["id"]     | "ONLINE" | "node_unique_property" | {version: "1.0", key: "lucene+native"} |
+-------------------------------------------------------------------------------------------------------------------------------+

2 rows available after 5 ms, consumed after another 0 ms

## Importing Data from a CSV File in Cypher


In [252]:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/datsoftlyngby/soft2018spring-databases-teaching-material/master/data/social_network_nodes_small.csv" AS row
MERGE (:Person {id: toInt(row.node_id), name: row.name, job: row.job, 
                          birthday: row.birthday});



To import the data for the edges, which we have stored in another CSV file do:

In [253]:
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/datsoftlyngby/soft2018spring-databases-teaching-material/master/data/social_network_edges_small.csv" AS row
MATCH (f:Person {id: toInt(row.source_node_id)}), 
                      (t:Person {id: toInt(row.target_node_id)})
CREATE (f)-[:ENDORSES]->(t);



### What `MERGE`?

The `MERGE` clause ensures that a pattern exists in the graph. Either the pattern already exists, or it needs to be created.

`MERGE` either matches existing nodes and binds them, or it creates new data and binds that. It is like a combination of `MATCH` and `CREATE` that additionally allows you to specify what happens if the data was matched or created. For example, you can specify that the graph must contain a node for a user with a certain name. If there is not a node with the correct name, a new node will be created and its name property set.

### What do we have in the DB now?

In [254]:
MATCH (n)
RETURN count(n)

+----------+
| count(n) |
+----------+
| 200      |
+----------+

1 row available after 33 ms, consumed after another 1 ms

In [255]:
CALL db.schema()

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| nodes                                                                                                                                  | relationships                   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [(:Person {name: "Person", _id_: -1, indexes: ["name"], constraints: ["CONSTRAINT ON ( person:Person ) ASSERT person.id IS UNIQUE"]})] | [[:ENDORSES {_id_: -1}[-1>-1]]] |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

1 row available after 17 ms, consumed after another 1 ms

### Who endorses me?

Of course, to find those persons endorsing oneself, one has to query for the tail nodes of incomming endorsment relations.

In [256]:
MATCH ({name:"Sol Linkert"})<-[:ENDORSES]-(other)
RETURN other

+-----------------------------------------------------------------------------------------------------------------------+
| other                                                                                                                 |
+-----------------------------------------------------------------------------------------------------------------------+
| (:Person {birthday: "1987-10-05", name: "Merlin Bussone", _id_: 179, id: 179, job: "Transfer-Car Operator"})          |
| (:Person {birthday: "1954-05-24", name: "Eddie Macqueen", _id_: 121, id: 121, job: "Plastic-Joint Maker"})            |
| (:Person {birthday: "1993-09-28", name: "Toney Gatz", _id_: 101, id: 101, job: "Wire Drawer"})                        |
| (:Person {birthday: "1994-08-30", name: "Mellisa Sultani", _id_: 89, id: 89, job: "Diamond Selector"})                |
| (:Person {birthday: "1948-11-13", name: "Florine Chargualaf", _id_: 86, id: 86, job: "Tester, Semiconductor Wafers"}) |
| (:Person {birthday: "1

### Whom do I endorse?

And vice versa, those that oneself endorses are the head nodes of outgoing endorsement relations.

In [257]:
MATCH ({name:"Sol Linkert"})-[:ENDORSES]->(other)
RETURN other

+------------------------------------------------------------------------------------------------------------------------------+
| other                                                                                                                        |
+------------------------------------------------------------------------------------------------------------------------------+
| (:Person {birthday: "1956-04-18", name: "Rudolph Bicklein", _id_: 161, id: 161, job: "Cupboard Builder"})                    |
| (:Person {birthday: "1975-08-05", name: "Elly Glosser", _id_: 22, id: 22, job: "Catalyst Operator, Gasoline"})               |
| (:Person {birthday: "1983-12-07", name: "Dulcie Miyares", _id_: 12, id: 12, job: "Needle-Loom Tender"})                      |
| (:Person {birthday: "1965-06-12", name: "Gianna Alan", _id_: 2, id: 2, job: "Foreign Clerk"})                                |
| (:Person {birthday: "1940-06-09", name: "Dino Kalt", _id_: 137, id: 137, job: "Street-Light Rep

### Who is endorsed by those who I endorse?

Graph databases make it easy for us to query relations of a certain depth. For example, if we wanted to know endorsements of depth two, i.e., those who are endorsed by those who I endorse, we just create a path of depth two in our match pattern.

In [258]:
MATCH ({name:"Sol Linkert"})-[:ENDORSES]->()-[:ENDORSES]->(other_other)
RETURN other_other

+---------------------------------------------------------------------------------------------------------------------------------+
| other_other                                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------+
| (:Person {birthday: "1945-04-23", name: "Erik Shoger", _id_: 186, id: 186, job: "Flare Breaker"})                               |
| (:Person {birthday: "1995-10-30", name: "Elfrieda Witherington", _id_: 25, id: 25, job: "Mill-And-Coal-Transport Operator"})    |
| (:Person {birthday: "1994-02-12", name: "Irvin Vanwagenen", _id_: 198, id: 198, job: "Track-Surfacing-Machine Operator"})       |
| (:Person {birthday: "1954-05-18", name: "Claude Mumme", _id_: 163, id: 163, job: "Director Of Pupil Personnel Program"})        |
| (:Person {birthday: "1961-07-20", name: "Hien Schwalenberg", _id_: 53, id:

## Importing Data from the CLI

  * Small to medium-sized datasets can be imported with `LOAD CSV`
  * Large datasets can be _batch_ imported with `neo4j-admin import`

In [None]:
%%bash
docker stop neo4j

First, we get the dataset to be able to import it from the command-line.

In [None]:
%%bash
mkdir import
cd import/
wget https://raw.githubusercontent.com/datsoftlyngby/soft2018spring-databases-teaching-material/master/data/social_network_nodes_small.csv
wget https://raw.githubusercontent.com/datsoftlyngby/soft2018spring-databases-teaching-material/master/data/social_network_edges_small.csv
cd ..

In [None]:
%%bash
ls -ltrh import/social_network_*.csv

In [None]:
%%bash
head import/social_network_nodes_small.csv

When importing data from CSV files, Neo4j can interpret columns correctly if they follow a ceratin naming convention. Consequently, we change the header lines of the two files. You could do that in an editor too or with any other tool of your choice.

In [None]:
%%bash
sed -i -E '1s/.*/:ID,name,job,birthday/' import/social_network_nodes_small.csv

In [None]:
%%bash
head import/social_network_nodes_small.csv

In [None]:
%%bash
head import/social_network_edges_small.csv

In [None]:
%%bash
sed -i -E '1s/.*/:START_ID,:END_ID/' import/social_network_edges_small.csv

In [None]:
%%bash
head import/social_network_edges_small.csv

### Getting APOC and the Graph Algorithms Plugins

In the following, we will use functionality, which is not directly part of the Cypher language. We need the _APOC_ and the _Graph Algorithms_ plugins.

The Neo4j database is extensible, with _procedures_, which can be executed via the `CALL` command, see https://neo4j.com/docs/developer-manual/current/extending-neo4j/procedures/.

In [None]:
%%bash
mkdir plugins
cd plugins
echo "Downloading APOC plugin"
wget https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/3.3.0.1/apoc-3.3.0.1-all.jar
echo "Downloading algorithms plugin"
wget https://github.com/neo4j-contrib/neo4j-graph-algorithms/releases/download/3.3.2.0/graph-algorithms-algo-3.3.2.0.jar
cd ..

To clean up our previous database, we stop its container.

In [None]:
%%bash
docker ps -a | grep neo4j

In [None]:
%%bash
docker stop neo4j

### Starting a ne Neo4j Instance

Unlike previously, we mount the `import` and the `plugins` directories to the container, so that the corresponding data and JAR files are _visible_ for the Neo4j engine. 

Additionally, you can set environment variables, which might be important for performance tuning.

```bash
docker run \
    ...
    --env=NEO4J_dbms_memory_pagecache_size=6G \
    --env=NEO4J_dbms_memory_heap_max__size=10G \
    ...
    neo4j
```

In [None]:
%%bash
docker run \
    -d --name neo4j \
    --rm \
    --publish=7474:7474 \
    --publish=7687:7687 \
    --volume=$(pwd)/import:/import \
    --volume=$(pwd)/plugins:/plugins \
    --env NEO4J_AUTH=neo4j/class \
    --env=NEO4J_dbms_security_procedures_unrestricted=apoc.\\\*,algo.\\\* \
    neo4j

### Batch Import from the Command-line

To batch import data via the command-line, we have to:

  * Stop the Neo4j engine,
  * Remove the current database files,
  * Import the corresponding data, and
  * Restart the Neo4j engine.
  
In case you have Neo4j installed directly to your host machine
```bash
$ neo4j stop
$ rm -rf data/databases/graph.db
$ neo4j-admin import \
      --nodes:Person import/social_network_nodes_small.csv \
      --relationships:ENDORSES import/social_network_edges_small.csv \
      --ignore-missing-nodes=true \
      --ignore-duplicate-nodes=true \
      --id-type=INTEGER
$ neo4j start
```

#### Stopping and Removing the Neo4j Engine in Docker

In [None]:
%%bash
docker exec neo4j sh -c 'neo4j stop'
docker exec neo4j sh -c 'rm -rf data/databases/graph.db'

#### Importing the Neo4j Engine in Docker

In [None]:
%%bash
docker exec neo4j sh -c 'neo4j-admin import \
    --nodes:Person /import/social_network_nodes_small.csv \
    --relationships:ENDORSES /import/social_network_edges_small.csv \
    --ignore-missing-nodes=true \
    --ignore-duplicate-nodes=true \
    --id-type=INTEGER'

#### Restarting the Neo4j Engine in Docker

In [None]:
%%bash
docker restart neo4j

In [None]:
MATCH (n)
RETURN count(n)

In [None]:
CALL db.schema();

#### Counting all Nodes and Relationships

In [None]:
MATCH ()
WITH count(*) AS count
RETURN "nodes" AS type, count
UNION
MATCH ()-[]->()
WITH count(*) AS count
RETURN "relationships" AS type, count

# Accessing Neo4J from Java

In the following is a condensed tutorial on how to excute Cypher queries and access a Neo4J database from Java code. The guide assumes that you use Maven to manage your code dependencies.

  * Create a Maven project. In NetBeans `New Project -> Maven -> Java Application`
  * Add a dependency to the Neo4J driver to your project configuration (`pom.xml`)
    
    ```xml
    <dependencies>
        <!-- tag::bolt-dependency[] -->
        <dependency>
            <groupId>org.neo4j.driver</groupId>
            <artifactId>neo4j-java-driver</artifactId>
            <version>1.4.4</version>
        </dependency>

        <!-- ...any other dependencies -->
    </dependencies>
    ```

  * Create a Java Class `ConnectionTest.java` and type in the following code:
  
    ```java
    package dk.cphbusiness.db.neo4j.intro.db.neo4j;

    import org.neo4j.driver.v1.*;

    public class ConnectionTest {

        public static void main(String[] args) {
            Driver driver = GraphDatabase.driver( 
                    "bolt://localhost:7687", 
                    AuthTokens.basic( "neo4j", "class" ) );
            Session session = driver.session();

            // Run a query matching all nodes
            StatementResult result = session.run( 
                    "MATCH (s:Person)" +
                    "RETURN s.name AS name, s.job AS job");

            while ( result.hasNext() ) {
                Record record = result.next();
                System.out.println( record.get("name").asString() + " -> " + 
                                    record.get("job").asString());
            }
            session.close();
            driver.close();
        }
    }
    ```
    
  * This program should print a record for each node in your Neo4J database.

This example is based on: https://neo4j.com/developer/java/#_the_example_project

See https://neo4j.com/developer/language-guides/ for guides on using other languages than Java.

## The Neo4J Java API

Alternatively, if you do not want to rely on Cypher queries to comminicate with your Neo4J database, you can access it directly via the Java API (http://neo4j.com/docs/java-reference/current/javadocs/).


## Spring Data Neo4J, an Object-Graph Mapper

The following is based on chapter nine of _Neo4j in Action_ and the official documentation at https://neo4j.com/developer/spring-data-neo4j/.

Until now we have been working directly with the core Neo4j graph primitives—nodes and relationships—to represent and interact with (that is, read and persist) various domain model concepts.

Though that approach is extremely powerful and flexible, operating with the low-level Neo4j APIs can sometimes be quite verbose and result in a lot of boilerplate code, especially when it comes to working with domain model entities.

"In a nutshell, Spring Data Neo4j (SDN), is an object-graph mapping (OGM) framework that was created to make life easier for (currently only Java) developers who need, or would prefer, to work with a POJO-based domain model, where some or all of the data is stored in Neo4j."

```java
public class User {
    String userId;
    String name;
    Set<User> friends;
    Set<Viewing> views;
    User referredBy;
}
public class Movie {
    String title;
    Set<Viewing> views;
}
public class Viewing {
    User user;
    Movie movie;
    Integer stars;
}
```

SDN is an annotation-based object-graph mapping library. This means it is a library that relies on being able to recognize certain SDN-specific annotations attached to parts of your code. These annotations provide instructions about how to transform the associated code to the underlying structures in the graph.

Sometimes you may even find that you do not need to annotate certain pieces of code. This is because SDN tries to infer some sensible defaults, applying the principle of convention over configuration. OGM is to graphs what ORM is to an RDBMS.


```java
@NodeEntity
public class User {
    String name;

    @Indexed(unique=true)
    String userId;

    @GraphId
    Long nodeId;
    User referredBy;

    @RelatedTo(type = "IS_FRIEND_OF", direction = Direction.BOTH)
    Set<User> friends;

    @RelatedToVia
    Set<Viewing> views;
}

@NodeEntity
public class Movie {
    String title;

    @GraphId
    Long nodeId;

    @RelatedToVia(direction = Direction.INCOMING)
    Iterable<Viewing> views;
}

@RelationshipEntity(type = "HAS_SEEN")
public class Viewing {
    Integer stars;

    @GraphId
    Long relationshipId;

    @StartNode
    User user;

    @EndNode
    Movie movie;
}
```

## Getting Practical!

The PageRank Algorithm as described in _Artificial Intelligence: A 
Modern Approach_ Third Edition by Stuart J. Russell and Peter Norvig.

![](https://github.com/HelgeCPH/db_course_nosql/raw/1936259151dc57c43d1e83f40e35b88c86c2c650/lecture_notes/images/pr_descr.png)

In [None]:
CALL apoc.help("apoc.algo.pagerank") YIELD name, text
RETURN *

In [None]:
MATCH (p:Person) WITH collect(p) as persons
CALL apoc.algo.pageRank(persons) YIELD node, score
RETURN id(node), node.name, score
ORDER BY score DESC

#### How did that work?

Have a look to the implementation in Java: https://github.com/neo4j-contrib/neo4j-graph-algorithms/blob/3.2/algo/src/main/java/org/neo4j/graphalgo/PageRankProc.java.


#### Doing each Iterative Step Manually in Cypher

In [None]:
MATCH (a:Person)
SET a.pageRank = 1.0

In [None]:
// we have to go through all nodes
MATCH (a:Person)
WITH
  collect(DISTINCT a) AS pages
UNWIND pages as dest
  // let's find all source citations for a given node
  MATCH (source:Person)-[:ENDORSES]->(dest:Person)
  WITH
    collect(DISTINCT source) AS sources,
    dest AS dest
    UNWIND sources AS src
      // we have to know how many relationships the source node has
      MATCH (src:Person)-[r:ENDORSES]->(:Person)
      WITH
        src.pageRank / count(r) AS points,
        dest AS dest
      // now we have all information to update the destination node with the new pagerank
      WITH
        sum(points) AS p,
        dest AS dest
      SET dest.pageRank = 0.15 + 0.85 * p;

In [None]:
MATCH (a:Person) 
RETURN id(a), a.name, a.pageRank
ORDER BY a.pageRank DESC 
LIMIT 25;


  * The above was based on the following reply https://stackoverflow.com/a/36734460
  * See more on the APOC page rank algorithm here https://www.adamcowley.co.uk/neo4j/wordpress-recommendations-neo4j-part-4-pagerank-with-apoc/