# Plan for today
* [Last weeks assignment](https://github.com/datsoftlyngby/soft2019spring-databases/blob/master/assignments/assignment12.md)
    * Link to shared document is on moodle (and [here](https://docs.google.com/document/d/1c8W921VuAG5FaQJuMjIn6JSdAuYiA2eBQxrKk-YYXVs/edit))
* More Neo4J
* This weeks assignment
* (Preliminary) exam info (full info next week)

# Graph databases
It is not only Neo4J:

https://en.wikipedia.org/wiki/Graph_database

# Neo4J continued


The following is based on http://neo4j.com/graphgist/6619085 and https://maxdemarzi.com/2015/08/26/modeling-airline-flights-in-neo4j/ and in particular on `:play http://guides.neo4j.com/modeling_airports`.

The original data is extrtacted from https://www.transtats.bts.gov/databases.asp?Mode_ID=1&Mode_Desc=Aviation&Subject_ID2=0

In [None]:
%%bash
docker run \
    -d --name neo4j \
    --rm \
    --publish=7474:7474 \
    --publish=7687:7687 \
    --env NEO4J_AUTH=neo4j/fancy!99Doorknob \
    neo4j

In [None]:
%%bash
docker ps -a

In [None]:
import sys
from neo4j import GraphDatabase

uri = "bolt://localhost:7687"
auth=("neo4j", "fancy!99Doorknob")
driver = GraphDatabase.driver(uri, auth=auth)

def neo(command):
    try:
        with driver.session() as session:
            result = session.run(command)
        return result # result is a resultset/cursor for neo4j
    except Exception as ex:
        print(str(ex), file=sys.stderr)
        
def neov(command):
    try:
        return neo(command).values()
    except Exception as ex:
        return 'Shit happened'
"done"

## Recap on Modelling in Neo4j



When modeling data it is useful to have a use case of a system's application in mind. For example, we could start with the following question:

> As an air travel enthusiast
>
> I want to know how airports are connected
>
> So that I can find the busiest ones


Consequently, we could create the following model:

![initial_model](http://guides.neo4j.com/modeling_airports/img/initial.png)


### Manually creating the model

Before we start working with a large dataset let’s create some nodes and relationships manually. First we will create some airports:

In [None]:
neov("""
CREATE (:Airport {code: "LAX"});
CREATE (:Airport {code: "LAS"});
CREATE (:Airport {code: "ABQ"});
""")

Navigate to http://localhost:7474/browser/ to work with a browser based Neo4j console.

We can find `LAX` by changing the `CREATE` to a `MATCH` and returning the matched node:

See https://www.world-airport-codes.com/ for the airport codes.

In [None]:
neov('''
MATCH (lax:Airport {code: "LAX"})
RETURN lax
''')

### Create relationships

Now let’s create some connections between those airports.

In [None]:
neov('''
MATCH (las:Airport {code: "LAS"})
MATCH (lax:Airport {code: "LAX"})
CREATE (las)-[connection:CONNECTED_TO {
  airline: "WN",
  flightNumber: "82",
  date: "2008-1-3",
  departure: 1715,
  arrival: 1820}]->(lax)
''')


We can check that the relationship was created correctly by executing the following query:

In [None]:
neov('''
MATCH connection = (las:Airport {code: "LAS"})
        -[:CONNECTED_TO]->
        (lax:Airport {code: "LAX"})
RETURN connection
''')

### Create Relationships Idempotently

Idempotently, what is that? 

> *idempotent* ... denoting an element of a set which is unchanged in value when multiplied or otherwise operated on by itself.

When using the `MERGE` command, we only need to inline the properties that make the `CONNECTED_TO` relationship unique. In this case it is the combination of airline, flightNumber, and date. To idempotently create a specific connection between airports we can run the following query:

In [None]:
neov('''
MATCH (las:Airport {code: "LAS"})
MATCH (lax:Airport {code: "LAX"})
MERGE (las)-
    [connection:
        CONNECTED_TO { 
            airline: "WN", 
            flightNumber: "82", 
            date: "2008-1-3"}
    ]
    ->(lax)
ON CREATE SET connection.departure = 1715, connection.arrival = 1820
''')

Let’s try it with another connection to get the hang of it:

In [None]:
neov('''
MATCH (a:Airport {code: "LAS"})
MATCH (b:Airport {code: "ABQ"})
MERGE 
    (a) -[connection:
        CONNECTED_TO { 
            airline: "WN", 
            flightNumber: "500", 
            date: "2008-1-3"}]
    ->(b)
ON CREATE SET connection.departure = 1450, connection.arrival = 1710
''')

Try running the query multiple times. The relationship will only be created once.


### Find All the Connections leaving an Airport

We can now find any connections leaving LAS:

In [None]:
neov('''
MATCH  
    (las:Airport {code: "LAS"})-[connection:CONNECTED_TO]->(dest:Airport)
RETURN connection
''')

### Exploring data with LOAD CSV

While we are working out the appropriate model for our dataset it is much easier to work with a subset of the data so that we can iterate quickly. A smaller dataset containing the first 10,000 connections lives in `flights_initial.csv`, s https://github.com/neo4j-contrib/training/tree/master/modeling/data.

We can run the following query to see what data we’ve got to work with:

In [None]:
neov('''
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/training/master/modeling/data/flights_1k.csv" AS row
RETURN row
LIMIT 5
''')

This query:

  * loads the file flights_initial.csv
  * iterates over the file, referring to each line as the variable row
  * and returns the first 5 lines in the file

We have got lots of different fields but the ones that will be helpful for answering our question are: `Origin`, `Dest`, and `FlightNum`.


### Importing connections and airports

Run the following query to create nodes and relationships for these connections:

In [None]:
neov('''
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/training/master/modeling/data/flights_1k.csv" 
    AS row
MERGE (origin:Airport {code: row.Origin})
MERGE (destination:Airport {code: row.Dest})
MERGE (origin)
    -[connection:CONNECTED_TO {
      airline: row.UniqueCarrier,
      flightNumber: row.FlightNum,
      date: toInteger(row.Year) 
          + "-" + toInteger(row.Month) 
          + "-" + toInteger(row.DayofMonth)}
    ]->
    (destination)
ON CREATE SET 
    connection.departure = toInteger(row.CRSDepTime), 
    connection.arrival = toInteger(row.CRSArrTime)
''')

This query:

  * iterates through each row in the file
  * creates nodes with the Airport label for the origin and destination airports if they don’t already exist
  * creates a connection relationship between origin and destination airports for each row in the file

By default properties will be stored as strings. We know that year, month, and day are actually numeric values so we will coerce them using the toInteger function.

Now we are ready to start querying the data.

### Finding the most popular airports

We can see some of what we have imported by writing the following query, which finds the airports with the most outgoing connections.

This query:

  * finds every node with the `Airport` label
  * finds all outgoing `CONNECTED_TO` relationships
  * counts them up grouped by airport
  * returns the `Airport` nodes and the `outgoing` count in descending order by `outgoing`
  * limits the number of airports returned to `10`

In [None]:
neov('''
MATCH (a:Airport)-[:CONNECTED_TO]->()
RETURN a, COUNT(*) AS outgoing
ORDER BY outgoing DESC
LIMIT 5
''')

### Exercise: Finding connections

Now it is your turn! Try and write queries to answer the following questions:

  * Find the airports that have the most incoming connections
  * Find all the connections into Las Vegas (LAS)
  * Find all the connections from Las Vegas (LAS) to Los Angeles (LAX)

**Hint:** Refer to the Cypher refcard (http://neo4j.com/docs/stable/cypher-refcard/) for Cypher Syntax.


# Break

## Refactoring and Profiling


### Finding specific connections


The model has worked well so far. We have been able to find the popular airports and find the connections between pairs of airports without much trouble.

What about if we want to find all the occurrences of a specific connection?

> As an air travel enthusiast
>
> I want to know the schedule for flight number
>
> So that I know when I will be able to spot those planes taking off and landing


Our next query finds all the instances of connection `WN 1016`:

In [None]:
neov('''
MATCH  (origin:Airport)-[connection:CONNECTED_TO]->(destination:Airport)
WHERE connection.airline = "WN" AND connection.flightNumber = "1016"
RETURN 
    origin.code, 
    destination.code, 
    connection.date, 
    connection.departure, 
    connection.arrival
''')


It is still reasonably quick because we only have 1000 rows, but under the covers we’re actually doing a lot of unnecessary work.


We can *profile* our query by prefixing it with the `PROFILE` keyword:

> `PROFILE`
If you want to run the statement and see which operators are doing most of the work, use PROFILE. This will run your statement and keep track of how many rows pass through each operator, and how much each operator needs to interact with the storage layer to retrieve the necessary data. Please note that profiling your query uses more resources, so you should not profile unless you are actively working on a query. https://neo4j.com/docs/developer-manual/current/cypher/query-tuning/how-do-i-profile-a-query/


## Profiler
Run in the web-shell
```cypher
PROFILE
MATCH  (origin:Airport)-[connection:CONNECTED_TO]->(destination:Airport)
WHERE connection.airline = "WN" AND connection.flightNumber = "1016"
RETURN origin.code, destination.code, connection.date, connection.departure, connection.arrival
```

What we get back is an execution plan which describes the Cypher operators used to execute this query. You can read more about these in the developer manual (https://neo4j.com/docs/developer-manual/current/cypher/#execution-plans)

In this one the query starts with a `NodeByLabelScan` on the `:Airport` label, which means that we first scanned all the airports. Next we followed the `FLIGHT` relationship to `origin` airports, and we can see from the estimated rows count that we followed 1000 of these.

In fact we actually looked at every single flight, which we can confirm by executing the following query:


In [None]:
MATCH ()-[:CONNECTED_TO]->()
RETURN count(*)

So it is clear that our model is not optimal - we are doing far too much work just to find the destinations and origins of one flight.

It is time to **refactor** the model. The following is based on `:play http://guides.neo4j.com/modeling_airports/02_flight.html`


## Refactoring

### Refactoring - Creating flights

We are now ready to introduce `Flight` nodes to our data model. That is, we want to create a data model of the following kind:


![refactored_model](http://guides.neo4j.com/modeling_airports/img/flight_first_class.png)

### Ensuring flight uniqueness

When we refactor the model we want to make sure we only create each flight once.

Neo4j allows us to create unique constraints to ensure uniqueness across a label/property pair, but at the moment we can only create constraints on single properties. We want to ensure uniqueness across several properties so we will combine those together into a single dummy property.

The combination of airline, flight number, and date makes a flight unique. As we saw in the previous section, however, some flights can have multiple legs so we will need to consider departure and arrival airports as well. We will create a flightId with this format: `{airline}{flightNumber}{year}-{month}-{day}_{origin}_{destination}`

Run the following query to create a unique constraint on the Flight/id label/property pair:


In [None]:
neov('''
CREATE CONSTRAINT ON (f:Flight)
ASSERT f.id IS UNIQUE
''')

## Making the flights
Run the following query to create Flight nodes for every CONNECTED_TO relationship:

The following query

  * finds all `(origin, connection, destination)` paths
  * creates a `Flight` node if one doesn’t already exist
  * creates an `ORIGIN` relationship to the origin airport and a `DESTINATION` relationship to the destination airport

In [None]:
neov('''
MATCH (origin:Airport)-[connection:CONNECTED_TO]->(destination:Airport)
MERGE (newFlight:Flight 
    { id: connection.airline 
        + connection.flightNumber 
        + "_" + connection.date 
        + "_" + origin.code 
        + "_" + destination.code }   )
ON CREATE SET newFlight.date = connection.date,
              newFlight.airline = connection.airline,
              newFlight.number = connection.flightNumber,
              newFlight.departure = connection.departure,
              newFlight.arrival = connection.arrival
MERGE (origin)<-[:ORIGIN]-(newFlight)
MERGE (newFlight)-[:DESTINATION]->(destination)
''')

In [None]:
neov('''
CALL db.schema()
''')

### Find all the flights for flight number WN 1016

First let’s create an index on `(Flight, number)` so that we can quickly find the appropriate flights.

In [None]:
neov('''
CREATE INDEX ON :Flight(number)
''')

In [None]:
neov('''
MATCH (origin)<-[:ORIGIN]-(flight:Flight)-[:DESTINATION]->(destination)
WHERE flight.airline = "WN" AND flight.number = "1016"
RETURN origin, destination, flight
''')

Before we delete the `CONNECTED_TO` relationship we should profile the two versions of the query to see whether our refactoring has improved things.


In [None]:
PROFILE
MATCH (origin)<-[:ORIGIN]-(flight:Flight)-[:DESTINATION]->(destination)
WHERE flight.airline = "WN" AND flight.number = "1016"
RETURN origin, destination, flight

In [None]:
PROFILE
MATCH (origin:Airport)-[flight:CONNECTED_TO]->(destination:Airport)
WHERE flight.airline = "WN" AND flight.flightNumber = "1016"
RETURN origin, destination, flight

# Refactoring Edges

`:play http://guides.neo4j.com/modeling_airports/03_flight_booking.html`


### Flight booking

Our system develops and we got a new requirement to satisfy:

> As a frequent traveller
> 
> I want to find flights from `origin` to `destination` on `date`
> 
> So that I can book my business flight

Before we write queries to satisfy this requirement, let’s import some more data.

### Import more flights

We initially loaded 1000 flights. That was a fun initial dataset to play with, but now that we have got a model we are happy with let’s load in a bit more data.

`flights_10k.csv` contains 10000 flights. We can run the following query to import those flights:

In [None]:
neov('''
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/training/master/modeling/data/flights_10k.csv" AS row
MERGE (origin:Airport {code: row.Origin})
MERGE (destination:Airport {code: row.Dest})
MERGE (newFlight:Flight { id: row.UniqueCarrier + row.FlightNum + "_" + row.Year + "-" + row.Month + "-" + row.DayofMonth + "_" + row.Origin + "_" + row.Dest }   )
ON CREATE SET newFlight.date = toInteger(row.Year) + "-" + toInteger(row.Month) + "-" + toInteger(row.DayofMonth),
              newFlight.airline = row.UniqueCarrier,
              newFlight.number = row.FlightNum,
              newFlight.departure = toInteger(row.CRSDepTime),
              newFlight.arrival = toInteger(row.CRSArrTime)
MERGE (newFlight)-[:ORIGIN]->(origin)
MERGE (newFlight)-[:DESTINATION]->(destination)
''')

# Finding flights
Now it is time to write a query to find available flights between two airports on a specific date.

Let’s find all the flights going from Los Angeles (LAS) to Chicago Midway International (MDW) on the 3rd January. Run the following query:

This returns quite quickly but try prefixing it with `PROFILE`. 

What do you notice?

In [None]:
neov('''
MATCH path = (o:Airport {code: "LAS"})<-[:ORIGIN]-(flight:Flight)-[:DESTINATION]->(d:Airport {code: "MDW"})
WHERE flight.date = "2008-1-3"
RETURN path
''')

### Profiling the finding flights to book query

The query starts by using an index to find `MDW` but then has to traverse all incoming `DESTINATION` relationships and check the date property on the `:Flight` nodes on the other side. The more flights an airport has the more we will have to scan through, and since we are only working with 50000 flights we should probably find a better way to model our data before importing any more rows.

Can you think of a way that we can change our model to avoid doing all these property lookups?


One way that we can tweak our model to be more aligned with our queries is by bundling flights by day.


## Introducing Airport Day

We want to introduce `:AirportDay` nodes so that we do not have to scan through all the flights going from an airport when we are only interested in a subset of them.

Try and write a query to evolve our current model to include this new concept.


![](http://guides.neo4j.com/modeling_airports/img/airport_day.png)


Before we create anything let’s put a unique constraint on :AirportDay so we don’t create any duplicates:

In [None]:
neov('''
CREATE CONSTRAINT ON (airportDay:AirportDay)
ASSERT airportDay.id IS UNIQUE
''')

We’ll use the combination of origin and the flight date as our unique key for an `:AirportDay`

In [None]:
neov('''
MATCH (origin:Airport)<-[:ORIGIN]-(flight:Flight)-[:DESTINATION]->(destination:Airport)
MERGE (originAirportDay:AirportDay {id: origin.code + "_" + flight.date})
SET originAirportDay.date = flight.date
MERGE (destinationAirportDay:AirportDay {id: destination.code + "_" + flight.date})
SET destinationAirportDay.date = flight.date
MERGE (origin)-[:HAS_DAY]->(originAirportDay)
MERGE (flight)-[:ORIGIN]->(originAirportDay)

MERGE (flight)-[:DESTINATION]->(destinationAirportDay)
MERGE (destination)-[:HAS_DAY]->(destinationAirportDay)
''')

### Find flights to book

Now let’s try finding those flights between Los Angeles and Chicago Midway International again. To recap, this was our original query:

In [None]:
PROFILE
MATCH path = (origin:Airport {code: "LAS"})<-[:ORIGIN]-(flight:Flight)-[:DESTINATION]->(destination:Airport {code: "MDW"})
WHERE flight.date = "2008-1-3"
RETURN path

This is the equivalent query which makes use of `:AirportDay`

In [None]:
PROFILE
MATCH (origin:Airport {code: "LAS"})-[:HAS_DAY]->(:AirportDay {date: "2008-1-3"})<-[:ORIGIN]-(flight:Flight),
      (flight)-[:DESTINATION]->(:AirportDay {date: "2008-1-3"})<-[:HAS_DAY]-(destination:Airport {code: "MDW"})
RETURN *

# Accessing Neo4J from Java

In the following is a condensed tutorial on how to excute Cypher queries and access a Neo4J database from Java code. The guide assumes that you use Maven to manage your code dependencies.

  * Create a Maven project. In NetBeans `New Project -> Maven -> Java Application`
  * Add a dependency to the Neo4J driver to your project configuration (`pom.xml`)
    
    ```xml
    <dependencies>
        <dependency>
            <groupId>org.neo4j.driver</groupId>
            <artifactId>neo4j-java-driver</artifactId>
            <version>1.7.3</version>
        </dependency>

        <!-- ...any other dependencies -->
    </dependencies>
    ```

### Create a Java Class:
  
```java
package whopackagesajava.neo4j;

import org.neo4j.driver.v1.*;

public class ConnectionTest {

    public static void main(String[] args) {
        Driver driver = GraphDatabase.driver( 
                "bolt://localhost:7687", 
                AuthTokens.basic( "neo4j", "fancy!99Doorknob" ) );
        Session session = driver.session();

        // Run a query matching all nodes
        StatementResult result = session.run( 
                "MATCH (s:Airport) -[:CONNECTED_TO]->(d:Airport)" +
                "RETURN s.code AS source, d.code AS dest " +
                "LIMIT 5");
        System.out.println( "---------------" );
        while ( result.hasNext() ) {
            Record record = result.next();
            System.out.println( 
                record.get("source").asString() + 
                " -> " + 
                record.get("dest").asString());
        }
        System.out.println( "---------------" );
        session.close();
        driver.close();
       }
}
```
    
#### Output:
```
---------------
MDW -> LAX
MCI -> LAX
LAS -> LAX
MCI -> LAX
MDW -> LAX
---------------
```

This example is based on: https://neo4j.com/developer/java/#_the_example_project

See https://neo4j.com/developer/language-guides/ for guides on using other languages than Java.

# Modelling Guidelines


`:play http://guides.neo4j.com/modeling_airports/04_specific_relationship_types.html`

See the `Modeling Guidelines.pdf` in the literature folder.


# Further reading

The following two tutorials expand on the examples further:

`:play http://guides.neo4j.com/modeling_airports/04_specific_relationship_types.html`

`:play http://guides.neo4j.com/modeling_airports/05_refactoring_large_graphs.html`
  
  
  

# [Assignment](https://github.com/datsoftlyngby/soft2019spring-databases/blob/master/assignments/assignment13.md)



# Preliminary exam info

[See here](https://github.com/datsoftlyngby/soft2019spring-databases) (Exam folder on our material site)
