Author: Kevin ALBERT  

Created: Oct 2020 (Updated: 14 Apr 2021)

TestRun: 28 Feb 2021

(helped identify a bug [#1808](https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/1808))

# Graph Database
_**How to load data and interact using Neo4j**_  

## Contents
1. [Introduction](#Introduction)  
1. [Setup](#Setup)  
● [import](#import)  
● [environment](#environment)  
● [installed](#installed)  
● [config](#config)  
● [plugins](#plugins)  
● [parameters](#parameters)  
1. [Data](#Data)  
● [definitions](#definitions)  
● [exploring](#exploring)  
● [modeling](#modeling)  
● [cleaning](#cleaning)   
1. [Import](#Cycle2)  
1. [Querying](#Querying)  
● [Which 10 countries have the most wineries ?](#Which-10-countries-have-the-most-wineries-?)  
● [Which wineries are across multiple provinces ?](#Which-wineries-are-across-multiple-provinces-?)  
● [The top 10 most prolific wine tasters ?](#The-top-10-most-prolific-wine-tasters-?)  
● [How many wine varieties contain the word 'red' ?](#How-many-wine-varieties-contain-the-word-'red'-?)  
● [Which Year had the most Wine ?](#Which-Year-had-the-most-Wine-?)  
● [Which top 5 Winery produces the most Wine for a given Year ?](#Which-top-5-Winery-produces-the-most-Wine-for-a-given-Year-?)  
● [Show Variety linked to VarietyName](#Show-Variety-linked-to-VarietyName)  
● [Which VarietyName have the most Wine ?](#Which-VarietyName-have-the-most-Wine-?)  
● [Find the popular product flavors in this 'merlot' variety](#Find-the-popular-product-flavors-in-this-'merlot'-variety)  
● [Which wines have 100 points ?](#Which-wines-have-100-points-?)  
● [Show the most expensive wines (>1000)](#Show-the-most-expensive-wines-(>1000))  
● [Show the highest scoring wines](#Show-the-highest-scoring-wines)  
1. [Documentation](#Documentation)  
● [how to datamodel](#howto-datamodel) 
1. [Maintenance](#Maintenance)  
● [cypher](#cypher)  
● [python](#python)  
● [data loading test](#data-loading-test)  

## Introduction

This howto will use the `Kaggle` [wines data](https://www.kaggle.com/zynicide/wine-reviews/data) and follow along the `YouTube` [adventures with wine data](https://www.youtube.com/playlist?list=PL9Hl4pk2FsvU7skL6tC-ZoSALfDQ552bI) and `Git` [repository](https://github.com/lju-lazarevic/wine) by Lju Lazarevic.  
Combined with other sources into a one complete set of code and methods for future projects.  

## Setup

### import

In [1]:
from py2neo import Graph, Node, Relationship
import pandas as pd
from IPython.display import Javascript
from fuzzywuzzy import process, fuzz

In [2]:
import platform
import psutil
import os

In [3]:
# pd.describe_option('display')            # show all pandas options, parameters can slow down notebook
pd.set_option('display.max_colwidth', 100) # default 50, the maximum width in characters of a column
pd.set_option('display.max_columns', 40)   # default 20, the maximum amount of columns in view 
pd.set_option('display.max_rows', 60)      # default 60, the maximum amount of rows in view

### environment

In [4]:
print(f"Cores : {psutil.cpu_count(logical=True)} ({psutil.cpu_freq().current/1000:.0f}GHz)")
print(f"Memory: {psutil.virtual_memory().total/(1024**3):.2f} GB ({psutil.virtual_memory().percent}%)")
print(f"Swap  : {os.path.getsize('/swapfile')/(1024**3):.0f} GB")
disk_size = psutil.disk_usage(psutil.disk_partitions()[0].mountpoint).total
disk_used = psutil.disk_usage(psutil.disk_partitions()[0].mountpoint).percent
disk_fs   = psutil.disk_partitions()[0].fstype 
print(f"Disk  : {disk_size/(1024**3):.0f} GB ({disk_used}% {disk_fs})")
print(f"System: {platform.uname().version.split('~')[1].split()[0]}")

Cores : 2 (2GHz)
Memory: 7.78 GB (22.0%)
Swap  : 8 GB
Disk  : 145 GB (48.6% ext4)
System: 18.04.1-Ubuntu


### installed
python modules

In [5]:
conda_version = ! conda -V
print(f"conda : {conda_version[0].split()[1]}")
pip_version = ! pip -V
print(f"pip   : {pip_version[0].split()[1]}")
python_version = ! python -V
print(f"python: {python_version[0].split()[1]}")
pandas_version = ! pip list |grep -i pandas
print(f"pandas: {pandas_version[0].split()[1]}")
py2neo_version = ! pip list |grep -i py2neo
print(f"py2neo: {py2neo_version[0].split()[1]}")
fuzzywuzzy_version = ! pip list |grep -i fuzzywuzzy
print(f"fuzzywuzzy: {fuzzywuzzy_version[0].split()[1]}")

conda : 4.9.2
pip   : 21.0.1
python: 3.8.8
pandas: 1.2.2
py2neo: 4.2.0
fuzzywuzzy: 0.18.0


### config

In [6]:
! sudo cat ../../neo4j/conf/neo4j.conf


dbms.default_listen_address=0.0.0.0


neo4j.bloom.license_file=/plugins/bloom-plugin.license
neo4j.bloom.authorization_role=admin,architect
dbms.unmanaged_extension_classes=com.neo4j.bloom.server=/browser/bloom
dbms.tx_log.rotation.retention_policy=100M size
dbms.security.procedures.whitelist=apoc.*,gds.*
dbms.security.procedures.unrestricted=apoc.*,gds.*,bloom.*
dbms.memory.pagecache.size=2G
dbms.memory.heap.max_size=2G
dbms.directories.plugins=/plugins
dbms.directories.logs=/logs
dbms.directories.import=/import
causal_clustering.transaction_advertised_address=9ee153507f17:6000
causal_clustering.raft_advertised_address=9ee153507f17:7000
causal_clustering.discovery_advertised_address=9ee153507f17:5000
apoc.import.file.enabled=true


### plugins

In [7]:
! sudo ls -l ../../neo4j/plugins

total 72108
-rw-rw-r-- 1 ubuntu ubuntu 18542753 Feb 28 10:33 apoc-4.1.0.6-all.jar
-rw-r--r-- 1 root   root    7809568 Jan 26 14:35 apoc-couchbase-dependencies-4.1.0.6.jar
-rw-r--r-- 1 root   root     709133 Jan 26 14:35 apoc-email-dependencies-4.1.0.6.jar
-rw-r--r-- 1 root   root    1483695 Jan 26 14:35 apoc-mongodb-dependencies-4.1.0.6.jar
-rw-rw-r-- 1 ubuntu ubuntu 10848418 Feb 28 09:58 apoc-nlp-dependencies-4.1.0.6.jar
-rw-r--r-- 1 root   root   13956779 Jan 26 14:35 apoc-xls-dependencies-4.1.0.6.jar
-rw-r--r-- 1 ubuntu root   11192049 Jan  5 17:55 bloom-plugin-4.x-1.5.0.jar
-rw-r--r-- 1 ubuntu root         84 Feb 28 08:59 bloom-plugin.license
-rw-r--r-- 1 ubuntu root    9269836 Jan  5 17:55 neo4j-bloom-1.5.0-assets.zip


### parameters

In [8]:
server  = "40.127.98.81"
port    = "7687"
user    = "neo4j"
passw   = "digityser"
db_name = "neo4j"        # default name (v4.x)

In [9]:
# load graph connection instance
graph = Graph(host=server, auth=(user, passw), name=db_name, encrypted=False)

In [10]:
# open neo4j browser
display(Javascript('window.open("{url}");'.format(url="http://"+server+":7474")))

<IPython.core.display.Javascript object>

## Data
**[download](https://www.kaggle.com/zynicide/wine-reviews/data?select=winemag-data_first150k.csv)** original dataset from kaggle `winemag-data_first150k.csv`  
**[download](https://www.kaggle.com/zynicide/wine-reviews/data)** updated dataset from kaggle `winemag-data-130k-v2.csv`  
**[download](https://raw.githubusercontent.com/lju-lazarevic/wine/tree/master/data/winemag-data-130k-v2.csv)** updated dataset from github `winemag-data-130k-v2.csv`  
**[download](https://raw.githubusercontent.com/lju-lazarevic/wine/master/data/winemag-data-130k-v3.csv)** cleaned dataset from github `winemag-data-130k-v3.csv`  

In [11]:
# load cleaned dataset (v3) data (URL)
dataset = "https://raw.githubusercontent.com/lju-lazarevic/wine/master/data/winemag-data-130k-v3.csv"
prep = pd.read_csv(dataset)

### definitions

|Header|Sample|Description|
|-|-|-|
|id|14952|unique id of each record|
|country|France|the country that the wine is from|
|description|This is the latest wine from the popular Whispering Angel rosé|description of the tasting features of the wine|
|designation|Rock Angel|vineyard within the winery. <br> a winery may have more than one vineyard|
|points|91|points the wine taster rated on a scale 1 to 100|
|price|35.0|the cost for a bottle of the wine in $ dollars|
|province|Provence|the province or state that the wine is from|
|region_1|Côtes de Provence|the wine growing area in a province or state|
|region_2|NaN|more specific region within a growing area|
|taster_name|Roger Voss|name of the taster|
|taster_twitter_handle|@vossroger|taster's twitter handle|
|title|Château d'Esclans 2016 Rock Angel Rosé (Côtes de Provence)|the title of the wine review|
|variety|Rosé|the type of grapes used to make the wine|
|winery|Château d'Esclans|the winery that made the wine|


### exploring
Generate an interactive and static report about the data.

In [12]:
# concise summary information
prep.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119988 entries, 0 to 119987
Data columns (total 14 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   id                     119988 non-null  int64  
 1   country                119929 non-null  object 
 2   description            119988 non-null  object 
 3   designation            85443 non-null   object 
 4   points                 119988 non-null  int64  
 5   price                  111593 non-null  float64
 6   province               119929 non-null  object 
 7   region_1               100428 non-null  object 
 8   region_2               46769 non-null   object 
 9   taster_name            95071 non-null   object 
 10  taster_twitter_handle  90542 non-null   object 
 11  title                  119988 non-null  object 
 12  variety                119987 non-null  object 
 13  winery                 119988 non-null  object 
dtypes: float64(1), int64(2), object(11)


In [13]:
# dimensionality (rows, columns)
prep.shape

(119988, 14)

In [14]:
# count distinct (=unique) observations (+ missing), sorted (high cardinality > 390)
prep.nunique(dropna=False).sort_values(ascending=False)

id                       119988
description              119955
title                    118840
designation               37977
winery                    16757
region_1                   1230
variety                     708
province                    426
price                       391
country                      44
points                       21
taster_name                  20
region_2                     18
taster_twitter_handle        16
dtype: int64

In [15]:
# count missing values, sorted (high missing > 5%)
prep.isnull().apply(lambda x: x.sum() * 100 / len(prep)).round(1).sort_values(ascending=True)

id                        0.0
country                   0.0
description               0.0
points                    0.0
province                  0.0
title                     0.0
variety                   0.0
winery                    0.0
price                     7.0
region_1                 16.3
taster_name              20.8
taster_twitter_handle    24.5
designation              28.8
region_2                 61.0
dtype: float64

### modeling
**suggestions:**  
 * high cardinality can be used as node Properties, because Properties make your graph less complex
 * low cardinality can be used as nodes, because a query on Properties will check all Nodes, less lookups and faster
 * neo4j can not work with null values
   * high missing values need to be renamed `df.fillna('No Country')`
   ```python
   prep['country'] = prep['country'].fillna('No Country')
   ```
   * low missing values need to be excluded `WHERE price <> ""`
   ```cypher
   USING PERIODIC COMMIT 1000
   LOAD CSV WITH HEADERS FROM 'file:///winedata.csv' AS line FIELDTERMINATOR ','
   WITH line.price AS price, line.title AS wine WHERE price <> ""
   ```
 * datamodel is designed to what questions you are going answer
 * Nodes are subject entities
 * Properties are metadata
 * Relationships are the verbs, each value between the same record is connected
 * use MATCH when the arrow leaves the Node, then MERGE were the arrow arrives the Node
 * save cleaned dataset to `neo4j/import` in CSV format
 

### cleaning

In [16]:
%%time
# replace nan
prep['winery'] = prep['winery'].fillna('No Winery')
prep['province'] = prep['province'].fillna('No Province')
prep['country'] = prep['country'].fillna('No Country')
prep['designation'] = prep['designation'].fillna('No Designation')
prep['taster_name'] = prep['taster_name'].fillna('No Taster')
prep['variety'] = prep['variety'].fillna('No Variety')
prep['title'] = prep['title'].fillna('No Title')

# extract years 1970-2119
prep['year'] = prep['title'].str.extract("(([2][0-1][0-1][0-9])|([1][9][7-9][0-9]))")[0]
prep['year'] = prep['year'].fillna('No Year')

# remove 4-digits and double-space
prep['wine_group'] = prep['title'].str.replace("(\d{4})", '') # remove 4-digit and 1-space
prep['wine_group'] = prep['wine_group'].str.replace("([ ]{2,})", ' ') # replace 2 or more spaces into 1-space

# tokenize variety_name on dashes or spaces
prep['variety_name'] = prep['variety'].str.lower().str.split('[ ]|[-]')

# clean up: 
#  * different spellings for the same grape, e.g. Aragonez and Aragonês
#  * different names for the same grape, e.g. Syrah and Shiraz
#  * different ordering of wine blends, e.g. Cabernet-Shiraz and Shiraz-Cabernet
list_of_strings = prep['variety'].unique().tolist()
# look for the best match, return one similar strings with score
score_set = [(one_string,) + i
             for one_string in list_of_strings
             for i in process.extract(one_string, list_of_strings, scorer=fuzz.partial_token_sort_ratio, limit=1)]
oldstr_newstr_map = dict([(oldstr, newstr) for oldstr, newstr, score in score_set])
# overwrite variety, substitute each value
prep["variety"] = prep['variety'].map(oldstr_newstr_map)

# prep['price'] = prep['price'].fillna('No Price') # you can not allow a string, keep it all integers

# save file to /import
file = "winedata.csv"
prep.to_csv("../../neo4j/import/"+file, sep=',', index=False)



CPU times: user 9.62 s, sys: 89.6 ms, total: 9.71 s
Wall time: 10 s


In [17]:
%%time
# load data captured using Azure cognitive services
entities = pd.read_parquet("../../data/bronze/winegraph/description_entities.parquet")

# clean
entities = entities.drop(columns=['entities', 'entity_confidence_score'])
entities = entities[(entities['entity_category'] == 'Event')| \
                    (entities['entity_category'] == 'Location')| \
                    (entities['entity_category'] == 'Product')]

# save file to /import
file2 = "description_entities.csv"
entities.to_csv("../../neo4j/import/"+file2, sep=',', index=False)

CPU times: user 3.22 s, sys: 400 ms, total: 3.62 s
Wall time: 3.53 s


## Import
 1. create datamodel
 1. create index
 1. create nodes
 1. create relationships

[link to Arrows for data modeling](http://www.apcjones.com/arrows/#)  
[arrows source code](../../docs/datagraph/howto_graph_datamodel1.markup)  

![datamodel1](../../image/howto_graph/datamodel1.png)

In [18]:
prep.iloc[[117671]][["id", "winery", "province", "country"]]

Unnamed: 0,id,winery,province,country
117671,127202,Pali,California,US


In [19]:
%%time
# graph.run("""CREATE INDEX ON :Winery(name)""")
# graph.run("""CREATE INDEX ON :Province(name)""")
# graph.run("""CREATE INDEX ON :Country(name)""")
graph.run("CREATE CONSTRAINT ON (w:Winery) ASSERT w.name IS UNIQUE")
graph.run("CREATE CONSTRAINT ON (p:Province) ASSERT p.name IS UNIQUE")
graph.run("CREATE CONSTRAINT ON (c:Country) ASSERT c.name IS UNIQUE")
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (w:Winery {name:(line.winery)})
MERGE (p:Province {name:(line.province)})
MERGE (c:Country {name:(line.country)})
MERGE (w)-[:FROM_PROVENCE]->(p)
MERGE (p)-[:PROVINCE_COUNTRY]->(c)
"""
display(graph.run(query).stats())

constraints_added: 0
constraints_removed: 0
contains_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 17227
labels_removed: 0
nodes_created: 17227
nodes_deleted: 0
properties_set: 17227
relationships_created: 19481
relationships_deleted: 0

CPU times: user 7.25 ms, sys: 0 ns, total: 7.25 ms
Wall time: 19.4 s


[link to Arrows for data modeling](http://www.apcjones.com/arrows/#)  
[arrows source code](../../docs/datagraph/howto_graph_datamodel2.markup)  

![datamodel2](../../image/howto_graph/datamodel2.png)

In [20]:
prep.iloc[[117671]][["id", "title", "winery", "designation", "country"]]

Unnamed: 0,id,title,winery,designation,country
117671,127202,Pali 2014 Radian Vineyard Pinot Noir (Sta. Rita Hills),Pali,Radian Vineyard,US


In [21]:
%%time
# graph.run("""CREATE INDEX ON :Wine(id)""")
# graph.run("""CREATE INDEX ON :Designation(name)""")
graph.run("CREATE CONSTRAINT ON (w:Wine) ASSERT w.id IS UNIQUE")
# graph.run("CREATE CONSTRAINT ON (t:Taster) ASSERT t.name IS UNIQUE")
# graph.run("CREATE CONSTRAINT ON (v:Variety) ASSERT v.name IS UNIQUE")
graph.run("CREATE CONSTRAINT ON (d:Designation) ASSERT d.name IS UNIQUE")
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (winery:Winery {name:(line.winery)})
MERGE (d:Designation {name:(line.designation)})
MERGE (wine:Wine {id:(line.id), name:(line.title)})
MERGE (wine)-[:FROM_WINERY]->(winery)
MERGE (wine)-[:HAS_DESIGNATION]->(d)
"""
display(graph.run(query).stats())

constraints_added: 0
constraints_removed: 0
contains_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 157965
labels_removed: 0
nodes_created: 157965
nodes_deleted: 0
properties_set: 277953
relationships_created: 239976
relationships_deleted: 0

CPU times: user 0 ns, sys: 5.23 ms, total: 5.23 ms
Wall time: 22.7 s


[link to Arrows for data modeling](http://www.apcjones.com/arrows/#)  
[arrows source code](../../docs/datagraph/howto_graph_datamodel3.markup)  

![datamodel3](../../image/howto_graph/datamodel3.png)

In [22]:
prep.iloc[[117671]][["id", "title", "year", "wine_group"]]

Unnamed: 0,id,title,year,wine_group
117671,127202,Pali 2014 Radian Vineyard Pinot Noir (Sta. Rita Hills),2014,Pali Radian Vineyard Pinot Noir (Sta. Rita Hills)


In [23]:
%%time
# graph.run("""CREATE INDEX ON :Year(value)""")
# graph.run("""CREATE INDEX ON :WineGroup(name)""")
graph.run("CREATE CONSTRAINT ON (y:Year) ASSERT y.value IS UNIQUE")
graph.run("CREATE CONSTRAINT ON (wg:WineGroup) ASSERT wg.name IS UNIQUE")
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (w:Wine {id:(line.id)})
MERGE (y:Year {value:(line.year)})
MERGE (wg:WineGroup {name:(line.wine_group)})
MERGE (w)-[:FROM_YEAR]->(y)
MERGE (w)-[:IN_WINE_GROUP]->(wg)
"""
display(graph.run(query).stats())

constraints_added: 0
constraints_removed: 0
contains_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 81730
labels_removed: 0
nodes_created: 81730
nodes_deleted: 0
properties_set: 81730
relationships_created: 239976
relationships_deleted: 0

CPU times: user 4.18 ms, sys: 10 µs, total: 4.19 ms
Wall time: 13.8 s


[link to Arrows for data modeling](http://www.apcjones.com/arrows/#)  
[arrows source code](../../docs/datagraph/howto_graph_datamodel4.markup)  

![datamodel4](../../image/howto_graph/datamodel4.png)

In [24]:
prep.iloc[[117671]][["id", "title", "variety", "variety_name"]]

Unnamed: 0,id,title,variety,variety_name
117671,127202,Pali 2014 Radian Vineyard Pinot Noir (Sta. Rita Hills),Pinot Noir,"[pinot, noir]"


In [25]:
%%time
# create wine, variety and relationships
# graph.run("""CREATE INDEX ON :Variety(name)""")
# graph.run("""CREATE INDEX ON :VarietyName(name)""")
graph.run("CREATE CONSTRAINT ON (v:Variety) ASSERT v.name IS UNIQUE")
graph.run("CREATE CONSTRAINT ON (vn:VarietyName) ASSERT vn.name IS UNIQUE")
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (w:Wine {id:(line.id)})
UNWIND apoc.text.split(line.variety_name, ",") AS varnamelist
FOREACH (varname IN varnamelist|
 MERGE (v:Variety {name:(line.variety)})
 MERGE (vn:VarietyName {name:apoc.text.clean(varname)})
 MERGE (w)-[:HAS_VARIETY]->(v)
 MERGE (vn)-[:IS_COMPONENT_OF]->(v)
 )
"""
display(graph.run(query).stats())

constraints_added: 0
constraints_removed: 0
contains_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 1016
labels_removed: 0
nodes_created: 1016
nodes_deleted: 0
properties_set: 1016
relationships_created: 120807
relationships_deleted: 0

CPU times: user 240 µs, sys: 3.77 ms, total: 4.01 ms
Wall time: 16 s


[link to Arrows for data modeling](http://www.apcjones.com/arrows/#)  
[arrows source code](../../docs/datagraph/howto_graph_datamodel5.markup)  

![datamodel5](../../image/howto_graph/datamodel5.png)

In [26]:
prep.iloc[[117671]][["id", "title", "description"]]

Unnamed: 0,id,title,description
117671,127202,Pali 2014 Radian Vineyard Pinot Noir (Sta. Rita Hills),"From one of the most impressive vineyards located at the region's far-western edge, this bottlin..."


In [27]:
%%time
# graph.run("""CREATE INDEX ON :Description(value)""")
graph.run("CREATE CONSTRAINT ON (d:Description) ASSERT d.value IS UNIQUE")
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (w:Wine {id:(line.id)})
MERGE (d:Description {value:(line.description)})
MERGE (w)-[:HAS_DESCRIPTION]->(d)
"""
display(graph.run(query).stats())

constraints_added: 0
constraints_removed: 0
contains_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 119955
labels_removed: 0
nodes_created: 119955
nodes_deleted: 0
properties_set: 119955
relationships_created: 119988
relationships_deleted: 0

CPU times: user 4.02 ms, sys: 1 µs, total: 4.02 ms
Wall time: 13.5 s


[link to Arrows for data modeling](http://www.apcjones.com/arrows/#)  
[arrows source code](../../docs/datagraph/howto_graph_datamodel6.markup)  

![datamodel6](../../image/howto_graph/datamodel6.png)

In [28]:
print(prep.iloc[[117671]]["description"])
# extracted entities from description with cognitive textanalysis
entities[entities["document"].str.contains("impressive vineyards")][["document", "entity_text", "entity_category"]]

117671    From one of the most impressive vineyards located at the region's far-western edge, this bottlin...
Name: description, dtype: object


Unnamed: 0,document,entity_text,entity_category
489244,"From one of the most impressive vineyards located at the region's far-western edge, this bottlin...",vineyards,Location
489246,"From one of the most impressive vineyards located at the region's far-western edge, this bottlin...",pomegranate,Product
489247,"From one of the most impressive vineyards located at the region's far-western edge, this bottlin...",dried fennel,Product
489248,"From one of the most impressive vineyards located at the region's far-western edge, this bottlin...",mint,Product
489249,"From one of the most impressive vineyards located at the region's far-western edge, this bottlin...",light tobacco,Product
489250,"From one of the most impressive vineyards located at the region's far-western edge, this bottlin...",sesame,Product
489251,"From one of the most impressive vineyards located at the region's far-western edge, this bottlin...",thyme,Product
489252,"From one of the most impressive vineyards located at the region's far-western edge, this bottlin...",minty eucalyptus flavor,Product
489253,"From one of the most impressive vineyards located at the region's far-western edge, this bottlin...",candied cherries,Product
489254,"From one of the most impressive vineyards located at the region's far-western edge, this bottlin...",fruit,Product


In [29]:
%%time
# graph.run("""CREATE INDEX ON :Event(value)""")
# graph.run("""CREATE INDEX ON :Product(value)""")
# graph.run("""CREATE INDEX ON :Location(value)""")
graph.run("CREATE CONSTRAINT ON (e:Event) ASSERT e.value IS UNIQUE")
graph.run("CREATE CONSTRAINT ON (p:Product) ASSERT p.value IS UNIQUE")
graph.run("CREATE CONSTRAINT ON (l:Location) ASSERT l.value IS UNIQUE")
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file2+"""' AS line FIELDTERMINATOR ','
WITH line WHERE line.entity_category = 'Event'
MATCH (description:Description {value:(line.document)})
MERGE (event:Event {value:(line.entity_text)})
MERGE (event)-[:EVENT_IN]->(description)
"""
display(graph.run(query).stats())
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file2+"""' AS line FIELDTERMINATOR ','
WITH line WHERE line.entity_category = 'Location'
MATCH (description:Description {value:(line.document)})
MERGE (location:Location {value:(line.entity_text)})
MERGE (location)-[:LOCATION_IN]->(description)
"""
display(graph.run(query).stats())
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file2+"""' AS line FIELDTERMINATOR ','
WITH line WHERE line.entity_category = 'Product'
MATCH (description:Description {value:(line.document)})
MERGE (product:Product {value:(line.entity_text)})
MERGE (product)-[:PRODUCT_IN]->(description)
"""
display(graph.run(query).stats())

constraints_added: 0
constraints_removed: 0
contains_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 95
labels_removed: 0
nodes_created: 95
nodes_deleted: 0
properties_set: 95
relationships_created: 149
relationships_deleted: 0

constraints_added: 0
constraints_removed: 0
contains_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 6315
labels_removed: 0
nodes_created: 6315
nodes_deleted: 0
properties_set: 6315
relationships_created: 26215
relationships_deleted: 0

constraints_added: 0
constraints_removed: 0
contains_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 43903
labels_removed: 0
nodes_created: 43903
nodes_deleted: 0
properties_set: 43903
relationships_created: 476509
relationships_deleted: 0

CPU times: user 12.1 ms, sys: 493 µs, total: 12.6 ms
Wall time: 49.2 s


[link to Arrows for data modeling](http://www.apcjones.com/arrows/#)  
[arrows source code](../../docs/datagraph/howto_graph_datamodel7.markup)  

![datamodel7](../../image/howto_graph/datamodel7.png)  

In [30]:
prep.iloc[[117671]][["id", "title", "taster_name", "points"]]

Unnamed: 0,id,title,taster_name,points
117671,127202,Pali 2014 Radian Vineyard Pinot Noir (Sta. Rita Hills),Matt Kettmann,94


In [31]:
%%time
graph.run("CREATE CONSTRAINT ON (t:Taster) ASSERT t.name IS UNIQUE")
graph.run("CREATE CONSTRAINT ON (p:Points) ASSERT p.value IS UNIQUE")
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (w:Wine {id:(line.id)})
MERGE (t:Taster {name:(line.taster_name)})
MERGE (p:Points {value:toInteger(line.points)})
MERGE (w)-[:TASTED_BY]->(t)
MERGE (t)-[:GAVE_POINTS]->(p)
MERGE (w)-[:HAS_POINTS]->(p)
"""
display(graph.run(query).stats())

constraints_added: 0
constraints_removed: 0
contains_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 41
labels_removed: 0
nodes_created: 41
nodes_deleted: 0
properties_set: 41
relationships_created: 240302
relationships_deleted: 0

CPU times: user 370 µs, sys: 3.79 ms, total: 4.16 ms
Wall time: 11.4 s


[link to Arrows for data modeling](http://www.apcjones.com/arrows/#)  
[arrows source code](../../docs/datagraph/howto_graph_datamodel8.markup)  

![datamodel8](../../image/howto_graph/datamodel8.png)

In [32]:
prep.iloc[[117671]][["id", "title", "price"]]

Unnamed: 0,id,title,price
117671,127202,Pali 2014 Radian Vineyard Pinot Noir (Sta. Rita Hills),58.0


In [33]:
%%time
# graph.run("""CREATE INDEX ON :Price(value)""")
graph.run("CREATE CONSTRAINT ON (p:Price) ASSERT p.value IS UNIQUE")
# create Price Nodes and Relationships, ignore nan (load as int)
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
WITH line.id AS id, line.price AS price, line.title AS wine WHERE price <> ""
MATCH (w:Wine {id:id})
MERGE (p:Price {value:toInteger(price)})
MERGE (w)-[:HAS_PRICE]->(p)
"""
display(graph.run(query).stats())

constraints_added: 0
constraints_removed: 0
contains_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 390
labels_removed: 0
nodes_created: 390
nodes_deleted: 0
properties_set: 390
relationships_created: 111593
relationships_deleted: 0

CPU times: user 3.51 ms, sys: 0 ns, total: 3.51 ms
Wall time: 8.82 s


## Result

[link to arrows for data modeling](http://www.apcjones.com/arrows/#)   
[arrows source code](../../docs/datagraph/apcjones_datamodel_full.markup)  
![full data model](../../image/howto_graph/datamodel_full.png)

```cypher
CALL db.schema.visualization
:style
```
![neo4j browser schema visualization](../../image/howto_graph/schema_full.png)  
[link to yworks data exploration](http://www.yworks.com/neo4j-explorer/)
[yworks nodes template](../../docs/datagraph/yworks_template.json)
![yworks schema visualization](../../image/howto_graph/schema_full_yworks.png)  

## Querying

### Which 10 countries have the most wineries ?
note: make sure to count only once each winery

In [34]:
%%time
query = """
MATCH (w:Winery)-[:FROM_PROVENCE]->(p:Province)-[:PROVINCE_COUNTRY]->(c:Country)
RETURN c.name AS Country, count(DISTINCT w) AS totalNrWineries
ORDER BY totalNrWineries DESC LIMIT 10
"""
graph.run(query).to_data_frame()

CPU times: user 0 ns, sys: 3.61 ms, total: 3.61 ms
Wall time: 339 ms


Unnamed: 0,Country,totalNrWineries
0,US,5375
1,France,3864
2,Italy,2934
3,Spain,1435
4,Argentina,531
5,Australia,474
6,Portugal,430
7,Chile,317
8,New Zealand,300
9,South Africa,294


In [35]:
%%time
result = prep[["winery","country"]].groupby(['country'])['winery'].nunique()
result = result.rename_axis(['Country']).rename('totalNrWineries').sort_values(ascending=False).reset_index()
result.head(10)

CPU times: user 55.5 ms, sys: 210 µs, total: 55.7 ms
Wall time: 59 ms


Unnamed: 0,Country,totalNrWineries
0,US,5375
1,France,3864
2,Italy,2934
3,Spain,1435
4,Argentina,531
5,Australia,474
6,Portugal,430
7,Chile,317
8,New Zealand,300
9,South Africa,294


### Which wineries are across multiple provinces ?
alt: Which provinces are associated to each winery ?

In [36]:
%%time
query = """
MATCH (w:Winery)-[:FROM_PROVENCE]->(p:Province)
WITH w, COLLECT(p.name) AS Provinces, count(p) AS Total
RETURN w.name AS Winery, Provinces, Total
ORDER BY Total DESC LIMIT 10
"""
graph.run(query).to_data_frame()

CPU times: user 3.35 ms, sys: 0 ns, total: 3.35 ms
Wall time: 237 ms


Unnamed: 0,Winery,Provinces,Total
0,Undurraga,"[Colchagua Valley, Maule Valley, Maipo Valley, Rapel Valley, Leyda Valley, Casablanca Valley, Cu...",19
1,Concha y Toro,"[Colchagua Valley, Maule Valley, Maipo Valley, Rapel Valley, Leyda Valley, Casablanca Valley, Li...",16
2,Santa Carolina,"[Colchagua Valley, Maule Valley, Maipo Valley, Rapel Valley, Leyda Valley, Casablanca Valley, Cu...",14
3,San Pedro,"[Northern Spain, Mendoza Province, Maule Valley, Maipo Valley, Leyda Valley, Casablanca Valley, ...",12
4,Kirkland Signature,"[Northern Spain, California, Mendoza Province, Bordeaux, Washington, Burgundy, Tuscany, Aconcagu...",12
5,Santa Rita,"[Colchagua Valley, Maipo Valley, Rapel Valley, Leyda Valley, Aconcagua Valley, Casablanca Valley...",11
6,Bacalhôa Wines of Portugal,"[Douro, Alentejano, Lisboa, Península de Setúbal, Dão, Vinho Espumante, Setubal, Vinho Espumante...",11
7,Wines & Winemakers,"[Douro, Tejo, Alentejano, Vinho Verde, Península de Setúbal, Port, Dão, Bairrada, Setubal, Palmela]",10
8,Tussock Jumper,"[Rheinhessen, California, Other, Colchagua Valley, France Other, Central Spain, Marlborough, Wes...",10
9,Casca Wines,"[Douro, Tejo, Alentejano, Vinho Verde, Lisboa, Dão, Bairrada, Minho, Bucelas, Távora-Varosa]",10


In [37]:
%%time
result = prep.groupby('winery').agg({'province':[lambda x: x.unique(), lambda x: x.nunique()]}).reset_index()
result.columns = ['Winery', 'Provinces', 'Total']
result = result.sort_values(by='Total',ascending=False).reset_index(drop=True)
result.head(10)

CPU times: user 3.78 s, sys: 0 ns, total: 3.78 s
Wall time: 3.78 s


Unnamed: 0,Winery,Provinces,Total
0,Undurraga,"[Maipo Valley, Leyda Valley, Chile, Cauquenes Valley, Curicó Valley, Rapel Valley, San Antonio, ...",19
1,Concha y Toro,"[Chile, Central Valley, Maipo Valley, Casablanca Valley, Rapel Valley, Peumo, Marchigue, Puente ...",16
2,Santa Carolina,"[Cachapoal Valley, Colchagua Valley, Casablanca Valley, Leyda Valley, Maipo Valley, Central Vall...",14
3,San Pedro,"[Lontué Valley, Cachapoal Valley, Maipo Valley, Central Valley, Leyda Valley, Maule Valley, Elqu...",12
4,Kirkland Signature,"[California, Washington, Bordeaux, Rhône Valley, Tuscany, Mendoza Province, Marlborough, Norther...",12
5,Santa Rita,"[Leyda Valley, Central Valley, Maipo Valley, Aconcagua Valley, Rapel Valley, Colchagua Valley, A...",11
6,Bacalhôa Wines of Portugal,"[Douro, Setubal, Península de Setúbal, Lisboa, Alentejano, Dão, Moscatel de Setúbal, Vinho Espum...",11
7,Xavier Flouret,"[Central Valley, Bordeaux, Provence, Burgundy, Loire Valley, Northern Spain, Other, Mendoza Prov...",10
8,Barton & Guestier,"[France Other, No Province, Bordeaux, Burgundy, Languedoc-Roussillon, Beaujolais, Loire Valley, ...",10
9,Echeverria,"[Central Valley, Maipo Valley, Curicó Valley, Maipo Valley-Colchagua Valley, Molina, Colchagua V...",10


### The top 10 most prolific wine tasters ?
note: count the unique amount instead of the total amount tasted

In [38]:
%%time
query = """
MATCH (t: Taster)
WHERE t.name <> "No Taster"
MATCH (t)-[:GAVE_POINTS]->(p:Points)<-[:HAS_POINTS]-(w:Wine)-[:HAS_VARIETY]->(v:Variety)
WITH t, count(DISTINCT w.name) AS Total, COLLECT(DISTINCT v.name) AS Varieties
RETURN t.name AS Taster, Varieties, Total
ORDER BY Total DESC LIMIT 10
"""
graph.run(query).to_data_frame()

CPU times: user 16.4 ms, sys: 0 ns, total: 16.4 ms
Wall time: 3.49 s


Unnamed: 0,Taster,Varieties,Total
0,Roger Voss,"[Chardonnay, Syrah, Melon, Cabernet Sauvignon, Primitivo, Tempranillo-Merlot, Red Blend, Silvane...",118840
1,Paul Gregutt,"[Chardonnay, Syrah, Melon, Cabernet Sauvignon, Primitivo, Tempranillo-Merlot, Red Blend, Silvane...",118840
2,Joe Czerwinski,"[Chardonnay, Syrah, Melon, Cabernet Sauvignon, Primitivo, Tempranillo-Merlot, Red Blend, Silvane...",118840
3,Kerin O’Keefe,"[Chardonnay, Syrah, Melon, Cabernet Sauvignon, Primitivo, Tempranillo-Merlot, Red Blend, Silvane...",118840
4,Virginie Boone,"[Chardonnay, Syrah, Melon, Cabernet Sauvignon, Primitivo, Tempranillo-Merlot, Red Blend, Silvane...",118823
5,Anna Lee C. Iijima,"[Chardonnay, Syrah, Melon, Cabernet Sauvignon, Primitivo, Tempranillo-Merlot, Red Blend, Silvane...",118791
6,Michael Schachner,"[Chardonnay, Syrah, Melon, Cabernet Sauvignon, Primitivo, Tempranillo-Merlot, Red Blend, Silvane...",118791
7,Jim Gordon,"[Chardonnay, Syrah, Melon, Cabernet Sauvignon, Primitivo, Tempranillo-Merlot, Red Blend, Silvane...",118715
8,Sean P. Sullivan,"[Chardonnay, Syrah, Melon, Cabernet Sauvignon, Primitivo, Tempranillo-Merlot, Red Blend, Silvane...",118715
9,Matt Kettmann,"[Chardonnay, Syrah, Melon, Cabernet Sauvignon, Primitivo, Tempranillo-Merlot, Red Blend, Silvane...",118323


In [39]:
%%time
result = prep[prep.taster_name != "No Taster"]
result = result.groupby(['taster_name']).agg({'variety':[lambda x: list(x)], 'title':[lambda x: x.nunique()]}).reset_index()
result.columns = ['Taster', 'Varieties', 'Total']
result = result.sort_values(by='Total',ascending=False).reset_index(drop=True)
result.head(10)

CPU times: user 122 ms, sys: 2.57 ms, total: 124 ms
Wall time: 125 ms


Unnamed: 0,Taster,Varieties,Total
0,Roger Voss,"[Portuguese Red, Gewürztraminer, Pinot Gris, Gewürztraminer, Gamay, Gamay, Gamay, Bordeaux-style...",22973
1,Michael Schachner,"[Tempranillo-Merlot, Malbec, Malbec, Tempranillo Blend, Chardonnay, Tempranillo-Merlot, Petit Ve...",13944
2,Kerin O’Keefe,"[White Blend, Frappato, Nerello Mascalese, White Blend, Nero d'Avola, White Blend, Nero d'Avola,...",9662
3,Paul Gregutt,"[Pinot Gris, Pinot Noir, Pinot Noir, Pinot Noir, Pinot Noir, Pinot Noir, Pinot Noir, Pinot Noir,...",8856
4,Virginie Boone,"[Cabernet Sauvignon, Cabernet Sauvignon, Pinot Noir, Chenin Blanc, Chardonnay, Cabernet Sauvigno...",8689
5,Matt Kettmann,"[Chardonnay, Tempranillo-Merlot, Sauvignon Blanc, Zinfandel, Bordeaux-style Red Blend, Chardonna...",5698
6,Joe Czerwinski,"[Chardonnay, Rosé, Cabernet Sauvignon, Bordeaux-style Red Blend, Sauvignon Blanc, Cabernet Sauvi...",4753
7,Sean P. Sullivan,"[Malbec, Cabernet Franc, Bordeaux-style Red Blend, Chardonnay, Albariño, Viognier-Chardonnay, Te...",4448
8,Anna Lee C. Iijima,"[Gewürztraminer, Riesling, Riesling, Riesling, Riesling, Pinot Gris, Riesling, Riesling, Rieslin...",4012
9,Jim Gordon,"[Red Blend, Cabernet Franc, White Blend, Grenache Blanc, Grenache Blanc, White Blend, Pinot Noir...",3750


### How many wine varieties contain the word 'red' ?

In [40]:
%%time
query = """
MATCH (v:Variety)
WHERE tolower(v.name) CONTAINS 'red'
RETURN v.name AS redVariety
ORDER BY redVariety
"""
graph.run(query).to_data_frame()

CPU times: user 0 ns, sys: 2.44 ms, total: 2.44 ms
Wall time: 109 ms


Unnamed: 0,redVariety
0,Bordeaux-style Red Blend
1,Portuguese Red
2,Provence red blend
3,Red Blend


In [41]:
%%time
pd.DataFrame(sorted(prep["variety"][prep["variety"].str.contains('red', case=False)].unique()), columns=["redVariety"])

CPU times: user 95.5 ms, sys: 0 ns, total: 95.5 ms
Wall time: 143 ms


Unnamed: 0,redVariety
0,Bordeaux-style Red Blend
1,Portuguese Red
2,Provence red blend
3,Red Blend


### Which Year had the most Wine ?
unique or distinct count !

In [42]:
%%time
query = """
MATCH (w:Wine)-[:FROM_YEAR]->(y:Year)
WITH y, collect(DISTINCT w.name) AS wines
RETURN y.value AS year, size(wines) AS wines ORDER BY wines DESC LIMIT 5
"""
graph.run(query).to_data_frame()

CPU times: user 3.34 ms, sys: 0 ns, total: 3.34 ms
Wall time: 306 ms


Unnamed: 0,year,wines
0,2012,14302
1,2013,14261
2,2014,13914
3,2011,11504
4,2010,11228


In [43]:
%%time
result = prep[prep.year != "No Year"]
result = result.groupby(['year'])['title'].nunique().reset_index()
result.columns = ['year', 'wines']
result = result.sort_values(by='wines',ascending=False).reset_index(drop=True)
result.head(5)

CPU times: user 101 ms, sys: 4.04 ms, total: 105 ms
Wall time: 108 ms


Unnamed: 0,year,wines
0,2012,14302
1,2013,14261
2,2014,13914
3,2011,11504
4,2010,11228


### Which top 5 Winery produces the most Wine for a given Year ?
**note:** the cypher query is showing the distinct or unique count of wine titles

In [44]:
%%time
query = """
MATCH (wy:Winery)<-[:FROM_WINERY]-(w:Wine)-[:FROM_YEAR]->(y:Year)
WITH wy, y, COLLECT(DISTINCT w.name) AS wines
RETURN wy.name AS Winery, y.value AS Year, size(wines) AS `No of Wines`
ORDER BY `No of Wines` DESC LIMIT 5
"""
graph.run(query).to_data_frame()

CPU times: user 3.45 ms, sys: 172 µs, total: 3.62 ms
Wall time: 1.1 s


Unnamed: 0,Winery,Year,No of Wines
0,Wines & Winemakers,2013,39
1,Georges Duboeuf,2015,38
2,Wines & Winemakers,2014,38
3,Louis Latour,2014,37
4,Georges Duboeuf,2014,37


In [45]:
%%time
result = prep.groupby(['winery', 'year']).agg({'title':['nunique']}).reset_index()
result.columns = ['Winery', 'Year', 'No of Wines']
result = result.sort_values(by='No of Wines',ascending=False).reset_index(drop=True)
result.head(5)

CPU times: user 126 ms, sys: 0 ns, total: 126 ms
Wall time: 127 ms


Unnamed: 0,Winery,Year,No of Wines
0,Wines & Winemakers,2013,39
1,Georges Duboeuf,2015,38
2,Wines & Winemakers,2014,38
3,Louis Latour,2014,37
4,Georges Duboeuf,2014,37


### Show Variety linked to VarietyName

In [46]:
%%time
query = """
MATCH (vn:VarietyName)-[:IS_COMPONENT_OF]->(v:Variety)
WITH vn, COLLECT(v.name) AS var
RETURN vn.name, var, size(var) AS s
ORDER BY s DESC LIMIT 5
"""
graph.run(query).to_data_frame()

CPU times: user 3.55 ms, sys: 0 ns, total: 3.55 ms
Wall time: 90.6 ms


Unnamed: 0,vn.name,var,s
0,blanc,"[Pinot Noir, Chardonnay, Chenin Blanc, Sauvignon Blanc, Grenache Blanc, Pinot Blanc, Fumé Blanc,...",17
1,blend,"[White Blend, Malbec, Tempranillo Blend, Red Blend, Bordeaux-style White Blend, Champagne Blend,...",16
2,cabernet,"[Cabernet Sauvignon, Malbec, Sangiovese, Cabernet Franc, Carmenère, Syrah, Tannat-Cabernet, Merl...",14
3,tinta,"[Tinta Miúda, Tinta de Toro, Tinta Fina, Tinta Roriz, Tinta Barroca, Tinta del Pais, Tinta del T...",12
4,pinot,"[Pinot Gris, Pinot Noir, Chardonnay, Pinot Blanc, Pinot Bianco, Viognier, Pinot Grigio, Pinot Ne...",11


### Which VarietyName have the most Wine ?

In [47]:
%%time
query = """
MATCH (vn:VarietyName)-[:IS_COMPONENT_OF]->(v:Variety)<-[:HAS_VARIETY]-(w:Wine)
WITH vn, COLLECT(DISTINCT w.name) AS wines
RETURN vn.name, size(wines) AS s
ORDER BY s DESC LIMIT 5
"""
graph.run(query).to_data_frame()

CPU times: user 2.84 ms, sys: 0 ns, total: 2.84 ms
Wall time: 1.44 s


Unnamed: 0,vn.name,s
0,syrah,34437
1,sauvignon,33631
2,blanc,30058
3,pinot,26373
4,blend,26086


### Find the popular product flavors in this 'merlot' variety

* Avoid all the variety names that contain generic words. Manually added 'black', 'red', 'white', 'blend', 'style', 'other'
* Then we match those description words to the variety names
* After, we pull back all of the wines that match that have 'merlot' as a variety
* Then we do a count against the most popular unique products in the wine

In [48]:
%%time
query = """
MATCH (vn:VarietyName)
WHERE NOT vn.name in ['black', 'red', 'white', 'blend', 'style', 'other', 'blank', 'gris']
WITH vn
MATCH (p:Product {value:vn.name})
WITH p
MATCH (p:Product)-[:PRODUCT_IN]->(d:Description)<-[:HAS_DESCRIPTION]-(w:Wine)-[:HAS_VARIETY]->(v:Variety)
WHERE tolower(v.name) contains('merlot')
WITH w, p ORDER BY p.value
WITH w, collect(DISTINCT p.value) as grapes
RETURN grapes, count(grapes) as popularity order by popularity desc
"""
graph.run(query).to_data_frame()

CPU times: user 3 ms, sys: 0 ns, total: 3 ms
Wall time: 866 ms


Unnamed: 0,grapes,popularity
0,[orange],3
1,[apple],2
2,[melon],1


### Which wines have 100 points ?

In [49]:
%%time
# ...
query = """
MATCH (p:Points {value:100})<-[:HAS_POINTS]-(w:Wine)-[:HAS_VARIETY]->(v:Variety)
MATCH (w)-[:TASTED_BY]->(t:Taster)
RETURN t.name AS `Reviewer`, w.name AS `Wine title`, v.name AS `Grape variety` ORDER BY `Grape variety`
"""
graph.run(query).to_data_frame()

CPU times: user 126 µs, sys: 3.61 ms, total: 3.74 ms
Wall time: 134 ms


Unnamed: 0,Reviewer,Wine title,Grape variety
0,No Taster,Verité 2007 La Muse Red (Sonoma County),Bordeaux-style Red Blend
1,Roger Voss,Château Léoville Las Cases 2010 Saint-Julien,Bordeaux-style Red Blend
2,Roger Voss,Château Cheval Blanc 2010 Saint-Émilion,Bordeaux-style Red Blend
3,Roger Voss,Château Lafite Rothschild 2010 Pauillac,Bordeaux-style Red Blend
4,Roger Voss,Château Léoville Barton 2010 Saint-Julien,Bordeaux-style Red Blend
5,Roger Voss,Château Haut-Brion 2014 Pessac-Léognan,Bordeaux-style White Blend
6,No Taster,Cardinale 2006 Cabernet Sauvignon (Napa Valley),Cabernet Sauvignon
7,Roger Voss,Louis Roederer 2008 Cristal Vintage Brut (Champagne),Champagne Blend
8,Roger Voss,Krug 2002 Brut (Champagne),Champagne Blend
9,Roger Voss,Salon 2006 Le Mesnil Blanc de Blancs Brut Chardonnay (Champagne),Chardonnay


In [50]:
%%time
result = prep[(prep["points"] == 100)][["taster_name", "title", "variety", "points"]]
result = result.sort_values(by=["variety"], ascending=True).reset_index(drop=True)
result = result.rename(columns={'taster_name':'Reviewer', 'title':'Wine Title', 'variety':'Variety'})
result

CPU times: user 5.22 ms, sys: 60 µs, total: 5.28 ms
Wall time: 4.9 ms


Unnamed: 0,Reviewer,Wine Title,Variety,points
0,No Taster,Verité 2007 La Muse Red (Sonoma County),Bordeaux-style Red Blend,100
1,Roger Voss,Château Léoville Las Cases 2010 Saint-Julien,Bordeaux-style Red Blend,100
2,Roger Voss,Château Cheval Blanc 2010 Saint-Émilion,Bordeaux-style Red Blend,100
3,Roger Voss,Château Lafite Rothschild 2010 Pauillac,Bordeaux-style Red Blend,100
4,Roger Voss,Château Léoville Barton 2010 Saint-Julien,Bordeaux-style Red Blend,100
5,Roger Voss,Château Haut-Brion 2014 Pessac-Léognan,Bordeaux-style White Blend,100
6,No Taster,Cardinale 2006 Cabernet Sauvignon (Napa Valley),Cabernet Sauvignon,100
7,Roger Voss,Krug 2002 Brut (Champagne),Champagne Blend,100
8,Roger Voss,Louis Roederer 2008 Cristal Vintage Brut (Champagne),Champagne Blend,100
9,Roger Voss,Salon 2006 Le Mesnil Blanc de Blancs Brut Chardonnay (Champagne),Chardonnay,100


### Show the most expensive wines (>1000)

In [51]:
%%time
query = """
MATCH (po:Points)<-[:HAS_POINTS]-(wi:Wine)-[:HAS_PRICE]->(pr:Price)
WHERE pr.value >= 1000
RETURN wi.name AS Title, pr.value AS Price, po.value AS Points ORDER BY po.value
"""
graph.run(query).to_data_frame()

CPU times: user 2.71 ms, sys: 0 ns, total: 2.71 ms
Wall time: 136 ms


Unnamed: 0,Title,Price,Points
0,Château les Ormes Sorbet 2013 Médoc,3300,88
1,Blair 2013 Roger Rose Vineyard Chardonnay (Arroyo Seco),2013,91
2,Château La Mission Haut-Brion 2009 Pessac-Léognan,1000,94
3,Emmerich Knoll 2013 Ried Loibenberg Smaragd Grüner Veltliner (Wachau),1100,94
4,Domaine du Comte Liger-Belair 2006 La Romanée,1125,94
5,Château Haut-Brion 2009 Pessac-Léognan,1200,96
6,Château Mouton Rothschild 2009 Pauillac,1300,96
7,Domaine du Comte Liger-Belair 2005 La Romanée,2000,96
8,Domaine du Comte Liger-Belair 2010 La Romanée,2500,96
9,Château Pétrus 2014 Pomerol,2500,96


### Show the highest scoring wines

In [52]:
%%time
query = """
MATCH (pr:Price)<-[:HAS_PRICE]-(wi:Wine)-[:HAS_POINTS]->(po:Points)
WHERE po.value > 96
WITH pr, wi, po
MATCH (wi)-[:HAS_VARIETY]->(va:Variety)
RETURN wi.name AS Title, va.name AS Variety, po.value AS Points, pr.value AS Price ORDER BY Price ASC
"""
graph.run(query).to_data_frame()

CPU times: user 12.3 ms, sys: 16 µs, total: 12.3 ms
Wall time: 143 ms


Unnamed: 0,Title,Variety,Points,Price
0,Donkey & Goat 2010 Fenaughty Vineyard Syrah (El Dorado),Syrah,97,35
1,Taylor Fladgate NV 325 Anniversary (Port),Portuguese Red,97,40
2,Failla 2010 Estate Vineyard Chardonnay (Sonoma Coast),Chardonnay,99,44
3,Château Coutet 2014 Barsac,Bordeaux-style White Blend,97,45
4,Trefethen 2005 Estate Cabernet Sauvignon (Oak Knoll District),Cabernet Sauvignon,97,50
...,...,...,...,...
317,Château La Mission Haut-Brion 2009 Pessac-Léognan,Bordeaux-style Red Blend,97,1100
318,Château Cheval Blanc 2010 Saint-Émilion,Bordeaux-style Red Blend,100,1500
319,Château Lafite Rothschild 2010 Pauillac,Bordeaux-style Red Blend,100,1500
320,Château Margaux 2009 Margaux,Bordeaux-style Red Blend,98,1900


## Documentation
### useful links, commands and documentation
* [**data modelling 'arrows'**](http://www.apcjones.com/arrows/#)
* [**overview docs**](https://neo4j.com/docs/)  
  * [**cypher**](https://neo4j.com/docs/cypher-manual/current/)
  * [**apoc**](https://neo4j.com/labs/apoc/4.1/)
  * [**gds**](https://neo4j.com/docs/graph-data-science/current/)
  * [**py2neo**](https://py2neo.readthedocs.io/en/latest/)
* [**cypher ref card**](https://neo4j.com/docs/cypher-refcard/current/)  
* [**forums community**](https://community.neo4j.com/)  
* [**bloom**](https://neo4j.com/docs/bloom-user-guide/current/)
* [**data explorer yworks**](http://www.yworks.com/neo4j-explorer/)  
  * [**video tutorial**](https://www.youtube.com/watch?v=kSMh8NtNk_k)
* [**gists**](https://neo4j.com/graphgists/)

### working with unstructured text data
regex generator : http://regex.inginf.units.it/  
regex checker : https://regex101.com/  
neo4j apoc text replace : https://neo4j.com/labs/apoc/4.1/misc/text-functions/#text-functions-regex  
pandas series replace : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html  
pandas series extract : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html  
text similarity : https://neo4j.com/labs/apoc/4.1/misc/text-functions/#text-functions-text-similarity

**apply FuzzyWuzzy in one column using token set ratio**
[compare 2 strings](https://stackoverflow.com/questions/31806695/when-to-use-which-fuzz-function-to-compare-2-strings)  
[find similar strings with fuzzywuzzy](https://github.com/thuynh323/NLP-with-Python/blob/master/FuzzyWuzzy%20-%20Ramen%20Rater%20List/Find%20similar%20strings%20with%20FuzzyWuzzy.ipynb)  

the `token` method ignore case and punctuation  
the `ratio` calculates the Levenshtein distance  
`token_sort_ratio` tokenizes strings in words sorted in alphanumeric order then apply ratio  
`token_set_ratio` ignore duplicate words (~set)  
`partial_token_sort_ratio` ~token_sort_ratio but uses partial_ration instead of ratio  
[doc](https://medium.com/@laxmi17sarki/string-matching-using-fuzzywuzzy-24be9e85c88d)  

using APOC for Azure NLP cognitive services:  
https://neo4j.com/labs/apoc/4.1/nlp/azure/#nlp-azure-examples-entities

In [53]:
query = """
MERGE (:Article {
  uri: "https://neo4j.com/blog/pokegraph-gotta-graph-em-all/",
  body: "These days I’m rarely more than a few feet away from my Nintendo Switch and I play board games, card games and role playing games with friends at least once or twice a week. I’ve even organised lunch-time Mario Kart 8 tournaments between the Neo4j European offices!"
})
"""
graph.run(query)

query = """
MERGE (:Article {
  uri: "https://en.wikipedia.org/wiki/Nintendo_Switch",
  body: "The Nintendo Switch is a video game console developed by Nintendo, released worldwide in most regions on March 3, 2017. It is a hybrid console that can be used as a home console and portable device. The Nintendo Switch was unveiled on October 20, 2016. Nintendo offers a Joy-Con Wheel, a small steering wheel-like unit that a Joy-Con can slot into, allowing it to be used for racing games such as Mario Kart 8."
})
"""
graph.run(query)

<py2neo.database.Cursor at 0x7f52e01d2b20>

In [54]:
query = """
MATCH (a:Article)
WITH collect(a) AS articles
CALL apoc.nlp.azure.entities.graph(articles, {
  key: "1923a4f6c4ea4f6b89d04596f5d17766",
  url: "https://westeurope.api.cognitive.microsoft.com/",
  nodeProperty: "body",
  writeRelationshipType: "ENTITY",
  write: true
})
YIELD graph AS g
RETURN g
"""
graph.run(query).to_data_frame()

Unnamed: 0,g
0,"{'name': 'Graph', 'relationships': [{'score': 0.94}, {'score': 0.032446316016667254}, {'score': ..."


### howto datamodel
The datamodel design depends on what business questions you need to ask.  
Otherwise it depends on the underlying context connection between each pair of dataset feature columns.  
The example show that bridging nodes can help preserve all information, else it is lost in the merge.  
Verify each combination in a 4 column dataset (A 'Wine', B 'Points', C 'Taster', D 'Price'):  
 * A versus B = related (Wine HAS_POINTS > Points)  
 * A versus C = related (Taster TASTED > Wine)  
 * A versus D = related (Wine HAS_PRICE > Price)  
 * B versus C = related (Taster GAVE_POINTS > Points)  
 * B versus D = unrelated (lose this information)  
 * C versus D = unrelated (ignore this relation)  
 
It is not a good practice to include everything,  
follow the business question and only the useful data.  

![how to include all information in the datamodel](../../image/howto_graph/howto_include_all_information.jpg)

### mesh topology
stores only unique nodes, no duplication  
preferred topology, you probably have to add relationship bridges for certain questions (ex: TASTED_BY)  
```cypher
CREATE CONSTRAINT ON (t:Taster) ASSERT t.name IS UNIQUE
CREATE CONSTRAINT ON (p:Points) ASSERT p.value IS UNIQUE
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS line
MERGE (t:Taster {name:line.taster})
MERGE (p:Points {value:line.points})
MERGE (t)-[:GAVE_POINTS]->(p)
```
**41 nodes, 326 relationships**  
![unique_mesh](../../image/howto_graph/mesh_topology.png)


### star topology
surrounding the unique nodes, half duplication  
not preferred, you could eliminate the use of relationship bridges (ex: TASTED_BY)
```cypher
CREATE CONSTRAINT ON (t:Taster) ASSERT t.name IS UNIQUE
CREATE INDEX ON :Points(value)
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS line
MERGE (t:Taster {name:line.taster})
MERGE (t)-[:GAVE_POINTS]->(p:Points {value:line.points})
```  
**346 nodes, 326 relationships**  
![unique_mesh](../../image/howto_graph/star_topology.png)  


### binary topology
pair of nodes, full duplication  
not preferred, this can create unique chains of nodes and edges
```cypher
CREATE INDEX ON :Taster(name)
CREATE INDEX ON :Points(value)
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS line
MERGE (t:Taster {name:line.taster})-[:GAVE_POINTS]->(p:Points {value:line.points})
``` 
**654 nodes, 327 relationships**
![unique_mesh](../../image/howto_graph/binary_topology.png)  

# Maintenance
### cypher
```cypher
//visualize the data model
CALL db.schema.visualization
```
```cypher
//list available functions
CALL dbms.procedures
```
```cypher
//unique constraints ensure that no duplicate nodes can be created
CREATE CONSTRAINT ON (p:Price) ASSERT p.value IS UNIQUE
//index may allow duplicate nodes
CREATE INDEX ON :Price(value)
```
```cypher
//check running queries
CALL dbms.listQueries()
```
```cypher
//list index nodes
:schema
```
```cypher
//browser visualization for setting color, size and title
:style
```
```cypher
//remove all nodes and relationships
MATCH (n) DETACH DELETE n;
CALL apoc.schema.assert({},{},true) YIELD label, key RETURN *;
```
**analyzing and query**
```cypher
// view the query plan, query does not run
EXPLAIN
```
```cypher
// view the query metrics, query run, give performance
PROFILE
```
### python
```python
#remove all nodes, relationships and indexes
graph.delete_all()
graph.run("""CALL apoc.schema.assert({},{},true) YIELD label, key RETURN *""")
```
```sh
#delete database (v4.x)
sudo docker-compose down
sudo rm -Rf neo4j/data/databases/neo4j
sudo rm -Rf neo4j/data/transactions/neo4j
sudo docker-compose up --build &
```
```python
#status queries
graph.run("""CALL dbms.listQueries()""").to_data_frame()[["queryId", "query", "status", "elapsedTimeMillis"]].T
```
```python
#remove queries
graph.run("""CALL dbms.killQueries(["query-3295"])""").to_data_frame()
```

In [None]:
# delete database and restart (~2min)
os.system(" cd ../.. && \
            sudo docker-compose down && \
            sudo rm -Rf neo4j/data/databases/neo4j && \
            sudo rm -Rf neo4j/data/transactions/neo4j && \
            sudo docker-compose up --build &")

In [55]:
#status queries
graph.run("""CALL dbms.listQueries()""").to_data_frame()[["queryId", "query", "status", "elapsedTimeMillis"]].T

Unnamed: 0,0
queryId,query-106
query,CALL dbms.listQueries()
status,running
elapsedTimeMillis,38


In [56]:
#remove queries
graph.run("""CALL dbms.killQueries(["query-3295"])""").to_data_frame()

Unnamed: 0,queryId,username,message
0,query-3295,,No Query found with this id


### data loading test

In [57]:
# check first 2 lines
! head -n 2 ../../neo4j/import/$file

id,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,year,wine_group,variety_name
0,Italy,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco  (Etna),White Blend,Nicosia,2013,Nicosia Vulkà Bianco (Etna),"['white', 'blend']"


In [58]:
# test data loading
query = """
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
WITH line
LIMIT 1
RETURN line
"""
data = graph.run(query)

In [59]:
next(data)

<Record line={'wine_group': 'Nicosia Vulkà Bianco (Etna)', 'variety_name': "['white', 'blend']", 'country': 'Italy', 'year': '2013', 'taster_name': 'Kerin O’Keefe', 'taster_twitter_handle': '@kerinokeefe', 'description': "Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.", 'title': 'Nicosia 2013 Vulkà Bianco  (Etna)', 'points': '87', 'province': 'Sicily & Sardinia', 'variety': 'White Blend', 'price': None, 'designation': 'Vulkà Bianco', 'id': '0', 'winery': 'Nicosia', 'region_1': 'Etna', 'region_2': None}>