Author: Kevin ALBERT  

Created: Oct 2020 

Inspiration: [git repo](https://github.com/lju-lazarevic/wine)

# environment
**cpu:**2, **mem:**8GB, **disk:**150GB, **os:**ubuntu

In [None]:
# ! pip install py2neo pandas
# ! pip install pandas-profiling
# ! pip install jellyfish
# ! pip install fuzzywuzzy
# ! pip install python-Levenshtein
# ! pip install pandas-dedupe
# ! pip install -U nltk
# ! pip install pyarrow fastparquet

In [None]:
# rerun report (delete me later)
import pandas_profiling as pp
pp.ProfileReport(prep, minimal=True, correlations={"cramers": {"calculate": False}}, progress_bar=False).to_file(reportFile)

In [None]:
import dtale
d = dtale.show(entities, host="13.74.11.167", port="40000", ignore_duplicate=True, drop_index=True, reaper_on=False)
d.main_url() # show all running instances

In [None]:
d.kill() # stop webapp

In [None]:
! pip list |grep -i py2neo
! pip list |grep -i pandas

py2neo is a client library and toolkit for working with Neo4j from within Python applications.  
It is well suited for Data Science workflows and has great integration with other Python Data Science tools.  
[py2neo docs](https://py2neo.org/v4/database.html)

In [None]:
# prep = pd.read_csv("../../neo4j/import/winedata.csv")

In [2]:
from py2neo import Graph, Node, Relationship
import pandas as pd
from IPython.display import Javascript
import pandas_profiling as pp
from fuzzywuzzy import process, fuzz
# import pandas_dedupe

# import warnings
# warnings.filterwarnings('ignore')

In [3]:
neo_server = "13.74.11.167"
port = "7687"
user = "neo4j"
passw = "digityser"
file = "winedata.csv"

In [4]:
graph = Graph(host=neo_server, auth=(user, passw))

In [57]:
# check queries running:
graph.run("""CALL dbms.listQueries()""").to_data_frame()[["queryId", "query", "status", "elapsedTimeMillis"]].T

Unnamed: 0,0,1
queryId,query-12602,query-12573
query,CALL dbms.listQueries(),\nUSING PERIODIC COMMIT 1000\nLOAD CSV WITH HE...
status,running,running
elapsedTimeMillis,1,84646


In [58]:
# remove queries
graph.run("""CALL dbms.killQueries(["query-12573"])""").to_data_frame()

Unnamed: 0,queryId,username,message
0,query-12573,neo4j,Query found


**delete database neo4j (v4.x):**
```sh
sudo docker-compose down
sudo rm -Rf neo4j/data/databases/neo4j
sudo rm -Rf neo4j/data/transactions/neo4j
sudo docker-compose up --build &
```
```cypher
MATCH (n) DETACH DELETE n;
CALL apoc.schema.assert({},{},true) YIELD label, key RETURN *;
```

In [None]:
# delete all nodes and relationships
graph.delete_all()

In [None]:
# delete all indexes and constraints
graph.run("""CALL apoc.schema.assert({},{},true) YIELD label, key RETURN *""")

# dataReport

[link to the original dataset](https://www.kaggle.com/zynicide/wine-reviews/data)  
[link to the git repo dataset](https://github.com/lju-lazarevic/wine/tree/master/data)

In [5]:
# pre-cleaned dataset: deduplicated and cleaned twitter handles
datasetURL = "https://raw.githubusercontent.com/lju-lazarevic/wine/master/data/winemag-data-130k-v3.csv"
reportFile = "../../data/report/winemag_report.html"

In [6]:
df = pd.read_csv(datasetURL)

In [None]:
%%time
pp.ProfileReport(df=df.sample(frac=1),
                 minimal=True,
                 progress_bar=False,
                 correlations={"cramers": {"calculate": False}}).to_file(reportFile)

In [None]:
# open the report (*.html)
display(Javascript('window.open("{url}");'.format(url=reportFile)))

# dataPrep
clean data prior to a load

In [7]:
prep = df.copy()

In [8]:
# replace nan
prep['winery'] = prep['winery'].fillna('No Winery')
prep['province'] = prep['province'].fillna('No Province')
prep['country'] = prep['country'].fillna('No Country')

In [9]:
# save file to /import
prep.to_csv("../../neo4j/import/"+file, sep=',', index=False)

# dataModel
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![Drag Racing](../../image/howto_graph/model2.jpg)

# dataLoading
Multistatement queries is only supported on neo4j browser or cypher shell.  
With py2neo you must run each statement sequentially. 
  
`MERGE` will take care of any duplicate values  
`p` temporary variable name  
`Province` entity name defined in data model  
`name` property name of entity  
`line.province` line is 1 record from dataset then select column 'province'  
  
```cypher
MERGE (p: Province {name: (line.province)})
```

In [None]:
# check first 2 lines
! head -n 2 ../../neo4j/import/$file

In [None]:
# test data loading
query = """
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
WITH line
LIMIT 1
RETURN line
"""
data = graph.run(query)

In [None]:
next(data)

In [None]:
# open neo4j dashboard
display(Javascript('window.open("{url}");'.format(url="http://"+neo_server+":7474")))

## createIndex

In [10]:
%%time
graph.run("""CREATE INDEX ON :Winery(name)""")
graph.run("""CREATE INDEX ON :Province(name)""")
graph.run("""CREATE INDEX ON :Country(name)""")

CPU times: user 3.2 ms, sys: 1.9 ms, total: 5.1 ms
Wall time: 631 ms


<py2neo.database.Cursor at 0x7f0d552b2c10>

## createNodes

In [11]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (w: Winery {name: (line.winery)})
MERGE (p: Province {name: (line.province)})
MERGE (c: Country {name: (line.country)})
"""
graph.run(query)

CPU times: user 2.72 ms, sys: 579 µs, total: 3.3 ms
Wall time: 11 s


<py2neo.database.Cursor at 0x7f0d552b2430>

## createRelations

In [12]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (w: Winery {name: trim(line.winery)})
MATCH (p: Province {name: trim(line.province)})
MATCH (c: Country {name: trim(line.country)})
MERGE (w)-[:FROM_PROVENCE]->(p)
MERGE (p)-[:PROVINCE_COUNTRY]->(c)
"""
graph.run(query)

CPU times: user 2.46 ms, sys: 527 µs, total: 2.99 ms
Wall time: 11.4 s


<py2neo.database.Cursor at 0x7f0d55b2f070>

```cypher
CALL db.schema.visualization
```

![CALL db.schema.visualization](../../image/howto_graph/schema2.png)

## Which 10 countries have the most wineries ?
note: make sure to count only once each winery

In [None]:
%%time
query = """
MATCH (w:Winery)-[:FROM_PROVENCE]->(p:Province)-[:PROVINCE_COUNTRY]->(c:Country)
RETURN c.name AS Country, count(DISTINCT w) AS totalNrWineries
ORDER BY totalNrWineries DESC LIMIT 10
"""
graph.run(query).to_data_frame()

In [None]:
%%time
result = prep[["winery","country"]].groupby(['country'])['winery'].nunique()
result = result.rename_axis(['Country']).rename('totalNrWineries').sort_values(ascending=False).reset_index()
result.head(10)

## Which wineries are across multiple provinces ?
alt: Which provinces are associated to each winery ?

In [None]:
%%time
query = """
MATCH (w:Winery)-[:FROM_PROVENCE]->(p:Province)
WITH w, COLLECT(p.name) AS Provinces, count(p) AS Total
RETURN w.name AS Winery, Provinces, Total
ORDER BY Total DESC LIMIT 10
"""
graph.run(query).to_data_frame()

In [None]:
%%time
result = prep.groupby('winery').agg({'province':[lambda x: x.unique(), lambda x: x.nunique()]}).reset_index()
result.columns = ['Winery', 'Provinces', 'Total']
result = result.sort_values(by='Total',ascending=False).reset_index(drop=True)
result.head(10)

# dataModel (expansion 1)
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![model3](../../image/howto_graph/model3.jpg)

# dataPrep (expanded 1)
clean data prior to a load

In [13]:
prep = df.copy()

In [14]:
prep['winery'] = prep['winery'].fillna('No Winery')
prep['province'] = prep['province'].fillna('No Province')
prep['country'] = prep['country'].fillna('No Country')

In [15]:
prep['designation'] = prep['designation'].fillna('No Designation')
prep['taster_name'] = prep['taster_name'].fillna('No Taster')
prep['variety'] = prep['variety'].fillna('No Variety')
prep['title'] = prep['title'].fillna('No Title')

In [16]:
# save file
prep.to_csv("../../neo4j/import/"+file, sep=',', index=False)

# dataLoading (expansion 1)
Multistatement queries is only supported on neo4j browser or cypher shell.  
With py2neo you must run each statement sequentially. 
  
`MERGE` will take care of any duplicate values  
`p` temporary variable name  
`Province` entity name defined in data model  
`name` property name of entity  
`line.province` line is 1 record from dataset then select column province  
  
```cypher
MERGE (p: Province {name: (line.province)})
```

In [17]:
%%time
# indexes for additional data
graph.run("""CREATE INDEX ON :Wine(name)""")
graph.run("""CREATE INDEX ON :Taster(name)""")
graph.run("""CREATE INDEX ON :Variety(name)""")
graph.run("""CREATE INDEX ON :Designation(name)""")

CPU times: user 5.16 ms, sys: 319 µs, total: 5.48 ms
Wall time: 1.06 s


<py2neo.database.Cursor at 0x7f0d18458490>

In [18]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (d: Designation {name: (line.designation)})
MERGE (t: Taster {name: (line.taster_name)})
MERGE (v: Variety {name: (line.variety)})
MERGE (c: Country {name: (line.country)})
MERGE (w: Wine {name: line.title})
"""
graph.run(query)

CPU times: user 2.27 ms, sys: 401 µs, total: 2.67 ms
Wall time: 11.9 s


<py2neo.database.Cursor at 0x7f0d195115b0>

In [19]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (d: Designation {name: (line.designation)})
MATCH (t: Taster {name: (line.taster_name)})
MATCH (v: Variety {name: (line.variety)})
MATCH (w: Wine {name: (line.title)})
MATCH (win: Winery {name: (line.winery)})
MERGE (w)-[:FROM_WINERY]->(win)
MERGE (w)-[:HAS_VARIETY]->(v)
MERGE (t)-[:RATES_WINE]->(w)
MERGE (w)-[:HAS_DESIGNATION]->(d)
"""
graph.run(query)

CPU times: user 2.94 ms, sys: 519 µs, total: 3.46 ms
Wall time: 17 s


<py2neo.database.Cursor at 0x7f0d184589a0>

```cypher
CALL db.schema.visualization
```

![CALL db.schema.visualization](../../image/howto_graph/schema3.png)

# The top 10 most prolific wine tasters ?
note: count the unique amount instead of the total amount tasted

In [None]:
%%time
query = """
MATCH (t:Taster)
WHERE t.name <> "No Taster"
MATCH (t)-[:RATES_WINE]->(w:Wine)-[:HAS_VARIETY]->(v:Variety)
WITH t, count(DISTINCT w) AS Total, COLLECT(DISTINCT v.name) AS Varieties
RETURN t.name AS Taster, Varieties, Total
ORDER BY Total DESC LIMIT 10
"""
graph.run(query).to_data_frame()

In [None]:
%%time
result = prep[prep.taster_name != "No Taster"]
result = result.groupby(['taster_name']).agg({'variety':[lambda x: list(x)], 'title':[lambda x: x.nunique()]}).reset_index()
result.columns = ['Taster', 'Varieties', 'Total']
result = result.sort_values(by='Total',ascending=False).reset_index(drop=True)
result.head(10)

# How many wine varieties contain the word 'red' ?

In [None]:
%%time
query = """
MATCH (v:Variety)
WHERE tolower(v.name) CONTAINS 'red'
RETURN v.name AS redVariety
ORDER BY redVariety
"""
graph.run(query).to_data_frame()

In [None]:
%%time
pd.DataFrame(sorted(prep["variety"][prep["variety"].str.contains('red', case=False)].unique()), columns=["redVariety"])

# dataModel (expansion 2)
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![model4](../../image/howto_graph/model4.jpg)

# dataPrep (expansion 2)
clean data prior to a load

In [20]:
prep = df.copy()

In [21]:
prep['winery'] = prep['winery'].fillna('No Winery')
prep['province'] = prep['province'].fillna('No Province')
prep['country'] = prep['country'].fillna('No Country')

In [22]:
prep['designation'] = prep['designation'].fillna('No Designation')
prep['taster_name'] = prep['taster_name'].fillna('No Taster')
prep['variety'] = prep['variety'].fillna('No Variety')
prep['title'] = prep['title'].fillna('No Title')

regex generator : http://regex.inginf.units.it/  
regex checker : https://regex101.com/  
neo4j apoc text replace : https://neo4j.com/labs/apoc/4.1/misc/text-functions/#text-functions-regex  
pandas series replace : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html  
pandas series extract : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html  

In [23]:
# extract years 1970-2119
prep['year'] = prep['title'].str.extract("(([2][0-1][0-1][0-9])|([1][9][7-9][0-9]))")[0]
prep['year'] = prep['year'].fillna('No Year')

In [24]:
prep['wine_group'] = prep['title'].str.replace("(\d{4})", '') # remove 4-digit and 1-space
prep['wine_group'] = prep['wine_group'].str.replace("([ ]{2,})", ' ') # replace 2 or more spaces into 1-space

In [25]:
# save file
prep.to_csv("../../neo4j/import/"+file, sep=',', index=False)

# dataLoading (expanded 2)

In [26]:
%%time
# indexes for additional data
graph.run("""CREATE INDEX ON :WineGroup(name)""")
graph.run("""CREATE INDEX ON :Year(value)""")

CPU times: user 3.48 ms, sys: 524 µs, total: 4.01 ms
Wall time: 2.53 s


<py2neo.database.Cursor at 0x7f0d552b9070>

In [27]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (y: Year {value: (line.year)})
MERGE (wg: WineGroup {name: (line.wine_group)})
"""
graph.run(query)

CPU times: user 3.27 ms, sys: 0 ns, total: 3.27 ms
Wall time: 6.08 s


<py2neo.database.Cursor at 0x7f0d195110a0>

In [28]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (w: Wine {name: (line.title)})
MATCH (y: Year {value: (line.year)})
MATCH (wg: WineGroup {name: (line.wine_group)})
MERGE (w)-[:FROM_YEAR]->(y)
MERGE (w)-[:IN_WINE_GROUP]->(wg)
"""
graph.run(query)

CPU times: user 3.48 ms, sys: 0 ns, total: 3.48 ms
Wall time: 18.4 s


<py2neo.database.Cursor at 0x7f0d19d9aa00>

```cypher
CALL db.schema.visualization
```

![CALL db.schema.visualization](../../image/howto_graph/schema4.png)

## Which Year had the most Wine?
unique or distinct count !

In [None]:
%%time
query = """
MATCH (w:Wine)-[:FROM_YEAR]->(y:Year)
WITH y, collect(w) AS wines
RETURN y.value AS year, size(wines) AS wines ORDER BY wines DESC LIMIT 5
"""
graph.run(query).to_data_frame()

In [None]:
%%time
result = prep[prep.year != "No Year"]
result = result.groupby(['year'])['title'].nunique().reset_index()
result.columns = ['year', 'wines']
result = result.sort_values(by='wines',ascending=False).reset_index(drop=True)
result.head(5)

## Which top 5 Winery produces the most Wine for a given Year ?
**note:** the cypher query is showing the distinct or unique count of wine titles

In [None]:
%%time
query = """
MATCH (wy:Winery)<-[:FROM_WINERY]-(w:Wine)-[:FROM_YEAR]->(y:Year)
WITH wy, y, COLLECT(w) AS wines
RETURN wy.name AS Winery, y.value AS Year, size(wines) AS `No of Wines`
ORDER BY `No of Wines` DESC LIMIT 5
"""
graph.run(query).to_data_frame()

In [None]:
%%time
result = prep.groupby(['winery', 'year']).agg({'title':['nunique']}).reset_index()
result.columns = ['Winery', 'Year', 'No of Wines']
result = result.sort_values(by='No of Wines',ascending=False).reset_index(drop=True)
result.head(5)

# dataModel (expansion 3)
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![model expansion 5](../../image/howto_graph/model5.jpg)

https://neo4j.com/labs/apoc/4.1/misc/text-functions/#text-functions-text-similarity

clean up  
* different spellings for the same grape, e.g. Aragonez and Aragonês
* different names for the same grape, e.g. Syrah and Shiraz
* different ordering of wine blends, e.g. Cabernet-Shiraz and Shiraz-Cabernet

# dataPrep (expansion 3)
clean data prior to a load

In [29]:
prep = df.copy()
prep['winery'] = prep['winery'].fillna('No Winery')
prep['province'] = prep['province'].fillna('No Province')
prep['country'] = prep['country'].fillna('No Country')
prep['designation'] = prep['designation'].fillna('No Designation')
prep['taster_name'] = prep['taster_name'].fillna('No Taster')
prep['variety'] = prep['variety'].fillna('No Variety')
prep['title'] = prep['title'].fillna('No Title')
prep['year'] = prep['title'].str.extract("(([2][0-1][0-1][0-9])|([1][9][7-9][0-9]))")[0] # extract years 1970-2119
prep['year'] = prep['year'].fillna('No Year')
prep['wine_group'] = prep['title'].str.replace("(\d{4})", '') # remove 4-digit and 1-space
prep['wine_group'] = prep['wine_group'].str.replace("([ ]{2,})", ' ') # replace 2 or more spaces into 1-space

In [30]:
# tokenize variety first dashes and spaces
prep['variety_name'] = prep['variety'].str.lower().str.split('[ ]|[-]')

## Apply FuzzyWuzzy in one column using token set ratio

https://stackoverflow.com/questions/31806695/when-to-use-which-fuzz-function-to-compare-2-strings  

https://github.com/thuynh323/NLP-with-Python/blob/master/FuzzyWuzzy%20-%20Ramen%20Rater%20List/Find%20similar%20strings%20with%20FuzzyWuzzy.ipynb  

sequence matching using levenshtein distance  
(minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other)  
the token method ignore case and punctuation  
the ratio calculates the Levenshtein distance  
token_sort_ratio tokenizes strings in words sorted in alphanumeric order then apply ratio  
token_set_ratio ignore duplicate words (~set)  
partial_token_sort_ratio: (=token_sort_ratio) but uses partial_ration instead of ratio  
https://medium.com/@laxmi17sarki/string-matching-using-fuzzywuzzy-24be9e85c88d  

1. removing all but letters and numbers, 2. trim whitespace, 3. force to lower case

In [31]:
# overwrite variety
list_of_strings = prep['variety'].unique().tolist()
# look for the best match, return one similar strings with score
score_set = [(one_string,) + i
             for one_string in list_of_strings
             for i in process.extract(one_string, list_of_strings, scorer=fuzz.partial_token_sort_ratio, limit=1)]
oldstr_newstr_map = dict([(oldstr, newstr) for oldstr, newstr, score in score_set])
# substitute each value
prep["variety"] = prep['variety'].map(oldstr_newstr_map)

In [32]:
# save file
prep.to_csv("../../neo4j/import/"+file, sep=',', index=False)

# dataLoading (expanded 3)

In [33]:
%%time
# indexes for additional data
graph.run("""CREATE INDEX ON :VarietyName(name)""")

CPU times: user 1.58 ms, sys: 43 µs, total: 1.62 ms
Wall time: 2.4 s


<py2neo.database.Cursor at 0x7f0d18458460>

In [34]:
%%time
# remove variety
graph.run("""MATCH (v:Variety) DETACH DELETE v""")

CPU times: user 2.33 ms, sys: 0 ns, total: 2.33 ms
Wall time: 13.2 s


<py2neo.database.Cursor at 0x7f0d18458eb0>

**create nodes and relationsips from a list, loaded from a csv**  

In [35]:
%%time
# create wine, variety and relationships
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
UNWIND apoc.text.split(line.variety_name, ",") AS varnamelist
FOREACH (varname IN varnamelist|
 MERGE (vn:VarietyName {name:apoc.text.clean(varname)})
 MERGE (v:Variety {name:line.variety})
 MERGE (w:Wine {name:line.title})
 MERGE (w)-[:HAS_VARIETY]->(v)
 MERGE (vn)-[:IS_COMPONENT_OF]->(v)
 )
"""
graph.run(query)

CPU times: user 4.36 ms, sys: 0 ns, total: 4.36 ms
Wall time: 32.4 s


<py2neo.database.Cursor at 0x7f0d18458af0>

## Q: Show Variety linked to VarietyName

In [None]:
%%time
query = """
MATCH (vn:VarietyName)-[:IS_COMPONENT_OF]->(v:Variety)
WITH vn, COLLECT(v.name) AS var
RETURN vn.name, var, size(var) AS s
ORDER BY s DESC LIMIT 5
"""
graph.run(query).to_data_frame()

## Q: Which VarietyName have the most Wine?

In [None]:
%%time
query = """
MATCH (vn:VarietyName)-[:IS_COMPONENT_OF]->(v:Variety)<-[:HAS_VARIETY]-(w:Wine)
WITH vn, COLLECT(w) AS wines
RETURN vn.name, size(wines) AS s
ORDER BY s DESC LIMIT 5
"""
graph.run(query).to_data_frame()

# dataModel (expansion 4) - Description data
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![model expansion 6](../../image/howto_graph/model6.jpg)

# dataPrep (expansion 4) - azure API
clean data prior to a load

In [36]:
# this data has been captured using Azure API scripts:
entities = pd.read_parquet("../../data/bronze/winegraph/description_entities.parquet")

In [37]:
entities = entities.drop(columns=['entities', 'entity_confidence_score'])
entities = entities[(entities['entity_category'] == 'Event')| \
                    (entities['entity_category'] == 'Location')| \
                    (entities['entity_category'] == 'Product')]

In [38]:
# save file
file2 = "description_entities.csv"
entities.to_csv("../../neo4j/import/"+file2, sep=',', index=False)

# dataLoading (expanded 4)

In [39]:
%%time
# indexes for additional data
graph.run("""CREATE INDEX ON :Description(value)""")
graph.run("""CREATE INDEX ON :Event(value)""")
graph.run("""CREATE INDEX ON :Product(value)""")
graph.run("""CREATE INDEX ON :Location(value)""")

CPU times: user 3.67 ms, sys: 60 µs, total: 3.73 ms
Wall time: 3.12 s


<py2neo.database.Cursor at 0x7f0cbda342b0>

In [40]:
%%time
# create description nodes
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (w:Wine {name: (line.title)})
MERGE (d:Description {value: (line.description)})
MERGE (w)-[:HAS_DESCRIPTION]->(d)
"""
display(graph.run(query).stats())

constraints_added: 0
constraints_removed: 0
contains_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 119955
labels_removed: 0
nodes_created: 119955
nodes_deleted: 0
properties_set: 119955
relationships_created: 119988
relationships_deleted: 0

CPU times: user 6.14 ms, sys: 0 ns, total: 6.14 ms
Wall time: 22.2 s


In [41]:
%%time
# create Nodes and Relationships
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file2+"""' AS line FIELDTERMINATOR ','
WITH line WHERE line.entity_category = 'Event'
MATCH (d:Description {value:line.document})
MERGE (de:Event {value:line.entity_text})
MERGE (de)-[:EVENT_IN]->(d)
"""
display(graph.run(query).stats())

query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file2+"""' AS line FIELDTERMINATOR ','
WITH line WHERE line.entity_category = 'Location'
MATCH (d:Description {value:line.document})
MERGE (dl:Location {value:line.entity_text})
MERGE (dl)-[:LOCATION_IN]->(d)
"""
display(graph.run(query).stats())

query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file2+"""' AS line FIELDTERMINATOR ','
WITH line WHERE line.entity_category = 'Product'
MATCH (d:Description {value:line.document})
MERGE (dp:Product {value:line.entity_text})
MERGE (dp)-[:PRODUCT_IN]->(d)
"""
display(graph.run(query).stats())

constraints_added: 0
constraints_removed: 0
contains_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 43903
labels_removed: 0
nodes_created: 43903
nodes_deleted: 0
properties_set: 43903
relationships_created: 476509
relationships_deleted: 0

constraints_added: 0
constraints_removed: 0
contains_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 6315
labels_removed: 0
nodes_created: 6315
nodes_deleted: 0
properties_set: 6315
relationships_created: 26215
relationships_deleted: 0

constraints_added: 0
constraints_removed: 0
contains_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 95
labels_removed: 0
nodes_created: 95
nodes_deleted: 0
properties_set: 95
relationships_created: 149
relationships_deleted: 0

CPU times: user 9.62 ms, sys: 4.23 ms, total: 13.8 ms
Wall time: 1min 26s


## alternative:  
using APOC for Azure NLP cognitive services:  
https://neo4j.com/labs/apoc/4.1/nlp/azure/#nlp-azure-examples-entities

In [None]:
query = """
MERGE (:Article {
  uri: "https://neo4j.com/blog/pokegraph-gotta-graph-em-all/",
  body: "These days I’m rarely more than a few feet away from my Nintendo Switch and I play board games, card games and role playing games with friends at least once or twice a week. I’ve even organised lunch-time Mario Kart 8 tournaments between the Neo4j European offices!"
})
"""
graph.run(query)

query = """
MERGE (:Article {
  uri: "https://en.wikipedia.org/wiki/Nintendo_Switch",
  body: "The Nintendo Switch is a video game console developed by Nintendo, released worldwide in most regions on March 3, 2017. It is a hybrid console that can be used as a home console and portable device. The Nintendo Switch was unveiled on October 20, 2016. Nintendo offers a Joy-Con Wheel, a small steering wheel-like unit that a Joy-Con can slot into, allowing it to be used for racing games such as Mario Kart 8."
})
"""
graph.run(query)

In [None]:
query = """
MATCH (a:Article)
WITH collect(a) AS articles
CALL apoc.nlp.azure.entities.graph(articles, {
  key: "618f207e871d4ea3a79dd6889b8a6f7c",
  url: "https://westeurope.api.cognitive.microsoft.com/",
  nodeProperty: "body",
  writeRelationshipType: "ENTITY",
  write: true
})
YIELD graph AS g
RETURN g
"""
graph.run(query).to_data_frame()

## Q: find the popular product flavors in this 'merlot' variety

* Avoid all the variety names that contain generic words. Manually added 'black', 'red', 'white', 'blend', 'style', 'other'
* Then we match those description words to the variety names
* After, we pull back all of the wines that match that have 'merlot' as a variety
* Then we do a count against the most popular unique products in the wine

In [None]:
%%time
query = """
MATCH (vn:VarietyName)
WHERE NOT vn.name in ['black', 'red', 'white', 'blend', 'style', 'other', 'blank', 'gris']
WITH vn
MATCH (p:Product {value:vn.name})
WITH p
MATCH (p:Product)-[:PRODUCT_IN]->(d:Description)<-[:HAS_DESCRIPTION]-(w:Wine)-[:HAS_VARIETY]->(v:Variety)
WHERE tolower(v.name) contains('merlot')
WITH w, p ORDER BY p.value
WITH w, collect(DISTINCT p.value) as grapes
RETURN grapes, count(grapes) as popularity order by popularity desc
"""
graph.run(query).to_data_frame()

```cypher
CALL db.schema.visualization
```

![CALL db.schema.visualization](../../image/howto_graph/schema6.png)

# dataModel (expansion 5) - Points
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![model expansion 7](../../image/howto_graph/model7.jpg)

# dataLoading (expanded 5)

In [44]:
%%time
# remove relationships between Taster and Wine
query = """
CALL apoc.periodic.commit("
    MATCH (t:Taster)-[r:RATES_WINE]->(w:Wine)
    WITH r LIMIT $limit
    DELETE r
    RETURN COUNT(*)",
    {limit:10})
"""
graph.run(query).to_data_frame()

CPU times: user 7.75 ms, sys: 316 µs, total: 8.07 ms
Wall time: 31.4 s


Unnamed: 0,updates,executions,runtime,batches,failedBatches,batchErrors,failedCommits,commitErrors,wasTerminated
0,7660,766,31,767,0,{},0,{},False


In [None]:
%%time
# indexes for additional data
graph.run("""CREATE INDEX ON :Points(name)""")

In [None]:
%%time
# connect, COALESCE means select everything except 'No Taster'
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (t:Taster {name:line.taster_name})-[:GAVE_POINTS]->(p:Points {value:toInteger(line.points)})
MATCH (w:Wine {name:line.title})
MERGE (p)<-[:HAS_POINTS]-(w)
"""
graph.run(query)

## Q: which wines have 100/100 points ?

In [86]:
%%time
query = """
MATCH (p:Points {value:'100'})<-[:HAS_POINTS]-(w:Wine)-[:HAS_VARIETY]-(v:Variety)
MATCH (p)<-[:GAVE_POINTS]-(t:Taster)
RETURN t.name AS `Reviewer`,  w.name AS `Wine title`, v.name AS `Grape variety` ORDER BY `Grape variety`
"""
graph.run(query).to_data_frame()

CPU times: user 9.97 ms, sys: 66 µs, total: 10 ms
Wall time: 44.2 ms


Unnamed: 0,Reviewer,Wine title,Grape variety
0,Paul Gregutt,Château Haut-Brion 2014 Pessac-Léognan,Bordeaux-style Red Blend
1,Kerin O’Keefe,Château Haut-Brion 2014 Pessac-Léognan,Bordeaux-style Red Blend
2,Roger Voss,Château Haut-Brion 2014 Pessac-Léognan,Bordeaux-style Red Blend
3,No Taster,Château Haut-Brion 2014 Pessac-Léognan,Bordeaux-style Red Blend
4,Joe Czerwinski,Château Haut-Brion 2014 Pessac-Léognan,Bordeaux-style Red Blend
...,...,...,...
95,Paul Gregutt,Tenuta dell'Ornellaia 2007 Masseto Merlot (Tos...,Tempranillo-Merlot
96,Kerin O’Keefe,Tenuta dell'Ornellaia 2007 Masseto Merlot (Tos...,Tempranillo-Merlot
97,Roger Voss,Tenuta dell'Ornellaia 2007 Masseto Merlot (Tos...,Tempranillo-Merlot
98,No Taster,Tenuta dell'Ornellaia 2007 Masseto Merlot (Tos...,Tempranillo-Merlot
