Author: Kevin ALBERT  

Created: Oct 2020 

Inspiration: [git repo](https://github.com/lju-lazarevic/wine)

# environment
**cpu:**2, **mem:**8G, **disk:**150GB, **os:**ubuntu

In [None]:
# ! pip install py2neo pandas
# ! pip install pandas-profiling
# ! pip install jellyfish
# ! pip install fuzzywuzzy
# ! pip install python-Levenshtein
# ! pip install pandas-dedupe
# ! pip install -U nltk

In [None]:
# rerun report (delete me later)
import pandas_profiling as pp
pp.ProfileReport(prep, minimal=True, correlations={"cramers": {"calculate": False}}, progress_bar=False).to_file(reportFile)

In [None]:
import dtale
d = dtale.show(prep, host="13.74.11.167", port="40000", ignore_duplicate=True, drop_index=True, reaper_on=False)
# show all running instances
d.main_url()

In [None]:
# stop webapp
d.kill()

In [None]:
! pip list |grep -i py2neo
! pip list |grep -i pandas
# ! pip list |grep -i jellyfish

py2neo is a client library and toolkit for working with Neo4j from within Python applications.  
It is well suited for Data Science workflows and has great integration with other Python Data Science tools.  
[py2neo docs](https://py2neo.org/v4/database.html)

In [1]:
from py2neo import Graph, Node, Relationship
import pandas as pd
from IPython.display import Javascript
import pandas_profiling as pp
from fuzzywuzzy import process, fuzz
# import pandas_dedupe

# import warnings
# warnings.filterwarnings('ignore')

In [2]:
neo_server = "13.74.11.167"
user = "neo4j"
passw = "digityser"
file = "winedata.csv"

In [3]:
graph = Graph(host=neo_server, auth=(user, passw))

**delete database neo4j (v4.x):**
```sh
sudo docker-compose down
sudo rm -Rf neo4j/data/databases/neo4j
sudo rm -Rf neo4j/data/transactions/neo4j
sudo docker-compose up --build &
```
```cypher
MATCH (n) DETACH DELETE n;
CALL apoc.schema.assert({},{},true) YIELD label, key RETURN *;
```

In [None]:
# delete all nodes and relationships
graph.delete_all()

In [None]:
# delete all indexes and constraints
graph.run("""CALL apoc.schema.assert({},{},true) YIELD label, key RETURN *""")

# dataReport

[link to the original dataset](https://www.kaggle.com/zynicide/wine-reviews/data)  
[link to the git repo dataset](https://github.com/lju-lazarevic/wine/tree/master/data)

In [4]:
# pre-cleaned dataset: deduplicated and cleaned twitter handles
datasetURL = "https://raw.githubusercontent.com/lju-lazarevic/wine/master/data/winemag-data-130k-v3.csv"
reportFile = "../../data/report/winemag_report.html"

In [5]:
df = pd.read_csv(datasetURL)

In [None]:
%%time
pp.ProfileReport(df=df.sample(frac=1),
                 minimal=True,
                 progress_bar=False,
                 correlations={"cramers": {"calculate": False}}).to_file(reportFile)

In [None]:
# open the report (*.html)
display(Javascript('window.open("{url}");'.format(url=reportFile)))

# dataPrep
clean data prior to a load

In [6]:
prep = df.copy()

In [7]:
# replace nan
prep['winery'] = prep['winery'].fillna('No Winery')
prep['province'] = prep['province'].fillna('No Province')
prep['country'] = prep['country'].fillna('No Country')

In [8]:
# save file to /import
prep.to_csv("../../neo4j/import/"+file, sep=',', index=False)

# dataModel
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![Drag Racing](../../image/howto_graph/model2.jpg)

# dataLoading
Multistatement queries is only supported on neo4j browser or cypher shell.  
With py2neo you must run each statement sequentially. 
  
`MERGE` will take care of any duplicate values  
`p` temporary variable name  
`Province` entity name defined in data model  
`name` property name of entity  
`line.province` line is 1 record from dataset then select column province  
  
```cypher
MERGE (p: Province {name: (line.province)})
```

In [None]:
# check first 2 lines
! head -n 2 ../../neo4j/import/$file

In [None]:
# test data loading
query = """
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
WITH line
LIMIT 1
RETURN line
"""
data = graph.run(query)

In [None]:
next(data)

In [None]:
# open neo4j dashboard
display(Javascript('window.open("{url}");'.format(url="http://"+neo_server+":7474")))

## createIndex

In [9]:
%%time
graph.run("""CREATE INDEX ON :Winery(name)""")
graph.run("""CREATE INDEX ON :Province(name)""")
graph.run("""CREATE INDEX ON :Country(name)""")

CPU times: user 5.45 ms, sys: 584 µs, total: 6.04 ms
Wall time: 1.87 s


<py2neo.database.Cursor at 0x7f27cfe8bc40>

## createNodes

In [10]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (w: Winery {name: (line.winery)})
MERGE (p: Province {name: (line.province)})
MERGE (c: Country {name: (line.country)})
"""
graph.run(query)

CPU times: user 2.94 ms, sys: 233 µs, total: 3.17 ms
Wall time: 11.9 s


<py2neo.database.Cursor at 0x7f280c4b3280>

## createRelations

In [11]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (w: Winery {name: trim(line.winery)})
MATCH (p: Province {name: trim(line.province)})
MATCH (c: Country {name: trim(line.country)})
MERGE (w)-[:FROM_PROVENCE]->(p)
MERGE (p)-[:PROVINCE_COUNTRY]->(c)
"""
graph.run(query)

CPU times: user 659 µs, sys: 3.59 ms, total: 4.25 ms
Wall time: 12.8 s


<py2neo.database.Cursor at 0x7f27cfe9d940>

```cypher
CALL db.schema.visualization
```

![CALL db.schema.visualization](../../image/howto_graph/schema2.png)

## Which 10 countries have the most wineries ?
note: make sure to count only once each winery

In [None]:
%%time
query = """
MATCH (w:Winery)-[:FROM_PROVENCE]->(p:Province)-[:PROVINCE_COUNTRY]->(c:Country)
RETURN c.name AS Country, count(DISTINCT w) AS totalNrWineries
ORDER BY totalNrWineries DESC LIMIT 10
"""
graph.run(query).to_data_frame()

In [None]:
%%time
result = prep[["winery","country"]].groupby(['country'])['winery'].nunique()
result = result.rename_axis(['Country']).rename('totalNrWineries').sort_values(ascending=False).reset_index()
result.head(10)

## Which wineries are across multiple provinces ?
alt: Which provinces are associated to each winery ?

In [None]:
%%time
query = """
MATCH (w:Winery)-[:FROM_PROVENCE]->(p:Province)
WITH w, COLLECT(p.name) AS Provinces, count(p) AS Total
RETURN w.name AS Winery, Provinces, Total
ORDER BY Total DESC LIMIT 10
"""
graph.run(query).to_data_frame()

In [None]:
%%time
result = prep.groupby('winery').agg({'province':[lambda x: x.unique(), lambda x: x.nunique()]}).reset_index()
result.columns = ['Winery', 'Provinces', 'Total']
result = result.sort_values(by='Total',ascending=False).reset_index(drop=True)
result.head(10)

# dataModel (expansion 1)
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![model3](../../image/howto_graph/model3.jpg)

# dataPrep (expanded 1)
clean data prior to a load

In [12]:
prep = df.copy()

In [13]:
prep['winery'] = prep['winery'].fillna('No Winery')
prep['province'] = prep['province'].fillna('No Province')
prep['country'] = prep['country'].fillna('No Country')

In [14]:
prep['designation'] = prep['designation'].fillna('No Designation')
prep['taster_name'] = prep['taster_name'].fillna('No Taster')
prep['variety'] = prep['variety'].fillna('No Variety')
prep['title'] = prep['title'].fillna('No Title')

In [15]:
# save file
prep.to_csv("../../neo4j/import/"+file, sep=',', index=False)

# dataLoading (expansion 1)
Multistatement queries is only supported on neo4j browser or cypher shell.  
With py2neo you must run each statement sequentially. 
  
`MERGE` will take care of any duplicate values  
`p` temporary variable name  
`Province` entity name defined in data model  
`name` property name of entity  
`line.province` line is 1 record from dataset then select column province  
  
```cypher
MERGE (p: Province {name: (line.province)})
```

In [16]:
%%time
# indexes for additional data
graph.run("""CREATE INDEX ON :Wine(name)""")
graph.run("""CREATE INDEX ON :Taster(name)""")
graph.run("""CREATE INDEX ON :Variety(name)""")
graph.run("""CREATE INDEX ON :Designation(name)""")

CPU times: user 3.24 ms, sys: 3.8 ms, total: 7.04 ms
Wall time: 1.19 s


<py2neo.database.Cursor at 0x7f280c440850>

In [17]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (d: Designation {name: (line.designation)})
MERGE (t: Taster {name: (line.taster_name)})
MERGE (v: Variety {name: (line.variety)})
MERGE (c: Country {name: (line.country)})
MERGE (w: Wine {name: line.title})
"""
graph.run(query)

CPU times: user 3.95 ms, sys: 0 ns, total: 3.95 ms
Wall time: 15 s


<py2neo.database.Cursor at 0x7f27cfe8b940>

In [18]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (d: Designation {name: (line.designation)})
MATCH (t: Taster {name: (line.taster_name)})
MATCH (v: Variety {name: (line.variety)})
MATCH (w: Wine {name: (line.title)})
MATCH (win: Winery {name: (line.winery)})
MERGE (w)-[:FROM_WINERY]->(win)
MERGE (w)-[:HAS_VARIETY]->(v)
MERGE (t)-[:RATES_WINE]->(w)
MERGE (w)-[:HAS_DESIGNATION]->(d)
"""
graph.run(query)

CPU times: user 4.33 ms, sys: 36 µs, total: 4.37 ms
Wall time: 23 s


<py2neo.database.Cursor at 0x7f27d0723be0>

```cypher
CALL db.schema.visualization
```

![CALL db.schema.visualization](../../image/howto_graph/schema3.png)

# The top 10 most prolific wine tasters ?
note: count the unique amount instead of the total amount tasted

In [None]:
%%time
query = """
MATCH (t:Taster)
WHERE t.name <> "No Taster"
MATCH (t)-[:RATES_WINE]->(w:Wine)-[:HAS_VARIETY]->(v:Variety)
WITH t, count(DISTINCT w) AS Total, COLLECT(DISTINCT v.name) AS Varieties
RETURN t.name AS Taster, Varieties, Total
ORDER BY Total DESC LIMIT 10
"""
graph.run(query).to_data_frame()

In [None]:
%%time
result = prep[prep.taster_name != "No Taster"]
result = result.groupby(['taster_name']).agg({'variety':[lambda x: list(x)], 'title':[lambda x: x.nunique()]}).reset_index()
result.columns = ['Taster', 'Varieties', 'Total']
result = result.sort_values(by='Total',ascending=False).reset_index(drop=True)
result.head(10)

# How many wine varieties contain the word 'red' ?

In [None]:
%%time
query = """
MATCH (v:Variety)
WHERE tolower(v.name) CONTAINS 'red'
RETURN v.name AS redVariety
ORDER BY redVariety
"""
graph.run(query).to_data_frame()

In [None]:
%%time
pd.DataFrame(sorted(prep["variety"][prep["variety"].str.contains('red', case=False)].unique()), columns=["redVariety"])

# dataModel (expansion 2)
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![model4](../../image/howto_graph/model4.jpg)

# dataPrep (expansion 2)
clean data prior to a load

In [19]:
prep = df.copy()

In [20]:
prep['winery'] = prep['winery'].fillna('No Winery')
prep['province'] = prep['province'].fillna('No Province')
prep['country'] = prep['country'].fillna('No Country')

In [21]:
prep['designation'] = prep['designation'].fillna('No Designation')
prep['taster_name'] = prep['taster_name'].fillna('No Taster')
prep['variety'] = prep['variety'].fillna('No Variety')
prep['title'] = prep['title'].fillna('No Title')

regex generator : http://regex.inginf.units.it/  
regex checker : https://regex101.com/  
neo4j apoc text replace : https://neo4j.com/labs/apoc/4.1/misc/text-functions/#text-functions-regex  
pandas series replace : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html  
pandas series extract : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html  

In [22]:
# extract years 1970-2119
prep['year'] = prep['title'].str.extract("(([2][0-1][0-1][0-9])|([1][9][7-9][0-9]))")[0]
prep['year'] = prep['year'].fillna('No Year')

In [23]:
prep['wine_group'] = prep['title'].str.replace("(\d{4})", '') # remove 4-digit and 1-space
prep['wine_group'] = prep['wine_group'].str.replace("([ ]{2,})", ' ') # replace 2 or more spaces into 1-space

In [24]:
# save file
prep.to_csv("../../neo4j/import/"+file, sep=',', index=False)

# dataLoading (expanded 2)

In [25]:
%%time
# indexes for additional data
graph.run("""CREATE INDEX ON :WineGroup(name)""")
graph.run("""CREATE INDEX ON :Year(value)""")

CPU times: user 2.86 ms, sys: 278 µs, total: 3.14 ms
Wall time: 2.07 s


<py2neo.database.Cursor at 0x7f27cdeb1e80>

In [26]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (y: Year {value: (line.year)})
MERGE (wg: WineGroup {name: (line.wine_group)})
"""
graph.run(query)

CPU times: user 4.34 ms, sys: 32 µs, total: 4.37 ms
Wall time: 8.15 s


<py2neo.database.Cursor at 0x7f27cfe9d970>

In [27]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (w: Wine {name: (line.title)})
MATCH (y: Year {value: (line.year)})
MATCH (wg: WineGroup {name: (line.wine_group)})
MERGE (w)-[:FROM_YEAR]->(y)
MERGE (w)-[:IN_WINE_GROUP]->(wg)
"""
graph.run(query)

CPU times: user 3.83 ms, sys: 0 ns, total: 3.83 ms
Wall time: 14.6 s


<py2neo.database.Cursor at 0x7f27f20c3700>

```cypher
CALL db.schema.visualization
```

![CALL db.schema.visualization](../../image/howto_graph/schema4.png)

## Which Year had the most Wine?
unique or distinct count !

In [None]:
%%time
query = """
MATCH (w:Wine)-[:FROM_YEAR]->(y:Year)
WITH y, collect(w) AS wines
RETURN y.value AS year, size(wines) AS wines ORDER BY wines DESC LIMIT 5
"""
graph.run(query).to_data_frame()

In [None]:
%%time
result = prep[prep.year != "No Year"]
result = result.groupby(['year'])['title'].nunique().reset_index()
result.columns = ['year', 'wines']
result = result.sort_values(by='wines',ascending=False).reset_index(drop=True)
result.head(5)

## Which top 5 Winery produces the most Wine for a given Year ?
**note:** the cypher query is showing the distinct or unique count of wine titles

In [None]:
%%time
query = """
MATCH (wy:Winery)<-[:FROM_WINERY]-(w:Wine)-[:FROM_YEAR]->(y:Year)
WITH wy, y, COLLECT(w) AS wines
RETURN wy.name AS Winery, y.value AS Year, size(wines) AS `No of Wines`
ORDER BY `No of Wines` DESC LIMIT 5
"""
graph.run(query).to_data_frame()

In [None]:
%%time
result = prep.groupby(['winery', 'year']).agg({'title':['nunique']}).reset_index()
result.columns = ['Winery', 'Year', 'No of Wines']
result = result.sort_values(by='No of Wines',ascending=False).reset_index(drop=True)
result.head(5)

# dataModel (expansion 3)
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![model expansion 5](../../image/howto_graph/model5.jpg)

https://neo4j.com/labs/apoc/4.1/misc/text-functions/#text-functions-text-similarity

clean up  
* different spellings for the same grape, e.g. Aragonez and Aragonês
* different names for the same grape, e.g. Syrah and Shiraz
* different ordering of wine blends, e.g. Cabernet-Shiraz and Shiraz-Cabernet

# dataPrep (expansion 3)
clean data prior to a load

In [28]:
prep = df.copy()
prep['winery'] = prep['winery'].fillna('No Winery')
prep['province'] = prep['province'].fillna('No Province')
prep['country'] = prep['country'].fillna('No Country')
prep['designation'] = prep['designation'].fillna('No Designation')
prep['taster_name'] = prep['taster_name'].fillna('No Taster')
prep['variety'] = prep['variety'].fillna('No Variety')
prep['title'] = prep['title'].fillna('No Title')
prep['year'] = prep['title'].str.extract("(([2][0-1][0-1][0-9])|([1][9][7-9][0-9]))")[0] # extract years 1970-2119
prep['year'] = prep['year'].fillna('No Year')
prep['wine_group'] = prep['title'].str.replace("(\d{4})", '') # remove 4-digit and 1-space
prep['wine_group'] = prep['wine_group'].str.replace("([ ]{2,})", ' ') # replace 2 or more spaces into 1-space

In [29]:
# tokenize variety first dashes and spaces
prep['variety_name'] = prep['variety'].str.lower().str.split('[ ]|[-]')

## Apply FuzzyWuzzy in one column using token set ratio

https://stackoverflow.com/questions/31806695/when-to-use-which-fuzz-function-to-compare-2-strings  

https://github.com/thuynh323/NLP-with-Python/blob/master/FuzzyWuzzy%20-%20Ramen%20Rater%20List/Find%20similar%20strings%20with%20FuzzyWuzzy.ipynb  

sequence matching using levenshtein distance  
(minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other)  
the token method ignore case and punctuation  
the ratio calculates the Levenshtein distance  
token_sort_ratio tokenizes strings in words sorted in alphanumeric order then apply ratio  
token_set_ratio ignore duplicate words (~set)  
partial_token_sort_ratio: (=token_sort_ratio) but uses partial_ration instead of ratio  
https://medium.com/@laxmi17sarki/string-matching-using-fuzzywuzzy-24be9e85c88d  

1. removing all but letters and numbers, 2. trim whitespace, 3. force to lower case

In [30]:
# overwrite variety
list_of_strings = prep['variety'].unique().tolist()
# look for the best match, return one similar strings with score
score_set = [(one_string,) + i
             for one_string in list_of_strings
             for i in process.extract(one_string, list_of_strings, scorer=fuzz.partial_token_sort_ratio, limit=1)]
oldstr_newstr_map = dict([(oldstr, newstr) for oldstr, newstr, score in score_set])
# substitute each value
prep["variety"] = prep['variety'].map(oldstr_newstr_map)

In [31]:
# save file
prep.to_csv("../../neo4j/import/"+file, sep=',', index=False)

# dataLoading (expanded 3)

In [32]:
%%time
# indexes for additional data
graph.run("""CREATE INDEX ON :VarietyName(name)""")

CPU times: user 2.18 ms, sys: 101 µs, total: 2.28 ms
Wall time: 2.16 s


<py2neo.database.Cursor at 0x7f27cdeb18b0>

In [45]:
%%time
# remove variety
graph.run("""MATCH (v:Variety) DETACH DELETE v""")
# graph.run("""MATCH (v:VarietyName) DETACH DELETE v""")

CPU times: user 4.04 ms, sys: 144 µs, total: 4.18 ms
Wall time: 5.85 s


<py2neo.database.Cursor at 0x7f27cedc5640>

**create nodes and relationsips from a list, loaded from a csv**  

In [46]:
%%time
# create wine, variety and relationships
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
UNWIND apoc.text.split(line.variety_name, ",") AS varnamelist
FOREACH (varname IN varnamelist|
 MERGE (vn:VarietyName {name:apoc.text.clean(varname)})
 MERGE (v:Variety {name:line.variety})
 MERGE (w:Wine {name:line.title})
 MERGE (w)-[:HAS_VARIETY]->(v)
 MERGE (vn)-[:IS_COMPONENT_OF]->(v)
 )
"""
graph.run(query)

CPU times: user 0 ns, sys: 3.69 ms, total: 3.69 ms
Wall time: 18.1 s


<py2neo.database.Cursor at 0x7f27cedc5400>