Author: Kevin ALBERT  

Created: Oct 2020 

Inspiration: [git repo](https://github.com/lju-lazarevic/wine)

# environment
**cpu:**2, **mem:**8G, **disk:**150GB, **os:**ubuntu

In [None]:
# ! pip install py2neo pandas
# ! pip install pandas-profiling

In [55]:
# rerun report (delete me later)
import pandas_profiling as pp
pp.ProfileReport(prep, minimal=True, correlations={"cramers": {"calculate": False}}, progress_bar=False).to_file(reportFile)

In [127]:
import dtale
d = dtale.show(prep, host="13.74.11.167", port="40000", ignore_duplicate=True, drop_index=True, reaper_on=False)
# show all running instances
d.main_url()
# stop webapp
# d.kill()

http://13.74.11.167:40000/dtale/main/1


In [1]:
! pip list |grep -i py2neo
! pip list |grep -i pandas

py2neo                        4.2.0
pandas                        1.1.3
pandas-profiling              2.9.0


py2neo is a client library and toolkit for working with Neo4j from within Python applications.  
It is well suited for Data Science workflows and has great integration with other Python Data Science tools.  
[py2neo docs](https://py2neo.org/v4/database.html)

In [1]:
from py2neo import Graph, Node, Relationship
import pandas as pd
from IPython.display import Javascript
import pandas_profiling as pp

In [2]:
neo_server = "13.74.11.167"
user = "neo4j"
passw = "digityser"
file = "winedata.csv"

In [3]:
graph = Graph(host=neo_server, auth=(user, passw))

**delete database neo4j (v4.x):**
```sh
sudo docker-compose down
sudo rm -Rf data/databases/neo4j
sudo rm -Rf data/transactions/neo4j
sudo docker-compose up --build &
```
```cypher
MATCH (n) DETACH DELETE n;
CALL apoc.schema.assert({},{},true) YIELD label, key RETURN *;
```

In [4]:
# delete all nodes and relationships
graph.delete_all()
# delete all indexes and constraints
graph.run("""CALL apoc.schema.assert({},{},true) YIELD label, key RETURN *""")

<py2neo.database.Cursor at 0x7f7c406e80d0>

# dataReport

[link to the original dataset](https://www.kaggle.com/zynicide/wine-reviews/data)  
[link to the git repo dataset](https://github.com/lju-lazarevic/wine/tree/master/data)

In [6]:
# pre-cleaned dataset: deduplicated and cleaned twitter handles
datasetURL = "https://raw.githubusercontent.com/lju-lazarevic/wine/master/data/winemag-data-130k-v3.csv"
reportFile = "../../data/report/winemag_report.html"

In [7]:
df = pd.read_csv(datasetURL)

In [None]:
pp.ProfileReport(df=df.sample(frac=1),
                 minimal=True,
                 correlations={"cramers": {"calculate": False}}).to_file(reportFile)

In [57]:
# open the report (*.html)
display(Javascript('window.open("{url}");'.format(url=reportFile)))

<IPython.core.display.Javascript object>

2020-10-20 17:30:51,205 - ERROR    - 167.248.133.52 - - [20/Oct/2020 17:30:51] code 400, message Bad HTTP/0.9 request type ('\x16\x03\x01\x00{\x01\x00\x00w\x03\x03f\x01k¨-')


# dataPrep
clean data prior to a load

In [8]:
prep = df.copy()

In [9]:
# replace nan
prep['winery'] = prep['winery'].fillna('No Winery')
prep['province'] = prep['province'].fillna('No Province')
prep['country'] = prep['country'].fillna('No Country')

In [10]:
# save file to /import
prep.to_csv("../../neo4j/import/"+file, sep=',', index=False)

# dataModel
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![Drag Racing](../../image/howto_graph/model2.jpg)

# dataLoading
Multistatement queries is only supported on neo4j browser or cypher shell.  
With py2neo you must run each statement sequentially. 
  
`MERGE` will take care of any duplicate values  
`p` temporary variable name  
`Province` entity name defined in data model  
`name` property name of entity  
`line.province` line is 1 record from dataset then select column province  
  
```cypher
MERGE (p: Province {name: (line.province)})
```

In [11]:
# check first 2 lines
! head -n 2 ../../neo4j/import/$file

id,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco  (Etna),White Blend,Nicosia


In [12]:
# test data loading
query = """
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
WITH line
LIMIT 1
RETURN line
"""
data = graph.run(query)

In [13]:
next(data)

<Record line={'country': 'Italy', 'taster_name': 'Kerin O’Keefe', 'taster_twitter_handle': '@kerinokeefe', 'description': "Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.", 'title': 'Nicosia 2013 Vulkà Bianco  (Etna)', 'points': '87', 'province': 'Sicily & Sardinia', 'variety': 'White Blend', 'price': None, 'designation': 'Vulkà Bianco', 'id': '0', 'winery': 'Nicosia', 'region_1': 'Etna', 'region_2': None}>

In [None]:
# open neo4j dashboard
display(Javascript('window.open("{url}");'.format(url="http://"+neo_server+":7474")))

## createIndex

In [14]:
graph.run("""CREATE INDEX ON :Winery(name)""")
graph.run("""CREATE INDEX ON :Province(name)""")
graph.run("""CREATE INDEX ON :Country(name)""")

<py2neo.database.Cursor at 0x7f119a5b9fa0>

## createNodes

In [15]:
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (w: Winery {name: (line.winery)})
"""
graph.run(query)

query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (p: Province {name: (line.province)})
"""
graph.run(query)

query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (c: Country {name: (line.country)})
"""
graph.run(query)

<py2neo.database.Cursor at 0x7f119a5c9550>

## createRelations

In [16]:
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (w: Winery {name: trim(line.winery)})
MATCH (p: Province {name: trim(line.province)})
MATCH (c: Country {name: trim(line.country)})
MERGE (w)-[:FROM_PROVENCE]->(p)
MERGE (p)-[:PROVINCE_COUNTRY]->(c)
"""
graph.run(query)

<py2neo.database.Cursor at 0x7f1191249d30>

## Which 10 countries have the most wineries ?
note: make sure to count only once each winery

In [17]:
%%time
query = """
MATCH (w:Winery)-[:FROM_PROVENCE]->(p:Province)-[:PROVINCE_COUNTRY]->(c:Country)
RETURN c.name AS Country, count(DISTINCT w) AS totalNrWineries
ORDER BY totalNrWineries DESC LIMIT 10
"""
graph.run(query).to_data_frame()

CPU times: user 0 ns, sys: 7.48 ms, total: 7.48 ms
Wall time: 754 ms


Unnamed: 0,Country,totalNrWineries
0,US,5375
1,France,3864
2,Italy,2934
3,Spain,1435
4,Argentina,531
5,Australia,474
6,Portugal,430
7,Chile,317
8,New Zealand,300
9,South Africa,294


In [18]:
%%time
result = prep[["winery","country"]].groupby(['country'])['winery'].nunique()
result = result.rename_axis(['Country']).rename('totalNrWineries').sort_values(ascending=False).reset_index()
result.head(10)

CPU times: user 53.8 ms, sys: 3.65 ms, total: 57.4 ms
Wall time: 55.1 ms


Unnamed: 0,Country,totalNrWineries
0,US,5375
1,France,3864
2,Italy,2934
3,Spain,1435
4,Argentina,531
5,Australia,474
6,Portugal,430
7,Chile,317
8,New Zealand,300
9,South Africa,294


## Which wineries are across multiple provinces ?
alt: Which provinces are associated to each winery ?

In [19]:
%%time
query = """
MATCH (w:Winery)-[:FROM_PROVENCE]->(p:Province)
WITH w, COLLECT(p.name) AS Provinces, count(p) AS Total
RETURN w.name AS Winery, Provinces, Total
ORDER BY Total DESC LIMIT 10
"""
graph.run(query).to_data_frame()

CPU times: user 4.78 ms, sys: 119 µs, total: 4.9 ms
Wall time: 447 ms


Unnamed: 0,Winery,Provinces,Total
0,Undurraga,"[Colchagua Valley, Maule Valley, Maipo Valley,...",19
1,Concha y Toro,"[Colchagua Valley, Maule Valley, Maipo Valley,...",16
2,Santa Carolina,"[Colchagua Valley, Maule Valley, Maipo Valley,...",14
3,San Pedro,"[Northern Spain, Mendoza Province, Maule Valle...",12
4,Kirkland Signature,"[Northern Spain, California, Mendoza Province,...",12
5,Santa Rita,"[Colchagua Valley, Maipo Valley, Rapel Valley,...",11
6,Bacalhôa Wines of Portugal,"[Douro, Alentejano, Lisboa, Península de Setúb...",11
7,Wines & Winemakers,"[Douro, Tejo, Alentejano, Vinho Verde, Penínsu...",10
8,Tussock Jumper,"[Rheinhessen, California, Other, Colchagua Val...",10
9,Casca Wines,"[Douro, Tejo, Alentejano, Vinho Verde, Lisboa,...",10


In [20]:
%%time
result = prep.groupby('winery').agg({'province':[lambda x: x.unique(), lambda x: x.nunique()]}).reset_index()
result.columns = ['Winery', 'Provinces', 'Total']
result = result.sort_values(by='Total',ascending=False).reset_index(drop=True)
result.head(10)

CPU times: user 2.53 s, sys: 7.45 ms, total: 2.54 s
Wall time: 2.53 s


Unnamed: 0,Winery,Provinces,Total
0,Undurraga,"[Maipo Valley, Leyda Valley, Chile, Cauquenes ...",19
1,Concha y Toro,"[Chile, Central Valley, Maipo Valley, Casablan...",16
2,Santa Carolina,"[Cachapoal Valley, Colchagua Valley, Casablanc...",14
3,San Pedro,"[Lontué Valley, Cachapoal Valley, Maipo Valley...",12
4,Kirkland Signature,"[California, Washington, Bordeaux, Rhône Valle...",12
5,Santa Rita,"[Leyda Valley, Central Valley, Maipo Valley, A...",11
6,Bacalhôa Wines of Portugal,"[Douro, Setubal, Península de Setúbal, Lisboa,...",11
7,Xavier Flouret,"[Central Valley, Bordeaux, Provence, Burgundy,...",10
8,Barton & Guestier,"[France Other, No Province, Bordeaux, Burgundy...",10
9,Echeverria,"[Central Valley, Maipo Valley, Curicó Valley, ...",10


# dataModel (expanded 1)
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![model3](../../image/howto_graph/model3.jpg)

![CALL db.schema.visualization](../../image/howto_graph/schema3.png)

# dataPrep (expanded 1)
clean data prior to a load

In [21]:
prep = df.copy()

In [22]:
prep['winery'] = prep['winery'].fillna('No Winery')
prep['province'] = prep['province'].fillna('No Province')
prep['country'] = prep['country'].fillna('No Country')

In [23]:
prep['designation'] = prep['designation'].fillna('No Designation')
prep['taster_name'] = prep['taster_name'].fillna('No Taster')
prep['variety'] = prep['variety'].fillna('No Variety')
prep['title'] = prep['title'].fillna('No Title')

In [24]:
# save file
prep.to_csv("../../neo4j/import/"+file, sep=',', index=False)

# dataLoading (expanded 1)
Multistatement queries is only supported on neo4j browser or cypher shell.  
With py2neo you must run each statement sequentially. 
  
`MERGE` will take care of any duplicate values  
`p` temporary variable name  
`Province` entity name defined in data model  
`name` property name of entity  
`line.province` line is 1 record from dataset then select column province  
  
```cypher
MERGE (p: Province {name: (line.province)})
```

In [25]:
# indexes for additional data
graph.run("""CREATE INDEX ON :Wine(id)""")
graph.run("""CREATE INDEX ON :Taster(name)""")
graph.run("""CREATE INDEX ON :Variety(name)""")
graph.run("""CREATE INDEX ON :Designation(name)""")

<py2neo.database.Cursor at 0x7f119969dd30>

In [26]:
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (d: Designation {name: (line.designation)})
"""
graph.run(query)

query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (t: Taster {name: (line.taster_name)})
"""
graph.run(query)

query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (v: Variety {name: (line.variety)})
"""
graph.run(query)

query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (c: Country {name: (line.country)})
MERGE (w: Wine {id: line.id, title: line.title})
"""
graph.run(query)

<py2neo.database.Cursor at 0x7f119969df10>

In [27]:
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (d: Designation {name: (line.designation)})
MATCH (t: Taster {name: (line.taster_name)})
MATCH (v: Variety {name: (line.variety)})
MATCH (w: Wine {id: (line.id)})
MATCH (win: Winery {name: (line.winery)})
MERGE (w)-[:FROM_WINERY]->(win)
MERGE (w)-[:HAS_VARIETY]->(v)
MERGE (t)-[:RATES_WINE]->(w)
MERGE (w)-[:HAS_DESIGNATION]->(d)
"""
graph.run(query)

<py2neo.database.Cursor at 0x7f119969d820>

# The top 10 most prolific wine tasters ?
note: not the unique amount but count the total amount tasted

In [28]:
%%time
query = """
MATCH (t:Taster)
WHERE t.name <> "No Taster"
//WITH t
MATCH (t)-[:RATES_WINE]->(w:Wine)-[:HAS_VARIETY]->(v:Variety)
WITH t, count(w) AS Total, COLLECT(DISTINCT v.name) AS Varieties
RETURN t.name AS Taster, Varieties, Total
ORDER BY Total DESC LIMIT 10
"""
graph.run(query).to_data_frame()

CPU times: user 9.6 ms, sys: 0 ns, total: 9.6 ms
Wall time: 1.61 s


Unnamed: 0,Taster,Varieties,Total
0,Roger Voss,"[Pinot Noir, Chenin Blanc, Gewürztraminer, Spa...",23560
1,Michael Schachner,"[Cabernet Sauvignon, Malbec, White Blend, Temp...",14046
2,Kerin O’Keefe,"[Nebbiolo, Red Blend, Sirica, Primitivo, Sangi...",9697
3,Paul Gregutt,"[Red Blend, Syrah, Cabernet Franc, Chardonnay,...",8868
4,Virginie Boone,"[Zinfandel, Rhône-style Red Blend, Chardonnay,...",8708
5,Matt Kettmann,"[Grenache, Cabernet Sauvignon, Chardonnay, San...",5730
6,Joe Czerwinski,"[Pinot Gris, Cabernet Merlot, Rhône-style Red ...",4766
7,Sean P. Sullivan,"[Merlot, Grenache, Syrah, Rosé, Malbec, Cabern...",4461
8,Anna Lee C. Iijima,"[Pinot Noir, Riesling, Sauvignon Blanc, Tokaji...",4017
9,Jim Gordon,"[Cabernet Sauvignon, Pinot Noir, Merlot, Petit...",3766


In [29]:
%%time
result = prep[prep.taster_name != "No Taster"]
result = result.groupby(['taster_name']).agg({'variety':[lambda x: list(x)], 'title':[lambda x: x.count()]}).reset_index()
result.columns = ['Taster', 'Varieties', 'Total']
result = result.sort_values(by='Total',ascending=False).reset_index(drop=True)
result.head(10)

CPU times: user 77.8 ms, sys: 7.43 ms, total: 85.3 ms
Wall time: 84.5 ms


Unnamed: 0,Taster,Varieties,Total
0,Roger Voss,"[Portuguese Red, Gewürztraminer, Pinot Gris, G...",23560
1,Michael Schachner,"[Tempranillo-Merlot, Malbec, Malbec, Tempranil...",14046
2,Kerin O’Keefe,"[White Blend, Frappato, Nerello Mascalese, Whi...",9697
3,Paul Gregutt,"[Pinot Gris, Pinot Noir, Pinot Noir, Pinot Noi...",8868
4,Virginie Boone,"[Cabernet Sauvignon, Cabernet Sauvignon, Pinot...",8708
5,Matt Kettmann,"[Chardonnay, Merlot, Sauvignon Blanc, Zinfande...",5730
6,Joe Czerwinski,"[Chardonnay, Rosé, Shiraz-Cabernet Sauvignon, ...",4766
7,Sean P. Sullivan,"[Malbec, Cabernet Franc, Bordeaux-style Red Bl...",4461
8,Anna Lee C. Iijima,"[Gewürztraminer, Riesling, Riesling, Riesling,...",4017
9,Jim Gordon,"[Red Blend, Cabernet Franc, White Blend, Grena...",3766


# How many wine varieties contain the word 'red' ?

In [30]:
%%time
query = """
MATCH (v:Variety)
WHERE tolower(v.name) CONTAINS 'red'
RETURN v.name AS redVariety
ORDER BY redVariety
"""
graph.run(query).to_data_frame()

CPU times: user 3.96 ms, sys: 385 µs, total: 4.35 ms
Wall time: 87.5 ms


Unnamed: 0,redVariety
0,Austrian Red Blend
1,Bordeaux-style Red Blend
2,Portuguese Red
3,Provence red blend
4,Red Blend
5,Rhône-style Red Blend


In [31]:
%%time
pd.DataFrame(sorted(prep["variety"][prep["variety"].str.contains('red', case=False)].unique()), columns=["redVariety"])

CPU times: user 68 ms, sys: 3.13 ms, total: 71.2 ms
Wall time: 70.2 ms


Unnamed: 0,redVariety
0,Austrian Red Blend
1,Bordeaux-style Red Blend
2,Portuguese Red
3,Provence red blend
4,Red Blend
5,Rhône-style Red Blend


regex generator : http://regex.inginf.units.it/  
regex checker : https://regex101.com/  
neo4j apoc text replace : https://neo4j.com/labs/apoc/4.1/misc/text-functions/#text-functions-regex  
pandas series replace : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html  
pandas series extract : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html  

**problem:**  
 * apoc.text can only do replace and regexGroup
 * regex expresions aren't set up for doing negative matching, leave that to whatever language you are using

In [32]:
query = """
MATCH (w:Wine)
RETURN apoc.text.replace(w.title, '\\\d{4}', '') AS wineTitle, apoc.text.regexGroups(w.title, '\\\d{4}')[0][0] AS wineYear
LIMIT 5
"""
graph.run(query).to_data_frame()

Unnamed: 0,wineTitle,wineYear
0,Nicosia Vulkà Bianco (Etna),2013
1,Quinta dos Avidagos Avidagos Red (Douro),2011
2,Rainstorm Pinot Gris (Willamette Valley),2013
3,St. Julian Reserve Late Harvest Riesling (Lak...,2013
4,Sweet Cheeks Vintner's Reserve Wild Child Blo...,2012


In [33]:
result = prep["title"].str.replace("(\d{4})", '')
result = pd.concat([result, prep["title"].str.extract("(\d{4})")], axis=1)
result.columns = ["wineTitle", "wineYear"]
result.head(5)

Unnamed: 0,wineTitle,wineYear
0,Nicosia Vulkà Bianco (Etna),2013
1,Quinta dos Avidagos Avidagos Red (Douro),2011
2,Rainstorm Pinot Gris (Willamette Valley),2013
3,St. Julian Reserve Late Harvest Riesling (Lak...,2013
4,Sweet Cheeks Vintner's Reserve Wild Child Blo...,2012


In [34]:
prep['wineTitle'] = result['wineTitle']
prep['wineYear'] = result['wineYear'].fillna('No Year')

In [None]:
prep

In [35]:
# save file
prep.to_csv("../../neo4j/import/"+file, sep=',', index=False)

In [76]:
prep['wineYear']

0         2013
1         2011
2         2013
3         2013
4         2012
          ... 
119983    2013
119984    2004
119985    2013
119986    2012
119987    2012
Name: wineYear, Length: 119988, dtype: object

In [90]:
(prep['wineYear'][prep['wineYear'].value_counts() >= 100])

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

In [80]:
prep['wineYear'].where(prep['wineYear'].value_counts() >= 100, "No Year").value_counts()

No Year    119988
Name: wineYear, dtype: int64

In [82]:
s = pd.Series(range(5))
s

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [83]:
s.where(s > 1, 10)

0    10
1    10
2     2
3     3
4     4
dtype: int64

In [85]:
s = prep['wineYear'].where(prep['wineYear'].value_counts() >= 100
s

SyntaxError: invalid syntax (<ipython-input-85-82439cbe37fa>, line 2)

In [109]:
count = prep["wineYear"].value_counts()
count

2012    14339
2013    14307
2014    13984
2011    11528
2010    11274
        ...  
3000        1
1935        1
1945        1
1070        1
1982        1
Name: wineYear, Length: 92, dtype: int64

In [110]:
prep["wineYear"]

0         2013
1         2011
2         2013
3         2013
4         2012
          ... 
119983    2013
119984    2004
119985    2013
119986    2012
119987    2012
Name: wineYear, Length: 119988, dtype: object

In [113]:
count[2013]

IndexError: index 2013 is out of bounds for axis 0 with size 92

In [120]:
prep["wineYear"].apply(lambda x: "other" if count[x] > 2 else x)

0         other
1         other
2         other
3         other
4         other
          ...  
119983    other
119984    other
119985    other
119986    other
119987    other
Name: wineYear, Length: 119988, dtype: object

Executing shutdown...


2020-10-20 18:14:09,239 - INFO     - Executing shutdown...


In [None]:
prep["wineYear"] = 

In [95]:
prep["wineYear"].apply(lambda x: x.map(x.value_counts()))<=100

AttributeError: 'str' object has no attribute 'map'

In [91]:
prep["wineYear"].where(prep["wineYear"].apply(lambda x: x.map(x.value_counts()))>=100, "other")

AttributeError: 'str' object has no attribute 'map'

In [128]:
prep['wineYear'].unique()

array(['2013', '2011', '2012', '2010', '2007', '2009', '2008', '2014',
       '2015', 'No Year', '2016', '2004', '2003', '2006', '2001', '2005',
       '2002', '1887', '2000', '1999', '1991', '1997', '1996', '1877',
       '2017', '1995', '1872', '1637', '1868', '1898', '1492', '1998',
       '7200', '1852', '1994', '1992', '1840', '1929', '1912', '1875',
       '1976', '1964', '1848', '1870', '1856', '1983', '1967', '1990',
       '1988', '1827', '1860', '1850', '1000', '1980', '1987', '1989',
       '1993', '1969', '1882', '1935', '1503', '1821', '1973', '1978',
       '1965', '1968', '1947', '1963', '1070', '1985', '1927', '1904',
       '1847', '1982', '1986', '1752', '1789', '1607', '1621', '1919',
       '1957', '1966', '1984', '1961', '1845', '1952', '1150', '1941',
       '1974', '3000', '1934', '1945'], dtype=object)

In [145]:
[x for x in range(5)]

[0, 1, 2, 3, 4]

In [153]:
{[key, value] for key, value in (prep['wineYear'].value_counts() > 100).to_dict().items()}

TypeError: unhashable type: 'list'

In [None]:
gevoinden !

In [160]:
fun = dict(prep['wineYear'].value_counts())
for key, value in (prep['wineYear'].value_counts() > 100).items():
    if value:
        print(key, key)
        fun[key] = key
    else:
        print(key, "No Year")
        fun[key] = "No Year"
print(fun)

2012 2012
2013 2013
2014 2014
2011 2011
2010 2010
2009 2009
2015 2015
2008 2008
2007 2007
2006 2006
No Year No Year
2005 2005
2016 2016
2004 2004
2000 2000
2001 2001
1999 1999
2003 2003
1998 1998
2002 2002
1997 1997
1996 No Year
1995 No Year
1852 No Year
1994 No Year
1898 No Year
1992 No Year
7200 No Year
2017 No Year
1868 No Year
1912 No Year
1877 No Year
1875 No Year
1848 No Year
1989 No Year
1929 No Year
1492 No Year
1988 No Year
1860 No Year
1990 No Year
1985 No Year
1821 No Year
1882 No Year
1840 No Year
1991 No Year
1870 No Year
1850 No Year
1993 No Year
1986 No Year
1963 No Year
1856 No Year
1978 No Year
1980 No Year
1872 No Year
1966 No Year
1150 No Year
1827 No Year
1000 No Year
1964 No Year
1987 No Year
1984 No Year
1983 No Year
1887 No Year
1845 No Year
1637 No Year
1927 No Year
1952 No Year
1965 No Year
1969 No Year
1934 No Year
1974 No Year
1919 No Year
1947 No Year
1904 No Year
1941 No Year
1847 No Year
1973 No Year
1968 No Year
1957 No Year
1789 No Year
1503 No Year
1961

In [163]:
prep['wineYear'].map(fun)

0         2013
1         2011
2         2013
3         2013
4         2012
          ... 
119983    2013
119984    2004
119985    2013
119986    2012
119987    2012
Name: wineYear, Length: 119988, dtype: object

In [162]:
prep['wineYear'].map(fun).unique()

array(['2013', '2011', '2012', '2010', '2007', '2009', '2008', '2014',
       '2015', 'No Year', '2016', '2004', '2003', '2006', '2001', '2005',
       '2002', '2000', '1999', '1997', '1998'], dtype=object)

In [133]:
prep['wineYear'].apply(lambda x: x.map(x.value_counts()))

AttributeError: 'str' object has no attribute 'map'

In [None]:
reimburse_cat_dict = {'nd':'other', 'A':'A', 'B':'B', 'Cat 1':'other', 'Cat 2 (A)':'other', 'Cat 3':'other', 'Cat 4':'other',
       'Cat 5 (D)':'other', 'C':'C', 'Cs':'C', 'Cx':'C', 'Cxg':'C', 'D':'D', 'Csg':'C', 'Ag':'A', 'Bg':'B', 'Cg':'C',
       'Forf Ant':'other', 'Nutri Par':'other', 'Br':'other', 'Ar':'other', 'Cr':'C', 'Csr':'C', 'Cxr':'C',
       'Forf Adm':'other', 'Forf BH':'forf', 'V08':'other', 'Fa':'other', 'Fb':'other', 'Forf 1-3':'other',
       'Forf 4-':'other', 'Ri-D11':'other', 'Ri-T1':'other', 'Ri-T2':'other', 'Ri-T3':'other', 'Ri-D5':'other', 'Ri-D7':'other',
       'Ri-D2':'other', 'Ri-D9':'other', 'Ri-D6':'other', 'Ri-D10':'other', 'Ri-D3':'other', 'Ri-D1':'other', 'Ri-D8':'other',
       'Ri-T4':'other', 'Ri-D4':'other', 'Forf PET':'other', '90-A':'A', '90-B':'B', '90-Fa':'other', '90-Fb':'other',
       'Ri-T5':'other', 'Ri-T6':'other', 'Ri-T7':'other', 'Ri-T8':'other', '90-C':'C', '90-Cs':'C', '90-Cx':'C'}
reimburse_cat["reimbt_crit_long"] = reimburse_cat["reimbt_crit_long"].map(reimburse_cat_dict)

In [73]:
(pd.value_counts(prep['wineYear']) >= 100).where()

AttributeError: 'Series' object has no attribute 'repace'

In [52]:
# indexes for additional data
graph.run("""CREATE INDEX ON :wineGroup(id)""")
graph.run("""CREATE INDEX ON :Year(value)""")

<py2neo.database.Cursor at 0x7f11987c5d30>

In [53]:
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (w: wineGroup {id: line.id, title: (line.wineTitle)})
"""
graph.run(query)

<py2neo.database.Cursor at 0x7f119a5c9640>

In [47]:
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (y: Year {value: (line.wineYear)})
"""
graph.run(query)

<py2neo.database.Cursor at 0x7f1198777280>

In [54]:
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (w: Wine {id: (line.id)})
MATCH (wg: wineGroup {title: (line.wineTitle)})
MATCH (y: Year {value: (line.wineYear)})
MERGE (w)-[:FROM_YEAR]->(y)
MERGE (w)-[:IN_WINE_GROUP]->(wg)
"""
graph.run(query)

<py2neo.database.Cursor at 0x7f11987775b0>

In [None]:
MATCH (w: Wine {id: (line.id)})
MATCH (win: Winery {name: (line.winery)})
MERGE (w)-[:FROM_WINERY]->(win)
MERGE (w)-[:HAS_VARIETY]->(v)
MERGE (t)-[:RATES_WINE]->(w)
MERGE (w)-[:HAS_DESIGNATION]->(d)

In [None]:
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (c: Country {name: (line.country)})
MERGE (w: Wine {id: line.id, title: line.title})
"""
graph.run(query)

## Which Year had the most Wine?

In [None]:
MATCH (w:Wine)-[:FROM_YEAR]->(y:Year)
WITH y, collect(w) AS wines
RETURN y.value, size(wines) AS s ORDER BY s DESC

## Which Winery produces the most Wine for a given Year ?

In [None]:
MATCH (wy:Winery)<-[:FROM_WINERY]-(w:Wine)-[:FROM_YEAR]->(y:Year)
WITH wy, y, COLLECT(w) AS wines
RETURN wy.name AS Winery, y.value AS Year, size(wines) AS `No of Wines`
ORDER BY `No of Wines` DESC