Author: Kevin ALBERT  

Created: Oct 2020 

Inspiration: [git repo](https://github.com/lju-lazarevic/wine)

# environment
**cpu:**2, **mem:**8G, **disk:**150GB, **os:**ubuntu

In [None]:
# ! pip install py2neo pandas
# ! pip install pandas-profiling

In [None]:
# rerun report (delete me later)
import pandas_profiling as pp
pp.ProfileReport(prep, minimal=True, correlations={"cramers": {"calculate": False}}, progress_bar=False).to_file(reportFile)

In [54]:
import dtale
d = dtale.show(prep, host="13.74.11.167", port="40000", ignore_duplicate=True, drop_index=True, reaper_on=False)
# show all running instances
d.main_url()
# stop webapp
# d.kill()

http://13.74.11.167:40000/dtale/main/1


In [None]:
! pip list |grep -i py2neo
! pip list |grep -i pandas

py2neo is a client library and toolkit for working with Neo4j from within Python applications.  
It is well suited for Data Science workflows and has great integration with other Python Data Science tools.  
[py2neo docs](https://py2neo.org/v4/database.html)

In [1]:
from py2neo import Graph, Node, Relationship
import pandas as pd
from IPython.display import Javascript
import pandas_profiling as pp

In [2]:
neo_server = "13.74.11.167"
user = "neo4j"
passw = "digityser"
file = "winedata.csv"

In [3]:
graph = Graph(host=neo_server, auth=(user, passw))

**delete database neo4j (v4.x):**
```sh
sudo docker-compose down
sudo rm -Rf data/databases/neo4j
sudo rm -Rf data/transactions/neo4j
sudo docker-compose up --build &
```
```cypher
MATCH (n) DETACH DELETE n;
CALL apoc.schema.assert({},{},true) YIELD label, key RETURN *;
```

In [None]:
# delete all nodes and relationships
graph.delete_all()

In [None]:
# delete all indexes and constraints
graph.run("""CALL apoc.schema.assert({},{},true) YIELD label, key RETURN *""")

# dataReport

[link to the original dataset](https://www.kaggle.com/zynicide/wine-reviews/data)  
[link to the git repo dataset](https://github.com/lju-lazarevic/wine/tree/master/data)

In [4]:
# pre-cleaned dataset: deduplicated and cleaned twitter handles
datasetURL = "https://raw.githubusercontent.com/lju-lazarevic/wine/master/data/winemag-data-130k-v3.csv"
reportFile = "../../data/report/winemag_report.html"

In [5]:
df = pd.read_csv(datasetURL)

In [None]:
%%time
pp.ProfileReport(df=df.sample(frac=1),
                 minimal=True,
                 progress_bar=False,
                 correlations={"cramers": {"calculate": False}}).to_file(reportFile)

In [None]:
# open the report (*.html)
display(Javascript('window.open("{url}");'.format(url=reportFile)))

# dataPrep
clean data prior to a load

In [6]:
prep = df.copy()

In [7]:
# replace nan
prep['winery'] = prep['winery'].fillna('No Winery')
prep['province'] = prep['province'].fillna('No Province')
prep['country'] = prep['country'].fillna('No Country')

In [8]:
# save file to /import
prep.to_csv("../../neo4j/import/"+file, sep=',', index=False)

# dataModel
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![Drag Racing](../../image/howto_graph/model2.jpg)

# dataLoading
Multistatement queries is only supported on neo4j browser or cypher shell.  
With py2neo you must run each statement sequentially. 
  
`MERGE` will take care of any duplicate values  
`p` temporary variable name  
`Province` entity name defined in data model  
`name` property name of entity  
`line.province` line is 1 record from dataset then select column province  
  
```cypher
MERGE (p: Province {name: (line.province)})
```

In [None]:
# check first 2 lines
! head -n 2 ../../neo4j/import/$file

In [None]:
# test data loading
query = """
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
WITH line
LIMIT 1
RETURN line
"""
data = graph.run(query)

In [None]:
next(data)

In [None]:
# open neo4j dashboard
display(Javascript('window.open("{url}");'.format(url="http://"+neo_server+":7474")))

## createIndex

In [9]:
%%time
graph.run("""CREATE INDEX ON :Winery(name)""")
graph.run("""CREATE INDEX ON :Province(name)""")
graph.run("""CREATE INDEX ON :Country(name)""")

CPU times: user 5.28 ms, sys: 20 µs, total: 5.3 ms
Wall time: 975 ms


<py2neo.database.Cursor at 0x7fb82d30fd30>

## createNodes

In [10]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (w: Winery {name: (line.winery)})
"""
graph.run(query)

query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (p: Province {name: (line.province)})
"""
graph.run(query)

query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (c: Country {name: (line.country)})
"""
graph.run(query)

CPU times: user 4.34 ms, sys: 562 µs, total: 4.9 ms
Wall time: 13.6 s


<py2neo.database.Cursor at 0x7fb8681893a0>

## createRelations

In [11]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (w: Winery {name: trim(line.winery)})
MATCH (p: Province {name: trim(line.province)})
MATCH (c: Country {name: trim(line.country)})
MERGE (w)-[:FROM_PROVENCE]->(p)
MERGE (p)-[:PROVINCE_COUNTRY]->(c)
"""
graph.run(query)

CPU times: user 3.15 ms, sys: 219 µs, total: 3.37 ms
Wall time: 11.5 s


<py2neo.database.Cursor at 0x7fb82bbe1dc0>

```cypher
CALL db.schema.visualization
```

![CALL db.schema.visualization](../../image/howto_graph/schema2.png)

## Which 10 countries have the most wineries ?
note: make sure to count only once each winery

In [12]:
%%time
query = """
MATCH (w:Winery)-[:FROM_PROVENCE]->(p:Province)-[:PROVINCE_COUNTRY]->(c:Country)
RETURN c.name AS Country, count(DISTINCT w) AS totalNrWineries
ORDER BY totalNrWineries DESC LIMIT 10
"""
graph.run(query).to_data_frame()

CPU times: user 4.59 ms, sys: 1.06 ms, total: 5.65 ms
Wall time: 726 ms


Unnamed: 0,Country,totalNrWineries
0,US,5375
1,France,3864
2,Italy,2934
3,Spain,1435
4,Argentina,531
5,Australia,474
6,Portugal,430
7,Chile,317
8,New Zealand,300
9,South Africa,294


In [13]:
%%time
result = prep[["winery","country"]].groupby(['country'])['winery'].nunique()
result = result.rename_axis(['Country']).rename('totalNrWineries').sort_values(ascending=False).reset_index()
result.head(10)

CPU times: user 53.9 ms, sys: 0 ns, total: 53.9 ms
Wall time: 53.1 ms


Unnamed: 0,Country,totalNrWineries
0,US,5375
1,France,3864
2,Italy,2934
3,Spain,1435
4,Argentina,531
5,Australia,474
6,Portugal,430
7,Chile,317
8,New Zealand,300
9,South Africa,294


## Which wineries are across multiple provinces ?
alt: Which provinces are associated to each winery ?

In [14]:
%%time
query = """
MATCH (w:Winery)-[:FROM_PROVENCE]->(p:Province)
WITH w, COLLECT(p.name) AS Provinces, count(p) AS Total
RETURN w.name AS Winery, Provinces, Total
ORDER BY Total DESC LIMIT 10
"""
graph.run(query).to_data_frame()

CPU times: user 3.33 ms, sys: 935 µs, total: 4.27 ms
Wall time: 360 ms


Unnamed: 0,Winery,Provinces,Total
0,Undurraga,"[Colchagua Valley, Maule Valley, Maipo Valley,...",19
1,Concha y Toro,"[Colchagua Valley, Maule Valley, Maipo Valley,...",16
2,Santa Carolina,"[Colchagua Valley, Maule Valley, Maipo Valley,...",14
3,San Pedro,"[Northern Spain, Mendoza Province, Maule Valle...",12
4,Kirkland Signature,"[Northern Spain, California, Mendoza Province,...",12
5,Santa Rita,"[Colchagua Valley, Maipo Valley, Rapel Valley,...",11
6,Bacalhôa Wines of Portugal,"[Douro, Alentejano, Lisboa, Península de Setúb...",11
7,Wines & Winemakers,"[Douro, Tejo, Alentejano, Vinho Verde, Penínsu...",10
8,Tussock Jumper,"[Rheinhessen, California, Other, Colchagua Val...",10
9,Casca Wines,"[Douro, Tejo, Alentejano, Vinho Verde, Lisboa,...",10


In [15]:
%%time
result = prep.groupby('winery').agg({'province':[lambda x: x.unique(), lambda x: x.nunique()]}).reset_index()
result.columns = ['Winery', 'Provinces', 'Total']
result = result.sort_values(by='Total',ascending=False).reset_index(drop=True)
result.head(10)

CPU times: user 2.5 s, sys: 1.09 ms, total: 2.5 s
Wall time: 2.5 s


Unnamed: 0,Winery,Provinces,Total
0,Undurraga,"[Maipo Valley, Leyda Valley, Chile, Cauquenes ...",19
1,Concha y Toro,"[Chile, Central Valley, Maipo Valley, Casablan...",16
2,Santa Carolina,"[Cachapoal Valley, Colchagua Valley, Casablanc...",14
3,San Pedro,"[Lontué Valley, Cachapoal Valley, Maipo Valley...",12
4,Kirkland Signature,"[California, Washington, Bordeaux, Rhône Valle...",12
5,Santa Rita,"[Leyda Valley, Central Valley, Maipo Valley, A...",11
6,Bacalhôa Wines of Portugal,"[Douro, Setubal, Península de Setúbal, Lisboa,...",11
7,Xavier Flouret,"[Central Valley, Bordeaux, Provence, Burgundy,...",10
8,Barton & Guestier,"[France Other, No Province, Bordeaux, Burgundy...",10
9,Echeverria,"[Central Valley, Maipo Valley, Curicó Valley, ...",10


# dataModel (expansion 1)
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![model3](../../image/howto_graph/model3.jpg)

# dataPrep (expanded 1)
clean data prior to a load

In [16]:
prep = df.copy()

In [17]:
prep['winery'] = prep['winery'].fillna('No Winery')
prep['province'] = prep['province'].fillna('No Province')
prep['country'] = prep['country'].fillna('No Country')

In [18]:
prep['designation'] = prep['designation'].fillna('No Designation')
prep['taster_name'] = prep['taster_name'].fillna('No Taster')
prep['variety'] = prep['variety'].fillna('No Variety')
prep['title'] = prep['title'].fillna('No Title')

In [19]:
# save file
prep.to_csv("../../neo4j/import/"+file, sep=',', index=False)

# dataLoading (expansion 1)
Multistatement queries is only supported on neo4j browser or cypher shell.  
With py2neo you must run each statement sequentially. 
  
`MERGE` will take care of any duplicate values  
`p` temporary variable name  
`Province` entity name defined in data model  
`name` property name of entity  
`line.province` line is 1 record from dataset then select column province  
  
```cypher
MERGE (p: Province {name: (line.province)})
```

In [20]:
%%time
# indexes for additional data
graph.run("""CREATE INDEX ON :Wine(title)""")
graph.run("""CREATE INDEX ON :Taster(name)""")
graph.run("""CREATE INDEX ON :Variety(name)""")
graph.run("""CREATE INDEX ON :Designation(name)""")

CPU times: user 5.54 ms, sys: 309 µs, total: 5.85 ms
Wall time: 389 ms


<py2neo.database.Cursor at 0x7fb82bbe1fa0>

In [21]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (d: Designation {name: (line.designation)})
"""
graph.run(query)

query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (t: Taster {name: (line.taster_name)})
"""
graph.run(query)

query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (v: Variety {name: (line.variety)})
"""
graph.run(query)

query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (c: Country {name: (line.country)})
MERGE (w: Wine {title: line.title})
"""
graph.run(query)

CPU times: user 4.76 ms, sys: 61 µs, total: 4.82 ms
Wall time: 16.7 s


<py2neo.database.Cursor at 0x7fb82aaacee0>

In [22]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (d: Designation {name: (line.designation)})
MATCH (t: Taster {name: (line.taster_name)})
MATCH (v: Variety {name: (line.variety)})
MATCH (w: Wine {title: (line.title)})
MATCH (win: Winery {name: (line.winery)})
MERGE (w)-[:FROM_WINERY]->(win)
MERGE (w)-[:HAS_VARIETY]->(v)
MERGE (t)-[:RATES_WINE]->(w)
MERGE (w)-[:HAS_DESIGNATION]->(d)
"""
graph.run(query)

CPU times: user 80 µs, sys: 3.97 ms, total: 4.05 ms
Wall time: 29.1 s


<py2neo.database.Cursor at 0x7fb86817d3a0>

```cypher
CALL db.schema.visualization
```

![CALL db.schema.visualization](../../image/howto_graph/schema3.png)

# The top 10 most prolific wine tasters ?
note: count the unique amount instead of the total amount tasted

In [27]:
%%time
query = """
MATCH (t:Taster)
WHERE t.name <> "No Taster"
MATCH (t)-[:RATES_WINE]->(w:Wine)-[:HAS_VARIETY]->(v:Variety)
WITH t, count(DISTINCT w) AS Total, COLLECT(DISTINCT v.name) AS Varieties
RETURN t.name AS Taster, Varieties, Total
ORDER BY Total DESC LIMIT 10
"""
graph.run(query).to_data_frame()

CPU times: user 5.2 ms, sys: 3.71 ms, total: 8.91 ms
Wall time: 729 ms


Unnamed: 0,Taster,Varieties,Total
0,Roger Voss,"[Bordeaux-style Red Blend, Portuguese Red, Sau...",22973
1,Michael Schachner,"[Cabernet Sauvignon, Pinot Noir, Malbec, Red B...",13944
2,Kerin O’Keefe,"[Nero d'Avola, Sagrantino, Red Blend, Garganeg...",9662
3,Paul Gregutt,"[Merlot, Red Blend, Pinot Noir, Pinot Gris, Ch...",8856
4,Virginie Boone,"[Pinot Noir, Cabernet Sauvignon, Chardonnay, Z...",8689
5,Matt Kettmann,"[Chardonnay, Rhône-style Red Blend, Cabernet S...",5698
6,Joe Czerwinski,"[Chardonnay, Rhône-style Red Blend, Rhône-styl...",4753
7,Sean P. Sullivan,"[Merlot, Bordeaux-style Red Blend, Syrah, Red ...",4448
8,Anna Lee C. Iijima,"[White Blend, Riesling, Pinot Noir, Gewürztram...",4012
9,Jim Gordon,"[Pinot Noir, Chardonnay, Muscat Canelli, Merlo...",3750


In [28]:
%%time
result = prep[prep.taster_name != "No Taster"]
result = result.groupby(['taster_name']).agg({'variety':[lambda x: list(x)], 'title':[lambda x: x.nunique()]}).reset_index()
result.columns = ['Taster', 'Varieties', 'Total']
result = result.sort_values(by='Total',ascending=False).reset_index(drop=True)
result.head(10)

CPU times: user 109 ms, sys: 3.94 ms, total: 113 ms
Wall time: 112 ms


Unnamed: 0,Taster,Varieties,Total
0,Roger Voss,"[Portuguese Red, Gewürztraminer, Pinot Gris, G...",22973
1,Michael Schachner,"[Tempranillo-Merlot, Malbec, Malbec, Tempranil...",13944
2,Kerin O’Keefe,"[White Blend, Frappato, Nerello Mascalese, Whi...",9662
3,Paul Gregutt,"[Pinot Gris, Pinot Noir, Pinot Noir, Pinot Noi...",8856
4,Virginie Boone,"[Cabernet Sauvignon, Cabernet Sauvignon, Pinot...",8689
5,Matt Kettmann,"[Chardonnay, Merlot, Sauvignon Blanc, Zinfande...",5698
6,Joe Czerwinski,"[Chardonnay, Rosé, Shiraz-Cabernet Sauvignon, ...",4753
7,Sean P. Sullivan,"[Malbec, Cabernet Franc, Bordeaux-style Red Bl...",4448
8,Anna Lee C. Iijima,"[Gewürztraminer, Riesling, Riesling, Riesling,...",4012
9,Jim Gordon,"[Red Blend, Cabernet Franc, White Blend, Grena...",3750


# How many wine varieties contain the word 'red' ?

In [29]:
%%time
query = """
MATCH (v:Variety)
WHERE tolower(v.name) CONTAINS 'red'
RETURN v.name AS redVariety
ORDER BY redVariety
"""
graph.run(query).to_data_frame()

CPU times: user 4.58 ms, sys: 1.11 ms, total: 5.69 ms
Wall time: 50.8 ms


Unnamed: 0,redVariety
0,Austrian Red Blend
1,Bordeaux-style Red Blend
2,Portuguese Red
3,Provence red blend
4,Red Blend
5,Rhône-style Red Blend


In [30]:
%%time
pd.DataFrame(sorted(prep["variety"][prep["variety"].str.contains('red', case=False)].unique()), columns=["redVariety"])

CPU times: user 64.7 ms, sys: 0 ns, total: 64.7 ms
Wall time: 63.8 ms


Unnamed: 0,redVariety
0,Austrian Red Blend
1,Bordeaux-style Red Blend
2,Portuguese Red
3,Provence red blend
4,Red Blend
5,Rhône-style Red Blend


# dataModel (expansion 2)
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![model4](../../image/howto_graph/model4.jpg)

# dataPrep (expansion 2)
clean data prior to a load

In [31]:
prep = df.copy()

In [32]:
prep['winery'] = prep['winery'].fillna('No Winery')
prep['province'] = prep['province'].fillna('No Province')
prep['country'] = prep['country'].fillna('No Country')

In [33]:
prep['designation'] = prep['designation'].fillna('No Designation')
prep['taster_name'] = prep['taster_name'].fillna('No Taster')
prep['variety'] = prep['variety'].fillna('No Variety')
prep['title'] = prep['title'].fillna('No Title')

regex generator : http://regex.inginf.units.it/  
regex checker : https://regex101.com/  
neo4j apoc text replace : https://neo4j.com/labs/apoc/4.1/misc/text-functions/#text-functions-regex  
pandas series replace : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html  
pandas series extract : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html  

In [34]:
# extract years 1970-2119
prep['year'] = prep['title'].str.extract("(([2][0-1][0-1][0-9])|([1][9][7-9][0-9]))")[0]
prep['year'] = prep['year'].fillna('No Year')

In [35]:
prep['wine_group'] = prep['title'].str.replace("(\d{4})", '') # remove 4-digit and 1-space
prep['wine_group'] = prep['wine_group'].str.replace("([ ]{2,})", ' ') # replace 2 or more spaces into 1-space

In [36]:
# save file
prep.to_csv("../../neo4j/import/"+file, sep=',', index=False)

# dataLoading (expanded 2)

In [41]:
%%time
# indexes for additional data
graph.run("""CREATE INDEX ON :WineGroup(title)""")
graph.run("""CREATE INDEX ON :Year(value)""")

CPU times: user 3.66 ms, sys: 0 ns, total: 3.66 ms
Wall time: 155 ms


<py2neo.database.Cursor at 0x7fb829ffce50>

In [42]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (y: Year {value: (line.year)})
"""
graph.run(query)

CPU times: user 2.72 ms, sys: 0 ns, total: 2.72 ms
Wall time: 2.82 s


<py2neo.database.Cursor at 0x7fb827699340>

In [43]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MERGE (wg: WineGroup {title: (line.wine_group)})
"""
graph.run(query)

CPU times: user 2.71 ms, sys: 155 µs, total: 2.86 ms
Wall time: 9.8 s


<py2neo.database.Cursor at 0x7fb829fe78e0>

In [44]:
%%time
query = """
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///"""+file+"""' AS line FIELDTERMINATOR ','
MATCH (w: Wine {title: (line.title)})
MATCH (y: Year {value: (line.year)})
MATCH (wg: WineGroup {title: (line.wine_group)})
MERGE (w)-[:FROM_YEAR]->(y)
MERGE (w)-[:IN_WINE_GROUP]->(wg)
"""
graph.run(query)

CPU times: user 3.57 ms, sys: 233 µs, total: 3.81 ms
Wall time: 22.9 s


<py2neo.database.Cursor at 0x7fb827699460>

```cypher
CALL db.schema.visualization
```

![CALL db.schema.visualization](../../image/howto_graph/schema4.png)

## Which Year had the most Wine?
unique or distinct count !

In [46]:
%%time
query = """
MATCH (w:Wine)-[:FROM_YEAR]->(y:Year)
WITH y, collect(w) AS wines
RETURN y.value AS year, size(wines) AS wines ORDER BY wines DESC LIMIT 5
"""
graph.run(query).to_data_frame()

CPU times: user 4.45 ms, sys: 0 ns, total: 4.45 ms
Wall time: 199 ms


Unnamed: 0,year,wines
0,2012,14302
1,2013,14261
2,2014,13914
3,2011,11504
4,2010,11228


In [47]:
%%time
result = prep[prep.year != "No Year"]
result = result.groupby(['year'])['title'].nunique().reset_index()
result.columns = ['year', 'wines']
result = result.sort_values(by='wines',ascending=False).reset_index(drop=True)
result.head(5)

CPU times: user 147 ms, sys: 12.2 ms, total: 159 ms
Wall time: 160 ms


Unnamed: 0,year,wines
0,2012,14302
1,2013,14261
2,2014,13914
3,2011,11504
4,2010,11228


## Which top 5 Winery produces the most Wine for a given Year ?
**note:** the cypher query is showing the distinct or unique count of wine titles

In [111]:
%%time
query = """
MATCH (wy:Winery)<-[:FROM_WINERY]-(w:Wine)-[:FROM_YEAR]->(y:Year)
WITH wy, y, COLLECT(w) AS wines
RETURN wy.name AS Winery, y.value AS Year, size(wines) AS `No of Wines`
ORDER BY `No of Wines` DESC LIMIT 5
"""
graph.run(query).to_data_frame()

CPU times: user 5.6 ms, sys: 0 ns, total: 5.6 ms
Wall time: 436 ms


Unnamed: 0,Winery,Year,No of Wines
0,Wines & Winemakers,2013,39
1,Georges Duboeuf,2015,38
2,Wines & Winemakers,2014,38
3,Georges Duboeuf,2014,37
4,Louis Latour,2014,37


In [116]:
%%time
result = prep.groupby(['winery', 'year']).agg({'title':['nunique']}).reset_index()
result.columns = ['Winery', 'Year', 'No of Wines']
result = result.sort_values(by='No of Wines',ascending=False).reset_index(drop=True)
result.head(5)

CPU times: user 166 ms, sys: 6 µs, total: 166 ms
Wall time: 164 ms


Unnamed: 0,Winery,Year,No of Wines
0,Wines & Winemakers,2013,39
1,Georges Duboeuf,2015,38
2,Wines & Winemakers,2014,38
3,Louis Latour,2014,37
4,Georges Duboeuf,2014,37


# dataModel (expansion 3)
[link to Arrows for data modelling](http://www.apcjones.com/arrows/#)

![model expansion 5](../../image/howto_graph/model5.jpg)