# Blinded in the Hedonism Wine Shop

Hedonism is a luxury wine merchant in central London, near Berkeley square.

The wines retailed there aren't  what we'd call cheap.

How can we make sense of a bottle that sells for £30 and one that sells for £3,000?

Well, most very expensive bottles are from Burgundy and highly collectible, which makes them speculative and distorts prices. That much is conventional wisdom.

However, how accurate are we to assume that such an easy criterion of high price per centiliter can be used to detect the "Burgundy" property of a wine on a list?

Let's use a decision tree and see what it can teach us on how to blindly detect a Burgundy on a wine list (ie with no knowledge of the wine name).


[NB: as this isnt a CNN, and given the explainability of the tree, we provide no model card.]

In [1]:
import numpy as np
import pandas as pd
from sklearn import tree, ensemble
from sklearn.tree import export_text

# Data source

We imported a csv from https://hedonism.co.uk/ and cleansed the dataset for a clean import into Jupyter.

More data processing below.

In [32]:
# Load an Excel file into a pandas DataFrame
df = pd.read_csv('hedonism.csv', sep=';')
df.head()

Unnamed: 0,Code,Title,Vintage,Size,ABV,Style,Country,Group,Available,Price (inc VAT),Price (ex-VAT),Link
0,HED27286,Ruppert Leroy 11 12 13,,75cl,12.0,White,France,Champagne,2,79.0,65.83,https://hedonism.co.uk/product/ruppert-leroy-1...
1,HED1936,Jura Vin Jaune 1774,1774.0,62cl,12.0,White,France,Regional France,1,42400.0,35333.33,https://hedonism.co.uk/product/jura-vin-jaune-...
2,HED94290,Constantia Sweet Wine Half 1791,1791.0,37.5cl,,White,South Africa,South Africa,1,16600.0,13833.33,https://hedonism.co.uk/product/constantia-swee...
3,HED2234,Yquem 1847,1847.0,75cl,13.0,White,France,Sauternes & Barsac,1,59900.0,49916.67,https://hedonism.co.uk/product/yquem-1847
4,HED33509,Yquem (Recorked 1996) 1873,1873.0,75cl,13.0,White,France,Sauternes & Barsac,1,10700.0,8916.67,https://hedonism.co.uk/product/yquem-recorked-...


### Wine Drops
Price inc or ex VAT contain the same information so we drop one column.

We also drop the product Code which is arbitrary and the link to the website which is destined for human consumption.

We decide to keep the "available" column which may be indicative of scarcity, and scarce collectibles are more valuable.

In [34]:
df = df.drop(['Code', 'Price (ex-VAT)', 'Link'], axis=1)
df = df.drop(['Title'], axis=1)
# The df has thousands of rows... loads to learn. Let's drop NaNs and see how many lines are left.

# Replace empty strings with NaN, then drop rows with NaN in 'Vintage'
df['Vintage'].replace('', float('NaN'), inplace=True)
df = df.dropna(subset=['Vintage'])
df['ABV'].replace('', float('NaN'), inplace=True)
df = df.dropna(subset=['ABV'])
# This gets rid of glassware items on the list.

# Cleansed dataset, normalisation

With about 7k lines left, that leaves us still plenty of drinking pleasure to look forward to.

We then normalise all prices by the bottle size to get a sterling price per cl.

In [35]:
df['Size'] = df['Size'].str.replace('cl', '').astype(float)
df['PricePerCl'] = df['Price (inc VAT)'] / df['Size']
# Drop the Price (inc VAT) column as with Size it now fully determines Normalise price. 
df = df.drop(['Price (inc VAT)'], axis=1)

# Categories

We surmise that reds are dearer than whites and that French wines are dearer than the rest, in particular if they're from Burgundy in older vintages.

Let's learn from data and see if red Burgs are indeed detectable from price.

One clear property is that they should be in France, so that much should be learnt easily from the tree.

We'll categorise Style and Country, before setting Group (aka region) to true for Burgundy and false for the rest. 

In [36]:
# Convert 'Style', 'Country', and 'Group' to dummy variables
df = pd.get_dummies(df, columns=['Style', 'Country'])
df['is_burgundy'] = df['Group'] == 'Burgundy'
df.pop('Group')

1           Regional France
3        Sauternes & Barsac
4        Sauternes & Barsac
5                   Madeira
6                    Sherry
               ...         
6994               Burgundy
6995         Northern Rhone
6996                    USA
6997    Non-Alcoholic Wines
6998                  Italy
Name: Group, Length: 6796, dtype: object

# Shuffling the data

In [37]:
# Convert all columns to numeric, coercing errors to NaN
df_cleaned = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
# Optionally, fill NaN values (you can also drop them if preferred)
df_cleaned = df_cleaned.fillna(df_cleaned.median())  # Replace NaN with median
df_cleaned

Unnamed: 0,Vintage,Size,ABV,Available,PricePerCl,Style_Red,Style_Rose,Style_White,Country_Argentina,Country_Armenia,...,Country_Slovakia,Country_Slovenia,Country_South Africa,Country_Spain,Country_Switzerland,Country_Syria,Country_Ukraine,Country_United States,Country_Uruguay,is_burgundy
1,1774.0,62.0,12.0,1,683.870968,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
3,1847.0,75.0,13.0,1,798.666667,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
4,1873.0,75.0,13.0,1,142.666667,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
5,1875.0,75.0,20.0,1,15.200000,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
6,1891.0,75.0,20.0,1,73.600000,False,False,True,False,False,...,False,False,False,True,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6994,2020.0,75.0,12.5,4,1.228000,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,True
6995,2021.0,75.0,13.5,4,0.452000,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6996,2021.0,75.0,13.0,11,0.400000,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
6997,2022.0,75.0,0.3,9,0.145333,True,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False


In [38]:
Xy=np.array(df_cleaned)
seed = np.random.seed(1)
np.random.shuffle(Xy)
X=Xy[:,:-1]
y=Xy[:,-1]

total_size = len(X)
print(total_size)
print(len(y))
# Define sizes for training, validation, and test sets as percentages
train_size_percentage = 0.50
val_size_percentage = 0.25
# Calculate the number of rows for each split
train_size = int(total_size * train_size_percentage)
val_size = int(total_size * val_size_percentage)
# Split the data
X_train = X[:train_size]
X_val = X[train_size:train_size + val_size]
X_test = X[train_size + val_size:]

y=y.astype('int')  # don't forget this !
y_train = y[:train_size]
y_val = y[train_size:train_size + val_size]
y_test = y[train_size + val_size:]

6796
6796


# Naive Benchmarks

First we assume all wines are Burgundies.

In [39]:
acc_train = np.sum(y_train) / len(y_train)
acc_val = np.sum(y_val) / len(y_val)
print ( 'Naïve guess train and validation', acc_train , acc_val)

Naïve guess train and validation 0.20806356680400234 0.20659211300765157


About 1 in 5 actually is.

Then we wonder how are we doing if we assume that any French wine above 10GBP per cl is from Burgundy?

In [40]:
condition =  (df['Country_France'] == True) & (df['PricePerCl'] > 10)
# Step 2: Compare the condition with the actual boolean array
correct_predictions = condition == (df['is_burgundy'] == True)
# Step 3: Calculate the accuracy (number of correct predictions / total)
accuracy = np.sum(correct_predictions) / len(correct_predictions)
print(f"Accuracy: {accuracy}")

Accuracy: 0.8172454384932313


That's not too poor a guess!
Let's see if a DT adds substantial value.

In [41]:
# Define the DecisionTreeClassifier
clf = tree.DecisionTreeClassifier()
# Fit X_train and y_train
clf.fit(X_train, y_train)

In [42]:
print ( 'Full tree guess train/validation ',clf.score(X_train, y_train),clf.score(X_val, y_val))

Full tree guess train/validation  0.9997057092407299 0.8893466745144203


In [45]:
# Looks like we overfitted, let's look at the rules and see how complicated they are, 
# before we cut down the tree depth.
df2 = df.copy()
df2.drop(columns='is_burgundy', axis=1, inplace=True)
feature_names = df2.columns.tolist()
# print(feature_names)
tree_rules = export_text(clf, feature_names=feature_names)
# print(tree_rules)

Rules look super complicated, Occam wouldn't be proud.

In [46]:
bestdepth=-1
bestscore=0
max_depth = 15

for i in range(max_depth):
    clf = tree.DecisionTreeClassifier(max_depth = i+1)
    #fit the training sets
    clf.fit(X_train,y_train)
    #update trainscore
    trainscore=clf.score(X_train, y_train)
    #update valscore
    valscore=clf.score(X_val, y_val)
    print( 'Depth:', i+1, 'Train Score:', trainscore, 'Validation Score:', valscore)
    if  valscore>bestscore  :
        #update bestscore
        bestscore=valscore
        #update depth
        bestdepth=i+1
bestdepth

Depth: 1 Train Score: 0.7919364331959976 Validation Score: 0.7934078869923484
Depth: 2 Train Score: 0.8134196586227193 Validation Score: 0.8145968216597999
Depth: 3 Train Score: 0.8608004708652148 Validation Score: 0.8404944084755739
Depth: 4 Train Score: 0.8755150088287228 Validation Score: 0.8563861094761624
Depth: 5 Train Score: 0.8866980576809889 Validation Score: 0.8628605061801059
Depth: 6 Train Score: 0.898175397292525 Validation Score: 0.872277810476751
Depth: 7 Train Score: 0.9111241907004121 Validation Score: 0.8887580929958799
Depth: 8 Train Score: 0.9187757504414361 Validation Score: 0.8893466745144203
Depth: 9 Train Score: 0.9287816362566216 Validation Score: 0.8964096527369041
Depth: 10 Train Score: 0.9446733372572101 Validation Score: 0.8969982342554443
Depth: 11 Train Score: 0.9558563861094762 Validation Score: 0.8975868157739847
Depth: 12 Train Score: 0.9661565626839317 Validation Score: 0.8928781636256622
Depth: 13 Train Score: 0.9764567392583873 Validation Score: 0.8

11

Depth of 11 is optimal but 3 looks more than enough, and better than our naive approach.
We now retrain using the full training and validating data.

In [49]:
X_trainval=X[:train_size + train_size + val_size,:]
y_trainval = y[:train_size + train_size + val_size]

In [50]:
my_best_depth = 3
clf = tree.DecisionTreeClassifier(max_depth = my_best_depth)
clf.fit(X_trainval, y_trainval)

In [52]:
test_score = clf.score(X_test, y_test)
print('testing set score', test_score)

testing set score 0.8534432018834609


In [53]:
# Looks like we overfitted, let's look at the rules and see how complicated they are, 
# before we cut down the tree depth.
df2 = df.copy()
df2.drop(columns='is_burgundy', axis=1, inplace=True)
feature_names = df2.columns.tolist()
print(feature_names)
tree_rules = export_text(clf, feature_names=feature_names)
print(tree_rules)

['Vintage', 'Size', 'ABV', 'Available', 'PricePerCl', 'Style_Red', 'Style_Rose', 'Style_White', 'Country_Argentina', 'Country_Armenia', 'Country_Australia', 'Country_Austria', 'Country_Azerbaijan', 'Country_Bulgaria', 'Country_Canada', 'Country_Chile', 'Country_China', 'Country_Czech Republic', 'Country_England', 'Country_France', 'Country_Georgia', 'Country_Germany', 'Country_Greece', 'Country_Hungary', 'Country_India', 'Country_Israel', 'Country_Italy', 'Country_Japan', 'Country_Lebanon', 'Country_Luxembourg', 'Country_Mixed', 'Country_Morocco', 'Country_New Zealand', 'Country_Portugal', 'Country_Slovakia', 'Country_Slovenia', 'Country_South Africa', 'Country_Spain', 'Country_Switzerland', 'Country_Syria', 'Country_Ukraine', 'Country_United States', 'Country_Uruguay']
|--- Country_France <= 0.50
|   |--- class: 0
|--- Country_France >  0.50
|   |--- PricePerCl <= 5.09
|   |   |--- Vintage <= 2020.50
|   |   |   |--- class: 0
|   |   |--- Vintage >  2020.50
|   |   |   |--- class: 1
|

In [55]:
import matplotlib.pyplot as plt
import graphviz 
from sklearn.tree import export_graphviz
class_names = [str(cls) for cls in clf.classes_]
dot_data = export_graphviz(clf, out_file=None, 
                           feature_names=feature_names,  
                           class_names=class_names,  
                           filled=True, rounded=True,  
                           special_characters=True)  
graph = graphviz.Source(dot_data)  
graph.render("decision_tree")  # Saves the tree as a file
graph.view()  # Opens the file in a viewer

'decision_tree.pdf'

# Conclusion
The logic of testing on vintage doesnt make much sense so we should in fact be happy with a 2 level test, and the discriminating price per cl is about 5gbp.