# Scotch Whisky Exploration: Data Wrangling

In [3]:
import numpy as np
import pandas as pd
import re
import spacy

import pickle


Load data from the JSON file created by our scrapy Spider.

In [4]:
whiskydf_raw = pd.read_json("whiskeyscraper/MoM_whiskydata.json")

In [5]:
whiskydf_raw.head(2)

Unnamed: 0,name,nose,palate,finish,description,region,style,distillery,bottler,age,alcohol,maturation,chill_filter,cask_strength
0,Singleton of Dufftown 12 Year Old,"Malty with cereal/barley sweetness, buttery to...",Orange zest spiciness perks up a malty core of...,"Oaky, rich with good length, some fruit lingers.","A straightforward, nutty and malty single malt...",Speyside Whisky,Single Malt Whisky,Dufftown,Dufftown,12 year old Whisky,40.0%,,,
1,Laphroaig 10 Year Old Sherry Oak Finish,"Smoked meats, maple syrup, BBQ lemon, charred ...","More roasted cedar and peat smoke, with a hint...",A balanced finish of sherried sweetness and sm...,Smoke and sherry here from Laphroaig! The lege...,Islay Whisky,Single Malt Whisky,Laphroaig,Laphroaig,10 year old Whisky,48.0%,,,


## Initial Data Clean
#### Dropping observations with no reviews or missing taste,nose, and finish notes.
#### Coverting/cleaning numeric columns. 

In [6]:
whiskydf_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14143 entries, 0 to 14142
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name           14143 non-null  object
 1   nose           8553 non-null   object
 2   palate         8552 non-null   object
 3   finish         8488 non-null   object
 4   description    14143 non-null  object
 5   region         14143 non-null  object
 6   style          14143 non-null  object
 7   distillery     14143 non-null  object
 8   bottler        14142 non-null  object
 9   age            10667 non-null  object
 10  alcohol        14142 non-null  object
 11  maturation     2376 non-null   object
 12  chill_filter   479 non-null    object
 13  cask_strength  1022 non-null   object
dtypes: object(14)
memory usage: 1.5+ MB


### Let's keep the subset of whiskeys where there are tasting notes (e.g. where nose, palate, and finish are not null)

In [7]:
tastingnote_cols = ['nose', 'palate', 'finish']

whiskydf_raw[tastingnote_cols].isna().all() == True

nose      False
palate    False
finish    False
dtype: bool

Takes subset where Master of Malt bros have created tasting notes on the whiskey. 

In [8]:
whisky_df =  whiskydf_raw.dropna(how = "any", axis = 0, subset= tastingnote_cols)

In [9]:
whisky_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8485 entries, 0 to 14141
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name           8485 non-null   object
 1   nose           8485 non-null   object
 2   palate         8485 non-null   object
 3   finish         8485 non-null   object
 4   description    8485 non-null   object
 5   region         8485 non-null   object
 6   style          8485 non-null   object
 7   distillery     8485 non-null   object
 8   bottler        8485 non-null   object
 9   age            6557 non-null   object
 10  alcohol        8485 non-null   object
 11  maturation     1613 non-null   object
 12  chill_filter   403 non-null    object
 13  cask_strength  550 non-null    object
dtypes: object(14)
memory usage: 994.3+ KB


Our scraper extracted a good amount of data and it'll be good to see whether there are some entries that don't belong (i.e styles that are not single malt, etc.)

In [10]:
whisky_df['style'].unique()

array(['Single Malt Whisky'], dtype=object)

Alright, so these are all single malt whiskies. Good. But since there's only one value the column carries no information. We'll drop it.  Also there seem to be no disambiguations/mispellings in the Scotch whisky region names. Each unique entry corresponds to a different Scotch whisky making region.

In [11]:
whisky_df.region.unique()

array(['Speyside Whisky', 'Islay Whisky', 'Highland Whisky',
       'Island Whisky', 'Scotch Whisky', 'Lowland Whisky',
       'Campbeltown Whisky', 'Other Scotch Whisky'], dtype=object)

In [12]:
whisky_df = whisky_df.drop(columns = ['style'])
print(whisky_df.columns)

Index(['name', 'nose', 'palate', 'finish', 'description', 'region',
       'distillery', 'bottler', 'age', 'alcohol', 'maturation', 'chill_filter',
       'cask_strength'],
      dtype='object')


#### Age column
Now we'll tackle the age column. Generally, the whiskies are matured for years in casks before bottling. The age corresponds to the aging before bottling. But let's check whether there are other units (months, etc.) of aging buried in the data.

In [13]:
whisky_df.age.head(3)

0    12 year old Whisky
1    10 year old Whisky
2    15 year old Whisky
Name: age, dtype: object

The second word is the aging unit (year). Let's extract this second word and see if anything else pops up.

In [14]:
whisky_df.age.str.split().str[1].unique()

array(['year', nan], dtype=object)

Nope. Either the age is recorded in years or the entry corresponds to a whisky with no age statement. Let's process this understanding that age is in years and convert the column to numeric.

In [15]:
#convert whiskey age to numeric.
whisky_df.age = whisky_df.age.str.split(' ').str[0].astype('float')

In [16]:
np.sort(whisky_df.age.unique())

array([   3.,    4.,    5.,    6.,    7.,    8.,    9.,   10.,   11.,
         12.,   13.,   14.,   15.,   16.,   17.,   18.,   19.,   20.,
         21.,   22.,   23.,   24.,   25.,   26.,   27.,   28.,   29.,
         30.,   31.,   32.,   33.,   34.,   35.,   36.,   37.,   38.,
         39.,   40.,   41.,   42.,   43.,   44.,   45.,   46.,   47.,
         48.,   49.,   50.,   51.,   52.,   54.,   55.,   56.,   60.,
         62.,   64.,   65.,   71.,   78.,  105., 2003.,   nan])

105 and 2003 year old whisky are possible outliers. Let's investigate more closely.

In [17]:
whisky_df[whisky_df.age == 105]

Unnamed: 0,name,nose,palate,finish,description,region,distillery,bottler,age,alcohol,maturation,chill_filter,cask_strength
9219,Aisla T'Orten 105 Year Old 1906 - Liquid Histo...,The most unique bouquet I’ve ever experienced....,Heaven is spelt “T-O-R-T-E-N”. This conjures u...,To say this was long would be an understatemen...,The world's oldest whisky. Read more about thi...,Highland Whisky,Aisla T'Orten,Master of Malt,105.0,40.7%,Sherry,,


In [18]:
whisky_df[whisky_df.age == 105].description.values

array(["The world's oldest whisky. Read more about this extraordinary liquid history over on the Master of Malt blog . OK everyone, we have to come clean! The miraculous discovery of this 105 year old whisky may have had something to do with its launch on April Fools’ Day 2011. It’s also possible that some sneaky anagrams were used in some of the names in the story: Allie Sisell (the discoverer of the cask):  Aethenias Simonvent (the distillery’s founder):  And there’s a good chance that if you rearrange the letters in Aisla T’Orten distillery you get:"],
      dtype=object)

It's an April Fools joke! Gaaaarrrrr!!!! Remove this.

In [19]:
whisky_df = whisky_df[~(whisky_df.age == 105)]
print(whisky_df.age.unique())

[  12.   10.   15.   nan   18.   25.   14.   21.   13.   16.    8.    5.
   11.   26.   23.   43.   28.   24.    9.   30.   29.   42.   19.   27.
   20.   17.    7.   40.    6.   22.   45.    3.   37.   35.   31.   64.
   50.   46.   36.   32.   44.   56.   38.   34.   33.   62.    4.   41.
   39. 2003.   60.   54.   55.   78.   52.   49.   47.   48.   51.   71.
   65.]


In [20]:
whisky_df[whisky_df.age == 2003].description.values

array(["Batch 4 of Benromach's Origins range, bottled in 2013, is matured entirely in first fill Port pipes and peated to a level of 8ppm. The idea behind the range is to highlight how changing different factors during production results in different characteristics in the final spirit."],
      dtype=object)

This batch was started on 2003 but bottled in 2013, so we'll change the age value to 10 years. 

In [21]:
whisky_df.loc[whisky_df.age == 2003, "age"] = 10

In [22]:
# check if we have removed the outlier. 
whisky_df.age.unique()

array([12., 10., 15., nan, 18., 25., 14., 21., 13., 16.,  8.,  5., 11.,
       26., 23., 43., 28., 24.,  9., 30., 29., 42., 19., 27., 20., 17.,
        7., 40.,  6., 22., 45.,  3., 37., 35., 31., 64., 50., 46., 36.,
       32., 44., 56., 38., 34., 33., 62.,  4., 41., 39., 60., 54., 55.,
       78., 52., 49., 47., 48., 51., 71., 65.])

#### Alcohol column
Yes. Now we convert the alcohol column to numeric value, we'll also change the name from alcohol to ABV as this is known to be in percent.

In [23]:
whisky_df.alcohol = whisky_df.alcohol.str.replace('%', '').astype('float')
#check for NaNs
print(whisky_df.alcohol.isna().any())

False


In [24]:
whisky_df.rename(columns = {'alcohol': 'ABV'}, inplace = True)
print(whisky_df.columns)

Index(['name', 'nose', 'palate', 'finish', 'description', 'region',
       'distillery', 'bottler', 'age', 'ABV', 'maturation', 'chill_filter',
       'cask_strength'],
      dtype='object')


#### cask_strength and chill_filter have Yes/No entries when populated. Let's convert these to Boolean.

In [25]:
print({'cask_strength_unique': whisky_df.cask_strength.unique(), 'chill_filter_unique': whisky_df.chill_filter.unique()})

{'cask_strength_unique': array([nan, 'No', 'Yes'], dtype=object), 'chill_filter_unique': array([nan, 'No', 'Yes'], dtype=object)}


In [26]:
whisky_df.loc[:, ['cask_strength', 'chill_filter']] = whisky_df[['cask_strength', 'chill_filter']].replace({'Yes': True, 'No': False})

In [27]:
print({'cask_strength_unique': whisky_df.cask_strength.unique(), 'chill_filter_unique': whisky_df.chill_filter.unique()})

{'cask_strength_unique': array([nan, False, True], dtype=object), 'chill_filter_unique': array([nan, False, True], dtype=object)}


## Description Column: Extracting Cask Data and Filtration Process from free text

We want to extract information about the whiskies from these columns. Some of this info will be concrete (is it cask strength? is it chill filtered? which barrels was it aged in?). Some of that info is in the description but was not put into the bottle detail data on the website...so we'll need to extract it from free text and put it into the appropriate existing dataframe column. 

In [28]:
# we're going to import necessary NLP libraries

import spacy
from spacy.tokens import Doc, Span, Token # for creating global objects 
from spacy.matcher import Matcher # for rule-based matching
from spacy.matcher import DependencyMatcher
from spacy.language import Language # for building custom pipeline components
from spacy.pipeline import EntityRuler 



from copy import deepcopy


#### We create our nlp object and we are going to use a rule-based scheme to get some named entities.

In [29]:
nlp = spacy.load('en_core_web_lg') # loads our NLP model

# we're going to create a manual list of stop words that will be used later down the line. "hint", "touch", and "note" and variants thereof need to be removed.

nlp.Defaults.stop_words |= {"hint", "Hint", "hints", "touch", "touches", "Touch", "note", "Note", "notes", "Notes", "little", "end", "thing", "palate", "Palate", "nose", "Nose", "whisper", "whispers"}


# creates ruler
ruler = nlp.add_pipe("entity_ruler")

# pattern_csk1 = [{"POS": "NOUN", "OP":"*"}, {"LEMMA": "cask"}]
# pattern_csk1 = [{"POS": "NOUN", "OP":"*"}, {"LEMMA": {"IN": ["cask", "octave", "pipe", "puncheon", "butt", "barrel"]} } ]
# pattern_csk2 = [{"POS": "NOUN", "OP":"*"}, {"POS": "CCONJ"}, {"POS": "NOUN", "OP":"*"}, {"LEMMA": {"IN": ["cask", "octave", "pipe", "puncheon", "butt", "hogshead"]} } ]

pattern_csk = [{"LEMMA": {"IN": ["cask", "octave", "pipe", "puncheon", "butt", "barrel", "hogshead"]} } ]

patterns = [ {"label": "CSK", "pattern": pattern_csk }]

# adds rules that will be used to parse and create entities
ruler.add_patterns(patterns)





We want to batch process all the description text. So we'll construct a pipeline. We also want to keep track of what description belongs to what whisky. This will enable us to rejoin any results from NLP into our pandas dataframe. So we will construct a custom attribute via context creation.

In [30]:
# passing context metadata into attributes via pipeline requires the data to be in a specific form. We'll create a function to do transform the data into the right form.
# then the data will be used to create a Doc object generator based off of the text and context metadata.

def doc_context_tupling(data, textcolname, *args):
    col_list = [textcolname]
    attr_list = list(args)
    
    col_list.extend(attr_list)

    # data is in dataframe w/ form of whisky_df
    subset_dict_list = list(data[col_list].T.to_dict().values())

    # now let's form a (text, context_dictionary) tuple which is what the spacy pipeline requires. 

    data_context_list = [ (subset_dict.pop(textcolname), subset_dict) for subset_dict in subset_dict_list]

    # for each key-value pair in args, we'll need to create a Doc extension attribute.
    

    for doc, context in nlp.pipe(data_context_list, as_tuples = True):
        Doc.set_extension('context', force = True, default = context)
        yield doc


We are going to use some domain knowledge. It is unlikely that a whisky producer or reviewer would mention chill filtration in a description unless the whisky was NOT chill filtered. Chill filtration is a process that removes some esthers and fatty acids that can become non-soluble in lower ABV whisky at cooler temperatures. The process can prevent clouding but many connosieurs believe that it degrades the quality of the whisky and removes some of its mouthfeel and complexity. In any case, saying that your whisky is non chill-filtered is a point of pride. We'll thus create a very simple rule based matcher for this and not worry too heavily about the possibility that the review is saying that it IS chill filtered. The same logic goes for whether a whisky is cask strength or not.

In [31]:
def update_cs_cf_info(data):

    matcher = Matcher(nlp.vocab)
    pattern1 = [{'LOWER': 'chill'}, {'IS_PUNCT': True, 'OP': "?"},  {'LEMMA': 'filtration'} ]
    pattern2 = [{'LOWER': 'chill'}, {'IS_PUNCT': True, 'OP': "?"},  {'LEMMA': 'filter'} ]
    matcher.add('CHILL_FILTER', [pattern1, pattern2])

    matcher2 = Matcher(nlp.vocab)
    pattern3 = [{'LOWER': 'cask'}, {'IS_PUNCT': True, 'OP': "?"},  {'LEMMA': 'strength'} ]
    matcher2.add('CASK_STRENGTH', [pattern3])

    descript_corpus = doc_context_tupling(data, 'description', 'name')

    new_data = deepcopy(data)

    for doc in descript_corpus:

        matches_chillf = matcher(doc)
        matches_cask = matcher2(doc)
        
        if len(matches_cask) > 0:
            new_data.loc[new_data.name == doc._.context['name'], 'cask_strength'] =  True 
        
        counter = 0
        if len(matches_chillf) > 0:
            new_data.loc[new_data.name == doc._.context['name'], 'chill_filter'] =  False 

    return new_data



In [32]:
new_whisk = update_cs_cf_info(whisky_df)

In [33]:
new_whisk.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8484 entries, 0 to 14141
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           8484 non-null   object 
 1   nose           8484 non-null   object 
 2   palate         8484 non-null   object 
 3   finish         8484 non-null   object 
 4   description    8484 non-null   object 
 5   region         8484 non-null   object 
 6   distillery     8484 non-null   object 
 7   bottler        8484 non-null   object 
 8   age            6556 non-null   float64
 9   ABV            8484 non-null   float64
 10  maturation     1612 non-null   object 
 11  chill_filter   819 non-null    object 
 12  cask_strength  1643 non-null   object 
dtypes: float64(2), object(11)
memory usage: 1.2+ MB


Whisky maturation column text normalizing

In [34]:
new_whisk.maturation.unique()

array([nan, 'Sherry', 'Wine Cask', 'Bourbon', 'Port Finish', 'Rum Finish',
       'Oak', 'Brandy'], dtype=object)

Let's lower-case all these. Also we can remove the word 'finish' where necessary.

In [35]:
new_whisk.maturation = new_whisk.maturation.str.lower().replace({'port finish': 'port',  'rum finish': 'rum', 'wine cask': 'wine'})


The description also has information about the wood / type of barrel(s) that the whisky was aged in.  We want to be able to extract that information from the free text. There are different names used for 'casks' in the corpus. Some examples are octaves, pipes, butts, puncheons, etc. There are technical differences between these (size, shape, etc) that can have effects on the whisky taste. But it'll be good to create a broad named entity class for this. We've already done this via the Entity Ruler and created a named entity class CSK.

We want characteristics of the cask that directly impact flavor profiles such as whether the cask once held sherry, bourbon, etc or the kind of wood when relevant (oak can impart strong flavors to a Scotch). My bet is that this can be parsed via syntactic dependencies. Statements regarding cask aging will typically have a structure like:

"Was aged in a sherry cask."

Cask entities in such a formation thus will be a prepositional object and we can look for noun modifiers on pobj.



In [36]:
dc = nlp(new_whisk.iloc[450].description)
print(dc)


A 21 year old single malt from the Aberlour Distillery, independently bottled by Montgomerie's as part of the Rare Select series. This one was distilled back in February 1996 and filled into a single bourbon cask, then left alone for 21 years before being bottled in September 2017 at 46% ABV. A tasty chance to see the non-sherried side of Aberlour.


There are a lot of downsides to what I'm doing next and a lot of dependency rule generation. But hey...this is my first NLP rodeo and it seems to work OK

In [37]:
def parse_cask_traits(doc):

    # let's construct a dependency pattern matcher

    dep_matcher = DependencyMatcher(nlp.vocab)

    # define our cask entity as an anchor where we inisist that the cask mention is a prepositional object
    #dep_pattern1 looks for modifiers like 'American oak' or 'Oloroso sherry' or single modifiers like 'sherry' or 'bourbon'

    dep_pattern1 = [{'RIGHT_ID': 'csk_ent', 'RIGHT_ATTRS': {'ENT_TYPE': 'CSK', 'DEP': 'pobj'}}, {'LEFT_ID': 'csk_ent', 'REL_OP': '>>', 'RIGHT_ID': 'cask_trait', 'RIGHT_ATTRS': {'DEP': {'IN': ['compound', 'amod']}}}]
    # this is designed to find things like "bourbon and sherry casks" or "bourbon and European oak"
    dep_pattern2 = [{'RIGHT_ID': 'csk_ent', 'RIGHT_ATTRS': {'ENT_TYPE': 'CSK', 'DEP': 'pobj'}}, {'LEFT_ID': 'csk_ent', 'REL_OP': '>', 'RIGHT_ID': 'cask_trait', 'RIGHT_ATTRS': {'DEP': 'nmod' }}, {'LEFT_ID': 'cask_trait', 'REL_OP': '>', 'RIGHT_ID': 'cask_conj', 'RIGHT_ATTRS': {'DEP': 'conj'}}]
    
    
    # this is a common misparsing that needs to be accounted for when conjunctions with compounds and noun modifiers on the cask entity are present -- first nmod sometimes gets mislabeled as a pobj. the parser mislabels the cask as a conjunction.
    dep_pattern3 = [{'RIGHT_ID': 'csk_ent', 'RIGHT_ATTRS': {'ENT_TYPE': 'CSK', 'DEP': 'conj'}}, {'LEFT_ID': 'csk_ent', 'REL_OP': '<', 'RIGHT_ID': 'cask_trait2', 'RIGHT_ATTRS': {'DEP': 'pobj'}}, {'LEFT_ID': 'csk_ent', 'REL_OP': '>>', 'RIGHT_ID': 'cask_trait', 'RIGHT_ATTRS': {'DEP': {'IN': ['compound', 'amod']}}}]
    dep_pattern4 = [{'RIGHT_ID': 'csk_ent', 'RIGHT_ATTRS': {'ENT_TYPE': 'CSK', 'DEP': 'conj'}}, {'LEFT_ID': 'csk_ent', 'REL_OP': '>>', 'RIGHT_ID': 'cask_trait', 'RIGHT_ATTRS': {'DEP': {'IN': ['compound', 'amod']}}}]

    dep_matcher.add('compoundamod', patterns = [dep_pattern1, dep_pattern2, dep_pattern3, dep_pattern4])

    dep_matches = dep_matcher(doc)
    match_dict = {}
    for match_objs in dep_matches:
        match = match_objs[1]

        # match is a tuple with cask, then children as indices. let's construct a dictionary instead out of this.
        csk_ind = match[0]

        if csk_ind not in match_dict.keys():
            match_dict.update({csk_ind: match[1::]})
        else:
            match_dict[csk_ind].extend(match[1::])
            match_dict[csk_ind].sort()



    return(match_dict)


In [38]:

import itertools

def create_spans_from_matches(doc):

    matchdicts = parse_cask_traits(doc)

    span_list = []

    def group_ranges(L):
        for w, z in itertools.groupby(L, lambda x, y=itertools.count(): next(y)-x):
            grouped = list(z)
            yield (x for x in [grouped[0], grouped[-1]][:len(grouped)])

    for key, val in matchdicts.items():
        gener = group_ranges(val)
        while True:
            try:
                
                span_inds = list(next(gener))
                span_list.append(Span(doc, span_inds[0], span_inds[-1] + 1))
                
            except StopIteration:
                break

    return span_list
        




Without getting more sophisticated (i.e. training NER), it'll be difficult to extract relevant cask descriptors and put them into the right categories (i.e. oloroso = sherry, Chardonnay = white wine, etc.) This is definitely something worth doing on a second pass with the spans we have created but for now we'll do a simple keyword match based off of a list based off of the 'maturation' column with a few augmented descriptors.

The following function extracts the cask descriptors from each review description where it follows our dependecy matching patterns and then one-hot encodes the types of casks used in maturation via employing a multilabel binarizer.

In [39]:

def maturation_descriptor(data):

    maturation_traits = set(np.append(data.maturation.unique(), ['sauternes', 'marsala', "white wine", "rye", "red wine"]))
    #create generator from description

    descript_corpus = doc_context_tupling(data, 'description', 'name')

    #return dictionary 

    csk_doc_dict = {'name': [], 'cask_description': []}

    for doc in descript_corpus:
        match_spans = create_spans_from_matches(doc)

        # initialize the set with any value that is in the maturation column

        maturation_col_dat = data[data['name'] == doc._.context['name']].maturation.values

        # csk_descript = set()
        

        if str(maturation_col_dat[0]) == 'nan':
            csk_descript = set()
        else:
            csk_descript = set([str(maturation_col_dat[0])])

        

        for span in match_spans:
            span_descriptors = [token.lemma_.lower() for token in span if token.lemma_.lower() in maturation_traits]
            csk_descript.update(span_descriptors)

            # this will manually construct bigrams to test against maturation_traits list  (in future iteration we can use mutual information to get words most commonly linked in bigrams) 

            csk_descript.update( [doc[token.i:token.i + 2].lemma_.lower() for token in span if doc[token.i:token.i + 2].lemma_.lower() in maturation_traits] )

        # now let's look within the the corresponding data.maturation entry in the orginal dataframe. 
        # if there are cask descriptors here that are not in the spans that we have extracted from free text let's put those in now.
            

        #now create column in dataframe that possesses list

        csk_doc_dict['name'].append(doc._.context['name'])
        csk_doc_dict['cask_description'].append(csk_descript)

    csk_doc_df = pd.DataFrame.from_dict(csk_doc_dict)

    # now we need to one-hot encode the cask descriptions that are in the form of a list of descriptors.
    # sklearn's multilabel binarizer will do the trick

    from sklearn.preprocessing import MultiLabelBinarizer
    mlb = MultiLabelBinarizer()

    encoded_csk_cols = pd.DataFrame(mlb.fit_transform(csk_doc_df.cask_description), columns = 'maturation_' + mlb.classes_ )

    csk_inter = csk_doc_df.drop(columns = ['cask_description']).join(encoded_csk_cols, how = 'inner')

    #now columns that still all have zeros should be converted to having all NaNs.

    mat_cols = csk_inter.columns[csk_inter.columns.str.contains('maturation')]
    csk_inter.loc[csk_inter[mat_cols].sum(axis=1) == 0, mat_cols] = np.nan

    csk_fulldf = pd.merge(left = data, right = csk_inter, how = 'inner', on = 'name').drop(columns = ['maturation'])


    return csk_fulldf
        
    

In [40]:
new_whisk2 = maturation_descriptor(new_whisk) 
new_whisk2.head()

Unnamed: 0,name,nose,palate,finish,description,region,distillery,bottler,age,ABV,...,maturation_marsala,maturation_oak,maturation_port,maturation_red wine,maturation_rum,maturation_rye,maturation_sauternes,maturation_sherry,maturation_white wine,maturation_wine
0,Singleton of Dufftown 12 Year Old,"Malty with cereal/barley sweetness, buttery to...",Orange zest spiciness perks up a malty core of...,"Oaky, rich with good length, some fruit lingers.","A straightforward, nutty and malty single malt...",Speyside Whisky,Dufftown,Dufftown,12.0,40.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Laphroaig 10 Year Old Sherry Oak Finish,"Smoked meats, maple syrup, BBQ lemon, charred ...","More roasted cedar and peat smoke, with a hint...",A balanced finish of sherried sweetness and sm...,Smoke and sherry here from Laphroaig! The lege...,Islay Whisky,Laphroaig,Laphroaig,10.0,48.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,GlenAllachie 15 Year Old,"Peanut brittle, dates and a big scoop of choco...","Walnut, raisin, Christmas spices and fresh gin...","Coffee, Turkish delight and just a hint of fla...",A 2019 addition to the GlenAllachie core range...,Speyside Whisky,GlenAllachie,GlenAllachie,15.0,46.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,Seaweed & Aeons & Digging & Fire & Cask Streng...,Roasted fruit and driftwood notes jump out of ...,The coastal core remains solid on the palate w...,"A meaty hint lingers on the finish, with cinna...","Behold, batch 4 of Seaweed & Aeons & Digging &...",Islay Whisky,Seaweed & Aeons & Digging & Fire,Seaweed & Aeons & Digging & Fire,10.0,58.3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,Edradour 10 Year Old,"Medium, great complexity. Thoroughly fruity, s...","Cloying, seductive murkiness. Rum, barley, toa...",Any confusion is arrested: spiced fruitcake wi...,Edradour is one of Scotland's smallest distill...,Highland Whisky,Edradour,Edradour,10.0,40.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [41]:
new_whisk2.columns

Index(['name', 'nose', 'palate', 'finish', 'description', 'region',
       'distillery', 'bottler', 'age', 'ABV', 'chill_filter', 'cask_strength',
       'maturation_bourbon', 'maturation_brandy', 'maturation_marsala',
       'maturation_oak', 'maturation_port', 'maturation_red wine',
       'maturation_rum', 'maturation_rye', 'maturation_sauternes',
       'maturation_sherry', 'maturation_white wine', 'maturation_wine'],
      dtype='object')

## Palate, Nose, and Finish

These columns independently describe different aspects of the tasting process: the nose (smell), palate (taste of the whisky), finish (aftertaste and lingering mouthfeel). We want to extract sense descriptors and taste and smell metaphors/similies. Typically these sorts of descriptors are either in grouped noun form or adjectives.

As an initial first go, we'll do a BOW Count Vectorizer or TF-IDF model. The hope is that later down the line we can either explore descriptor clustering or semantic/topic modeling in order to get a handle on the type of distinct sensory groups present within the range of single malt scotches.

We'll use a combination of spacy and gensim for these tasks:

In [42]:
from gensim.corpora.dictionary import Dictionary
from gensim.matutils import corpus2dense
from gensim.models import Phrases
from gensim.models.tfidfmodel import TfidfModel
from hunspell import Hunspell


In [43]:
# the manual stemmer is going to take the list of tokenized documents and manually stem/lemmatize before gensim dictionary creation.
# this is due to the fact that SpaCy's lemmatizer still keeps a lot of tokens distinct that should actually be rolled into the same token (smokiness, smoky, smoke)
#NOTE: this stemmer relies on a large enough corpus that variants + the stemmed word exist in the corpus
def manual_stemmer(tokenized_list):

    word_checker = Hunspell() # we'll use this for checking whether the stemmed word is in the hunspell english dictionary

    # we also want to establish the set of unique tokens in tokenized_list (the entire corpus)
    unique_token_set = set(itertools.chain(*tokenized_list))

    final_token_list = []
    for doc in tokenized_list:
        doc_str = " ".join(doc)
        
         # on first pass, all tokens that end with "-iness" and "-ied" should be contracted to "-y"
        regex_search_pattern1 = [r'iness\b', r'ied\b', r'iful\b', r'ifull\b', r'ifully\b']
        doc_str = re.sub("|".join(regex_search_pattern1) , 'y', doc_str)

        # all tokens ending with "-ness" or "-ful" should just have this ending chopped off.
        regex_search_pattern2 = [r'ness\b', r'ful\b', r'full\b']
        doc_str = re.sub("|".join(regex_search_pattern2) , '', doc_str)
        
        #specific word replacement rule
        regex_search_pattern3 = r'tannic'
        doc_str = re.sub(regex_search_pattern3 , 'tannin', doc_str)

        

        """
        Stemming words with -y endings can be tricky. We are going to construct a rule that depends on the existing tokens in the dataset.
        It's important to have removed stop words, etc from the corpus and have a large enough corpus before running this stemmer.   
        
        First stem. In some cases, we wont have a complete word (e.g. smoky --> smok). After stemming, we will check whether the ending follows a pattern
        like consonant-vowel-consonant after stemming. Append -e. This rule converts smok to smoke. It also converts pepper to peppere. For other cases, the 
        stemmed word doesnt match this consonant-vowel-consonant pattern. 

        At the end we check whether the stemmed/converted word is in the original corpus. If not, then do not stem the word at all and return the 
        original token. This will mess up rare words that occur only once in the corpus, but that wont matter down the line.
        
        """

        doc_list = doc_str.split()
        regex_search_pattern4 = r"y\b"
        regex_search_pattern5 = r"\w+[^aeiou][aeiou][^aeiou]\b"
        
        spac_doc_list = []
        for token in doc_list:

            stemmed_tok = re.sub(regex_search_pattern4, "", token)

            if stemmed_tok in unique_token_set:
                spac_doc_list.append(stemmed_tok)
                
            elif (stemmed_tok not in unique_token_set) & (not not re.findall(regex_search_pattern5, stemmed_tok)):
                stemmed_tok = stemmed_tok + 'e'
        
            else:
                spac_doc_list.append(token)

        final_token_list.append(spac_doc_list)
        
    return final_token_list     

    

In [44]:
# function that converts a set of documents into a tokenized document list with relevant bigrams. also creates a gensim dictionary object

def create_gensim_data_bigrams(data, col_to_tokenize, save_bigram = True):

    doc_contextor = doc_context_tupling(data, col_to_tokenize, 'name')
    name_list = []
    doc_list = []
    doc_generator = doc_contextor

    for current_doc in doc_generator:
        # lower case lemmatized alphabetic words with common stop words, puncuation removed.
        # Also filter out verbs, remove prepositions

        token_list = [token.lemma_.lower() for token in current_doc if ( (not token.is_stop) & (not token.is_punct) & (token.pos_ != 'VERB') & (token.dep_ != 'prep') & (token.is_alpha)) ]
        doc_list.append(token_list)
        name_list.append(current_doc._.context['name'])

    # now let's train a phrase model that can include relevant bigrams as tokens in the tokenized docs
    phrase_mod = Phrases(doc_list)

    if save_bigram == True:
        phrase_mod_path = "models\\phrase_mod.pkl"
        pickle.dump(phrase_mod, open(phrase_mod_path, 'wb'))

    toks_with_bigrams = list(phrase_mod[doc_list]) # this creates the new set of tokenized documents with bigrams.

    # manual stemming:
    toks_with_bigrams = manual_stemmer(toks_with_bigrams)

    # let's create a gensim dictionary off of this.

    gens_dict_bigrams = Dictionary(toks_with_bigrams)

    #it's probably useful to return the document token lists as a Python dictionary with corresponding whisky names.
    token_bigram_dict = {'name': name_list, col_to_tokenize + '_tokenized': toks_with_bigrams}

    return pd.DataFrame.from_dict(token_bigram_dict), gens_dict_bigrams



In [45]:
# this function executes the tokenization on each sense column, saves it into the dataframe and removes the corresponding free text
# also generates a python dictionary of gensim Dicionaries for each sense column corpus.

def generate_tokenized_cols(data, *args):
    descriptor_cols = list(args)

    # for loop to generate bigram-tokenized list and list of dictionaries for each tasting note type (i.e, nose, palate, finish)

    tokenized_doc_list = []
    gensimDict_dict = {}

    for descriptor in descriptor_cols:


        tokenized_doc, gensim_dictionary = create_gensim_data_bigrams(data, descriptor)
        tokenized_doc_list.append(tokenized_doc.set_index('name'))

        gensimDict_dict.update({descriptor: gensim_dictionary})

    
    tokdocs_df = pd.concat(tokenized_doc_list, axis = 1).reset_index()
    new_data = pd.concat([tokdocs_df.set_index('name'), data.set_index('name')], axis = 1).reset_index().drop(columns = descriptor_cols).drop(columns = ['description'])

    
    return new_data, gensimDict_dict


In [46]:
new_whisk3, gensim_dictionary_list = generate_tokenized_cols(new_whisk2, 'palate', 'nose', 'finish')

NameError: name 'pickle' is not defined

Now we pickle the gensim dictionaries:

In [55]:
import pickle
for key, value in gensim_dictionary_list.items():
    pickle_path = "dictionaries\\gemsimdict_" + key + ".pkl"
    pickle.dump(value, open(pickle_path, 'wb'))
    

In [56]:
new_whisk3.head(3)

Unnamed: 0,name,palate_tokenized,nose_tokenized,finish_tokenized,region,distillery,bottler,age,ABV,chill_filter,...,maturation_marsala,maturation_oak,maturation_port,maturation_red wine,maturation_rum,maturation_rye,maturation_sauternes,maturation_sherry,maturation_white wine,maturation_wine
0,Singleton of Dufftown 12 Year Old,"[orange_zest, malty_core, nut, oak, toffee, cu...","[malt, cereal, barley, sweet, butter, toast, w...","[oak, rich, good_length, fruit]",Speyside Whisky,Dufftown,Dufftown,12.0,40.0,,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Laphroaig 10 Year Old Sherry Oak Finish,"[roasted, cedar, peat_smoke, iodine, away, dar...","[meat, maple_syrup, bbq, lemon, charred_oak, s...","[balanced, finish, sherry, sweet, smouldering,...",Islay Whisky,Laphroaig,Laphroaig,10.0,48.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,GlenAllachie 15 Year Old,"[walnut, raisin, christmas, spice, fresh, ginger]","[peanut_brittle, date, big, scoop, chocolate, ...","[coffee, turkish_delight, sea_salt]",Speyside Whisky,GlenAllachie,GlenAllachie,15.0,46.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


We have extracted the tokens we want out of the free text columns or generated one-hot encoded representations of whisky preparation (chill filtering, cask strength, and types of barrels used in maturation) from the free text description. We've thus thrown out all the free text columns. The numerical columns (whisky age, ABV, etc) have also all been cleaned.

We save this dataframe to CSV. Saving the sensory descriptor columns in this non-numerical form is important as it'll enable us to try different vector representation schemes later down the line.




In [57]:
new_whisk3.to_csv("data\\interim\\whisk_tokenized_encoded.csv")

It may also be better to unify the palate, nose, and finish tokens. Often these are correlated with each other anyway and unifying the token sets may yield a BoW representation that captures correlations between the descriptors a bit better (i.e. a whisky may have salt in the palate, coastal in the nose, and seaweed and iodine in the finish -- we would want these words linked together. Splitting them apart might make our representation excessively sparse. )

We'll thus also create and save a new dataframe with the tokens from palate, nose, and finish unified:

In [58]:
col_list = ['palate_tokenized', 'nose_tokenized', 'finish_tokenized'  ]
sense_df = new_whisk3[col_list]

whisk_unified = deepcopy(new_whisk3)
whisk_unified['token_unified'] = sense_df.apply(lambda x: list(itertools.chain(*x)), axis = 1)
whisk_unified.drop(columns = col_list, inplace=True)


whisk_unified[['name', 'token_unified']]

Unnamed: 0,name,token_unified
0,Singleton of Dufftown 12 Year Old,"[orange_zest, malty_core, nut, oak, toffee, cu..."
1,Laphroaig 10 Year Old Sherry Oak Finish,"[roasted, cedar, peat_smoke, iodine, away, dar..."
2,GlenAllachie 15 Year Old,"[walnut, raisin, christmas, spice, fresh, ging..."
3,Seaweed & Aeons & Digging & Fire & Cask Streng...,"[coastal, core, solid, seaweed, salt, slightly..."
4,Edradour 10 Year Old,"[cloying, seductive, murky, rum, barley, almon..."
...,...,...
8597,Caol Ila 12 Year Old,"[oil, tar, elegant, smoke, sweet, fresh, herba..."
8598,Bowmore 15 Year Old,"[rich, wood, pine_oil, syrup, cream, toffee, m..."
8599,anCnoc 24 Year Old,"[rich, citrus, forward, oodle, lemon, peel, ma..."
8600,Tobermory 12 Year Old,"[rounded, malt, oil, hearty_helping, stone_fru..."


In [59]:
whisk_unified['token_unified'][5]

['soft',
 'supple',
 'sherry',
 'nutty',
 'sweet',
 'malt',
 'juicy',
 'sultana',
 'slightly',
 'coastal',
 'fresh',
 'sweet',
 'seaweed',
 'malt',
 'sherry',
 'mochaccino',
 'herbal',
 'balanced',
 'salt',
 'tang']

OK this worked. Let's save the whisk_unified dataframe to file:

In [60]:
whisk_unified.to_csv("data\\interim\\whisk_unified_tokenized.csv")

It'd also be wise to create a gensim dictionary for the unified token set:

In [61]:
gensim_dict_unified = Dictionary(list(whisk_unified['token_unified']))
pickle_path_unified = "dictionaries\\gemsimdict_unified.pkl"
pickle.dump(gensim_dict_unified, open(pickle_path_unified, 'wb'))

### Corpus creation

gensim_dictionary_list contains the corpuses from the palate, nose, and finish notes. A quick look at the dictionaries shows that we have a lot of unique tokens in each corpus.

In [62]:
for key,values in gensim_dictionary_list.items():

    print(values)

Dictionary(2743 unique tokens: ['cut_grass', 'malty_core', 'nut', 'oak', 'orange_zest']...)
Dictionary(2913 unique tokens: ['barley', 'butter', 'cereal', 'hay', 'malt']...)
Dictionary(1956 unique tokens: ['fruit', 'good_length', 'oak', 'rich', 'balanced']...)


We look at the same for the unified corpus:

In [63]:
print(gensim_dict_unified)

Dictionary(4129 unique tokens: ['barley', 'butter', 'cereal', 'cut_grass', 'fruit']...)


We have a lot of unique tokens, but in order to reduce the dimensionality of the feature set we may want to discard tokens in a dictionary that appear exceedingly rarely or are in way too many documents. We put the bottom frequency cutoff at 1% (word only appears in 0.50% of documents) and an upper cutoff of 50%.

In [64]:
reduced_gensim_dict = deepcopy(gensim_dictionary_list)

for key,values in reduced_gensim_dict.items():

    n = len(new_whisk3) # size of each corpus
    values.filter_extremes(no_below=0.01*n, no_above=0.5)
    print(values)

Dictionary(146 unique tokens: ['nut', 'oak', 'toffee', 'cedar', 'dark_chocolate']...)
Dictionary(145 unique tokens: ['barley', 'butter', 'cereal', 'hay', 'malt']...)
Dictionary(80 unique tokens: ['fruit', 'good_length', 'oak', 'rich', 'finish']...)


In [80]:
gensim_dict_unified.filter_extremes(no_below=0.006*n, no_above=0.5)
print(gensim_dict_unified)
reduced_unified_dict_path = "dictionaries\\reduced_gemsimdict_unified.pkl"
pickle.dump(gensim_dict_unified, open(reduced_unified_dict_path, 'wb'))

Dictionary(477 unique tokens: ['barley', 'butter', 'cereal', 'cut_grass', 'fruit']...)


There's a substantial reduction in the number of tokens. Let's see what these tokens are for the unified token set.

In [66]:
token_names = list(gensim_dict_unified.values())
print(token_names)

['barley', 'butter', 'cereal', 'cut_grass', 'fruit', 'good_length', 'hay', 'malt', 'nut', 'oak', 'orange_zest', 'rich', 'sweet', 'toast', 'toffee', 'walnut', 'away', 'balanced', 'bbq', 'cedar', 'charred_oak', 'coffee', 'dark_chocolate', 'finish', 'honey', 'iodine', 'lemon', 'maple_syrup', 'meat', 'peat', 'peat_smoke', 'roasted', 'sherry', 'smidge', 'vanilla_pod', 'big', 'chocolate', 'christmas', 'date', 'fresh', 'ginger', 'ice_cream', 'peanut_brittle', 'raisin', 'sea_salt', 'spice', 'turkish_delight', 'apricot_jam', 'cinnamon', 'coastal', 'core', 'gingerbread', 'glass', 'mocha', 'oatcake', 'peanut', 'salt', 'seaweed', 'slightly', 'sultana', 'support', 'wave', 'almond', 'fruitcake', 'great', 'medium', 'rum', 'vanilla', 'herbal', 'juicy', 'nutty', 'soft', 'supple', 'tang', 'aroma', 'backdrop', 'bitter', 'bread', 'freshly', 'golden_syrup', 'lead', 'lime', 'mixed_peel', 'oil', 'orange', 'present', 'pudding', 'rubber', 'suggestion', 'syrup', 'toasted', 'bean', 'cassia', 'cherry', 'cigar_box

These are primarily sense descriptors and while there are some variant of words present in the corpus, our lemmatization/stemming /bigram-fitting seems to have done a decent job. There are of course some words that are not sense related: 'bit', 'age', 'element', 'mid', 'hand'. The hope is that our modeling/dimensional reduction will allow us to group these words into their own set of categories distinct from the sense descriptors.

Now let's convert these to tokenized documents to a Bag of Words feature set.

In [67]:
def gen_sensory_BoW_df(data, gensim_dicts):

    bow_df_list = []
    
    for key,value in gensim_dicts.items():
        sense_col = key + "_tokenized"

        corpus = [value.doc2bow(doc) for doc in data[sense_col] ]

        token_names = [key + ": " + word for word in list(value.values()) ]
        bow_df = pd.DataFrame(corpus2dense(corpus, num_terms = len(value.token2id)).T, columns = token_names)
        bow_df_list.append(bow_df)
    
    sensory_bow_df = pd.concat(bow_df_list, axis = 1)
    sensory_bow_df['name'] = data.name

    return sensory_bow_df
    

In [68]:
term_freq_df = gen_sensory_BoW_df(new_whisk3, reduced_gensim_dict)
term_freq_df.head()

Unnamed: 0,palate: nut,palate: oak,palate: toffee,palate: cedar,palate: dark_chocolate,palate: honey,palate: peat_smoke,palate: vanilla_pod,palate: fresh,palate: ginger,...,finish: subtly,finish: peach,finish: white,finish: cedar,finish: nut,finish: floral,finish: brown_sugar,finish: butterscotch,finish: juicy,name
0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Singleton of Dufftown 12 Year Old
1,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Laphroaig 10 Year Old Sherry Oak Finish
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,GlenAllachie 15 Year Old
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Seaweed & Aeons & Digging & Fire & Cask Streng...
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Edradour 10 Year Old


We created a BoW with the three types of tasting notes -- palate, nose, and finish -- with separate dictionaries/corpuses. This might be the way to go so we've saved this non-unified BoW dataframe to file:

In [69]:
cols_to_drop = ["palate_tokenized", "nose_tokenized", "finish_tokenized"]
new_whisk4 = pd.concat([new_whisk3.set_index('name'), term_freq_df.set_index('name')], axis = 1).reset_index().drop(columns = cols_to_drop)

In [70]:
new_whisk4.head()

Unnamed: 0,name,region,distillery,bottler,age,ABV,chill_filter,cask_strength,maturation_bourbon,maturation_brandy,...,finish: apricot,finish: subtly,finish: peach,finish: white,finish: cedar,finish: nut,finish: floral,finish: brown_sugar,finish: butterscotch,finish: juicy
0,Singleton of Dufftown 12 Year Old,Speyside Whisky,Dufftown,Dufftown,12.0,40.0,,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Laphroaig 10 Year Old Sherry Oak Finish,Islay Whisky,Laphroaig,Laphroaig,10.0,48.0,,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,GlenAllachie 15 Year Old,Speyside Whisky,GlenAllachie,GlenAllachie,15.0,46.0,,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Seaweed & Aeons & Digging & Fire & Cask Streng...,Islay Whisky,Seaweed & Aeons & Digging & Fire,Seaweed & Aeons & Digging & Fire,10.0,58.3,,True,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Edradour 10 Year Old,Highland Whisky,Edradour,Edradour,10.0,40.0,,,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [71]:
new_whisk4.to_csv("data\\interim\\whisk_bow_encoded.csv")

In [72]:
new_whisk4.columns

Index(['name', 'region', 'distillery', 'bottler', 'age', 'ABV', 'chill_filter',
       'cask_strength', 'maturation_bourbon', 'maturation_brandy',
       ...
       'finish: apricot', 'finish: subtly', 'finish: peach', 'finish: white',
       'finish: cedar', 'finish: nut', 'finish: floral', 'finish: brown_sugar',
       'finish: butterscotch', 'finish: juicy'],
      dtype='object', length=391)

Let's also create a BoW for the unified token set where the descriptors for palate, nose, and finish were put into the same corpus.

In [73]:

corpus_unified = [gensim_dict_unified.doc2bow(doc) for doc in whisk_unified['token_unified']]
token_names = list(gensim_dict_unified.values())
bow_df_unified = pd.DataFrame(corpus2dense(corpus_unified, num_terms = len(gensim_dict_unified.token2id)).T, columns = token_names)

bow_df_unified['name'] = whisk_unified.name

cols_to_drop = ['token_unified']
bow_unified_final = pd.concat([whisk_unified.set_index('name'), bow_df_unified.set_index('name')], axis = 1).reset_index().drop(columns = cols_to_drop)


In [74]:
bow_unified_final.head()

Unnamed: 0,name,region,distillery,bottler,age,ABV,chill_filter,cask_strength,maturation_bourbon,maturation_brandy,...,exotic_spice,strawberry_jam,cracker,fizz,tangerine,porridge_oats,right,salty_butter,coriander,lightly
0,Singleton of Dufftown 12 Year Old,Speyside Whisky,Dufftown,Dufftown,12.0,40.0,,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Laphroaig 10 Year Old Sherry Oak Finish,Islay Whisky,Laphroaig,Laphroaig,10.0,48.0,,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,GlenAllachie 15 Year Old,Speyside Whisky,GlenAllachie,GlenAllachie,15.0,46.0,,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Seaweed & Aeons & Digging & Fire & Cask Streng...,Islay Whisky,Seaweed & Aeons & Digging & Fire,Seaweed & Aeons & Digging & Fire,10.0,58.3,,True,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Edradour 10 Year Old,Highland Whisky,Edradour,Edradour,10.0,40.0,,,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's save this dataframe to disk:

In [75]:
bow_unified_final.to_csv("data\\interim\\whiskunified_bow_encoded.csv")