# Cocktail Recommendation

Goal: 
Create a cocktail recommender
1. Scrape reddit for username and cocktails ordered
2. Create a recommender using association rules, knn, or some other clustering.
3. User interface where you select cocktails you like from a list of cocktails and we spit out recommendation. 

3. Scrape internet for cocktail recipes and make a database of recipes
4. Calculate alcohol by volume of cocktails/strength category

5. Create a more robust recommender that references the recipes as well as association. 

6. User interface saves previous recommendation requests as well as stores information on whether or NOT you liked a cocktail, introducing a new feature of not liking a cocktail. 

## Step 1: Gathering a list of cocktails

### Step 1a: Scraping Reddit for cocktails

We will use PRAW, reddit application API, to scrape a couple reddit posts to get a list of cocktails people enjoy ordering. The hard part is sorting people's responses for the key words that I want.

In [1]:
import praw
import pandas as pd
from praw.models import MoreComments

reddit = praw.Reddit(client_id='ZGRoXTLzJz0hiEW6q6E2Og', client_secret='2f95rsary77FnpfsMKKwmXR-xlRryw', user_agent='scraps')

# first url 
url = "https://www.reddit.com/r/cocktails/comments/178syjl/what_are_your_top_3_favorite_cocktails_and_why/"
submissions = reddit.submission(url=url)
posts = []
for top_level_comment in submissions.comments:
    if isinstance(top_level_comment, MoreComments):
        continue
    posts.append((top_level_comment.author.name, top_level_comment.body))

# second url
url = "https://www.reddit.com/r/cocktails/comments/13ihzpy/rcocktails_top_50/"
submissions = reddit.submission(url=url)
for top_level_comment in submissions.comments:
    if isinstance(top_level_comment, MoreComments):
        continue
    if top_level_comment.author != None:
        posts.append((top_level_comment.author.name, top_level_comment.body))

# convert posts to dataframe with redditor id and text.
posts = pd.DataFrame(posts,columns=["id", "body"])
# remove all deleted posts
indexNames = posts[(posts.body == '[removed]') | (posts.body == '[deleted]')].index
posts.drop(indexNames, inplace=True)
posts


Unnamed: 0,id,body
0,jaykaner411,1. Daiquiri - Especially with good rum (probit...
1,dwchambers,This is tough! And I’ve hardly tried a great n...
2,MizLucinda,"Paper plane - equal parts, all magic. I genera..."
3,thehza4,1. Sazerac. Just perfection. Love the bourbon ...
4,_makebuellerproud_,At the moment? \n\n1.) a well made dirty gin m...
...,...,...
481,aproposofnothing32,Bees Knees\n\nBlack Manhattan\n\nMargarita \n\...
482,woelajilliams,"Negroni, Mai Tai, Army & Navy, Queen's Park Sw..."
483,thecal714,Daiquiri\n\nWhiskey Sour\n\nSazerac\n\nPaper J...
484,CovfefeFan,Nergroni\nManhattan\nOld Fashioned\nMartini\nL...


In [2]:
# preprocessing the text in each post to help our nlp model detect cocktails more accurately. 
processedText = []
for post in posts['body']:
    # common characters used to end sentences or start a new bullet. We want to remove these so that our model doesn't have to work so hard trying to identify what numbers or special characters are part of the cocktail name or not. Now there's a space between these special characters that might be commonly misinterpreted. 
    to_replace = ['.', '-', '\n', ',', '/']
    for c in to_replace:
        post = post.replace(c, ' . ')
    post = post.lower()
    processedText.append(post)

posts["preprocessed"] = processedText # add the processed strings with stop words removed as a new column in the posts dataframe. 

posts


Unnamed: 0,id,body,preprocessed
0,jaykaner411,1. Daiquiri - Especially with good rum (probit...,1 . daiquiri . especially with good rum (pr...
1,dwchambers,This is tough! And I’ve hardly tried a great n...,this is tough! and i’ve hardly tried a great n...
2,MizLucinda,"Paper plane - equal parts, all magic. I genera...",paper plane . equal parts . all magic . i ...
3,thehza4,1. Sazerac. Just perfection. Love the bourbon ...,1 . sazerac . just perfection . love the bo...
4,_makebuellerproud_,At the moment? \n\n1.) a well made dirty gin m...,at the moment? . . 1 . ) a well made dirty g...
...,...,...,...
481,aproposofnothing32,Bees Knees\n\nBlack Manhattan\n\nMargarita \n\...,bees knees . . black manhattan . . margarita...
482,woelajilliams,"Negroni, Mai Tai, Army & Navy, Queen's Park Sw...",negroni . mai tai . army & navy . queen's p...
483,thecal714,Daiquiri\n\nWhiskey Sour\n\nSazerac\n\nPaper J...,daiquiri . . whiskey sour . . sazerac . . p...
484,CovfefeFan,Nergroni\nManhattan\nOld Fashioned\nMartini\nL...,nergroni . manhattan . old fashioned . martini...


### Step 1b: Using a spaCy NER model to classify cocktails from a body of text.

In [3]:
!pip install spacy
!python -m spacy download en_core_web_sm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting en-core-web-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
import spacy

nlp = spacy.load("model-best")

# Process the text to detect the specific entity words
detected_cocktails = []
for post in posts['preprocessed']:
    list = []
    doc = nlp(post)
    for ent in doc.ents:
        list.append(ent.text)
    detected_cocktails.append(list)

posts['processedNER'] = detected_cocktails
posts

Unnamed: 0,id,body,preprocessed,processedNER
0,jaykaner411,1. Daiquiri - Especially with good rum (probit...,1 . daiquiri . especially with good rum (pr...,"[daiquiri, manhattan, highball]"
1,dwchambers,This is tough! And I’ve hardly tried a great n...,this is tough! and i’ve hardly tried a great n...,"[jingle bird, naked and famous, sazerac, la be..."
2,MizLucinda,"Paper plane - equal parts, all magic. I genera...",paper plane . equal parts . all magic . i ...,"[paper plane, boulevardier, negroni, margarita..."
3,thehza4,1. Sazerac. Just perfection. Love the bourbon ...,1 . sazerac . just perfection . love the bo...,"[sazerac, grog, paloma, oaxacan old fashioned]"
4,_makebuellerproud_,At the moment? \n\n1.) a well made dirty gin m...,at the moment? . . 1 . ) a well made dirty g...,"[dirty gin martini, old fashioned, maple syrup..."
...,...,...,...,...
481,aproposofnothing32,Bees Knees\n\nBlack Manhattan\n\nMargarita \n\...,bees knees . . black manhattan . . margarita...,"[bees knees, black manhattan, margarita, new y..."
482,woelajilliams,"Negroni, Mai Tai, Army & Navy, Queen's Park Sw...",negroni . mai tai . army & navy . queen's p...,"[negroni, mai tai]"
483,thecal714,Daiquiri\n\nWhiskey Sour\n\nSazerac\n\nPaper J...,daiquiri . . whiskey sour . . sazerac . . p...,"[daiquiri, whiskey sour, sazerac, paper jam, c..."
484,CovfefeFan,Nergroni\nManhattan\nOld Fashioned\nMartini\nL...,nergroni . manhattan . old fashioned . martini...,"[nergroni, manhattan, old fashioned, martini, ..."


### Step 1c: Data cleaning our resulting list

In [5]:
# remove rows where the list is empty
posts = posts[posts["processedNER"].str.len() != 0]

# removing all extraneous characters from cocktail names
import re
import string
from unidecode import unidecode

# defining a function to replace special characters. 
def replace_spec(mystring):
    special_char = string.punctuation
    # add the bullet to the list of special characters(punctuations).
    special_char += '•'

    # unidecode is used to convert all diacritic characters into ascii characters (aka removes accent marks)
    mystring = unidecode(mystring)

    # sometimes, we have a bullet and a space so we remove those first. 
    mystring = mystring.replace('• ', '')
    mystring = mystring.replace('&', 'and')
    mystring = mystring.replace('the ', '')

    for c in special_char:
        mystring = mystring.replace(c, '')
    mystring = " ".join(mystring.split())
    #mystring = mystring.strip()
    return mystring

postprocessed = []
for l in posts['processedNER']:
    temp = []
    for item in l:
        cocktail = replace_spec(item)
        temp.append(cocktail)
    postprocessed.append(temp)

posts['postprocessed'] = postprocessed

#posts

# check the count at which all cocktails appear. We can choose to remove if it appears too few times. 

all_cocktails = []
for list in posts['postprocessed']:
    for item in list:
        all_cocktails.append(item)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  posts['postprocessed'] = postprocessed


In [6]:
df = pd.DataFrame(all_cocktails, columns=['name'])
ncocktails = df.groupby('name').value_counts().to_frame().reset_index().rename(columns={0:'count'}).sort_values(by=['count','name'], ascending=[False, True])
ncocktails.reset_index(inplace=True)
ncocktails.drop(labels=['index'],axis=1, inplace=True)

In [7]:
from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

for i in range(0,len(ncocktails)):
    for j in range (i, len(ncocktails)):
        # if the name at i is similar to the name at j, but not the same
        if (0.88 < similar(ncocktails['name'][i], ncocktails['name'][j]) < 1):
            print('dropped: ' + ncocktails['name'][j] + ' ---- count: ' + str(ncocktails['count'][j])) 
            print(' and replaced with ' + ncocktails['name'][i] + ' ---- count: ' + str(ncocktails['count'][i]))
            # we should replace the name of j with i, so we save the name of i. 
            ncocktails.loc[j, 'name'] = ncocktails.loc[i,'name']
            continue


dropped: old fashion ---- count: 3
 and replaced with old fashioned ---- count: 162
dropped: old fasioned ---- count: 1
 and replaced with old fashioned ---- count: 162
dropped: nergroni ---- count: 1
 and replaced with negroni ---- count: 158
dropped: margaritas ---- count: 2
 and replaced with margarita ---- count: 102
dropped: margharita ---- count: 1
 and replaced with margarita ---- count: 102
dropped: daquiri ---- count: 8
 and replaced with daiquiri ---- count: 101
dropped: manhatten ---- count: 1
 and replaced with manhattan ---- count: 82
dropped: jingle bird ---- count: 1
 and replaced with jungle bird ---- count: 68
dropped: boulvardier ---- count: 5
 and replaced with boulevardier ---- count: 46
dropped: boulavardier ---- count: 1
 and replaced with boulevardier ---- count: 46
dropped: boulivardier ---- count: 1
 and replaced with boulevardier ---- count: 46
dropped: naked a famous ---- count: 1
 and replaced with naked and famous ---- count: 41
dropped: naked ans famous --

In [8]:
# group a new dataframe by the name and add the counts and reset the index.
cocktail_names = ncocktails.groupby('name').sum('count').sort_values('count', ascending=False)
cocktail_names.reset_index(inplace=True)
cocktail_names

Unnamed: 0,name,count
0,old fashioned,166
1,negroni,159
2,daiquiri,109
3,margarita,105
4,last word,102
...,...,...
255,midnight marauder,1
256,midori slipper,1
257,midori sour,1
258,milk punch,1


In [34]:
# manually sort through each cocktail name to make sure that our cocktail_names dataframe contains the proper format for each cocktail. 
index = 0
for name in cocktail_names['name']:
    print(name + ', ' + str(index))
    index += 1

# remove all cocktails with the name you want it to be by hand. 
cocktail_names.replace('dark and stormy', 'dark n stormy', inplace=True)
cocktail_names.replace('ny sour', 'new york sour', inplace=True)
cocktail_names.replace('old', 'old fashioned', inplace=True)
cocktail_names.replace('ramos', 'ramos gin fizz', inplace=True)
cocktail_names.replace('oaxaca of', 'oaxaca old fashioned', inplace=True)
cocktail_names.replace('sazarac', 'sazerac', inplace=True)
cocktail_names.replace('ramos fizz', 'ramos gin fizz', inplace=True)
cocktail_names.replace('old fasioned', 'old fashioned', inplace=True)
cocktail_names.replace('coffee negroni w', 'coffee negroni', inplace=True)
cocktail_names.replace('a spritz', 'aperol spritz', inplace=True)
cocktail_names.replace('maple syrup old fashioned', 'maple old fashioned', inplace=True)
cocktail_names.replace('maple whiskey sours', 'maple whiskey sour', inplace=True)
cocktail_names.replace('mezcal marg', 'mezcal margarita', inplace=True)
cocktail_names.replace('hemingway daq', 'hemingway daiquiri', inplace=True)
cocktail_names.replace('gin style old fashioned', 'gin old fashioned', inplace=True)
cocktail_names.replace('gin tonic', 'gin and tonic', inplace=True)
cocktail_names.replace('hawaiian', 'blue hawaiian', inplace=True)
cocktail_names.replace('hunter s', 'hunters cocktail', inplace=True)
cocktail_names.replace('maguerita', 'margarita', inplace=True)
cocktail_names.replace('dark maple syrup in', 'toronto', inplace=True)

# group a new dataframe by the name and add the counts and reset the index.
cocktail_names = cocktail_names.groupby('name').sum('count').sort_values('count', ascending=False)
cocktail_names.reset_index(inplace=True)
cocktail_names

# create dictionary of cocktail names.
cocktail_dict = []
for cocktail in cocktail_names['name']:
    cocktail_dict.append(cocktail)

old fashioned, 0
negroni, 1
daiquiri, 2
margarita, 3
last word, 4
mai tai, 5
manhattan, 6
paper plane, 7
jungle bird, 8
boulevardier, 9
sazerac, 10
martini, 11
naked and famous, 12
penicillin, 13
vieux carre, 14
black manhattan, 15
painkiller, 16
corpse reviver 2, 17
paloma, 18
caipirinha, 19
whiskey sour, 20
gin and tonic, 21
french 75, 22
sidecar, 23
mojito, 24
bees knees, 25
saturn, 26
new york sour, 27
aviation, 28
amaretto sour, 29
hemingway daiquiri, 30
trinidad sour, 31
oaxaca old fashioned, 32
pina colada, 33
espresso martini, 34
gold rush, 35
ramos gin fizz, 36
la louisiane, 37
tom collins, 38
aperol spritz, 39
singapore sling, 40
industry sour, 41
clover club, 42
bijou, 43
pisco sour, 44
gimlet, 45
mezcal margarita, 46
dark n stormy, 47
kingston negroni, 48
toronto, 49
gin martini, 50
ti punch, 51
moscow mule, 52
white russian, 53
pina verde, 54
final ward, 55
cosmopolitan, 56
enzoni, 57
jet pilot, 58
earl gray marteani, 59
jack rose, 60
gin basil smash, 61
el presidente, 62


In [35]:
# create new dataframe with only id, body, and postprocessed. 
cocktail_recs = posts[['id', 'body', 'postprocessed']]

# rename post processed as recommendations. This will hold our recommendations.
cocktail_recs.rename(columns={'postprocessed':'recommendations'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cocktail_recs.rename(columns={'postprocessed':'recommendations'}, inplace=True)


In [11]:
!pip install pyspellchecker


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [18]:
import json

# convert cocktail dict into a series and then convert to a json file .
cocktail_corpus = cocktail_names.set_index('name')['count'].to_json()

# writing to cocktail_corpus.json
with open("cocktail_corpus.json", "w") as outfile:
    outfile.write(cocktail_corpus)

def spellchecker(list):
    from spellchecker import SpellChecker

    spell = SpellChecker(language=None, case_sensitive=False, distance=1)
    spell.word_frequency.load_dictionary('./cocktail_corpus.json')

    new_list = []

    for word in list:
        if spell.correction(word) == None:
            print(word + ' is not in our dictionary!')
            inp = input('Type the word if you want to keep it, or type what you want to replace it with: ')
            if inp in cocktail_dict:
                print('Replaced ' + word + ' with ' + inp + '.')
                new_list.append(inp)
            else: 
                cocktail_dict.append(inp)
                print(inp + ' added.')
            continue
        new_list.append(spell.correction(word))
    
    return(new_list)

spellchecker(['daquiri', 'black manhatten', 'hawaiian', 'daiquiri', 'ramos', 'ramos'])


hawaiian is not in our dictionary!


Replaced hawaiian with blue hawaiian.
ramos is not in our dictionary!
Replaced ramos with ramos gin fizz.
ramos is not in our dictionary!
Replaced ramos with ramos gin fizz.


['daiquiri',
 'black manhattan',
 'blue hawaiian',
 'daiquiri',
 'ramos gin fizz',
 'ramos gin fizz']

In [38]:
# define spell checker that will also allow you to change the name of incorrectly spelled words. 
def spellcheck(mystring):
    if mystring in cocktail_dict:
        return (mystring)
    # if our cocktail is not in the list, then we'll pick the most similar one and replace it. 
    for cocktail in cocktail_dict:
        if (similar(mystring, cocktail) > 0.83):
            print('Replaced ' + mystring + ' with ' + cocktail + '.')
            return (cocktail)
    
    # this list will hold more inaccurate matches that might still be possibilities.
    possibilities = []
    for cocktail in cocktail_dict:
        if (similar(mystring, cocktail) > 0.5):
            possibilities.append(cocktail)
            print(mystring + " corrections: " + cocktail + " ----- score: " + str(similar(mystring, cocktail)))

    if 0 < len(possibilities) < 2:
        print('Replaced ' + mystring + ' with ' + possibilities[0] + '.')
        return (possibilities[0])

    # take an input from the user to correct the spelling of the cocktail or possibly add a new cocktail to the dictionary. 
    print(mystring + ' is not in our dictionary! Here are some possibilities: ' + ", ".join(possibilities))
    inp = input('Type the word if you want to keep it, or type what you want to replace it with: ')
    if inp in cocktail_dict:
        print('Replaced ' + mystring + ' with ' + inp + '.')
        return (inp)
    cocktail_dict.append(inp)
    print(inp + ' added.')

def spellcheck_list(list):
    new_list = []
    for word in list:
        new_list.append(spellcheck(word))
    return new_list

spellcheck_list(['daquiri', 'black manhatten', 'hawaiian', 'daiquiri', 'ramos', 'ramos'])

Replaced daquiri with daiquiri.
Replaced black manhatten with black manhattan.
hawaiian corrections: blue hawaiian ----- score: 0.7619047619047619
Replaced hawaiian with blue hawaiian.
ramos corrections: ramos gin fizz ----- score: 0.5263157894736842
ramos corrections: champs ----- score: 0.5454545454545454
ramos corrections: rosita ----- score: 0.5454545454545454
ramos is not in our dictionary! Here are some possibilities: ramos gin fizz, champs, rosita
Replaced ramos with ramos gin fizz.
ramos corrections: ramos gin fizz ----- score: 0.5263157894736842
ramos corrections: champs ----- score: 0.5454545454545454
ramos corrections: rosita ----- score: 0.5454545454545454
ramos is not in our dictionary! Here are some possibilities: ramos gin fizz, champs, rosita
Replaced ramos with ramos gin fizz.


['daiquiri',
 'black manhattan',
 'blue hawaiian',
 'daiquiri',
 'ramos gin fizz',
 'ramos gin fizz']

In [39]:
# temporary list to hold the list of recommendations. 
recommendations = []

# go through all the recommendations column lists and perform spell check. 
for l in cocktail_recs['recommendations']:
    recommendations.append(spellcheck_list(l))
cocktail_recs['recommendations'] = recommendations

cocktail_recs



Replaced jingle bird with jungle bird.
Replaced oaxacan old fashioned with oaxaca old fashioned.
Replaced maple syrup old fashioned with maple old fashioned.
sazarac corrections: sazerac ----- score: 0.8571428571428571
Replaced sazarac with sazerac.
old is not in our dictionary! Here are some possibilities: 
Replaced old with old fashioned.
Replaced earl grey marteani with earl gray marteani.
Replaced daquiri with daiquiri.
Replaced golf rush with gold rush.
Replaced margaritas with margarita.
Replaced pain killer with painkiller.
gin tonic corrections: gin and tonic ----- score: 0.8181818181818182
gin tonic corrections: gin martini ----- score: 0.7
gin tonic corrections: enzoni ----- score: 0.5333333333333333
gin tonic corrections: gin gimlet ----- score: 0.5263157894736842
gin tonic corrections: dirty gin martini ----- score: 0.5384615384615384
gin tonic corrections: aperol tonic ----- score: 0.5714285714285714
gin tonic corrections: gin sour ----- score: 0.5882352941176471
gin tonic

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cocktail_recs['recommendations'] = recommendations


Unnamed: 0,id,body,recommendations
0,jaykaner411,1. Daiquiri - Especially with good rum (probit...,"[daiquiri, manhattan, highball]"
1,dwchambers,This is tough! And I’ve hardly tried a great n...,"[jungle bird, naked and famous, sazerac, la be..."
2,MizLucinda,"Paper plane - equal parts, all magic. I genera...","[paper plane, boulevardier, negroni, margarita..."
3,thehza4,1. Sazerac. Just perfection. Love the bourbon ...,"[sazerac, grog, paloma, oaxaca old fashioned]"
4,_makebuellerproud_,At the moment? \n\n1.) a well made dirty gin m...,"[dirty gin martini, old fashioned, maple old f..."
...,...,...,...
481,aproposofnothing32,Bees Knees\n\nBlack Manhattan\n\nMargarita \n\...,"[bees knees, black manhattan, margarita, new y..."
482,woelajilliams,"Negroni, Mai Tai, Army & Navy, Queen's Park Sw...","[negroni, mai tai]"
483,thecal714,Daiquiri\n\nWhiskey Sour\n\nSazerac\n\nPaper J...,"[daiquiri, whiskey sour, sazerac, paper jam, c..."
484,CovfefeFan,Nergroni\nManhattan\nOld Fashioned\nMartini\nL...,"[negroni, manhattan, old fashioned, martini, l..."
