# Cocktail Recommendation

Goal: 
Create a cocktail recommender
1. Scrape reddit for username and cocktails ordered
2. Create a recommender using association rules, knn, or some other clustering.
3. User interface where you select cocktails you like from a list of cocktails and we spit out recommendation. 

3. Scrape internet for cocktail recipes and make a database of recipes
4. Calculate alcohol by volume of cocktails/strength category

5. Create a more robust recommender that references the recipes as well as association. 

6. User interface saves previous recommendation requests as well as stores information on whether or NOT you liked a cocktail, introducing a new feature of not liking a cocktail. 

## Step 1: Gathering a list of cocktails

### Step 1a: Scraping Reddit for cocktails

We will use PRAW, reddit application API, to scrape a couple reddit posts to get a list of cocktails people enjoy ordering. The hard part is sorting people's responses for the key words that I want.

In [5]:
import praw
import pandas as pd
from praw.models import MoreComments

reddit = praw.Reddit(client_id='ZGRoXTLzJz0hiEW6q6E2Og', client_secret='2f95rsary77FnpfsMKKwmXR-xlRryw', user_agent='scraps')

# first url 
url = "https://www.reddit.com/r/cocktails/comments/178syjl/what_are_your_top_3_favorite_cocktails_and_why/"
submissions = reddit.submission(url=url)
posts = []
for top_level_comment in submissions.comments:
    if isinstance(top_level_comment, MoreComments):
        continue
    posts.append((top_level_comment.author.name, top_level_comment.body))

# second url
url = "https://www.reddit.com/r/cocktails/comments/13ihzpy/rcocktails_top_50/"
submissions = reddit.submission(url=url)
for top_level_comment in submissions.comments:
    if isinstance(top_level_comment, MoreComments):
        continue
    if top_level_comment.author != None:
        posts.append((top_level_comment.author.name, top_level_comment.body))

# convert posts to dataframe with redditor id and text.
posts = pd.DataFrame(posts,columns=["id", "body"])
# remove all deleted posts
indexNames = posts[(posts.body == '[removed]') | (posts.body == '[deleted]')].index
posts.drop(indexNames, inplace=True)
posts


Unnamed: 0,id,body
0,jaykaner411,1. Daiquiri - Especially with good rum (probit...
1,dwchambers,This is tough! And I’ve hardly tried a great n...
2,MizLucinda,"Paper plane - equal parts, all magic. I genera..."
3,thehza4,1. Sazerac. Just perfection. Love the bourbon ...
4,_makebuellerproud_,At the moment? \n\n1.) a well made dirty gin m...
...,...,...
481,aproposofnothing32,Bees Knees\n\nBlack Manhattan\n\nMargarita \n\...
482,woelajilliams,"Negroni, Mai Tai, Army & Navy, Queen's Park Sw..."
483,thecal714,Daiquiri\n\nWhiskey Sour\n\nSazerac\n\nPaper J...
484,CovfefeFan,Nergroni\nManhattan\nOld Fashioned\nMartini\nL...


In [6]:
# preprocessing the text in each post to help our nlp model detect cocktails more accurately. 
processedText = []
for post in posts['body']:
    # common characters used to end sentences or start a new bullet. We want to remove these so that our model doesn't have to work so hard trying to identify what numbers or special characters are part of the cocktail name or not. Now there's a space between these special characters that might be commonly misinterpreted. 
    to_replace = ['.', '-', '\n', ',', '/']
    for c in to_replace:
        post = post.replace(c, ' . ')
    post = post.lower()
    processedText.append(post)

posts["preprocessed"] = processedText # add the processed strings with stop words removed as a new column in the posts dataframe. 

posts


Unnamed: 0,id,body,preprocessed
0,jaykaner411,1. Daiquiri - Especially with good rum (probit...,1 . daiquiri . especially with good rum (pr...
1,dwchambers,This is tough! And I’ve hardly tried a great n...,this is tough! and i’ve hardly tried a great n...
2,MizLucinda,"Paper plane - equal parts, all magic. I genera...",paper plane . equal parts . all magic . i ...
3,thehza4,1. Sazerac. Just perfection. Love the bourbon ...,1 . sazerac . just perfection . love the bo...
4,_makebuellerproud_,At the moment? \n\n1.) a well made dirty gin m...,at the moment? . . 1 . ) a well made dirty g...
...,...,...,...
481,aproposofnothing32,Bees Knees\n\nBlack Manhattan\n\nMargarita \n\...,bees knees . . black manhattan . . margarita...
482,woelajilliams,"Negroni, Mai Tai, Army & Navy, Queen's Park Sw...",negroni . mai tai . army & navy . queen's p...
483,thecal714,Daiquiri\n\nWhiskey Sour\n\nSazerac\n\nPaper J...,daiquiri . . whiskey sour . . sazerac . . p...
484,CovfefeFan,Nergroni\nManhattan\nOld Fashioned\nMartini\nL...,nergroni . manhattan . old fashioned . martini...


### Step 1b: Using a spaCy NER model to classify cocktails from a body of text.

In [7]:
%pip install spacy
%python -m spacy download en_core_web_sm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


UsageError: Line magic function `%python` not found (But cell magic `%%python` exists, did you mean that instead?).


In [8]:
import spacy

nlp = spacy.load("model-best")

# Process the text to detect the specific entity words
detected_cocktails = []
for post in posts['preprocessed']:
    list = []
    doc = nlp(post)
    for ent in doc.ents:
        list.append(ent.text)
    detected_cocktails.append(list)

posts['processedNER'] = detected_cocktails
posts

Unnamed: 0,id,body,preprocessed,processedNER
0,jaykaner411,1. Daiquiri - Especially with good rum (probit...,1 . daiquiri . especially with good rum (pr...,"[daiquiri, manhattan, highball]"
1,dwchambers,This is tough! And I’ve hardly tried a great n...,this is tough! and i’ve hardly tried a great n...,"[jingle bird, naked and famous, sazerac, la be..."
2,MizLucinda,"Paper plane - equal parts, all magic. I genera...",paper plane . equal parts . all magic . i ...,"[paper plane, boulevardier, negroni, margarita..."
3,thehza4,1. Sazerac. Just perfection. Love the bourbon ...,1 . sazerac . just perfection . love the bo...,"[sazerac, grog, paloma, oaxacan old fashioned]"
4,_makebuellerproud_,At the moment? \n\n1.) a well made dirty gin m...,at the moment? . . 1 . ) a well made dirty g...,"[dirty gin martini, old fashioned, maple syrup..."
...,...,...,...,...
481,aproposofnothing32,Bees Knees\n\nBlack Manhattan\n\nMargarita \n\...,bees knees . . black manhattan . . margarita...,"[bees knees, black manhattan, margarita, new y..."
482,woelajilliams,"Negroni, Mai Tai, Army & Navy, Queen's Park Sw...",negroni . mai tai . army & navy . queen's p...,"[negroni, mai tai]"
483,thecal714,Daiquiri\n\nWhiskey Sour\n\nSazerac\n\nPaper J...,daiquiri . . whiskey sour . . sazerac . . p...,"[daiquiri, whiskey sour, sazerac, paper jam, c..."
484,CovfefeFan,Nergroni\nManhattan\nOld Fashioned\nMartini\nL...,nergroni . manhattan . old fashioned . martini...,"[nergroni, manhattan, old fashioned, martini, ..."


### Step 1c: Data cleaning our resulting list

In [9]:
# remove rows where the list is empty
posts = posts[posts["processedNER"].str.len() != 0]

# removing all extraneous characters from cocktail names
import re
import string
from unidecode import unidecode

# defining a function to replace special characters. 
def replace_spec(mystring):
    special_char = string.punctuation
    # add the bullet to the list of special characters(punctuations).
    special_char += '•'

    # unidecode is used to convert all diacritic characters into ascii characters (aka removes accent marks)
    mystring = unidecode(mystring)

    # sometimes, we have a bullet and a space so we remove those first. 
    mystring = mystring.replace('• ', '')
    mystring = mystring.replace('&', 'and')
    mystring = mystring.replace('the ', '')

    for c in special_char:
        mystring = mystring.replace(c, '')
    mystring = " ".join(mystring.split())
    #mystring = mystring.strip()
    return mystring

postprocessed = []
for l in posts['processedNER']:
    temp = []
    for item in l:
        cocktail = replace_spec(item)
        temp.append(cocktail)
    postprocessed.append(temp)

posts['postprocessed'] = postprocessed

#posts

# check the count at which all cocktails appear. We can choose to remove if it appears too few times. 

all_cocktails = []
for list in posts['postprocessed']:
    for item in list:
        all_cocktails.append(item)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  posts['postprocessed'] = postprocessed


In [11]:
df = pd.DataFrame(all_cocktails, columns=['name'])
ncocktails = df.groupby('name').value_counts().to_frame().reset_index().rename(columns={0:'count'}).sort_values(by=['count','name'], ascending=[False, True])
ncocktails.reset_index(inplace=True)
ncocktails.drop(labels=['index'],axis=1, inplace=True)

In [21]:
from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

for i in range(0,len(ncocktails)):
    for j in range (i, len(ncocktails)):
        # if the name at i is similar to the name at j, but not the same
        if (0.88 < similar(ncocktails['name'][i], ncocktails['name'][j]) < 1):
            print('dropped: ' + ncocktails['name'][j] + ' ---- count: ' + str(ncocktails['count'][j])) 
            print(' and replaced with ' + ncocktails['name'][i] + ' ---- count: ' + str(ncocktails['count'][i]))
            # we should replace the name of j with i, so we save the name of i. 
            ncocktails.loc[j, 'name'] = ncocktails.loc[i,'name']
            continue


dropped: old fashion ---- count: 3
 and replaced with old fashioned ---- count: 162
dropped: old fasioned ---- count: 1
 and replaced with old fashioned ---- count: 162
dropped: nergroni ---- count: 1
 and replaced with negroni ---- count: 158
dropped: margaritas ---- count: 2
 and replaced with margarita ---- count: 102
dropped: margharita ---- count: 1
 and replaced with margarita ---- count: 102
dropped: daquiri ---- count: 8
 and replaced with daiquiri ---- count: 101
dropped: manhatten ---- count: 1
 and replaced with manhattan ---- count: 82
dropped: jingle bird ---- count: 1
 and replaced with jungle bird ---- count: 68
dropped: boulvardier ---- count: 5
 and replaced with boulevardier ---- count: 46
dropped: boulavardier ---- count: 1
 and replaced with boulevardier ---- count: 46
dropped: boulivardier ---- count: 1
 and replaced with boulevardier ---- count: 46
dropped: naked a famous ---- count: 1
 and replaced with naked and famous ---- count: 41
dropped: naked ans famous --

In [22]:
# group a new dataframe by the name and add the counts and reset the index.
cocktail_names = ncocktails.groupby('name').sum('count').sort_values('count', ascending=False)
cocktail_names.reset_index(inplace=True)
cocktail_names

Unnamed: 0,name,count
0,old fashioned,166
1,negroni,159
2,daiquiri,109
3,margarita,105
4,last word,102
...,...,...
255,midnight marauder,1
256,midori slipper,1
257,midori sour,1
258,milk punch,1


In [23]:
# manually sort through each cocktail name to make sure that our cocktail_names dataframe contains the proper format for each cocktail. 
index = 0
for name in cocktail_names['name']:
    print(name + ', ' + str(index))
    index += 1

# remove all cocktails with the name you want it to be by hand. 
cocktail_names.replace('dark and stormy', 'dark n stormy', inplace=True)
cocktail_names.replace('ny sour', 'new york sour', inplace=True)
cocktail_names.replace('old', 'old fashioned', inplace=True)
cocktail_names.replace('ramos', 'ramos gin fizz', inplace=True)
cocktail_names.replace('oaxaca of', 'oaxaca old fashioned', inplace=True)
cocktail_names.replace('sazarac', 'sazerac', inplace=True)
cocktail_names.replace('ramos fizz', 'ramos gin fizz', inplace=True)
cocktail_names.replace('old fasioned', 'old fashioned', inplace=True)
cocktail_names.replace('coffee negroni w', 'coffee negroni', inplace=True)
cocktail_names.replace('a spritz', 'aperol spritz', inplace=True)
cocktail_names.replace('maple syrup old fashioned', 'maple old fashioned', inplace=True)
cocktail_names.replace('maple whiskey sours', 'maple whiskey sour', inplace=True)
cocktail_names.replace('mezcal marg', 'mezcal margarita', inplace=True)
cocktail_names.replace('hemingway daq', 'hemingway daiquiri', inplace=True)
cocktail_names.replace('gin style old fashioned', 'gin old fashioned', inplace=True)
cocktail_names.replace('gin tonic', 'gin and tonic', inplace=True)
cocktail_names.replace('hawaiian', 'blue hawaiian', inplace=True)
cocktail_names.replace('hunter s', 'hunters cocktail', inplace=True)
cocktail_names.replace('maguerita', 'margarita', inplace=True)
cocktail_names.replace('dark maple syrup in', 'toronto', inplace=True)

# group a new dataframe by the name and add the counts and reset the index.
cocktail_names = cocktail_names.groupby('name').sum('count').sort_values('count', ascending=False)
cocktail_names.reset_index(inplace=True)
cocktail_names

# create dictionary of cocktail names.
cocktail_dict = []
for cocktail in cocktail_names['name']:
    cocktail_dict.append(cocktail)

old fashioned, 0
negroni, 1
daiquiri, 2
margarita, 3
last word, 4
mai tai, 5
manhattan, 6
paper plane, 7
jungle bird, 8
boulevardier, 9
martini, 10
sazerac, 11
naked and famous, 12
penicillin, 13
black manhattan, 14
vieux carre, 15
painkiller, 16
corpse reviver 2, 17
paloma, 18
caipirinha, 19
whiskey sour, 20
french 75, 21
gin and tonic, 22
sidecar, 23
mojito, 24
bees knees, 25
saturn, 26
aviation, 27
amaretto sour, 28
new york sour, 29
trinidad sour, 30
hemingway daiquiri, 31
oaxaca old fashioned, 32
pina colada, 33
espresso martini, 34
la louisiane, 35
tom collins, 36
gold rush, 37
industry sour, 38
clover club, 39
singapore sling, 40
pisco sour, 41
bijou, 42
aperol spritz, 43
gimlet, 44
ramos gin fizz, 45
dark and stormy, 46
mezcal margarita, 47
kingston negroni, 48
gin martini, 49
toronto, 50
enzoni, 51
white russian, 52
final ward, 53
cosmopolitan, 54
ti punch, 55
pina verde, 56
moscow mule, 57
el presidente, 58
jet pilot, 59
gin basil smash, 60
jack rose, 61
hot toddy, 62
mezcal 

In [24]:
# create new dataframe with only id, body, and postprocessed. 
cocktail_recs = posts[['id', 'body', 'postprocessed']]

# rename post processed as recommendations. This will hold our recommendations.
cocktail_recs.rename(columns={'postprocessed':'recommendations'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cocktail_recs.rename(columns={'postprocessed':'recommendations'}, inplace=True)


In [25]:
%pip install pyspellchecker


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [26]:
import json

# spell checker using pyspellchecker

# convert cocktail dict into a series and then convert to a json file .
cocktail_corpus = cocktail_names.set_index('name')['count'].to_json()

# writing to cocktail_corpus.json
with open("cocktail_corpus.json", "w") as outfile:
    outfile.write(cocktail_corpus)

def spellchecker(list):
    from spellchecker import SpellChecker

    spell = SpellChecker(language=None, case_sensitive=False, distance=1)
    spell.word_frequency.load_dictionary('./cocktail_corpus.json')

    new_list = []

    for word in list:
        if spell.correction(word) == None:
            print(word + ' is not in our dictionary!')
            inp = input('Type the word if you want to keep it, or type what you want to replace it with: ')
            if inp in cocktail_dict:
                print('Replaced ' + word + ' with ' + inp + '.')
                new_list.append(inp)
            else: 
                cocktail_dict.append(inp)
                print(inp + ' added.')
            continue
        new_list.append(spell.correction(word))
    
    return(new_list)

# test line of code:
# spellchecker(['daquiri', 'black manhatten', 'hawaiian', 'daiquiri', 'ramos', 'ramos'])


In [27]:
# spell checker using distances. This one is more efficient. 

# define spell checker that will also allow you to change the name of incorrectly spelled words. 
def spellcheck(mystring):
    if mystring in cocktail_dict:
        return (mystring)
    # if our cocktail is not in the list, then we'll pick the most similar one and replace it. 
    for cocktail in cocktail_dict:
        if (similar(mystring, cocktail) > 0.83):
            print('Replaced ' + mystring + ' with ' + cocktail + '.')
            return (cocktail)
    
    # this list will hold more inaccurate matches that might still be possibilities.
    possibilities = []
    for cocktail in cocktail_dict:
        if (similar(mystring, cocktail) > 0.5):
            possibilities.append(cocktail)
            print(mystring + " corrections: " + cocktail + " ----- score: " + str(similar(mystring, cocktail)))

    if 0 < len(possibilities) < 2:
        print('Replaced ' + mystring + ' with ' + possibilities[0] + '.')
        return (possibilities[0])

    # take an input from the user to correct the spelling of the cocktail or possibly add a new cocktail to the dictionary. 
    print(mystring + ' is not in our dictionary! Here are some possibilities: ' + ", ".join(possibilities))
    inp = input('Type the word if you want to keep it, or type what you want to replace it with: ')
    if inp in cocktail_dict:
        print('Replaced ' + mystring + ' with ' + inp + '.')
        return (inp)
    cocktail_dict.append(inp)
    print(inp + ' added.')

def spellcheck_list(list):
    new_list = []
    for word in list:
        new_list.append(spellcheck(word))
    return new_list

#spellcheck_list(['daquiri', 'black manhatten', 'hawaiian', 'daiquiri', 'ramos', 'ramos'])

In [28]:
# temporary list to hold the list of recommendations. 
recommendations = []

# go through all the recommendations column lists and perform spell check. 
for l in cocktail_recs['recommendations']:
    recommendations.append(spellcheck_list(l))
cocktail_recs['recommendations'] = recommendations

cocktail_recs

Replaced jingle bird with jungle bird.
Replaced oaxacan old fashioned with oaxaca old fashioned.
Replaced maple syrup old fashioned with maple old fashioned.
Replaced sazarac with sazerac.
old is not in our dictionary! Here are some possibilities: 
Replaced old with old fashioned.
Replaced earl grey marteani with earl gray marteani.
Replaced daquiri with daiquiri.
Replaced golf rush with gold rush.
Replaced margaritas with margarita.
Replaced pain killer with painkiller.
gin tonic corrections: gin and tonic ----- score: 0.8181818181818182
gin tonic corrections: gin martini ----- score: 0.7
gin tonic corrections: enzoni ----- score: 0.5333333333333333
gin tonic corrections: gin gimlet ----- score: 0.5263157894736842
gin tonic corrections: dirty gin martini ----- score: 0.5384615384615384
gin tonic corrections: aperol tonic ----- score: 0.5714285714285714
gin tonic corrections: gin sour ----- score: 0.5882352941176471
gin tonic corrections: gin daisy ----- score: 0.5555555555555556
gin t

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cocktail_recs['recommendations'] = recommendations


Unnamed: 0,id,body,recommendations
0,jaykaner411,1. Daiquiri - Especially with good rum (probit...,"[daiquiri, manhattan, highball]"
1,dwchambers,This is tough! And I’ve hardly tried a great n...,"[jungle bird, naked and famous, sazerac, la be..."
2,MizLucinda,"Paper plane - equal parts, all magic. I genera...","[paper plane, boulevardier, negroni, margarita..."
3,thehza4,1. Sazerac. Just perfection. Love the bourbon ...,"[sazerac, grog, paloma, oaxaca old fashioned]"
4,_makebuellerproud_,At the moment? \n\n1.) a well made dirty gin m...,"[dirty gin martini, old fashioned, maple old f..."
...,...,...,...
481,aproposofnothing32,Bees Knees\n\nBlack Manhattan\n\nMargarita \n\...,"[bees knees, black manhattan, margarita, new y..."
482,woelajilliams,"Negroni, Mai Tai, Army & Navy, Queen's Park Sw...","[negroni, mai tai]"
483,thecal714,Daiquiri\n\nWhiskey Sour\n\nSazerac\n\nPaper J...,"[daiquiri, whiskey sour, sazerac, paper jam, c..."
484,CovfefeFan,Nergroni\nManhattan\nOld Fashioned\nMartini\nL...,"[negroni, manhattan, old fashioned, martini, l..."


In [29]:
# removing repeats
no_repeats = []
for l in cocktail_recs['recommendations']:
    temp = []
    for item in l:
        if item in temp:
            continue
        temp.append(item)
    no_repeats.append(temp)

cocktail_recs['recommendations'] = no_repeats

cocktail_recs

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cocktail_recs['recommendations'] = no_repeats


Unnamed: 0,id,body,recommendations
0,jaykaner411,1. Daiquiri - Especially with good rum (probit...,"[daiquiri, manhattan, highball]"
1,dwchambers,This is tough! And I’ve hardly tried a great n...,"[jungle bird, naked and famous, sazerac, la be..."
2,MizLucinda,"Paper plane - equal parts, all magic. I genera...","[paper plane, boulevardier, negroni, margarita..."
3,thehza4,1. Sazerac. Just perfection. Love the bourbon ...,"[sazerac, grog, paloma, oaxaca old fashioned]"
4,_makebuellerproud_,At the moment? \n\n1.) a well made dirty gin m...,"[dirty gin martini, old fashioned, maple old f..."
...,...,...,...
481,aproposofnothing32,Bees Knees\n\nBlack Manhattan\n\nMargarita \n\...,"[bees knees, black manhattan, margarita, new y..."
482,woelajilliams,"Negroni, Mai Tai, Army & Navy, Queen's Park Sw...","[negroni, mai tai]"
483,thecal714,Daiquiri\n\nWhiskey Sour\n\nSazerac\n\nPaper J...,"[daiquiri, whiskey sour, sazerac, paper jam, c..."
484,CovfefeFan,Nergroni\nManhattan\nOld Fashioned\nMartini\nL...,"[negroni, manhattan, old fashioned, martini, l..."


## Step 2: Building a Recommender

For the next part of this project, I want to create a recommender. Here are some ideas that I have:

#### Frequent Pattern Growth:
In order to implement frequent pattern growth, we need to one hot encode our recommendations. Then, we can implement an fp growth algo. The tricky part with this association rules algorithm is that we would only be able to input a single cocktail and then use that to gather a recommendation on what other cocktails to try. I want a more robust recommender that can take multiple cocktails into account. However, for a one to one recommendation, this would be quick and simple to implement, given that the fp growth algorithm is included in the mlxtend package. 

#### User to User Recommendation: 
Since we have different lists of users, we can try to create a recommendation algorithm that matches us to other users who have rated the same cocktails and try to predict a rating. A lot of the documentation on this form of algorithm uses a rating system. People would rate different movies they have watched, and we would take our input, compare it to other users who have rated other movies similarly, and use that to predict what other items might be rated highly. However, instead of a rating system for our cocktails, we only have a ranked list of cocktails. We could either create a sparse matrix of binary values and approach the recommendation system that way, or we can decide what rating a first place, second place, third place (and so on) cocktail translates to. Fortunately for us, there are a few people who have documented their process of creating a user to user recommendation system with binary or unary data. 

I guess we'll try implementing both and seeing what happens!

### 2a: Frequent Pattern Growth 

In [30]:
from mlxtend.frequent_patterns import fpgrowth

In [31]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth

# take the recommendations column and store as a pandas series
dataset = cocktail_recs['recommendations'].tolist()

dataset

# transform series of lists into one hot encoded dataframe.
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

# return all items and itemsets with at least 60% support.
results = fpgrowth(df, min_support=0.04, use_colnames=True)

results.sort_values(by='support', ascending=False)

Unnamed: 0,support,itemsets
11,0.347193,(old fashioned)
7,0.328482,(negroni)
0,0.222453,(daiquiri)
8,0.218295,(margarita)
17,0.212058,(last word)
12,0.178794,(mai tai)
1,0.172557,(manhattan)
2,0.160083,(paper plane)
3,0.143451,(jungle bird)
36,0.112266,"(negroni, old fashioned)"


What we see from our results from using frequent pattern growth is that cocktails are to varied. Our one hot encoded matrix is so sparse that we can barely see any support between each of our cocktails. What we can glean, however is that certain cocktails are related to other cocktails. If people like one cocktail, it's possible to identify a second cocktail to recommend. For example, the support between an old fashioned and a negroni has a support value of 0.11, which is considerably high compared to other relations. We can say that a person who likes an old fashioned, might like a negroni as well. 

### 2b: Item-Item Recommendation

In [41]:
def normalized_one_hot(dataset):
    from mlxtend.preprocessing import TransactionEncoder

    te = TransactionEncoder()
    te_ary = te.fit(dataset).transform(dataset)
    df = pd.DataFrame(te_ary, columns=te.columns_)

    # convert all boolean values into binary.
    df = df.replace({True:1, False:0})

    # calculate the norm of each row vector and normalize each row. 
    norm_by_row = df.sum(axis=1)
    df = df.divide(norm_by_row, axis='index')

    return df


# convert list of lists to similarity matrix
def similarity_matrix(dataset):
    df = normalized_one_hot(dataset)

    # calculate pearson coefficient constant (the same as the cosine similarity function).
    data_matrix = df.corr(method='pearson')

    return (data_matrix)

# takes user input on preferred cocktails and returns a list of preferential cocktails
def input_likes():
    inp = ""
    likes = []

    while (inp != 'done'):
        if (inp != ""):
            print("You like: " + str(likes))
        inp = input("What cocktail do you like? If you are done listing your favorite cocktails, write 'done': ")
        
        inp = inp.lower()

        if (inp == 'done'):
            break
        
        inp = replace_spec(inp)
        inp = spellcheck(inp)

        if (inp not in likes):
            likes.append(inp)

    return likes

In [37]:

# take the recommendations column and store as a pandas series
dataset = cocktail_recs['recommendations'].tolist()

dataset.append(input_likes())

matrix = similarity_matrix(dataset)

matrix
# nlargest function returns the indices with the highest correlation values to the given index. 
#print (data_matrix.loc['negroni'].nlargest(11))

You like: ['old fashioned']
Replaced negorni with negroni.
You like: ['old fashioned', 'negroni']
You like: ['old fashioned', 'negroni', 'blue hawaiian']
You like: ['old fashioned', 'negroni', 'blue hawaiian']
You like: ['old fashioned', 'negroni', 'blue hawaiian', 'mojito']


Unnamed: 0,20th century,amaretto sour,american trilogy,angostura colada,aperol negroni,aperol sour,aperol spritz,aperol tonic,apple martini,army,...,whiskey and soda,whiskey buck,whiskey sour,white lady,white negroni,white russian,widows kiss,wisconsin old fashioned,yellow cactus flower,yellow smash
20th century,1.000000,-0.011237,-0.002943,-0.003952,-0.002943,-0.002943,-0.008335,-0.002943,-0.002943,-0.002943,...,-0.002943,-0.002943,-0.014541,-0.004947,-0.004167,-0.006491,-0.002943,-0.004042,-0.002943,-0.002943
amaretto sour,-0.011237,1.000000,-0.007937,-0.010658,-0.007937,-0.007937,0.098614,-0.007937,-0.007937,0.201698,...,-0.007937,-0.007937,-0.000617,-0.013340,0.137152,-0.017505,-0.007937,-0.010900,0.201698,-0.007937
american trilogy,-0.002943,-0.007937,1.000000,-0.002792,-0.002079,-0.002079,-0.005888,-0.002079,-0.002079,-0.002079,...,-0.002079,-0.002079,-0.010271,-0.003494,-0.002943,-0.004585,-0.002079,-0.002855,-0.002079,-0.002079
angostura colada,-0.003952,-0.010658,-0.002792,1.000000,-0.002792,-0.002792,-0.007906,-0.002792,-0.002792,-0.002792,...,-0.002792,-0.002792,-0.013792,-0.004692,-0.003952,-0.006156,-0.002792,-0.003834,-0.002792,-0.002792
aperol negroni,-0.002943,-0.007937,-0.002079,-0.002792,1.000000,-0.002079,-0.005888,-0.002079,-0.002079,-0.002079,...,-0.002079,-0.002079,-0.010271,-0.003494,-0.002943,-0.004585,-0.002079,-0.002855,-0.002079,-0.002079
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
white russian,-0.006491,-0.017505,-0.004585,-0.006156,-0.004585,-0.004585,-0.012984,-0.004585,-0.004585,-0.004585,...,-0.004585,-0.004585,-0.022652,-0.007706,-0.006491,1.000000,-0.004585,-0.006296,-0.004585,-0.004585
widows kiss,-0.002943,-0.007937,-0.002079,-0.002792,-0.002079,-0.002079,-0.005888,-0.002079,-0.002079,-0.002079,...,-0.002079,-0.002079,-0.010271,-0.003494,-0.002943,-0.004585,1.000000,-0.002855,-0.002079,-0.002079
wisconsin old fashioned,-0.004042,-0.010900,-0.002855,-0.003834,-0.002855,-0.002855,-0.008085,-0.002855,0.857210,-0.002855,...,-0.002855,-0.002855,0.080908,-0.004798,-0.004042,-0.006296,-0.002855,1.000000,-0.002855,-0.002855
yellow cactus flower,-0.002943,0.201698,-0.002079,-0.002792,-0.002079,-0.002079,-0.005888,-0.002079,-0.002079,1.000000,...,-0.002079,-0.002079,-0.010271,-0.003494,-0.002943,-0.004585,-0.002079,-0.002855,1.000000,-0.002079


In [68]:
# the index of what we received from our input will always be the last. 
dataset[-1]

# initialize normalized one hot encoded dataset:
df = normalized_one_hot(dataset)

# index of latest add to dataset. 
user_index = df.index.tolist()[-1]

# user vector extracted from the dataframe.
user_vector = df.iloc[user_index]

# Calculate the score.
score = matrix.dot(user_vector).div(matrix.sum(axis=1))

# Remove the known likes from the recommendation.
score = score.drop(dataset[-1])

# Print the known likes and the top 5 recommendations.
print ("You like: " + str(dataset[-1]))

print("We recommend: \n" + str(score.nlargest(5)))

You like: ['old fashioned', 'negroni', 'blue hawaiian', 'mojito']
We recommend: 
penicillin          10.372146
dark n stormy        0.878432
whiskey sour         0.484964
naked and famous     0.443066
black cat            0.400691
dtype: float64
