### Project 1
#### Summer 2021
**Authors:** GOAT Team (Estaban Aramayo, Ethan Haley, Claire Meyer, and Tyler Frankenburg)

The [Cocktail DB](https://www.thecocktaildb.com/api.php) is a database of cocktails and ingredients. In this assignment, we describe how we could use the Cocktail DB's API to generate a network of cocktails and ingredients. We can use some example data to explore how we might be able to predict outcomes from this data using centrality metrics. 

##### Loading the Data

Without digging too deeply into the intricacies of the Cocktail DB API, we can leverage [this code](https://holypython.com/api-12-cocktail-database/) as a start for grabbing some example output from the API. This code leverages 2 libraries: `requests` to make an API request, and `json` to load the JSON output from the API. We can then iterate through each cocktail output to grab the relevant components. 

We can bulid a search query to pull all cocktail names (`strDrink`), the ingredients for each, and the drink categories (`strCategory`) for each cocktail.

First, we get all drinks, by first letter of name.

In [1]:
import networkx as net
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In [2]:
baseUrl = "https://www.thecocktaildb.com/api/json/v1/1/"
letterEndpoint = "search.php?f="

def searchLetter(letter):
    data = requests.get(baseUrl + letterEndpoint + letter)
    return json.loads(data.text)

In [3]:
# Build a dataframe
name = []  # drink name
ids = []   # drink ID
cat = []   # drink category
pic = []   # thumbnail url
ingr = []  # ingredients

In [4]:
# helper function to parse ingredients
def ingreds(drinkDict):
    ing = []
    for i in range(1,16):  # API has 16 fields for ingredients of each drink, most of them empty/None
        s = "strIngredient" + str(i)
        if not d[s]:
            break
        ing.append(d[s])
    return ing

In [5]:
for l in 'abcdefghijklmnopqrstuvwxyz':
    
    drinks = searchLetter(l)['drinks']
    if not drinks: continue   #(some letters have no drinks)
    for d in drinks:
        name.append(d['strDrink'])
        ids.append(d['idDrink'])
        cat.append(d['strCategory'])
        pic.append(d['strDrinkThumb'])
        ingr.append(ingreds(d))

In [6]:
drinkDF = pd.DataFrame({'name': name,
                       'id': ids,
                       'category': cat,
                       'photoURL': pic,
                       'ingredients': ingr})
drinkDF

Unnamed: 0,name,id,category,photoURL,ingredients
0,A1,17222,Cocktail,https://www.thecocktaildb.com/images/media/dri...,"[Gin, Grand Marnier, Lemon Juice, Grenadine]"
1,ABC,13501,Shot,https://www.thecocktaildb.com/images/media/dri...,"[Amaretto, Baileys irish cream, Cognac]"
2,Ace,17225,Cocktail,https://www.thecocktaildb.com/images/media/dri...,"[Gin, Grenadine, Heavy cream, Milk, Egg White]"
3,Adam,17837,Ordinary Drink,https://www.thecocktaildb.com/images/media/dri...,"[Dark rum, Lemon juice, Grenadine]"
4,AT&T,13938,Ordinary Drink,https://www.thecocktaildb.com/images/media/dri...,"[Absolut Vodka, Gin, Tonic water]"
...,...,...,...,...,...
412,Zima Blaster,17027,Ordinary Drink,https://www.thecocktaildb.com/images/media/dri...,"[Zima, Chambord raspberry liqueur]"
413,Zizi Coin-coin,14594,Punch / Party Drink,https://www.thecocktaildb.com/images/media/dri...,"[Cointreau, Lemon juice, Ice, Lemon]"
414,Zippy's Revenge,14065,Cocktail,https://www.thecocktaildb.com/images/media/dri...,"[Amaretto, Rum, Kool-Aid]"
415,Zimadori Zinger,15801,Punch / Party Drink,https://www.thecocktaildb.com/images/media/dri...,"[Midori melon liqueur, Zima]"


Save/load the df so as not to have to make 26 API calls every time notebook opens

In [7]:
#drinkDF.to_csv('drinkDF.csv')
drinkDF = pd.read_csv('drinkDF.csv', index_col=0)

One more step needed on the load, since pandas converts lists to string literals on csv storage

In [8]:
import ast
drinkDF['ingredients'] = drinkDF.ingredients.apply(ast.literal_eval)

##### Make a bipartite graph, with cocktails as one type and ingredients as the other.

First we can play with a high-level graph of the full dataset, creating a bipartite graph where drinks are one type and ingredients are the other.

In [31]:
from networkx.algorithms import bipartite

name = drinkDF.name.values
cat = drinkDF.category.values
ingr = drinkDF.ingredients.values

drinks = set(name)
ingreds = set(i for iList in ingr for i in iList)

B = net.Graph()
B.add_nodes_from(drinks, bipartite='Cocktail')
B.add_nodes_from(ingreds, bipartite='Ingredient')

for d in range(len(drinkDF)):
    B.add_node(name[d], category=cat[d])
    for ing in ingr[d]:
        B.add_edge(name[d], ing)

Then the 2 bipartite projection graphs are these:

In [32]:
D = bipartite.weighted_projected_graph(B, drinks)
I = bipartite.weighted_projected_graph(B, ingreds)

#### Calculate Centrality by Category

Then we want to create sub-dataframes of 3 key categories: 'cocktail', 'shot', and 'ordinary drink', that we can then compare key centrality measures across.

In [12]:
drinks_cocktails = drinkDF[(drinkDF['category']=="Cocktail")]
drinks_cocktails.head()

Unnamed: 0,name,id,category,photoURL,ingredients
0,A1,17222,Cocktail,https://www.thecocktaildb.com/images/media/dri...,"[Gin, Grand Marnier, Lemon Juice, Grenadine]"
2,Ace,17225,Cocktail,https://www.thecocktaildb.com/images/media/dri...,"[Gin, Grenadine, Heavy cream, Milk, Egg White]"
12,Addison,17228,Cocktail,https://www.thecocktaildb.com/images/media/dri...,"[Gin, Vermouth]"
16,Aviation,17180,Cocktail,https://www.thecocktaildb.com/images/media/dri...,"[Gin, lemon juice, maraschino liqueur]"
21,Afterglow,12560,Cocktail,https://www.thecocktaildb.com/images/media/dri...,"[Grenadine, Orange juice, Pineapple juice]"


In [34]:
drinks_ordinary = drinkDF[(drinkDF['category']=="Ordinary Drink")]
drinks_ordinary.head()

Unnamed: 0,name,id,category,photoURL,ingredients
3,Adam,17837,Ordinary Drink,https://www.thecocktaildb.com/images/media/dri...,"[Dark rum, Lemon juice, Grenadine]"
4,AT&T,13938,Ordinary Drink,https://www.thecocktaildb.com/images/media/dri...,"[Absolut Vodka, Gin, Tonic water]"
6,A. J.,17833,Ordinary Drink,https://www.thecocktaildb.com/images/media/dri...,"[Applejack, Grapefruit juice]"
7,Affair,17839,Ordinary Drink,https://www.thecocktaildb.com/images/media/dri...,"[Strawberry schnapps, Orange juice, Cranberry ..."
9,Avalon,15266,Ordinary Drink,https://www.thecocktaildb.com/images/media/dri...,"[Vodka, Pisang Ambon, Apple juice, Lemon juice..."


In [35]:
drinks_shots = drinkDF[(drinkDF['category']=="Shot")]
drinks_shots.head()

Unnamed: 0,name,id,category,photoURL,ingredients
1,ABC,13501,Shot,https://www.thecocktaildb.com/images/media/dri...,"[Amaretto, Baileys irish cream, Cognac]"
5,ACID,14610,Shot,https://www.thecocktaildb.com/images/media/dri...,"[151 proof rum, Wild Turkey]"
25,B-53,13332,Shot,https://www.thecocktaildb.com/images/media/dri...,"[Kahlua, Sambuca, Grand Marnier]"
26,B-52,15853,Shot,https://www.thecocktaildb.com/images/media/dri...,"[Baileys irish cream, Grand Marnier, Kahlua]"
29,Big Red,13222,Shot,https://www.thecocktaildb.com/images/media/dri...,"[Irish cream, Goldschlager]"


From the subsetted dataframes, create bipartite drink and ingredient graphs.

In [52]:
drinks_c = set(drinks_cocktails.name.values)
ingreds_c = set(i for iList in drinks_cocktails.ingredients.values for i in iList)

dc_graph = net.Graph()

dc_graph.add_nodes_from(drinks_c, bipartite='Drink')
dc_graph.add_nodes_from(ingreds_c,bipartite='Ingredient')

for d in drinks_cocktails.index:
    dc_graph.add_node(name[d], category=cat[d])
    for ing in ingr[d]:
        dc_graph.add_edge(name[d], ing)

Then we create a dataframe of degree and eigenvector centrality by each drink.

In [53]:
cocktail_degree = net.degree_centrality(dc_graph)
cocktail_degree = pd.DataFrame.from_dict(cocktail_degree, orient='index').reset_index()

In [59]:
cocktail_eig = net.eigenvector_centrality_numpy(dc_graph)
cocktail_eig = pd.DataFrame.from_dict(cocktail_eig, orient='index').reset_index()

We repeat these steps across the two other categories: 

In [54]:
drinks_o = set(drinks_ordinary.name.values)
ingreds_o = set(i for iList in drinks_ordinary.ingredients.values for i in iList)

do_graph = net.Graph()

do_graph.add_nodes_from(drinks_o, bipartite='Drink')
do_graph.add_nodes_from(ingreds_o,bipartite='Ingredient')

for d in drinks_ordinary.index:
    dc_graph.add_node(name[d], category=cat[d])
    for ing in ingr[d]:
        dc_graph.add_edge(name[d], ing)

In [55]:
ordinary_degree = net.degree_centrality(do_graph)
ordinary_degree = pd.DataFrame.from_dict(ordinary_degree, orient='index').reset_index()

In [60]:
ordinary_eig = net.eigenvector_centrality_numpy(do_graph)
ordinary_eig = pd.DataFrame.from_dict(ordinary_eig, orient='index').reset_index()

In [56]:
drinks_s = set(drinks_shots.name.values)
ingreds_s = set(i for iList in drinks_shots.ingredients.values for i in iList)

ds_graph = net.Graph()

ds_graph.add_nodes_from(drinks_s, bipartite='Drink')
ds_graph.add_nodes_from(ingreds_s,bipartite='Ingredient')

for d in drinks_shots.index:
    ds_graph.add_node(name[d], category=cat[d])
    for ing in ingr[d]:
        ds_graph.add_edge(name[d], ing)

In [61]:
shots_degree = net.degree_centrality(ds_graph)
shots_degree = pd.DataFrame.from_dict(shots_degree, orient='index').reset_index()

In [62]:
shots_eig = net.eigenvector_centrality_numpy(ds_graph)
shots_eig = pd.DataFrame.from_dict(shots_eig, orient='index').reset_index()

##### Compare Centrality Measures across Categories
Then we can create summary dataframes of both centrality measures, and sort by each to see top 10 cocktails, shots, and ordinary drinks by each centrality measure.

In [65]:
summary_cocktails = pd.merge(cocktail_degree, cocktail_eig, how = "inner", on = "index")
summary_cocktails = summary_cocktails.rename(columns = 
        {"index":"Name","0_x":"Degree Centrality","0_y":"Eigenvector Centrality"})
summary_cocktails.head()

Unnamed: 0,Name,Degree Centrality,Eigenvector Centrality
0,The Last Word,0.012295,0.005688
1,Zippy's Revenge,0.012295,0.003358
2,Imperial Cocktail,0.012295,0.059576
3,Empellón Cocina's Fat-Washed Mezcal,0.012295,0.000141
4,Dry Martini,0.012295,0.065821


In [66]:
summary_ordinary = pd.merge(ordinary_degree, ordinary_eig, how = "inner", on = "index")
summary_ordinary = summary_ordinary.rename(columns = 
        {"index":"Name","0_x":"Degree Centrality","0_y":"Eigenvector Centrality"})
summary_ordinary.head()

Unnamed: 0,Name,Degree Centrality,Eigenvector Centrality
0,Godfather,0.0,0.052523
1,Grasshopper,0.0,0.078433
2,Rum Cobbler,0.0,0.003691
3,Popped cherry,0.0,-0.033888
4,Rum Screwdriver,0.0,-0.052785


In [67]:
summary_shots = pd.merge(shots_degree, shots_eig, how = "inner", on = "index")
summary_shots = summary_shots.rename(columns = 
        {"index":"Name","0_x":"Degree Centrality","0_y":"Eigenvector Centrality"})
summary_shots.head()

Unnamed: 0,Name,Degree Centrality,Eigenvector Centrality
0,Shot-gun,0.034483,0.0131926
1,Jelly Bean,0.022989,-1.3761290000000001e-17
2,Damned if you do,0.022989,-1.008789e-17
3,Red Snapper,0.034483,0.1359623
4,Kool-Aid Slammer,0.022989,0.1205234


##### Digging into the Results

Then we can explore the values we see in centrality by each category. First we'll sort, to see how the highest values compare, and then we can look at means and medians.

In [68]:
summary_cocktails.sort_values(by=['Degree Centrality'],ascending=False).head()

Unnamed: 0,Name,Degree Centrality,Eigenvector Centrality
152,Gin,0.094262,0.555744
193,Lime Juice,0.053279,0.0357
117,Vodka,0.04918,0.108363
103,Grenadine,0.040984,0.157424
93,Angostura Bitters,0.032787,0.020386


In [69]:
summary_cocktails.sort_values(by=['Eigenvector Centrality'],ascending=False).head()

Unnamed: 0,Name,Degree Centrality,Eigenvector Centrality
152,Gin,0.094262,0.555744
103,Grenadine,0.040984,0.157424
163,Lemon juice,0.028689,0.142327
239,Rose,0.004098,0.111255
97,Sugar,0.02459,0.110781


In [72]:
summary_ordinary.sort_values(by=['Degree Centrality'],ascending=False).head()

Unnamed: 0,Name,Degree Centrality,Eigenvector Centrality
0,Godfather,0.0,0.052523
242,Champagne,0.0,0.032357
251,Anisette,0.0,0.023194
250,Advocaat,0.0,0.027251
249,Egg white,0.0,-0.035619


In [73]:
summary_ordinary.sort_values(by=['Eigenvector Centrality'],ascending=False).head()

Unnamed: 0,Name,Degree Centrality,Eigenvector Centrality
129,Sol Y Sombra,0.0,0.090658
143,Foxy Lady,0.0,0.090635
225,Sambuca,0.0,0.090468
244,Angostura bitters,0.0,0.090077
49,Gin Fizz,0.0,0.090007


In [74]:
summary_shots.sort_values(by=['Degree Centrality'],ascending=False).head()

Unnamed: 0,Name,Degree Centrality,Eigenvector Centrality
82,Vodka,0.091954,0.394175
68,Amaretto,0.068966,0.277725
63,Baileys irish cream,0.068966,0.255925
15,Flaming Dr. Pepper,0.057471,0.25201
71,Kahlua,0.057471,0.214798


In [75]:
summary_shots.sort_values(by=['Eigenvector Centrality'],ascending=False).head()

Unnamed: 0,Name,Degree Centrality,Eigenvector Centrality
82,Vodka,0.091954,0.394175
33,Kool-Aid Shot,0.057471,0.278812
68,Amaretto,0.068966,0.277725
63,Baileys irish cream,0.068966,0.255925
15,Flaming Dr. Pepper,0.057471,0.25201


In [77]:
cocktail_degree_mean = summary_cocktails["Degree Centrality"].mean()
cocktail_eig_mean = summary_cocktails["Eigenvector Centrality"].mean()
cocktail_degree_median = summary_cocktails["Degree Centrality"].median()
cocktail_eig_median = summary_cocktails["Eigenvector Centrality"].median()

In [78]:
ordinary_degree_mean = summary_cocktails["Degree Centrality"].mean()
ordinary_eig_mean = summary_cocktails["Eigenvector Centrality"].mean()
ordinary_degree_median = summary_cocktails["Degree Centrality"].median()
ordinary_eig_median = summary_cocktails["Eigenvector Centrality"].median()

In [79]:
shot_degree_mean = summary_shots["Degree Centrality"].mean()
shot_eig_mean = summary_shots["Eigenvector Centrality"].mean()
shot_degree_median = summary_shots["Degree Centrality"].median()
shot_eig_median = summary_shots["Eigenvector Centrality"].median()

In [80]:
print("cocktail_degree_mean: ",cocktail_degree_mean)
print("cocktail_eig_mean: ",cocktail_eig_mean)
print("cocktail_degree_median: ",cocktail_degree_median)
print("cocktail_eig_median: ",cocktail_eig_median)
print("ordinary_degree_mean: ",ordinary_degree_mean)
print("ordinary_eig_mean: ",ordinary_eig_mean)
print("ordinary_degree_median: ",ordinary_degree_median)
print("ordinary_eig_median: ",ordinary_eig_median)
print("shot_degree_mean: ",shot_degree_mean)
print("shot_eig_mean: ",shot_eig_mean)
print("shot_degree_median: ",shot_degree_median)
print("shot_eig_median: ",shot_eig_median)

cocktail_degree_mean:  0.011743057878889261
cocktail_eig_mean:  0.02075649893875311
cocktail_degree_median:  0.00819672131147541
cocktail_eig_median:  0.0066096767289867304
ordinary_degree_mean:  0.011743057878889261
ordinary_eig_mean:  0.02075649893875311
ordinary_degree_median:  0.00819672131147541
ordinary_eig_median:  0.0066096767289867304
shot_degree_mean:  0.026907001044932075
shot_eig_mean:  0.06945263613384046
shot_degree_median:  0.022988505747126436
shot_eig_median:  0.04071819967908407


### clean up, interpretation, stat sig diff?