# Analysis of a Great Cocktail

Here I look into what it takes to make a great adult beverage. Nothing against beer, wine, or even mead, but here I focus on **Cocktails**--and fancy ones at that!

The dataset contains cocktails collected by alcohol importer and distiller Hotaling & Co. Original data at: http://www.hotalingandco.com/cocktails/. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        dpath=(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
# Start by reading in the data and taking a quick look at the structure
df = pd.read_csv(dpath)
df.head()

In [None]:
df.describe()

## Which Cities, Bars, and Bartenders are prevalent in the data?

Let's visually explore the data, focussing on some demographic details first. To streamline this, we will first define a function. 



In [None]:
sns.set(style="whitegrid")
def plot_dist(data, prop, minimum = 0, title='Distribution of Values'):
    plt.figure(figsize=(10,6))
    counts = data.groupby(prop).filter(lambda x: len(x) >= minimum)
    plot = sns.countplot(
        data = counts,
        y=prop,
        order=counts[prop].value_counts().index,
        palette="deep"
    )
    plot.set_title(title)
    plt.tight_layout()
    plt.show()
    return counts

In [None]:
locations = plot_dist(df, 'Location', 0,'Distribution of recepies origin location')

## San Fran vs NYC
It's not entirely surprising that most of the drinks come from major US cites: San Francisco, New York, Houston, LA, New Oreleans, Chicago, etc. (Well, Houston surprised me a bit at first, but I suppose it's hot and humid there--perfect weather for a cold concoction!). However, San Francisco has a HUGE advantage!

In [None]:
df.Location.loc[df.Location.isin(["San Francisco","New York"])].value_counts()

**San Francisco has 6 times as many drinks in the list!** That seems unusual, considering New Yorkers are knows for loving their cocktails; plus New York is the larger city. Well, it turns out the company that gererously suplied this data, Hotaling & Co, is based in San Fran. That may very well be the primary reason. 

But that brings us to an impotortant point--selection bias. This data should **not** be taken as a representative sample of coctail characteristics across America (though if such a dataset existed, I'd love to see it!). It's just a biased sampling of some nice drinks that the distributer has chosen to share. Considering we don't know how drinks are selected for the list, we can't really make any assumptions beyond that. 

## A Closer look at San Fancisco Cocktails

In [None]:
df_sf = df.loc[df.Location == "San Francisco"]
df_sf.describe()

In [None]:
liqours = ["gin", "vodka","rum","whiskey","rye", "bourbon","tequila"]
for i,liq in enumerate(liqours):
    # Returns a positive value if liq is found in the ingredients
    df.loc[:,liq] = df.Ingredients.str.lower().str.find(liq)

n_gin = df.gin.loc[df.gin >0].count() 
n_whi = df.whiskey.loc[df.whiskey >0].count()
n_vod = df.vodka.loc[df.vodka >0].count()
n_rye = df.rye.loc[df.rye >0].count()
n_bou = df.bourbon.loc[df.bourbon >0].count()
n_teq = df.tequila.loc[df.tequila >0].count()
    
p = [n_gin,n_whi,n_vod,n_rye,n_bou,n_teq, len(df)-n_gin-n_whi-n_vod-n_rye-n_bou-n_teq]
labels = ["Gin","Whiskey","Vodka","Rye","Burbon","Tequila","Other"]
    
f0, ax0 = plt.subplots(figsize=(6,6))
ax0.pie(p,labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
plt.title("Main Cocktail Spirit for all Locations")

In [None]:
# Here is a simple funciton that tallies the number of Gin, Whiskey and Vodka Drinks for a given city

def liq_find(data, city):
    
    df_city = df.loc[df.Location == city]
    
    liq_count = {}
    for i,liq in enumerate(liqours):
        # Returns a positive value if liq is found in the ingredients
        df_city.loc[:,liq]= df_city.Ingredients.str.lower().str.find(liq)
    
    n_gin = df_city.gin.loc[df_city.gin >0].count()
    n_whi = df_city.whiskey.loc[df_city.whiskey >0].count()
    n_vod = df_city.vodka.loc[df_city.vodka >0].count()
    n_rye = df_city.rye.loc[df_city.rye >0].count()
    n_bou = df_city.bourbon.loc[df_city.bourbon >0].count()
    n_teq = df_city.tequila.loc[df_city.tequila >0].count()
    
    # ratio of drinks with "other" main spirit
    r_other = (len(df_city)-n_gin-n_whi-n_vod-n_rye-n_bou-n_teq)/len(df_city)
    
    # Print a summary 
    print("NUmber of gin drinks in "+city+": " +str(n_gin))
    print("NUmber of whiskey drinks in "+city+": "+ str(n_whi))
    print("NUmber of vodka drinks "+city+": "+ str(n_vod))
    print("Percent of drinks with Other Spirit - "+city+": " +("%0.1f" % (100*r_other))+"%")
    
    #df_p = pd.Series([n_gin,n_whi,n_vod,len(df_city)-n_gin-n_whi-n_vod])
    p = [n_gin,n_whi,n_vod,n_rye,n_bou,n_teq, len(df_city)-n_gin-n_whi-n_vod-n_rye-n_bou-n_teq]
    labels = ["Gin","Whiskey","Vodka","Rye","Burbon","Tequila","Other"]
    
    f1, ax1 = plt.subplots(figsize=(6,6))
    ax1.pie(p,labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
    plt.title(city)
    #df_p.plot(kind="pie", subplots=True, figsize=(6,6))
    
    return 0

df_sf = liq_find(df, "San Francisco")

In [None]:
Now let's look at New York

In [None]:
df_sf = liq_find(df, "New York")

In [None]:
df_sf = liq_find(df, "Houston")

In [None]:
df_sf = liq_find(df, "Los Angeles")

In [None]:
df_sf = liq_find(df, "New Orleans")

So, roughly half of the drink do no considst of Gin, Vodka, or Tequila