# Let's Make Some Drinks
But what?!

My task here is to create a program to receive ingredients as inputs and produce recipes than contain those ingredients.

In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

cocktails = pd.read_csv('cocktails.csv')

## Cleaning & EDA
I'll start by looking at the information about the dataframe. Looking for data types and number of missing values

In [20]:
cocktails.info()
cocktails.head()

## Looks like each column is an object data type. Probably just all text entries with no numeric data. Also, looks like there are lots of missing values. I won't need to impute any missing data since the data isn't numeric and I won't be conducting any statistical analysis. Now I'll try to create a bar graph in order to visualize the missing data. Maybe not totally necessary, but good practice at least.

In [21]:
missing = cocktails.isna().sum()
complete = cocktails.notna().sum()
complete

## Make a dataframe showing filled vs null values

In [22]:
null_v_complete = pd.concat([complete, missing], axis=1)
null_v_complete['Category'] = null_v_complete.index
null_v_complete.columns = ['Filled', 'Null', 'Category']
null_v_complete = null_v_complete[['Category', 'Filled', 'Null']].reset_index(drop=True)
null_v_complete.info()
null_v_complete

## Only two columns don't have any null values, the cocktail name and the cocktail ingredients. Thankfully, those are the most important columns. Is there any column that I really don't need at all?...

In [23]:
fig, ax = plt.subplots(figsize=(10, 6))

plt.style.use('seaborn-v0_8-dark-palette')
bar1 = ax.bar(null_v_complete['Category'], null_v_complete['Filled'], label='Filled')
bar2 = ax.bar(null_v_complete['Category'], null_v_complete['Null'], bottom=null_v_complete['Filled'], label='Null')
ax.set_title('Filled vs Null Values by Category')
ax.set_xlabel('Category')
ax.set_ylabel('Count')
ax.set_xticklabels(null_v_complete['Category'], rotation=60)
ax.spines[['top', 'right']].set_visible(False)
ax.bar_label(bar1, label_type='center', color='white', weight='bold')
ax.bar_label(bar2, label_type='center', color='white', weight='bold', padding=3)
ax.axhline(400, color='b', linewidth=1, linestyle='dashed')
ax.legend(loc=2)

plt.show()

### So, I looked through the dataset and its info, and it looks like the 'Notes' column has a huge amount of null values. Also, when I looked at the entries that had values, the information wasn't really pertinent. Things like "this is an anniversary drink", or "credit for the photo to..", so my first task is to just drop that column altogether. Also, the 'Bar/Company' column has lots of null values also. Location is missing about half of its values.While the bar name and location would be cool, for this example I'm going to remove those columns since there are so many empty values. For the task of filtering the data to use as a recipe generator, I might as well just get rid of every column except for cocktail name, ingredients, garnish, glassware, and preparation.

In [24]:
cocktails.drop(['Bartender', 'Bar/Company', 'Location', 'Notes'], axis=1, inplace=True)
cocktails.info()
cocktails.head()

## Ok, now let's check for any duplicate data

In [25]:
for col in cocktails.columns:
    col_vals = cocktails[col].nunique()
    print(f'Unique values in {col} column = {col_vals}')

## Build the Recommender
First create a variable to contain spirits. This variable will provide the input ingredients. I'm sure it won't be complete at first, and I'll add to it as time goes on.

In [26]:
cocktails['Ingredients'].str.len().describe()

## I learned the hard way that I need to convert the data types from object into string. Actually, I think it already was, but I'll leave this step in as a check

In [27]:
cocktails = cocktails.astype(str)
cocktails.dtypes

## Create a list of individual spirits Create boolean list comprehension showing which spirits are in which cocktails

In [28]:
spirit_list = ['sake', 'port', 'madeira', 'sherry', 'vermouth', 'sangria', 'champagne', 'rum', 'brandy', 'cognac', 'gin', 'whisky', 'whiskey', 'bourbon', 'vodka', 'absinthe', 'mezcal', 'tequila']

spirits = pd.DataFrame({
    spirit: cocktails['Ingredients'].str.contains(spirit, case=False) 
    for spirit in spirit_list})

spirits.head()

## Input desired spirits into query function

In [29]:
selection = spirits.query('rum')
len(selection)

In [30]:
cocktails.loc[selection.index]

# Mission Complete
Seems to work :)

Some spirits such as gin have several results

Next I want to work on some type of 'input' function so a user can type in their selections in a more user friendly manner

I also need to include the ingredients and instructions in addition to just the cocktail name

# Sources
https://app.datacamp.com/learn/tutorials/techniques-to-handle-missing-data-values

https://medium.com/@navamisunil174/exploratory-data-analysis-of-breast-cancer-survival-prediction-dataset-c423e4137e38

https://stackoverflow.com/questions/18062135/combining-two-series-into-a-dataframe-in-pandas

https://www.geeksforgeeks.org/how-to-create-a-stacked-bar-plot-in-seaborn/

https://medium.com/@jb.ranchana/easy-way-to-create-stacked-bar-graphs-from-dataframe-19cc97c86fe3

https://towardsdatascience.com/4-methods-for-changing-the-column-order-of-a-pandas-data-frame-a16cf0b58943

https://app.datacamp.com/learn/courses/introduction-to-data-visualization-with-matplotlib

https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html#matplotlib.pyplot.legend

https://matplotlib.org/stable/gallery/lines_bars_and_markers/bar_label_demo.html#sphx-glr-gallery-lines-bars-and-markers-bar-label-demo-py

https://stackoverflow.com/questions/70271367/stacked-bars-are-unexpectedly-annotated-with-the-sum-of-bar-heights

https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html#pandas.Series.str.contains

McKinney, Wes. Python for Data Analysis . Chapter 7.

VanderPlas, Jake. Python Data Science Handbook . Chapters 16, 22.