__Fundamentals of Data Visualization – Final Project - Visualization of Starbucks Drinks Nutrition__
<br>__[Starbucks Drink Nutritional Information Dataset](ttps://www.kaggle.com/starbucks/starbucks-menu?select=starbucks_drinkMenu_expanded.csv )__

The dataset I chose to work with describes the drinks offered at Starbucks, a popular café, along with their nutritional information. This data in csv format was found on Kaggle. It features the drink categories and their respective sizing. Additionally, it features standard nutritional information found on most food and beverages in America: 
<br>- Calories 
<br>- Total Fat
<br>- Trans Fat
<br>- Saturated Fat
<br>- Sodium
<br>- Total Carbohydrates
<br>- Cholesterol
<br>- Dietary Fiber
<br>- Sugars
<br>- Protein
<br>- Vitamins A & C
<br>- Calcium
<br>- Iron
<br>- Caffeine

My goals for working with this data is determine the healthier & unhealthier drinks offered at Starbucks, while considering a holistic nutritional approach. On Kaggle, some exiting histograms existed for illustrating the distribution of the drinks based on each nutritional category. It gives a good initial idea of how the drinks are distributed across each nutritional category, but it appears that a deep diver is required to identify specific characteristics and connections between nutritional categories, drink type categories, or drink preparation styles. Some difficulties were encountered with the way the data set classified “Beverage_prep.” The data does not have iterative-friendly method of classification. It included, in certain cases, both the size and the milk/milk-alternative added. Additionally, the data set listed the drinks in a way that only mentioned the size in the preparation that included Nonfat milk. The milk-alternative preparations under the previously listed Nonfat drink were assumed to be the same size, which works for a static numeric table but not so well for plotting.

The tasks I want to complete with the data includes presenting the most relevant nutritional information of the overall Starbucks drink menu and presenting this same information for different drink categories. The purpose of pursuing the presenting this data to communicate to people of drinks to indulge in and drinks to avoid. Identification by drink category is being pursued to give people the option to narrow down their choices without feeling restricted. The tasks will be conducted largely by organizing data. These tasks seek to identify patterns across drink category types, drink preparations with nutritional information. Operation of these tasks occurs across a relative reference frame as the data point’s nutritional information will be compared against each other to inform the user of the best and worst drink choices in the dataset. The tasks will be performed by the data analyst, after the dataset is cleaned up and made more uniform. 

In the code cell below, I perform additional tidying of the data.


In [6]:
import pandas as pd
import numpy as np
import altair as alt

data = pd.read_csv("starbucks_drinkMenu_expanded.csv")
data.head()

#remove leading and ending spaces of column names
data.columns = data.columns.str.lstrip()
data.columns = data.columns.str.rstrip()
#find any missing data
data.isna().any()
#how many values are missing
data.isnull().sum()
#row with missing value
data[data.isnull().any(axis=1)]
#fill in missing value
data = data.fillna(125)
data.isna().any()


#change column data into numeric
#print(data["Total Fat (g)"].unique())
data.loc[data['Total Fat (g)'] == '3 2']
data["Total Fat (g)"] = data["Total Fat (g)"].str.replace('3 2','3')
data["Total Fat (g)"] = data["Total Fat (g)"].astype(float)

#print(data["Vitamin A (% DV)"].unique())
data["Vitamin A (% DV)"] = data["Vitamin A (% DV)"].str.replace('%','')
data["Vitamin A (% DV)"] = data["Vitamin A (% DV)"].astype(int)

data["Vitamin C (% DV)"] = data["Vitamin C (% DV)"].str.replace('%','')
data["Vitamin C (% DV)"] = data["Vitamin C (% DV)"].astype(int)

#print(data["Calcium (% DV)"].unique())
data["Calcium (% DV)"] = data["Calcium (% DV)"].str.replace('%','')
data["Calcium (% DV)"] = data["Calcium (% DV)"].astype(int)

#print(data["Iron (% DV)"].unique())
data["Iron (% DV)"] = data["Iron (% DV)"].str.replace('%','')
data["Iron (% DV)"] = data["Iron (% DV)"].str.replace('.00','')
data["Iron (% DV)"] = data["Iron (% DV)"].astype(int)

#print(data["Caffeine (mg)"].unique())
data.loc[data["Caffeine (mg)"] == 'Varies']
data.loc[102:105] = data.loc[102:105].replace('Varies', '10')
data.loc[167:171] = data.loc[167:171].replace('Varies', '20')
data.loc[172] = data.loc[172].replace('Varies', '30')

data.loc[data["Caffeine (mg)"] == 'varies']
data["Caffeine (mg)"] = data["Caffeine (mg)"].replace('varies', '50')
data["Caffeine (mg)"] = data["Caffeine (mg)"].astype(int)

  data["Iron (% DV)"] = data["Iron (% DV)"].str.replace('.00','')


The first attempt at visualization involved getting some simple counts of drink numbers in Beverage_category and Beverage_prep

In [7]:
#drink counts by category
data.Beverage_category.unique()
drinkCountchart = alt.Chart(data).mark_bar().encode(
    x = "Beverage_category",
    y = alt.Y('count():Q', title = '# of drinks'),
    tooltip = ['count():Q']
)
drinkPrepCountchart = alt.Chart(data).mark_bar().encode(
    x = "Beverage_prep",
    y = alt.Y('count():Q', title = '# of drinks'),
    tooltip = ['count():Q']
)
drinkCountchart | drinkPrepCountchart

In trying to figure out how to filter out the data set most efficiently, I found two different ways to filter. The first was was to iterate through the entire data set using 'for' and 'if' loops. This proved to be inefficient with loss of certain information. The second way I tried to filter was by making separate data frames where the column categories was limited to one value. 

In [4]:
#filter out only grande drinks that use nonfat milk
filtered_data = pd.DataFrame(np.NaN,index = [0], columns = data.columns)

inx = 0
for index, row in data.iterrows():
    if "Grande" in row['Beverage_prep']:
        if inx == 0:
            filtered_data.loc[0] = row
            inx = inx +1
        else:
            filtered_data.loc[inx] = row
            inx = inx + 1
    elif "Smoothies" in row['Beverage_category']:
        if "Nonfat" in row['Beverage_prep']:
            filtered_data.loc[inx] = row
            inx = inx + 1
    else: 
        inx = inx + 1
filtered_data = filtered_data.reset_index(level=0, drop=True).drop(filtered_data.index[0])

#filtered data by beverage category
keywords = ['Grande', 'Solo', 'Doppio']
CED = data[(data.Beverage_category == "Classic Espresso Drinks")].reset_index(level=0, drop=True)
COF = data[(data.Beverage_category == 'Coffee')].reset_index(level=0, drop=True)
SED = data[(data.Beverage_category == "Signature Espresso Drinks")].reset_index(level=0, drop=True)
TTD = data[(data.Beverage_category =='Tazo® Tea Drinks')].reset_index(level=0, drop=True)
SIB = data[(data.Beverage_category =='Shaken Iced Beverages')].reset_index(level=0, drop=True)
SMO = data[(data.Beverage_category =='Smoothies')].reset_index(level=0, drop=True)
FBCo = data[(data.Beverage_category =='Frappuccino® Blended Coffee')].reset_index(level=0, drop=True)
FLBC = data[(data.Beverage_category =='Frappuccino® Light Blended Coffee')].reset_index(level=0, drop=True)
FBCr = data[(data.Beverage_category =='Frappuccino® Blended Crème')].reset_index(level=0, drop=True)


I then proceeded to write up a function that would load nutritional charts displaying Calories, Total Fat, Sugars, and Caffeine content of drinks in specific categories. 

In [5]:
#nutritional charts by beverage category
def makeNUTchart(DF):
    print("Loading nutritional chart for", DF.Beverage_category.unique())
    selection = alt.selection_multi(fields=['Beverage_prep'])
    color = alt.condition(selection,
                      alt.Color('Beverage_prep:N', legend=None, scale = alt.Scale(scheme = 'set1')),
                      alt.value('lightgray'))

    chart = alt.Chart(DF).mark_square().encode(
        x = "Beverage",
        color = color)

    legend = alt.Chart(DF).mark_point().encode(
    y=alt.Y('Beverage_prep:N', axis=alt.Axis(orient='right')),
    color= color,
    ).add_selection(
    selection
    )
    return chart.encode(y = "Calories", tooltip = ["Beverage_prep", "Calories"]) |chart.encode(y = 'Total Fat (g)', tooltip = ["Beverage_prep", 'Total Fat (g)']) | chart.encode(y = "Sugars (g)",tooltip = ["Beverage_prep", "Sugars (g)"]) | chart.encode(y = 'Caffeine (mg)', tooltip = ["Beverage_prep", 'Caffeine (mg)']) |legend

makeNUTchart(CED)

Loading nutritional chart for ['Classic Espresso Drinks']


Unfortunately, I realized that this would require user input to be fed into the code realm, which again did not seem to be the most efficient method to present data. After additional research, I was able to implement filtering in the form of Altair's selction capabilities. I utilized a multi_selection by color in the legend to filter out the Beverage preparations and a single_selection using a dropdown menu to filter out the Beverage categories.

Implementation of my fine-tuned final visualization can be observed by running the cell below. I decided to only present nutritional information that included Total fat, Calories, Sugar, and Caffeine.

<br>The dropdown menu allows you to filter out beverages by Beverage categories offered at Starbucks while the legend to the right allows you to filter out the Beverage preparation (i.e. the size and/or milk/milk-alternative you want with your drink) by clicking on the colored legend icons. Additionally, hovering over individual plot points will give you nutritional information about that beverage.

In [9]:
Legendselection = alt.selection_multi(fields=['Beverage_prep'])
color = alt.condition(Legendselection,
                      alt.Color('Beverage_prep:N', legend=None),
                      alt.value('lightgray')) 

input_dropdown = alt.binding_select(options = [None, 'Coffee','Classic Espresso Drinks','Signature Espresso Drinks','Tazo® Tea Drinks','Shaken Iced Beverages','Smoothies','Frappuccino® Blended Coffee','Frappuccino® Light Blended Coffee','Frappuccino® Blended Crème'], name = "Beverage Category") 
dropDownselection = alt.selection_single(fields = ['Beverage_category'], bind = input_dropdown) 
                  
totalPlot = alt.Chart(data).mark_circle().encode(
    x = 'Beverage',
    color = color
).add_selection(
    dropDownselection
).transform_filter(
    dropDownselection
)

legend = alt.Chart(data).mark_point().encode(
    y=alt.Y('Beverage_prep:N', axis=alt.Axis(orient='right')),
    color= color
).add_selection(
    Legendselection
)

totalPlot.encode(y = "Calories", tooltip = ["Beverage",'Beverage_prep', "Calories"])| totalPlot.encode(y = 'Total Fat (g)',tooltip = ["Beverage", 'Beverage_prep', "Calories", 'Total Fat (g)'])|totalPlot.encode(y = 'Sugars (g)',tooltip = ["Beverage", 'Beverage_prep', "Calories", 'Sugars (g)']) |totalPlot.encode(y = 'Caffeine (mg)',tooltip = ["Beverage", 'Beverage_prep', "Calories", 'Caffeine (mg)']) |legend

#To save the visualization as an HTML file
#assign above code to variable StarbsPlot
#uncomment the line below to save the 
#StarbsPlot.save('StarbsPlot.html', embed_options={'renderer':'svg'})

Key elements of my design include:<br>
<br> Interaction – I wanted the visualization to be engaging and dynamic to allow Starbucks customers the flexibility to have different drink preferences at any given time. By providing an interactive visualization, I am able to zoom and pan the data based on the users’ desires. I followed the approach of providing an overview first (all data presented as a scatterplot), allowing the user to zoom & filter (using dropdown menu to filter), and then provided details on demand using Altair’s tooltip.<br>
<br> Categorization – I wanted the visualization to follow the flow of decision making that human follow when narrowing down choices from a vast group of options. From my experience, people find it easier to narrow down a large number of choices using patterns or categories.<br>
<br> Subsets of data – I wanted the visualization to encompass all of the menu options but present them in a way that was not overwhelming to the user and that also additionally follows a typical person’s decision-making flow. By presenting the menu options based on their preferential drink category selected, I can show them only relevant data to the next step in their decision-making process.<br>

I decided to take an insight-based evaluation approach to see determine if users could gain new knowledge using my visualization. The target question I was trying to answer during this evaluation was if one is health-conscious, how would they go about deciding what drink to order at Starbucks?
Ideally, the people I would have liked to recruit to answer the target question would be health-conscious Starbucks customers. But because I didn’t have access to a wide sample of regular Starbucks customers, I recruited family members and friends who were family with the menu offerings at Starbucks. I used the depth of insight gained by the users to determine if people recruited for evaluation gained general knowledge of nutritional information based on the visualization. I also used time to insight in a way that determined if the visualization helped the users make a drink choice faster than if they had just looked at the static menu. These measures helped infer generally if the information being primarily show to health-conscious users when determining a healthy drink was useful. The approach I used to answer the target question involved mocking up a initial draft of the visualization and allowed the users to roleplay the scenario of trying to decide what drink to order. The participants were asked to imagine walking into Starbucks, being health conscious and describe How would they go about deciding what drink to order, if given the visualization. If users found visualization easy to use, helpful in decision making, and portraying information relevant to health factors of drinks I considered my visualization was successful.

My initial set up plotted certain nutritional information (fat, sugar, caffeine) vs Calories but I found it was not entirely helpful because the specific drinks themselves weren’t easily identified unless scrolling over each plot point. Although this visualization was able to demonstrate a linear relationship between sugar and calories, there was not much insight to be gained from comparisons of calories to the other nutritional categories. The lack of insight coupled with granular navigation difficulties deemed this approach unfavorable. Through the evaluation I found that people don’t granularly look fat, sugar, caffeine content but look more in general ‘healthy’ regions and get more specific with decision making based on the beverages available. In using that insight, I decided to plot Beverages vs. Nutritional category and make the interaction in selecting the Beverage category and the beverage preparation. 

I also noticed that in presentation of the Beverage vs Caffeine content plot some difficulties arose when different beverage preparations were incorporated into the mix. Because a beverage’s caffeine level largely remains unchanged regardless of the size change or milk/milk alternative used, many of the data points were overlaid on top of each other on the caffeine plot. This made it difficult to identify the caffeine level of certain drinks with different preparations when trying to filter out i.e. even though certain data points should be highlighted in color when choosing a specific Beverage preparation (and the rest grayed out) in some instances the colored data points were plotted behind the unfiltered “gray” data points so it was difficult to distinguish where they were on the plot. In the future, I would like the refine how to filter out explicitly data points even though they have the same value as other filtered data. 

Incorporating two different levels of filtering and interaction of presenting more detailed information for each data point worked well for users to help in narrow down the decision making process of choosing a drink to order. Overall, the scope of this project sought to narrow down the data set and drill down specific information about different beverage groups. In the future, I would like to possibly take a different approach and look at general patterns of the entire data set and incorporate visualizations that would bring insight in that direction.


