# Data Visualization Final

## Introduction 

The dataset from Kaggle (https://www.kaggle.com/datasets/heitornunes/caffeine-content-of-drinks) provides data about the caffeine content and calories of many brands and types of beverages. Attributes of this dataset include the drink, volume, calories, caffeine content, and drink type. I created a visualization from this data with to goal of providing insight into the caffeine content of various beverages, such as coffee, tea, energy drinks, etc. 

## Preliminary Sketches

I came up with three ideas to showcase the relationship between volume and caffeine content. In my first sketch, I plot the average caffeine content per mL. While this would give you a sense of how much caffeine was in each beverage, I wanted to show more data variability.

In my second sketch, I separate the different types of beverages into other plots. While this gives us a sense of how much drinks in the same category can vary, plotting them on the same graph would be more beneficial.

Finally, in my third sketch, I included all the drinks, with each drink type corresponding to a different color. When hovering over each data point, you can see the beverage's name and the amount of caffeine contained in it.  This was the approach I went with for my visualization using Altair.


![image.png](attachment:image.png)


## Visualization Using Altair
Using sketch 3 as my starting point, I plotted the volume on the x-axis and the caffeine content on the y-axis. I also included a tooltip that allows one to see what drink that point corresponds to and the caffeine content of that drink. Once I plotted the relationship, I decided that the visualization would benefit from the ability to filter for specific types of drinks and highlight them in a way. To do this, I used the selection attribute of Altair to allow the end-user to filter based on drink type by either clicking on a point on the graph or the legend. With the drinks plotted out this way, you can make out the general characteristics of some of the drink types, especially the energy shots and the waters and teas. Energy drinks and coffee tend to fall in the same range of ratios between caffeine content and volume, with coffee containing several outliers. 

In [5]:
import pandas as pd
import altair as alt
#Kaggle Dataset -- "https://www.kaggle.com/datasets/heitornunes/caffeine-content-of-drinks"

df = pd.read_csv("caffeine.csv")

In [8]:
selection = alt.selection(type='multi', fields=['type'], bind='legend')

caffeinePlot = alt.Chart(df).mark_circle().encode(
    x = "Volume (ml)", 
    y = "Caffeine (mg)",
    color = alt.Color("type", scale = alt.Scale(scheme = "spectral")),
    tooltip = ["drink", "Caffeine (mg)"],
    opacity = alt.condition(selection, alt.value(1), alt.value(0.1))
).add_selection(selection).interactive()
caffeinePlot

## Feedback and Next Steps
I sought feedback on my visualization from family and friends. Due to the small number of participants, I conducted a semistructured interview with all of my participants. 

While the participants agreed that the visualizations were effective in showing a general how much caffeine was in different types of beverages through the zones the data points created, they all had additional feedback on how to improve the effectiveness of the visualization. One recommendation was to provide a secondary graph with a histogram of the caffeine per volume, which the legend would also filter. This could provide additional insight into the concentration of caffeine in the beverages. 

Another recommendation was to plot a smaller representative sample of the data to display the relationship between volume and caffeine content and each data point larger. This would make the visualization easier to comprehend. 

Another suggestion was plotting a regression line for each beverage type, also on a separate graph, so that the caffeine concentration varies more easily across caffeinated drinks.

Based on feedback, I would supplement my current visualization with a histogram of caffeine per drink volume. I would also make each data point larger and remove outliers so that we can more easily make conclusions about the caffeine content of various beverages.