Rachel K. Riggs
Carrie Cheung
Have you ever wondered where the chocolate beans of your favourite chocolate bar came from, and whether that has an effect on how good it tastes?
To investigate this further, we needed some very delicious chocolate data - we used the chocolate bar ratings dataset from Kaggle, which contains ratings by chocolate experts on over 1,795 individual chocolate bars. The dataset includes additional information about the chocolate bars, including where the chocolate beans were grown, cocoa percentage, and bean variety.
Here is a snapshot of first few rows in the dataset:
(A CSV copy of the data from Kaggle can be found in the data folder of this repository.)
Since Venezuela is one of the largest producers of the Criollo bean, which is considered a delicacy, we set out to answer the following question using the chocolate bar ratings dataset:
Do chocolate bars made from beans grown in Venezuela have a different average rating compared to beans grown elsewhere?
You can reproduce our analysis in one of three ways with the following steps:
-
Clone/download this repository and, using the command line, navigate to the root of this project.
-
Run the below command in bash (filling in PATH_ON_YOUR_COMPUTER with the absolute path to the root of this project on your computer):
docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/choc_analysis rachelkriggs/dsci_522-chocolate_ratings_analysis make -C '/home/choc_analysis' all
- To clean up the analysis:
docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/choc_analysis rachelkriggs/dsci_522-chocolate_ratings_analysis make -C '/home/choc_analysis' clean
Note that using Make to run our analysis is more straight-forward and therefore recommended compared to option #3 below, which requires running multiple scripts.
-
Clone/download this repository and, using the command line, navigate to the root of this project.
-
Run the below command in bash:
make all
- To clean up the analysis:
make clean
-
Clone/download this repository and, using the command line, navigate to the root of this project.
-
Run the below command in bash in the order listed:
Rscript src/01_load_choc_data.R data/flavors_of_cacao.csv data/cleaned_choc_data.csv
Rscript src/02_viz_choc_data.R data/cleaned_choc_data.csv results/choc_data_viz.png
Rscript src/03_analyze_choc_data.R data/cleaned_choc_data.csv results/summarized_choc_data.csv
Rscript src/04_analyze_result_choc_data.R data/cleaned_choc_data.csv results/choc_ratings_analysis_viz.png
Rscript -e "rmarkdown::render('doc/Report.Rmd')"
The below flowchart visualizes the order the scripts are run as listed in Usage, along with the input file(s) needed and output file(s) produced at each step.
The report for this analysis can be viewed here.
- R & R libraries (R version 3.5.1):
tidyverse_1.2.1
knitr_1.20
here_0.1
infer_0.3.1
dplyr_0.7.7
ggplot2_3.0.0