# Exploratory data analysis of Coffee Quality Dataset 

Authors: Arlin Cherian, Kristin Bunyan, Michelle Wang, Berkay Bulut

***

## Summary of the data set 

This project uses the *[Coffee Quality Dataset](https://github.com/jldbc/coffee-quality-database)*, collected by the Coffee Quality Institute in January 2018. The data was retrieved from tidytuesday, courtesy of James LeDoux, a Data Scientist at Buzzfeed. The data is collected on Arabica coffee beans from across the world and professionally rated on a 0-100 scale based on factors like acidity, sweetness, fragrance, balance, etc. The dataset also contains information about the country of origin of the coffee beans, harvesting and grading date, colour of the beans, defects, processing and packaging details. There are 1311 observations in the dataset and 43 features (18 numeric features and 25 categorical features). There are some missing values for certain features which will be preprocessed appropriately if included in the analysis model. There are no missing values in the target column, `total_cup_points`. According to the dataset decription, our target variable, `total_cup_points` is a point scale from 0-100 with mean scale rating of 82.1 points, minimum of 0.0 and maximum of 90.6 points.

Table 1. Summary statistics of total cup points or rating on a scale of 0-100 points

| Count  | Mean   | Std  |  Min | Q1 | Median | Q3 | Max |
|---|---|---|---|---|---|---|---|
| 1339.0  | 82.1  | 3.5  | 0.0  |81.0 | 82.5 | 83.6 | 90.6 |



Table 2. Unique values of country of origin, region and colour of coffee beans

| Variable | Count   |
| --- | --- |
| Regions  | 356  | 
| Countries | 36 |
|Colur of coffee beans | 3 |

## Splitting of train and test data sets 

Before we begin visualizing the data, we will split the dataset into 80% training data and 20% test data. The test data will not be used for the exploratory analysis, and will only be used for testing the finalized model at the end of the project. Below we list the number of observations for each split.

Table 3. Data partition to train and test data.

| Data Split | Rows | Columns | 
|---| ---| --- |
| Train  | 1048 | 43  | 
| Test | 263 | 43 |


## Exploratory data analysis on training set

We wanted to determine which features might be the most important to include to predict coffee quality rating. For preliminary EDA, we looked at some categorical features and only selected those that may be intuitive in adding to the model such as `country_of_origin`, `region` and `color` of the coffee beans. There are 36 unique countries, 3 unique colour of coffee beans and 343 unique regions. For the purpose of this preliminary analysis we are only visualizing the relationship between country of origin, color and the target variable as the region feature will require some preprocessing due to large number of unique values.

In our first exploratory question, we looked at how total ratings differ between the different colours of coffee beans. As shown in Figure 1, the average coffee quality rating did not differ vastly by the colour of the coffee beans. Green color coffee beans were slightly lower in coffee quality rating than blue-green or bluish-green groups. 

In Figure 2, we looked at coffee quality ratings differed by country of origin of the beans. The average coffee rating differed between the various countries of origin. The highest average rating of coffee beans were from Ethiopia and the lowest average coffee rating were from Haiti. We will could also explore relationship between other categorical features and the total coffee quality rating with some preprocessing and cleaning of the dataset. 

We then explored the relationship between some numerical features in the data set and the target coffee rating to see the correlation between these features. Certain features such as aroma, flavour, aftertaste, acidity, body, balance, uniformity, or sweetness were ignored in this EDA as they are make up the sum of the total cup points taget variable.  Including these features in the model may lead to a false interpretation. Therefore, we have decided to ignore these variables in the analysis.

Figure 1. Coffee quality rating by coffee bean colour

![](coffeebean_color_rating.svg)

Figure 2. Coffee quality rating by coffee bean country of origin

![](coffeebean_color_rating.svg)

Figure 3. Coffee quality rating by possible numerical predictors

![](numerical_features_corr.svg)