# Data visualization tutorial #

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Crop Yields Dataset

#### Plot quantitative variables

In [None]:
# Reads data
crops_all = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/key_crop_yields.csv')
# Removes unnecessary rows
crops_us = crops_all[crops_all['Entity'] == 'United States']
crops = crops_us.iloc[:,[2,3,4]]
# Processes dataset
crops = crops.rename(columns={'Wheat (tonnes per hectare)':'Wheat', 'Rice (tonnes per hectare)':'Rice'})
crops.head()

**Description of dataset**: Crop yield in the USA

- Year: Year
- Wheat: Tonnes per hectare of wheat harvested
- Rice: Tonnes per hectare of rice harvested in a given country and year

Which variables are quantitative? What types of plots correspond with quantitative variables?

In [None]:
### TODO: Fill in the following code cell wherever you see an ellipsis (...)
values = crops[...] # Wheat values
ax = sns.distplot(values, kde=False, rug=False) # play with the kde and rug arguments!

In [None]:
ax = sns.boxplot(x=..., data=...); # TODO: boxplot

In [None]:
ax = sns.violinplot(x=..., data=...); # TODO: violinplot

Okay, now let's try something more insightful. Try visualizing the change in the wheat yield over time.

In [None]:
### TODO
ax = sns.scatterplot(x=..., y=..., data=crops);

What conclusions can you draw from this scatterplot?

Seems like there's a pretty clear trend. Now let's add a lineplot too.

In [None]:
### TODO
ax = ... # scatterplot
ax = sns.lineplot(x=..., y=..., data=crops);

Look at that! Our first visualization!

Now let's explore the differences between Wheat and Rice.

**Try-it-yourself!**

Pick a type of plot which would effectively visualize the distribution of rice.

In [None]:
# TODO: distribution of rice
ax = ... 

In [None]:
### TODO
ax = ... # scatterplot of Rice
ax = ... # lineplot of Rice

This is useful, but it's annoying to scroll. Let's merge the visualizations together.

In [None]:
### TODO
sns.scatterplot(x='Year', y=..., data=crops, label='Wheat');
sns.lineplot(x='Year', y='Wheat', data=crops);
... # scatterplot of Rice (don't forget the label)
... # lineplot of Rice

It looks like for Wheat and Rice grow together. Let's look at their scatterplot for more information.

In [None]:
### TODO
ax = ...

We can also see the scatterplot with the individual histogram distributions -- like one of the hybrid plots

In [None]:
ax = sns.jointplot(x='Wheat', y='Rice', data=crops)

There definitely appears to be some kind of positive correlation between the 2 variables.

Let's try one more type of plot before exploring a more qualitative dataset. We'll look exclusively at Wheat and Rice data in 2000 but across all countries instead.

In [None]:
crops_2000 = crops_all[crops_all['Year']==2000]
crops_2000 = crops_2000.iloc[:,[0,2,3,4]].dropna()
crops_2000 = crops_2000.rename(columns={'Entity':'Country','Wheat (tonnes per hectare)':'Wheat', 'Rice (tonnes per hectare)':'Rice'})
crops_2000.head()

In [None]:
### TODO
plt.hist(x=crops_2000[...], alpha=0.5, label=..., color='Orange'); # histogram of Wheat values
... # histogram of Rice values
plt.legend();

Great! We see the overlapping histograms of the global crop yield in 2000 of wheat and rice.

### Coffee Ratings Dataset

In [None]:
# Read data in as a pandas dataframe
coffee = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv')
# Removing extra columns
coffee = coffee[['species', 'country_of_origin', 'number_of_bags','harvest_year',
                'acidity', 'balance', 'aroma', 'flavor', 'aftertaste', 'sweetness', 'uniformity', 'color']]
# Pre-processing dataset
coffee = coffee.rename(columns={
    'country_of_origin':'country',
    'number_of_bags':'num bags',
    'harvest_year':'year',
}).dropna()
coffee.head()

**Description of dataset**: Coffee ratings collected from the Coffee Quality Institute's review pages in January 2018 by Buzzfeed data scientist James LeDoux

- species: Species of coffee bean (arabica or robusta)
- country: Where the bean came from
- num bags: Number of bags tested
- year: When the beans were harvested (year)
- color: Color of bean

And the remaining columns represent the 'grades' of the coffee in each attribute (acidity, balance, aroma, flavor, aftertaste, sweetness, uniformity)

#### Plot qualitative variables

Let's make a bar graph to see the top 5 countries where the coffee in our dataset originates from.

In [None]:
top_countries = coffee['country'].value_counts().sort_values(ascending=False)
top_countries[:5]

In [None]:
### TODO
__________[:5].plot.bar();

Create one more bar graph to show the counts of each type of bean species in our dataset.

In [None]:
ax = sns.countplot('species', data=...) ### TODO: bar graph of bean species

In [None]:
bean_species = coffee['species'].value_counts().sort_values()
bean_species

Clearly, our graph matches up with what the data tells us in this tabular form. Robusta type coffee must be pretty rare!

### Plot relationships ###

Let's look at the distributions of acidity for each bean color.

In [None]:
ax = coffee.boxplot(column=..., by='color'); # TODO: plot acidity distributions for each bean color
plt.ylabel('acidity distribution'); # TODO: add label

**Try-it-yourself!**

Plot a boxplot for each color genre for a coffee grade of your choice!

Try adding labels :)

In [None]:
ax = ... # TODO: boxplot by color for coffee grade metric
plt.ylabel(...);

### Adding labels, titles, legends ###

In [None]:
top_countries[:5] # recall the top_countries variable

In [None]:
### TODO: add axis labels and a title
ax = ... # bar plot of top 8 countries
ax.set_xlabel(...);
ax.set_ylabel('counts');
ax.set_title(...);

In [None]:
### TODO: play with font size, rotate x-axis labels
ax = ____________________________.bar(rot=45)
ax.set_title(..., fontsize=20);
ax.set_ylabel(..., fontsize=16);
ax.set_xlabel(..., fontsize=16);

Let's try grouping different countries' coffee grades together in a new dataset so we can see how they compare to each other.

In [None]:
coffee_grades = coffee.groupby('country').mean()
coffee_grades.head()

In [None]:
axes = coffee_grades[['sweetness', 'balance','aroma']].head().plot.bar(rot=0, subplots=False)

**Try-it-yourself!**

Pick a type of plot which would effectively visualize different coffee grades for different countries. Don't forget your labels!

In [None]:
### TODO
ax = ...