<a href="https://colab.research.google.com/github/acedesci/scanalytics/blob/master/EN/S06_Descriptive_Analytics/S6_AfterClass_Exercises_Descriptive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# S6 - AfterClass Exercise: Descriptive Analytics and  Visualization 
---
## Instructions:
Most of the exercises presented here allows you to practice basic Python programming for some applications in Operations Management and Logistics.

For each exercise, you have a code cell for the response underneath it, where you should write your answer between the lines containing `### start your code here ###` and `### end your code here ###`. Your code can contain one or more lines and you can execute this cell in order to complete the exercise. To execute the cell, you can type `Shift+Enter` or press the play button in the toolbar above. Your results will appear right below this response cell.

NOTE: Please pay attention to the variable name of the output you would need to provide under each question. You must use the same variable name for the output so that the result can be printed out correctly.

# Analyzing and Visualizing Crops Statistics in the Americas
In this notebook, you will implement some visualization of the data about crop statistics in the Americas. This data is available at the file `Production_Crops_E_Americas.csv`, adapted from data provided by the Food and Agriculture Organization of the United Nations (FAO). The original files can be found at [this page](https://data.world/agriculture/crop-production).

This is a description of the columns of our adapted data:

| VARIABLE NAME | DESCRIPTION | 
|:----|:----|
|area_code| numeric value representing the area|
|area| name of the area (e.g., Argentina, Canada, Chile, Colombia)|
|item_code| numeric value representing the item |
|item| name of the product (e.g, Bananas, Beans, Cassava)|
|element_code|numeric value representing the element|
|element|specification of the data (e.g., Area Harvested, Yield, Production)|
|unit| measure unit (e.g., ha - *hectare* -, hg/ha - *hectogram per hectare* -,and tonnes |
|Y2000| crop of the year 2000|
|...|...|
|Y2014| crop of the year 2014|

## Data preparation:  Importing libraries and Data Set

In the code cell below, import the `pandas` library under the alias `pd`; the library `seaborn` under the alias `sns`; and the `matplotlib.pyplot` library under the alias `plt`. 

Note that we will be using the `% matplotlib inline` magic command to make sure our graphics are displayed in our Jupyter Notebook. 

**IMPORTANT:** You simply need to execute the codes below to preprocess data until the DataFrame `df_transformed` is generated (prior to Exercise 1)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Import the data file `'Production_Crops_E_Americas.csv'` into a `DataFrame` named `df_crops`. 

**Note:** you can use the `pandas.read_csv()` function and parameter `encoding` set as `'latin-1'` to avoid errors due to special characters in the data file.

In [None]:
url = 'https://raw.githubusercontent.com/acedesci/scanalytics/master/EN/S06_Descriptive_Analytics/Production_Crops_E_Americas.csv'
df_crops = pd.read_csv(url, encoding='latin-1')  # reading data file into a DataFrame
df_crops.head()
    
# replacing missing values with 0
df_crops.fillna(0, inplace=True)  

Since the data is in a pivot structure, we transform it using the function `pd.melt(...)` to unpivot it and rearrange the years into one column. See [link](https://pandas.pydata.org/docs/reference/api/pandas.melt.html).

In [None]:
df_unpivot = pd.melt(df_crops, id_vars=['item', 'area', 'element'], value_vars=['Y20%02d'%(i) for i in range(15)], 
        var_name='year', value_name='value') 
df_unpivot

We then transform the data and put the values in separate columns for variables. 

In [None]:
df_transformed = df_unpivot.pivot(index=["item", "area", "year"], columns='element', values='value').reset_index()
df_transformed

## Exercise 1:  Visualizations
Let's explore the production of some products in Canada from 2000 to 2014. For that, we first separate the data of interest. More specifically, we are intested only the data from `df_transformed` based on the following conditions 

*   `area = 'Canada'` 
*   list of items (products) to analyze: `'Blueberries'`, `'Raspberries'`, and `'Strawberries'`

Please put the resulting DataFrame into a new DataFrame object `df_canada`

In [None]:
### start your code here ###

### end your code here ###

**b)** Create a line graph from the DataFrame `df_canada` to show the  production of `'Blueberries'`, `'Raspberries'` and `'Strawberries'` in Canada. Configure the aesthetics of your graph as follows.

* Set the size of the figure to `12, 6`
* Give the title `Annual Production in Canada` to the graph
* Set the labels of the `x`-axis and `y`-axis as `Years` and  `Tonnes`, respectively
* Each item should appear as a separate line in the graph using arguments `hue = "item"`,  `style = "item"` and `markers = True`
    
**Hint**:  

*   Use the function `seaborn.lineplot()` to draw a line plot with several semantic groupings (e.g., to differentiate items). Check [this page](https://seaborn.pydata.org/generated/seaborn.lineplot.html) for more information abut this function. 
*   Use functions `plt.title()` and `plt.figure()` to set the title and size of the graph, respectively. 
 


In [None]:
### start your code here ###

### end your code here ###

**c.)** Create a scatter plot from the DataFrame `df_canada` to show the relations between the harvested area and the production of the selected products in Canada. Configure the aesthetics of your graph as follow.

* Set the size of the figure to `12, 6`
* Give a meaningful title to the graph
* Use the style `'white'`
* Set the labels of the `x`-axis and `y`-axis as `Tonnes` and  `Hectares`, respectively
* Each product should be differentiated using the argument `hue='item'`
    
**Hint:** 
* Use the function `seaborn.scatterplot()` to draw a line plot with several semantic groupings (e.g., to differentiate items). Check [this page](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) for more information abut this function.  

In [None]:
### start your code here ### 

### end your code here ####

## Exercise 2: Clustering algorithms

**a)** Please filter and transform the data from the transformed DataFrame `df_transformed` (which contains the data of all the countries) using the following steps

* Step 1: Filter only the item: `'Grapes'`
* Step 2: Use the function `groupby` to summarize the statistics by country (`area`) for the following variables:
  *   Average 'Production' per year
  *   Average 'Yield' per year
* Step 3: Remove rows with `NaN` (Hint: you can use `df = df.dropna(axis='rows')`. See [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html).
* Step 4: Normalize the two columns above using z-score transformation and put them in the new columns with prefix `z_`





In [None]:
### start your code here ### 

### end your code here ####

**Note:** For the following two questions, you can make use of the codes for clustering in the lecture and adapt for this data.

**b)** Apply K-Means method to cluster the countries based on the recenly created DataFrame using the normalized variables ['Production', 'Yield']. Please compare the results based on `K = 2, 3, ..., 9` and recommend the best number of clusters. 

In [None]:
### start your code here ### 

### end your code here ####

**c)** Apply (hierarchical) alglomorative clustering method to cluster the countries based on the recenly created DataFrame using the normalized variables `['Production', 'Yield']` using `K = 2, 3, 4`. Please then explain how the resulting hierarchy of the clusters looks like. More specifically, what clusters are aggregated and combined into one cluster starting from `K = 4`?

In [None]:
### start your code here ### 

### end your code here ####