# Matplotlib

Matplotlib is a Python 2D plotting library.

The [matplotlib.org](http://matplotlib.org) project website is the primary online resource for the library's documentation. It contains:
* [Extensive documentation of PyPlot](https://matplotlib.org/api/pyplot_api.html),
* [Example galleries](https://matplotlib.org/gallery/index.html), 
* [FAQs](http://matplotlib.org/faq/index.html), 
* [API documentation](http://matplotlib.org/api/index.html),
* [Tutorials](https://matplotlib.org/tutorials/index.html).

Matplotlib has multiple "backends" that handle converting Matplotlib's in-memory representation of your plot into the colorful output you can look at. This is done either by writing files (e.g., png, svg, pdf) that you can use an external tool to look at or by embedding into your GUI toolkit of choice (Qt, Tk, Wx, etc).

To check what backend Matplotlib is currently using:

In [None]:
import matplotlib
import matplotlib.pyplot as plt

print(matplotlib.__version__)
print(matplotlib.get_backend())

If you have problems with plotting in Jupyter Notebook, change your backend to `nbagg` by uncommenting below code. 

You must do it before importing `matplotlib.pyplot`.

In [None]:
# matplotlib.use('nbagg')
print(matplotlib.get_backend())

import matplotlib.pyplot as plt

## Basic Plotting (Line)

### Using Pyplot

In [None]:
import numpy as np

# Data for plotting
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)

print("First five items of t and s:")
print("t:", t[:5])
print("s:", s[:5])

# Add data / axes to the Figure
plt.plot(t, s)
plt.show()

### Using Axes method

You can see the `XAxis` horizontally and the `YAxis` vertically. Each `Axes` has an `XAxis` and a `YAxis`.

The `Figure` is the top-level container in this hierarchy. It is the overall window/page that everything is drawn on. You can have multiple independent figures and `Figures` can contain multiple `Axes`.

`Subplot` are `Axes` on a regular grid system, so they are synonymous in most cases.

In [None]:
# Set up a Figure and put a green background to it (color = (R, V, B, Opacity)), so you can view what it is
fig = plt.figure(facecolor=(0, 1, 0, .1)) 

# Set up Axes on our Figure 
ax = fig.add_subplot(111) # "111" is the subplot specification in 3-digit. Basically it means 1 row and 1 column.

ax.set(title='Axes Example',
       ylabel='Y-Axis: s', 
       xlabel='X-Axis: t')

ax.plot(t, s)

plt.show()

The Pyplot method may seems better because the code is shorter and simplier, but it is also very implicit, which is against Python's principles.

When doing more complicated plots, it will be better to explicitely set up Axes and Figures objects. And especially when working with multiple axes in one figure.

If you're not convinced, take the time to read `The Zen of Python` by running the Easter Egg below.

In [None]:
import this

Anyway, we're still using the Axes method for this course.

You can also use `plt.subplots` to get a Figure with one Axe in one line.

It means this code:

```
fig = plt.figure()
ax = fig.add_subplot(111)
```


Is equivalent to this one:

```
fig, ax = plt.subplots()
```

In [None]:
fig, ax = plt.subplots()

ax.set(title='Axes Props Customization Example',
       ylabel='Y-Axis: s', 
       xlabel='X-Axis: t')

ax.grid(True, linestyle='-.')
ax.tick_params(labelcolor='g', labelsize='medium', width=3)

ax.plot(t, s)

plt.show()

# Pandas

Pandas helps you transform data from CSV, JSON, databases like SQL and other formats to a `DataFrame`.

Pandas can easily interact with NumPy methods.



In [None]:
import pandas as pd

# Example data
data = {'Year': [1920,1930,1940,1950,1960,1970,1980,1990,2000,2010],
        'Unemployment_Rate': [9.8,12,8,7.2,6.9,7,6.5,6.2,5.5,6.3]
       }
  
df = pd.DataFrame(data, columns=data.keys())
print(df)

## Basic plotting using pandas DataFrame

In [None]:
df.plot(x ='Year', y="Unemployment_Rate", kind='line')

In [None]:
df.plot(x ='Year', y="Unemployment_Rate", kind='scatter')

## Bar Chart

We'll make up a simple example before using data from our DataFrame.

In [None]:
# Example values.

data = {'Animal': ["Renard", "Loutre", "Panda roux", "Écureuil", "Chèvre", "Hyène"],
        "Votes": [47, 36, 30, 27, 14, 2]
       }

To make a bar chart using Matplotlib, use `plt.bar(x, y)`

In [None]:
fig, ax = plt.subplots()

ax.set(title="Résultat du sondage :\n" \
              "Si c'était possible, quel animal voudriez-vous avoir comme animal de compagnie ?",
       ylabel='Pourcentage de gens ayant voté pour', 
       xlabel='Animal')

ax.bar(data["Animal"], data["Votes"])
plt.show()

You can make horizontal bar chart using `plt.barh(x, y)`

In [None]:
fig, ax = plt.subplots()

ax.set(title="Résultat du sondage :\n" \
              "Si c'était possible, quel animal voudriez-vous avoir comme animal de compagnie ?",
       xlabel='Pourcentage de gens ayant voté pour', 
       ylabel='Animal')

ax.barh(data["Animal"], data["Votes"])
plt.show()

To make a bar chart using `df.plot`, specify `kind='bar'`.

In [None]:
df = pd.DataFrame(data, columns=data.keys())
df.plot(x ='Animal', y="Votes", kind='bar')

And `kind='barh'` for horizontal ones.

In [None]:
df.plot(x ='Animal', y="Votes", kind='barh')

## Pie Chart



In [None]:
data = {'Tasks': [300,500,700]}
labels = ['Tasks Pending','Tasks Ongoing','Tasks Completed']

explode = (0, 0.1, 0)  # only "explode" the 2nd slice (i.e. 'Tasks Ongoing')

fig, ax = plt.subplots()
ax.pie(data['Tasks'], explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)
ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

In [None]:
df = pd.DataFrame(data, columns=['Tasks'], index=labels)

df.plot.pie(y='Tasks', figsize=(5, 5), autopct='%1.1f%%', startangle=90)

# Open Food Facts

In real life, data is almost never in the ideal shape for us to plot.

First, we will work together step by step to overcome basic challenges. Then you will have to do this yourself on another dataset.

We will use structured data in a .csv file provided by from data.gouv.fr: please download the [Open Food Facts sheet](https://www.data.gouv.fr/fr/datasets/open-food-facts-produits-alimentaires-ingredients-nutrition-labels/#) (the first one) and place it in this folder (`chapter01`).

In [None]:
df = pd.read_csv("fr.openfoodfacts.org.products.csv", delimiter="\t")

In [None]:
df.info()

In [None]:
df

In [None]:
print(df.columns)

You can see that our dataset is huge: there is more than 1 million of rows and 178 columns. 

It contains lots of missing data - parsed as "NaN" by Pandas.

Working with missing data is a common task in Data Science. Pandas provides the `isna()` and `notna()` functions to detect them.

In [None]:
pd.isna(df['generic_name'])

Using what we learnt in Lesson01, print the **number of missing value** for specified columns.

In [None]:
# TODO: print the number of missing value for the column "generic_name".
# Hint: use Boolean Masks.
print()

# TODO: print the number of missing value for the column "energy-kcal_100g".
print()

You can use `df[column].unique()` to get the list of unique values contained in the specified `column`.

In [None]:
# TODO: print the number of unique values for the column "nutriscore_score".
print()

Great. Now let's use the Boolean masks to reduce our data: we only want lines with a value for the columns we want to investigate on.

In [None]:
interesting_columns =  ["product_name", "generic_name",
                        "created_datetime",
                        "main_category_fr", "countries_fr",
                        "nutriscore_score", 
                        "energy_100g", "proteins_100g", "carbohydrates_100g", "fat_100g"]

nutriscores = df[interesting_columns]
nutriscores

In [None]:
for col in interesting_columns:
    nutriscores = nutriscores[pd.notna(nutriscores[col])]
nutriscores

Hourray! We reduced our rows from 1 million to 74k.

Now, we can use data visualization to answer to some questions.
Let's start with the first one:

## How many products are there per country?

We assumed the best way to display that is with **Bar Chart**.

Ideally, we would like unique countries on the axis X, and the corresponding number of products on the axis Y.

Let's start by creating our axis X. 

In [None]:
unique_countries = nutriscores["countries_fr"].unique()

print(len(unique_countries))
print(unique_countries)

We see that there may be different countries for one product, separated by a comma.

A lot of possibilies exist to deal with this. The simplier are:

* Create a dictionary `{countryName: listOfProducts}` and then use keys as a x-axis and length of values as y-axis.

To iter on a DataFrame, we use the generator `df.iterrows()`

Downside: it will take a long time, as we will use Python built-in for loops to iterate over our 74k rows.

* Make a list of all unique countries. Use boolean masks to count each product than contains its name in this column.

This method will use Pandas algorithms and will be quicker. Let's get to it!

In [None]:
# Get the real set of unique countries
country_set = set()
for country_list in unique_countries:
    for country in country_list.split(","):
        country_set.add(country)

print(len(country_set))
print(country_set)

Below, we define a function `country_is_in` that takes variables `iterable` and `country` as **arguments**, and return a boolean mask.

In Python, typing is not necessary, but is helpful. It also makes the behavior of a function clearer for everyone : you, your colleagues, and the future you in 6 months that forgot why you wrote these functions and what they do. We will dig into that later.

It is also helpful while coding:
* If you code with an IDE, it will tell you when their is a type mismatch, before you even run the code
* If you run the code, it will tell you types are mismatching instead of giving you weird errors.

In [None]:
# Create the boolean mask

from typing import List

# Without typing
def country_is_in(iterable, country):
    return [country in c for c in iterable]

# With typing
def country_is_in(iterable: List[str], country: str) -> List[bool]:
    return [country in c for c in iterable]

# Test it
print(country_is_in(["Arménie", "France", "France,Malaisie"], "France"))

country = "Arménie"
nutriscores[country_is_in(nutriscores["countries_fr"].values, country)]

In [None]:
# See how Python typing can be useful when you mismatch types.
print(country_is_in(["Arménie", "France", "France,Malaisie"], 666))

In [None]:
country_dict = dict.fromkeys(country_set)

for country in country_set:
    country_dict[country] = nutriscores[country_is_in(nutriscores["countries_fr"].values, country)].shape[0]
    
print(country_dict)

In [None]:
fig, ax = plt.subplots()

fig.set_size_inches(18.5, 25.5)  # Make the figure taller

ax.set(title="Répartition des produits par pays", 
       ylabel='Pays',
       xlabel="Nombre de produits")

ax.barh(list(country_dict.keys()), list(country_dict.values()))
plt.show()

Let's:
* Remove France and countries with less than 100 results so we can better see the picture.
* Sort our dict, so the results appear more clearly.

You can change the number of products minimum required to appear on the graph.

In [None]:
mini_country_dict = country_dict.copy()
removed_countries = []

for key, value in country_dict.items():
    if value <= 100:
        mini_country_dict.pop(key)
        removed_countries.append(key)
        
mini_country_dict.pop("France")
print(f"Removed {len(removed_countries) + 1} countries. Remaining: {len(mini_country_dict)}")

mini_country_dict

To sort our dict, we can use the built-in function `sorted`.

As you can see from the Documentation (accessible when you click on the function and press Shift+Tab):
> Signature: sorted(iterable, /, *, key=None, reverse=False)  
>Docstring:  
>Return a new list containing all items from the iterable in ascending order.   

As we gives it a dictionary, but wants to iterate over the values and not the key, we specify what elements to sort upon by using the parameter `key`.

In [None]:
from typing import Iterable

def take_second_value(x: Iterable) -> int:
    return x[1]

mini_country_dict = sorted(mini_country_dict.items(), key=take_second_value)
print(mini_country_dict)

Instead of defining a function only for this purpose, we can use a **lambda function**.

It's also called a *anonymous function*: it's a function without a name, that we define using

> lambda *arguments* : *expression*


For exemple

`lambda x: x + 5`

is equivalent to

`def plus_five(x):
    return x + 5`

In [None]:
mini_country_dict = sorted(mini_country_dict.items(), key=lambda x: x[1])
print(mini_country_dict)

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(18.5, 15.5)  # Make the figure taller

ax.set(title="Répartition des produits par pays", 
       ylabel='Pays',
       xlabel="Nombre de produits")

ax.barh([x[0] for x in mini_country_dict], [x[1] for x in mini_country_dict])

plt.show()

Next question!

## What is the product distribution for each nutriscore?

Of course, our goal is to output a **Pie Chart**.

We will apply this distribution only on French products.

Let's explore what are the different nutriscore values.

In [None]:
french_products = nutriscores[country_is_in(nutriscores["countries_fr"].values, "France")]

uniques = french_products['nutriscore_score'].unique()
print(len(uniques))
uniques

Let's visualize the "raw" nutriscore distribution.

In [None]:
y = [french_products[french_products['nutriscore_score'] == unique_val].shape[0] for unique_val in uniques]

fig, ax = plt.subplots()
fig.set_size_inches(18.5, 15.5)  # Make the figure taller

ax.set(title="Répartition des produits par nutriscore en France", 
       ylabel='Nutriscore',
       xlabel="Nombre de produits")

ax.barh(uniques, y)
plt.show()

Based on that, we can make a lot of different Pie Charts.

Let's say I want to show 4 pies. If the distribution were equal, we would have 1/4 of the results for each quarter (between the minimum and the maximum value) : 

Change the value of `number of pies`.

In [None]:
score_max = max(uniques)
score_min = min(uniques)

score_range = score_max - score_min

print(score_min, score_max, score_range)

number_of_pies = 4
print(f"Percentage of each pie if were equally distributed: {1 / number_of_pies * 100}%")

def get_pies(number_of_pies: int):
    y = []
    labels = []
    for i in range(number_of_pies):
        last_threshold = threshold if i > 0 else score_min
        # Compute labels (thresholds)
        threshold = score_min + ((i + 1) * score_range / number_of_pies)
        label = f"{round(last_threshold, 2)} <= Nutriscore value <= {round(threshold, 2)}"
        labels.append(label)
        # Compute number of products
        nb = french_products[french_products['nutriscore_score'] <= threshold].shape[0]
        y.append(nb)
    explode = [0.1 if i == max(y) else 0 for i in y]
    return y, labels, explode

y, labels, explode = get_pies(number_of_pies)

fig, ax = plt.subplots()

fig.set_size_inches(18.5, 10.5)

ax.pie(y, labels=labels, explode=explode, autopct='%1.1f%%', shadow=True, startangle=90)
ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

# Multiple subplots

What if we wanted to display multiple pies ?

You can use `plt.subplots` parameters to define your grid.

In [None]:
fig, ax = plt.subplots(2)  # asked for 2 rows

fig.set_size_inches(18.5, 10.5)
fig.suptitle("Two vertically stacked pies")

y, labels, explode = get_pies(2)
ax[0].pie(y, labels=labels, explode=explode, autopct='%1.1f%%', shadow=True, startangle=90)

y, labels, explode = get_pies(3)
ax[1].pie(y, labels=labels, explode=explode, autopct='%1.1f%%', shadow=True, startangle=90)

plt.show()

In [None]:
fig, ax = plt.subplots(1, 2)  # asked for 1 row, 2 cols

fig.set_size_inches(18.5, 10.5)
fig.suptitle("Two horizontally stacked pies")

y, labels, explode = get_pies(4)
ax[0].pie(y, labels=labels, explode=explode, autopct='%1.1f%%', shadow=True, startangle=90)

y, labels, explode = get_pies(5)
ax[1].pie(y, labels=labels, explode=explode, autopct='%1.1f%%', shadow=True, startangle=90)

plt.show()

Using the same method, we can print multiple lines in the same figure.

In [None]:
x = np.arange(6)

fig, ax = plt.subplots()

ax.plot(x, x, c='b', marker="^", ls='--', label='GNE', fillstyle='none')
ax.plot(x, x+1, c='g', marker=(8,2,0), ls='--', label='MMR')
ax.plot(x, (x+1)**2, c='k', ls='-', label='Rand')
ax.plot(x, (x-1)**2, c='r', marker="v", ls='-', label='GMC')
ax.plot(x, x**2-1, c='m', marker="o", ls='--', label='BSwap', fillstyle='none')
ax.plot(x, x-1, c='k', marker="+", ls=':', label='MSD')

plt.legend()  # Automatically generate a legend based on plots `label` parameter

plt.show()

To know more of what `matplotlib.pyplot` (`plt`) can do for you, there is an extensive documentation with examples [here](https://matplotlib.org/api/pyplot_api.html).