# Python coding bootcamp - Notebook 3

1. Data management and processing
2. Data visualization

&copy; Francis WOLINSKI 2024

<div class="alert alert-danger">
    <h3><i class="fa fa-plus-square"></i>  PART 1</h3>
</div>

In [None]:
import numpy as np
import pandas as pd

# display options
pd.set_option("display.min_rows", 16)
pd.set_option("display.max_columns", 30)

In [None]:
# loading the 2023 file

df2023 = pd.read_csv('data/names/yob2023.txt', names=['name', 'gender', 'births'])
name2023 = df2023.set_index("name")
name2023 = name2023.head(17_532)  # so that the index is unique

In [None]:
# Import all data in a single DataFrame

df = pd.read_pickle('names.pkl')  # load it from pickle format
df.shape

# 1. Data management and processing

![image](./images/pandas2.png)

#### Reminder

A `Series` object can behave like a dictionary with an access by labels (refering to the index) and like a list with an access by positions (refering to the underlying 1-D array).

The `.loc[]` operator is reserved to labels and the `.iloc[]` one to positions.

A `DataFrame` object can behave like a dictionary with an access by labels (refering to the index or to the columns) and like a list of lists with an access by positions (refering to the underlying 2-D array).

The `.loc[]` operator is reserved to labels and the `.iloc[]` one to positions. The first part denotes rows, and the second part if any denotes columns.


## 1.1 Modifying data

It is possible to use the `.loc[]` or `.iloc[]` operators to modify some values of a `Series` or a `DataFrame` object.

In [None]:
# head of name2020
name2023.head()

In [None]:
# modify a single cell
name2023.loc["Olivia", "births"] = 18000
name2023.head()

In [None]:
# modify several cells
name2023.iloc[2:4, 1] = 13000
name2023.head()

Some useful methods also enable to modify data: e.g., `replace()` to replace a given value by another, also `clip()` to trim all numeric values according to given thresholds.

In [None]:
# example with replace
name2023['births'].replace(13527, 14000)

In [None]:
# example with clip

name2023['births'].clip(10, 15000)

## 1.2 Sorting and sampling data

Few methods enable to sort data in `Series` and `DataFrame` objects.
- The method `sort_values()` enables to sort a `Series` or a `DataFrame` object according to its values
- The method `sort_index()` enables to sort a `Series` or a `DataFrame` object according to its index

For `DataFrame` objects, the `sort_values()` method takes a column or a list of columns as argument for the sorting.

To sort values within the reverse order, one should uses the option `ascending=False`.

The sorting operation returns a copy of the initial object.

In [None]:
# sort df according to births
df.sort_values("births").head(10)

In [None]:
# sort df according to births in reverse order
df.sort_values("births", ascending=False).head(10)

In [None]:
# sort df according to births then names
df.sort_values(["births", "name"]).head(10)

In [None]:
# sort df according to births then names in reverse order
df.sort_values(["births", "name"], ascending=[True, False]).head(10)

In [None]:
# sort index
name2023.sort_index().head(10)

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 1 &starf;</h3>
    <ul>
        <li>Sort the large <code>DataFrame</code> by name</li>
        <li>Then sort the large <code>DataFrame</code> by name and year. Compare the results.</li>
        <li>Make a new <code>DataFrame</code> from the large one by setting its index to name.</li>
        <li>Change all Emma births to 0 and sort the new <code>DataFrame</code> object by births and year.</li>
    </ul>
</div>

In [None]:
# %load notebook3/ex_01.py

Other interesting methods exist such as `nsmallest()` and `nlargest()` which compute the n smallest and largest numerical values of columns.

In [None]:
# example with nlargest
df.nlargest(10, 'births')

In [None]:
# example with nlargest
df.nsmallest(10, 'births')

<div class="alert alert-warning" role="alert">
    <h3><i class="fa fa-question-circle"></i> Question &starf;&starf;</h3>
    <ul>
        <li>The <code>nsmallest()</code> method can be simulated by combining 2 others methods, which ones?</li>
        <li>The <code>nlargest()</code> method can be simulated by combining 2 others methods, which ones?</li>
        <li>Compare the efficiency of the different methods.</li>
    </ul>
</div>

There is also the `sample()` method which takes a random sample of a `DataFrame`.

This method is based on the pseudo random number generator of the `numpy.random` library. If you need to replicate the same random example, for simulation purpose for instance, use the `seed()` function with the same seed integer to initialize the pseudo random number generator.

In [None]:
# example with sample 1
# at each time you run it you have different outputs
df.sample(10)

In [None]:
# example with sample and initialization of the pseudo random number generator
np.random.seed(0)
df.sample(10)

## 1.3 Casting data type

The method `astype()` casts values from `Series` or `DataFrame` objects to the specified dtype: e.g., `bool`, `int`, `float`, `str`.

Casting to `int` and `float` will in fact cast to the appropriate `pandas` dtypes `int32` and `float64`.

In [None]:
# example of casting a Series
df['births'].astype(float).head()

In [None]:
# example of casting a DataFrame
df[['year', 'births']].astype(float).head()

It is also possible to cast a categorical column to a dedicated categorical dtype which may be ordered or not. This is done by importing and instancing the `CategoricalDtype` class from **pandas**. After casting a column to a categorical dtype, the memory usage of the `DataFrame` is a bit lower.

In [None]:
# info before casting to category
df.info()

In [None]:
# copy the DataFrame
df2 = df.copy()

# instanciate a categorical dtype
gender_dtype = pd.CategoricalDtype(categories=['F', 'M'], ordered=False)

# cast gender column to new dtype
df2['gender'] = df2['gender'].astype(gender_dtype)

# info after casting to category
df2.info()

<div class="alert alert-warning" role="alert">
    <h3><i class="fa fa-question-circle"></i> Questions &starf;&starf;</h3>
    <ul>
        <li>Why does the memory usage go down after casting the gender to a categorical type?</li>
        <li>Compare the efficiency of selecting a given gender when it is a string vs. it is a category. Why does the efficiency go up after casting the gender to a categorical type?</li>
    </ul>
</div>

<div class="alert alert-info">
    <h3><i class="fa fa-info-circle"></i>  Tips</h3>
    <ul>
        <li>Ordered categorical type are useful when the alphabetical order of categories do not match with their meaning.</li>
        <li>For instance, if a column contains risk values such as <code>Low</code>, <code>Medium</code> and <code>High</code>, sorting the risk values will use the alphabetical order (<code>High</code>, <code>Low</code>, <code>Medium</code>) and will <b>NOT</b> match the risk values.</li>
        <li>To get categorical values correctly ordered, one needs to define an ordered categorical type, for instance:</li>
        <code>risk_dtype = pd.CategoricalDtype(categories=['Low', 'Medium', 'High'], ordered=True)</code>
        <li>Then, when sorting the risk values, they will be appropriately ordered.</li>
    </ul>
</div>

## 1.4 Applying functional methods

### 1.4.1 `Series` objects

The method `map()` returns a new `Series` object by using element-wise correspondance (`dict` or `Series`) to each element.

In [None]:
# map for Series with a dictionary
d = {'F': 'Female', 'M': 'Male'}
df['gender'].map(d)

In [None]:
# map for Series with a Series object
s = pd.Series(['Female', 'Male'], index=['F', 'M'])
s

In [None]:
# index
s.index

In [None]:
# values
s.values

In [None]:
# map for Series with a Series object
df['gender'].map(s)

<div class="alert alert-info">
    <h3><i class="fa fa-info-circle"></i>  Tips</h3>
    <ul>
        <li>This method is useful when one wants to normalize the column values according to another referential.</li>
        <li>For instance, if one has a column with country names, this method can be used to map them to ISO codes.</li>
    </ul>
</div>

The method `apply()` returns a new `Series` object by invoking a function element-wise. The function can be either an existing Python or a function from a module, or a user-defined function such as a lambda or a standard Python function.

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 2 &starf;&starf;</h3>
    <ul>
        <li>Implement a function which rounds births up when above 1000: e.g., 12345 => 12K.</li>
        <li>Apply your function on different extracts of <code>name2023['births']</code> (e.g., nlargest, sample).</li>
    </ul>
</div>

In [None]:
# %load notebook3/ex_02.py

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 3 &starf;&starf;&starf;</h3>
    <ul>
        <li>The word <code>Internationalization</code> is sometimes abbreviated into <code>I18n</code>, where 18 represents the number of letters between the first and the last letter.</li>
        <li>Implement a <code>shorten()</code> function which turns a string into a new one with the first character, the last character and the number of characters in between (strings with 2 letters or less are let unchanged),<br /> e.g. <code>Al → Al, Emma → E2a, Olivia → O4a</code>.</li>
        <li>Test your function with the word <code>Internationalization</code>.</li>
        <li>Add a column <code>shortname</code> to <code>df2023</code> by applying your function on the column <code>name</code>.</li>
        <li>Compare the number of different names and the number of different shortnames.</li>
        <li>Which shortname is the most frequent?</li>
        <li>Print the list of names in alphabetical order which correspond to this shortname.</li>
    </ul>
</div>

In [None]:
# %load notebook3/ex_03.py

<div class="alert alert-info">
    <h3><i class="fa fa-info-circle"></i>  Tips</h3>
    <ul>
        <li>Using <code>apply()</code> might be less performant than using vectorisation that is available in <b>pandas</b> (or <b>NumPy</b>).</li>
    </ul>
</div>

### 1.4.2 `DataFrame` objects

The method `apply()` returns a new `DataFrame` object by invoking a function to each column (`axis=0` by default). It is possible to use the `axis=1` option to invoke the function by rows.

<div class="alert alert-info">
    <h3><i class="fa fa-info-circle"></i>  Tips</h3>
    <ul>
        <li><code>apply(axis=0)</code> invokes a function to each <b>column</b> of a <code>DataFrame</code> object. For reducing functions, the result is <b>row-like</b>.</li>
        <li><code>apply(axis=1)</code> invokes a function to each <b>row</b> of a <code>DataFrame</code> object. For reducing functions, the result is <b>column-like</b>.</li>
    </ul>
</div>

## 1.5 Adding and reorganizing columns

It is possible to add a new column to a `DataFrame` object either with a single value or another `Series` object which shares the same index (e.g., a map or a function applyied to an existing column, or again by combining several existing columns).

In [None]:
# add a column with a single value
df['length'] = 0
df.head(10)

In [None]:
# add a Series as a new column
df['length'] = df['name'].str.len()

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 4 &starf;</h3>
    <p>Select the top 10 longest names in the df <code>DataFrame</code> and apply the <code>shorten()</code> function defined above..</p>
</div>

In [None]:
# %load notebook3/ex_04.py

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 5 &starf;&starf;</h3>
    <p>Add a column <code>initial</code> with the first letter of names in capital.</p>
</div>

In [None]:
# %load notebook3/ex_05.py

<div class="alert alert-warning" role="alert">
    <h3><i class="fa fa-question-circle"></i> Question &starf;</h3>
    <ul>
        <li>Try to guess which is the most and the least frequent initial in the df <code>DataFrame</code>.</li>
        <li>Compare your though with the value counts of the initial column.</li>
    </ul>
</div>

Fancy indexing can be used to select some columns or to reorganize the full columns.

In [None]:
# head
df.head()

In [None]:
# df with columns sorted in alphabetical order
df[['births', 'gender', 'initial', 'length', 'name', 'year']].head()

The `rename()` method enables to rename columns (and also index). This method returns a copy of the initial object. All columns names can also be changed by modifying the columns attribute: e.g. `df.columns = [...]`. Of course the list of new labels should have the same size than the number of columns.

In [None]:
# renaming columns with a mapping
df.rename(columns={'name': 'first name', 'births': 'number of births'}).head()

## 1.6 Dropping values

The `drop()` method enables to remove index in `Series` or `DataFrame` objects, or columns in `DataFrame` objects with the option `axis=1`. It returns a new object.

In [None]:
# head of name2023
name2023.head(10)

### 1.6.1 Dropping rows

In [None]:
# drop index
name2023.drop('Mia').head(10)

### 1.6.2 Dropping columns

In [None]:
# drop column
# single column DataFrame (not a Series object)
name2023.drop('gender', axis=1).head(10)

This is a `DataFrame` object. There is a big difference between a `DataFrame` object with a single column and a `Series` object.

In [None]:
# Series out of a DataFrame
name2023['births']

<div class="alert alert-info">
    <h3><i class="fa fa-info-circle"></i>  Tips</h3>
    <ul>
        <li>The method <code>to_frame()</code> transforms a <code>Series</code> object into a <code>DataFrame</code> one.</li>
        <li>In Machine Learning, the <code>drop()</code> method is often use to build the features matrix from the whole dataset by dropping the target column.</li>
    </ul>
</div>

In [None]:
# Series to DataFrame
name2023['births'].to_frame()

### 1.6.3 Dropping duplicates

It is also possible to remove possibly duplicated rows. The `duplicated()` method returns a boolean `Series` object denoting duplicated rows and the `drop_duplicates()` remove those dulicated rows.

In [None]:
# duplicated rows
df.duplicated(subset=['year', 'name'])

In [None]:
# value counts of duplicated rows
df.duplicated(subset=['year', 'name']).value_counts()

In [None]:
# drop duplicated rows
var = df.drop_duplicates(subset=['year', 'name'])
var.shape

<div class="alert alert-info">
    <h3><i class="fa fa-info-circle"></i>  Tips</h3>
    <ul>
        <li>The method <code>drop_duplicates()</code> return a <code>DataFrame</code> with duplicate rows removed.</li>
        <li>If any, only columns passed in the <code>subset</code> argument are used to compare rows.</li>
        <li>In addition, the <code>keep</code> argument can be set up to decide which duplicated rows to keep:</li>
        <ul>
            <li><code>first</code>: drop duplicates except for the first occurrence.</li>
            <li><code>last</code>: drop duplicates except for the last occurrence.</li>
            <li><code>False</code>: drop all duplicates.</li>
        </ul>
    </ul>
</div>

## 1.7 Pivoting data

### 1.7.1 Cross table

The `crosstab()` function computes a simple cross-tabulation of two factors (`Series` objects sharing the same index). The `margins=True` option computes the total of each lines and columns. The `normalize=...` option normalizes the result by dividing all values by the sum of values (`True`: all values, `'index'`: row values or `'columns'`: column values).

The result is a `DataFrame` object: the index contains the different values of the first `Series` object and the columns contains the different values of the second `Series` object.

Note that it is a function of the `pandas` module and not a method that applies on an object.

In [None]:
# counting length x gender
pd.crosstab(df['length'], df['gender'])

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 6 &starf;</h3>
    <p>Check the value of the crosstab for length = 6 and gender = 'F' by selecting in the <code>DataFrame</code> and counting the number of rows.</p>
</div>

In [None]:
# %load notebook3/ex_06.py

In [None]:
# counting length x gender + margins
pd.crosstab(df['length'], df['gender'], margins=True)

In [None]:
# counting length x gender + normalize + margins
pd.crosstab(df['length'], df['gender'], normalize=True, margins=True)

### 1.7.2 Pivot table

The `pivot_table()` method applies to a `DataFrame` object and builds a synthetic table with aggregated values of a given column organised according to the values of one or several columns. It returns a new `DataFrame` object.

The main arguments are:
- `values`: column of the `DataFrame` object which values are to be aggregated and broken down
- `index`: column(s) of the `DataFrame` object which values are to be used as the index of the pivot table
- `columns`: column(s) of the `DataFrame` object which values are to be used as the columns of the pivot table
- `aggfunc`: aggregation function to apply to the set of values to be aggregated.

*Nota bene*: It is possible to use the `pivot_table()` method by specifying only one of the arguments `index` or `columns`.  In that case, the values are aggregated and broken down according to the values of the single axis that has been specified.

For instance, if we have a `DataFrame` object with 3 columns `A`, `B` and `C`, we can pivot this object in a new `DataFrame` object, in which the index will be the different values (or modalities) of column `B`, the columns will be the different values (or modalities) of column `C` and cells will be filled with an aggregate of all values of column `A` for which value of `B` and value of `C` correspond.

The possible values for computing the aggregation with `aggfunc` are: `'mean'` (by default), `'median'`, `'min'`, `'max'`, `'count'`, `'nunique'`, `'sum'` and any `lambda` or list of such values or functions.

More options are available such as `margins=True`.

In [None]:
# pivot table with births and gender by year
df.pivot_table(values='births',
               index='year',
               columns='gender',
               aggfunc='sum')

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 7 &starf;</h3>
    <p>Check the value of the pivot table for year = 1880 and gender = 'F' by selecting in the dataframe and summing the number of births.</p>
</div>

In [None]:
# %load notebook3/ex_07.py

In [None]:
# pivot table with births by year
df.pivot_table(values='births',
               index='year',
               aggfunc='sum')

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 8 &starf;&starf;</h3>
    <ul>
        <li>Build a pivot table with first name in alphabetical order by year and gender.</li>
        <li>Build a pivot table with last name in alphabetical order by year and gender.</li>
    </ul>
</div>

In [None]:
# %load notebook3/ex_08.py

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 9 &starf;&starf;&starf;</h3>
    <ul>
    <li>Build a pivot table with diversity of names (number of different names) by year and gender.</li>
        <li>Check the value of the pivot table for year = 1880 and gender = 'F' by selecting in the DataFrame and counting the different names.</li>
        <li>Compute the difference between diversity of female names and diversity of male names over years</li>
        <li>Compute the maximum of this difference</li>
        <li>Get the year</li>
    </ul>
</div>

In [None]:
# %load notebook3/ex_09.py

<div class="alert alert-warning">
    <h3><i class="fa fa-book"></i> Further reading</h3>
    <ul>
        <li><a href="https://docs.python.org/3/library/string.html#formatspec" target="_blank">Reshaping and pivot tables</a></li>
    </ul>
</div>

## 1.8 Managing missing values

The **pandas** module tackles missing values and provides different tools to handle them.

In [None]:
# pivot table of a subset of df with name equals to mary
selection = df.loc[(df['name'] == 'Mary')]
tab = selection.pivot_table(values='births',
                            index='year',
                            columns='gender')
tab

We can observed that the `NaN` value is displayed in some cell. 

Contrary to the `crosstab()` function which outputs `0` when there is no values for a combination of an index and a column, these `NaN` are introduced by the `pivot_table()` method in such cases.

In [None]:
# crosstab of year by gender for names Mary
pd.crosstab(selection['year'], selection['gender'])

`NaN` is the `numpy.nan` special value; `NaN` means "Not a Number". It is used by **pandas** to denote missing values.

In [None]:
# a missing value
x = tab.loc[2023, 'M']
x

In [None]:
# nan is a float
type(x)

In [None]:
# any arithmetic, logical or mathematical operation with nan returns nan
# a kind of absorbing element
x + 1

In [None]:
# equality does not work
x == x

In [None]:
# any function with nan returns nan
np.sqrt(x)

The `pandas` module contains several methods to deal with missing values en `Series` or `DataFrame` objects:
- `isna()` test whether each value is nan
- `notna()` test whether each value is not nan
- `dropna()` remove rows or columns with nan
- `fillna()` replace nan values by another value

In [None]:
# testing nan
tab.isna()

In [None]:
# testing not nan
tab.notna()

One can combine these null tests and the logical operators `any()` or `all()` to select or reject rows or columns with `NaN`.

In [None]:
# rows with at least a nan
tab.loc[tab.isna().any(axis=1)]

In [None]:
# columns with no nan
tab.loc[:, tab.notna().all(axis=0)]

In [None]:
# dropping nan
tab.dropna()

In [None]:
# dropping nan by column
tab.dropna(axis=1)

In [None]:
# replacing nan
tab.fillna(0)

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 10 &starf;&starf;&starf;</h3>
    <ul>
        <li>Implement a function which takes a name as argument and produces a pivot table with the distribution of its births across years and genders.</li>
        <li>The output must only show integers.</li>
        <li>Test it with the name Kim.</li>
        <li>For the name Kim, select the rows where there is at least a 0 (there are different ways to produce the logical condition).</li>
    </ul>
</div>

In [None]:
# %load notebook3/ex_10.py

<div class="alert alert-warning">
    <h3><i class="fa fa-book"></i> Further reading</h3>
    <ul>
        <li>Other filling options or methods exists (<code>ffill</code>, <code>bfill</code>, <code>ìnterpolate()</code>), see: <a href='http://pandas.pydata.org/pandas-docs/stable/missing_data.html'>Working with missing data</a></li>
        <li>The <code>read_csv()</code> function has some options to deal with missing values when importing data (<code>na_values</code>, <code>keep_default_na</code>, <code>na_filter</code>), see: <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html'>Documentation pandas.read_csv</a></li>
    </ul>
</div>

<div class="alert alert-danger">
    <h3><i class="fa fa-plus-square"></i>  PART 2</h3>
</div>

In [None]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# display options
pd.set_option("display.min_rows", 16)
pd.set_option("display.max_columns", 30)

In [None]:
# loading the 2023 file
df2023 = pd.read_csv('data/names/yob2023.txt', names=['name', 'gender', 'births'])
name2023 = df2023.set_index("name")
name2023 = name2023.head(17_532)  # so that the index is unique

In [None]:
# Import all data in a single DataFrame

df = pd.read_pickle('names.pkl')  # load it from pickle format
df.shape

# 2. Data visualisation

![image](./images/matplotlib.png)

A `Series` or a `DataFrame` object handles the `plot` accessor, derived from the `matplotlib.pyplot` module which provides graphical functionalities.

Then, different kinds of graphics are available using the appropriate methods.

plotting methods| graphic
-|-
line() | line plot
bar() | vertical bar plot
barh() | horizontal bar plot
hist() | histogram
box() | boxplot
kde() | Kernel Density Estimation plot
density() | same as ‘kde’
area() | area plot
pie() | pie plot
scatter() | scatter plot
hexbin() | hexagone plot

The scales of X and Y axis are automatically adjusted to the data. By default, the X axis scale corresponds to the index of the object.

<div class="alert alert-info">
    <h3><i class="fa fa-info-circle"></i>  Tips</h3>
    <ul>
        <li><code>matplotlib</code> instructions return systematically a string representing the object resulting the operation. A way to prevent from displaying those strings is to terminate the instructions by a semi-column: <code>;</code>.</li>
    </ul>
</div>

<div class="alert alert-warning">
    <h3><i class="fa fa-book"></i> Further reading</h3>
    <ul>
        <li><a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html">Visualization</a></li>
    </ul>
</div>

## 2.1 Curves

In [None]:
# plot Emma (F)

var = df.loc[(df['name'] == 'Emma') & (df['gender'] == 'F')]
var = var.set_index('year')
var.plot(title='Number of births for Emma (F)');

The `matplotlib` module handles different kinds of linestyles, markers and colors:
- 4 kinds of lines: `'-'` (solid), `'--'` (dashed), `':'` (dotted), `'-.'` (dashdot)
- 35 point marks: see the variable `matplotlib.lines.Line2D.markers`
- 8 predefined colors: `'b'` (blue), `'g'` (green), `'r'` (red), `'c'` (cyan), `'m'` (magenta), `'y'` (yellow), `'k'` (black), `'w'` (white)

For colors only, you may pass an array of colors when plotting several columns of a `DataFrame`.

<div class="alert alert-warning">
    <h3><i class="fa fa-book"></i> Further reading</h3>
    <ul>
        <li>The matplotlib module handles many color formats: i.e, RGB and RGBA in tuple of float values or strings of hexa numbers, grey scale, X11/CSS4, xkcd, Tableau 10, CN color spec, etc.</li>
        <li>See: <a href='https://matplotlib.org/api/colors_api.html'>Documentation matplotlib.colors</a></li>
    </ul>
</div>

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 11 &starf;</h3>
    <p>Implement a function which plots the births across years for a given name and gender and test it.<p>
    <pre>
    
    def plot_name(name, gender):
        pass
    </pre>
</div>

In [None]:
# %load notebook3/ex_11.py

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 12</h3>
    <ul>
        <li>Get the pivot table of births by year and gender and plot it.</li>
    </ul>
</div>

In [None]:
# %load notebook3/ex_12.py

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 13</h3>
    <ul>
        <li>Get the pivot table of diversity of names by year and gender and plot it.</li>
    </ul>
</div>

In [None]:
# %load notebook3/ex_13.py

## 2.2 Bars

### 2.2.1 Vertical bars

In [None]:
# selection of 7 first data
var = df[['name', 'births']].head(7)
var.plot(kind='bar', x='name', y='births');

<div class="alert alert-warning">
    <h3><i class="fa fa-edit"></i> Stylization exercise</h3>
    <ul>
    <li>Use the argument <code>rot=60</code> to rotate the names.</li>
    </ul>
</div>

### 2.2.2 Horizontal bars

In [None]:
# selection of 7 first data
var = df[['name', 'births']].head(7)
var.plot(kind='barh', x='name', y='births');

<div class="alert alert-warning">
    <h3><i class="fa fa-edit"></i> Stylization exercise</h3>
    <p>Switch the color of bars to gray by using the <code>color=</code> option and display the bar for the name Emma in red.</p>
</div>

### 2.2.3 Multiple bars

In [None]:
# evolution of births by names Adam and Eve for years 1880, 1940 and 2000
var = df.loc[df['name'].isin(['Adam', 'Eve']) & (df['year'].isin([1880, 1940, 2000]))]
var = var.pivot_table(index='name', columns='year', values='births')
var.plot(kind='bar', rot=0);

<div class="alert alert-warning">
    <h3><i class="fa fa-edit"></i> Stylization exercise</h3>
    <p>Display the comparison of births by years 1880, 1940 and 2000 for names Adam and Eve.</p>
</div>

### 2.2.4 Stacked bars

In [None]:
# comparison of births by names Adam and Eve for years 1880, 1940 and 2000
var = df.loc[df['name'].isin(['Adam', 'Eve']) & (df['year'].isin([1880, 1940, 2000]))]
var = var.pivot_table(index='name', columns='year', values='births')
var.plot(kind='bar', stacked=True, rot=0);

<div class="alert alert-warning">
    <h3><i class="fa fa-edit"></i> Stylization exercise</h3>
    <p>Display the comparison of births by years 1880, 1940 and 2000 for names Adam and Eve. Tip: use <code>transpose()</code> or <code>T</code>.</p>
</div>

## 2.3 Pies

In [None]:
# names by number of letters
df['length'] = df['name'].apply(len)
var = df['length'].value_counts()
var

In [None]:
# names by number of letters
var = var[var > 10_000]
var.plot(kind='pie');

<div class="alert alert-warning">
    <h3><i class="fa fa-edit"></i> Stylization exercises</h3>
    <ul>
        <li>Display the % of values by using <code>autopct='%.1f%%'</code> option.</li>
        <li>Display the % of values by using <code>autopct='%.1f%%'</code> option and modify the appearance of the % by using the <code>textprops={}</code> option where the textprops dictionary is filled with valid <code>Text</code> properties (for instance <code>color</code>, <code>fontweight</code>, see <a href="https://matplotlib.org/3.1.3/api/text_api.html#matplotlib.text.Text">matplotlib.text.Text</a>).</li>
        <li>Test the <code>explode=()</code> option which specifies the fraction of the radius with which to offset each wedge of the pie.</li>
    </ul>
</div>

## 2.4 Distributions

Several methods enable to display distributions.

### 2.4.1 Histograms

In [None]:
# histogram of names length

df["length"] = df["name"].str.len()
var = df["length"]
var.plot(kind='hist');

The `bins` parameter enables to change the number of buckets (10 by default).

<div class="alert alert-warning">
    <h3><i class="fa fa-edit"></i> Stylization exercise</h3>
    <p>Display the distribution using the appropriate number of bins for names of length between 2 and 15.</p>
    <p>Display the distribution in step mode by using the <code>histtype='step'</code> option.</p>
</div>

In [None]:
# %load notebook4/ex_05.py

### 2.4.2 Boxes

Display median, quartiles and range of values.

In [None]:
# boxplot of names length
var = df["length"]
var.plot(kind='box');

<div class="alert alert-warning">
    <h3><i class="fa fa-edit"></i> Stylization exercise</h3>
    <p>Display the boxplot without the outliers by using the <code>showfliers=False</code> option.</p>
</div>

### 2.4.3 Scatter

In [None]:
# select two years
year1 = 2022
year2 = 2023
var = df.loc[df["year"].isin([year1, year2])]
var = var.pivot_table(values="births",
                      index="name",
                      columns="year")
var.head(5)

In [None]:
# scatter plot births for year1 vs year2
var.plot(kind='scatter', x=year1, y=year2);

<div class="alert alert-warning">
    <h3><i class="fa fa-edit"></i> Stylization exercises</h3>
    <ul>
        <li>Modify the size of the plots by using the <code>s=</code> option.</li>
        <li>Switch the markers of the plots to crosses by using the <code>marker=</code> option.</li>
        <li>Modify the color of the plots so that they are blue if 2023 birth is above 2022 birth and red otherwise (you need to build a Series object with "b" and "r" according to the comparison of 2023 and 2022 years).</li>
    </ul>
</div>

## 2.5 Seaborn API (discretional section)

![image](./images/seaborn.png)

**seaborn** is an extension of **matplotlib** which defines circa 25 dedicated statistical graphics for relational, categorical, distribution, regression and matrix plots as well as multi-plot grids

We present here some of the available graphics. Most of them include a large number of parameters, refer to documentation: https://seaborn.pydata.org/api.html

In [None]:
# import

import matplotlib.pyplot as plt
import seaborn as sns

### 2.5.1 countplot

Show the counts of observations in each categorical bin using bars.

In [None]:
df2023['initial'] = df2023['name'].str[0]
sns.countplot(data=df2023, x='initial');

<div class="alert alert-warning">
    <h3><i class="fa fa-edit"></i> Stylization exercise</h3>
    <p>Display the countplot for terminal letters in the alphabetical order.</p>
</div>

### 2.5.2 barplot

Show point estimates and confidence intervals as rectangular bars.

In [None]:
# seaborn barplot with births for decades

plt.figure(figsize=(8, 5))
var = df.loc[df['year'] % 10 == 0]
sns.barplot(data=var, x='year', y='births', palette='Blues');

<div class="alert alert-warning">
    <h3><i class="fa fa-edit"></i> Stylization exercise</h3>
    <ul>
        <li>Display the barplot split by a second variable (e.g., gender) by using the <code>hue=</code> option.</li>
        <li>You may also change the colors for each gender by using the <code>palette=</code> option with a tuple of colors.</li>
    </ul>
</div>

### 2.5.3 distplot

Flexibly plot a univariate distribution of observations.

In [None]:
# seaborn distplot
# 'ratio' comes from Exercise 5:

var = df.loc[df['year'] % 10 == 0]
s2 = ratio[(ratio > 0.01) & (ratio < 0.99)]
sns.displot(s2, kde=False, bins=10);

<div class="alert alert-warning">
    <h3><i class="fa fa-edit"></i> Stylization exercise</h3>
    <p>You can add a rugplot to the displot by using the <code>rug=</code> option.</p>
</div>

### 2.5.4 regplot

Plot data and a linear regression model fit.

In [None]:
# select years year1 and year2
year1 = 2023
year2 = 2022
var = df.loc[df["year"].isin([year1, year2])]
var = var.pivot_table(values="births",
                      index="name",
                      columns="year")
var = var.rename(columns={2023: "2023", 2022: "2022"})  # seaborn bug if colomun names are integers
sns.regplot(x="2023", y="2022", data=var);

<div class="alert alert-warning">
    <h3><i class="fa fa-edit"></i> Stylization exercise</h3>
    <ul>
        <li>Switch the markers of the plots to triangles by using the <code>marker=</code> option. Test with other markers.</li>
        <li>Change the attributes of the line by using the <code>line_kws={}</code> option where the line_kws dictionary is filled with valid Line2D properties (for instance, <code>color</code>, <code>linestyle</code>, see <a href="https://matplotlib.org/3.1.3/api/_as_gen/matplotlib.lines.Line2D.html#matplotlib.lines.Line2D">matplotlib.lines.Line2D</a>).</li>
    </ul>
</div>

### 2.5.5 stripplot

Draw a scatterplot where one variable is categorical.

In [None]:
# stripplot
df2023['terminal'] = df2023['name'].apply(lambda x: x[-1].upper())
letters = list('NAEHIYR')
var = df2023.loc[df2023['terminal'].isin(letters)]
sns.stripplot(data=var, x='terminal', y='births');

<div class="alert alert-warning">
    <h3><i class="fa fa-edit"></i> Stylization exercise</h3>
    <ul>
        <li>Switch x and y in the graphic above.</li>
    </ul>
</div>

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 14 &starf;&starf;&starf;</h3>
    <ul>
        <li>Retrieve the <code>shorten()</code> function that we have defined in exercise 3.</li>
        <li>Make a stripplot chart with the top ten shortnames from df2023 and the births.</li>       
    </ul>
</div>

In [None]:
# %load notebook3/ex_14.py

### 2.5.6 heatmap

Plot rectangular data as a color-encoded matrix

In [None]:
df2023['initial'] = df2023['name'].str[0]
sns.heatmap(pd.crosstab(df2023['terminal'], df2023['initial']), cmap='Blues');

<div class="alert alert-success">
    <h3><i class="fa fa-edit"></i> Exercise 15 &starf;&starf;&starf;</h3>
    <ul>
        <li>Select the most 7 used initial letters in 2023.</li>
        <li>Select the most 7 used terminal letters in 2023.</li>
        <li>Make a crosstab counting the number of names according to their initial letters and terminal letters in the full df DataFrame for the most 7 used initial letters and terminal letters in 2023.</li>
        <li>Make a heatmap with Blues colormap from this crosstab.</li>
        <li>Add annotations within the cells with the <code>annot=True</code> option.</li>
        <li>Switch the numbers to integers by using the option <code>fmt='5d'</code>.</li>
    </ul>
</div>

In [None]:
# %load notebook3/ex_15.py

<div class="alert alert-warning">
    <h3><i class="fa fa-edit"></i> Stylization exercise</h3>
    <ul>
        <li>Switch the Blues colormap to another and reproduce the heatmap above.</li>
    </ul>
</div>

<div class="alert alert-warning">
    <h3><i class="fa fa-book"></i> Further reading</h3>
    <ul>
        <li>The <strong>matplotlib</strong> and <strong>seaborn</strong> modules manage also colormaps, i.e. palettes of colors associated with discrete or continuous data:</li>
        <ul>
            <li><a href="https://matplotlib.org/stable/tutorials/colors/colormaps.html">Choosing Colormaps in Matplotlib</a></li>
            <li><a href="https://seaborn.pydata.org/tutorial/color_palettes.html">Choosing color palette in seaborn</a></li>
        </ul>
        <li>These modules manage also other palettes:</li>
        <ul>
            <li><a href="https://colorbrewer2.org/">ColorBrewer</a></li>
            <li><a href="https://xkcd.com/color/rgb/">XKCD Color</a>, see also the blog: <a href="https://blog.xkcd.com/2010/05/03/color-survey-results/">Color Survey Results</a></li>
        </ul>
    </ul>
</div> 