<div class="licence">
<span>Licence CC BY-NC-ND</span>
<span>Valérie Roy</span>
<span><img src="media/ensmp-25-alpha.png" /></span>
</div>

In [None]:
import numpy as np
import pandas as pd

##  VIII) pivoting tables

   - let us see an **example**

you have a **dataset** about the sinking of **titanic**
   - for each **passenger** you have values such as:
      - the **survival status**
      - the **sex**
      - the **class** (first, second, third)
      - the **age**

   - we **import** the dataset
   - we keep the **interesting columns** (parameter *usecols*)

In [None]:
df = pd.read_csv('titanic.csv', usecols=['Survived', 'Pclass', 'Sex', 'Age'])

In [None]:
df.head(3)

   - the **number** of passengers in **each** class

In [None]:
df['Pclass'].value_counts()

suppose you want to know
   - the **survival rate** depending on the **sex** and the **class** 

you know
   - the **value** to be **aggregated** here the **survival status**
   - the **aggregation** function here the *numpy.mean*
   - the **key** to be the **index**   here the column **sex**
   - the **key** to be the **column**  here the column **Pclass**

   - this is done by the *pandas.DataFrame.pivot_table* **method**

In [None]:
df.pivot_table('Survived', index='Sex', columns='Pclass', aggfunc=np.mean)

   - it returns a **new** data frame 

In [None]:
df1 = df.pivot_table('Survived', index='Sex', columns='Pclass', aggfunc=np.mean)

   - **Pclass** is the name of the **columns index**
   - **Sex** is the name of the **rows index**

   - another example

   - you want to **compute** the **survival rate** by **age group**
   - but we do not have **age group**
   - so we must **pack** the ages in **bins** representing **age groupe**

   - we create **bins** of ages and **names** for those **bins**

In [None]:
age_groups=[0, 11, 17, 25, 35, 45, 55, 65, 100] 
age_group_names = ['<11', '11-17', '17-25', '25-35', '35-45', '45-55', '55-65', '>65'] 

   - we create a new **column** where the **Age** is replaced by the **age group** 
   - using the **method** *pandas.cut* with the **bins** and the **names**

   - we can **add** the column in our data frame

In [None]:
df['Age group'] = pd.cut(df['Age'], bins=age_groups, labels=age_group_names)

In [None]:
#df.sort_values(by='Age', ascending=False)

   - we compute a new **data frame** with the  **survival rate** by **age group**

In [None]:
df.pivot_table('Survived', index=['Sex', 'Pclass'], columns='Age group', aggfunc=np.mean)

   - a **higher** rate of **women** was **saved** in **all** categories except **children under 11**
   - where $55 \%$ of the boys were saved against $54 \%$ of the girls

   - we **do not need** to **add** a column to the **data frame**
   
   - here we  **pass** the number of **bins** and their **names**

In [None]:
col = pd.cut(df['Age'], 3, labels=['child', 'adult', 'old'])

In [None]:
df.pivot_table('Survived', index=['Sex', 'Pclass'], columns=col, aggfunc=np.mean)

In [None]:
df = pd.read_csv('titanic.csv')

  - example of changing the **data**
  - for example, we want to **replace** the **number** of the classes by the **names**
  
  
  - create the **Boolean mask**
  - always **loc** or **iloc** **never** use a classical **array** assignement

In [None]:
mask = (df['Pclass'] == 1)  # we have a mask of indexes

In [None]:
mask.head()

   - we **localize** the **true** values
   - and replace their **Pclass** column **value** by the string **'first'**

In [None]:
df.loc[ mask, 'Pclass'] = 'first'

In [None]:
df.loc[ df['Pclass'] == 2, 'Pclass'] = 'second'
df.loc[ df['Pclass'] == 3, 'Pclass'] = 'third'

In [None]:
df.head()