<div class="licence">
<span>Licence CC BY-NC-ND</span>
<span>Valérie Roy</span>
<span><img src="media/ensmp-25-alpha.png" /></span>
</div>

## II) $\texttt{pandas.DataFrame}$


   - **two-dimensional** arrays
   - where **rows** and **columns** are indexed
   - **missing** values are replaced by $\texttt{numpy.NaN}$
   - **scalar** values are broadcasted
   
   
   
   - can be build in **several ways** or **read** from files (the most usual way)

In [None]:
import pandas as pd

### 1) creating $\texttt{pandas.DataFrame}$ from $\texttt{pandas.Series}$
   - **index** must be identical

   - we create three **series of data**
      - **distance**, **lowest_temp** and **highest_temp** related to the solar system
   - series are **indexed by** the **names** of the planets
   - some values are **missing**
      - the **lowest** and the **highest** temperature of **neptune**, **saturn** and **uranus**
   - all planets are from the **solar system**

In [None]:
# distance are relative to earth
distance = pd.Series([0.387, 0.723, 30, 1., 5.203, 1.523, 9.6, 19.19],
                     index=['Mercury', 'Venus', 'Neptune', 'Earth', 'Jupiter', 'Mars', 'Saturn', 'Uranus'])

lowest_temp = pd.Series([-200.0, 446.0,  -90.0, -125.0, -140.0],
                        index=['Mercury', 'Venus', 'Earth', 'Jupiter', 'Mars'])

highest_temp = pd.Series([430.0, 490.0, 60.0, 17.0, 20.0],
                         index=['Mercury', 'Venus', 'Earth', 'Jupiter', 'Mars'])


we **group** the series using a **python dict** 
   - the **names** of the series are the **keys** of **dict**
   - the **elements** of the series are the **values** 

In [None]:
planets = pd.DataFrame({'distance': distance,
                        'lowest temperature': lowest_temp, 
                        'highest temperature': highest_temp, 
                        'origin':'solar system'})

   - for the **serie** $\texttt{'origin'}$
   - we give the **single** value $\texttt{'solar system'}$

   - we **see** the first elements

In [None]:
planets.head()

   - the **single** value is **broadcasted** to the **entire column** (here  *Solar System*)
   - **missing** values are **replaced by** $\texttt{numpy.NaN}$ (min/max temperature of neptune, saturn and uranus

you can **retrieve** columns by **name**
   - you obtain a **reference** on the **serie** **not** a **copy**

In [None]:
planets[['distance', 'lowest temperature', 'highest temperature']]

In [None]:
planets['distance'] # relative distance where earth is 1

   - you can use the **columns names** as a **key** (when possible)
   - it **won't** work for $\texttt{lowest temperature}$

In [None]:
planets.distance

planets.distance

   - we can give a name to the data frame

In [None]:
planets.name = 'planets'

 - we can give a name to the index

In [None]:
planets.index.name = 'planets names'

In [None]:
planets.head()

   - you can create a **data frame** from a **dict of dicts**

In [None]:
planets_2 = pd.DataFrame(
    {'distance': {'Mercury' : 0.387, 'Venus' : 0.723, 'Neptune' : 30, 'Earth' : 1},
     'lowest temperature': {'Mercury' : -200, 'Venus': 446, 'Earth' : -90},})
                        

In [None]:
planets_2

### 2) creating $\texttt{pandas.DataFrame}$ by specifying parameters $\texttt{data}$, $\texttt{columns}$ and $\texttt{index}$

In [None]:
planets_1 = pd.DataFrame([[1.000, -90.0, 60.0],
                          [5.203, -125.0, 17.0],
                          [1.523, -140.0, 20.0],
                          [0.387, -200.0, 430.0],
                          [30.0],
                          [9.600],
                          [ 19.190],
                          [ 0.723, 446.0, 490.0]],
                         
                         index= ['Earth', 'Jupiter', 'Mars', 'Mercury', 'Neptune', 'Saturn', 'Uranus', 'Venus'],
                         
                         columns=['distance', 'lowest temperature', 'highest temperature'])

In [None]:
planets_1.head(3)

in a **data frame**
   - the **rows** and the **columns** are **indexed**
   - the type is $\texttt{pandas.Index}$ (for short)

In [None]:
type(planets_1.columns), type(planets_1.columns)

In [None]:
pd.Index

   - you can create an object **Index**
   - and pass it to the data frame **constructor**

In [None]:
index_rows = pd.Index(['Earth', 'Jupiter', 'Mars','Mercury',
                       'Neptune', 'Saturn', 'Uranus', 'Venus'])

In [None]:
index_cols = pd.Index(['distance', 'lowest temperature',
                     'highest temperature'])

In [None]:
planets_3 = pd.DataFrame([[1.000, -90.0, 60.0],
                          [5.203, -125.0, 17.0],
                          [1.523, -140.0, 20.0],
                          [0.387, -200.0, 430.0],
                          [30.0],
                          [9.600],
                          [ 19.190],
                          [ 0.723, 446.0, 490.0]],                         
                         index = index_rows,
                         columns = index_cols)

### 3) information on $\texttt{pandas.DataFrame}$

   - you can access the **index**

In [None]:
planets.index

   - you can access the **columns names**

In [None]:
planets.columns

   - you access **columns** by keys like for a dictionary

In [None]:
planets['distance']

   - you can **transpose** a $\texttt{pandas.DataFrame}$

In [None]:
planets.T # nows columns are rows

   - you can access the **underlying** two-dimensional $\texttt{numpy.ndarray}$

In [None]:
planets.to_numpy()

   - you can get general **statistics** on the **numerical** columns

In [None]:
planets.describe()

   - you an access **information** on **columns**
   - *numbers of non-null elements, types, memory usage*

In [None]:
planets.info()

### 4) accessing elements in a $\texttt{pandas.DataFrame}$ using $\texttt{pandas.DataFrame.loc}$

the classical way
   - **standard** (python and numpy) **indexing operators** **[]** and attribute operator **.**
   - are **available** and **intuitive**

   
However
   - using **standard operators** has  **optimization** limits
   - for **production code** use the **optimized pandas data access methods** 
   
   
   
http://pandas.pydata.org/pandas-docs/stable/indexing.html

#### accessing elements using **labels** and $\texttt{pandas.DataFrame.loc}$
   - $\texttt{df.loc[row_label]}$
   - $\texttt{df.loc[row_label, column_label]}$
   

$\texttt{row_label}$ and $\texttt{column_label}$ can be:
   - **labels**
   - **list of labels**
   - **slices** with labels
   - **masks** (**Boolean array**) 
   

when $\texttt{row_label}$ and $\texttt{column_label}$ are **labels**
   - it returns a value

In [None]:
planets.loc['Earth', 'distance']

In [None]:
planets.loc['Earth']

when only one $\texttt{row_label}$ or $\texttt{row_label}$ is a **label**
   - it returns a $\texttt{pandas.Series}$

In [None]:
planets.loc['Earth']

In [None]:
type(planets.loc['Earth'])

In [None]:
planets.loc[['Earth'], 'distance']

In [None]:
type(planets.loc[['Earth'], 'distance'])

In [None]:
planets.loc['Earth', ['distance']]

In [None]:
type(planets.loc['Earth', ['distance']])

when $\texttt{row_label}$ and $\texttt{column_label}$ are **lists of labels**
   - it returns a $\texttt{pandas.DataFrame}$

In [None]:
planets.loc[['Earth']]

In [None]:
type(planets.loc[['Earth']])

In [None]:
planets.loc[['Earth', 'Mars']]

   - all columns $\texttt{':'}$
   - rows fron 'Earth' included to 'Mars'**included**

In [None]:
planets.loc['Earth':'Mars', :]

   - all rows $\texttt{':'}$
   - columns from $\texttt{distance}$ to $\texttt{highest temperature}$ **included**

In [None]:
planets.loc[:, 'distance':'highest temperature']

   - $\texttt{planets}$ **farther than** earth from the sum

In [None]:
planets.loc[planets.loc[:, 'distance'] > 1]

### 5) accessing elements in a $\texttt{pandas.DataFrame}$ using **position** and  $\texttt{pandas.DataFrame.iloc}$

#### accessing elements using $\texttt{pandas.DataFrame.iloc}$
   - $\texttt{df.loc[row_id]}$
   - $\texttt{df.loc[row_id, column_id]}$
   

$\texttt{row_id}$ and $\texttt{column_id}$ can be:
   - **integer**
   - **list of integers**
   - **slices**
   - **masks** (**Boolean array**)  

In [None]:
planets_1 = pd.DataFrame([[1.000, -90.0, 60.0],
                          [5.203, -125.0, 17.0],
                          [1.523, -140.0, 20.0],
                          [0.387, -200.0, 430.0]],
                         
                         index= ['Earth', 'Jupiter', 'Mars', 'Mercury'],
                         
                         columns=['distance', 'lowest temperature', 'highest temperature'])

   - the first row

In [None]:
planets.iloc[0] # pandas.Series

   - the first and the third rows

In [None]:
planets.iloc[[0, 2]] # pandas.DataFrame

   - $\texttt{[1, 3]}$ the second and the fourth columns
   - $\texttt{[0, 2]}$ of the first and the third rows

In [None]:
planets.iloc[[0, 2], [1, 3]] # pandas.DataFrame

   - first row, first column (as a float)

In [None]:
planets.iloc[0, 1] # pandas.DataFrame

   - first row, first column (as a $\texttt{pandas.DataFrame}$)

In [None]:
planets.iloc[[0], [1]] # pandas.DataFrame

   - the **rows** from **position** $0$ to **position** $2$ **excluded** (*python slicing rules*)
   - the **columns** from **position** $1$ to position $3$ **excluded** (*python slicing rules*)

In [None]:
planets.iloc[0:2, 1:3] # pandas.DataFrame

   - all rows $\texttt{':'}$
   - columns from $1$ to $3$ excluded

In [None]:
planets.iloc[:, 1:3]

   - all columns $\texttt{':'}$
   - rows from $0$ to $3$ excluded

In [None]:
planets.iloc[0:3, :]

   - planets farther from the sun than the erath

### 6) changing the $\texttt{pandas.DataFrame}$ $\texttt{index}$

   - $\texttt{pandas.DataFramce.set_index(new_column)}$
   - $\texttt{pandas.DataFramce.reset_index()}$
   - direct assignement

In [None]:
planets = pd.DataFrame([[1.000, -90.0, 60.0],
                          [5.203, -125.0, 17.0],],                         
                         index= ['Earth', 'Jupiter'],                         
                         columns=['distance', 'lowest temperature', 'highest temperature'])

   - with $\texttt{pandas.DataFramce.set_index}$ you **index** by **another** column

In [None]:
planets.set_index('distance')

   - with $\texttt{pandas.DataFramce.reset_index}$  the **index** became a **normal** $\texttt{pandas.DataFrame}$ column 

In [None]:
planets.reset_index()

   - with direct assigment you create a new index

In [None]:
planets

In [None]:
planets.index = ['la terre', 'jupiter']

### 7) sorting $\texttt{pandas.DataFrame}$ according **columns**

In [None]:
df = pd.DataFrame({ 'col1':  [19, 3, 26, 46, 4, 19],
                    'col2': ['h', 'w', 'y', 'd', 'm', 'w'],
                    'col3':  [8.45, 19.23, 89.56, 17.5, 54.76, 89.56]})

In [None]:
df.sort_values(by='col1', ascending=False)

   - **first** $\texttt{col1}$ is **sorted**
   - then, for **identical values**, $\texttt{col2}$ is sorted 

In [None]:
df.sort_values(by=['col1', 'col2'], ascending=False)

   - you can sort only a few elements ($\texttt{pandas.DataFrame.nlargest()}$, $\texttt{pandas.DataFrame.nsmallest()}$)
   - (*it might be faster on large datasets*)

In [None]:
df.nlargest(2, 'col3')

In [None]:
df.nsmallest(3, 'col1')

### 8) applying vectorized functions to $\texttt{pandas.DataFrame}$

   - $\texttt{pandas.DataFrame}$ columns are stored in $\texttt{numpy.ndarray}$
   - **ufuncs** functions can be **applied** to $\texttt{pandas.Series}$
   - **rows** and **columns** labels are preserved

In [None]:
import numpy as np

df = pd.DataFrame(np.linspace(0, 2*np.pi, 100), columns=['angle'])

In [None]:
df.head()

In [None]:
df['sinus'] = np.sin(df)

In [None]:
df.head(3)

In [None]:
df['cosinus'] = np.cos(df['angle'])

In [None]:
df.head(3)

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
df[['sinus', 'cosinus']].plot()

In [None]:
( np.power(df['sinus'], 2) + np.power(df['cosinus'], 2) )[0:3]