# Introduction à Pandas


Nous avons vu que numpy était la bibliothéque de référence pour manipuler des tableaux multidimensionnelles (spécialement en calcul matriciel). 


Cependant en data science on a très souvent des labels associés aux données qu'on manipule ou bien dans le cas où on aimerait bien mettre des labels sur les tableaux qu'on manipule.

A cet égard, la bibliothéque Pandas a été développée.



 
Soit le tableau suivant contenant uniquement des données de type float 


In [2]:
L = [12,13,16,11,16]
L

[12, 13, 16, 11, 16]

- Maintenant, on veut associer un label à chaque valeur c'est-à-dire par exemple, on veut associer un prénom à chaque entrée du tableau.

- En Pandas, on peut faire cette tâche en ajoutant un objet appelé index.

En Pandas, il existe deux types de structures de données:
- les series pour les données à une seule dimension.
- les DataFrame pour les données à deux dimensions.


Attention une grande partie de la complexité de Pandas vient de la maîtrise de la notion de l'index.

Un index est un objet très puissant possèdant deux caractéristiques majeures:
- il permet un accès optimisé des données dans un tableau

- il permet une notion d'alignement automatique pendant les opérations. Imaginons qu'on veut additionner entre deux dataframe, l'opération n'est possible que pour les éléments ayant le même label.



In [5]:
import pandas as pd
s = pd.Series([20,30,40,50],index=['eve','bill','liz','bob'])

In [6]:
s

eve     20
bill    30
liz     40
bob     50
dtype: int64

In [7]:
s.values

array([20, 30, 40, 50])

In [8]:
s.index

Index(['eve', 'bill', 'liz', 'bob'], dtype='object')

In [9]:
s.loc['eve']

20

In [11]:
s['eve'] # Attention à cette notation

20

In [12]:
s.loc['eve':'liz'] # liz est inclue

eve     20
bill    30
liz     40
dtype: int64

In [13]:
s = pd.Series([20,30,40,50],index=['eve','liz','bill','bob'])

In [14]:
s

eve     20
liz     30
bill    40
bob     50
dtype: int64

In [15]:
s.loc['eve':'liz']

eve    20
liz    30
dtype: int64

In [22]:
animaux = ["chien", 'chat', 'chat','chien', 'poisson']
proprio = ['eve', 'bob', 'eve', 'bill', 'liz']

In [23]:
s = pd.Series(animaux, index=proprio)

In [25]:
print(s)

eve       chien
bob        chat
eve        chat
bill      chien
liz     poisson
dtype: object


In [26]:
s.loc['eve':'liz'] # label dupliqué le slicing ne marche pas

KeyError: "Cannot get left slice bound for non-unique label: 'eve'"

In [27]:
# Solution
s = s.sort_index()

In [28]:
s

bill      chien
bob        chat
eve       chien
eve        chat
liz     poisson
dtype: object

In [29]:
s.loc['eve':'liz']

eve      chien
eve       chat
liz    poisson
dtype: object

In [32]:
s.iloc[0]
s.iloc[4]

'poisson'

In [31]:
s.iloc?

[0;31mType:[0m        property
[0;31mString form:[0m <property object at 0x7fc86d2e62c8>
[0;31mDocstring:[0m  
Purely integer-location based indexing for selection by position.

``.iloc[]`` is primarily integer position based (from ``0`` to
``length-1`` of the axis), but may also be used with a boolean
array.

Allowed inputs are:

- An integer, e.g. ``5``.
- A list or array of integers, e.g. ``[4, 3, 0]``.
- A slice object with ints, e.g. ``1:7``.
- A boolean array.
- A ``callable`` function with one argument (the calling Series, DataFrame
  or Panel) and that returns valid output for indexing (one of the above)

``.iloc`` will raise ``IndexError`` if a requested indexer is
out-of-bounds, except *slice* indexers which allow out-of-bounds
indexing (this conforms with python/numpy *slice* semantics).

See more at :ref:`Selection by Position <indexing.integer>`


In [33]:
s.iloc[1:3]

bob     chat
eve    chien
dtype: object

In [34]:
# Notion d'indexation avancée
s.loc[s=='chien']

bill    chien
eve     chien
dtype: object

In [39]:
s.loc[(s == 'chien') | (s=='poisson')]

bill      chien
eve       chien
liz     poisson
dtype: object

In [40]:
s.loc[(s=='chien') | (s=='poisson')]='autre'

In [41]:
s

bill    autre
bob      chat
eve     autre
eve      chat
liz     autre
dtype: object

In [56]:
# Notion d'alignement d'index
s1 = pd.Series([1,2,3], index=[list('abc')])
s2 = pd.Series([5,6,7], index=[list('acd')])

In [57]:
s1


a    1
b    2
c    3
dtype: int64

In [58]:
s2

a    5
c    6
d    7
dtype: int64

In [59]:
s1 + s2 # NaN not a number

a    6.0
b    NaN
c    9.0
d    NaN
dtype: float64

In [60]:
s1.add(s2, fill_value=50)

a     6.0
b    52.0
c     9.0
d    57.0
dtype: float64

## Notion de DataFrame


In [61]:
import numpy as np
import pandas as pd

In [85]:
prenoms = ['liz','bob','bill','eve']
           

In [86]:
age = pd.Series([25,30,35,40],index=prenoms)
taille = pd.Series([160,175,170,180],index=prenoms)
sexe = pd.Series(list('fhhf'),index=prenoms)
df = pd.DataFrame({'age': age, 'taille':taille, 'Sexe':sexe})

In [87]:
df

Unnamed: 0,age,taille,Sexe
liz,25,160,f
bob,30,175,h
bill,35,170,h
eve,40,180,f


In [88]:
df.index

Index(['liz', 'bob', 'bill', 'eve'], dtype='object')

In [89]:
df.columns

Index(['age', 'taille', 'Sexe'], dtype='object')

In [92]:
tableau = df.values
tableau

array([[25, 160, 'f'],
       [30, 175, 'h'],
       [35, 170, 'h'],
       [40, 180, 'f']], dtype=object)

In [91]:
print(type(tableau))

<class 'numpy.ndarray'>


In [93]:
df.tail(2)

Unnamed: 0,age,taille,Sexe
bill,35,170,h
eve,40,180,f


In [94]:
df.head()

Unnamed: 0,age,taille,Sexe
liz,25,160,f
bob,30,175,h
bill,35,170,h
eve,40,180,f


In [96]:
df.describe()

Unnamed: 0,age,taille
count,4.0,4.0
mean,32.5,171.25
std,6.454972,8.539126
min,25.0,160.0
25%,28.75,167.5
50%,32.5,172.5
75%,36.25,176.25
max,40.0,180.0


In [97]:
df.loc?

[0;31mType:[0m        property
[0;31mString form:[0m <property object at 0x7fc86d2e6368>
[0;31mDocstring:[0m  
Access a group of rows and columns by label(s) or a boolean array.

``.loc[]`` is primarily label based, but may also be used with a
boolean array.

Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.

      start and the stop are included

- A boolean array of the same length as the axis being sliced,
  e.g. ``[True, False, True]``.
- A ``callable`` function with one argument (the calling Series, DataFrame
  or Panel) and that returns valid output for indexing (one of the above)

See more at :ref:`Selection by Label <indexing.label>`

See Also
--------
DataFrame.at : Access a single value for a row/column label pair
DataFrame.iloc : Access gro

In [101]:
import numpy as np
import pandas as pd

In [102]:
df1 = pd.DataFrame(np.random.randint(1,10,size=(2,2)),columns=list('ab'),index=list('xy'))

In [103]:
df2 = pd.DataFrame(np.random.randint(1,10,size=(2,2)),columns=list('ab'),index=list('zt'))

In [106]:
df1

Unnamed: 0,a,b
x,6,5
y,4,4


In [107]:
df2

Unnamed: 0,a,b
z,4,3
t,2,8


In [108]:
pd.concat([df1,df2])

Unnamed: 0,a,b
x,6,5
y,4,4
z,4,3
t,2,8


64

In [116]:
df1 = pd.DataFrame(np.random.randint(1,10,size=(2,2)),columns=list('ab'),index=list('xy'))
df2 = pd.DataFrame(np.random.randint(1,10,size=(2,2)),columns=list('cd'),index=list('xy'))

In [121]:
pd.concat([df1,df2],axis=1)

Unnamed: 0,a,b,c,d
x,1,3,3,9
y,9,1,7,8


In [123]:
df1 = pd.DataFrame({'personnel':['Bob','Lisa','Sue'],'groupe':['SAF','R&D','RH']})
df1

Unnamed: 0,personnel,groupe
0,Bob,SAF
1,Lisa,R&D
2,Sue,RH


In [127]:
df2 = pd.DataFrame({'personnel':['Lisa','Bob','Sue'],'date embauche':[2004,2008,2014]})
df2

Unnamed: 0,personnel,date embauche
0,Lisa,2004
1,Bob,2008
2,Sue,2014


In [129]:
print(df1)
df1

  personnel groupe
0       Bob    SAF
1      Lisa    R&D
2       Sue     RH


Unnamed: 0,personnel,groupe
0,Bob,SAF
1,Lisa,R&D
2,Sue,RH


In [130]:
print(df2)

  personnel  date embauche
0      Lisa           2004
1       Bob           2008
2       Sue           2014


In [131]:
pd.merge(df1,df2)

Unnamed: 0,personnel,groupe,date embauche
0,Bob,SAF,2008
1,Lisa,R&D,2004
2,Sue,RH,2014


In [132]:
import seaborn as sns

In [141]:
ti = sns.load_dataset('titanic').loc[:,['survived','sex','class']]

In [142]:
ti.shape

(891, 3)

In [145]:
ti.loc[:,'survived'].mean()

0.3838383838383838

In [146]:
ti.loc[ti.loc[:,'class']=='First','survived'].mean()

0.6296296296296297

In [167]:
ti.groupby('class').median() # principe supression nuisance colonne (n'a pas de sens)

Unnamed: 0_level_0,survived
class,Unnamed: 1_level_1
First,1
Second,0
Third,0


In [163]:
g = ti.groupby(['class','sex']).mean()

In [164]:
g

Unnamed: 0_level_0,Unnamed: 1_level_0,survived
class,sex,Unnamed: 2_level_1
First,female,0.968085
First,male,0.368852
Second,female,0.921053
Second,male,0.157407
Third,female,0.5
Third,male,0.135447


In [165]:
g.index

MultiIndex(levels=[['First', 'Second', 'Third'], ['female', 'male']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=['class', 'sex'])

In [169]:
ti.pivot_table("survived",aggfunc=np.mean,index='class',columns='sex')

sex,female,male
class,Unnamed: 1_level_1,Unnamed: 2_level_1
First,0.968085,0.368852
Second,0.921053,0.157407
Third,0.5,0.135447


## Gestion des dates et des séries temporelles

encodage de l'ordre de nanosecondes



In [172]:
np.datetime64('2018-06-30 08:35:23', 'ns')

numpy.datetime64('2018-06-30T08:35:23.000000000')

In [174]:
np.datetime64('2018-06-30 08:35:23', 's') - np.datetime64('2018-06-30 06:35:23', 's')

numpy.timedelta64(7200,'s')

- On utilise plutot les types natifs
- 

In [175]:
pd.to_datetime('10 june 1973 8h30')
pd.to_datetime()

TypeError: to_datetime() missing 1 required positional argument: 'arg'