# Pandas Library Introduction

Autor: Julie ORJUELA, Sébastien RAVEL, Nicolas Barrier

Formation ISI - IRD

CIBIG 2024, 2025


### About this material 

All the examples in this course for the pandas part have been taken from Patrick Fuchs and Pierre Poulain's Python course! 

https://python.sdv.u-paris.fr/


# What's Pandas 

Pandas is a powerful, general-purpose library that will allow you to easily perform complex data analysis.

* Based on the very popular NumPy library. It provides various data structures and operations for processing numerical and time series data. 

* Used as input data for plotting functions in Matplotlib,  statistical analysis in SciPy, machine learning algorithms in Scikit-learn. 

* Data scientists use it for loading, processing and analyzing tabular data (data stored in .csv, .tsv or .xlsx format)

## Pandas installation

In [None]:
%%bash
python3 -m pip install pandas numpy

In [1]:
import numpy as np
import pandas as pd

## 2. Dataframe

A dataframe is a two-dimensional arrays with labels to name the rows and columns.

<img src="img/dataframe.png" alt="dataframe" style="height: 500px; width:900px;"/>

A Dataframe can be constructed from many different types: 
* A dict of 1D ndarrays, lists, dicts, or Series
* A two-dimensional numpy.ndarray
* A structured ndarray
* A Series
* Another Dataframe

### Create a dataframe with pandas from a list of lines

In [2]:
df = pd.DataFrame(columns=["a", "b", "c", "d"],
                  data=[np.arange(10, 14),
                        np.arange(20, 24),
                        np.arange(30, 34)])

In [3]:
df

Unnamed: 0,a,b,c,d
0,10,11,12,13
1,20,21,22,23
2,30,31,32,33


In [4]:
df = pd.DataFrame(columns=["a", "b", "c", "d"],
                   index=["chat", "singe", "souris"],
                  data=[np.arange(10, 14),
                        np.arange(20, 24),
                        np.arange(30, 34)])


In [5]:
df

Unnamed: 0,a,b,c,d
chat,10,11,12,13
singe,20,21,22,23
souris,30,31,32,33


### Similar dataframe could be created from values of a dictionary

In [7]:
data = {"a": np.arange(10, 40, 10),
        "b": np.arange(11, 40, 10),
        "c": np.arange(12, 40, 10),
        "d": np.arange(13, 40, 10)}

In [8]:
data

{'a': array([10, 20, 30]),
 'b': array([11, 21, 31]),
 'c': array([12, 22, 32]),
 'd': array([13, 23, 33])}

In [9]:
df = pd.DataFrame(data)
df.index = ["chat", "singe", "souris"]
df

Unnamed: 0,a,b,c,d
chat,10,11,12,13
singe,20,21,22,23
souris,30,31,32,33


### Création from csv

A very interesting feature of pandas is the ability to open tabulated files very easily.


In [10]:
import pandas as pd
df = pd.read_csv("files/movies/movies.csv")
df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [11]:
df = pd.read_csv("files/movies/movies.csv", sep=',')
df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


### Pandas dataframe propeties


In [12]:
data = {"a": np.arange(10, 40, 10),
        "b": np.arange(11, 40, 10),
        "c": np.arange(12, 40, 10),
        "d": np.arange(13, 40, 10)}
df = pd.DataFrame(data)
df.index = ["chat", "singe", "souris"]
df

Unnamed: 0,a,b,c,d
chat,10,11,12,13
singe,20,21,22,23
souris,30,31,32,33


In [13]:
df.head()

Unnamed: 0,a,b,c,d
chat,10,11,12,13
singe,20,21,22,23
souris,30,31,32,33


#### The dimensions of a dataframe are given by the .shape attribute (row/column)

In [14]:
df.shape

(3, 4)

What's the column type  

In [15]:
df.dtypes 

a    int64
b    int64
c    int64
d    int64
dtype: object

#### The .columns attribute returns names and also allows columns to be renamed

In [16]:
df.columns

Index(['a', 'b', 'c', 'd'], dtype='object')

In [17]:
df.columns = ["Paris", "Lyon", "Nantes", "Pau"]

In [18]:
df

Unnamed: 0,Paris,Lyon,Nantes,Pau
chat,10,11,12,13
singe,20,21,22,23
souris,30,31,32,33


### Selection/Filtering a dataframe

#### Columns selection


##### A column can be selected by its label

In [19]:
df["Lyon"]

chat      11
singe     21
souris    31
Name: Lyon, dtype: int64

##### Or several columns at the same time

In [20]:
df[["Lyon", "Pau"]]

Unnamed: 0,Lyon,Pau
chat,11,13
singe,21,23
souris,31,33


#### Lines selection

##### To select a line, use the .loc() instruction and the line label

In [21]:
df.loc["singe"]

Paris     20
Lyon      21
Nantes    22
Pau       23
Name: singe, dtype: int64

##### Here, too, you can select several lines

In [22]:
df.loc[["singe", "chat"]]

Unnamed: 0,Paris,Lyon,Nantes,Pau
singe,20,21,22,23
chat,10,11,12,13


##### Finally, you can also select lines using the .iloc instruction and the line index (the first line has index 0):

In [23]:
df.iloc[1]

Paris     20
Lyon      21
Nantes    22
Pau       23
Name: singe, dtype: int64

In [24]:
df.iloc[[1,0]]

Unnamed: 0,Paris,Lyon,Nantes,Pau
singe,20,21,22,23
chat,10,11,12,13


##### Intervals can also be used (as with lists):

In [25]:
df.iloc[0:2]

Unnamed: 0,Paris,Lyon,Nantes,Pau
chat,10,11,12,13
singe,20,21,22,23


#### Lines AND columns selection

##### Both row and column selections can be combined:

In [26]:
df.loc["souris", "Pau"]

33

In [27]:
df.loc[["singe", "souris"], ['Nantes', 'Lyon']]

Unnamed: 0,Nantes,Lyon
singe,22,21
souris,32,31


### Examples on dataframe selection

Let's select all lines for which the number of employees in Pau is > 15.

In [28]:
df[ df["Pau"]>15 ]

Unnamed: 0,Paris,Lyon,Nantes,Pau
singe,20,21,22,23
souris,30,31,32,33


From this selection, we wish to keep only the values for Lyon

In [29]:
df[ df["Pau"]>15 ]["Lyon"]

singe     21
souris    31
Name: Lyon, dtype: int64

You can also combine several conditions with & for the and operator

In [32]:
df[ (df["Pau"]>15) & (df["Lyon"]>25) ]

Unnamed: 0,Paris,Lyon,Nantes,Pau
souris,30,31,32,33


and "|" for the "or" operator

In [33]:
df[ (df["Pau"]>15) | (df["Lyon"]>25) ]

Unnamed: 0,Paris,Lyon,Nantes,Pau
singe,20,21,22,23
souris,30,31,32,33


## Combination of Dataframes

In [34]:
data = {"Lyon": np.arange(10, 40, 10),
         "Paris": np.arange(11, 40, 10)}
df1 = pd.DataFrame(data)
df1.index = ["chat", "singe", "souris"]
df1


Unnamed: 0,Lyon,Paris
chat,10,11
singe,20,21
souris,30,31


In [35]:
data = {"Nantes": np.arange(10, 40, 10),
         "Strasbourg": np.arange(11, 40, 10)}
df2 = pd.DataFrame(data)
df2.index = ["chat", "souris", "lapin"]
df2


Unnamed: 0,Nantes,Strasbourg
chat,10,11
souris,20,21
lapin,30,31


#### Concatenate two df

In [36]:
pd.concat([df1, df2])

Unnamed: 0,Lyon,Paris,Nantes,Strasbourg
chat,10.0,11.0,,
singe,20.0,21.0,,
souris,30.0,31.0,,
chat,,,10.0,11.0
souris,,,20.0,21.0
lapin,,,30.0,31.0


#### concatenate them but pool the lines of the two df's

In [37]:
pd.concat([df1, df2], axis=1)

Unnamed: 0,Lyon,Paris,Nantes,Strasbourg
chat,10.0,11.0,10.0,11.0
singe,20.0,21.0,,
souris,30.0,31.0,20.0,21.0
lapin,,,30.0,31.0


#### keep only lines common to both dataframes 

In [42]:
pd.concat([df1, df2], axis=1, join="outer")

Unnamed: 0,Lyon,Paris,Nantes,Strasbourg
chat,10.0,11.0,10.0,11.0
singe,20.0,21.0,,
souris,30.0,31.0,20.0,21.0
lapin,,,30.0,31.0
