# Pandas
Pandas is an open source, high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Pandas adds data structures and tools designed to work with table-like data which is Series and Data Frames. Pandas provides tools for data manipulation:

reshaping<br>
merging<br>
sorting<br>
slicing<br>
aggregation<br>
imputation. If you are using anaconda, you do not have install pandas.

<font color="blue">Pandas est un outil open source, performant et facile à utiliser pour les structures de données et l'analyse de données, conçu pour le langage de programmation Python. Pandas propose des structures de données et des outils conçus pour traiter des données de type tableau, telles que les séries et les trames de données. Pandas fournit également des outils de manipulation des données :

<font color="blue">remodelage <br>
<font color="blue">fusion <br>
<font color="blue">tri <br>
<font color="blue">découpage <br>
<font color="blue">agrégation <br>
<font color="blue">imputation. Si vous utilisez Anaconda, vous n'avez pas besoin d'installer Pandas.

### Pandas data structure is based on Series and DataFrames.

A series is a column and a DataFrame is a multidimensional table made up of collection of series. In order to create a pandas series we should use numpy to create a one dimensional arrays or a python list. Let us see an example of a series:

### <font color="blue">La structure de données de Pandas repose sur les séries et les DataFrames.

<font color="blue">Une série est une colonne et un DataFrame est un tableau multidimensionnel composé d'une collection de séries. Pour créer une série Pandas, nous devons utiliser Numpy pour créer des tableaux unidimensionnels ou une liste Python. Voici un exemple de série :    

In [None]:
# pip install pandas

In [11]:
# creating object with Pandas
# Création d'un objet avec Pandas
import pandas as pd
import numpy as np
s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [12]:
#defining dates
# définition de dates
dates = pd.date_range('20180101', periods=12, freq='m')

df = pd.DataFrame(np.random.randn(12,5), index=dates, columns=list('ABCDE'))
df

Unnamed: 0,A,B,C,D,E
2018-01-31,0.231315,0.708033,1.111132,2.181142,1.157375
2018-02-28,0.734025,-0.089942,-0.706523,-0.839375,-0.2457
2018-03-31,-1.292621,0.502302,0.935274,-0.38286,0.986433
2018-04-30,-1.289494,-0.092952,0.307212,0.123968,1.571922
2018-05-31,0.575059,0.744409,1.616799,1.452358,-0.728734
2018-06-30,0.146737,-0.905041,-0.224802,-0.521343,1.448092
2018-07-31,0.038899,-0.549983,0.400122,-0.784636,1.086672
2018-08-31,-0.246512,1.038037,-0.585043,-2.12651,0.688611
2018-09-30,1.078921,1.966752,-1.82029,0.738201,0.49169
2018-10-31,-0.781411,-1.467237,1.690682,0.579128,0.912026


In [3]:
# viewing the first 5 observations
# affichage des 5 premières observations
df.head()

Unnamed: 0,A,B,C,D,E
2018-01-31,-0.291303,-1.250286,-0.496759,1.500387,-0.284256
2018-02-28,1.365386,-0.07982,0.011336,0.351958,0.789908
2018-03-31,-0.102763,-0.029352,0.678579,-0.515541,0.903214
2018-04-30,1.887002,0.171269,-0.573882,0.413721,2.353703
2018-05-31,-1.068683,0.354814,0.575092,0.017316,-1.824893


In [6]:
# Get the top observation
df.head(2)

Unnamed: 0,A,B,C,D,E
2018-01-31,-0.732388,1.022338,0.811228,-0.020765,-0.492954
2018-02-28,0.359478,0.718718,-0.369269,-0.269416,0.011059


In [14]:
df.tail(2)

Unnamed: 0,A,B,C,D,E
2018-11-30,0.172564,1.017236,-0.242889,-0.479964,0.058661
2018-12-31,-0.887499,-0.227271,1.745352,-1.06329,-0.345572


In [19]:
# Display the index, columns, and the underlying NumPy data:
# Afficher l'index, les colonnes et les données NumPy sous-jacentes :
df.index

DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
               '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31'],
              dtype='datetime64[ns]', freq='M')

In [20]:
# what columns do we have in the dataset
# Quelles sont les colonnes présentes dans l'ensemble de données ?
df.columns

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

In [10]:
# Data values in array format
# Valeurs de données au format tableau
# x = df.values
# x
y = df['A'].values
print(y)

[-0.73238751  0.35947818 -0.76089162  2.02663129  0.42467588  0.46765918
  0.24088975 -1.62481505  0.04893174  0.25580473  0.53102533  0.02086362]


In [11]:
# describe() shows a quick statistic summary of your data:
# La fonction describe() affiche un résumé statistique rapide de vos données :
#df.describe()

df.describe()

Unnamed: 0,A,B,C,D,E
count,12.0,12.0,12.0,12.0,12.0
mean,0.104822,0.263455,-0.046378,-0.098938,0.408452
std,0.887659,0.788949,0.829316,0.824056,0.781242
min,-1.624815,-1.167443,-1.704662,-1.290458,-1.034196
25%,-0.167449,-0.119654,-0.470625,-0.678433,-0.08281
50%,0.248347,0.343404,-0.26815,-0.145091,0.463625
75%,0.435422,0.794623,0.553713,0.350861,1.044367
max,2.026631,1.583903,1.465308,1.280658,1.514482


In [17]:
# round(a,4)
round(df.describe(),2)

Unnamed: 0,A,B,C,D,E
count,12.0,12.0,12.0,12.0,12.0
mean,-0.13,0.22,0.35,-0.09,0.59
std,0.78,0.96,1.11,1.18,0.75
min,-1.29,-1.47,-1.82,-2.13,-0.73
25%,-0.81,-0.31,-0.33,-0.8,-0.02
50%,0.09,0.21,0.35,-0.43,0.8
75%,0.32,0.81,1.24,0.62,1.1
max,1.08,1.97,1.75,2.18,1.57


In [19]:
#Transposing your data:
#Transposition de vos données :
#df.T
round(df.T, 2)

Unnamed: 0,2018-01-31,2018-02-28,2018-03-31,2018-04-30,2018-05-31,2018-06-30,2018-07-31,2018-08-31,2018-09-30,2018-10-31,2018-11-30,2018-12-31
A,0.23,0.73,-1.29,-1.29,0.58,0.15,0.04,-0.25,1.08,-0.78,0.17,-0.89
B,0.71,-0.09,0.5,-0.09,0.74,-0.91,-0.55,1.04,1.97,-1.47,1.02,-0.23
C,1.11,-0.71,0.94,0.31,1.62,-0.22,0.4,-0.59,-1.82,1.69,-0.24,1.75
D,2.18,-0.84,-0.38,0.12,1.45,-0.52,-0.78,-2.13,0.74,0.58,-0.48,-1.06
E,1.16,-0.25,0.99,1.57,-0.73,1.45,1.09,0.69,0.49,0.91,0.06,-0.35


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 12 entries, 2018-01-31 to 2018-12-31
Freq: M
Data columns (total 5 columns):
A    12 non-null float64
B    12 non-null float64
C    12 non-null float64
D    12 non-null float64
E    12 non-null float64
dtypes: float64(5)
memory usage: 576.0 bytes


In [22]:
# Sorting by an axis:
# Tri selon un axe :
df.sort_index(axis=1, ascending=True)

Unnamed: 0,A,B,C,D,E
2018-01-31,0.231315,0.708033,1.111132,2.181142,1.157375
2018-02-28,0.734025,-0.089942,-0.706523,-0.839375,-0.2457
2018-03-31,-1.292621,0.502302,0.935274,-0.38286,0.986433
2018-04-30,-1.289494,-0.092952,0.307212,0.123968,1.571922
2018-05-31,0.575059,0.744409,1.616799,1.452358,-0.728734
2018-06-30,0.146737,-0.905041,-0.224802,-0.521343,1.448092
2018-07-31,0.038899,-0.549983,0.400122,-0.784636,1.086672
2018-08-31,-0.246512,1.038037,-0.585043,-2.12651,0.688611
2018-09-30,1.078921,1.966752,-1.82029,0.738201,0.49169
2018-10-31,-0.781411,-1.467237,1.690682,0.579128,0.912026


In [23]:
#Sorting by values:
# Tri selon une collone :
df.sort_values(by='A')

Unnamed: 0,A,B,C,D,E
2018-03-31,-1.292621,0.502302,0.935274,-0.38286,0.986433
2018-04-30,-1.289494,-0.092952,0.307212,0.123968,1.571922
2018-12-31,-0.887499,-0.227271,1.745352,-1.06329,-0.345572
2018-10-31,-0.781411,-1.467237,1.690682,0.579128,0.912026
2018-08-31,-0.246512,1.038037,-0.585043,-2.12651,0.688611
2018-07-31,0.038899,-0.549983,0.400122,-0.784636,1.086672
2018-06-30,0.146737,-0.905041,-0.224802,-0.521343,1.448092
2018-11-30,0.172564,1.017236,-0.242889,-0.479964,0.058661
2018-01-31,0.231315,0.708033,1.111132,2.181142,1.157375
2018-05-31,0.575059,0.744409,1.616799,1.452358,-0.728734


In [24]:
# Selecting a single column, which yields a Series, equivalent to df.A:
# La sélection d'une seule colonne donne une série, équivalente à df.A :
df['D']

2018-01-31    2.181142
2018-02-28   -0.839375
2018-03-31   -0.382860
2018-04-30    0.123968
2018-05-31    1.452358
2018-06-30   -0.521343
2018-07-31   -0.784636
2018-08-31   -2.126510
2018-09-30    0.738201
2018-10-31    0.579128
2018-11-30   -0.479964
2018-12-31   -1.063290
Freq: M, Name: D, dtype: float64

In [25]:
# Selecting via [], which slices the rows.
# Sélection via [], qui découpe les lignes.
df[2:8]

Unnamed: 0,A,B,C,D,E
2018-03-31,-1.292621,0.502302,0.935274,-0.38286,0.986433
2018-04-30,-1.289494,-0.092952,0.307212,0.123968,1.571922
2018-05-31,0.575059,0.744409,1.616799,1.452358,-0.728734
2018-06-30,0.146737,-0.905041,-0.224802,-0.521343,1.448092
2018-07-31,0.038899,-0.549983,0.400122,-0.784636,1.086672
2018-08-31,-0.246512,1.038037,-0.585043,-2.12651,0.688611


In [34]:
# Select via the position of the passed integers:
# Sélection en fonction de la position des entiers transmis :

df.iloc[2:5]

Unnamed: 0,A,B,C,D,E
2018-03-31,0.187674,-1.11574,-0.031174,-2.660125,2.405434
2018-04-30,0.397166,-1.682216,0.183021,1.898495,0.93099
2018-05-31,-0.62277,-1.371721,-0.256098,-0.196595,-0.475871


In [37]:
#By integer slices, acting similar to numpy/python:
# Par tranches d'entiers, fonctionnant de manière similaire à numpy/python :
df.iloc[1:9,0:2]


Unnamed: 0,A,B
2018-02-28,-0.764791,0.581003
2018-03-31,0.187674,-1.11574
2018-04-30,0.397166,-1.682216
2018-05-31,-0.62277,-1.371721
2018-06-30,0.598582,-0.682675
2018-07-31,-0.488721,0.757552
2018-08-31,-0.793953,-0.146138
2018-09-30,1.473386,1.348848


In [26]:
# By lists of integer position locations, similar to the numpy/python style:
# Par listes d'emplacements de position entiers, similaire au style numpy/python :
df.iloc[[1,5,11],[0,3]]

Unnamed: 0,A,D
2018-02-28,0.734025,-0.839375
2018-06-30,0.146737,-0.521343
2018-12-31,-0.887499,-1.06329


In [39]:
# For slicing rows explicitly:
# Pour découper explicitement les lignes :
df.iloc[1:7,:]

Unnamed: 0,A,B,C,D,E
2018-02-28,-0.764791,0.581003,-0.059589,-0.023677,0.479781
2018-03-31,0.187674,-1.11574,-0.031174,-2.660125,2.405434
2018-04-30,0.397166,-1.682216,0.183021,1.898495,0.93099
2018-05-31,-0.62277,-1.371721,-0.256098,-0.196595,-0.475871
2018-06-30,0.598582,-0.682675,-0.483396,0.619357,-0.627747
2018-07-31,-0.488721,0.757552,0.380881,0.682492,1.291066


In [27]:
# Using a single column’s values to select data
# Utilisation des valeurs d'une seule colonne pour sélectionner des données
df[df.B < 0]


Unnamed: 0,A,B,C,D,E
2018-02-28,0.734025,-0.089942,-0.706523,-0.839375,-0.2457
2018-04-30,-1.289494,-0.092952,0.307212,0.123968,1.571922
2018-06-30,0.146737,-0.905041,-0.224802,-0.521343,1.448092
2018-07-31,0.038899,-0.549983,0.400122,-0.784636,1.086672
2018-10-31,-0.781411,-1.467237,1.690682,0.579128,0.912026
2018-12-31,-0.887499,-0.227271,1.745352,-1.06329,-0.345572


In [28]:
df2 = df.copy()

In [29]:
# Adding a new column to a dataframe
# Ajout d'une nouvelle colonne à un dataframe
df2['F'] = ['one', 'one','two','three','four','three','one','two','three','one','two','three']

In [30]:
df2

Unnamed: 0,A,B,C,D,E,F
2018-01-31,0.231315,0.708033,1.111132,2.181142,1.157375,one
2018-02-28,0.734025,-0.089942,-0.706523,-0.839375,-0.2457,one
2018-03-31,-1.292621,0.502302,0.935274,-0.38286,0.986433,two
2018-04-30,-1.289494,-0.092952,0.307212,0.123968,1.571922,three
2018-05-31,0.575059,0.744409,1.616799,1.452358,-0.728734,four
2018-06-30,0.146737,-0.905041,-0.224802,-0.521343,1.448092,three
2018-07-31,0.038899,-0.549983,0.400122,-0.784636,1.086672,one
2018-08-31,-0.246512,1.038037,-0.585043,-2.12651,0.688611,two
2018-09-30,1.078921,1.966752,-1.82029,0.738201,0.49169,three
2018-10-31,-0.781411,-1.467237,1.690682,0.579128,0.912026,one


In [31]:
# Add month as a new column
# Ajouter le mois comme nouvelle colonne

df2['Month'] = df2.index.month

In [32]:
df2

Unnamed: 0,A,B,C,D,E,F,Month
2018-01-31,0.231315,0.708033,1.111132,2.181142,1.157375,one,1
2018-02-28,0.734025,-0.089942,-0.706523,-0.839375,-0.2457,one,2
2018-03-31,-1.292621,0.502302,0.935274,-0.38286,0.986433,two,3
2018-04-30,-1.289494,-0.092952,0.307212,0.123968,1.571922,three,4
2018-05-31,0.575059,0.744409,1.616799,1.452358,-0.728734,four,5
2018-06-30,0.146737,-0.905041,-0.224802,-0.521343,1.448092,three,6
2018-07-31,0.038899,-0.549983,0.400122,-0.784636,1.086672,one,7
2018-08-31,-0.246512,1.038037,-0.585043,-2.12651,0.688611,two,8
2018-09-30,1.078921,1.966752,-1.82029,0.738201,0.49169,three,9
2018-10-31,-0.781411,-1.467237,1.690682,0.579128,0.912026,one,10


In [48]:
# Setting a new column automatically aligns the data by the indexes
# La création d'une nouvelle colonne aligne automatiquement les données selon les index.
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20180102', periods=6))
s1

2018-01-02    1
2018-01-03    2
2018-01-04    3
2018-01-05    4
2018-01-06    5
2018-01-07    6
Freq: D, dtype: int64

In [19]:
#Setting by assigning with a NumPy array:
#Initialisation par affectation à l'aide d'un tableau NumPy :
df.loc[:,'D'] = np.array(['words'] * len(df))
df

Unnamed: 0,A,B,C,D,E
2018-01-31,-0.732388,1.022338,0.811228,words,-0.492954
2018-02-28,0.359478,0.718718,-0.369269,words,0.011059
2018-03-31,-0.760892,-0.131119,-0.458481,words,1.312366
2018-04-30,2.026631,1.583903,-0.507058,words,-0.364418
2018-05-31,0.424676,-0.03987,-0.542735,words,0.669956
2018-06-30,0.467659,0.328861,1.465308,words,1.06849
2018-07-31,0.24089,0.408042,-0.167031,words,0.257294
2018-08-31,-1.624815,1.04458,0.688979,words,1.514482
2018-09-30,0.048932,-0.115833,-1.704662,words,-1.034196
2018-10-31,0.255805,-1.167443,-0.441554,words,1.036326


# Missing Data

# <font color='blue'>Données manquantes

In [33]:
# Manipulation missing data
# Manipulation des données manquantes
df1 = df.reindex(index=dates[0:5], columns=list(df.columns) + ['F'])
df1.loc[dates[0]:dates[1],'F'] = 1
df1

Unnamed: 0,A,B,C,D,E,F
2018-01-31,0.231315,0.708033,1.111132,2.181142,1.157375,1.0
2018-02-28,0.734025,-0.089942,-0.706523,-0.839375,-0.2457,1.0
2018-03-31,-1.292621,0.502302,0.935274,-0.38286,0.986433,
2018-04-30,-1.289494,-0.092952,0.307212,0.123968,1.571922,
2018-05-31,0.575059,0.744409,1.616799,1.452358,-0.728734,


In [59]:
# To drop any rows that have missing data
# Supprimer les lignes contenant des données manquantes
df1.dropna(how='any')

Unnamed: 0,A,B,C,D,E,F
2018-01-31,-0.537944,0.214964,-0.426093,words,-1.821859,1.0
2018-02-28,1.137935,-1.059726,0.463824,words,0.334747,1.0


In [60]:
# Filling missing data, replacing missing values in E with the mean value of the column D
# Remplissage des données manquantes, remplacement des valeurs manquantes dans E par la valeur moyenne de la colonne D

df1.fillna(value=np.median(df1['C']))

Unnamed: 0,A,B,C,D,E,F
2018-01-31,-0.537944,0.214964,-0.426093,words,-1.821859,1.0
2018-02-28,1.137935,-1.059726,0.463824,words,0.334747,1.0
2018-03-31,-0.237654,-0.58609,-2.261763,words,-0.607987,-0.084837
2018-04-30,-0.339678,0.431857,-0.084837,words,-0.571251,-0.084837
2018-05-31,-0.551689,-0.61467,1.226142,words,1.072403,-0.084837


# Statistics

# <font color='blue'>Statistiques

In [61]:
# Getting the mean of the columns
# Calcul de la moyenne des colonnes

df.mean()


A   -0.099817
B   -0.193103
C   -0.226651
E   -0.385305
dtype: float64

In [62]:
# find mean of each row
# calculer la moyenne de chaque ligne

df.mean(1)

2018-01-31   -0.642733
2018-02-28    0.219195
2018-03-31   -0.923373
2018-04-30   -0.140977
2018-05-31    0.283047
2018-06-30   -0.220754
2018-07-31   -0.320195
2018-08-31   -0.040353
2018-09-30   -0.599126
2018-10-31   -0.109287
2018-11-30   -0.245468
2018-12-31    0.025401
Freq: M, dtype: float64

In [64]:
# Computing the cummulative sum for a column
# Calcul de la somme cumulée d'une colonne

df.apply(np.cumsum)

Unnamed: 0,A,B,C,D,E
2018-01-31,-0.537944,0.214964,-0.426093,words,-1.821859
2018-02-28,0.59999,-0.844762,0.037731,wordswords,-1.487113
2018-03-31,0.362336,-1.430853,-2.224032,wordswordswords,-2.095099
2018-04-30,0.022658,-0.998996,-2.308869,wordswordswordswords,-2.66635
2018-05-31,-0.529031,-1.613666,-1.082727,wordswordswordswordswords,-1.593947
2018-06-30,-0.114286,-1.129737,-2.795041,wordswordswordswordswordswords,-1.663323
2018-07-31,0.273173,-0.1899,-3.293236,wordswordswordswordswordswordswords,-3.773204
2018-08-31,-0.108453,-0.018584,-3.442503,wordswordswordswordswordswordswordswords,-3.575038
2018-09-30,0.090949,-1.922741,-2.622459,wordswordswordswordswordswordswordswordswords,-5.086832
2018-10-31,-0.388499,-2.138377,-3.332301,wordswordswordswordswordswordswordswordswordsw...,-4.119055


In [48]:
# Using Function lambda for operation on columns
# Utilisation de la fonction lambda pour les opérations sur les colonnes

range_value = df.iloc[:,:3].apply(lambda x: x.max() - x.min())

range_value


A    2.731215
B    3.082090
C    3.717038
dtype: float64

# Data Manipulation - dealing with duplicate

# <font color='blue'>Manipulation des données - Gestion des doublons</font>

In [21]:

data = {"Name": ["James", "Alice", "Phil", "James"],
		"Age": [24, 28, 40, 24],
		"Sex": ["Male", "Female", "Male", "Male"]}
df = pd.DataFrame(data)
print(df)

    Name  Age     Sex
0  James   24    Male
1  Alice   28  Female
2   Phil   40    Male
3  James   24    Male


In [22]:
# Drop Duplicate Rows
# Supprimer les lignes en double
df = df.drop_duplicates()
print(df)

    Name  Age     Sex
0  James   24    Male
1  Alice   28  Female
2   Phil   40    Male


In [23]:
data = {"Name": ["James", "Alice", "Phil", "James"],
		"Age": [24, 28, 40, 25],
		"Sex": ["Male", "Female", "Male", "Male"]}
df = pd.DataFrame(data)
print(df)

    Name  Age     Sex
0  James   24    Male
1  Alice   28  Female
2   Phil   40    Male
3  James   25    Male


To remove duplicates of only one or a subset of columns, specify subset as the individual column or list of columns that should be unique. To do this conditional on a different column’s value, you can sort_values(colname) and specify keep equals either first or last.

In our example data, this could be useful if we had two entries for Name = James, one with Age = 24 and one with Age = 25. If we know we only want the oldest example for each person, we can sort by Age and drop duplicates of the name column, keeping only the observation with the highest age.

<font color='blue'>Pour supprimer les doublons d'une seule colonne ou d'un sous-ensemble de colonnes, spécifiez le sous-ensemble comme la colonne ou la liste de colonnes à ne pas modifier. Pour appliquer cette condition à une autre colonne, utilisez `sort_values(colname)` et indiquez de conserver uniquement les valeurs de la première ou de la dernière colonne.

<font color='blue'>Dans notre exemple, cela peut s'avérer utile si nous avons deux entrées pour « Nom » = « James », l'une avec « Âge » = 24 et l'autre avec « Âge » = 25. Si nous ne voulons que l'exemple le plus ancien pour chaque personne, nous pouvons trier par « Âge » et supprimer les doublons de la colonne « Nom », en ne conservant que l'observation avec l'âge le plus élevé.
</font>

In [24]:
df = df.sort_values('Age', ascending=False)
df = df.drop_duplicates(subset='Name', keep='first')
print(df)

    Name  Age     Sex
2   Phil   40    Male
1  Alice   28  Female
3  James   25    Male


In [37]:
# Count distinct or unique records
# Compter les enregistrements distincts ou uniques
df = pd.DataFrame({
         'hID': [101, 102, 103, 101, 102, 104, 105, 101],
         'dID': [10, 11, 12, 10, 11, 10, 12, 10],
         'uID': ['James', 'Henry', 'Abe', 'James', 'Henry', 'Brian', 'Claude', 'James'],
         'mID': ['A', 'B', 'A', 'B', 'A', 'A', 'A', 'C']
})

df

Unnamed: 0,hID,dID,uID,mID
0,101,10,James,A
1,102,11,Henry,B
2,103,12,Abe,A
3,101,10,James,B
4,102,11,Henry,A
5,104,10,Brian,A
6,105,12,Claude,A
7,101,10,James,C


In [38]:
# Count all IDs including duplicate IDs
# Compter tous les identifiants, y compris les identifiants en double
df['hID'].count()

8

In [39]:
# Count all IDs including duplicate IDs
# Compter tous les identifiants, y compris les identifiants en double
df['hID'].size

8

In [40]:
# Count all IDs including duplicate IDs
# Compter tous les identifiants, y compris les identifiants en double
len(df['hID'])

8

In [41]:
# Count distinct value
# Compter les valeurs distinctes
df['hID'].nunique()

5

In [43]:
# Use boolean indexing:
# Utilisez l'indexation booléenne :
df.loc[df['mID']=='A','hID'].agg(['nunique','count','size'])

nunique    5
count      5
size       5
Name: hID, dtype: int64

In [31]:
# OR using query:
df.query('mID == "A"')['hID'].agg(['nunique','count','size'])

nunique    5
count      5
size       5
Name: hID, dtype: int64

In [44]:
# Delete or drop a row
# Supprimer ou supprimer une ligne
df1 = df.drop('mID', axis=1)
df1

Unnamed: 0,hID,dID,uID
0,101,10,James
1,102,11,Henry
2,103,12,Abe
3,101,10,James
4,102,11,Henry
5,104,10,Brian
6,105,12,Claude
7,101,10,James


In [45]:
# Delete or drop a row using column statment
# Supprimer ou supprimer une ligne à l'aide de l'instruction de colonne
df2 = df1.drop(columns="dID")
df2

Unnamed: 0,hID,uID
0,101,James
1,102,Henry
2,103,Abe
3,101,James
4,102,Henry
5,104,Brian
6,105,Claude
7,101,James


In [39]:
# delete or drop multiple rows
# supprimer ou supprimer plusieurs lignes
df3 = df.drop(["mID","dID"], axis=1)
df3

Unnamed: 0,hID,uID
0,101,James
1,102,Henry
2,103,Abe
3,101,James
4,102,Henry
5,104,Brian
6,105,Claude
7,101,James


In [43]:
# Copy Output to an excel file
# Copier la sortie dans un fichier Excel
df3.to_csv("output_excel_file.xlsx",  index=False)

In [42]:
#Copy output to a csv file
#Copier la sortie dans un fichier csv
df3.to_csv("output_filename.csv", index=False, encoding='utf8')

### Key Differences:
.loc[] - Uses labels (column names, index labels)

.iloc[] - Uses integer positions (0, 1, 2, ...)

df[] - Simple indexing (columns only or boolean indexing)

### <font color='blue'>Principales différences :

.loc[] : Utilise des étiquettes (noms de colonnes, index)

.iloc[] : Utilise des positions numériques (0, 1, 2, …)

df[] : Indexation simple (colonnes uniquement ou indexation booléenne)


🎉 CONGRATULATIONS ! 🎉