___

<a href='https://oxiane-institut.com/'> <img src='../oxiane.jpg' /></a>
___

## Advanced Indexing with Pandas

It will allow you to do complex selections of data based on multiple logical criteria

In [37]:
import pandas as pd
import numpy as np


pd.set_option(
    'display.max_colwidth', 100     # Default: 50
)

pd.set_option(
    'display.max_rows', 100         # Default: 15
)

# vérifier la version
pd.__version__

'2.2.2'

In [38]:
df = pd.read_csv("data/heart.csv")

## `.loc()` and `.iloc()`

- `.loc()`: takes into account the inner indexing of the Series/Dataframe object.
- `.iloc()`: takes into account the integer location of the data, like an usual `list[i]`

### On a Series

In [39]:
s = pd.Series(["a","b","c","d","e","f"], index=[49, 48, 47, 0, 1, 2]) 
s

49    a
48    b
47    c
0     d
1     e
2     f
dtype: object

In [40]:
s.loc[0]

'd'

In [41]:
s.iloc[0]

'a'

In [42]:
s.loc[0:1]

0    d
1    e
dtype: object

In [43]:
s.iloc[0:1]

49    a
dtype: object

## `iloc()` on a dataframe

In [44]:
df

Unnamed: 0,age,sexe,type_douleur,pression,cholester,sucre,electro,taux_max,angine,depression,pic,vaisseau,coeur
0,70,masculin,D,130,322,A,C,109,non,24,2,D,presence
1,67,feminin,C,115,564,A,C,160,non,16,2,A,absence
2,57,masculin,B,124,261,A,A,141,non,3,1,A,presence
3,64,masculin,D,128,263,A,A,105,oui,2,2,B,absence
4,74,feminin,B,120,269,A,C,121,oui,2,1,B,absence
...,...,...,...,...,...,...,...,...,...,...,...,...,...
265,52,masculin,C,172,199,B,A,162,non,5,1,A,absence
266,44,masculin,B,120,263,A,A,173,non,0,1,A,absence
267,56,feminin,B,140,294,A,C,153,non,13,2,A,absence
268,57,masculin,D,140,192,A,A,148,non,4,2,A,absence


In [45]:
df.head(3)

Unnamed: 0,age,sexe,type_douleur,pression,cholester,sucre,electro,taux_max,angine,depression,pic,vaisseau,coeur
0,70,masculin,D,130,322,A,C,109,non,24,2,D,presence
1,67,feminin,C,115,564,A,C,160,non,16,2,A,absence
2,57,masculin,B,124,261,A,A,141,non,3,1,A,presence


In [46]:
# df.iloc[ROW, COLUMN]

df.iloc[0,0]

70

In [47]:
df.iloc[-1,0]

67

In [48]:
df.iloc[df.shape[0]-1,0]

67

In [49]:
df.iloc[0:5,:]

Unnamed: 0,age,sexe,type_douleur,pression,cholester,sucre,electro,taux_max,angine,depression,pic,vaisseau,coeur
0,70,masculin,D,130,322,A,C,109,non,24,2,D,presence
1,67,feminin,C,115,564,A,C,160,non,16,2,A,absence
2,57,masculin,B,124,261,A,A,141,non,3,1,A,presence
3,64,masculin,D,128,263,A,A,105,oui,2,2,B,absence
4,74,feminin,B,120,269,A,C,121,oui,2,1,B,absence


In [50]:
colonnes = [1, 3, 4]
df.iloc[0:5,colonnes]

Unnamed: 0,sexe,pression,cholester
0,masculin,130,322
1,feminin,115,564
2,masculin,124,261
3,masculin,128,263
4,feminin,120,269


In [51]:
df.iloc[-5:,:]

Unnamed: 0,age,sexe,type_douleur,pression,cholester,sucre,electro,taux_max,angine,depression,pic,vaisseau,coeur
265,52,masculin,C,172,199,B,A,162,non,5,1,A,absence
266,44,masculin,B,120,263,A,A,173,non,0,1,A,absence
267,56,feminin,B,140,294,A,C,153,non,13,2,A,absence
268,57,masculin,D,140,192,A,A,148,non,4,2,A,absence
269,67,masculin,D,160,286,A,C,108,oui,15,2,D,presence


In [52]:
df.iloc[0:5,0:2]

Unnamed: 0,age,sexe
0,70,masculin
1,67,feminin
2,57,masculin
3,64,masculin
4,74,feminin


In [53]:
df.iloc[0:5,[0,2,4]]

Unnamed: 0,age,type_douleur,cholester
0,70,D,322
1,67,C,564
2,57,B,261
3,64,D,263
4,74,B,269


In [54]:
df.iloc[0:5,0:5:2]

Unnamed: 0,age,type_douleur,cholester
0,70,D,322
1,67,C,564
2,57,B,261
3,64,D,263
4,74,B,269


## `.loc()` on a dataframe

In [55]:
# We must index by column names

columns = ['age','sexe','coeur','taux_max']
df.loc[:10, columns]

Unnamed: 0,age,sexe,coeur,taux_max
0,70,masculin,presence,109
1,67,feminin,absence,160
2,57,masculin,presence,141
3,64,masculin,absence,105
4,74,feminin,absence,121
5,65,masculin,absence,140
6,56,masculin,presence,142
7,59,masculin,presence,142
8,60,masculin,presence,170
9,63,feminin,presence,154


## `.loc()` powerfull requests capabilities

In [56]:
df

Unnamed: 0,age,sexe,type_douleur,pression,cholester,sucre,electro,taux_max,angine,depression,pic,vaisseau,coeur
0,70,masculin,D,130,322,A,C,109,non,24,2,D,presence
1,67,feminin,C,115,564,A,C,160,non,16,2,A,absence
2,57,masculin,B,124,261,A,A,141,non,3,1,A,presence
3,64,masculin,D,128,263,A,A,105,oui,2,2,B,absence
4,74,feminin,B,120,269,A,C,121,oui,2,1,B,absence
...,...,...,...,...,...,...,...,...,...,...,...,...,...
265,52,masculin,C,172,199,B,A,162,non,5,1,A,absence
266,44,masculin,B,120,263,A,A,173,non,0,1,A,absence
267,56,feminin,B,140,294,A,C,153,non,13,2,A,absence
268,57,masculin,D,140,192,A,A,148,non,4,2,A,absence


In [57]:
df.loc[df['type_douleur']=="A",:]

Unnamed: 0,age,sexe,type_douleur,pression,cholester,sucre,electro,taux_max,angine,depression,pic,vaisseau,coeur
13,61,masculin,A,134,234,A,A,145,non,26,2,C,presence
18,64,masculin,A,110,211,A,C,144,oui,18,2,A,absence
19,40,masculin,A,140,199,A,A,178,oui,14,1,A,absence
37,59,masculin,A,160,273,A,C,125,non,0,1,A,presence
63,60,feminin,A,150,240,A,A,171,non,9,1,A,absence
64,63,masculin,A,145,233,B,C,150,non,23,3,A,absence
85,42,masculin,A,148,244,A,C,178,non,8,1,C,absence
87,59,masculin,A,178,270,A,C,145,non,42,3,A,absence
118,66,feminin,A,150,226,A,A,114,non,26,3,A,absence
143,51,masculin,A,125,213,A,C,125,oui,14,1,B,absence


### Let's break down the query

In [58]:
df['type_douleur']=="A"
# Gives us a boolean Series

0      False
1      False
2      False
3      False
4      False
       ...  
265    False
266    False
267    False
268    False
269    False
Name: type_douleur, Length: 270, dtype: bool

In [59]:
(df['type_douleur']=="A").value_counts()

type_douleur
False    250
True      20
Name: count, dtype: int64

In [60]:
boolean_series = df['type_douleur']=="A"

df.loc[boolean_series, :]

Unnamed: 0,age,sexe,type_douleur,pression,cholester,sucre,electro,taux_max,angine,depression,pic,vaisseau,coeur
13,61,masculin,A,134,234,A,A,145,non,26,2,C,presence
18,64,masculin,A,110,211,A,C,144,oui,18,2,A,absence
19,40,masculin,A,140,199,A,A,178,oui,14,1,A,absence
37,59,masculin,A,160,273,A,C,125,non,0,1,A,presence
63,60,feminin,A,150,240,A,A,171,non,9,1,A,absence
64,63,masculin,A,145,233,B,C,150,non,23,3,A,absence
85,42,masculin,A,148,244,A,C,178,non,8,1,C,absence
87,59,masculin,A,178,270,A,C,145,non,42,3,A,absence
118,66,feminin,A,150,226,A,A,114,non,26,3,A,absence
143,51,masculin,A,125,213,A,C,125,oui,14,1,B,absence


### Multiple conditions

In [61]:
df.loc[(df['type_douleur']=="A") & (df['angine'] =="oui"),:]

Unnamed: 0,age,sexe,type_douleur,pression,cholester,sucre,electro,taux_max,angine,depression,pic,vaisseau,coeur
18,64,masculin,A,110,211,A,C,144,oui,18,2,A,absence
19,40,masculin,A,140,199,A,A,178,oui,14,1,A,absence
143,51,masculin,A,125,213,A,C,125,oui,14,1,B,absence
160,38,masculin,A,120,231,A,A,182,oui,38,2,A,presence


In [62]:
# ! /!\ The () are necessary

# df.loc[df['type_douleur']=="A" & df['angine'] =="oui",:]

In [63]:
df.loc[(df['age'] < 45) & (df['sexe'] == "masculin") & (df['coeur'] =="presence"),:]

Unnamed: 0,age,sexe,type_douleur,pression,cholester,sucre,electro,taux_max,angine,depression,pic,vaisseau,coeur
40,40,masculin,D,152,223,A,A,181,non,0,1,A,presence
47,44,masculin,D,110,197,A,C,177,non,0,1,B,presence
50,42,masculin,D,136,315,A,A,125,oui,18,2,A,presence
81,35,masculin,D,120,198,A,A,130,oui,16,2,A,presence
147,40,masculin,D,110,167,A,C,114,oui,20,2,A,presence
160,38,masculin,A,120,231,A,A,182,oui,38,2,A,presence
182,41,masculin,D,110,172,A,C,158,non,0,1,A,presence
193,35,masculin,D,126,282,A,C,156,oui,0,1,A,presence
231,39,masculin,D,118,219,A,A,140,non,12,2,A,presence
237,43,masculin,D,120,177,A,C,120,oui,25,2,A,presence


In [64]:
columns = ['age','sexe','coeur','taux_max']

df.loc[(df['age'] < 45) & (df['sexe'] == "masculin") & (df['coeur'] =="presence"), columns]

Unnamed: 0,age,sexe,coeur,taux_max
40,40,masculin,presence,181
47,44,masculin,presence,177
50,42,masculin,presence,125
81,35,masculin,presence,130
147,40,masculin,presence,114
160,38,masculin,presence,182
182,41,masculin,presence,158
193,35,masculin,presence,156
231,39,masculin,presence,140
237,43,masculin,presence,120


### Modifying the value of some cells

When setting a value in pandas to one or multiple cells, you should always use `.loc`

In [65]:
df.loc[:4, 'sexe'] = "my_new_value"
# This way you are guaranteed to change the underlying data.

df

Unnamed: 0,age,sexe,type_douleur,pression,cholester,sucre,electro,taux_max,angine,depression,pic,vaisseau,coeur
0,70,my_new_value,D,130,322,A,C,109,non,24,2,D,presence
1,67,my_new_value,C,115,564,A,C,160,non,16,2,A,absence
2,57,my_new_value,B,124,261,A,A,141,non,3,1,A,presence
3,64,my_new_value,D,128,263,A,A,105,oui,2,2,B,absence
4,74,my_new_value,B,120,269,A,C,121,oui,2,1,B,absence
...,...,...,...,...,...,...,...,...,...,...,...,...,...
265,52,masculin,C,172,199,B,A,162,non,5,1,A,absence
266,44,masculin,B,120,263,A,A,173,non,0,1,A,absence
267,56,feminin,B,140,294,A,C,153,non,13,2,A,absence
268,57,masculin,D,140,192,A,A,148,non,4,2,A,absence


Like said [in the documentation](https://pandas.pydata.org/docs/user_guide/indexing.html#evaluation-order-matters) trying to set a value without the .loc method

Can sometimes work and sometimes will not work

it's better to never set a value to some cells of your dataframe without `.loc`

In [66]:
df["sexe"][:4] = "my_new_new_value"

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df["sexe"][:4] = "my_new_new_value"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["sexe"][:4] = "my_new_ne

In [67]:
df

Unnamed: 0,age,sexe,type_douleur,pression,cholester,sucre,electro,taux_max,angine,depression,pic,vaisseau,coeur
0,70,my_new_new_value,D,130,322,A,C,109,non,24,2,D,presence
1,67,my_new_new_value,C,115,564,A,C,160,non,16,2,A,absence
2,57,my_new_new_value,B,124,261,A,A,141,non,3,1,A,presence
3,64,my_new_new_value,D,128,263,A,A,105,oui,2,2,B,absence
4,74,my_new_value,B,120,269,A,C,121,oui,2,1,B,absence
...,...,...,...,...,...,...,...,...,...,...,...,...,...
265,52,masculin,C,172,199,B,A,162,non,5,1,A,absence
266,44,masculin,B,120,263,A,A,173,non,0,1,A,absence
267,56,feminin,B,140,294,A,C,153,non,13,2,A,absence
268,57,masculin,D,140,192,A,A,148,non,4,2,A,absence
