<a href="https://colab.research.google.com/github/alienhouseGIT/WBS_portfolio/blob/main/2_2_intro_to_boolean_indexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to boolean indexing

In [None]:
import pandas as pd

We will create a `people` DataFrame.

In [None]:
names = ["Erika Schumacher", "Javi López", "Maria Rovira", "Ana Gromek",
         "Shekhar Biswas", "Muriel Adams", "Saira Polom", "Alex Kubiak",
         "Kit Ching", "Dog Woof"]
ages = [22, 50, 23, 29, 44, 30, 25, 71, 35, 2]
nations = ["DE", "ES", "ES", "PL", "IN", "FR", "IN", "PL", "UK", "XX"]
siblings = [2, 0, 4, 1, 1, 2, 3, 7, 0, 9]
colours = ["Red", "Yellow", "Yellow", "Blue", "Red", "Yellow", "Blue", "Blue", "Red", "Gray"]



people = pd.DataFrame({"name":names,
                       "age":ages,
                       "country":nations,
                       "siblings":siblings,
                       "favourite_colour":colours
                      })

people.head()

Unnamed: 0,name,age,country,siblings,favourite_colour
0,Erika Schumacher,22,DE,2,Red
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow
3,Ana Gromek,29,PL,1,Blue
4,Shekhar Biswas,44,IN,1,Red


## 1.&nbsp;Filtering data based on conditions

Let's say we want to select only the rows of those people whose favourite colour is "Yellow".

If we just type the condition `favourite_colour == "Yellow"`, we will create a pandas Series of boolean values of the same length as the rows in the DataFrame. The value in the Series is `True` for rows where the condition is met, and `False` otherwise.

In [None]:
people["favourite_colour"] == "Yellow"

0    False
1     True
2     True
3    False
4    False
5     True
6    False
7    False
8    False
9    False
Name: favourite_colour, dtype: bool

In [None]:
yellow_lovers = people["favourite_colour"] == "Yellow"

In [None]:
yellow_lovers

0    False
1     True
2     True
3    False
4    False
5     True
6    False
7    False
8    False
9    False
Name: favourite_colour, dtype: bool

> Note: a pandas Series is like a list, but it has an index and all of its elements must share the same data type. You can think of it as a "single column DataFrame".

We can use this Series inside the `.loc[]` function we learned earlier to select only the rows that corrspond to the `True` values.

In [None]:
people.loc[yellow_lovers,:]

Unnamed: 0,name,age,country,siblings,favourite_colour
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow
5,Muriel Adams,30,FR,2,Yellow


In [None]:
people.loc[people["favourite_colour"]=="Yellow", :]

Unnamed: 0,name,age,country,siblings,favourite_colour
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow
5,Muriel Adams,30,FR,2,Yellow


The operators for boolean indexing are:
* **|** for OR,
* **&** for AND, and
* **~** for NOT.

These must be grouped by using parentheses.

In [None]:
and or not #regular python comparision operators (not overloadable, behaviour cannot be changed)
&   |   ~  #bitwise operators, behaviour can be changed by Pandas developers

Let's find out who does not come from Spain and has more than 3 siblings.

In [None]:
~(people["country"]=="ES")

0     True
1    False
2    False
3     True
4     True
5     True
6     True
7     True
8     True
9     True
Name: country, dtype: bool

In [None]:
people.loc[~(people["country"]=="ES") & (people["siblings"]>3), "age": ]

Unnamed: 0,age,country,siblings,favourite_colour
7,71,PL,7,Blue
9,2,XX,9,Gray


In [None]:
people.loc[~(people["country"]=="ES") & (people["siblings"]>3), :"age" ]

Unnamed: 0,name,age
7,Alex Kubiak,71
9,Dog Woof,2


In [None]:
# alternative code to achieve the same result: using != instead of ~
people.loc[(people["country"]!="ES") & (people["siblings"]>3), :]

Unnamed: 0,name,age,country,siblings,favourite_colour
7,Alex Kubiak,71,PL,7,Blue
9,Dog Woof,2,XX,9,Gray


In [None]:
people.loc[:,["age","siblings"]]

Unnamed: 0,age,siblings
0,22,2
1,50,0
2,23,4
3,29,1
4,44,1
5,30,2
6,25,3
7,71,7
8,35,0
9,2,9


From Spain or Germany

In [None]:
people.loc[(people["country"]=="ES") | (people["country"]=="DE")]

Unnamed: 0,name,age,country,siblings,favourite_colour
0,Erika Schumacher,22,DE,2,Red
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow


In [None]:
people.loc[    people["country"].isin(["ES","DE"])             ]

Unnamed: 0,name,age,country,siblings,favourite_colour
0,Erika Schumacher,22,DE,2,Red
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow


In [None]:
"ES"|"DE"

TypeError: unsupported operand type(s) for |: 'str' and 'str'

## 2.&nbsp;Challenges

### Exercise 1

Filter the `people` DataFrame and keep only people from the UK.

In [None]:
# your code here

### Exercise 2

Filter the `people` DataFrame and keep only people from either India or France.

In [None]:
# your code here

### Exercise 3

Filter the `people` DataFrame and keep only people from either Poland or Germany who have 2 or more siblings.

In [None]:
# your code here