_This teaching material can be freely used, distributed and modified as per the [CC-BY-SA 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) license._   
 _Authors: Jana Lasser (jana.lasser@ds.mpg.de), Debsankha Manik (debsankha.manik@ds.mpg.de)._  
 _This teaching material is created for the project "Daten Lesen Lernen", funded by Stifterverband and the Heinz Nixdorf Stiftung._

# Exercise 3
### Some tips
$\rightarrow$ You should take a look at the exercises at home and try to solve as many tasks as you can. In the tutorials you could go on with solving the tasks you could not finish at home and if necessary, ask for help from the tutors. 

$\rightarrow$ You you encounter an error:
1. _Read_ and _understand_ the error message. 
2. Try to find out the solution to the problem ($\rightarrow$ this may be hard at the beginning, but it is very educational!)
3. Search using a search engine for the problem (hint: Stackoverflow) or ask your neighbour!
4. Ask the tutors.

$\rightarrow$ Look out for the sign <font color='green'>**HINT**</font>: these are hints that helps you with solving the task and sometimes additional information regarding the task. 

$\rightarrow$ Tasks marked **(Optional)** are for those of you who are extra fast :-).

### Logic, conditionals and filters
1. **Logic**
  1. Write a program to check if a word contains the letter 'e' *and* is shorter than 5 letters. 
  2. Write a program to check if a work is odd or even. 
  <font color='green'>**HINT**: The Modulo operator ```%``` computes the remainder of a division:</font>

In [1]:
print(11 % 2)
print(10 % 2)

1
0


2. **Conditionals**
  1. Create a list containing integers between 0 and 20. <font color='green'>**Hint**: Use the ```range()``` function!</font>
  2. Write a program that iterates over the list with a ```for``` loop and does the follwoing:
    * multiplies all numbers lesser than 10 by 2 and prints. 
    * divides all numbers greater than or equal to 10 by 2 and prints. 
    * treats the number 15 specially and prints a warning message when this number is reached.  
    
  <font color='green'>**HINT:** The necessary keywords are ```if```, ```elif``` and ```else```. </font>
3. **Filter (Titanic survivors by gender)** 
  <font color='green'>**HINT:** If you haven't gone through filtering DataFrames yet, or is missing from the lecture, look at the recap  section below. It explains the prodecure with a simple example.</font>  
  
  1. Read the Titanic dataset into a ```Pandas``` DataFrame, as you did in Exercise 2. Create a table (DataFrame) containing only the passengers who survived. How many are there in total?
  2. Out of the survivors table, create two further tables, one containing surviving women and the other containing surviving men.
  3. Find out the number of men and women originally on board and the numbers among the survivors.
  4. What was the probability of survival of a man and a woman on board the Titanic? 
  5. Can you explain why the gap in survival probability is so high between genders? Can you use your domain knowledge or knowledge of history to explain the causal relationship between gender and survival probability? In the absense of any prior knowledge, could one come up with some other explanation?
  6. **(Optional)**: do the same analysis as above, but with class of travel rather than gender (```pclass``` or ```class```). How much differences are there?

### Recap : filtering a DataFrame
Filtering a DataFrame by a certain criterion is an important technique in data analysis.

In [2]:
# import Pandas with the usual alias pd
import pandas as pd

# we create a test dataframe containing names
# (strings) and age (numbers) 

# first create a list of names, then the list of ages
names = ['tim','olaf','nina','andrea','jana', 'miro']
ages = [31, 52, 14, 46, 28, 3]

# then create the table (dataframe) "people"
# by supplying the two lists as columns
people = pd.DataFrame({'name': names, 'age': ages})

# display the first 6 rows of the table
people.head(6)

Unnamed: 0,age,name
0,31,tim
1,52,olaf
2,14,nina
3,46,andrea
4,28,jana
5,3,miro


To display only some specific rows of the table, we can create a "mask", that tells which rows should be included and which omitted. 

The mask is a list boolean values (```True``` und ```False```), the length of that list must equal the number of rows of the table. 

In [3]:
# create the mask
mask = [False, True, False, True, False, True]

# use the mask to display only 2nd, 4th and 6th entries
people[mask]

Unnamed: 0,age,name
1,52,olaf
3,46,andrea
5,3,miro


In [4]:
# it's of course possible to store the filtered
# dataframe in a new variable
filtered_table = people[mask]
filtered_table.head()

Unnamed: 0,age,name
1,52,olaf
3,46,andrea
5,3,miro


The next step will be creating the mask automatically. Say, we want to know which people in the table are older than 30. We do it with the "greater-than-or-equal-to" operator ```>=```.

In [5]:
people['age'] >= 30

0     True
1     True
2    False
3     True
4    False
5    False
Name: age, dtype: bool

In [6]:
# one can also save the mask as a variable
old_mask = people['age'] >= 30

# then we filter the table with the stored age mask
old_people = people[old_mask]
old_people.head()

Unnamed: 0,age,name
0,31,tim
1,52,olaf
3,46,andrea


In the same way we can find out the younger people in the table using the "less-than" operator ```<```:

In [7]:
# create the mask
young_mask = people['age'] < 30

# filtere die Tabelle
young_people = people[young_mask]
young_people

Unnamed: 0,age,name
2,14,nina
4,28,jana
5,3,miro


Next, people with short names (using the "less-than-or-equal" operator ```<=```):

In [8]:
# check the length of the column 'name'
short_name_mask = people['name'].str.len() <= 4

# ffilter the table with the mask
short_names = people[short_name_mask]
short_names.head()

Unnamed: 0,age,name
0,31,tim
1,52,olaf
2,14,nina
4,28,jana
5,3,miro


You can use the same principles to any dataframe, in specific the Titanic one. Thus you can filter the data with specific properties of the passengers (e.g. class or gender).