# Notes

This assignment is devoted to `pandas`. It covers indexing and filtering, and some `groupby` and `join` operations. The assignment roughly corresponds to Week 4 and the beginning of Week 5 of the course.

The main dataset you'll be using is [Titanic](https://www.kaggle.com/c/titanic). Please, note, that you must not rely on any specific location for the dataset, hence, any code like

```python
titanic_train = pd.read_csv("<location>/train.csv")
```

will fail and your notebook won't be validated and graded. Inputs to the functions are described explicitly in each case, and that's the only thing you can rely on.

In [2]:
from matplotlib import pyplot as plt
%pylab inline
plt.style.use("bmh")

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


In [3]:
plt.rcParams["figure.figsize"] = (6,6)

In [4]:
import numpy as np
import pandas as pd

train = pd.read_csv("../train.csv", index_col="PassengerId")
test = pd.read_csv('../test.csv', index_col="PassengerId")


#concat but mark missing survived in test as nan
data = pd.concat([train, test], sort=False)





In [5]:



STUDENT = "Amir Alikulov and Ruslan Shuvalov "
ASSIGNMENT = 4
TEST = False

In [6]:
if TEST:
    import solutions
    total_grade = 0
    MAX_POINTS = 21

# Indexing and filtering

### 1. Fixing age (1 point).

There are several known mistakes in the Titanic dataset.

Namely, [Julia Florence Siegel](https://www.encyclopedia-titanica.org/titanic-survivor/julia-florence-cavendish.html) (Mrs. Tyrell William Cavendish) is mistakenly marked as being 76 years old (the age she actually died, but many years after Titanic).

You must **replace the corresponding age value in the dataframe with her actual age at the time** (25) and return the dataset. Input is **indexed** with `PassengerId` and is a **concatenation of train and test sets**. You must return a copy of the input dataframe, and not perform replacement in the original dataframe. Structure and indexing must be the same as in input.

In [7]:
df = data.copy()
df.loc[df['Name'].str.contains('Julia Florence Siegel'), 'Age'] = 25
    
    

In [8]:
PROBLEM_ID = 1

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, fix_age)

### 2. Embarkment port distribution (1 point).

You must find the value counts for embarkment port (`Embarked` column) for the passengers, who travelled in 3-d class, were male and between 20 and 30 years old (both inclusive). No need to treat missing values separately.

Input is **indexed** with `PassengerId` and is a **concatenation of train and test sets**. You must return **series**, indexed with values from `Embarked`, according to `.value_counts()` method semantics:

```
S    <number of male passengers in 3-d class, embarked at S, 20<=Age<=30>
C    <...>
Q    <...>
Name: Embarked, dtype: int64
```

In [9]:
data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
data.Embarked.value_counts()

Embarked
S    914
C    270
Q    123
Name: count, dtype: int64

In [11]:

def embarked_stats(df):
    """Calculate embarkment port statistics."""
    required_columns = ['Age', 'Pclass', 'Sex', 'Embarked']
    if not all(col in df.columns for col in required_columns):
        raise ValueError(f"Missing required columns. Need: {required_columns}")
    filtered_df = df.loc[(df.Age >= 20) & (df.Age <= 30) & (df.Pclass == 3) & (df.Sex == 'male')]
    return filtered_df.Embarked.value_counts()

embarked_stats(data)


Embarked
S    132
C     21
Q      7
Name: count, dtype: int64

In [12]:
PROBLEM_ID = 2

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, embarked_stats)

### 3. Fill missing age values (1 point).

Some age values are missing in the Titanic dataset. You need to calculate average age over all passengers, and fill missing age values in `Age` column.

Input is **indexed** with `PassengerId` and is a **concatenation of train and test sets**. Output must be a **new** dataframe with the same structure, but without missing values in `Age` column.

In [13]:
train = pd.read_csv("../train.csv", index_col="PassengerId")
test = pd.read_csv('../test.csv', index_col="PassengerId")
#concat but mark missing survived in test as nan
df = pd.concat([train, test], sort=False)

data = df.copy()

mean_age = data.Age.mean()
data.Age = data.Age.fillna(mean_age)

print("Original missing ages:", df['Age'].isna().sum())
print("New missing ages:", data['Age'].isna().sum())
print("Age statistics:", data['Age'].describe())


Original missing ages: 263
New missing ages: 0
Age statistics: count    1309.000000
mean       29.881138
std        12.883193
min         0.170000
25%        22.000000
50%        29.881138
75%        35.000000
max        80.000000
Name: Age, dtype: float64


In [14]:
PROBLEM_ID = 3

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, fix_age)

### 4. Child travelling alone (1 point).

You must find a child (`Age<10`) on-board, who was travelling without siblings or parents and find a name of her nursemaid.

Input is **indexed** with `PassengerId` and is a **concatenation of train and test sets**. Output must be a **tuple** of two strings, collected from `Name` column, with one being child's name and second being nursemaid's name. It's known, that there's **only one child** like this.

In [15]:
train = pd.read_csv("../train.csv", index_col="PassengerId")
test = pd.read_csv('../test.csv', index_col="PassengerId")
#concat but mark missing survived in test as nan
df = pd.concat([train, test], sort=False)

data = df.copy()

child = data.loc[(data.Age < 10) & (data.SibSp == 0) &(data.Parch == 0)]

tuple(child.iloc[0].Name.split(','))



#is all ticket numbers unique?






('Emanuel', ' Miss. Virginia Ethel')

In [16]:
PROBLEM_ID = 4

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, get_nursemaid)

### 5. Port with the most children embarked (1 point).

You must find, which port had the largest percentage of children (`Age<10`) embarked, i.e. number of children divided by total number of passengers embarked.

Input is **indexed** with `PassengerId` and is a **concatenation of train and test sets**. Output must be a **single string** with port letter.

In [17]:
train = pd.read_csv("../train.csv", index_col="PassengerId")
test = pd.read_csv('../test.csv', index_col="PassengerId")
#concat but mark missing survived in test as nan
df = pd.concat([train, test], sort=False)

data = df.copy()

child = data.loc[(data.Age < 10)]

port_with_largest_perc_children_embarked = (child.groupby(data.Embarked).size() / data.groupby(data.Embarked).size()).idxmax()

In [18]:
PROBLEM_ID = 5

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, get_port)

### 6. Passengers per ticket (2 points).

Calculate average and maximum number of passengers per ticket.

Input is **indexed** with `PassengerId` and is a **concatenation of train and test sets**. Output must be a **tuple** of two values - average and maximum number of passengers per ticket.

In [39]:

import time

train = pd.read_csv("../train.csv", index_col="PassengerId")
test = pd.read_csv('../test.csv', index_col="PassengerId")
#concat but mark missing survived in test as nan
df = pd.concat([train, test], sort=False)

data = df.copy()

ticket_size = data.groupby('Ticket').size()
average_passengers_per_ticket = ticket_size.mean()
max_passengers_per_ticket = ticket_size.max()

# average_passengers_per_ticket, max_passengers_per_ticket





size() method: 0.3671 seconds
count() method: 0.5111 seconds

Results identical: False

size() result memory: 14864 bytes
count() result memory: Index       7432
Survived    7432
Pclass      7432
Name        7432
Sex         7432
Age         7432
SibSp       7432
Parch       7432
Fare        7432
Cabin       7432
Embarked    7432
dtype: int64 bytes


Ticket
110152         3
110413         3
110465         2
110469         1
110489         1
              ..
W./C. 6608     5
W./C. 6609     1
W.E.P. 5734    2
W/C 14208      1
WE/P 5735      2
Length: 929, dtype: int64

In [20]:
PROBLEM_ID = 6

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, get_ticket_stats)

Pclass
1    33.910500
2    11.411010
3     7.328701
Name: Fare_per_person, dtype: float64

### 7. Fare per passenger (3 points).

The column `Fare` shows the Fare for the entire ticket, not price of one particular passenger on that ticket.
For each individual ticket, you should calculate **fare per person for that ticket**, and then calculate averages for each class. Note, that you will need to apply `groupby` and you may consider using `.first()` of resulting `DataFrameGroupBy`. Also, caferully consider, in which order calculations are performed.

Input is **indexed** with `PassengerId` and is a **concatenation of train and test sets**. Output must be `pd.Series` with three elements, indexed by class:

```
1    <average per person fare in class 1>
2    <...>
3    <...>
Name: Pclass, dtype: float64
```

In [73]:


train = pd.read_csv("../train.csv", index_col="PassengerId")
test = pd.read_csv('../test.csv', index_col="PassengerId")
#concat but mark missing survived in test as nan
df = pd.concat([train, test], sort=False)

data = df.copy()

#fare per person for that ticket. fare / # of people on ticket

fare_per_person_per_ticket = data.groupby(data.Ticket).Fare.first() / data.groupby(data.Ticket).size()
fare_per_person_per_ticket

data['Fare_per_person'] = data['Ticket'].map(fare_per_person_per_ticket)

data.groupby(['Pclass'])['Fare_per_person'].mean()


Pclass
1    33.910500
2    11.411010
3     7.328701
Name: Fare_per_person, dtype: float64

In [56]:
data


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
1306,,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
1307,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
1308,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [76]:
PROBLEM_ID = 7

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, get_fare_per_pass)

### 8. Fill missing age values (3 points).

In problem 3 you filled missing age values with global average over all passengers. Now, you need to fill them **according to class and sex**. For example, for a female passenger from 2d class, missing age value must be filled with average age of females in 2d class.

In this problem, you may need joins and `.apply()`, although there are several ways to get the same result.

Input is **indexed** with `PassengerId` and is a **concatenation of train and test sets**. Output must be a **new** dataframe with the same structure as input, but without missing values in `Age` column.

In [106]:
train = pd.read_csv("../train.csv", index_col="PassengerId")
test = pd.read_csv('../test.csv', index_col="PassengerId")
#concat but mark missing survived in test as nan
df = pd.concat([train, test], sort=False)





Pclass  Sex   
1       female    37.037594
        male      41.029272
2       female    27.499223
        male      30.815380
3       female    22.185329
        male      25.962264
Name: Age, dtype: float64

In [113]:
data = df.copy()
data.groupby(['Pclass', 'Sex'])['Age'].mean()
average_age_per_class_per_sex = data.groupby(['Pclass', 'Sex'])['Age'].mean()
data['Age'] = data.apply(lambda x: average_age_per_class_per_sex.loc[x['Pclass'], x['Sex']] if pd.isnull(x['Age']) else x['Age'], axis=1)



In [112]:
data.loc[6]

Survived                 0.0
Pclass                     3
Name        Moran, Mr. James
Sex                     male
Age                25.962264
SibSp                      0
Parch                      0
Ticket                330877
Fare                  8.4583
Cabin                    NaN
Embarked                   Q
Name: 6, dtype: object

In [108]:


df[df.Age.isna()]


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
6,0.0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
18,1.0,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
20,1.0,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
27,0.0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
29,1.0,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...
1300,,3,"Riordan, Miss. Johanna Hannah""""",female,,0,0,334915,7.7208,,Q
1302,,3,"Naughton, Miss. Hannah",female,,0,0,365237,7.7500,,Q
1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
1308,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [105]:
data

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Average_age_per_class_per_sex
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,,S,
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,1,0,PC 17599,71.2833,C85,C,
3,1.0,3,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,,S,
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,53.1000,C123,S,
5,0.0,3,"Allen, Mr. William Henry",male,35.000000,0,0,373450,8.0500,,S,
...,...,...,...,...,...,...,...,...,...,...,...,...
1305,,3,"Spector, Mr. Woolf",male,25.962264,0,0,A.5. 3236,8.0500,,S,
1306,,1,"Oliva y Ocana, Dona. Fermina",female,39.000000,0,0,PC 17758,108.9000,C105,C,
1307,,3,"Saether, Mr. Simon Sivertsen",male,38.500000,0,0,SOTON/O.Q. 3101262,7.2500,,S,
1308,,3,"Ware, Mr. Frederick",male,25.962264,0,0,359309,8.0500,,S,


In [24]:
PROBLEM_ID = 8

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, fix_age_groupped)

### 9. Finding couples (3 points).

Based on the code from Lecture 5, build a dataframe of couples. Filter it by survival status: select those couples, in which only one of spouses survived or none of two. Built survival statistics by class, i.e. ratio of the number couples with partial survival or couples which died together, divided by total number of couples in class. If the survival status of one or both of spouses is not known, it must be considered as `0`.

Input is **indexed** with `PassengerId` and is a **concatenation of train and test sets**. Output must be `Series` with three elements indexed by values from `Pclass` column (see P7 as a reference).

In [25]:
def find_couples(df):
    """Your code here."""
    pass

In [26]:
PROBLEM_ID = 9

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, find_couples)

### 10. Lonely Passengers (2 points).

Extract # of passengers per class who were either traveling alone (no siblings/spouses/parents/children and also bought a ticket for one) or they have a ticket number beginning with "P", **but not both**.

Note that passenger traveling alone who has a ticket "PC 1234" **should NOT be counted**, but a passenger traveling alone with a ticket "AC 11" should be counted.

Input is **indexed** with `PassengerId` and is a **concatenation of train and test sets**.\
Output must be `Series` with three elements indexed by values from `Pclass` column.

In [27]:
def lonely_or_p(df):
    """Your code here."""
    pass

In [28]:
PROBLEM_ID = 10

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, lonely_or_p)

### 11. Family Tickets (3 points).

Find for each class a proportion of family tickets - tickets where all passengers have the same last name.\
Note that by that definition, even a 1-passenger ticket is considered a family ticket.

Input is **indexed** with `PassengerId` and is a **concatenation of train and test sets**. Output must be `Series` with three elements indexed by values from `Pclass` column, with values between 0 and 1, containing the proportion of family tickets for that class.

In [29]:
def family_tickets(df):
    """Your code here."""
    pass

In [30]:
PROBLEM_ID = 11

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, family_tickets)

# Your grade

In [31]:
if TEST:
    print(f"{STUDENT}: {int(100 * total_grade / MAX_POINTS)}")