# Titanic passenger dataset

In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets

## Case description 

This case study demonstrates some basic data exploration, data quality profiling and data quality problem mitigation techniques for practical purposes. The Titanic passenger data set (accessed through the sckit-learn library, so that the code can be run in any environment without further dependencies) is used as data to illustrate the techniques.

## Data access and exploration

It is possible to access the data through the fetch_openml function, which reads them locally if they are available or downloads them from openml.org otherwise. Below is the description of the data provided by the access object, and we assign them to a dataframe to see a summary of them.

(In this example, this means of accessing the data is commented out, and is replaced by a local access, so that it is possible to run the notebook even without an internet connection or even if the openml.org site is no longer available. However, it is shown here for didactic purposes.)

In [2]:
# I access the data

bunch = datasets.fetch_openml('Titanic', version = '1', as_frame=True)

# I show the description that is included with the data
# (I use print(...) instead of simply putting the variable on a line in the notebook so that
# its value is displayed, because this way the formatting characters are interpreted and the result is more readable)
# In the document linked at the beginning of the text it is possible to see a somewhat more detailed description
# of the meaning of each variable

print(bunch.DESCR)

# I access the data in the form of a Pandas DataFrame.

titanic = bunch.frame

**Author**: Frank E. Harrell Jr., Thomas Cason  
**Source**: [Vanderbilt Biostatistics](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html)  
**Please cite**:   

The original Titanic dataset, describing the survival status of individual passengers on the Titanic. The titanic data does not contain information from the crew, but it does contain actual ages of half of the passengers. The principal source for data about Titanic passengers is the Encyclopedia Titanica. The datasets used here were begun by a variety of researchers. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay.

Thomas Cason of UVa has greatly updated and improved the titanic data frame using the Encyclopedia Titanica and created the dataset here. Some duplicate passengers have been dropped, many errors corrected, many missing ages filled in, and new variable

### Data description (obtained by executing the code commented in the previous cell)

**Author**: Frank E. Harrell Jr., Thomas Cason  
**Source**: [Vanderbilt Biostatistics](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html)  
**Please cite**:   

The original Titanic dataset, describing the survival status of individual passengers on the Titanic. The titanic data does not contain information from the crew, but it does contain actual ages of half of the passengers. The principal source for data about Titanic passengers is the Encyclopedia Titanica. The datasets used here were begun by a variety of researchers. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay.

Thomas Cason of UVa has greatly updated and improved the titanic data frame using the Encyclopedia Titanica and created the dataset here. Some duplicate passengers have been dropped, many errors corrected, many missing ages filled in, and new variables created. 

For more information about how this dataset was constructed:
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt


**Attribute information**

The variables on our extracted dataset are pclass, survived, name, age, embarked, home.dest, room, ticket, boat, and sex. pclass refers to passenger class (1st, 2nd, 3rd), and is a proxy for socio-economic class. Age is in years, and some infants had fractional values. The titanic2 data frame has no missing data and includes records for the crew, but age is dichotomized at adult vs. child. These data were obtained from Robert Dawson, Saint Mary's University, E-mail. The variables are pclass, age, sex, survived. These data frames are useful for demonstrating many of the functions in Hmisc as well as demonstrating binary logistic regression analysis using the Design library. For more details and references see Simonoff, Jeffrey S (1997): The "unusual episode" and a second statistics course. J Statistics Education, Vol. 5 No. 1.

Downloaded from openml.org.

In [4]:
# I read the data from a local file

titanic = pd.read_parquet('./titanic.parquet')

In [5]:
# I show the first five lines of the DataFrame

titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,1,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [6]:
# .info() gives us information about the data type and the size of each column.
# As can be seen, numeric types have been interpreted as such, values that have only
# a small number of options are as categorical type, and those that have many options but are not # numeric are as “object”.
# are numeric are as “object”, so that nothing needs to be modified. It is also interesting
# to see which fields have null values or not (the number of non-null values is shown for each column,
# if it is equal to the total of values, which is mentioned at the beginning of the result, 
# it means that there are no null values.
# null values. You could also use the .count() method of the dataframe for the same)

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   pclass     1309 non-null   float64 
 1   survived   1309 non-null   category
 2   name       1309 non-null   object  
 3   sex        1309 non-null   category
 4   age        1046 non-null   float64 
 5   sibsp      1309 non-null   float64 
 6   parch      1309 non-null   float64 
 7   ticket     1309 non-null   object  
 8   fare       1308 non-null   float64 
 9   cabin      295 non-null    object  
 10  embarked   1307 non-null   category
 11  boat       486 non-null    object  
 12  body       121 non-null    float64 
 13  home.dest  745 non-null    object  
dtypes: category(3), float64(6), object(5)
memory usage: 116.7+ KB


In [7]:
# With .describe() we see a summary of the data. The information displayed is different depending on whether
# the columns are numeric (which is the only information shown by default) 
# or whether they are of other # types (which it is possible to include via .describe(include = 'all').
# types (which it is possible to include via .describe(include = 'all') 
# However, it is generally easier to see each type of # column by column type.
# simple to view each type of column separately).

# Description of the numeric columns...

titanic.describe()

Unnamed: 0,pclass,age,sibsp,parch,fare,body
count,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.837836,14.4135,1.041658,0.86556,51.758668,97.696922
min,1.0,0.1667,0.0,0.0,0.0,1.0
25%,2.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,39.0,1.0,0.0,31.275,256.0
max,3.0,80.0,8.0,9.0,512.3292,328.0


In [8]:
# Description of the categorical type columns
# 'count' refers to the total number of non-null cases, 'unique' to the number of possible options,
# 'top' to the most frequent option and 'freq' to the frequency of the most frequent option.


titanic.describe(include = 'category')

Unnamed: 0,survived,sex,embarked
count,1309,1309,1307
unique,2,2,3
top,0,male,S
freq,809,843,914


In [9]:
# Description of the object type columns

titanic.describe(include = 'object')

Unnamed: 0,name,ticket,cabin,boat,home.dest
count,1309,1309,295,486,745
unique,1307,929,186,27,369
top,"Kelly, Mr. James",CA. 2343,C23 C25 C27,13,"New York, NY"
freq,2,11,6,39,64


## Analysis and treatment of (some) data problems 

Next we are going to try to do some simple analysis but where we find some data quality problems, to see how to deal with them.

Looking at the results of the description of the data in columns of type “object”, we see that there are 1309 non-null values (i.e. the data is complete, since there are 1309 cases), but only 1307 unique values: this means that there are repeated values. Are the data repeated? Can we use the passenger's name as a unique identifier?

In [10]:
# Show the columns of the DataFrame that have the “name” field duplicated 
# (adding keep = False is to show all values: by default, 
# .duplicated() marks as “not duplicated” the first case found and as “duplicated” the first case found. 
# all values: by default, .duplicated() marks as “not duplicated” the first case encountered 
# and as “duplicated” any that are the same, so that if all records are selected, all records that are duplicated are shown.
# to any that are the same, so that if you select all records that are not “duplicated” you have one of each,
# not just the ones that are not duplicates)

titanic[titanic.name.duplicated(keep = False)]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
725,3.0,1,"Connolly, Miss. Kate",female,22.0,0.0,0.0,370373,7.75,,Q,13.0,,Ireland
726,3.0,0,"Connolly, Miss. Kate",female,30.0,0.0,0.0,330972,7.6292,,Q,,,Ireland
924,3.0,0,"Kelly, Mr. James",male,34.5,0.0,0.0,330911,7.8292,,Q,,70.0,
925,3.0,0,"Kelly, Mr. James",male,44.0,0.0,0.0,363592,8.05,,S,,,


We see that there are two duplicate names, but they correspond to different people (one of the Kate Connolly survived the shipwreck and one did not!), so they should not be deleted. The name should not be used as a unique identifier in this case (and in general, almost never) as it could give a problem. If they were not different persons, but a data duplication problem, you could simply do the following

    titanic.drop_duplicates(['name'])
    
to remove those cases that have the same name, although it is usually advisable to do something like

    titanic.drop_duplicates()
    
to delete only those records that have all the same data (if that is what you want, of course).

As we see here, the removal of duplicates, like almost all data cleaning tasks, requires a thorough analysis of the actions to take, so as not to make the data worse rather than better. In this case, we are dealing with an already standardized and heavily worked data set, so easily correctable errors such as duplicates are not to be expected. However, the treatment of data quality problems is not only oriented to “correct” errors in the data, but to improve some quality characteristic that makes it more usable.

As a next example, let's try to analyze the purchase of tickets. In the description of the object type data we see that the field “ticket” is present for all 1309 cases, but there are only 929 distinct values, which means that some tickets included several passengers.

In [11]:
# We show the data of the cases with duplicated “ticket” field.

ticket_multiple = titanic.ticket.duplicated(keep = False)
titanic[ticket_multiple]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1.0,0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1299,3.0,0,"Yasbeck, Mr. Antoni",male,27.0000,1.0,0.0,2659,14.4542,,C,C,,
1300,3.0,1,"Yasbeck, Mrs. Antoni (Selini Alexander)",female,15.0000,1.0,0.0,2659,14.4542,,C,,,
1303,3.0,0,"Yousseff, Mr. Gerious",male,,0.0,0.0,2627,14.4583,,C,,,
1304,3.0,0,"Zabour, Miss. Hileni",female,14.5000,1.0,0.0,2665,14.4542,,C,,328.0,


In [12]:
# For cases with duplicate “ticket” field, we select the ticket field and show how many unique values there are.

titanic[ticket_multiple].ticket.nunique()

216

We see that there are 596 passengers with joint tickets, and 216 of those tickets. A simple inspection of the few data shown allows us to intuit that a whole family could be included in a ticket (according to the custom of the time, the wife took the husband's surname, although in some cases it could be that they were siblings, or a simple coincidence, such as duplicate names).

We want to analyze these conjunct tickets: how many members of the same family there were, etc. Unfortunately, we do not have the first and last name separately, even though the full names are there: this is a **data validity** problem, as they are not in the format we would like to have them. We can try to solve this problem (“problem” for us, who are interested in this particular analysis; it is not necessarily “a problem” of the data, it is not that the data is wrong, it is that it is not in the right format for our purposes).

In [13]:
# We create two new fields in our DataFrame, “apellido” and “nombre” separating the “nombre” field by the comma.

titanic[['apellido', 'nombre']] = titanic.name.str.split(',', expand = True)
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,apellido,nombre
0,1.0,1,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO",Allen,Miss. Elisabeth Walton
1,1.0,1,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON",Allison,Master. Hudson Trevor
2,1.0,0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",Allison,Miss. Helen Loraine
3,1.0,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON",Allison,Mr. Hudson Joshua Creighton
4,1.0,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",Allison,Mrs. Hudson J C (Bessie Waldo Daniels)


It is now possible to perform these analyses...

In [14]:
# We filter the data corresponding to multiple tickets, group by ticket and surname, compute
# a series of aggregates of some columns to put names to, and assign that result to a
# new DataFrame called “grupos”.

grupos = titanic[ticket_multiple].groupby(['ticket', 'apellido'])\
                                 .agg(nombres = pd.NamedAgg('nombre', lambda x: x.tolist()),
                                      masculino = pd.NamedAgg('sex', lambda x: (x == 'male').sum()),
                                      femenino = pd.NamedAgg('sex', lambda x: (x == 'female').sum()),
                                      edades = pd.NamedAgg('age', lambda x: x.tolist()),
                                      mayores_de_edad = pd.NamedAgg('age', lambda x: (x >= 16).sum()),
                                      menores_de_edad = pd.NamedAgg('age', lambda x: (x < 16).sum()),
                                      personas_con_apellido = pd.NamedAgg('apellido', lambda x: x.count()))\
                                 .reset_index()

# We created two new columns with relevant information: the number of different surnames in each group, and
# the number of persons in each group (since we did the grouping 
# by ticket and surname at the same time, these values could not be easily
# values could not be easily calculated in the same statement; 
# it would also have been possible to do the same calculations by first doing a
# same calculations by first doing a groupby ticket, calculating some values, and then a groupby last name).
grupos['apellidos_en_grupo'] = grupos.groupby('ticket').apellido.transform(lambda x: x.count())
grupos['personas_en_grupo'] = grupos.groupby('ticket').personas_con_apellido.transform(lambda x: x.sum())

### Technical programming note: lambda functions

In the above code, lambda functions are used to calculate aggravations and transformations. A “lambda” is nothing more than a small anonymous function. They are generally used when a function is only going to be used once, so it is shorter and clearer not to define it separately. For example, the following two examples of code (which calculates, for each class, the age below which 90 percent of the passengers in that class are below), are (almost) equivalent:

    # Example 1: using a “lambda”.
    grupos.groupby('pclass').age.agg(lambda x: x.quantile(0.90)

    # Example 2: defining a function
    def percentile90(series):
        return series.quantile(0.90).
    grupos.groupby('pclass').age.agg(percentile90)

The main difference between the two examples, apart from the length, is that after executing the second one, the function percentile90 is defined from then on (in general, it is preferable not to leave variables and functions defined that will not be used anymore, both for performance and to avoid errors).

In the example code, the function (whether defined by a lambda or externally) is passed as a parameter to another function. This is a fairly common practice, and the Pandas library makes frequent use of it. For example, the .agg() or .transform() methods used above, take each group selected by .groupby() and apply the function provided to it (the difference between them is that .agg() generates an aggregate, for example the sum or average of all the values of each group, while .transform() transforms the data but leaves them with the same dimension, for example normalizing them or calculating some value to attach as an additional column to facilitate further calculations).

For more details on the use of lambda functions, consult the standard Python documentation or any Python programming manual.

In [15]:
# Number of different surnames depending on the size of the group traveling on the same ticket.
# As can be seen, the most frequent case is that of a group of two people with the same surname 
# (most likely a married couple, or a parent with his or her offspring).
# (most likely, a married couple, or a parent with his or her offspring), but there are many different cases.

pd.crosstab(grupos.apellidos_en_grupo, grupos.personas_en_grupo)

personas_en_grupo,2,3,4,5,6,7,8,11
apellidos_en_grupo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,99,33,7,3,3,2,1,1
2,66,22,4,2,0,2,0,0
3,0,15,15,6,3,3,0,0
4,0,0,8,4,0,0,0,0
5,0,0,0,0,0,5,0,0
7,0,0,0,0,0,0,7,0


In [16]:
# For groups where all members have the same last name, we calculate
# how many people are of each sex. It can be seen that the most frequent is that couples travel in pairs
# of man and woman (80 cases)

grupos_1_apellido = grupos[grupos.apellidos_en_grupo == 1]
pd.crosstab(grupos_1_apellido.femenino, grupos_1_apellido.masculino)

masculino,0,1,2,3,5,6
femenino,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,0,5,3,0,0
1,0,80,10,0,1,0
2,14,18,3,0,1,0
3,2,3,2,1,1,0
4,1,1,1,0,0,0
5,0,0,1,0,0,1


In [17]:
# We do the same, but this time with the number of people over and under 16 years of age...
pd.crosstab(grupos_1_apellido.menores_de_edad, grupos_1_apellido.mayores_de_edad)

mayores_de_edad,0.0,1.0,2.0,3.0,4.0,6.0
menores_de_edad,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.0,14,5,72,4,0,1
1.0,2,13,14,0,1,0
2.0,2,9,4,0,0,0
3.0,0,2,0,0,0,0
4.0,0,1,1,0,0,0
5.0,0,1,2,1,0,0


We have found an error! The results do not make sense: there are 14 cases, for example, where a ticket for two or more persons does not include any person over 16 years old and no person under 16 years old, which does not make sense. If we go back to the initial data information, we see what the problem is: the “age” field of the original DataFrame only has valid values for 1046 of the 1309 cases, the rest are null values. This is a typical error caused by not taking into account a data quality problem.

We can easily check the error...

In [18]:
# we count for how many of the rows of the DataFrame “grupos” match in number of persons (which is always OK)
# with the sum of under-age and over-age (which only includes the persons for whom you have the age)
(grupos.mayores_de_edad + grupos.menores_de_edad == grupos.personas_en_grupo).value_counts()

False    185
True     126
dtype: int64

How could this problem be solved? There are different possibilities, depending on our objective. We could fill in the data with an estimated age, for example, with the average of all the ages, but in this particular case this would not be a good option, because it would make all these people appear as “of age” and would unrealistically unbalance the data. A better option, in this case, is not to consider the cases in which at least one of the persons in the group does not have the age: in this way we will see only the table for those groups for which we have complete information, which does not give us true total values but does allow us to get an idea of the proportion between the alternatives.

In [19]:
# We filter the data to keep only those for which we have complete information on the group.

g_1_a_filtrado = grupos_1_apellido[(grupos_1_apellido.mayores_de_edad + grupos_1_apellido.menores_de_edad) 
                                   == grupos_1_apellido.personas_con_apellido]

pd.crosstab(g_1_a_filtrado.menores_de_edad, g_1_a_filtrado.mayores_de_edad)

mayores_de_edad,0.0,1.0,2.0,3.0,4.0,6.0
menores_de_edad,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.0,0,0,72,4,0,1
1.0,0,12,14,0,1,0
2.0,1,9,4,0,0,0
3.0,0,2,0,0,0,0
4.0,0,1,1,0,0,0
5.0,0,1,2,1,0,0


The results, naturally, are not perfect (we do not have enough information to make them so), but they give us useful information, if we know how to interpret it: the most common case is that of a couple of adults alone, followed by one or two adults accompanied by a child, or two or more in some cases. There is only one case of two minors traveling alone, and groups of 3 or more adults traveling with a minor are also very infrequent.

## Anomaly detection 

Let us now make a simple analysis of how the class (first, second or third) is related to the price of the ticket.

In [20]:
for clase in [1, 2, 3]:
    print(f'--- CLASE {clase} ---')
    print(titanic[titanic.pclass == clase].fare.describe())
    print(clase)

--- CLASE 1 ---
count    323.000000
mean      87.508992
std       80.447178
min        0.000000
25%       30.695800
50%       60.000000
75%      107.662500
max      512.329200
Name: fare, dtype: float64
1
--- CLASE 2 ---
count    277.000000
mean      21.179196
std       13.607122
min        0.000000
25%       13.000000
50%       15.045800
75%       26.000000
max       73.500000
Name: fare, dtype: float64
2
--- CLASE 3 ---
count    708.000000
mean      13.302889
std       11.494358
min        0.000000
25%        7.750000
50%        8.050000
75%       15.245800
max       69.550000
Name: fare, dtype: float64
3


We see that the minimum price paid, in all three cases, is zero: it seems that there were passengers traveling as guests, in all classes. Let's see how many there were, and eliminate them from the sample to get a more realistic view of the actual ticket prices.

In [21]:
titanic[titanic.fare == 0].groupby('pclass').ticket.count()

pclass
1.0    7
2.0    6
3.0    4
Name: ticket, dtype: int64

In [22]:
for clase in [1, 2, 3]:
    print(f'--- CLASE {clase} ---')
    print(titanic[(titanic.fare != 0) &(titanic.pclass == clase)].fare.describe())
    print(clase)

--- CLASE 1 ---
count    316.000000
mean      89.447482
std       80.259713
min        5.000000
25%       31.682275
50%       61.379200
75%      108.900000
max      512.329200
Name: fare, dtype: float64
1
--- CLASE 2 ---
count    271.000000
mean      21.648108
std       13.382064
min        9.687500
25%       13.000000
50%       15.050000
75%       26.000000
max       73.500000
Name: fare, dtype: float64
2
--- CLASE 3 ---
count    704.000000
mean      13.378473
std       11.483004
min        3.170800
25%        7.750000
50%        8.050000
75%       15.245800
max       69.550000
Name: fare, dtype: float64
3


There is a very large variability in ticket prices. Let's take a look at the most extreme cases...

In [23]:
# To see the first class passengers who paid the least, we select those passengers, 
# sort by price in descending order, and see the first 10 cases that appear.

titanic[(titanic.fare != 0) & (titanic.pclass == 1)].sort_values(by = 'fare').head(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,apellido,nombre
51,1.0,0,"Carlsson, Mr. Frans Olof",male,33.0,0.0,0.0,695,5.0,B51 B53 B55,S,,,"New York, NY",Carlsson,Mr. Frans Olof
75,1.0,0,"Colley, Mr. Edward Pomeroy",male,47.0,0.0,0.0,5727,25.5875,E58,S,,,"Victoria, BC",Colley,Mr. Edward Pomeroy
79,1.0,1,"Cornell, Mrs. Robert Clifford (Malvina Helen L...",female,55.0,2.0,0.0,11770,25.7,C101,S,2.0,,"New York, NY",Cornell,Mrs. Robert Clifford (Malvina Helen Lamson)
219,1.0,1,"Omont, Mr. Alfred Fernand",male,,0.0,0.0,F.C. 12998,25.7417,,C,7.0,,"Paris, France",Omont,Mr. Alfred Fernand
15,1.0,0,"Baumann, Mr. John D",male,,0.0,0.0,PC 17318,25.925,,S,,,"New York, NY",Baumann,Mr. John D
288,1.0,1,"Swift, Mrs. Frederick Joel (Margaret Welles Ba...",female,48.0,0.0,0.0,17466,25.9292,D17,S,8.0,,"Brooklyn, NY",Swift,Mrs. Frederick Joel (Margaret Welles Barron)
181,1.0,1,"Leader, Dr. Alice (Farnham)",female,49.0,0.0,0.0,17465,25.9292,D17,S,8.0,,"New York, NY",Leader,Dr. Alice (Farnham)
171,1.0,0,"Jones, Mr. Charles Cresson",male,46.0,0.0,0.0,694,26.0,,S,,80.0,"Bennington, VT",Jones,Mr. Charles Cresson
172,1.0,0,"Julian, Mr. Henry Forbes",male,50.0,0.0,0.0,113044,26.0,E60,S,,,London,Julian,Mr. Henry Forbes
217,1.0,0,"Nicholson, Mr. Arthur Ernest",male,64.0,0.0,0.0,693,26.0,,S,,263.0,"Isle of Wight, England",Nicholson,Mr. Arthur Ernest


In [24]:
# We select class 3 passengers, sort by price in descending order, 
# and see the first 10 cases that appear

titanic[titanic.pclass == 3].sort_values(by = 'fare', ascending = False).head(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,apellido,nombre
1171,3.0,0,"Sage, Master. William Henry",male,14.5,8.0,2.0,CA. 2343,69.55,,S,,67.0,,Sage,Master. William Henry
1175,3.0,0,"Sage, Miss. Stella Anna",female,,8.0,2.0,CA. 2343,69.55,,S,,,,Sage,Miss. Stella Anna
1172,3.0,0,"Sage, Miss. Ada",female,,8.0,2.0,CA. 2343,69.55,,S,,,,Sage,Miss. Ada
1179,3.0,0,"Sage, Mr. John George",male,,1.0,9.0,CA. 2343,69.55,,S,,,,Sage,Mr. John George
1178,3.0,0,"Sage, Mr. George John Jr",male,,8.0,2.0,CA. 2343,69.55,,S,,,,Sage,Mr. George John Jr
1177,3.0,0,"Sage, Mr. Frederick",male,,8.0,2.0,CA. 2343,69.55,,S,,,,Sage,Mr. Frederick
1176,3.0,0,"Sage, Mr. Douglas Bullen",male,,8.0,2.0,CA. 2343,69.55,,S,,,,Sage,Mr. Douglas Bullen
1180,3.0,0,"Sage, Mrs. John (Annie Bullen)",female,,1.0,9.0,CA. 2343,69.55,,S,,,,Sage,Mrs. John (Annie Bullen)
1174,3.0,0,"Sage, Miss. Dorothy Edith 'Dolly'",female,,8.0,2.0,CA. 2343,69.55,,S,,,,Sage,Miss. Dorothy Edith 'Dolly'
1170,3.0,0,"Sage, Master. Thomas Henry",male,,8.0,2.0,CA. 2343,69.55,,S,,,,Sage,Master. Thomas Henry


It seems that there are many passengers who paid the price of 69.55 in third class, but if we look we see that all of them are on the same ticket. This makes us think that the price shown in the data is not the price per passenger, but the price for the full ticket. Although the documentation associated with the data does not make this clear, some evidence seems to confirm this hypothesis.

In [25]:
print('\:egg:')

\:egg:


In [26]:
# For each group of passengers with the same ticket, we check the difference between the maximum price displayed
# for some of them and the minimum price, and display the values.
# We see that in all cases the difference is zero, 
# which means that all passengers with the same ticket # have the same price... 
# suggesting that the minimum price is the same for all passengers.
# have the same price... which suggests that this is the group price, not the price per person.

(titanic.groupby('ticket').fare.max() - titanic.groupby('ticket').fare.max()).value_counts()

0.0    928
Name: fare, dtype: int64

In [27]:
# Let's create a new variable, “price_imputed”, which distributes the price of the ticket among the different persons.

titanic['precio_imputado'] = titanic.groupby('ticket').fare.transform(lambda x: x / x.count())

In [28]:
for clase in [1, 2, 3]:
    print(f'--- CLASE {clase} ---')
    print(titanic[(titanic.precio_imputado != 0) &(titanic.pclass == clase)].precio_imputado.describe())
    print(clase)

--- CLASE 1 ---
count    316.000000
mean      34.661682
std       14.675124
min        5.000000
25%       26.550000
50%       30.000000
75%       39.133350
max      128.082300
Name: precio_imputado, dtype: float64
1
--- CLASE 2 ---
count    271.000000
mean      11.663652
std        2.031927
min        5.250000
25%       10.500000
50%       12.650000
75%       13.000000
max       16.000000
Name: precio_imputado, dtype: float64
2
--- CLASE 3 ---
count    704.000000
mean       7.370788
std        1.367423
min        3.170800
25%        7.061975
50%        7.750000
75%        7.925000
max       19.966700
Name: precio_imputado, dtype: float64
3


It seems to result in more balanced values (and very concentrated in the central area), so let's assume that the hypothesis is true. Still, there seem to be some very high values in third class and some very low values in first class.

In [29]:
#Let's see the passengers who paid the most for a third class ticket....

titanic[titanic.pclass == 3].sort_values(by = 'precio_imputado', ascending = False).head(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,apellido,nombre,precio_imputado
842,3.0,0,"Hagland, Mr. Ingvald Olai Olsen",male,,1.0,0.0,65303,19.9667,,S,,,,Hagland,Mr. Ingvald Olai Olsen,19.9667
843,3.0,0,"Hagland, Mr. Konrad Mathias Reiersen",male,,1.0,0.0,65304,19.9667,,S,,,,Hagland,Mr. Konrad Mathias Reiersen,19.9667
743,3.0,0,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0.0,0.0,7552,10.5167,,S,,,"Norrlot, Sweden Chicago, IL",Dahlberg,Miss. Gerda Ulrika,10.5167
744,3.0,0,"Dakic, Mr. Branko",male,19.0,0.0,0.0,349228,10.1708,,S,,,Austria,Dakic,Mr. Branko,10.1708
1260,3.0,1,"Turja, Miss. Anna Sofia",female,18.0,0.0,0.0,4138,9.8417,,S,15.0,,,Turja,Miss. Anna Sofia,9.8417
1227,3.0,0,"Strandberg, Miss. Ida Sofia",female,22.0,0.0,0.0,7553,9.8375,,S,,,,Strandberg,Miss. Ida Sofia,9.8375
907,3.0,0,"Jussila, Miss. Katriina",female,20.0,1.0,0.0,4136,9.825,,S,,,,Jussila,Miss. Katriina,9.825
908,3.0,0,"Jussila, Miss. Mari Aina",female,21.0,1.0,0.0,4137,9.825,,S,,,,Jussila,Miss. Mari Aina,9.825
1261,3.0,1,"Turkula, Mrs. (Hedwig)",female,63.0,0.0,0.0,4134,9.5875,,S,15.0,,,Turkula,Mrs. (Hedwig),9.5875
943,3.0,0,"Laitinen, Miss. Kristina Sofia",female,37.0,0.0,0.0,4135,9.5875,,S,,,,Laitinen,Miss. Kristina Sofia,9.5875


We see that there are two passengers, who also share the same last name, who paid 19.96 each for their ticket, when the next most expensive third-class tickets cost exactly half the price. This makes us suspect an error: perhaps the ticket number is incorrect, it should be the same, and the price charged to each would be half. The fact that the price is out of range and improbable (it is double the next price paid, it is higher than the price paid by any second passenger, and even some first passengers traveled for less than that) makes it an **anomaly**. Whether it is an error or not, that already depends on the content of the data and needs to be determined by other means.

(Actually, in this case, it is possible to consult existing data on Titanic victims. Both Ingvald Hagland and Konrad Hagland, brothers, had planned to have traveled with companions, who in the end did not embark and therefore do not appear on the passenger list. The price per person at which they bought their tickets was, therefore, half that shown in our table). 

Should this data be corrected? As always, it depends on the intended use. For example: if you want to make an analysis of what the ticket prices were, the most correct thing to do would be to correct it. If you want to add up the amount paid by each passenger to calculate the total revenue of the shipping company, the correct thing to do would be not to correct it. The anomaly is a characteristic of the data, but the correction or not, or the discovery of an interesting fact, depends on the reality and the model.

In [30]:
titanic[titanic.pclass == 1].sort_values(by = 'precio_imputado', ascending = False).head(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,apellido,nombre,precio_imputado
50,1.0,1,"Cardeza, Mrs. James Warburton Martinez (Charlo...",female,58.0,0.0,1.0,PC 17755,512.3292,B51 B53 B55,C,3.0,,"Germantown, Philadelphia, PA",Cardeza,Mrs. James Warburton Martinez (Charlotte Ward...,128.0823
302,1.0,1,"Ward, Miss. Anna",female,35.0,0.0,0.0,PC 17755,512.3292,,C,3.0,,,Ward,Miss. Anna,128.0823
183,1.0,1,"Lesurer, Mr. Gustave J",male,35.0,0.0,0.0,PC 17755,512.3292,B101,C,3.0,,,Lesurer,Mr. Gustave J,128.0823
49,1.0,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0.0,1.0,PC 17755,512.3292,B51 B53 B55,C,3.0,,"Austria-Hungary / Germantown, Philadelphia, PA",Cardeza,Mr. Thomas Drake Martinez,128.0823
17,1.0,1,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",female,50.0,0.0,1.0,PC 17558,247.5208,B58 B60,C,6.0,,"Montreal, PQ",Baxter,Mrs. James (Helene DeLaudeniere Chaput),82.506933
16,1.0,0,"Baxter, Mr. Quigg Edmond",male,24.0,0.0,1.0,PC 17558,247.5208,B58 B60,C,,,"Montreal, PQ",Baxter,Mr. Quigg Edmond,82.506933
97,1.0,1,"Douglas, Mrs. Frederick Charles (Mary Helene B...",female,27.0,1.0,1.0,PC 17558,247.5208,B58 B60,C,6.0,,"Montreal, PQ",Douglas,Mrs. Frederick Charles (Mary Helene Baxter),82.506933
72,1.0,1,"Clark, Mrs. Walter Miller (Virginia McDowell)",female,26.0,1.0,0.0,13508,136.7792,C89,C,4.0,,"Los Angeles, CA",Clark,Mrs. Walter Miller (Virginia McDowell),68.3896
71,1.0,0,"Clark, Mr. Walter Miller",male,27.0,1.0,0.0,13508,136.7792,C89,C,,,"Los Angeles, CA",Clark,Mr. Walter Miller,68.3896
119,1.0,1,"Frauenthal, Dr. Henry William",male,50.0,2.0,0.0,PC 17611,133.65,,S,5.0,,"New York, NY",Frauenthal,Dr. Henry William,66.825


The highest prices paid in first class also correspond to multi-passenger tickets. In this case it can also be seen, looking at the “cabin” field, that these are multi-cabin bookings, which may justify the price, and probably should not be considered erroneous (again, looking at the passenger data one can read of Mrs. Charlotte Cardeza, a wealthy heiress of extravagant habits who traveled with her son and her servants, and carried a large amount of luggage).

In [31]:
titanic[(titanic.fare != 0) & (titanic.pclass == 1)].sort_values(by = 'precio_imputado').head(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,apellido,nombre,precio_imputado
51,1.0,0,"Carlsson, Mr. Frans Olof",male,33.0,0.0,0.0,695,5.0,B51 B53 B55,S,,,"New York, NY",Carlsson,Mr. Frans Olof,5.0
46,1.0,0,"Cairns, Mr. Alexander",male,,0.0,0.0,113798,31.0,,S,,,,Cairns,Mr. Alexander,15.5
258,1.0,1,"Serepeca, Miss. Augusta",female,30.0,0.0,0.0,113798,31.0,,C,4.0,,,Serepeca,Miss. Augusta,15.5
188,1.0,1,"Lines, Mrs. Ernest H (Elizabeth Lindsey James)",female,51.0,0.0,1.0,PC 17592,39.4,D28,S,9.0,,"Paris, France",Lines,Mrs. Ernest H (Elizabeth Lindsey James),19.7
187,1.0,1,"Lines, Miss. Mary Conover",female,16.0,0.0,1.0,PC 17592,39.4,D28,S,9.0,,"Paris, France",Lines,Miss. Mary Conover,19.7
147,1.0,0,"Harrington, Mr. Charles H",male,,0.0,0.0,113796,42.4,,S,,,,Harrington,Mr. Charles H,21.2
211,1.0,0,"Moore, Mr. Clarence Bloomfield",male,47.0,0.0,0.0,113796,42.4,,S,,,"Washington, DC",Moore,Mr. Clarence Bloomfield,21.2
154,1.0,0,"Hays, Mr. Charles Melville",male,55.0,1.0,1.0,12749,93.5,B69,S,,307.0,"Montreal, PQ",Hays,Mr. Charles Melville,23.375
225,1.0,0,"Payne, Mr. Vivian Ponsonby",male,23.0,0.0,0.0,12749,93.5,B24,S,,,"Montreal, PQ",Payne,Mr. Vivian Ponsonby,23.375
155,1.0,1,"Hays, Mrs. Charles Melville (Clara Jennings Gr...",female,52.0,1.0,1.0,12749,93.5,B69,S,3.0,,"Montreal, PQ",Hays,Mrs. Charles Melville (Clara Jennings Gregg),23.375


Finally, the lowest price paid in first class is also significantly lower than the following ones, and can be considered a data to be corrected or not depending on the use to be made of the data (in this case it is about a senior sailor who could not return in his own ship to the USA and had to travel on the Titanic, so probably the very low price is a consequence of some kind of agreement or courtesy between the shipping companies).

In [32]:
grupos.head()

Unnamed: 0,ticket,apellido,nombres,masculino,femenino,edades,mayores_de_edad,menores_de_edad,personas_con_apellido,apellidos_en_grupo,personas_en_grupo
0,110152,Cherry,[ Miss. Gladys],0,1,[30.0],1.0,0.0,1,3,3
1,110152,Maioni,[ Miss. Roberta],0,1,[16.0],1.0,0.0,1,3,3
2,110152,Rothes,[ the Countess. of (Lucy Noel Martha Dyer-Edwa...,0,1,[33.0],1.0,0.0,1,3,3
3,110413,Taussig,"[ Miss. Ruth, Mr. Emil, Mrs. Emil (Tillie Ma...",1,2,"[18.0, 52.0, 39.0]",3.0,0.0,3,1,3
4,110465,Clifford,[ Mr. George Quincy],1,0,[nan],0.0,0.0,1,2,2
