# Case Study - Feature Engineering on the Titanic
The titanic dataset has a few columns from which you can use regex to extract information from. Feature engineering involves using existing columns of data to create new columns of data. You will work on doing just that in these exercises. Read it in and then answer the following questions.

In [1]:
import pandas as pd

In [2]:
titanic = pd.read_csv('../data/tidy/titanic.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Problem 1
<span  style="color:green; font-size:16px">Extract the first character of the **`Ticket`** column and save it as a new column **`ticket_first`**. Find the total number of survivors, the total number of passengers, and the percentage of those who survived **by this column**. Next find the total survival rate for the entire dataset. Does this new column help predict who survived?</span>

### Problem 2
<span  style="color:green; font-size:16px">If you did problem 2 correctly, you should see that only 7% of the people with tickets that began with 'A' survived. Find the survival rate for all those 'A' tickets by **`Sex`**.</span>

### Problem 3
<span  style="color:green; font-size:16px">Find the survival rate by the last letter of the ticket. Is there any predictive power here?</span>

### Problem 4
<span  style="color:green; font-size:16px">Find the length of each passengers name and assign to the **`name_len`** column. What is the minimum and maximum name length?</span>

### Problem 5
<span  style="color:green; font-size:16px">Pass the **`name_len`** column to the **`pd.cut`** function. Also, pass a list of equal-sized cut points to the **`bins`** parameter. Assign the resulting Series to the **`name_len_cat`** column. Find the frequency count of each bin in this column.</span>

### Problem 6
<span  style="color:green; font-size:16px">Is name length a good predictor of survival?<span>

### Problem 7
<span  style="color:green; font-size:16px">Why do you think people with longer names had a better chance at survival?</span>

### Problem 8
<span  style="color:green; font-size:16px">Using the titanic dataset, do your best to extract the title from a person's name. Examples of title are 'Mr.', 'Dr.', 'Miss', etc... Save this to a column called **`title`**. Find the frequency count of the titles.</span>

### Problem 9
<span  style="color:green; font-size:16px">Does the title have good predictive value of survival?</span>

### Problem 10
<span  style="color:green; font-size:16px">Create a pivot table of survival by title and sex. Use two aggregation functions, mean and size</span>

### Problem 11
<span  style="color:green; font-size:16px">Attempt to extract the first name of each passenger into the column **`first_name`**. Are there are males and females with the same first name?</span>

### Problem 12
<span  style="color:green; font-size:16px">The problems have been an exercise in **feature engineering**. Several new features (columns) have been created from existing columns. Come up with your own feature and test it out on survival.</span>

# Solutions

### Problem 1
<span  style="color:green; font-size:16px">Extract the first character of the **`Ticket`** column and save it as a new column **`ticket_first`**. Find the total number of survivors, the total number of passengers, and the percentage of those who survived **by this column**. Next find the total survival rate for the entire dataset. Does this new column help predict who survived?</span>

In [3]:
titanic = pd.read_csv('../data/tidy/titanic.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
titanic['ticket_first'] = titanic.Ticket.str[0]
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,ticket_first
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,A
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,P
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,3


In [5]:
ticket_first_survival = titanic.groupby('ticket_first').agg({'Survived': ['mean', 'sum', 'size']})
ticket_first_survival

Unnamed: 0_level_0,Survived,Survived,Survived
Unnamed: 0_level_1,mean,sum,size
ticket_first,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,0.630137,92,146
2,0.464481,85,183
3,0.239203,72,301
4,0.2,2,10
5,0.0,0,3
6,0.166667,1,6
7,0.111111,1,9
8,0.0,0,2
9,1.0,1,1
A,0.068966,2,29


In [6]:
# overall survival rate
titanic['Survived'].mean()

0.3838383838383838

It does look like **`ticket_first`** has predictive power. 63% of those tickets beginning with '1' survived while versus 24% for '3'. Only 2 out of 29 people survived with tickets beginning with 'A'.

### Problem 2
<span  style="color:green; font-size:16px">If you did problem 2 correctly, you should see that only 7% of the people with tickets that began with 'A' survived. Find the survival rate for all those 'A' tickets by **`Sex`**.</span>

In [7]:
filt = titanic['ticket_first'] == 'A'
ticket_A = titanic[filt]
ticket_A.groupby('Sex').agg({'Survived': ['mean', 'size']})

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,mean,size
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2
female,0.0,2
male,0.074074,27


### Problem 3
<span  style="color:green; font-size:16px">Find the survival rate by the last letter of the ticket. Is there any predictive power here?</span>

In [8]:
titanic['ticket_last'] = titanic['Ticket'].str[-1]
ticket_last_survival = titanic.groupby('ticket_last').agg({'Survived': ['mean', 'sum', 'size']})
ticket_last_survival

Unnamed: 0_level_0,Survived,Survived,Survived
Unnamed: 0_level_1,mean,sum,size
ticket_last,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,0.395062,32,81
1,0.43,43,100
2,0.364706,31,85
3,0.339286,38,112
4,0.297297,22,74
5,0.392405,31,79
6,0.419355,39,93
7,0.355556,32,90
8,0.453488,39,86
9,0.390805,34,87


No predictive power. They are all about equal.

### Problem 4
<span  style="color:green; font-size:16px">Find the length of each passengers name and assign to the **`name_len`** column. What is the minimum and maximum name length?</span>

In [9]:
titanic['name_len'] = titanic['Name'].str.len()
titanic['name_len'].min()

12

In [10]:
titanic['name_len'].max()

82

### Problem 5
<span  style="color:green; font-size:16px">Pass the **`name_len`** column to the **`pd.cut`** function. Also, pass a list of equal-sized cut points to the **`bins`** parameter. Assign the resulting Series to the **`name_len_cat`** column. Find the frequency count of each bin in this column.</span>

In [11]:
titanic['name_len_cat'] = pd.cut(titanic['name_len'], bins=[0, 20, 40, 60, 80, 100])
titanic['name_len_cat'].head()

0    (20, 40]
1    (40, 60]
2    (20, 40]
3    (40, 60]
4    (20, 40]
Name: name_len_cat, dtype: category
Categories (5, interval[int64]): [(0, 20] < (20, 40] < (40, 60] < (60, 80] < (80, 100]]

In [12]:
titanic['name_len_cat'].value_counts()

(20, 40]     558
(0, 20]      243
(40, 60]      86
(60, 80]       3
(80, 100]      1
Name: name_len_cat, dtype: int64

### Problem 6
<span  style="color:green; font-size:16px">Is name length a good predictor of survival?<span>

In [13]:
titanic.groupby('name_len_cat').agg({'Survived': ['mean', 'size']})

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,mean,size
name_len_cat,Unnamed: 1_level_2,Unnamed: 2_level_2
"(0, 20]",0.230453,243
"(20, 40]",0.383513,558
"(40, 60]",0.790698,86
"(60, 80]",1.0,3
"(80, 100]",1.0,1


Yes, the longer the name, the higher the survival rate.

### Problem 7
<span  style="color:green; font-size:16px">Why do you think people with longer names had a better chance at survival?</span>

Let's output the shortest and longest 10 names

In [14]:
names = titanic.sort_values(by='name_len')['Name']

In [15]:
names.head(10)

826       Lam, Mr. Len
692       Lam, Mr. Ali
74       Bing, Mr. Lee
169      Ling, Mr. Lee
509     Lang, Mr. Fang
832     Saad, Mr. Amin
210     Ali, Mr. Ahmed
694    Weir, Col. John
108    Rekic, Mr. Tido
838    Chip, Mr. Chang
Name: Name, dtype: object

In [16]:
# Names exceed pandas display settings.
# change them with pd.options.display.max_colwidth 
# or just print out values
names.tail(10).values

array(['Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)',
       'Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)',
       'Spedden, Mrs. Frederic Oakley (Margaretta Corning Stone)',
       'Turpin, Mrs. William John Robert (Dorothy Ann Wonnacott)',
       'Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)',
       'Andersson, Mrs. Anders Johan (Alfrida Konstantia Brogren)',
       'Brown, Mrs. Thomas William Solomon (Elizabeth Catherine Ford)',
       'Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan")',
       'Phillips, Miss. Kate Florence ("Mrs Kate Louise Phillips Marshall")',
       'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'],
      dtype=object)

In [17]:
# temporarily set options in a context manager
with pd.option_context('display.max_colwidth', 100):
    print(names.tail(10))

18                                Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)
759                              Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)
319                              Spedden, Mrs. Frederic Oakley (Margaretta Corning Stone)
41                               Turpin, Mrs. William John Robert (Dorothy Ann Wonnacott)
25                              Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)
610                             Andersson, Mrs. Anders Johan (Alfrida Konstantia Brogren)
670                         Brown, Mrs. Thomas William Solomon (Elizabeth Catherine Ford)
556                     Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan")
427                   Phillips, Miss. Kate Florence ("Mrs Kate Louise Phillips Marshall")
307    Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)
Name: Name, dtype: object


Looks like all the people with short names are men. All people with long names are females.

### Problem 8
<span  style="color:green; font-size:16px">Using the titanic dataset, do your best to extract the title from a person's name. Examples of title are 'Mr.', 'Dr.', 'Miss', etc... Save this to a column called **`title`**. Find the frequency count of the titles.</span>

In [18]:
titanic['title'] = titanic['Name'].str.extract(r'(\w+[.])')
titanic['title'].value_counts()

Mr.          517
Miss.        182
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Major.         2
Mlle.          2
Col.           2
Capt.          1
Lady.          1
Mme.           1
Countess.      1
Sir.           1
Jonkheer.      1
Ms.            1
Don.           1
Name: title, dtype: int64

### Problem 9
<span  style="color:green; font-size:16px">Does the title have good predictive value of survival?</span>

In [19]:
titanic.groupby('title').agg({'Survived':['mean', 'size']})

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,mean,size
title,Unnamed: 1_level_2,Unnamed: 2_level_2
Capt.,0.0,1
Col.,0.5,2
Countess.,1.0,1
Don.,0.0,1
Dr.,0.428571,7
Jonkheer.,0.0,1
Lady.,1.0,1
Major.,0.5,2
Master.,0.575,40
Miss.,0.697802,182


### Problem 10
<span  style="color:green; font-size:16px">Create a pivot table of survival by title and sex. Use two aggregation functions, mean and size</span>

In [20]:
titanic.pivot_table(index='title', columns='Sex', 
                    values='Survived', aggfunc=['mean', 'size'])

Unnamed: 0_level_0,mean,mean,size,size
Sex,female,male,female,male
title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Capt.,,0.0,,1.0
Col.,,0.5,,2.0
Countess.,1.0,,1.0,
Don.,,0.0,,1.0
Dr.,1.0,0.333333,1.0,6.0
Jonkheer.,,0.0,,1.0
Lady.,1.0,,1.0,
Major.,,0.5,,2.0
Master.,,0.575,,40.0
Miss.,0.697802,,182.0,


### Problem 11
<span  style="color:green; font-size:16px">Attempt to extract the first name of each passenger into the column **`first_name`**. Are there are males and females with the same first name?</span>

Most can be found like this following the title

In [21]:
pattern = r'\w+[.] (\w+)'
titanic['first_name'] = titanic['Name'].str.extract(pattern)

To be more precise, we can do this:

In [22]:
pattern = r'\w+[.][a-z (]+([A-Z]\w+)'
titanic['first_name'] = titanic['Name'].str.extract(pattern)

In [23]:
first_name_ct = titanic.groupby('first_name').agg({'Sex': 'nunique'})
first_name_ct.head()

Unnamed: 0_level_0,Sex
first_name,Unnamed: 1_level_1
Abraham,1
Achille,1
Ada,1
Adele,1
Adola,1


In [24]:
filt = first_name_ct['Sex'] == 2
first_name_ct[filt].head(10)

Unnamed: 0_level_0,Sex
first_name,Unnamed: 1_level_1
Albert,2
Alexander,2
Amin,2
Anders,2
Antoni,2
Benjamin,2
Carl,2
Charles,2
Dickinson,2
Edgar,2


Looks like some female first names are actually in parentheses after their husband/father name.

In [25]:
filt = titanic['first_name'] == 'Albert'
titanic.loc[filt, 'Name']

64                                 Stewart, Mr. Albert A
107                               Moss, Mr. Albert Johan
323    Caldwell, Mrs. Albert Francis (Sylvia Mae Harb...
690                              Dick, Mr. Albert Adrian
781            Dick, Mrs. Albert Adrian (Vera Gillespie)
817                                   Mallet, Mr. Albert
833                               Augustsson, Mr. Albert
Name: Name, dtype: object

### Problem 12
<span  style="color:green; font-size:16px">The past several problems have been an exercise in **feature engineering**. Several new features (columns) have been created from existing columns. Come up with your own feature and test it out on survival.</span>

Get first letter of cabin. Use 'Missing' if not present.

In [26]:
titanic['cabin_first'] = titanic.Cabin.str[0].fillna('Missing')

Just having a cabin is highly predictive.

In [27]:
titanic.groupby('cabin_first').agg({'Survived': ['size', 'mean']})#.sort_values('size', ascending=False)

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,size,mean
cabin_first,Unnamed: 1_level_2,Unnamed: 2_level_2
A,15,0.466667
B,47,0.744681
C,59,0.59322
D,33,0.757576
E,32,0.75
F,13,0.615385
G,4,0.5
Missing,687,0.299854
T,1,0.0
