# Advanced `pandas`

The following notebook is dedicated to more advanved opeartions in Pandas:

- `split-apply-combine` pipeline,
- operations on string columns (string operations, replacement),
- joins on Pandas dataframes.

In [1]:
%pylab inline
plt.style.use("bmh")

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


In [2]:
import numpy as np
import pandas as pd

In [3]:
titanic_train = pd.read_csv("train.csv", index_col="PassengerId")
titanic_test = pd.read_csv("test.csv", index_col="PassengerId")
titanic = pd.concat([titanic_train, titanic_test], sort=False)

In [4]:
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Joining Pandas dataframes (`JOIN` in Pandas)

We start with a synthetic example:

In [5]:
a = pd.DataFrame(np.arange(8).reshape((4,2)),
                 columns=["a", "b"],
                 index=["a", "b", "a", "b"])
b = pd.DataFrame(10 + np.arange(4).reshape((4,-1)),
                 columns=["d"],
                 index=["d", "b", "c", "b"])

In [6]:
a

Unnamed: 0,a,b
a,0,1
b,2,3
a,4,5
b,6,7


In [7]:
b

Unnamed: 0,d
d,10
b,11
c,12
b,13


In [8]:
a.join(b) # default is left join

Unnamed: 0,a,b,d
a,0,1,
a,4,5,
b,2,3,11.0
b,2,3,13.0
b,6,7,11.0
b,6,7,13.0


In [9]:
a.join(b, how="inner")

Unnamed: 0,a,b,d
b,2,3,11
b,2,3,13
b,6,7,11
b,6,7,13


In [10]:
a

Unnamed: 0,a,b
a,0,1
b,2,3
a,4,5
b,6,7


In [11]:
b

Unnamed: 0,d
d,10
b,11
c,12
b,13


In [12]:
b.join(a, how="right")

Unnamed: 0,d,a,b
a,,0,1
b,11.0,2,3
b,13.0,2,3
a,,4,5
b,11.0,6,7
b,13.0,6,7


In [13]:
a.join(b, how="outer")

Unnamed: 0,a,b,d
a,0.0,1.0,
a,4.0,5.0,
b,2.0,3.0,11.0
b,2.0,3.0,13.0
b,6.0,7.0,11.0
b,6.0,7.0,13.0
c,,,12.0
d,,,10.0


We can also perform join operation on multi-indexed dataframes:

In [14]:
c = pd.DataFrame(np.arange(8).reshape((4,2)),
                 columns=["a", "b"],
                 index=pd.MultiIndex.from_tuples([("a", "A"), ("b", "E"), ("a", "Y"), ("b", "R")],
                                                 names=("lower", "upper")))

In [15]:
c

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
lower,upper,Unnamed: 2_level_1,Unnamed: 3_level_1
a,A,0,1
b,E,2,3
a,Y,4,5
b,R,6,7


In [16]:
a

Unnamed: 0,a,b
a,0,1
b,2,3
a,4,5
b,6,7


In [17]:
c.join(a, on="lower")  # This one will fail

ValueError: columns overlap but no suffix specified: Index(['a', 'b'], dtype='object')

In [20]:
c.join(a, on="lower", rsuffix="_right", lsuffix="_left")

Unnamed: 0_level_0,Unnamed: 1_level_0,a_left,b_left,a_right,b_right
lower,upper,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,A,0,1,0,1
a,A,0,1,4,5
b,E,2,3,2,3
b,E,2,3,6,7
a,Y,4,5,0,1
a,Y,4,5,4,5
b,R,6,7,2,3
b,R,6,7,6,7


# Joining dataframes for EDA

## Problem: get (almost) all couples on board

In [None]:
titanic[["Name", "Sex"]].head()

We start by noting the pattern: married females are listed as `<FAMILY_NAME>, Mrs. <HUSBANDS_FIRST_NAME> (<WIFES_FULL_NAME>)`. Let's play with it a bit:

In [None]:
family_names = (titanic
                .replace(re.compile(r'\s+\(.*\)'), '')
                .replace(re.compile("Mrs."), "Mr."))[["Name", "Sex"]]

In [None]:
family_names

Removing wife's names that appear in brackets:

In [None]:
titanic.replace(re.compile(r'\s+\(.*\)'), '')

In [None]:
family_names

We can now get passenger IDs and husbands names of all married women (note that not all of these husbands are on board!)

In [None]:
family_names = family_names[(family_names.Sex=="female") & family_names.Name.str.contains("Mr.")]

In [None]:
family_names.head()

In [None]:
family_names.shape[0]

We now want to join this back to original dataframe (a very common pattern if you need some **pairs**):

In [None]:
family_names.reset_index().set_index("Name")["PassengerId"]

In [None]:
couples = (titanic.join(family_names
                        .reset_index()
                        .set_index("Name")["PassengerId"],
                        on="Name", how="inner"))
couples

Note, that there is no colission on `PassengerId` **column** because husband PassengerId is an index!

In [None]:
couples.rename({"PassengerId":"PassengerId_Spouse"},
               axis=1, inplace=True)

In [None]:
couples.head()

In [None]:
couples = couples.join(titanic[["Name", "Age"]],
                       on="PassengerId_Spouse", rsuffix="_Spouse")

In [None]:
couples

In [None]:
titanic.Pclass.value_counts()

In [None]:
couples.Pclass.value_counts()

In [None]:
couples.Sex.value_counts()

In [None]:
((couples.Age - couples.Age_Spouse)
 .groupby(couples.Pclass)
 .agg(["min", "max", "mean", "median", "std", "count", "size"]))

In [None]:
couples[(couples.Age - couples.Age_Spouse)<0][["PassengerId_Spouse", "Name", "Age", "Name_Spouse", "Age_Spouse"]]

In [None]:
titanic.loc[742]

In [None]:
titanic.loc[988]

Although it's only heuristics, and we may need to dig deeper (e.g., to find some uncommon naming patterns), this is already something. Think on which features you may add to quantify a passenger (say, `is wife/husband on board?`, which may complement `SibSp`).

Think on how you may find entire **families**, and which features you may extract by knowing those. EDA is about your data driven creativity, so - play with it.

P. S. **not a single loop** above.

### Intermezzo: on self-joins

In [None]:
cabin_counts = titanic.Cabin.value_counts()
cabin_counts[cabin_counts>1]

In [None]:
cabin_counts = cabin_counts[cabin_counts>1]

In [None]:
cabins = (titanic
          .loc[titanic.Cabin.isin(cabin_counts.index),
               ["Name", "Cabin"]]
          .reset_index())

In [None]:
cabins

In [None]:
cabins.merge(cabins, on="Cabin", suffixes=("_first", "_second"))

In [None]:
companions = cabins.merge(cabins, on="Cabin", suffixes=("_first", "_second"))
companions = companions[companions.PassengerId_first < companions.PassengerId_second]

In [None]:
companions

We can now clean this up and get another interesting source of information (`travelling with a family member in the same cabin?`, etc.).