# `pandas`

In [65]:
%pylab inline
plt.style.use('ggplot')

Populating the interactive namespace from numpy and matplotlib


Note the import convention:

In [66]:
import numpy as np
import pandas as pd

In [67]:
np.random.seed(983456)

## Creating `pd.Series`

When creating Pandas `Series` you can provide values only:

In [68]:
s = pd.Series(np.random.randn(10))
s

0   -1.187327
1    0.382796
2   -0.681736
3   -3.534783
4    0.304866
5   -0.899330
6    1.194968
7   -0.446314
8    2.598223
9   -0.795144
dtype: float64

Values and series name:

In [34]:
s = pd.Series(np.random.randn(10), name="random_series")
s

0    2.433533
1   -0.677715
2    0.871098
3    0.128585
4   -1.042224
5    0.228245
6   -0.361624
7    0.447801
8    2.045056
9   -1.291771
Name: random_series, dtype: float64

Values, index and series name:

In [35]:
s = pd.Series(np.random.randn(10), name="random_series",
              index=np.random.randint(23, size=(10,)))
s

10    0.496488
2     0.516279
20   -1.497689
0    -0.256733
1     0.149244
14   -2.084936
16    0.117294
19    1.779378
3     0.820993
3    -0.090797
Name: random_series, dtype: float64

In [36]:
s.index

Int64Index([10, 2, 20, 0, 1, 14, 16, 19, 3, 3], dtype='int64')

Index can be created explicitly (and can have it's own name):

In [37]:
s = pd.Series(np.random.randn(10), name="random_series",
              index=pd.Index(np.random.randint(23, size=(10,)), name="main_index"))
s

main_index
5    -1.062838
10    0.704245
0    -0.596709
16   -0.244326
21   -0.862128
1    -0.532820
11   -0.364554
17   -2.594341
19   -0.014463
16   -0.012298
Name: random_series, dtype: float64

In [38]:
s.index

Int64Index([5, 10, 0, 16, 21, 1, 11, 17, 19, 16], dtype='int64', name='main_index')

Series can be created from a dictionary as well

In [54]:
s = pd.Series({'a':3, 'c':6, 'b':2}, name="dict_series")
s

a    3
c    6
b    2
Name: dict_series, dtype: int64

In [55]:
s.index

Index(['a', 'c', 'b'], dtype='object')

## Creating `pd.DataFrame`

Agai, we can use just values and Pandas will create an index (both row and column) for us:

In [56]:
df = pd.DataFrame(np.arange(20).reshape((5,4)))
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


Easy way to access data types in a dataframe:

In [57]:
df.dtypes

0    int32
1    int32
2    int32
3    int32
dtype: object

We can provide column names:

In [58]:
df = pd.DataFrame(np.arange(20).reshape((5,4)),
                  columns=['a', 'b', 'c', 'd'])
df

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


Values, index and column names:

In [59]:
import string
df = pd.DataFrame(np.arange(20).reshape((5,4)),
                  columns=['a', 'b', 'c', 'd'],
                  index=np.random.choice(list(string.ascii_lowercase), 5, replace=False))
df

Unnamed: 0,a,b,c,d
o,0,1,2,3
h,4,5,6,7
e,8,9,10,11
a,12,13,14,15
j,16,17,18,19


In [60]:
df.columns

Index(['a', 'b', 'c', 'd'], dtype='object')

In [61]:
df.index

Index(['o', 'h', 'e', 'a', 'j'], dtype='object')

Can you guess what does `df['a']` mean?

In [62]:
df['a']

o     0
h     4
e     8
a    12
j    16
Name: a, dtype: int32

Can we access a row in the same way?

In [63]:
df['h']

KeyError: 'h'

Each column is `pd.Series`:

In [64]:
type(df['a'])

pandas.core.series.Series

# Reading CSV files

We will use [Titanic dataset](https://www.kaggle.com/c/titanic/data):

In [81]:
titanic_train = pd.read_csv('train.csv')

By default, Pandas creates an integer row index and reads column names from `0-th` row of a CSV file:

In [None]:
titanic_train

Glimpse into a dataframe:

In [None]:
titanic_train.head()

In [None]:
titanic_train.tail()

In [None]:
titanic_train.info()

In [None]:
titanic_train.describe()

In [None]:
titanic_train

## Basic indexing of Pandas dataframes

We can set index column in `pd.read_csv`:

In [70]:
titanic_train = pd.read_csv('train.csv', index_col='PassengerId')
titanic_test = pd.read_csv('test.csv', index_col='PassengerId')

Accessing a single column:

In [71]:
titanic_train["Survived"]

PassengerId
1      0
2      1
3      1
4      1
5      0
      ..
887    0
888    1
889    0
890    1
891    0
Name: Survived, Length: 891, dtype: int64

A set of columns:

In [None]:
titanic_train[["Name", "Survived"]]

Just in case, column order is not important (usually):

In [None]:
%timeit titanic_train[["Name", "Survived"]]

In [None]:
%timeit titanic_train[["Survived", "Name"]]

Integer indexing is also available with `[]` notation, but with some peculiarities:

In [None]:
titanic_train[2:4]

But:

In [49]:
titanic_train[2]

NameError: name 'titanic_train' is not defined

`[]` may be ambiguous, and it's better to use it only for column access. If you want to use row labels, use `.loc`:

In [None]:
titanic_train.loc[2]

In [None]:
titanic_train.head()

Note, that `titanic_train.loc[...]` is label-based, not positional, although row labels are integers. This is even more elaborated for non-monotonic indexes (both default one and `PassengerId` are unique and monotonic).

Label-based slice (inclusive bounds):

In [82]:
titanic_train.loc[2:4]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Positional slice (exclusive upper bound):

In [51]:
titanic_train[2:4]

NameError: name 'titanic_train' is not defined

`.loc` indexing is very flexible and can combine row and column access in one run:

In [83]:
titanic_train.loc[2:4, "Age"]

2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

In [86]:
titanic_train.loc[2:4, ["Age"]]

Unnamed: 0,Age
2,26.0
3,35.0
4,35.0


This one won't work:

In [85]:
titanic_train[2:4, ["Age"]]

TypeError: '(slice(2, 4, None), ['Age'])' is an invalid key

In [53]:
titanic_train.loc[2:10:2, ["Age"]]

NameError: name 'titanic_train' is not defined

In [72]:
titanic_train.loc[titanic_train["Age"] < 5, ["Name", "Pclass"]]

Unnamed: 0_level_0,Name,Pclass
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
8,"Palsson, Master. Gosta Leonard",3
11,"Sandstrom, Miss. Marguerite Rut",3
17,"Rice, Master. Eugene",3
44,"Laroche, Miss. Simonne Marie Anne Andree",2
64,"Skoog, Master. Harald",3
79,"Caldwell, Master. Alden Gates",2
120,"Andersson, Miss. Ellis Anna Maria",3
165,"Panula, Master. Eino Viljami",3
172,"Rice, Master. Arthur",3
173,"Johnson, Miss. Eleanor Ileen",3


In [27]:
titanic_train.loc[(titanic_train["Age"] < 5) & (titanic_train.Pclass == 2), "Name"]

NameError: name 'titanic_train' is not defined

This won't work:

In [28]:
titanic_train.loc[titanic_train["Age"] < 5 & titanic_train.Pclass == 2, "Name"]

NameError: name 'titanic_train' is not defined

In [48]:
titanic_train["Age"] < 5 & titanic_train.Pclass

PassengerId
1      False
2      False
3      False
4      False
5      False
       ...  
887    False
888    False
889    False
890    False
891    False
Length: 891, dtype: bool

In [49]:
titanic_train["Age"] < 5 & titanic_train.Pclass == 2

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

`.iloc`, in contrast, is explicitly positional and can combine both row and column positions (and upper bounds are always exclusive):

In [50]:
titanic_train.iloc[:2, 3]  # Note resulting series name: Pandas preserves column name

PassengerId
1      male
2    female
Name: Sex, dtype: object

In [51]:
titanic_train.iloc[:2, 3:5]

Unnamed: 0_level_0,Sex,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,male,22.0
2,female,38.0


You cannot mix positional and label-based indexing:

In [52]:
titanic_train.iloc[:2, "Name"]

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

But you still can use filtering:

In [59]:
titanic_train.iloc[(titanic_train.Age < 10).values, 2]  # titanic_train.iloc[titanic_train.Age < 10, 2] won't work

PassengerId
8                Palsson, Master. Gosta Leonard
11              Sandstrom, Miss. Marguerite Rut
17                         Rice, Master. Eugene
25                Palsson, Miss. Torborg Danira
44     Laroche, Miss. Simonne Marie Anne Andree
                         ...                   
828                       Mallet, Master. Andre
832             Richards, Master. George Sibley
851     Andersson, Master. Sigvard Harald Elias
853                     Boulos, Miss. Nourelain
870             Johnson, Master. Harold Theodor
Name: Name, Length: 62, dtype: object

## Performance

But how index is useful? (note the filtering notation)

In [54]:
titanic_train = pd.read_csv('train.csv')

In [55]:
%timeit titanic_train[titanic_train.PassengerId==400]

331 µs ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [56]:
titanic_train = pd.read_csv('train.csv', index_col='PassengerId')

In [57]:
%timeit titanic_train.loc[400]

145 µs ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


## Combining dataframes

In [None]:
pd.concat([titanic_train, titanic_test], ignore_index=True)

Note, how Pandas filled `Survived` column (which is not even present `titanic_test`!). Better way to combine dataframes when index has actual meaning:

In [112]:
titanic = pd.concat([titanic_train, titanic_test])

In [113]:
titanic

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
1306,,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
1307,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
1308,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


# Indexing `pd.Series` in depth

In [60]:
np.random.seed(983456)

N_ELEMS = 20

s = pd.Series(np.random.randint(20, size=(N_ELEMS,)),
              index=list(string.ascii_lowercase)[:N_ELEMS],
              name='randint_series')
s

a     8
b    17
c    10
d    10
e    12
f    13
g    10
h    12
i     4
j    13
k    16
l     5
m    10
n    10
o     4
p     1
q    10
r     6
s     3
t    17
Name: randint_series, dtype: int32

## Indexing with `[]`

In [61]:
s

a     8
b    17
c    10
d    10
e    12
f    13
g    10
h    12
i     4
j    13
k    16
l     5
m    10
n    10
o     4
p     1
q    10
r     6
s     3
t    17
Name: randint_series, dtype: int32

In [62]:
s['i']  # But there's a caveat: it may be series or just an element

4

In [63]:
s[['i']]

i    4
Name: randint_series, dtype: int32

Slicing works not the way you would expect it to work (both bounds are inclusive):

In [64]:
s['a':'f']

a     8
b    17
c    10
d    10
e    12
f    13
Name: randint_series, dtype: int32

Indexing array work as well:

In [65]:
s[['k', 'q', 'a', 'r']]

k    16
q    10
a     8
r     6
Name: randint_series, dtype: int32

In [66]:
s.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
       'o', 'p', 'q', 'r', 's', 't'],
      dtype='object')

Note, that positional indexing works as well:

In [67]:
s[0:5]

a     8
b    17
c    10
d    10
e    12
Name: randint_series, dtype: int32

In [68]:
s[5:3:-1]

f    13
e    12
Name: randint_series, dtype: int32

## Indexing with `.loc`

In [71]:
np.random.seed(983456)

s_int_idx = pd.Series(np.random.randint(20, size=(N_ELEMS,)),
                      index=np.random.choice(N_ELEMS, N_ELEMS, replace=False),
                      name='randint_series')
s_int_idx

16     8
2     17
4     10
19    10
14    12
0     13
11    10
8     12
12     4
18    13
7     16
5      5
10    10
6     10
17     4
3      1
9     10
15     6
13     3
1     17
Name: randint_series, dtype: int32

We have integer index. What if we use slicing here? Will it go positional or use row index?

In [72]:
s_int_idx[2:15]

4     10
19    10
14    12
0     13
11    10
8     12
12     4
18    13
7     16
5      5
10    10
6     10
17     4
Name: randint_series, dtype: int32

Surprising. But that's the way Pandas works and you'll love it over time (it's API is strongly tailored to most common operations making them more concise).

In [73]:
s_int_idx[2]  # label

17

In [74]:
s_int_idx[2:5]  # position

4     10
19    10
14    12
Name: randint_series, dtype: int32

Boolean mask? Sure.

In [None]:
s_int_idx[s_int_idx.index.isin(range(2,6))]

In [None]:
s_int_idx

Again, `[]` may often be ambiguous. Use `.loc` or `.iloc` to make your code readable and clean:

In [None]:
s_int_idx.loc[2:15]  # label

In [None]:
s_int_idx.iloc[2:15]  # position

What if we take some random upper bound? It won't work generally:

In [None]:
s_int_idx.loc[2:456]

Because of this:

In [None]:
s_int_idx.index.is_monotonic

But we can make it work (or rather you now know when it works and when it doesn't):

In [None]:
s_int_idx.sort_index().loc[2:234]

Because:

In [None]:
s_int_idx.sort_index().index.is_monotonic

We'll see why this works a bit later. We can do complex filtering/masking/boolean indexing as well:

In [None]:
s_int_idx[s_int_idx.index!=11]

In [None]:
s_int_idx[(s_int_idx>15) | (s_int_idx<5)]

In [None]:
s_int_idx.loc[s_int_idx!=14]

# Indexing `pd.DataFrame`

In [70]:
np.random.seed(983456)

df = pd.DataFrame(np.arange(20).reshape((5,4)),
                  columns=['d', 'c', 'b', 'a'],
                  index=np.random.choice(list(string.ascii_lowercase), 5, replace=False))
df

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


Ok, so `[]` (without `loc` or `iloc`) probably is positional?

In [69]:
df[2:5]

Unnamed: 0,a,b,c,d
p,8,9,10,11
a,12,13,14,15
h,16,17,18,19


In [None]:
df['o'] # Nope, it doesn't work that way

But here's the thing: **the same** `[]` notation works differently if you're using column labels:

In [None]:
df['a']

In [None]:
df

Note, that this one returns a dataframe:

In [None]:
df[['b']]

... and this one returns `pd.Series`:

In [None]:
df['b']

In [None]:
df

In [None]:
df.columns

In [None]:
df.columns[2:]

In [None]:
df[df.columns[2:]]

In [None]:
df.iloc[:, 2:]

In [None]:
df

So, `[]` is positional. Is it?

In [75]:
df[:'g'] # Surprising!

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15


In [None]:
df['a':'u'] # Not really surprising

In [77]:
df.sort_index()['a':'u']

Unnamed: 0,d,c,b,a
d,16,17,18,19
g,12,13,14,15
l,4,5,6,7
o,0,1,2,3


But neither `a`, nor `u` are even in row index!

In [None]:
df

In [78]:
df['d':]

Unnamed: 0,d,c,b,a
d,16,17,18,19


In [79]:
df["x":]

Unnamed: 0,d,c,b,a
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


But:

In [80]:
df.sort_index()['d':]

Unnamed: 0,d,c,b,a
d,16,17,18,19
g,12,13,14,15
l,4,5,6,7
o,0,1,2,3
x,8,9,10,11


In [None]:
df

In [None]:
df['k':'z'] # No, that won't work

In [None]:
df.sort_index()['k':'zjyyf']

In reality, Pandas keeps track of ranking of index labels:

In [None]:
df.index.to_series().rank()

If index is monotonic, it allows for out-of-index indexing:

In [None]:
df.sort_index().index.to_series().rank()

In [None]:
df.sort_index()['b':'m']  # Pandas can unambiguously set 'b' to be less than 'd' and 'm' to be between 'l' and 'o'

In [None]:
df

In [None]:
df.loc['o':'x', 'c'] = 5

In [73]:
df_sub = df['o':'x']
df_sub['c'] = 5 # Not a very good idea

KeyError: 'x'

In [77]:
df[(df['a']>=2) | (df['b']<3)]

Unnamed: 0,a,b,c,d
o,0,1,2,3
h,4,5,6,7
e,8,9,10,11
a,12,13,14,15
j,16,17,18,19


In [None]:
df

## Indexing with `.loc`

General rule is (for readability and to exclude weird bugs):

- use `[]` when accessing columns by label,
- use `.loc` when accessing both rows and columns by label,
- use `.iloc` for positional indexing.

In [None]:
df

In [None]:
df.loc['o']

In [None]:
df.loc['o', 'b']

In [None]:
df.loc['o':, 'b']

In [None]:
df.loc['g':, 'b':]

In [None]:
df

In [None]:
df.loc['g':, 'c':'d'].shape

In [None]:
df

Column index is still an index and works in a similar manner.

In [None]:
df.columns.to_series().rank()

In [None]:
df.loc['x':, 'a'::-2]

In [None]:
df.loc['x':, 'c':'d']

In [None]:
df.sort_index(axis='columns').loc['x':, 'c':'d']

In [None]:
df.loc[:, ["a", "b"]]

`.loc` can contain a mask (Pandas will align it for you):

In [None]:
df.loc[df['c']>10, 'c']

In [None]:
df.loc[:, df.columns[2:]]

In [None]:
df.loc[[1,2], 'c'] # This won't work: .loc cannot use a mix

## `SettingWithCopyWarning`

In [None]:
df

Each indexing operation generates either a copy, or a view to the dataframe and in contrast to NumPy Pandas provides no guarantee.

In [None]:
df.loc[df['a']>10, 'c']

In [None]:
df.__setitem__?

An assignment like this works the same way as in NumPy and original dataframe is modified (under the hood it's just a call to `df.__setitem__`):

In [None]:
df.loc[df['a']>10, 'c'] = 10

In [None]:
df

This one, however, contains two chained `__getitem__` calls:

In [None]:
df.loc[df['a']>10]['c']

The following assignment generates a warning (it's unknown if `df.loc[df['a']>10]` is a view or a copy):

In [None]:
df.loc[df['a']>10]['c'] = 20.

In [None]:
df

Let's decompose it:

In [None]:
df_1 = df.loc[df['a']>10]

In [None]:
df_1

In [None]:
df_1['c'] = 25

In [None]:
df_1

In [None]:
df

# Dataframe arithmetic

In [81]:
df_1 = pd.DataFrame(np.arange(40).reshape(10,4),
                    columns=['a', 'b', 'c', 'd'],
                    index=np.random.choice(list(string.ascii_lowercase), 10, replace=False))
df_1

Unnamed: 0,a,b,c,d
f,0,1,2,3
l,4,5,6,7
h,8,9,10,11
y,12,13,14,15
j,16,17,18,19
z,20,21,22,23
t,24,25,26,27
g,28,29,30,31
e,32,33,34,35
p,36,37,38,39


In [82]:
df_2 = pd.DataFrame(np.arange(40).reshape(10,4),
                    columns=['a', 'e', 'c', 'd'],
                    index=np.random.choice(list(string.ascii_lowercase), 10, replace=False))
df_2

Unnamed: 0,a,e,c,d
n,0,1,2,3
t,4,5,6,7
e,8,9,10,11
i,12,13,14,15
g,16,17,18,19
j,20,21,22,23
m,24,25,26,27
c,28,29,30,31
z,32,33,34,35
w,36,37,38,39


In [83]:
# A lot of missing values
df_1 + df_2

Unnamed: 0,a,b,c,d,e
c,,,,,
e,40.0,,44.0,46.0,
f,,,,,
g,44.0,,48.0,50.0,
h,,,,,
i,,,,,
j,36.0,,40.0,42.0,
l,,,,,
m,,,,,
n,,,,,


We can provide a fill value for missing **operands**:

In [86]:
df_1.add(df_2, fill_value=0)

Unnamed: 0,a,b,c,d,e
c,28.0,,30.0,31.0,29.0
e,40.0,33.0,44.0,46.0,9.0
f,0.0,1.0,2.0,3.0,
g,44.0,29.0,48.0,50.0,17.0
h,8.0,9.0,10.0,11.0,
i,12.0,,14.0,15.0,13.0
j,36.0,17.0,40.0,42.0,21.0
l,4.0,5.0,6.0,7.0,
m,24.0,,26.0,27.0,25.0
n,0.0,,2.0,3.0,1.0


Operations between dataframes and series are aligned along column by default:

In [87]:
s_1 = pd.Series(np.arange(10),
                name='f',
                index=np.random.choice(list(string.ascii_lowercase), 10, replace=False))

In [88]:
s_1

z    0
c    1
a    2
n    3
e    4
x    5
j    6
h    7
d    8
p    9
Name: f, dtype: int32

In [89]:
df_1

Unnamed: 0,a,b,c,d
f,0,1,2,3
l,4,5,6,7
h,8,9,10,11
y,12,13,14,15
j,16,17,18,19
z,20,21,22,23
t,24,25,26,27
g,28,29,30,31
e,32,33,34,35
p,36,37,38,39


In [90]:
df_1 + s_1

Unnamed: 0,a,b,c,d,e,h,j,n,p,x,z
f,2.0,,3.0,11.0,,,,,,,
l,6.0,,7.0,15.0,,,,,,,
h,10.0,,11.0,19.0,,,,,,,
y,14.0,,15.0,23.0,,,,,,,
j,18.0,,19.0,27.0,,,,,,,
z,22.0,,23.0,31.0,,,,,,,
t,26.0,,27.0,35.0,,,,,,,
g,30.0,,31.0,39.0,,,,,,,
e,34.0,,35.0,43.0,,,,,,,
p,38.0,,39.0,47.0,,,,,,,


In [91]:
s_1 + df_1

Unnamed: 0,a,b,c,d,e,h,j,n,p,x,z
f,2.0,,3.0,11.0,,,,,,,
l,6.0,,7.0,15.0,,,,,,,
h,10.0,,11.0,19.0,,,,,,,
y,14.0,,15.0,23.0,,,,,,,
j,18.0,,19.0,27.0,,,,,,,
z,22.0,,23.0,31.0,,,,,,,
t,26.0,,27.0,35.0,,,,,,,
g,30.0,,31.0,39.0,,,,,,,
e,34.0,,35.0,43.0,,,,,,,
p,38.0,,39.0,47.0,,,,,,,


The default can be changed:

In [92]:
df_1.add(s_1, axis='index')

Unnamed: 0,a,b,c,d
a,,,,
c,,,,
d,,,,
e,36.0,37.0,38.0,39.0
f,,,,
g,,,,
h,15.0,16.0,17.0,18.0
j,22.0,23.0,24.0,25.0
l,,,,
n,,,,


Such an alignment (along columns) allows for many common operation to be written in a short form. For example, to normalize each row, we just do

In [93]:
(df_1 - df_1.mean()) / df_1.std()

Unnamed: 0,a,b,c,d
f,-1.486301,-1.486301,-1.486301,-1.486301
l,-1.156012,-1.156012,-1.156012,-1.156012
h,-0.825723,-0.825723,-0.825723,-0.825723
y,-0.495434,-0.495434,-0.495434,-0.495434
j,-0.165145,-0.165145,-0.165145,-0.165145
z,0.165145,0.165145,0.165145,0.165145
t,0.495434,0.495434,0.495434,0.495434
g,0.825723,0.825723,0.825723,0.825723
e,1.156012,1.156012,1.156012,1.156012
p,1.486301,1.486301,1.486301,1.486301


In [94]:
df_1.mean()

a    18.0
b    19.0
c    20.0
d    21.0
dtype: float64

# Applying functions to dataframes

In [None]:
df

Main entry method to apply a function over rows or columns:

In [97]:
df.apply(lambda row: np.sqrt(row.d), axis=1)

o    0.000000
l    2.000000
x    2.828427
g    3.464102
d    4.000000
dtype: float64

This one is faster, though:

In [98]:
%timeit df['d'].apply(lambda x: np.sqrt(x))

215 µs ± 7.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Pandas allows to use NumPy functions directly:

In [99]:
np.sqrt(df['d'])

o    0.000000
l    2.000000
x    2.828427
g    3.464102
d    4.000000
Name: d, dtype: float64

Which is faster:

In [100]:
%timeit np.sqrt(df['d'])

101 µs ± 3.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [101]:
# Better way
%timeit np.sqrt(df['d'].values)

5.98 µs ± 497 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [102]:
np.sqrt(df['d'].values)

array([0.        , 2.        , 2.82842712, 3.46410162, 4.        ])

Ofter replacing `apply` altogether is the best option:

In [103]:
df_copy = df.copy()
df_copy["d_sqrt"] = np.sqrt(df['d'].values)

Note, that we often can mix Pandas and NumPy:

In [104]:
df

Unnamed: 0,d,c,b,a
o,0,1,2,3
l,4,5,6,7
x,8,9,10,11
g,12,13,14,15
d,16,17,18,19


In [105]:
df.values

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [106]:
df.values.sum(axis=1)

array([ 6, 22, 38, 54, 70])

In [107]:
df.apply(lambda x: x.sum(), axis=1)

o     6
l    22
x    38
g    54
d    70
dtype: int64

In [108]:
np.sum(df, axis=1)

o     6
l    22
x    38
g    54
d    70
dtype: int64

In [109]:
df.sum(axis=1)

o     6
l    22
x    38
g    54
d    70
dtype: int64

Pandas is smart enough to combine the result in a proper manner:

In [110]:
dfm = df.apply(lambda x: pd.Series({'sum': x.sum(),
                                    'sqrt': np.sqrt(x['d'])}),
               axis=1)

In [111]:
dfm

Unnamed: 0,sum,sqrt
o,6.0,0.0
l,22.0,2.0
x,38.0,2.828427
g,54.0,3.464102
d,70.0,4.0


# Dataframe statistics

In [78]:
titanic['Pclass'].value_counts()

NameError: name 'titanic' is not defined

In [None]:
titanic.SibSp.value_counts()

In [None]:
titanic.Embarked.value_counts() # S = Southampton, C = Cherbourg, Q = Queens Town

In [79]:
titanic.Sex.value_counts()

NameError: name 'titanic' is not defined

In [None]:
print("Average age: %2.2f" % titanic['Age'].mean())
print("STD of age: %2.2f" % titanic['Age'].std())
print("Minimum age: %2.2f" % titanic['Age'].min())
print("Maximum age: %2.2f" % titanic['Age'].max())

In [119]:
print("Average number of siblings/spouse: %2.2f" % titanic['SibSp'].mean())
print("Average number of siblings/spouse in class 1: %2.2f" % titanic.loc[titanic.Pclass==1, 'SibSp'].mean())
print("Average number of siblings/spouse in class 2: %2.2f" % titanic.loc[titanic.Pclass==2, 'SibSp'].mean())
print("Average number of siblings/spouse in class 3: %2.2f" % titanic.loc[titanic.Pclass==3, 'SibSp'].mean())

Average number of siblings/spouse: 0.50
Average number of siblings/spouse in class 1: 0.44
Average number of siblings/spouse in class 2: 0.39
Average number of siblings/spouse in class 3: 0.57


In [120]:
print("Minimum age (not survived): %2.2f" % titanic.loc[titanic.Survived==0, 'Age'].min())
print("Maximum age (not survived): %2.2f" % titanic.loc[titanic.Survived==0, 'Age'].max())

Minimum age (not survived): 1.00
Maximum age (not survived): 74.00


In [121]:
print("Minimum age (survived): %2.2f" % titanic.loc[titanic.Survived==1, 'Age'].min())
print("Maximum age (survived): %2.2f" % titanic.loc[titanic.Survived==1, 'Age'].max())

Minimum age (survived): 0.42
Maximum age (survived): 80.00


# Replacing and renaming

In [None]:
titanic.replace(22, 122).head()

In [None]:
import re
titanic.replace(re.compile(r'\(.*\)'), '').head()

In [None]:
titanic.rename(lambda x: x.lower(), axis=1).head()

In [None]:
titanic.rename({'SibSp':'siblings_spouses'}, axis=1).head()

In [80]:
titanic.rename?

Object `titanic.rename` not found.


# String operations

In [124]:
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [123]:
titanic.replace(re.compile(r'\(.*\)'), '').Name.str.split(",", expand=True)

Unnamed: 0_level_0,0,1
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Braund,Mr. Owen Harris
2,Cumings,Mrs. John Bradley
3,Heikkinen,Miss. Laina
4,Futrelle,Mrs. Jacques Heath
5,Allen,Mr. William Henry
...,...,...
1305,Spector,Mr. Woolf
1306,Oliva y Ocana,Dona. Fermina
1307,Saether,Mr. Simon Sivertsen
1308,Ware,Mr. Frederick


In [125]:
(titanic
 .replace(re.compile(r'\(.*\)'), '')
 .Name.str
 .split(',', expand=True)
 .rename({0:'family_name', 1:'first_name'}, axis=1)
 .head())

Unnamed: 0_level_0,family_name,first_name
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Braund,Mr. Owen Harris
2,Cumings,Mrs. John Bradley
3,Heikkinen,Miss. Laina
4,Futrelle,Mrs. Jacques Heath
5,Allen,Mr. William Henry


# Cleaning data

`isnull` is very convenient method:

In [114]:
titanic.isnull().head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,False,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,True,False


Resulting dataframe can now be used to determine if there any missing values (by column or by row):

In [115]:
titanic.isnull().any()

Survived     True
Pclass      False
Name        False
Sex         False
Age          True
SibSp       False
Parch       False
Ticket      False
Fare         True
Cabin        True
Embarked     True
dtype: bool

In [116]:
titanic.isnull().any(axis=1).head()

PassengerId
1     True
2    False
3     True
4    False
5     True
dtype: bool

Or calculate how many missing values are in a dataframe (by row or by column):

In [117]:
titanic.isnull().sum()

Survived     418
Pclass         0
Name           0
Sex            0
Age          263
SibSp          0
Parch          0
Ticket         0
Fare           1
Cabin       1014
Embarked       2
dtype: int64

In [118]:
titanic.head(15)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0.0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0.0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0.0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1.0,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1.0,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


Pandas is smart enough to fill missing values by column:

In [None]:
fill_values = titanic[['Age', 'Fare']].mean()

In [None]:
fill_values

In [None]:
titanic[titanic.Fare.isnull()]

In [None]:
titanic.fillna(fill_values).head(15)

# Getting indicators and dummy variables

In [None]:
pd.get_dummies(titanic, columns=['Pclass', 'Sex', 'Embarked']).head()