# `pandas`

In [1]:
%pylab inline
plt.style.use('ggplot')

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


Note the import convention:

In [2]:
import numpy as np
import pandas as pd

In [3]:
np.random.seed(983456)

## Creating `pd.Series`

When creating Pandas `Series` you can provide values only:

In [4]:
s = pd.Series(np.random.randn(10))
s

0   -1.187327
1    0.382796
2   -0.681736
3   -3.534783
4    0.304866
5   -0.899330
6    1.194968
7   -0.446314
8    2.598223
9   -0.795144
dtype: float64

Values and series name:

In [5]:
s = pd.Series(np.random.randn(10), name="random_series")
s

0    2.433533
1   -0.677715
2    0.871098
3    0.128585
4   -1.042224
5    0.228245
6   -0.361624
7    0.447801
8    2.045056
9   -1.291771
Name: random_series, dtype: float64

Values, index and series name:

In [6]:
s = pd.Series(np.random.randn(10), name="random_series",
              index=np.random.randint(23, size=(10,)))
s

10    0.496488
2     0.516279
20   -1.497689
0    -0.256733
1     0.149244
14   -2.084936
16    0.117294
19    1.779378
3     0.820993
3    -0.090797
Name: random_series, dtype: float64

In [7]:
s.index

Index([10, 2, 20, 0, 1, 14, 16, 19, 3, 3], dtype='int64')

Index can be created explicitly (and can have it's own name):

In [8]:
s = pd.Series(np.random.randn(10), name="random_series",
              index=pd.Index(np.random.randint(23, size=(10,)), name="main_index"))
s

main_index
5    -1.062838
10    0.704245
0    -0.596709
16   -0.244326
21   -0.862128
1    -0.532820
11   -0.364554
17   -2.594341
19   -0.014463
16   -0.012298
Name: random_series, dtype: float64

In [9]:
s.index

Index([5, 10, 0, 16, 21, 1, 11, 17, 19, 16], dtype='int64', name='main_index')

Series can be created from a dictionary as well

In [10]:
s = pd.Series({'a':3, 'c':6, 'b':2}, name="dict_series")
s

a    3
c    6
b    2
Name: dict_series, dtype: int64

In [11]:
s.index

Index(['a', 'c', 'b'], dtype='object')

## Creating `pd.DataFrame`

Agai, we can use just values and Pandas will create an index (both row and column) for us:

In [12]:
df = pd.DataFrame(np.arange(20).reshape((5,4)))
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


Easy way to access data types in a dataframe:

In [13]:
df.dtypes

0    int64
1    int64
2    int64
3    int64
dtype: object

We can provide column names:

In [14]:
df = pd.DataFrame(np.arange(20).reshape((5,4)),
                  columns=['a', 'b', 'c', 'd'])
df

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


Values, index and column names:

In [15]:
import string
df = pd.DataFrame(np.arange(20).reshape((5,4)),
                  columns=['a', 'b', 'c', 'd'],
                  index=np.random.choice(list(string.ascii_lowercase), 5, replace=False))
df

Unnamed: 0,a,b,c,d
e,0,1,2,3
d,4,5,6,7
p,8,9,10,11
a,12,13,14,15
h,16,17,18,19


In [16]:
list(df.columns)

['a', 'b', 'c', 'd']

In [17]:
df.index

Index(['e', 'd', 'p', 'a', 'h'], dtype='object')

Can you guess what does `df['a']` mean?

In [18]:
df['a']

e     0
d     4
p     8
a    12
h    16
Name: a, dtype: int64

In [19]:
df.a

e     0
d     4
p     8
a    12
h    16
Name: a, dtype: int64

Can we access a row in the same way?

In [20]:
df['h']

KeyError: 'h'

Each column is `pd.Series`:

In [None]:
type(df['a'])

# Reading CSV files

We will use [Titanic dataset](https://www.kaggle.com/c/titanic/data):

In [None]:
titanic_train = pd.read_csv('train.csv')

By default, Pandas creates an integer row index and reads column names from `0-th` row of a CSV file:

In [21]:
titanic_train

NameError: name 'titanic_train' is not defined

Glimpse into a dataframe:

In [22]:
titanic_train.head(3)

NameError: name 'titanic_train' is not defined

In [23]:
titanic_train.tail()

NameError: name 'titanic_train' is not defined

In [24]:
titanic_train.info()

NameError: name 'titanic_train' is not defined

In [25]:
titanic_train.describe()

NameError: name 'titanic_train' is not defined

In [26]:
titanic_train

NameError: name 'titanic_train' is not defined

## Basic indexing of Pandas dataframes

We can set index column in `pd.read_csv`:

In [27]:
titanic_train = pd.read_csv('train.csv', index_col='PassengerId')
titanic_test = pd.read_csv('test.csv', index_col='PassengerId')

In [28]:
titanic_train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Accessing a single column:

In [29]:
titanic_train["Survived"]

PassengerId
1      0
2      1
3      1
4      1
5      0
      ..
887    0
888    1
889    0
890    1
891    0
Name: Survived, Length: 891, dtype: int64

A set of columns:

In [30]:
titanic_train[["Name", "Survived"]]

Unnamed: 0_level_0,Name,Survived
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"Braund, Mr. Owen Harris",0
2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1
3,"Heikkinen, Miss. Laina",1
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1
5,"Allen, Mr. William Henry",0
...,...,...
887,"Montvila, Rev. Juozas",0
888,"Graham, Miss. Margaret Edith",1
889,"Johnston, Miss. Catherine Helen ""Carrie""",0
890,"Behr, Mr. Karl Howell",1


Just in case, column order is not important (usually):

In [31]:
%timeit titanic_train[["Name", "Survived"]]

71.8 μs ± 1.75 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [32]:
%timeit titanic_train[["Survived", "Name"]]

73.6 μs ± 1.61 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


Integer indexing is also available with `[]` notation, but with some peculiarities:

In [33]:
titanic_train[2:4]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


But:

In [34]:
titanic_train[2]

KeyError: 2

`[]` may be ambiguous, and it's better to use it only for column access. If you want to use row labels, use `.loc`:

In [None]:
titanic_train.loc[2]

In [None]:
titanic_train.head()

Note, that `titanic_train.loc[...]` is label-based, not positional, although row labels are integers. This is even more elaborated for non-monotonic indexes (both default one and `PassengerId` are unique and monotonic).

Label-based slice (inclusive bounds):

In [None]:
titanic_train.loc[2:4]

Positional slice (exclusive upper bound):

In [None]:
titanic_train[2:4]

`.loc` indexing is very flexible and can combine row and column access in one run:

In [None]:
titanic_train.loc[2:4, "Age"]

In [None]:
titanic_train.loc[2:4, ["Age"]]

This one won't work:

In [None]:
titanic_train[2:4, ["Age"]]

In [None]:
titanic_train.loc[2:10:2, ["Age"]]

In [None]:
titanic_train.loc[titanic_train["Age"] < 5, ["Name", "Pclass"]]

In [None]:
titanic_train.loc[(titanic_train["Age"] < 5) & (titanic_train.Pclass == 2), "Name"]

This won't work:

In [None]:
titanic_train.loc[titanic_train["Age"] < 5 & titanic_train.Pclass == 2, "Name"]

In [None]:
titanic_train.head()

In [None]:
5 & titanic_train.Pclass

In [None]:
(titanic_train["Age"] < 5).values

`.iloc`, in contrast, is explicitly positional and can combine both row and column positions (and upper bounds are always exclusive):

In [None]:
titanic_train.iloc[:2, 3]  # Note resulting series name: Pandas preserves column name

In [None]:
titanic_train.iloc[:2, 3:5]

You cannot mix positional and label-based indexing:

In [None]:
titanic_train.iloc[:2, "Name"]

But you still can use filtering:

In [None]:
titanic_train.iloc[(titanic_train.Age < 10).values, 2]  # titanic_train.iloc[titanic_train.Age < 10, 2] won't work

## Performance

But how index is useful? (note the filtering notation)

In [None]:
titanic_train = pd.read_csv('train.csv')

In [None]:
%timeit titanic_train[titanic_train.PassengerId==400]

In [None]:
titanic_train = pd.read_csv('train.csv', index_col='PassengerId')

In [None]:
%timeit titanic_train.loc[400]

## Combining dataframes

In [None]:
pd.concat([titanic_train, titanic_test], ignore_index=True)

Note, how Pandas filled `Survived` column (which is not even present `titanic_test`!). Better way to combine dataframes when index has actual meaning:

In [None]:
titanic = pd.concat([titanic_train, titanic_test])
titanic

In [None]:
titanic

# Indexing `pd.Series` in depth

In [None]:
np.random.seed(983456)

N_ELEMS = 20

s = pd.Series(np.random.randint(20, size=(N_ELEMS,)),
              index=list(string.ascii_lowercase)[:N_ELEMS],
              name='randint_series')
s

## Indexing with `[]`

In [None]:
s

In [None]:
s[['i']]

Slicing works not the way you would expect it to work (both bounds are inclusive):

In [None]:
s['a':'f']

Indexing array work as well:

In [None]:
s[['k', 'q', 'a', 'r']]

In [None]:
s.index

Note, that positional indexing works as well:

In [None]:
s[0:5]

In [None]:
s[5:3:-1]

## Indexing with `.loc`

In [None]:
np.random.seed(983456)

s_int_idx = pd.Series(np.random.randint(20, size=(N_ELEMS,)),
                      index=np.random.choice(N_ELEMS, N_ELEMS, replace=False),
                      name='randint_series')
s_int_idx

We have integer index. What if we use slicing here? Will it go positional or use row index?

In [None]:
s_int_idx[2:15]

Surprising. But that's the way Pandas works and you'll love it over time (it's API is strongly tailored to most common operations making them more concise).

In [None]:
s_int_idx[2]  # label

In [None]:
s_int_idx[2:5]  # position

Boolean mask? Sure.

In [None]:
s_int_idx[s_int_idx.index.isin(range(2,6))] # based on value of an index using a boolean mask

In [None]:
s_int_idx

Again, `[]` may often be ambiguous. Use `.loc` or `.iloc` to make your code readable and clean:

In [None]:
s_int_idx.loc[2:15]  # label

In [None]:
s_int_idx.iloc[2:15]  # position

What if we take some random upper bound? It won't work generally:

In [None]:
s_int_idx.loc[2:456]

Because of this:

In [None]:
s_int_idx.index.is_monotonic_increasing

But we can make it work (or rather you now know when it works and when it doesn't):

In [None]:
s_int_idx.sort_index().loc[2:234]

Because:

In [None]:
s_int_idx.sort_index().index.is_monotonic_increasing

We'll see why this works a bit later. We can do complex filtering/masking/boolean indexing as well:

In [None]:
s_int_idx[s_int_idx.index!=11]

In [None]:
s_int_idx[(s_int_idx>15) | (s_int_idx<5)]

In [None]:
s_int_idx.loc[s_int_idx!=14]

# Indexing `pd.DataFrame`

In [None]:
np.random.seed(983456)

df = pd.DataFrame(np.arange(20).reshape((5,4)),
                  columns=['d', 'c', 'b', 'a'],
                  index=np.random.choice(list(string.ascii_lowercase), 5, replace=False))
df

Ok, so `[]` (without `loc` or `iloc`) probably is positional?

In [None]:
df[2:5]

In [None]:
df['o'] # Nope, it doesn't work that way

But here's the thing: **the same** `[]` notation works differently if you're using column labels:

In [None]:
df['a']

In [None]:
df

Note, that this one returns a dataframe:

In [None]:
df[['b']]

... and this one returns `pd.Series`:

In [None]:
df['b']

In [None]:
df

In [None]:
df.columns

In [None]:
l = list(df.columns)
l

In [None]:
df.columns[2:]

In [None]:
df[df.columns[2:]]

In [None]:
df.iloc[:, 2:]

In [None]:
df

So, `[]` is positional. Is it?

In [None]:
df[:'g'] # Surprising!

In [None]:
df['a':'u'] # Not really surprising

In [None]:
df.sort_index()['a':'u']

But neither `a`, nor `u` are even in row index!

In [None]:
df

In [None]:
df['d':]

In [None]:
list(df.columns)['d':'b']

In [None]:
df["x":]

But:

In [None]:
df.sort_index()['e':]

In [None]:
df

In [None]:
df['k':'z'] # No, that won't work

In [None]:
df.sort_index()['k':'zjyyf']

In reality, Pandas keeps track of ranking of index labels:

In [None]:
df.index.to_series().rank()

If index is monotonic, it allows for out-of-index indexing:

In [None]:
df.sort_index().index.to_series().rank()

In [None]:
df.sort_index()['b':'m']  # Pandas can unambiguously set 'b' to be less than 'd' and 'm' to be between 'l' and 'o'

In [None]:
df

In [None]:
df.loc['o':'x', 'c'] = 5

In [None]:
df

In [None]:
df_sub = df['o':'x']
df_sub['c'] = 5 # Not a very good idea

In [None]:
df[(df['a']>12) | (df['b']<3)]

In [None]:
df

## Indexing with `.loc`

General rule is (for readability and to exclude weird bugs):

- use `[]` when accessing columns by label,
- use `.loc` when accessing both rows and columns by label,
- use `.iloc` for positional indexing.

In [None]:
df

In [None]:
df.loc['o']

In [None]:
df.loc['o', 'b']

In [None]:
df.loc['o':, 'b']

In [None]:
df.loc['g':, 'b':]

In [None]:
df

In [None]:
df.loc['g':, 'c':'d'].shape

In [None]:
df.loc['g':, 'c':'d']

In [None]:
df

Column index is still an index and works in a similar manner.

In [None]:
df.columns.to_series().rank()

In [None]:
df.loc['x':, 'a'::-2]

In [None]:
df.loc['x':, 'c':'d']

In [None]:
df.sort_index(axis='columns').loc['x':, 'c':'d']

In [None]:
df.loc[:, ["a", "b"]]

`.loc` can contain a mask (Pandas will align it for you):

In [None]:
df.loc[df['c']>10, 'c']

In [None]:
df.loc[:, df.columns[2:]]

In [None]:
df.loc[[1,2], 'c'] # This won't work: .loc cannot use a mix of positional and label indexing

## `SettingWithCopyWarning`

In [None]:
df

Each indexing operation generates either a copy, or a view to the dataframe and in contrast to NumPy Pandas provides no guarantee.

In [35]:
df.loc[df['a']>10, 'c']

a    14
h    18
Name: c, dtype: int64

In [36]:
df.__setitem__?

An assignment like this works the same way as in NumPy and original dataframe is modified (under the hood it's just a call to `df.__setitem__`):

In [37]:
df.loc[df['a']>10, 'c'] = 10

In [38]:
df

Unnamed: 0,a,b,c,d
e,0,1,2,3
d,4,5,6,7
p,8,9,10,11
a,12,13,10,15
h,16,17,10,19


This one, however, contains two chained `__getitem__` calls:

In [39]:
df.loc[df['a']>10]['c']

a    10
h    10
Name: c, dtype: int64

The following assignment generates a warning (it's unknown if `df.loc[df['a']>10]` is a view or a copy):

In [40]:
df.loc[df['a']>10]['c'] = 20.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df['a']>10]['c'] = 20.


In [41]:
df

Unnamed: 0,a,b,c,d
e,0,1,2,3
d,4,5,6,7
p,8,9,10,11
a,12,13,10,15
h,16,17,10,19


Let's decompose it:

In [42]:
df_1 = df.loc[df['a']>10]

In [43]:
df_1

Unnamed: 0,a,b,c,d
a,12,13,10,15
h,16,17,10,19


In [44]:
df_1['c'] = 25

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_1['c'] = 25


In [45]:
df_1

Unnamed: 0,a,b,c,d
a,12,13,25,15
h,16,17,25,19


In [46]:
df

Unnamed: 0,a,b,c,d
e,0,1,2,3
d,4,5,6,7
p,8,9,10,11
a,12,13,10,15
h,16,17,10,19


# Dataframe arithmetic

In [47]:
df_1 = pd.DataFrame(np.arange(40).reshape(10,4),
                    columns=['a', 'b', 'c', 'd'],
                    index=np.random.choice(list(string.ascii_lowercase), 10, replace=False))
df_1

Unnamed: 0,a,b,c,d
o,0,1,2,3
h,4,5,6,7
e,8,9,10,11
a,12,13,14,15
j,16,17,18,19
s,20,21,22,23
f,24,25,26,27
b,28,29,30,31
r,32,33,34,35
w,36,37,38,39


In [48]:
df_2 = pd.DataFrame(np.arange(40).reshape(10,4),
                    columns=['a', 'e', 'c', 'd'],
                    index=np.random.choice(list(string.ascii_lowercase), 10, replace=False))
df_2

Unnamed: 0,a,e,c,d
s,0,1,2,3
m,4,5,6,7
u,8,9,10,11
t,12,13,14,15
f,16,17,18,19
v,20,21,22,23
q,24,25,26,27
b,28,29,30,31
k,32,33,34,35
y,36,37,38,39


In [49]:
# A lot of missing values
df_1 + df_2

Unnamed: 0,a,b,c,d,e
a,,,,,
b,56.0,,60.0,62.0,
e,,,,,
f,40.0,,44.0,46.0,
h,,,,,
j,,,,,
k,,,,,
m,,,,,
o,,,,,
q,,,,,


We can provide a fill value for missing **operands**:

In [50]:
df_1.add(df_2, fill_value=0)

Unnamed: 0,a,b,c,d,e
a,12.0,13.0,14.0,15.0,
b,56.0,29.0,60.0,62.0,29.0
e,8.0,9.0,10.0,11.0,
f,40.0,25.0,44.0,46.0,17.0
h,4.0,5.0,6.0,7.0,
j,16.0,17.0,18.0,19.0,
k,32.0,,34.0,35.0,33.0
m,4.0,,6.0,7.0,5.0
o,0.0,1.0,2.0,3.0,
q,24.0,,26.0,27.0,25.0


Operations between dataframes and series are aligned along column by default:

In [51]:
s_1 = pd.Series(np.arange(10),
                name='f',
                index=np.random.choice(list(string.ascii_lowercase), 10, replace=False))

In [52]:
s_1

f    0
t    1
r    2
g    3
q    4
e    5
s    6
n    7
w    8
l    9
Name: f, dtype: int64

In [53]:
df_1

Unnamed: 0,a,b,c,d
o,0,1,2,3
h,4,5,6,7
e,8,9,10,11
a,12,13,14,15
j,16,17,18,19
s,20,21,22,23
f,24,25,26,27
b,28,29,30,31
r,32,33,34,35
w,36,37,38,39


In [None]:
df_1 + s_1

In [None]:
s_1 + df_1

The default can be changed:

In [None]:
df_1.add(s_1, axis='index')

Such an alignment (along columns) allows for many common operation to be written in a short form. For example, to normalize each row, we just do

In [None]:
df_1.mean()

In [None]:
df_1

In [None]:
df_1 - df_1.mean()

In [None]:
(df_1 - df_1.mean()) / df_1.std()

In [None]:
# If df incldues both numeric and non-numeric columns, how to apply mathematical operations only on numeric ones?
df_1.select_dtypes(include=np.number).mean()

In [None]:
df_1['newcol'] = 'kosta'
df_1

In [None]:
df_1.mean()

In [None]:
# If df incldues both numeric and non-numeric columns, how to apply mathematical operations only on numeric ones?
df_1.select_dtypes(include=np.number).mean()

# Applying functions to dataframes

In [None]:
df

Main entry method to apply a function over rows or columns:

In [None]:
%timeit np.sqrt(df)

In [None]:
np.mean(df, axis=1)

In [None]:
%timeit df.apply(lambda row: np.sqrt(row))

This one is faster, though:

In [None]:
def plus1(x):
    return x+1

In [None]:
%timeit df['d'].apply(plus1)

In [None]:
%timeit plus1(df)

In [None]:
%timeit df['d'].apply(lambda x: np.sqrt(x))

In [None]:
df['k'] = df.d + 2*df.a
df

Pandas allows to use NumPy functions directly:

In [None]:
np.sqrt(df['d'])

Which is faster:

In [None]:
%timeit np.sqrt(df['d'])

In [None]:
# Better way
%timeit np.sqrt(df['d'].values)

In [None]:
np.sqrt(df['d'].values)

Often replacing `apply` altogether is the best option:

In [None]:
df_copy = df.copy()
df_copy["d_sqrt"] = np.sqrt(df['d'].values)

In [None]:
df_copy

Note, that we often can mix Pandas and NumPy:

In [None]:
df

In [None]:
df.values

In [None]:
df.values.sum(axis=1)

In [None]:
df.apply(lambda x: x.sum(), axis=1)

In [None]:
np.sum(df, axis=1)

In [None]:
df.sum(axis=1)

Pandas is smart enough to combine the result in a proper manner:

In [None]:
dfm = df.apply(lambda x: pd.Series({'sum': x.sum(),
                                    'sqrt': np.sqrt(x['d'])}),
               axis=1)

In [None]:
dfm

# Dataframe statistics

In [None]:
titanic['Cabin'].value_counts(dropna=False)

In [None]:
titanic.SibSp.value_counts()

In [None]:
titanic.Embarked.value_counts() # S = Southampton, C = Cherbourg, Q = Queens Town

In [None]:
titanic.Sex.value_counts()

In [None]:
print("Average age: %2.2f" % titanic['Age'].mean())
print("STD of age: %2.2f" % titanic['Age'].std())
print("Minimum age: %2.2f" % titanic['Age'].min())
print("Maximum age: %2.2f" % titanic['Age'].max())

In [None]:
print("Average number of siblings/spouse: %2.2f" % titanic['SibSp'].mean())
print("Average number of siblings/spouse in class 1: %2.2f" % titanic.loc[titanic.Pclass==1, 'SibSp'].mean())
print("Average number of siblings/spouse in class 2: %2.2f" % titanic.loc[titanic.Pclass==2, 'SibSp'].mean())
print("Average number of siblings/spouse in class 3: %2.2f" % titanic.loc[titanic.Pclass==3, 'SibSp'].mean())

In [None]:
print("Minimum age (not survived): %2.2f" % titanic.loc[titanic.Survived==0, 'Age'].min())
print("Maximum age (not survived): %2.2f" % titanic.loc[titanic.Survived==0, 'Age'].max())
print("Mean age (not survived): %2.2f" % titanic.loc[titanic.Survived==0, 'Age'].mean())

In [None]:
print("Minimum age (survived): %2.2f" % titanic.loc[titanic.Survived==1, 'Age'].min())
print("Maximum age (survived): %2.2f" % titanic.loc[titanic.Survived==1, 'Age'].max())
print("Mean age (survived): %2.2f" % titanic.loc[titanic.Survived==1, 'Age'].mean())

# Replacing and renaming

In [None]:
titanic.replace(22, 122).head()

In [None]:
titanic.head()

In [None]:
import re
titanic.replace(re.compile(r'\(.*\)'), '').head()

In [None]:
titanic.rename(lambda x: x.lower(), axis=1).head()

In [None]:
titanic.head()

In [None]:
titanic.rename({'SibSp':'siblings_spouses','Pclass':'class'}, axis=1).head()

# String operations

In [None]:
titanic.head()

In [None]:
titanic.replace(re.compile(r'\(.*\)'), '').Name.str.split(",", expand=True)

In [None]:
(titanic
 .replace(re.compile(r'\(.*\)'), '')
 .Name.str
 .split(',', expand=True)
 .rename({0:'family_name', 1:'first_name'}, axis=1)
 .head())

# Cleaning data

`isnull` is very convenient method:

In [None]:
titanic.isnull().head()

Resulting dataframe can now be used to determine if there any missing values (by column or by row):

In [None]:
titanic.isnull().any()

In [None]:
titanic.isnull().any(axis=1).head()

Or calculate how many missing values are in a dataframe (by row or by column):

In [None]:
titanic.isnull().sum()

In [None]:
titanic.head(15)

Pandas is smart enough to fill missing values by column:

In [None]:
fill_values = titanic[['Age', 'Fare']].mean()
fill_values

In [None]:
titanic[titanic.Fare.isnull()]

In [None]:
titanic.fillna(fill_values).head(15)

# Getting indicators and dummy variables

In [None]:
pd.get_dummies(titanic, columns=['Pclass', 'Sex', 'Embarked']).head()