# Review of `pandas`

Using Spark is very similar to using `pnadas`, since the Spark framework abstraction generally hides the fact that you are performing distributed computations from you. Hence, being skilled in using DataFrames is transferable since Spark DataFrames are used in much the same way. 

In [1]:
import numpy as np
import pandas as pd

## Series and Data Frames

### Series objects

A `Series` is like a vector. All elements must have the same type or are nulls.

In [2]:
s = pd.Series([1,1,2,3] + [None])
s

0    1.0
1    1.0
2    2.0
3    3.0
4    NaN
dtype: float64

### Size

In [3]:
s.size

5

### Unique Counts

In [4]:
s.value_counts()

1.0    2
3.0    1
2.0    1
dtype: int64

### Special types of series

#### Strings

In [5]:
words = 'the quick brown fox jumps over the lazy dog'.split()
s1 = pd.Series([' '.join(item) for item in zip(words[:-1], words[1:])])
s1

0      the quick
1    quick brown
2      brown fox
3      fox jumps
4     jumps over
5       over the
6       the lazy
7       lazy dog
dtype: object

In [6]:
s1.str.upper()

0      THE QUICK
1    QUICK BROWN
2      BROWN FOX
3      FOX JUMPS
4     JUMPS OVER
5       OVER THE
6       THE LAZY
7       LAZY DOG
dtype: object

In [7]:
s1.str.split()

0      [the, quick]
1    [quick, brown]
2      [brown, fox]
3      [fox, jumps]
4     [jumps, over]
5       [over, the]
6       [the, lazy]
7       [lazy, dog]
dtype: object

In [8]:
s1.str.split().str[1]

0    quick
1    brown
2      fox
3    jumps
4     over
5      the
6     lazy
7      dog
dtype: object

### Categories

In [9]:
s2 = pd.Series(['Asian', 'Asian', 'White', 'Black', 'White', 'Hispanic'])
s2

0       Asian
1       Asian
2       White
3       Black
4       White
5    Hispanic
dtype: object

In [10]:
s2 = s2.astype('category')
s2

0       Asian
1       Asian
2       White
3       Black
4       White
5    Hispanic
dtype: category
Categories (4, object): [Asian, Black, Hispanic, White]

In [11]:
s2.cat.categories

Index(['Asian', 'Black', 'Hispanic', 'White'], dtype='object')

In [12]:
s2.cat.codes

0    0
1    0
2    3
3    1
4    3
5    2
dtype: int8

### If you want a pre-specified order

In [13]:
s2 = s2.cat.reorder_categories(['Hispanic', 'White', 'Black', 'Asian'])

In [14]:
s2.cat.categories

Index(['Hispanic', 'White', 'Black', 'Asian'], dtype='object')

In [15]:
s2.cat.codes

0    3
1    3
2    1
3    2
4    1
5    0
dtype: int8

### Timestamps

In [16]:
s3 = pd.date_range('now', periods = 3, freq='d')

In [17]:
s3

DatetimeIndex(['2018-11-12 10:47:03.981679', '2018-11-13 10:47:03.981679',
               '2018-11-14 10:47:03.981679'],
              dtype='datetime64[ns]', freq='D')

In [18]:
s3.year

Int64Index([2018, 2018, 2018], dtype='int64')

In [19]:
s3.month

Int64Index([11, 11, 11], dtype='int64')

In [20]:
s3.day

Int64Index([12, 13, 14], dtype='int64')

In [21]:
s3.strftime('%d-%b-%Y')

Index(['12-Nov-2018', '13-Nov-2018', '14-Nov-2018'], dtype='object')

In [22]:
(s3 + 1).strftime('%d-%b-%Y')

Index(['13-Nov-2018', '14-Nov-2018', '15-Nov-2018'], dtype='object')

### DataFrame objects

A `DataFrame` is like a matrix. Columns in a `DataFrame` are `Series`.

- Each column in a DataFrame represents a **variale**
- Each row in a DataFrame represents an **observation**
- Each cell in a DataFrame represents a **value**

In [23]:
df = pd.DataFrame(dict(num=[1,2,3] + [None]))
df

Unnamed: 0,num
0,1.0
1,2.0
2,3.0
3,


In [24]:
df.num

0    1.0
1    2.0
2    3.0
3    NaN
Name: num, dtype: float64

### Index

Row and column identifiers are of `Index` type.

Somewhat confusingly, index is also a a synonym for the row identifiers.

In [25]:
df.index

RangeIndex(start=0, stop=4, step=1)

#### Setting a column as the row index

In [26]:
df

Unnamed: 0,num
0,1.0
1,2.0
2,3.0
3,


In [27]:
df1 = df.set_index('num')
df1

1.0
2.0
3.0
""


#### Making an index into a column

In [28]:
df1.reset_index()

Unnamed: 0,num
0,1.0
1,2.0
2,3.0
3,


### Columns

This is just a different index object

In [29]:
df.columns

Index(['num'], dtype='object')

### Getting raw values

Sometimes you just want a `numpy` array, and not a `pandas` object.

In [30]:
df.values

array([[ 1.],
       [ 2.],
       [ 3.],
       [nan]])

## Creating Data Frames

### Manual

In [31]:
from collections import OrderedDict

In [32]:
n = 5
dates = pd.date_range(start='now', periods=n, freq='d')
df = pd.DataFrame(OrderedDict(pid=np.random.randint(100, 999, n), 
                              weight=np.random.normal(70, 20, n),
                              height=np.random.normal(170, 15, n),
                              date=dates,
                             ))
df

Unnamed: 0,pid,weight,height,date
0,131,75.132077,160.047589,2018-11-12 10:47:04.155297
1,437,46.79341,175.417372,2018-11-13 10:47:04.155297
2,184,76.212384,152.291074,2018-11-14 10:47:04.155297
3,645,83.316648,176.507327,2018-11-15 10:47:04.155297
4,694,86.77115,164.698527,2018-11-16 10:47:04.155297


### From file

You can read in data from many different file types - plain text, JSON, spreadsheets, databases etc. Functions to read in data look like `read_X` where X is the data type.

In [33]:
%%file measures.txt
pid	weight	height	date
328	72.654347	203.560866	2018-11-11 14:16:18.148411
756	34.027679	189.847316	2018-11-12 14:16:18.148411
185	28.501914	158.646074	2018-11-13 14:16:18.148411
507	17.396343	180.795993	2018-11-14 14:16:18.148411
919	64.724301	173.564725	2018-11-15 14:16:18.148411

Overwriting measures.txt


In [34]:
df = pd.read_table('measures.txt')
df

Unnamed: 0,pid,weight,height,date
0,328,72.654347,203.560866,2018-11-11 14:16:18.148411
1,756,34.027679,189.847316,2018-11-12 14:16:18.148411
2,185,28.501914,158.646074,2018-11-13 14:16:18.148411
3,507,17.396343,180.795993,2018-11-14 14:16:18.148411
4,919,64.724301,173.564725,2018-11-15 14:16:18.148411


### From `numpy.ndarray`

In [35]:
vals = np.array([line.split('\t') for line in '''
649	42.942970	173.576789	2018-11-11 13:33:24.006649
533	58.421067	185.424830	2018-11-12 13:33:24.006649
918	60.209659	176.470378	2018-11-13 13:33:24.006649
590	66.595320	139.766303	2018-11-14 13:33:24.006649
112	77.112459	169.990751	2018-11-15 13:33:24.006649
'''.strip().splitlines()])
vals

array([['649', '42.942970', '173.576789', '2018-11-11 13:33:24.006649'],
       ['533', '58.421067', '185.424830', '2018-11-12 13:33:24.006649'],
       ['918', '60.209659', '176.470378', '2018-11-13 13:33:24.006649'],
       ['590', '66.595320', '139.766303', '2018-11-14 13:33:24.006649'],
       ['112', '77.112459', '169.990751', '2018-11-15 13:33:24.006649']],
      dtype='<U26')

In [36]:
df = pd.DataFrame(vals, columns=['pid', 'weight', 'height', 'date'])
df

Unnamed: 0,pid,weight,height,date
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649


## Indexing Data Frames

### Implicit defaults

if you provide a slice, it is assumed that you are asking for rows.

In [37]:
df[1:3]

Unnamed: 0,pid,weight,height,date
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649


If you provide a singe value or list, it is assumed that you are asking for columns.

In [38]:
df[['pid', 'weight']]

Unnamed: 0,pid,weight
0,649,42.94297
1,533,58.421067
2,918,60.209659
3,590,66.59532
4,112,77.112459


### Extracting a column

#### Dictionary style access

In [39]:
df['pid']

0    649
1    533
2    918
3    590
4    112
Name: pid, dtype: object

#### Property style access

This only works for column names tat are also valid Python identifier (i.e., no spaces or dashes or keywords)

In [40]:
df.pid

0    649
1    533
2    918
3    590
4    112
Name: pid, dtype: object

### Indexing by location

This is similar to `numpy` indexing

In [41]:
df.iloc[1:3, :]

Unnamed: 0,pid,weight,height,date
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649


In [42]:
df.iloc[1:3, [True, False, True]]

Unnamed: 0,pid,height
1,533,185.42483
2,918,176.470378


### Indexing by name

In [43]:
df.loc[1:3, 'weight':'height']

Unnamed: 0,weight,height
1,58.421067,185.42483
2,60.209659,176.470378
3,66.59532,139.766303


**Warning**: When using `loc`, the row slice indicates row names, not positions.

In [44]:
df1 = df.copy()
df1.index = df.index + 1
df1

Unnamed: 0,pid,weight,height,date
1,649,42.94297,173.576789,2018-11-11 13:33:24.006649
2,533,58.421067,185.42483,2018-11-12 13:33:24.006649
3,918,60.209659,176.470378,2018-11-13 13:33:24.006649
4,590,66.59532,139.766303,2018-11-14 13:33:24.006649
5,112,77.112459,169.990751,2018-11-15 13:33:24.006649


In [45]:
df1.loc[1:3, 'weight':'height']

Unnamed: 0,weight,height
1,42.94297,173.576789
2,58.421067,185.42483
3,60.209659,176.470378


## Structure of a Data Frame

### Data types

In [46]:
df.dtypes

pid       object
weight    object
height    object
date      object
dtype: object

### Converting data types

#### Using `astype` on one column

In [47]:
df.pid = df.pid.astype('category')

#### Using `astype` on multiple columns

In [48]:
df = df.astype(dict(weight=float, height=float))

#### Using a conversion function

In [49]:
df.date = pd.to_datetime(df.date)

#### Check

In [50]:
df.dtypes

pid             category
weight           float64
height           float64
date      datetime64[ns]
dtype: object

### Basic properties

In [51]:
df.size

20

In [52]:
df.shape

(5, 4)

In [53]:
df.describe()

Unnamed: 0,weight,height
count,5.0,5.0
mean,61.056295,169.04581
std,12.492347,17.33572
min,42.94297,139.766303
25%,58.421067,169.990751
50%,60.209659,173.576789
75%,66.59532,176.470378
max,77.112459,185.42483


### Inspection

In [54]:
df.head(n=3)

Unnamed: 0,pid,weight,height,date
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649


In [55]:
df.tail(n=3)

Unnamed: 0,pid,weight,height,date
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649


In [56]:
df.sample(n=3)

Unnamed: 0,pid,weight,height,date
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649


In [57]:
df.sample(frac=0.5)

Unnamed: 0,pid,weight,height,date
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649


## Selecting, Renaming and Removing Columns

### Selecting columns

In [58]:
df.filter(items=['pid', 'date'])

Unnamed: 0,pid,date
0,649,2018-11-11 13:33:24.006649
1,533,2018-11-12 13:33:24.006649
2,918,2018-11-13 13:33:24.006649
3,590,2018-11-14 13:33:24.006649
4,112,2018-11-15 13:33:24.006649


In [59]:
df.filter(regex='.*ght')

Unnamed: 0,weight,height
0,42.94297,173.576789
1,58.421067,185.42483
2,60.209659,176.470378
3,66.59532,139.766303
4,77.112459,169.990751


#### Note that you can also use regular string methods on the columns

In [60]:
df.loc[:, df.columns.str.contains('d')]

Unnamed: 0,pid,date
0,649,2018-11-11 13:33:24.006649
1,533,2018-11-12 13:33:24.006649
2,918,2018-11-13 13:33:24.006649
3,590,2018-11-14 13:33:24.006649
4,112,2018-11-15 13:33:24.006649


### Renaming columns

In [61]:
df.rename(dict(weight='w', height='h'), axis=1)

Unnamed: 0,pid,w,h,date
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649


In [62]:
orig_cols = df.columns 

In [63]:
df.columns = list('abcd')

In [64]:
df

Unnamed: 0,a,b,c,d
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649


In [65]:
df.columns = orig_cols

In [66]:
df

Unnamed: 0,pid,weight,height,date
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649


### Removing columns

In [67]:
df.drop(['pid', 'date'], axis=1)

Unnamed: 0,weight,height
0,42.94297,173.576789
1,58.421067,185.42483
2,60.209659,176.470378
3,66.59532,139.766303
4,77.112459,169.990751


In [68]:
df.drop(columns=['pid', 'date'])

Unnamed: 0,weight,height
0,42.94297,173.576789
1,58.421067,185.42483
2,60.209659,176.470378
3,66.59532,139.766303
4,77.112459,169.990751


In [69]:
df.drop(columns=df.columns[df.columns.str.contains('d')])

Unnamed: 0,weight,height
0,42.94297,173.576789
1,58.421067,185.42483
2,60.209659,176.470378
3,66.59532,139.766303
4,77.112459,169.990751


## Selecting, Renaming and Removing Rows

### Selecting rows

In [70]:
df[df.weight.between(60,70)]

Unnamed: 0,pid,weight,height,date
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649


In [71]:
df[(69 <= df.weight) & (df.weight < 70)]

Unnamed: 0,pid,weight,height,date


In [72]:
df[df.date.between(pd.to_datetime('2018-11-13'), 
                   pd.to_datetime('2018-11-15 23:59:59'))]

Unnamed: 0,pid,weight,height,date
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649


### Renaming rows

In [73]:
df.rename({i:letter for i,letter in enumerate('abcde')})

Unnamed: 0,pid,weight,height,date
a,649,42.94297,173.576789,2018-11-11 13:33:24.006649
b,533,58.421067,185.42483,2018-11-12 13:33:24.006649
c,918,60.209659,176.470378,2018-11-13 13:33:24.006649
d,590,66.59532,139.766303,2018-11-14 13:33:24.006649
e,112,77.112459,169.990751,2018-11-15 13:33:24.006649


In [74]:
df.index = ['the', 'quick', 'brown', 'fox', 'jumphs']

In [75]:
df

Unnamed: 0,pid,weight,height,date
the,649,42.94297,173.576789,2018-11-11 13:33:24.006649
quick,533,58.421067,185.42483,2018-11-12 13:33:24.006649
brown,918,60.209659,176.470378,2018-11-13 13:33:24.006649
fox,590,66.59532,139.766303,2018-11-14 13:33:24.006649
jumphs,112,77.112459,169.990751,2018-11-15 13:33:24.006649


In [76]:
df = df.reset_index(drop=True)

In [77]:
df

Unnamed: 0,pid,weight,height,date
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649


### Dropping rows

In [78]:
df.drop([1,3], axis=0)

Unnamed: 0,pid,weight,height,date
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649


## Transforming and Creating Columns

In [79]:
df.assign(bmi=df['weight'] / (df['height']/100)**2)

Unnamed: 0,pid,weight,height,date,bmi
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649,19.334037
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415


In [80]:
df['bmi'] = df['weight'] / (df['height']/100)**2

In [81]:
df

Unnamed: 0,pid,weight,height,date,bmi
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649,19.334037
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415


In [82]:
df['something'] = [2,2,None,None,3]

In [83]:
df

Unnamed: 0,pid,weight,height,date,bmi,something
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082,2.0
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578,2.0
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649,19.334037,
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923,
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,3.0


### Uniqueness

In [84]:
df.something.unique()

array([ 2., nan,  3.])

In [85]:
df.loc[df.something.duplicated()]

Unnamed: 0,pid,weight,height,date,bmi,something
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578,2.0
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923,


In [86]:
df.drop_duplicates(subset='something')

Unnamed: 0,pid,weight,height,date,bmi,something
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082,2.0
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649,19.334037,
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,3.0


### Missing data

In [87]:
df

Unnamed: 0,pid,weight,height,date,bmi,something
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082,2.0
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578,2.0
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649,19.334037,
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923,
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,3.0


In [88]:
df.something.fillna(0)

0    2.0
1    2.0
2    0.0
3    0.0
4    3.0
Name: something, dtype: float64

In [89]:
df.something.ffill()

0    2.0
1    2.0
2    2.0
3    2.0
4    3.0
Name: something, dtype: float64

In [90]:
df.something.bfill()

0    2.0
1    2.0
2    3.0
3    3.0
4    3.0
Name: something, dtype: float64

In [91]:
df.something.interpolate()

0    2.000000
1    2.000000
2    2.333333
3    2.666667
4    3.000000
Name: something, dtype: float64

In [92]:
df.dropna()

Unnamed: 0,pid,weight,height,date,bmi,something
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082,2.0
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578,2.0
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,3.0


## Sorting Data Frames

### Sort on indexes

In [93]:
df.sort_index(axis=1)

Unnamed: 0,bmi,date,height,pid,something,weight
0,14.253082,2018-11-11 13:33:24.006649,173.576789,649,2.0,42.94297
1,16.991578,2018-11-12 13:33:24.006649,185.42483,533,2.0,58.421067
2,19.334037,2018-11-13 13:33:24.006649,176.470378,918,,60.209659
3,34.090923,2018-11-14 13:33:24.006649,139.766303,590,,66.59532
4,26.685415,2018-11-15 13:33:24.006649,169.990751,112,3.0,77.112459


In [94]:
df.sort_index(axis=0, ascending=False)

Unnamed: 0,pid,weight,height,date,bmi,something
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,3.0
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923,
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649,19.334037,
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578,2.0
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082,2.0


### Sort on values

In [95]:
df.sort_values(by=['something', 'bmi'], ascending=[True, False])

Unnamed: 0,pid,weight,height,date,bmi,something
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578,2.0
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082,2.0
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,3.0
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923,
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649,19.334037,


## The `apply`, `applymap`, `transform` and `agg` methods

In [96]:
words = pd.DataFrame(OrderedDict(numbers="one two three".split(), food="cake biscuit salad".split()))

In [97]:
words

Unnamed: 0,numbers,food
0,one,cake
1,two,biscuit
2,three,salad


### Apply a function element-wise

In [98]:
words.applymap(len)

Unnamed: 0,numbers,food
0,3,4
1,3,7
2,5,5


### Apply a function along an axis

#### Column margins

In [99]:
words.applymap(len).apply(np.sum, axis=0)

numbers    11
food       16
dtype: int64

#### Row margins

In [100]:
words.applymap(len).apply(np.sum, axis=1)

0     7
1    10
2    10
dtype: int64

### Apply a transformation to a N-dimensional array

In [101]:
words.applymap(len).transform(lambda x: x/x.sum())

Unnamed: 0,numbers,food
0,0.272727,0.25
1,0.272727,0.4375
2,0.454545,0.3125


### Apply an aggregation function

In [102]:
words.applymap(len).agg(np.sum)

numbers    11
food       16
dtype: int64

In [103]:
words.applymap(len).agg(['count', np.sum, np.mean])

Unnamed: 0,numbers,food
count,3.0,3.0
sum,11.0,16.0
mean,3.666667,5.333333


## Split-Apply-Combine

We often want to perform subgroup analysis (conditioning by some discrete or categorical variable). This is done with `groupby` followed by an aggregate function. Conceptually, we split the data frame into separate groups, apply the aggregate function to each group separately, then combine the aggregated results back into a single data frame.

In [104]:
df['treatment'] = list('ababa')

In [105]:
df

Unnamed: 0,pid,weight,height,date,bmi,something,treatment
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082,2.0,a
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578,2.0,b
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649,19.334037,,a
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923,,b
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,3.0,a


In [106]:
grouped = df.groupby('treatment')

In [107]:
grouped.get_group('a')

Unnamed: 0,pid,weight,height,date,bmi,something,treatment
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082,2.0,a
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649,19.334037,,a
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,3.0,a


In [108]:
grouped.mean()

Unnamed: 0_level_0,weight,height,bmi,something
treatment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,60.088363,173.345973,20.090845,2.5
b,62.508194,162.595566,25.54125,2.0


### Using `agg` with `groupby`

In [109]:
grouped.agg('mean')

Unnamed: 0_level_0,weight,height,bmi,something
treatment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,60.088363,173.345973,20.090845,2.5
b,62.508194,162.595566,25.54125,2.0


In [110]:
grouped.agg(['mean', 'std'])

Unnamed: 0_level_0,weight,weight,height,height,bmi,bmi,something,something
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std
treatment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
a,60.088363,17.085067,173.345973,3.245974,20.090845,6.250624,2.5,0.707107
b,62.508194,5.78007,162.595566,32.285454,25.54125,12.091063,2.0,


In [111]:
grouped.agg({'weight': ['mean', 'std'], 'height': ['min', 'max'], 'bmi': lambda x: (x**2).sum()})

Unnamed: 0_level_0,weight,weight,height,height,bmi
Unnamed: 0_level_1,mean,std,min,max,<lambda>
treatment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
a,60.088363,17.085067,169.990751,176.470378,1289.066704
b,62.508194,5.78007,139.766303,185.42483,1450.904717


### Using `trasnform` wtih `groupby`

In [112]:
g_mean = grouped['weight', 'height'].transform(np.mean)
g_mean

Unnamed: 0,weight,height
0,60.088363,173.345973
1,62.508194,162.595566
2,60.088363,173.345973
3,62.508194,162.595566
4,60.088363,173.345973


In [113]:
g_std = grouped['weight', 'height'].transform(np.std)
g_std

Unnamed: 0,weight,height
0,17.085067,3.245974
1,5.78007,32.285454
2,17.085067,3.245974
3,5.78007,32.285454
4,17.085067,3.245974


In [114]:
(df[['weight', 'height']] - g_mean)/g_std

Unnamed: 0,weight,height
0,-1.003531,0.071108
1,-0.707107,0.707107
2,0.0071,0.962548
3,0.707107,-0.707107
4,0.996431,-1.033656


## Window Functions

A window function is very similar to `groupby` except that the groups are subsets of the original rows.

In [115]:
x = pd.DataFrame({'n': range(6)})

In [116]:
x

Unnamed: 0,n
0,0
1,1
2,2
3,3
4,4
5,5


In [117]:
x.rolling(window=3).sum()

Unnamed: 0,n
0,
1,
2,3.0
3,6.0
4,9.0
5,12.0


In [118]:
x.expanding().sum()

Unnamed: 0,n
0,0.0
1,1.0
2,3.0
3,6.0
4,10.0
5,15.0


## Combining Data Frames

In [119]:
df

Unnamed: 0,pid,weight,height,date,bmi,something,treatment
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082,2.0,a
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578,2.0,b
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649,19.334037,,a
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923,,b
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,3.0,a


In [120]:
df1 =  df.iloc[3:].copy()

In [121]:
df1.drop('something', axis=1, inplace=True)
df1

Unnamed: 0,pid,weight,height,date,bmi,treatment
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923,b
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,a


### Adding rows

Note that `pandas` aligns by column indexes automatically.

In [122]:
df.append(df1, sort=False)

Unnamed: 0,pid,weight,height,date,bmi,something,treatment
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082,2.0,a
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578,2.0,b
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649,19.334037,,a
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923,,b
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,3.0,a
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923,,b
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,,a


In [123]:
pd.concat([df, df1], sort=False)

Unnamed: 0,pid,weight,height,date,bmi,something,treatment
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082,2.0,a
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578,2.0,b
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649,19.334037,,a
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923,,b
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,3.0,a
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923,,b
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,,a


### Adding columns

In [124]:
df.pid

0    649
1    533
2    918
3    590
4    112
Name: pid, dtype: category
Categories (5, object): [112, 533, 590, 649, 918]

In [125]:
df2 = pd.DataFrame(OrderedDict(pid=[649, 533, 400, 600], age=[23,34,45,56]))

In [126]:
df2.pid

0    649
1    533
2    400
3    600
Name: pid, dtype: int64

In [127]:
df.pid = df.pid.astype('int')

In [128]:
pd.merge(df, df2, on='pid', how='inner')

Unnamed: 0,pid,weight,height,date,bmi,something,treatment,age
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082,2.0,a,23
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578,2.0,b,34


In [129]:
pd.merge(df, df2, on='pid', how='left')

Unnamed: 0,pid,weight,height,date,bmi,something,treatment,age
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082,2.0,a,23.0
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578,2.0,b,34.0
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649,19.334037,,a,
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923,,b,
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,3.0,a,


In [130]:
pd.merge(df, df2, on='pid', how='right')

Unnamed: 0,pid,weight,height,date,bmi,something,treatment,age
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082,2.0,a,23
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578,2.0,b,34
2,400,,,NaT,,,,45
3,600,,,NaT,,,,56


In [131]:
pd.merge(df, df2, on='pid', how='outer')

Unnamed: 0,pid,weight,height,date,bmi,something,treatment,age
0,649,42.94297,173.576789,2018-11-11 13:33:24.006649,14.253082,2.0,a,23.0
1,533,58.421067,185.42483,2018-11-12 13:33:24.006649,16.991578,2.0,b,34.0
2,918,60.209659,176.470378,2018-11-13 13:33:24.006649,19.334037,,a,
3,590,66.59532,139.766303,2018-11-14 13:33:24.006649,34.090923,,b,
4,112,77.112459,169.990751,2018-11-15 13:33:24.006649,26.685415,3.0,a,
5,400,,,NaT,,,,45.0
6,600,,,NaT,,,,56.0


### Merging on the index

In [132]:
df1 = pd.DataFrame(dict(x=[1,2,3]), index=list('abc'))
df2 = pd.DataFrame(dict(y=[4,5,6]), index=list('abc'))
df3 = pd.DataFrame(dict(z=[7,8,9]), index=list('abc'))

In [133]:
df1

Unnamed: 0,x
a,1
b,2
c,3


In [134]:
df2

Unnamed: 0,y
a,4
b,5
c,6


In [135]:
df3

Unnamed: 0,z
a,7
b,8
c,9


In [136]:
df1.join([df2, df3])

Unnamed: 0,x,y,z
a,1,4,7
b,2,5,8
c,3,6,9


## Fixing common DataFrame issues

### Multiple variables in a column

In [137]:
df = pd.DataFrame(dict(pid_treat = ['A-1', 'B-2', 'C-1', 'D-2']))
df

Unnamed: 0,pid_treat
0,A-1
1,B-2
2,C-1
3,D-2


In [138]:
df.pid_treat.str.split('-')

0    [A, 1]
1    [B, 2]
2    [C, 1]
3    [D, 2]
Name: pid_treat, dtype: object

In [139]:
df_ = pd.DataFrame(df.pid_treat.str.split('-').apply(pd.Series))
df_.columns = ['pid', 'treat']
df_

Unnamed: 0,pid,treat
0,A,1
1,B,2
2,C,1
3,D,2


### Multiple values in a cell

In [140]:
df = pd.DataFrame(dict(pid=['a', 'b', 'c'], vals = [(1,2,3), (4,5,6), (7,8,9)]))
df

Unnamed: 0,pid,vals
0,a,"(1, 2, 3)"
1,b,"(4, 5, 6)"
2,c,"(7, 8, 9)"


In [141]:
df[['t1', 't2', 't3']]  = df.vals.apply(pd.Series)
df

Unnamed: 0,pid,vals,t1,t2,t3
0,a,"(1, 2, 3)",1,2,3
1,b,"(4, 5, 6)",4,5,6
2,c,"(7, 8, 9)",7,8,9


In [142]:
df.drop('vals', axis=1, inplace=True)

In [143]:
pd.melt(df, id_vars='pid', value_name='vals').drop('variable', axis=1)

Unnamed: 0,pid,vals
0,a,1
1,b,4
2,c,7
3,a,2
4,b,5
5,c,8
6,a,3
7,b,6
8,c,9


## Reshaping Data Frames

Sometimes we need to make rows into columns or vice versa.

In [144]:
df = pd.DataFrame(OrderedDict(
    pid = [100, 101, 102, 100, 101, 102],
    treat = np.repeat(list('AB'), 3),
    x = [1,2,3,4,5,6],
    )
)

In [145]:
df

Unnamed: 0,pid,treat,x
0,100,A,1
1,101,A,2
2,102,A,3
3,100,B,4
4,101,B,5
5,102,B,6


### Converting multiple columns into a single column

This is often useful if you need to condition on some variable.

In [146]:
url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
iris = pd.read_csv(url)

In [147]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [148]:
iris.shape

(150, 5)

In [149]:
df_iris = pd.melt(iris, id_vars='species')

In [150]:
df_iris.sample(10)

Unnamed: 0,species,variable,value
430,virginica,petal_length,6.1
48,setosa,sepal_length,5.3
162,setosa,sepal_width,3.0
409,virginica,petal_length,6.1
84,versicolor,sepal_length,5.4
263,virginica,sepal_width,2.5
594,virginica,petal_width,2.5
575,virginica,petal_width,1.8
487,setosa,petal_width,0.1
183,setosa,sepal_width,4.2


### Hierarchical indexes

In [151]:
df_iris1 = df_iris.groupby(['species', 'variable']).mean()
df_iris1

Unnamed: 0_level_0,Unnamed: 1_level_0,value
species,variable,Unnamed: 2_level_1
setosa,petal_length,1.464
setosa,petal_width,0.244
setosa,sepal_length,5.006
setosa,sepal_width,3.418
versicolor,petal_length,4.26
versicolor,petal_width,1.326
versicolor,sepal_length,5.936
versicolor,sepal_width,2.77
virginica,petal_length,5.552
virginica,petal_width,2.026


In [152]:
df_iris1.index

MultiIndex(levels=[['setosa', 'versicolor', 'virginica'], ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]],
           names=['species', 'variable'])

If hierarchical indexes are confusing, you can always reindex.

In [153]:
df_iris1.reset_index()

Unnamed: 0,species,variable,value
0,setosa,petal_length,1.464
1,setosa,petal_width,0.244
2,setosa,sepal_length,5.006
3,setosa,sepal_width,3.418
4,versicolor,petal_length,4.26
5,versicolor,petal_width,1.326
6,versicolor,sepal_length,5.936
7,versicolor,sepal_width,2.77
8,virginica,petal_length,5.552
9,virginica,petal_width,2.026


### Stack and unstack

Stack takes a level of the column multi-index and moves it to the rows. Unstack does the reverse.

In [154]:
df_iris1.unstack(0)

Unnamed: 0_level_0,value,value,value
species,setosa,versicolor,virginica
variable,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
petal_length,1.464,4.26,5.552
petal_width,0.244,1.326,2.026
sepal_length,5.006,5.936,6.588
sepal_width,3.418,2.77,2.974


In [155]:
df_iris1.unstack(1)

Unnamed: 0_level_0,value,value,value,value
variable,petal_length,petal_width,sepal_length,sepal_width
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
setosa,1.464,0.244,5.006,3.418
versicolor,4.26,1.326,5.936,2.77
virginica,5.552,2.026,6.588,2.974


### Pivot tables

In [156]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [157]:
pd.pivot_table(iris, index='species', aggfunc=['mean', 'median'])

Unnamed: 0_level_0,mean,mean,mean,mean,median,median,median,median
Unnamed: 0_level_1,petal_length,petal_width,sepal_length,sepal_width,petal_length,petal_width,sepal_length,sepal_width
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
setosa,1.464,0.244,5.006,3.418,1.5,0.2,5.0,3.4
versicolor,4.26,1.326,5.936,2.77,4.35,1.3,5.9,2.8
virginica,5.552,2.026,6.588,2.974,5.55,2.0,6.5,3.0


In [158]:
pd.pivot_table(iris, columns='species'.split(), 
               aggfunc=['mean', 'median'])

Unnamed: 0_level_0,mean,mean,mean,median,median,median
species,setosa,versicolor,virginica,setosa,versicolor,virginica
petal_length,1.464,4.26,5.552,1.5,4.35,5.55
petal_width,0.244,1.326,2.026,0.2,1.3,2.0
sepal_length,5.006,5.936,6.588,5.0,5.9,6.5
sepal_width,3.418,2.77,2.974,3.4,2.8,3.0


## Chaining commands

Sometimes you see this functional style of method chaining that avoids the need for temporary intermediate variables.

In [159]:
(
    iris.
    sample(frac=0.2).
    filter(regex='s.*').
    assign(both=iris.sepal_length + iris.sepal_length).
    groupby('species').agg(['mean', 'sum']).
    pipe(lambda x: np.around(x, 1))
)

Unnamed: 0_level_0,sepal_length,sepal_length,sepal_width,sepal_width,both,both
Unnamed: 0_level_1,mean,sum,mean,sum,mean,sum
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
setosa,4.9,49.2,3.3,32.8,9.8,98.4
versicolor,6.1,54.7,2.9,26.4,12.2,109.4
virginica,6.6,72.7,3.0,32.6,13.2,145.4
