pandas (http://pandas.pydata.org/) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

This guide borrows heavily from 10 Minutes to pandas: http://pandas.pydata.org/pandas-docs/stable/10min.html#reshaping

It's common to see pandas, numpy and matplotlib imported in the following way. We also have to specify that we would like generated images to be presented on this page. 

In [31]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

There are configuration options for Jupyter to do this automatically, this is useful if your notebooks will be used for similar types of data analysis.

First let's revisit the data we gathered earlier. We created a list of lists which pair up an IP address and how many times that IP address was seen in an nginx access log file.

In [32]:
ip_count = !cat access.log | awk '{print $1}' | sort | uniq -c | sort -rn
ip_count = [line.strip() for line in ip_count]
ip_count = [line.split() for line in ip_count][:10]
ip_count

[['206', '64.134.25.220'],
 ['138', '70.114.7.38'],
 ['115', '70.125.133.107'],
 ['109', '61.219.149.7'],
 ['93', '70.114.8.49'],
 ['80', '24.153.162.178'],
 ['50', '72.32.146.52'],
 ['47', '72.3.128.84'],
 ['46', '50.56.228.100'],
 ['46', '38.103.208.94']]

Now we want to take this data and have pandas be able to do something with it. We begin by creating a "DataFrame" from the 'ip_count' variable. DataFrames (DF from here on) are essentially spreadsheets that pandas can do some work on.

We can use the 'head' and 'tail' functions to get a quick peek at the DF without having to load the entire thing (especially useful if your DF is large).

In [33]:
df = pd.DataFrame(ip_count, columns=['count', 'IP'])
df.head()

Unnamed: 0,count,IP
0,206,64.134.25.220
1,138,70.114.7.38
2,115,70.125.133.107
3,109,61.219.149.7
4,93,70.114.8.49


In [34]:
df.tail()

Unnamed: 0,count,IP
5,80,24.153.162.178
6,50,72.32.146.52
7,47,72.3.128.84
8,46,50.56.228.100
9,46,38.103.208.94


***Note***: these dataframes are styled using html/css. Brandon Rhodes had an interesting presentation at PyCon 2015 which shows how to modify IPython's core css to style the DF: https://github.com/brandon-rhodes/pycon-pandas-tutorial I don't understand it enough to explain it so I won't be using it for this presentation

We can have pandas well us some information about the DF like what type of objects it's comprised of.

In [35]:
df.dtypes

count    object
IP       object
dtype: object

Uh-oh. We won't be able to do useful work unless pandas recognizes the 'count' column as a numeric type. 

#dtypes
from http://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes

pandas understands several data types (dtypes). In this example we can see a couple of things: creating a DF from a dict and the various dtypes padas is aware of.

In [36]:
from pandas import Timestamp, Series
dft = pd.DataFrame(dict( A = np.random.rand(3),
                         B = 1,
                         C = 'foo',
                         D = Timestamp('20010102'),
                         E = Series([1.0]*3).astype('float32'),
                         F = False,
                         G = Series([1]*3,dtype='int8')))
dft.head()

Unnamed: 0,A,B,C,D,E,F,G
0,0.644437,1,foo,2001-01-02,1,False,1
1,0.505068,1,foo,2001-01-02,1,False,1
2,0.635613,1,foo,2001-01-02,1,False,1


Columns with string data are represented as the 'object' dtype (column 'C'). We'll need to coerce the data in our data to integers to work with them further.

In [37]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

To get column 'count' to integers we can 'apply' a function to a column:

In [38]:
df['count'] = df['count'].apply(int)
df.dtypes

count     int64
IP       object
dtype: object

We can also make use of lambda functions here:

In [39]:
df['count']

0    206
1    138
2    115
3    109
4     93
5     80
6     50
7     47
8     46
9     46
Name: count, dtype: int64

In [40]:
df['count'].apply(lambda x: x**2)

0    42436
1    19044
2    13225
3    11881
4     8649
5     6400
6     2500
7     2209
8     2116
9     2116
Name: count, dtype: int64

#Selecting

We can select colums by using '[]' after the DF:

In [41]:
df['count']

0    206
1    138
2    115
3    109
4     93
5     80
6     50
7     47
8     46
9     46
Name: count, dtype: int64

But if we try to slice the DF we get a selection of rows:

In [42]:
df[2:5]

Unnamed: 0,count,IP
2,115,70.125.133.107
3,109,61.219.149.7
4,93,70.114.8.49


Now that the values in 'count' are numeric we can use boolean operators to select data:

In [44]:
df['count'] > 100
#df[df['count'] > 100]

0     True
1     True
2     True
3     True
4    False
5    False
6    False
7    False
8    False
9    False
Name: count, dtype: bool

If we want to chain boolean checks we need to wrap them in parens since the '&' operator takes precedence over the '>' and '<' operators. If we were to try this without the parens pandas would complain that the "truth" of a series cannot be evaluated.

In [13]:
df[(df['count'] > 100) & (df['count'] < 200)]

Unnamed: 0,count,IP
1,138,70.114.7.38
2,115,70.125.133.107
3,109,61.219.149.7


# Reshaping a DataFrame

http://pandas.pydata.org/pandas-docs/stable/10min.html#reshaping

There are several ways for us to reorganize our data. Here are some examples:

In [48]:
# from http://pandas.pydata.org/pandas-docs/stable/10min.html
dates = pd.date_range('20150101',periods=3)
df = pd.DataFrame(np.random.randn(3,4),index=dates,columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2015-01-01,-1.529098,-0.768819,-0.621662,-0.091351
2015-01-02,-0.220833,0.397869,-0.639245,1.210969
2015-01-03,-0.071623,-1.196707,1.110405,0.026047


The first example is to transpose the data which will invert the columns and rows:

In [49]:
df.transpose()

Unnamed: 0,2015-01-01 00:00:00,2015-01-02 00:00:00,2015-01-03 00:00:00
A,-1.529098,-0.220833,-0.071623
B,-0.768819,0.397869,-1.196707
C,-0.621662,-0.639245,1.110405
D,-0.091351,1.210969,0.026047


pandas supports multiple indexes so we can control how we'd like our data orgaized. We can use the stack and unstack methods to move indexes up and down.

In this example we bring the columns 'down' using stack and each of the previous rows gets indexed under a date. Our top-level index becomes the date.

In [53]:
df_tmp = pd.DataFrame(df.stack())
df_tmp

Unnamed: 0,Unnamed: 1,0
2015-01-01,A,-1.529098
2015-01-01,B,-0.768819
2015-01-01,C,-0.621662
2015-01-01,D,-0.091351
2015-01-02,A,-0.220833
2015-01-02,B,0.397869
2015-01-02,C,-0.639245
2015-01-02,D,1.210969
2015-01-03,A,-0.071623
2015-01-03,B,-1.196707


Going the other way, we can bring the rows 'up' by using the unstack method. In this case our top level index is the letter witch each date and date's data repeated for each letter.

In [57]:
df_tmp = pd.DataFrame(df.unstack())
df_tmp

Unnamed: 0,Unnamed: 1,0
A,2015-01-01,-1.529098
A,2015-01-02,-0.220833
A,2015-01-03,-0.071623
B,2015-01-01,-0.768819
B,2015-01-02,0.397869
B,2015-01-03,-1.196707
C,2015-01-01,-0.621662
C,2015-01-02,-0.639245
C,2015-01-03,1.110405
D,2015-01-01,-0.091351


Unstack will default to operating on the ***last level***. The left level is the 0th. So in this case the following two operations are identical:

In [60]:
df_tmp.unstack()

Unnamed: 0_level_0,0,0,0
Unnamed: 0_level_1,2015-01-01,2015-01-02,2015-01-03
A,-1.529098,-0.220833,-0.071623
B,-0.768819,0.397869,-1.196707
C,-0.621662,-0.639245,1.110405
D,-0.091351,1.210969,0.026047


But we can specify which level we'd like unstacked:

In [65]:
df_tmp.unstack(1)

Unnamed: 0_level_0,0,0,0
Unnamed: 0_level_1,2015-01-01,2015-01-02,2015-01-03
A,-1.529098,-0.220833,-0.071623
B,-0.768819,0.397869,-1.196707
C,-0.621662,-0.639245,1.110405
D,-0.091351,1.210969,0.026047


versus using the 0th index (***first level***):

In [67]:
df_tmp.unstack(0)

Unnamed: 0_level_0,0,0,0,0
Unnamed: 0_level_1,A,B,C,D
2015-01-01,-1.529098,-0.768819,-0.621662,-0.091351
2015-01-02,-0.220833,0.397869,-0.639245,1.210969
2015-01-03,-0.071623,-1.196707,1.110405,0.026047


# DB-like operations

http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

With relational data we can perform operations as we would with a traditional RDBMS.

In pandas we can 'join' two DFs the way we would tables from a database. In this example We'll create a second DF that relates IP addresses to domain names

In [22]:
df2 = pd.DataFrame(dict(IP = ['64.134.25.220', '70.114.7.38', '70.125.133.107'], 
                        domain = ['example.com', 'example.net', 'example.org']))
df2

Unnamed: 0,IP,domain
0,64.134.25.220,example.com
1,70.114.7.38,example.net
2,70.125.133.107,example.org


By default pandas will perform an 'inner' join and pull data which exists in both DFs. The 'shape' of the resulting DF is (3,3)

In [27]:
df.merge(df2)

Unnamed: 0,count,IP,domain
0,206,64.134.25.220,example.com
1,138,70.114.7.38,example.net
2,115,70.125.133.107,example.org


But if we perform an 'outer' join the missing data will be filled in with NaNs

In [34]:
df.merge(df2, how='outer')

Unnamed: 0,count,IP,domain
0,206,64.134.25.220,example.com
1,138,70.114.7.38,example.net
2,115,70.125.133.107,example.org
3,109,61.219.149.7,
4,93,70.114.8.49,
5,80,24.153.162.178,
6,50,72.32.146.52,
7,47,72.3.128.84,
8,46,50.56.228.100,
9,46,38.103.208.94,


# Timeseries

http://pandas.pydata.org/pandas-docs/stable/10min.html#time-series

> pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. 

In this first example try modifying the freq to another 'Offset Alias' there are [many](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) to choose from. Or using something like ***5S***.

In [130]:
rng = pd.date_range('1/1/2015', periods=100, freq='S')
rng

<class 'pandas.tseries.index.DatetimeIndex'>
[2015-01-01 00:00:00, ..., 2015-01-01 00:01:39]
Length: 100, Freq: S, Timezone: None

In [134]:
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.head(5)

2015-01-01 00:00:00    410
2015-01-01 00:00:01    274
2015-01-01 00:00:02    233
2015-01-01 00:00:03     27
2015-01-01 00:00:04     37
Freq: S, dtype: int64

In [135]:
ts.head(5).sum()

981

In [136]:
ts.resample('5S', how='sum')

2015-01-01 00:00:00     981
2015-01-01 00:00:05    1339
2015-01-01 00:00:10    1292
2015-01-01 00:00:15     932
2015-01-01 00:00:20    1153
2015-01-01 00:00:25     775
2015-01-01 00:00:30    1510
2015-01-01 00:00:35     936
2015-01-01 00:00:40    1605
2015-01-01 00:00:45    1581
2015-01-01 00:00:50     965
2015-01-01 00:00:55    1360
2015-01-01 00:01:00     443
2015-01-01 00:01:05    1111
2015-01-01 00:01:10    1400
2015-01-01 00:01:15    1542
2015-01-01 00:01:20    1158
2015-01-01 00:01:25    1024
2015-01-01 00:01:30    1064
2015-01-01 00:01:35    1051
Freq: 5S, dtype: int64

We also have a lot of flexibility when selecting from a timeseries:

In [140]:
rng = pd.date_range('1/1/2015', periods=100, freq='D')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.head()

2015-01-01    110
2015-01-02    427
2015-01-03    424
2015-01-04    372
2015-01-05    392
Freq: D, dtype: int64

Here we select all the days in Feb:

In [151]:
ts['2015-02']

2015-02-01    278
2015-02-02    344
2015-02-03    414
2015-02-04    136
2015-02-05    409
2015-02-06     17
2015-02-07    196
2015-02-08    249
2015-02-09    354
2015-02-10    256
2015-02-11    302
2015-02-12    469
2015-02-13     53
2015-02-14     44
2015-02-15    425
2015-02-16    240
2015-02-17    343
2015-02-18    262
2015-02-19    440
2015-02-20    247
2015-02-21    436
2015-02-22    357
2015-02-23    293
2015-02-24     38
2015-02-25    323
2015-02-26    391
2015-02-27    348
2015-02-28    277
Freq: D, dtype: int64

But we can also select the 13th of each month within the timeseries like this:

In [149]:
ts[ts.index.day == 13]

2015-01-13    253
2015-02-13     53
2015-03-13    437
dtype: int64