pandas (http://pandas.pydata.org/) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

This guide borrows heavily from 10 Minutes to pandas: http://pandas.pydata.org/pandas-docs/stable/10min.html#reshaping

It's common to see pandas, numpy and matplotlib imported in the following way. We also have to specify that we would like generated images to be presented on this page. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

There are configuration options for Jupyter to do this automatically, this is useful if your notebooks will be used for similar types of data analysis.

First let's revisit the data we gathered earlier. We created a list of lists which pair up an IP address and how many times that IP address was seen in an nginx access log file.

In [None]:
ip_count = !cat access.log | awk '{print $1}' | sort | uniq -c | sort -rn
ip_count = [line.strip() for line in ip_count]
ip_count = [line.split() for line in ip_count][:10]
ip_count

Now we want to take this data and have pandas be able to do something with it. We begin by creating a "DataFrame" from the 'ip_count' variable. DataFrames (DF from here on) are essentially spreadsheets that pandas can do some work on.

We can use the 'head' and 'tail' functions to get a quick peek at the DF without having to load the entire thing (especially useful if your DF is large).

In [None]:
df = pd.DataFrame(ip_count, columns=['count', 'IP'])
df.head()

In [None]:
df.tail()

***Note***: these dataframes are styled using html/css. Brandon Rhodes had an interesting presentation at PyCon 2015 which shows how to modify IPython's core css to style the DF: https://github.com/brandon-rhodes/pycon-pandas-tutorial I don't understand it enough to explain it so I won't be using it for this presentation

We can have pandas well us some information about the DF like what type of objects it's comprised of.

In [None]:
df.dtypes

Uh-oh. We won't be able to do useful work unless pandas recognizes the 'count' column as a numeric type. 

#dtypes
from http://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes

pandas understands several data types (dtypes). In this example we can see a couple of things: creating a DF from a dict and the various dtypes padas is aware of.

In [None]:
from pandas import Timestamp, Series
dft = pd.DataFrame(dict( A = np.random.rand(3),
                         B = 1,
                         C = 'foo',
                         D = Timestamp('20010102'),
                         E = Series([1.0]*3).astype('float32'),
                         F = False,
                         G = Series([1]*3,dtype='int8')))
dft.head()

Columns with string data are represented as the 'object' dtype (column 'C'). We'll need to coerce the data in our data to integers to work with them further.

In [None]:
dft.dtypes

To get column 'count' to integers we can 'apply' a function to a column:

In [None]:
df['count'] = df['count'].apply(int)
df.dtypes

We can also make use of lambda functions here:

In [None]:
df['count']

In [None]:
df['count'].apply(lambda x: x**2)

#Selecting

We can select colums by using '[]' after the DF:

In [None]:
df['count']

But if we try to slice the DF we get a selection of rows:

In [None]:
df[2:5]

Now that the values in 'count' are numeric we can use boolean operators to select data:

In [None]:
df['count'] > 100
#df[df['count'] > 100]

If we want to chain boolean checks we need to wrap them in parens since the '&' operator takes precedence over the '>' and '<' operators. If we were to try this without the parens pandas would complain that the "truth" of a series cannot be evaluated.

In [None]:
df[(df['count'] > 100) & (df['count'] < 200)]

# Reshaping a DataFrame

http://pandas.pydata.org/pandas-docs/stable/10min.html#reshaping

There are several ways for us to reorganize our data. Here are some examples:

In [None]:
# from http://pandas.pydata.org/pandas-docs/stable/10min.html
dates = pd.date_range('20150101',periods=3)
df = pd.DataFrame(np.random.randn(3,4),index=dates,columns=list('ABCD'))
df

The first example is to transpose the data which will invert the columns and rows:

In [None]:
df.transpose()

pandas supports multiple indexes so we can control how we'd like our data orgaized. We can use the stack and unstack methods to move indexes up and down.

In this example we bring the columns 'down' using stack and each of the previous rows gets indexed under a date. Our top-level index becomes the date.

In [None]:
df_tmp = pd.DataFrame(df.stack())
df_tmp

Going the other way, we can bring the rows 'up' by using the unstack method. In this case our top level index is the letter witch each date and date's data repeated for each letter.

In [None]:
df_tmp = pd.DataFrame(df.unstack())
df_tmp

Unstack will default to operating on the ***last level***. The left level is the 0th. So in this case the following two operations are identical:

In [None]:
df_tmp.unstack()

But we can specify which level we'd like unstacked:

In [None]:
df_tmp.unstack(1)

versus using the 0th index (***first level***):

In [None]:
df_tmp.unstack(0)

# DB-like operations

http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

With relational data we can perform operations as we would with a traditional RDBMS.

In pandas we can 'join' two DFs the way we would tables from a database. In this example We'll create a second DF that relates IP addresses to domain names

In [None]:
df2 = pd.DataFrame(dict(IP = ['64.134.25.220', '70.114.7.38', '70.125.133.107'], 
                        domain = ['example.com', 'example.net', 'example.org']))
df2

By default pandas will perform an 'inner' join and pull data which exists in both DFs. The 'shape' of the resulting DF is (3,3)

In [None]:
df.merge(df2)

But if we perform an 'outer' join the missing data will be filled in with NaNs

In [None]:
df.merge(df2, how='outer')

# Timeseries

http://pandas.pydata.org/pandas-docs/stable/10min.html#time-series

> pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. 

In this first example try modifying the freq to another 'Offset Alias' there are [many](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) to choose from. Or using something like ***5S***.

In [None]:
rng = pd.date_range('1/1/2015', periods=100, freq='S')
rng

In [None]:
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.head(5)

In [None]:
ts.head(5).sum()

In [None]:
ts.resample('5S', how='sum')

We also have a lot of flexibility when selecting from a timeseries:

In [None]:
rng = pd.date_range('1/1/2015', periods=100, freq='D')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.head()

Here we select all the days in Feb:

In [None]:
ts['2015-02']

But we can also select the 13th of each month within the timeseries like this:

In [None]:
ts[ts.index.day == 13]