# Pandas - Data Analysis Library for Python - Part 1 Basics

Original by Wes McKinney 
        
Modified by Clayton Miller (miller.clayton@arch.ethz.ch)

`Pandas` is one of the most important libraries available for data analysts. It is extremely valuable when processing time-series output data

In [None]:
%matplotlib inline

In [None]:
from IPython.core.display import Image
Image(filename='pandaslogo.jpg')

## Pandas video tutorial
Wes McKinney created the Pandas library and this notebook was used in a live tutorial which can be found on youtube:

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("w26x-z-BdWQ")

In [None]:
from IPython.core.display import Image
Image(filename='pandasbook.jpg')

The first step in any analysis is to load the necessary libraries -- in this case `pandas` and `numpy`

In [None]:
from pandas import *
import pandas
import numpy as np



# plt.rc('figure', figsize=(10, 6))
# pandas.set_option('notebook_repr_html',False)

Pandas Series Object: 1 dimensional data container
======

This object is a data container for vectors -- incorporating an index and string search functions

In [None]:
s = Series(np.random.randn(5))
s

In [None]:
labels = ['a', 'b', 'c', 'd', 'e']
s = Series(np.random.randn(5), index = labels)
s

In [None]:
'b' in s

In [None]:
s['b']

In [None]:
s

In [None]:
mapping = s.to_dict()
mapping

In [None]:
s = Series(mapping)
s

In [None]:
s[:3]

In [None]:
s.index

DataFrame: 2D collection of Series
==================================

In [None]:
df = DataFrame({'a': np.random.randn(6),
                'b': ['foo', 'bar'] * 3,
                'c': np.random.randn(6)})
df.info()

In [None]:
df.index

In [None]:
df

In [None]:
df = DataFrame({'a': np.random.randn(6),
                'b': ['foo', 'bar'] * 3,
                'c': np.random.randn(6)},
               index = date_range('1/1/2000', periods=6))
df

In [None]:
df = DataFrame({'a': np.random.randn(6),
                'b': ['foo', 'bar'] * 3,
                'c': np.random.randn(6)},
               columns=['a', 'b', 'c', 'd'])
df

Creation from nested dicts
--------------------------

These arise naturally in Python code

In [None]:
data = {}
for col in ['foo', 'bar', 'baz']:
    for row in ['a', 'b', 'c', 'd']:
        data.setdefault(col, {})[row] = np.random.randn()
data

In [None]:
DataFrame(data)

Data alignment
==============

In [None]:
close_px = read_csv('stock_data.csv', index_col=0, parse_dates=True)

In [None]:
close_px

In [None]:
s1 = close_px['AAPL'][-20:]
s2 = close_px['AAPL'][-25:-10]
s1

In [None]:
s2

In [None]:
s1 + s2

In [None]:
df = close_px.iloc[-10:, :3]
df

In [None]:
b, c  = s1.align(s2, join='inner')
b

In [None]:
c

In [None]:
b, c  = s1.align(s2, join='outer')
b

In [None]:
b, c  = s1.align(s2, join='right')


In [None]:
df = close_px.ix[-10:, ['AAPL', 'IBM', 'MSFT']]
df

In [None]:
df2 = df.ix[::2, ['IBM', 'MSFT']]
df2

In [None]:
df + df2

In [None]:
b, c = df.align(df2, join='inner')

## Truncation - clipping a datetime indexed object

In [None]:
df

In [None]:
df.truncate(before='2011-10-05')

In [None]:
df.truncate(before='2011-10-05',after='2011-10-12')

## Resampling - Useful time series aggregation

In [None]:
df

In [None]:
df.resample('M').mean()

In [None]:
df.resample('5D').mean()

In [None]:
df.resample('5D').max()

## Missing Data - Filling in the Gaps

In [None]:
dfgaps = df.resample('D').mean()
dfgaps

In [None]:
dfgaps.dropna()

In [None]:
dfgaps.fillna(method = 'bfill')

Transposing: no copy if all columns are same type
-------------------------------------------------

In [None]:
df[:5].T

Columns can be any type
-----------------------

In [None]:
n = 10
foo = DataFrame(index=range(n))
foo['floats'] = np.random.randn(n)
foo['ints'] = np.arange(n)
foo['strings'] = ['foo', 'bar'] * (n / 2)
foo['bools'] = foo['floats'] > 0
foo['objects'] = date_range('1/1/2000', periods=n)
foo

In [None]:
foo.dtypes

N.B. transposing is not roundtrippable in this case (column-oriented data structure)

In [None]:
foo.T.T

In [None]:
foo.T.T.dtypes

## Function application

You can apply arbitrary functions to the rows or columns of a DataFrame

In [None]:
df.apply(np.mean)

In [None]:
df.apply(np.mean, axis=1)

You can get as fancy as you want

In [None]:
close_px

In [None]:
df.apply(lambda x: x.max() - x.min()) # np.ptp

In [None]:
np.log(close_px)

## Plotting

Some basic plotting integration with matplotlib in Series / DataFrame

In [None]:
close_px[['AAPL', 'IBM', 'MSFT', 'XOM']].plot();

Hierarchical indexing
---------------------

In [None]:
index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
                                   ['one', 'two', 'three']],
                           labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
                                   [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]])
hdf = DataFrame(np.random.randn(10, 3), index=index,
                columns=['A', 'B', 'C'])
hdf

In [None]:
hdf.loc['foo']

In [None]:
hdf.loc['foo'] = 0
hdf

In [None]:
hdf.loc['foo', 'three']

Stacking and unstacking
-----------------------

In [None]:
tuples = zip(*[['bar', 'bar', 'baz', 'baz',
                'foo', 'foo', 'qux', 'qux'],
               ['one', 'two', 'one', 'two',
                'one', 'two', 'one', 'two']])
index = MultiIndex.from_tuples(tuples)
columns = MultiIndex.from_tuples([('A', 'cat'), ('B', 'dog'),
                                  ('B', 'cat'), ('A', 'dog')])
df = DataFrame(np.random.randn(8, 4), index=index, columns=columns)
df

In [None]:
df2 = df.iloc[[0, 1, 2, 4, 5, 7]]
df2

In [None]:
df.unstack()['B']

## GroupBy


In [None]:
df = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                       'foo', 'bar', 'foo', 'foo'],
                'B' : ['one', 'one', 'two', 'three',
                       'two', 'two', 'one', 'three'],
                'C' : np.random.randn(8),
                'D' : np.random.randn(8)})
df

In [None]:
for key, group in df.groupby('A'):
    print key
    print group

In [None]:
df.groupby('A')['C'].describe().T

In [None]:
df.groupby('A').mean()

In [None]:
for key, group in df.groupby('A'):
    print key
    print group

In [None]:
df.groupby(['A', 'B']).mean()

In [None]:
df.groupby(['A', 'B'], as_index=False).mean()

GroupBy with hierarchical indexing
----------------------------------

In [None]:
tuples = zip(*[['bar', 'bar', 'baz', 'baz',
                'foo', 'foo', 'qux', 'qux'],
               ['one', 'two', 'one', 'two',
                'one', 'two', 'one', 'two']])
index = MultiIndex.from_tuples(tuples)
columns = MultiIndex.from_tuples([('A', 'cat'), ('B', 'dog'),
                                  ('B', 'cat'), ('A', 'dog')])
df = DataFrame(np.random.randn(8, 4), index=index, columns=columns)
df

In [None]:
df.groupby(level=0, axis=0).mean()

In [None]:
df.stack()

In [None]:
df.stack().mean(1).unstack()

In [None]:
# could also have done
df.groupby(level=1, axis=1).mean()