This is a short introduction to pandas, geared mainly for new users.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Create a `Series` by passing a list of values, letting pandas create a default integer index:

In [2]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Create a `DataFrame` by passing a NumPy array, with a datatime index and labeled columns:

In [5]:
dates = pd.date_range('20130101', periods=6)
print(dates)

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
                   A         B         C         D
2013-01-01 -0.451762  0.068150 -2.195064 -0.848344
2013-01-02  1.802462 -0.364465  2.120440 -1.389254
2013-01-03 -0.798670 -0.422135 -1.043356 -0.637629
2013-01-04 -1.553972  0.360354  1.197589 -1.437590
2013-01-05 -1.357458  0.204698 -0.660713 -0.935658
2013-01-06  0.384385 -0.000226  0.782146 -0.533268


Creating a `DataFrame` by passing a dict of objects that can be converted to series-like.

In [6]:
df2 = pd.DataFrame({
    'A' : 1,
    'B' : pd.Timestamp('20130102'),
    'C' : pd.Series(1, index=list(range(4)), dtype='float32'),
    'D' : np.array([3] * 4, dtype='int32'),
    'E' : pd.Categorical(["test", "train", "test", "train"]),
    'F' : 'foo' })

df2

Unnamed: 0,A,B,C,D,E,F
0,1,2013-01-02,1.0,3,test,foo
1,1,2013-01-02,1.0,3,train,foo
2,1,2013-01-02,1.0,3,test,foo
3,1,2013-01-02,1.0,3,train,foo


The columns of the resulting `DataFrame` have different dtypes.

In [7]:
df2.dtypes

A             int64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

# Viewing Data

Here is how to view the top and bottom rows of the frame:

In [10]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-0.451762,0.06815,-2.195064,-0.848344
2013-01-02,1.802462,-0.364465,2.12044,-1.389254
2013-01-03,-0.79867,-0.422135,-1.043356,-0.637629
2013-01-04,-1.553972,0.360354,1.197589,-1.43759
2013-01-05,-1.357458,0.204698,-0.660713,-0.935658


In [12]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,-1.553972,0.360354,1.197589,-1.43759
2013-01-05,-1.357458,0.204698,-0.660713,-0.935658
2013-01-06,0.384385,-0.000226,0.782146,-0.533268


`describe()` shows a quick statistic summary of your data:

In [13]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.329169,-0.025604,0.033507,-0.963624
std,1.252676,0.310966,1.604926,0.377188
min,-1.553972,-0.422135,-2.195064,-1.43759
25%,-1.217761,-0.273405,-0.947696,-1.275855
50%,-0.625216,0.033962,0.060717,-0.892001
75%,0.175348,0.170561,1.093728,-0.690308
max,1.802462,0.360354,2.12044,-0.533268


Transposing your data:

In [14]:
df.T

Unnamed: 0,2013-01-01 00:00:00,2013-01-02 00:00:00,2013-01-03 00:00:00,2013-01-04 00:00:00,2013-01-05 00:00:00,2013-01-06 00:00:00
A,-0.451762,1.802462,-0.79867,-1.553972,-1.357458,0.384385
B,0.06815,-0.364465,-0.422135,0.360354,0.204698,-0.000226
C,-2.195064,2.12044,-1.043356,1.197589,-0.660713,0.782146
D,-0.848344,-1.389254,-0.637629,-1.43759,-0.935658,-0.533268


Sorting by an axis:

In [15]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,-0.848344,-2.195064,0.06815,-0.451762
2013-01-02,-1.389254,2.12044,-0.364465,1.802462
2013-01-03,-0.637629,-1.043356,-0.422135,-0.79867
2013-01-04,-1.43759,1.197589,0.360354,-1.553972
2013-01-05,-0.935658,-0.660713,0.204698,-1.357458
2013-01-06,-0.533268,0.782146,-0.000226,0.384385


Sorting by values:

In [16]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-03,-0.79867,-0.422135,-1.043356,-0.637629
2013-01-02,1.802462,-0.364465,2.12044,-1.389254
2013-01-06,0.384385,-0.000226,0.782146,-0.533268
2013-01-01,-0.451762,0.06815,-2.195064,-0.848344
2013-01-05,-1.357458,0.204698,-0.660713,-0.935658
2013-01-04,-1.553972,0.360354,1.197589,-1.43759


# Selection

*NOTE: While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized panda data access methods, `.at`, `.iat`, `.loc`, `.iloc`*

## Getting

Selecting a single column, which yields a `Series`, equivalence to `df.A`:

In [17]:
df['A']

2013-01-01   -0.451762
2013-01-02    1.802462
2013-01-03   -0.798670
2013-01-04   -1.553972
2013-01-05   -1.357458
2013-01-06    0.384385
Freq: D, Name: A, dtype: float64

Selecting via `[]`, which slices the rows.

In [18]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-0.451762,0.06815,-2.195064,-0.848344
2013-01-02,1.802462,-0.364465,2.12044,-1.389254
2013-01-03,-0.79867,-0.422135,-1.043356,-0.637629


In [19]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,1.802462,-0.364465,2.12044,-1.389254
2013-01-03,-0.79867,-0.422135,-1.043356,-0.637629
2013-01-04,-1.553972,0.360354,1.197589,-1.43759


## Selection by Label

For getting a cross section using a label:

In [20]:
df.loc[dates[0]]

A   -0.451762
B    0.068150
C   -2.195064
D   -0.848344
Name: 2013-01-01 00:00:00, dtype: float64

Selection on a multi-axis by label:

In [21]:
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2013-01-01,-0.451762,0.06815
2013-01-02,1.802462,-0.364465
2013-01-03,-0.79867,-0.422135
2013-01-04,-1.553972,0.360354
2013-01-05,-1.357458,0.204698
2013-01-06,0.384385,-0.000226


Showing label slicing, both endpoints are *included*:

In [22]:
df.loc['20130102':'20130104', ['A', 'B']]

Unnamed: 0,A,B
2013-01-02,1.802462,-0.364465
2013-01-03,-0.79867,-0.422135
2013-01-04,-1.553972,0.360354


Reduction in the dimensions of the returned object:

In [23]:
df.loc['20130102', ['A', 'B']]

A    1.802462
B   -0.364465
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value:

In [24]:
df.loc[dates[0], 'A']

-0.451761818568139

For getting fast access to a scalar (equivalent to the prior method):

In [26]:
df.at[dates[0], 'A']

-0.451761818568139

## Selection by Position