his is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in the Cookbook.

Customarily, we import as follows:


Object creation
See the Data Structure Intro section.

Creating a Series by passing a list of values, letting pandas create a default integer index:




In [1]:
import numpy as np
import pandas as pd


In [2]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:



In [3]:
dates = pd.date_range('20130101', periods=6)
dates


DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [4]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,1.003178,1.348943,-1.605178,0.823144
2013-01-02,0.892376,0.073171,-1.195883,-2.482828
2013-01-03,1.300199,1.124934,-0.165094,-1.131265
2013-01-04,-0.087728,0.096098,-0.191023,1.174944
2013-01-05,1.61769,0.635123,-0.199219,-0.65648
2013-01-06,1.253653,0.633224,-0.410232,-0.420364


Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [5]:
df2 = pd.DataFrame({'A': 1.,
   ...:                     'B': pd.Timestamp('20130102'),
   ...:                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
   ...:                     'D': np.array([3] * 4, dtype='int32'),
   ...:                     'E': pd.Categorical(["test", "train", "test", "train"]),
   ...:                     'F': 'foo'})
   ...: 

In [10]: df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting DataFrame have different dtypes.



In [6]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:

df2.<TAB>  # noqa: E225, E999
df2.A                  df2.bool
df2.abs                df2.boxplot
df2.add                df2.C
df2.add_prefix         df2.clip
df2.add_suffix         df2.columns
df2.align              df2.copy
df2.all                df2.count
df2.any                df2.combine
df2.append             df2.D
df2.apply              df2.describe
df2.applymap           df2.diff
df2.B                  df2.duplicated

As you can see, the columns A, B, C, and D are automatically tab completed. E and F are there as well; the rest of the attributes have been truncated for brevity.

Viewing data
See the Basics section.

Here is how to view the top and bottom rows of the frame:

In [7]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,1.003178,1.348943,-1.605178,0.823144
2013-01-02,0.892376,0.073171,-1.195883,-2.482828
2013-01-03,1.300199,1.124934,-0.165094,-1.131265
2013-01-04,-0.087728,0.096098,-0.191023,1.174944
2013-01-05,1.61769,0.635123,-0.199219,-0.65648


In [8]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,-0.087728,0.096098,-0.191023,1.174944
2013-01-05,1.61769,0.635123,-0.199219,-0.65648
2013-01-06,1.253653,0.633224,-0.410232,-0.420364


In [9]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [10]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying data

In [11]:
df.to_numpy()

array([[ 1.00317807,  1.34894305, -1.60517802,  0.8231439 ],
       [ 0.89237614,  0.0731713 , -1.19588326, -2.48282834],
       [ 1.30019868,  1.12493409, -0.16509418, -1.1312651 ],
       [-0.08772759,  0.09609812, -0.19102311,  1.17494399],
       [ 1.61768988,  0.63512339, -0.19921894, -0.65647957],
       [ 1.2536528 ,  0.63322353, -0.41023218, -0.4203644 ]])

For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive.

In [12]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

Note

DataFrame.to_numpy() does not include the index or column labels in the output

Note

DataFrame.to_numpy() does not include the index or column labels in the output

In [13]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.996561,0.651916,-0.627772,-0.448808
std,0.588332,0.520404,0.618688,1.334268
min,-0.087728,0.073171,-1.605178,-2.482828
25%,0.920077,0.230379,-0.99947,-1.012569
50%,1.128415,0.634173,-0.304726,-0.538422
75%,1.288562,1.002481,-0.193072,0.512267
max,1.61769,1.348943,-0.165094,1.174944


Transposing your data:

In [14]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,1.003178,0.892376,1.300199,-0.087728,1.61769,1.253653
B,1.348943,0.073171,1.124934,0.096098,0.635123,0.633224
C,-1.605178,-1.195883,-0.165094,-0.191023,-0.199219,-0.410232
D,0.823144,-2.482828,-1.131265,1.174944,-0.65648,-0.420364


Sorting by an axis:



In [15]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,0.823144,-1.605178,1.348943,1.003178
2013-01-02,-2.482828,-1.195883,0.073171,0.892376
2013-01-03,-1.131265,-0.165094,1.124934,1.300199
2013-01-04,1.174944,-0.191023,0.096098,-0.087728
2013-01-05,-0.65648,-0.199219,0.635123,1.61769
2013-01-06,-0.420364,-0.410232,0.633224,1.253653


Sorting by values:


In [16]:

df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-02,0.892376,0.073171,-1.195883,-2.482828
2013-01-04,-0.087728,0.096098,-0.191023,1.174944
2013-01-06,1.253653,0.633224,-0.410232,-0.420364
2013-01-05,1.61769,0.635123,-0.199219,-0.65648
2013-01-03,1.300199,1.124934,-0.165094,-1.131265
2013-01-01,1.003178,1.348943,-1.605178,0.823144


Note

While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.

Selecting a single column, which yields a Series, equivalent to df.A:



In [17]:
df['A']

2013-01-01    1.003178
2013-01-02    0.892376
2013-01-03    1.300199
2013-01-04   -0.087728
2013-01-05    1.617690
2013-01-06    1.253653
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows.



In [18]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,1.003178,1.348943,-1.605178,0.823144
2013-01-02,0.892376,0.073171,-1.195883,-2.482828
2013-01-03,1.300199,1.124934,-0.165094,-1.131265


In [19]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,0.892376,0.073171,-1.195883,-2.482828
2013-01-03,1.300199,1.124934,-0.165094,-1.131265
2013-01-04,-0.087728,0.096098,-0.191023,1.174944


Selection by label
See more in Selection by Label.

For getting a cross section using a label:



In [20]:
df.loc[dates[0]]

A    1.003178
B    1.348943
C   -1.605178
D    0.823144
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label:



In [21]:
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2013-01-01,1.003178,1.348943
2013-01-02,0.892376,0.073171
2013-01-03,1.300199,1.124934
2013-01-04,-0.087728,0.096098
2013-01-05,1.61769,0.635123
2013-01-06,1.253653,0.633224


Showing label slicing, both endpoints are included:



In [22]:
df.loc['20130102':'20130104', ['A', 'B']]

Unnamed: 0,A,B
2013-01-02,0.892376,0.073171
2013-01-03,1.300199,1.124934
2013-01-04,-0.087728,0.096098


Reduction in the dimensions of the returned object:



In [23]:
df.loc['20130102', ['A', 'B']]

A    0.892376
B    0.073171
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value:



In [24]:
df.loc[dates[0], 'A']

1.003178068633914

For getting fast access to a scalar (equivalent to the prior method):



In [25]:
df.at[dates[0], 'A']

1.003178068633914

Selection by position
See more in Selection by Position.

Select via the position of the passed integers:



In [26]:
df.iloc[3]

A   -0.087728
B    0.096098
C   -0.191023
D    1.174944
Name: 2013-01-04 00:00:00, dtype: float64

By integer slices, acting similar to numpy/python:



In [27]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2013-01-04,-0.087728,0.096098
2013-01-05,1.61769,0.635123


By lists of integer position locations, similar to the numpy/python style:



In [28]:
df.iloc[[1, 2, 4], [0, 2]]

Unnamed: 0,A,C
2013-01-02,0.892376,-1.195883
2013-01-03,1.300199,-0.165094
2013-01-05,1.61769,-0.199219


For slicing rows explicitly:



In [29]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C,D
2013-01-02,0.892376,0.073171,-1.195883,-2.482828
2013-01-03,1.300199,1.124934,-0.165094,-1.131265


For slicing columns explicitly:



In [30]:
df.iloc[:, 1:3]

Unnamed: 0,B,C
2013-01-01,1.348943,-1.605178
2013-01-02,0.073171,-1.195883
2013-01-03,1.124934,-0.165094
2013-01-04,0.096098,-0.191023
2013-01-05,0.635123,-0.199219
2013-01-06,0.633224,-0.410232


For getting a value explicitly:



In [31]:
df.iloc[1, 1]

0.07317130485532075

For getting fast access to a scalar (equivalent to the prior method):


In [32]:
df.iat[1, 1]

0.07317130485532075

Boolean indexing
Using a single column’s values to select data.



In [33]:
df[df['A'] > 0]

Unnamed: 0,A,B,C,D
2013-01-01,1.003178,1.348943,-1.605178,0.823144
2013-01-02,0.892376,0.073171,-1.195883,-2.482828
2013-01-03,1.300199,1.124934,-0.165094,-1.131265
2013-01-05,1.61769,0.635123,-0.199219,-0.65648
2013-01-06,1.253653,0.633224,-0.410232,-0.420364


Selecting values from a DataFrame where a boolean condition is met.


In [34]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,1.003178,1.348943,,0.823144
2013-01-02,0.892376,0.073171,,
2013-01-03,1.300199,1.124934,,
2013-01-04,,0.096098,,1.174944
2013-01-05,1.61769,0.635123,,
2013-01-06,1.253653,0.633224,,


In [35]:
df2 = df.copy()



In [36]:
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']

df2

Unnamed: 0,A,B,C,D,E
2013-01-01,1.003178,1.348943,-1.605178,0.823144,one
2013-01-02,0.892376,0.073171,-1.195883,-2.482828,one
2013-01-03,1.300199,1.124934,-0.165094,-1.131265,two
2013-01-04,-0.087728,0.096098,-0.191023,1.174944,three
2013-01-05,1.61769,0.635123,-0.199219,-0.65648,four
2013-01-06,1.253653,0.633224,-0.410232,-0.420364,three


In [37]:
df2[df2['E'].isin(['two', 'four'])]

Unnamed: 0,A,B,C,D,E
2013-01-03,1.300199,1.124934,-0.165094,-1.131265,two
2013-01-05,1.61769,0.635123,-0.199219,-0.65648,four


Setting a new column automatically aligns the data by the indexes.



In [38]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

Setting values by label:



In [39]:
df.at[dates[0], 'A'] = 0

Setting values by position:



In [40]:
df.iat[0, 1] = 0

Setting by assigning with a NumPy array:



In [41]:
df.loc[:, 'D'] = np.array([5] * len(df))

In [42]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0.0,0.0,-1.605178,5
2013-01-02,0.892376,0.073171,-1.195883,5
2013-01-03,1.300199,1.124934,-0.165094,5
2013-01-04,-0.087728,0.096098,-0.191023,5
2013-01-05,1.61769,0.635123,-0.199219,5
2013-01-06,1.253653,0.633224,-0.410232,5


A where operation with setting.



In [43]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2


Unnamed: 0,A,B,C,D
2013-01-01,0.0,0.0,-1.605178,-5
2013-01-02,-0.892376,-0.073171,-1.195883,-5
2013-01-03,-1.300199,-1.124934,-0.165094,-5
2013-01-04,-0.087728,-0.096098,-0.191023,-5
2013-01-05,-1.61769,-0.635123,-0.199219,-5
2013-01-06,-1.253653,-0.633224,-0.410232,-5


pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [44]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])

df1.loc[dates[0]:dates[1], 'E'] = 1

df1

Unnamed: 0,A,B,C,D,E
2013-01-01,0.0,0.0,-1.605178,5,1.0
2013-01-02,0.892376,0.073171,-1.195883,5,1.0
2013-01-03,1.300199,1.124934,-0.165094,5,
2013-01-04,-0.087728,0.096098,-0.191023,5,


To drop any rows that have missing data.



In [45]:
df1.dropna(how='any')

Unnamed: 0,A,B,C,D,E
2013-01-01,0.0,0.0,-1.605178,5,1.0
2013-01-02,0.892376,0.073171,-1.195883,5,1.0


Filling missing data.



In [46]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,E
2013-01-01,0.0,0.0,-1.605178,5,1.0
2013-01-02,0.892376,0.073171,-1.195883,5,1.0
2013-01-03,1.300199,1.124934,-0.165094,5,5.0
2013-01-04,-0.087728,0.096098,-0.191023,5,5.0


To get the boolean mask where values are nan.



In [47]:
pd.isna(df1)

Unnamed: 0,A,B,C,D,E
2013-01-01,False,False,False,False,False
2013-01-02,False,False,False,False,False
2013-01-03,False,False,False,False,True
2013-01-04,False,False,False,False,True


Stats
Operations in general exclude missing data.

Performing a descriptive statistic:


In [48]:
df.mean()

A    0.829365
B    0.427092
C   -0.627772
D    5.000000
dtype: float64

Same operation on the other axis:



In [49]:
df.mean(1)

2013-01-01    0.848705
2013-01-02    1.192416
2013-01-03    1.815010
2013-01-04    1.204337
2013-01-05    1.763399
2013-01-06    1.619161
Freq: D, dtype: float64

Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension.


In [50]:
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)

s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

In [51]:
df.sub(s, axis='index')

Unnamed: 0,A,B,C,D
2013-01-01,,,,
2013-01-02,,,,
2013-01-03,0.300199,0.124934,-1.165094,4.0
2013-01-04,-3.087728,-2.903902,-3.191023,2.0
2013-01-05,-3.38231,-4.364877,-5.199219,0.0
2013-01-06,,,,


Applying functions to the data:



In [52]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D
2013-01-01,0.0,0.0,-1.605178,5
2013-01-02,0.892376,0.073171,-2.801061,10
2013-01-03,2.192575,1.198105,-2.966155,15
2013-01-04,2.104847,1.294204,-3.157179,20
2013-01-05,3.722537,1.929327,-3.356398,25
2013-01-06,4.97619,2.56255,-3.76663,30


In [53]:
df.apply(lambda x: x.max() - x.min())

A    1.705417
B    1.124934
C    1.440084
D    0.000000
dtype: float64

In [54]:
s = pd.Series(np.random.randint(0, 7, size=10))
s

0    6
1    2
2    0
3    2
4    5
5    3
6    0
7    2
8    3
9    6
dtype: int32

In [55]:
s.value_counts()

2    3
6    2
3    2
0    2
5    1
dtype: int64

String Methods
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.



In [56]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])



In [57]:
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

Concat
pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

See the Merging section.

Concatenating pandas objects together with concat():



In [58]:
df = pd.DataFrame(np.random.randn(10, 4))
df

Unnamed: 0,0,1,2,3
0,-1.091708,0.447032,-1.674618,-0.334687
1,-0.611181,0.133549,0.559816,-0.046679
2,-1.246322,-0.938716,1.385511,1.114058
3,3.015657,-0.258077,-0.421383,-1.367218
4,-0.249009,0.305996,0.481604,-0.829249
5,-0.272111,0.218038,1.417141,-0.153859
6,1.388396,-0.848745,1.353974,-1.136229
7,0.148057,1.590238,-1.805817,0.376834
8,-0.9122,0.259384,0.359636,-1.375815
9,0.722756,0.800308,0.04944,0.76701


In [59]:
df = pd.DataFrame(np.random.randn(10, 4))

break it into pieces


In [60]:
pieces = [df[:3], df[3:7], df[7:]]

In [61]:
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,0.140503,0.013098,0.563224,1.17825
1,-1.10635,-0.558391,0.777771,-0.899834
2,0.149987,0.241101,-0.528885,1.063399
3,0.879816,-0.187687,-0.48372,0.590984
4,-0.29686,2.312014,0.266848,0.749708
5,-1.193362,0.244293,-1.371026,0.313959
6,-0.173203,0.108693,1.139996,-1.507359
7,-1.260896,0.805682,0.378835,0.469593
8,-0.798206,-0.217795,-1.773928,-0.455626
9,-1.030738,1.515614,0.271076,0.231126
