***Identifies data*** (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display
* Enables automatic and explicit data alignment
* Allows intuitive getting and setting of subsets of the data set

***Different Choices of Indexing***
* .loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. 
* .iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with python/numpy slice semantics)

In [2]:
import numpy as np
import pandas as pd


In [3]:
dates = pd.date_range('1/1/2000', periods = 8)

In [4]:
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

In [5]:
df

Unnamed: 0,A,B,C,D
2000-01-01,-0.567391,-1.357776,0.294897,-0.218608
2000-01-02,0.816161,0.870177,2.78357,-1.209912
2000-01-03,0.805313,-1.957947,0.284881,1.12467
2000-01-04,-1.820819,-1.70017,-2.144157,0.951319
2000-01-05,-0.320835,0.694682,-1.516754,-0.055426
2000-01-06,-0.251408,0.806735,-0.403726,-0.038725
2000-01-07,-0.821472,0.774747,1.504485,0.749717
2000-01-08,0.532733,0.472449,-0.351474,-1.162957


In [6]:
panel = pd.Panel({'one' : df, 'two' : df - df.mean()})

In [7]:
panel

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 8 (major_axis) x 4 (minor_axis)
Items axis: one to two
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-08 00:00:00
Minor_axis axis: A to D

In [8]:
s = df['A']

In [9]:
s[dates[5]]

-0.25140795648575831

In [10]:
panel['two']

Unnamed: 0,A,B,C,D
2000-01-01,-0.363926,-1.183138,0.238432,-0.236118
2000-01-02,1.019626,1.044815,2.727105,-1.227421
2000-01-03,1.008778,-1.783309,0.228416,1.10716
2000-01-04,-1.617355,-1.525532,-2.200622,0.933809
2000-01-05,-0.11737,0.86932,-1.573219,-0.072935
2000-01-06,-0.047943,0.981372,-0.460191,-0.056235
2000-01-07,-0.618008,0.949385,1.448019,0.732207
2000-01-08,0.736198,0.647087,-0.407939,-1.180466


In [11]:
df

Unnamed: 0,A,B,C,D
2000-01-01,-0.567391,-1.357776,0.294897,-0.218608
2000-01-02,0.816161,0.870177,2.78357,-1.209912
2000-01-03,0.805313,-1.957947,0.284881,1.12467
2000-01-04,-1.820819,-1.70017,-2.144157,0.951319
2000-01-05,-0.320835,0.694682,-1.516754,-0.055426
2000-01-06,-0.251408,0.806735,-0.403726,-0.038725
2000-01-07,-0.821472,0.774747,1.504485,0.749717
2000-01-08,0.532733,0.472449,-0.351474,-1.162957


In [12]:
df[['B', 'A']] = df[['A', 'B']]

In [13]:
df

Unnamed: 0,A,B,C,D
2000-01-01,-1.357776,-0.567391,0.294897,-0.218608
2000-01-02,0.870177,0.816161,2.78357,-1.209912
2000-01-03,-1.957947,0.805313,0.284881,1.12467
2000-01-04,-1.70017,-1.820819,-2.144157,0.951319
2000-01-05,0.694682,-0.320835,-1.516754,-0.055426
2000-01-06,0.806735,-0.251408,-0.403726,-0.038725
2000-01-07,0.774747,-0.821472,1.504485,0.749717
2000-01-08,0.472449,0.532733,-0.351474,-1.162957


***Note***
This will not modify df because the column alignment is before value assignment.

In [14]:
df[['A', 'B']]

Unnamed: 0,A,B
2000-01-01,-1.357776,-0.567391
2000-01-02,0.870177,0.816161
2000-01-03,-1.957947,0.805313
2000-01-04,-1.70017,-1.820819
2000-01-05,0.694682,-0.320835
2000-01-06,0.806735,-0.251408
2000-01-07,0.774747,-0.821472
2000-01-08,0.472449,0.532733


In [15]:
df.loc[:,['B', 'A']] = df[['A', 'B']]

In [16]:
df[['A', 'B']]

Unnamed: 0,A,B
2000-01-01,-1.357776,-0.567391
2000-01-02,0.870177,0.816161
2000-01-03,-1.957947,0.805313
2000-01-04,-1.70017,-1.820819
2000-01-05,0.694682,-0.320835
2000-01-06,0.806735,-0.251408
2000-01-07,0.774747,-0.821472
2000-01-08,0.472449,0.532733


**The correct way is to use raw values**

In [17]:
df.loc[:,['B', 'A']] = df[['A', 'B']].values

In [18]:
df[['A', 'B']]

Unnamed: 0,A,B
2000-01-01,-0.567391,-1.357776
2000-01-02,0.816161,0.870177
2000-01-03,0.805313,-1.957947
2000-01-04,-1.820819,-1.70017
2000-01-05,-0.320835,0.694682
2000-01-06,-0.251408,0.806735
2000-01-07,-0.821472,0.774747
2000-01-08,0.532733,0.472449


***Attribute Access***
You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:

In [19]:
sa = pd.Series([1,2,3], index = list('abc'))

In [20]:
dfa = df.copy()

In [21]:
sa.a

1

In [22]:
dfa.A

2000-01-01   -0.567391
2000-01-02    0.816161
2000-01-03    0.805313
2000-01-04   -1.820819
2000-01-05   -0.320835
2000-01-06   -0.251408
2000-01-07   -0.821472
2000-01-08    0.532733
Freq: D, Name: A, dtype: float64

In [23]:
panel.one

Unnamed: 0,A,B,C,D
2000-01-01,-0.567391,-1.357776,0.294897,-0.218608
2000-01-02,0.816161,0.870177,2.78357,-1.209912
2000-01-03,0.805313,-1.957947,0.284881,1.12467
2000-01-04,-1.820819,-1.70017,-2.144157,0.951319
2000-01-05,-0.320835,0.694682,-1.516754,-0.055426
2000-01-06,-0.251408,0.806735,-0.403726,-0.038725
2000-01-07,-0.821472,0.774747,1.504485,0.749717
2000-01-08,0.532733,0.472449,-0.351474,-1.162957


In [24]:
# Changing the first index
sa.a = 6

In [25]:
sa

a    6
b    2
c    3
dtype: int64

In [26]:
dfa.A = list(range(len(dfa.index))) 

In [27]:
dfa

Unnamed: 0,A,B,C,D
2000-01-01,0,-1.357776,0.294897,-0.218608
2000-01-02,1,0.870177,2.78357,-1.209912
2000-01-03,2,-1.957947,0.284881,1.12467
2000-01-04,3,-1.70017,-2.144157,0.951319
2000-01-05,4,0.694682,-1.516754,-0.055426
2000-01-06,5,0.806735,-0.403726,-0.038725
2000-01-07,6,0.774747,1.504485,0.749717
2000-01-08,7,0.472449,-0.351474,-1.162957


In [28]:
dfa['A'] = list(range(len(dfa.index))) # use this form to create a new column

In [29]:
dfa

Unnamed: 0,A,B,C,D
2000-01-01,0,-1.357776,0.294897,-0.218608
2000-01-02,1,0.870177,2.78357,-1.209912
2000-01-03,2,-1.957947,0.284881,1.12467
2000-01-04,3,-1.70017,-2.144157,0.951319
2000-01-05,4,0.694682,-1.516754,-0.055426
2000-01-06,5,0.806735,-0.403726,-0.038725
2000-01-07,6,0.774747,1.504485,0.749717
2000-01-08,7,0.472449,-0.351474,-1.162957


****Warning****
* You can use this access only if the index element is a valid python identifier, e.g. s.1 is not allowed. See here for an explanation of valid identifiers.
* The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.
* Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items, labels.
* In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.

*If you are using the IPython environment, you may also use tab-completion to see these accessible attributes.*

In [30]:
# You can also assign a dict to a row of a DataFrame:
x = pd.DataFrame({'x': [1,2,3], 'y' : [3, 4, 5]})

In [31]:
x.iloc[1] = dict(x=9, y=99)

In [32]:
x

Unnamed: 0,x,y
0,1,3
1,9,99
2,3,5


You can use attribute access to modify an existing element of a Series or column of a DataFrame, but be careful; *if you try to use attribute access to create a new column, it creates a new attribute rather than a new column. In 0.21.0 and later, this will raise a UserWarning:*

In [33]:
df = pd.DataFrame({'one' : [1., 2., 3.]})


In [34]:
df.two = [4, 5, 6]

In [35]:
df

Unnamed: 0,one
0,1.0
1,2.0
2,3.0


***Slicing ranges***

The most robust and consistent way of slicing ranges along arbitrary axes is described in the Selection by Position section detailing the .iloc method. For now, we explain the semantics of slicing using the [] operator.
With Series, the syntax works exactly as with an ndarray, returning a slice of the values and the corresponding labels:

In [36]:
s[:5]

2000-01-01   -0.567391
2000-01-02    0.816161
2000-01-03    0.805313
2000-01-04   -1.820819
2000-01-05   -0.320835
Freq: D, Name: A, dtype: float64

In [37]:
s[::2]

2000-01-01   -0.567391
2000-01-03    0.805313
2000-01-05   -0.320835
2000-01-07   -0.821472
Freq: 2D, Name: A, dtype: float64

In [38]:
s[::-1]

2000-01-08    0.532733
2000-01-07   -0.821472
2000-01-06   -0.251408
2000-01-05   -0.320835
2000-01-04   -1.820819
2000-01-03    0.805313
2000-01-02    0.816161
2000-01-01   -0.567391
Freq: -1D, Name: A, dtype: float64

In [39]:
s2 = s.copy()

In [40]:
s2[:5] = 0

In [41]:
s2

2000-01-01    0.000000
2000-01-02    0.000000
2000-01-03    0.000000
2000-01-04    0.000000
2000-01-05    0.000000
2000-01-06   -0.251408
2000-01-07   -0.821472
2000-01-08    0.532733
Freq: D, Name: A, dtype: float64

In [42]:
df[:3]

Unnamed: 0,one
0,1.0
1,2.0
2,3.0


In [43]:
df[::-1]

Unnamed: 0,one
2,3.0
1,2.0
0,1.0


***Selecting by label***

In [44]:
dfl = pd.DataFrame(np.random.randn(5,4), columns = list('ABCD'), index=pd.date_range('20130101', periods=5))

In [45]:
dfl

Unnamed: 0,A,B,C,D
2013-01-01,-0.382775,-2.13445,0.45915,0.208656
2013-01-02,-1.181642,0.296588,-1.021073,0.667625
2013-01-03,-0.790631,1.106876,-1.880172,-0.539125
2013-01-04,-0.20654,0.833635,0.615072,1.092144
2013-01-05,-0.358504,1.413733,2.050281,0.150685


In [46]:
# dfl.loc[2:3] it will be mistaken
dfl.loc['20130102' : '20130104']

Unnamed: 0,A,B,C,D
2013-01-02,-1.181642,0.296588,-1.021073,0.667625
2013-01-03,-0.790631,1.106876,-1.880172,-0.539125
2013-01-04,-0.20654,0.833635,0.615072,1.092144


In [47]:
s1 = pd.Series(np.random.randn(6),index=list('abcdef'))

In [48]:
s1

a    0.247720
b   -1.009995
c   -0.296896
d    1.628449
e   -0.818756
f   -0.546160
dtype: float64

In [49]:
s1.loc['c':]

c   -0.296896
d    1.628449
e   -0.818756
f   -0.546160
dtype: float64

In [50]:
s1.loc['b']

-1.0099945835356392

In [51]:
# Note setting works as well:
s1.loc['c':] = 0

In [52]:
s1

a    0.247720
b   -1.009995
c    0.000000
d    0.000000
e    0.000000
f    0.000000
dtype: float64

In [53]:
# Slicing with labels

In [54]:
s = pd.Series(list('abcde'), index=[0,3,2,5,4])

In [55]:
s.loc[0:]

0    a
3    b
2    c
5    d
4    e
dtype: object

In [56]:
s.sort_index()

0    a
2    c
3    b
4    e
5    d
dtype: object

In [57]:
s.sort_index().loc[1:6]

2    c
3    b
4    e
5    d
dtype: object

In [58]:
# Selection by position

In [59]:
s1 = pd.Series(np.random.randn(5), index=list(range(0,10,2)))

In [60]:
s1

0    0.303660
2   -0.737870
4    0.149851
6    0.173955
8   -1.220589
dtype: float64

In [61]:
s1.iloc[:3] # slice first three rows

0    0.303660
2   -0.737870
4    0.149851
dtype: float64

In [62]:
s1.loc[:3] # slice up to and including label 3

0    0.30366
2   -0.73787
dtype: float64

In [63]:
s1.iloc[:3] = 0

In [64]:
s1

0    0.000000
2    0.000000
4    0.000000
6    0.173955
8   -1.220589
dtype: float64

In [65]:
df1 = pd.DataFrame(np.random.randn(6,4),
                      index=list(range(0,12,2)),
                      columns=list(range(0,8,2)))

In [66]:
df1

Unnamed: 0,0,2,4,6
0,-0.144366,-0.727775,-0.471009,-1.395643
2,0.423155,-0.810784,0.311376,1.624982
4,2.990786,0.868528,0.253802,1.799782
6,0.730784,-0.622619,-0.653717,-0.327788
8,-1.323964,-0.282376,0.792994,-1.616825
10,-0.975323,-0.484656,1.681544,-2.957533


In [67]:
df1.iloc[:3]

Unnamed: 0,0,2,4,6
0,-0.144366,-0.727775,-0.471009,-1.395643
2,0.423155,-0.810784,0.311376,1.624982
4,2.990786,0.868528,0.253802,1.799782


In [68]:
df1.iloc[1:5, 2:4]

Unnamed: 0,4,6
2,0.311376,1.624982
4,0.253802,1.799782
6,-0.653717,-0.327788
8,0.792994,-1.616825


In [69]:
df1.iloc[1]

0    0.423155
2   -0.810784
4    0.311376
6    1.624982
Name: 2, dtype: float64

In [70]:
# Out of range slice indexes are handled gracefully just as in Python/Numpy.

In [71]:
x = list('abcdef')

In [72]:
x

['a', 'b', 'c', 'd', 'e', 'f']

In [73]:
x[0:]

['a', 'b', 'c', 'd', 'e', 'f']

In [74]:
x[4:10]

['e', 'f']

In [75]:
s = pd.Series(x)

In [76]:
s

0    a
1    b
2    c
3    d
4    e
5    f
dtype: object

In [77]:
s.iloc[4:10]

4    e
5    f
dtype: object

In [78]:
s.iloc[8:10]

Series([], dtype: object)

In [79]:
s.loc[4:]

4    e
5    f
dtype: object

In [80]:
# Selection By Callable

In [81]:
df1 = pd.DataFrame(np.random.randn(6,4),
                      index=list('abcdef'),
                      columns=list('ABCD'))

In [82]:
df1

Unnamed: 0,A,B,C,D
a,-0.66173,1.81262,1.348707,1.894245
b,1.03307,-1.486896,-0.817916,-1.434451
c,-2.704058,1.228246,0.851208,-1.18943
d,0.155367,0.574928,-0.118079,-0.847567
e,-0.418014,0.348902,-0.861325,0.385646
f,0.554963,-1.394015,0.594366,-0.485053


In [83]:
df1.loc[lambda df: df.A > 0, :]

Unnamed: 0,A,B,C,D
b,1.03307,-1.486896,-0.817916,-1.434451
d,0.155367,0.574928,-0.118079,-0.847567
f,0.554963,-1.394015,0.594366,-0.485053


In [84]:
df1.loc[:, lambda df: ['A', 'B']]

Unnamed: 0,A,B
a,-0.66173,1.81262
b,1.03307,-1.486896
c,-2.704058,1.228246
d,0.155367,0.574928
e,-0.418014,0.348902
f,0.554963,-1.394015


In [85]:
df1.loc[:, ['A', 'B']] # WTF?

Unnamed: 0,A,B
a,-0.66173,1.81262
b,1.03307,-1.486896
c,-2.704058,1.228246
d,0.155367,0.574928
e,-0.418014,0.348902
f,0.554963,-1.394015


In [86]:
df1.iloc[:, lambda df: [0, 1]]

Unnamed: 0,A,B
a,-0.66173,1.81262
b,1.03307,-1.486896
c,-2.704058,1.228246
d,0.155367,0.574928
e,-0.418014,0.348902
f,0.554963,-1.394015


In [87]:
df1[lambda df: df.columns[0]]

a   -0.661730
b    1.033070
c   -2.704058
d    0.155367
e   -0.418014
f    0.554963
Name: A, dtype: float64

In [88]:
# You can use callable indexing in Series.
df1.A.loc[lambda s: s > 0]

b    1.033070
d    0.155367
f    0.554963
Name: A, dtype: float64

In [89]:
# Using these methods / indexers, you can chain data selection operations without using temporary variable.

In [90]:
#IX Indexer is Deprecated

The recommended methods of indexing are:

* .loc if you want to label index
* .iloc if you want to positionally index.

In [91]:
dfd = pd.DataFrame({'A': [1, 2, 3],
                    'B': [4, 5, 6]},
                      index=list('abc'))

In [92]:
dfd

Unnamed: 0,A,B
a,1,4
b,2,5
c,3,6


In [94]:
# I want  to get the 0th and the 2nd elements from the index in the ‘A’ column.
dfd.ix[[0, 2], 'A']

a    1
c    3
Name: A, dtype: int64

Using .loc. Here we will select the appropriate indexes from the index, then use label indexing.

In [95]:
dfd.loc[dfd.index[[0, 2]], 'A']

a    1
c    3
Name: A, dtype: int64

This can also be expressed using .iloc, by explicitly getting locations on the indexers, and using positional indexing to select things.

In [97]:
dfd.iloc[[0, 2], dfd.columns.get_loc('A')]

a    1
c    3
Name: A, dtype: int64

For getting multiple indexers, using .get_indexer

In [98]:
dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])]

Unnamed: 0,A,B
a,1,4
c,3,6


In [None]:
# Indexing with list with missing labels is Deprecated

In [99]:
s = pd.Series([1, 2, 3])

In [100]:
s

0    1
1    2
2    3
dtype: int64

In [101]:
s.loc[[1, 2]]

1    2
2    3
dtype: int64

In [102]:
s.loc[[1, 2, 3]]

1    2.0
2    3.0
3    NaN
dtype: float64

Passing list-likes to .loc with any non-matching elements will raise
KeyError in the future, you can use .reindex() as an alternative.

**Reindexing**

*The idiomatic way to achieve selecting potentially not-found elmenents is via .reindex(). See also the section on reindexing.*

In [103]:
s.reindex([1, 2, 3])

1    2.0
2    3.0
3    NaN
dtype: float64

In [104]:
labels = [1, 2, 3]

In [106]:
s.loc[s.index.intersection(labels)]

1    2
2    3
dtype: int64

Having a duplicated index will raise for a .reindex():

In [108]:
s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])

In [109]:
labels = ['c', 'd']

s.reindex(labels)


ValueError: cannot reindex from a duplicate axis

Generally, you can interesect the desired labels with the current axis, and then reindex.

In [111]:
s.loc[s.index.intersection(labels)].reindex(labels)

c    3.0
d    NaN
dtype: float64

----

**Selecting Random Samples**

A random selection of rows or columns from a Series, DataFrame, or Panel with the sample() method. The method will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows.

In [112]:
s = pd.Series([0,1,2,3,4,5])

In [142]:
s.sample()

0    0
dtype: int64

In [162]:
s.sample(n=3)

0    0
3    3
2    2
dtype: int64

In [185]:
# or a fraction of the rows:
s.sample(frac=0.5)

0    0
5    5
2    2
dtype: int64

By default, *sample* will return each row at most once, but one can also sample with replacement using the replace option:

In [186]:
s = pd.Series([0,1,2,3,4,5])

In [194]:
s.sample(n=6, replace=False)

0    0
2    2
1    1
5    5
4    4
3    3
dtype: int64

In [199]:
s.sample(n=6, replace=True)

3    3
5    5
4    4
5    5
5    5
0    0
dtype: int64

By default, each row has an equal probability of being selected, but if you want rows to have different probabilities, you can pass the sample function sampling weights as weights. These weights can be a list, a numpy array, or a Series, but they must be of the same length as the object you are sampling. Missing values will be treated as a weight of zero, and inf values are not allowed. If weights do not sum to 1, they will be re-normalized by dividing all weights by the sum of the weights. For example:

In [None]:
s = pd.Series([0,1,2,3,4,5])

In [200]:
example_weights = [0, 0, 0.1, 0.2, 0.2, 0.4]

In [210]:
s.sample(n=3, weights=example_weights)

5    5
4    4
2    2
dtype: int64

In [211]:
# Weights will be re-normalized automatically
example_weights2 = [0.5, 0, 0, 0, 0, 0]

In [222]:
s.sample(n=1, weights=example_weights2)

0    0
dtype: int64

When applied to a **DataFrame**, you can use a column of the DataFrame as sampling weights (provided you are sampling rows and not columns) by simply passing the name of the column as a string.

In [224]:
df2 = pd.DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.2, 0]})

In [241]:
df2.sample(n = 3, weights = 'weight_column')

Unnamed: 0,col1,weight_column
0,9,0.5
1,8,0.4
2,7,0.2


*sample* also allows users to sample columns instead of rows using the axis argument.

In [242]:
df3 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})

In [244]:
df3.sample(n=1, axis=1)

Unnamed: 0,col1
0,1
1,2
2,3


Finally, one can also set a seed for sample‘s random number generator using the random_state argument, which will accept either an integer (as a seed) or a numpy RandomState object.



In [245]:
df4 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})

In [271]:
# With a givern seed. the sample will always draw the same rows
df4.sample(n=2, random_state=2)

Unnamed: 0,col1,col2
2,3,4
1,2,3


In [272]:
df4.sample(n=2, random_state=2)

Unnamed: 0,col1,col2
2,3,4
1,2,3


----

***Setting With Enlargement***