# MultiIndex / advanced indexing
This section covers indexing with a MultiIndex and other advanced indexing features.

See the Indexing and Selecting Data for general indexing documentation.

>Warning
>
>Whether a copy or a reference is returned for a setting operation may depend on the context. This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy.

See the cookbook for some advanced strategies.

## Hierarchical indexing (MultiIndex)
Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with all of the pandas indexing functionality described above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring data for analysis.

See the cookbook for some advanced strategies.

Changed in version 0.24.0: MultiIndex.labels has been renamed to MultiIndex.codes and MultiIndex.set_labels to MultiIndex.set_codes.

### Creating a MultiIndex (hierarchical index) object
The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays()), an array of tuples (using MultiIndex.from_tuples()), a crossed set of iterables (using MultiIndex.from_product()), or a DataFrame (using MultiIndex.from_frame()). The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples. The following examples demonstrate different ways to initialize MultiIndexes.

In [1]:
import pandas as pd
import numpy as np

In [2]:
arrays = [
    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
    ["one", "two", "one", "two", "one", "two", "one", "two"],
]
tuples = list(zip(*arrays))
tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [3]:
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
index

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [4]:
s = pd.Series(np.random.randn(8), index=index)
s

first  second
bar    one      -0.560502
       two       0.712816
baz    one       0.536323
       two      -0.111789
foo    one      -0.555251
       two      -2.159646
qux    one      -0.877132
       two       2.555101
dtype: float64

When you want every pairing of the elements in two iterables, it can be easier to use the MultiIndex.from_product() method:

In [5]:
iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]
pd.MultiIndex.from_product(iterables, names=["first", "second"])

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

You can also construct a MultiIndex from a DataFrame directly, using the method MultiIndex.from_frame(). This is a complementary method to MultiIndex.to_frame().

*New in version 0.24.0.*

In [6]:
df = pd.DataFrame(
    [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
    columns=["first", "second"],
)
pd.MultiIndex.from_frame(df)

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('foo', 'one'),
            ('foo', 'two')],
           names=['first', 'second'])

As a convenience, you can pass a list of arrays directly into Series or DataFrame to construct a MultiIndex automatically:

In [7]:
arrays = [
    np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
    np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
]
s = pd.Series(np.random.randn(8), index=arrays)
s

bar  one   -0.144658
     two   -0.666538
baz  one    0.354465
     two   -0.771971
foo  one    1.095892
     two    0.835358
qux  one   -0.525660
     two   -0.410560
dtype: float64

In [8]:
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df

Unnamed: 0,Unnamed: 1,0,1,2,3
bar,one,-1.084033,-0.082323,0.562574,-0.117555
bar,two,1.395254,-0.590103,0.914462,0.439275
baz,one,-2.074609,0.281913,0.908905,0.683893
baz,two,-0.540291,-0.105286,-0.069525,1.504813
foo,one,0.461842,-0.192962,-1.46292,0.128575
foo,two,2.332838,0.666876,0.095966,-1.270526
qux,one,0.726911,-0.576219,0.353949,1.421193
qux,two,-0.649242,0.202796,0.225368,-0.471435


All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. If no names are provided, None will be assigned:

In [9]:
df.index.names

FrozenList([None, None])

This index can back any axis of a pandas object, and the number of levels of the index is up to you:

In [10]:
df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)
df

first,bar,bar,baz,baz,foo,foo,qux,qux
second,one,two,one,two,one,two,one,two
A,1.272169,-0.284021,-0.003127,0.18737,1.351286,-0.753089,0.545699,-0.720463
B,0.92379,-0.534629,-0.000299,-0.744271,1.187827,-2.411475,-0.471519,-0.061372
C,-1.566738,-2.231442,-1.000591,0.852539,1.496268,-0.663856,-0.097075,-0.533184


In [11]:
pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])

Unnamed: 0_level_0,first,bar,bar,baz,baz,foo,foo
Unnamed: 0_level_1,second,one,two,one,two,one,two
first,second,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
bar,one,-1.280923,1.336167,-0.010618,-0.131904,0.845112,-0.391942
bar,two,-0.342458,0.42548,-0.68712,0.313391,1.181827,1.692985
baz,one,0.525066,0.100396,0.235063,-0.918427,-1.345773,-0.106401
baz,two,-0.645827,-0.654009,-0.815013,0.563947,0.486787,-0.471484
foo,one,0.757025,-0.388679,2.548522,0.676742,0.992243,-0.150709
foo,two,0.273364,0.10022,-0.979245,1.756606,-2.997441,0.089973


We’ve “sparsified” the higher levels of the indexes to make the console output a bit easier on the eyes. Note that how the index is displayed can be controlled using the multi_sparse option in pandas.set_options():

In [12]:
with pd.option_context("display.multi_sparse", False):
    df

It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis:

In [13]:
pd.Series(np.random.randn(8), index=tuples)

(bar, one)   -0.761363
(bar, two)   -0.669149
(baz, one)    0.384573
(baz, two)   -2.213157
(foo, one)    2.102375
(foo, two)    0.342983
(qux, one)   -0.827745
(qux, two)    0.897601
dtype: float64

The reason that the MultiIndex matters is that it can allow you to do grouping, selection, and reshaping operations as we will describe below and in subsequent areas of the documentation. As you will see in later sections, you can find yourself working with hierarchically-indexed data without creating a MultiIndex explicitly yourself. However, when loading data from a file, you may wish to generate your own MultiIndex when preparing the data set.

## Reconstructing the level labels
The method get_level_values() will return a vector of the labels for each location at a particular level:

In [14]:
index.get_level_values(0)

Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [15]:
index.get_level_values("second")

Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

## Basic indexing on axis with MultiIndex
One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a subgroup in the data. Partial selection “drops” levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame:

In [16]:
df["bar"]

second,one,two
A,1.272169,-0.284021
B,0.92379,-0.534629
C,-1.566738,-2.231442


In [17]:
df["bar", "one"]

A    1.272169
B    0.923790
C   -1.566738
Name: (bar, one), dtype: float64

In [18]:
df["bar"]["one"]

A    1.272169
B    0.923790
C   -1.566738
Name: one, dtype: float64

In [19]:
s["qux"]

one   -0.52566
two   -0.41056
dtype: float64

See Cross-section with hierarchical index for how to select on a deeper level.

## Defined levels
The MultiIndex keeps all the defined levels of an index, even if they are not actually used. When slicing an index, you may notice this. For example:

In [20]:
df.columns.levels  # original MultiIndex

FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

In [21]:
df[["foo","qux"]].columns.levels  # sliced

FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see only the used levels, you can use the get_level_values() method.

In [22]:
df[["foo", "qux"]].columns.to_numpy()

array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
      dtype=object)

In [23]:
# for a specific level
df[["foo", "qux"]].columns.get_level_values(0)

Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

To reconstruct the MultiIndex with only the used levels, the remove_unused_levels() method may be used.

In [24]:
new_mi = df[["foo", "qux"]].columns.remove_unused_levels()
new_mi.levels

FrozenList([['foo', 'qux'], ['one', 'two']])

### Data alignment and using reindex
Operations between differently-indexed objects having MultiIndex on the axes will work as you expect; data alignment will work the same as an Index of tuples:

In [25]:
s + s[:-2]

bar  one   -0.289317
     two   -1.333076
baz  one    0.708930
     two   -1.543943
foo  one    2.191785
     two    1.670715
qux  one         NaN
     two         NaN
dtype: float64

In [26]:
s + s[::2]

bar  one   -0.289317
     two         NaN
baz  one    0.708930
     two         NaN
foo  one    2.191785
     two         NaN
qux  one   -1.051320
     two         NaN
dtype: float64

The reindex() method of Series/DataFrames can be called with another MultiIndex, or even a list or array of tuples:

In [27]:
s.reindex(index[:3])

first  second
bar    one      -0.144658
       two      -0.666538
baz    one       0.354465
dtype: float64

In [28]:
s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")])

foo  two    0.835358
bar  one   -0.144658
qux  one   -0.525660
baz  one    0.354465
dtype: float64

## Advanced indexing with hierarchical index
Syntactically integrating MultiIndex in advanced indexing with .loc is a bit challenging, but we’ve made every effort to do so. In general, MultiIndex keys take the form of tuples. For example, the following works as you would expect:

In [29]:
df = df.T
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,1.272169,0.92379,-1.566738
bar,two,-0.284021,-0.534629,-2.231442
baz,one,-0.003127,-0.000299,-1.000591
baz,two,0.18737,-0.744271,0.852539
foo,one,1.351286,1.187827,1.496268
foo,two,-0.753089,-2.411475,-0.663856
qux,one,0.545699,-0.471519,-0.097075
qux,two,-0.720463,-0.061372,-0.533184


In [30]:
df.loc[("bar", "two")]

A   -0.284021
B   -0.534629
C   -2.231442
Name: (bar, two), dtype: float64

Note that df.loc['bar', 'two'] would also work in this example, but this shorthand notation can lead to ambiguity in general.

If you also want to index a specific column with .loc, you must use a tuple like this:

In [31]:
df.loc[("bar", "two"), "A"]

-0.2840208459366906

You don’t have to specify all levels of the MultiIndex by passing only the first elements of the tuple. For example, you can use “partial” indexing to get all elements with bar in the first level as follows:

In [32]:
df.loc["bar"]

Unnamed: 0_level_0,A,B,C
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,1.272169,0.92379,-1.566738
two,-0.284021,-0.534629,-2.231442


This is a shortcut for the slightly more verbose notation df.loc[('bar',),] (equivalent to df.loc['bar',] in this example).

“Partial” slicing also works quite nicely.

In [33]:
df.loc["baz":"foo"]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,one,-0.003127,-0.000299,-1.000591
baz,two,0.18737,-0.744271,0.852539
foo,one,1.351286,1.187827,1.496268
foo,two,-0.753089,-2.411475,-0.663856


You can slice with a ‘range’ of values, by providing a slice of tuples.

In [34]:
df.loc[("baz", "two"):("qux", "one")]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,two,0.18737,-0.744271,0.852539
foo,one,1.351286,1.187827,1.496268
foo,two,-0.753089,-2.411475,-0.663856
qux,one,0.545699,-0.471519,-0.097075


In [35]:
df.loc[("baz", "two"):"foo"]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,two,0.18737,-0.744271,0.852539
foo,one,1.351286,1.187827,1.496268
foo,two,-0.753089,-2.411475,-0.663856


Passing a list of labels or tuples works similar to reindexing:

In [36]:
df.loc[[("bar", "two"), ("qux", "one")]]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,two,-0.284021,-0.534629,-2.231442
qux,one,0.545699,-0.471519,-0.097075


>Note
>
>It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).

Importantly, a list of tuples indexes several complete MultiIndex keys, whereas a tuple of lists refer to several values within a level:

In [37]:
s = pd.Series(
    [1, 2, 3, 4, 5, 6],
    index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]),
)
s.loc[[("A", "c"), ("B", "d")]]  # list of tuples

A  c    1
B  d    5
dtype: int64

In [38]:
s.loc[(["A", "B"], ["c", "d"])]  # tuple of lists

A  c    1
   d    2
B  c    4
   d    5
dtype: int64

### Using slicers
You can slice a MultiIndex by providing multiple indexers.

You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers.

You can use slice(None) to select all the contents of that level. You do not need to specify all the deeper levels, they will be implied as slice(None).

As usual, both sides of the slicers are included as this is label indexing.

>Warning
>
>You should specify all axes in the .loc specifier, meaning the indexer for the index and for the columns. There are some ambiguous cases where the passed indexer could be mis-interpreted as indexing both axes, rather than into say the MultiIndex for the rows.
>
>You should do this:
```python 
df.loc[(slice("A1", "A3"), ...), :]  # noqa: E999
```
You should not do this:

```python 
df.loc[(slice("A1", "A3"), ...)]  # noqa: E999
```


In [39]:
def mklbl(prefix, n):
    return ["%s%s" % (prefix, i) for i in range(n)]

miindex = pd.MultiIndex.from_product(
    [mklbl("A", 4), mklbl("B", 2), mklbl("C", 4), mklbl("D", 2)]
)

micolumns = pd.MultiIndex.from_tuples(
    [("a", "foo"), ("a", "bar"), ("b", "foo"), ("b", "bah")], names=["lvl0", "lvl1"]
)

dfmi = (
    pd.DataFrame(
        np.arange(len(miindex) * len(micolumns)).reshape(
            (len(miindex), len(micolumns))
        ),
        index=miindex,
        columns=micolumns,
        )
    .sort_index()
    .sort_index(axis=1)
)

dfmi

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,bar,foo,bah,foo
A0,B0,C0,D0,1,0,3,2
A0,B0,C0,D1,5,4,7,6
A0,B0,C1,D0,9,8,11,10
A0,B0,C1,D1,13,12,15,14
A0,B0,C2,D0,17,16,19,18
...,...,...,...,...,...,...,...
A3,B1,C1,D1,237,236,239,238
A3,B1,C2,D0,241,240,243,242
A3,B1,C2,D1,245,244,247,246
A3,B1,C3,D0,249,248,251,250


Basic MultiIndex slicing using slices, lists, and labels.

In [40]:
dfmi.loc[(slice("A1", "A3"), slice(None), ["C1", "C3"]), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,bar,foo,bah,foo
A1,B0,C1,D0,73,72,75,74
A1,B0,C1,D1,77,76,79,78
A1,B0,C3,D0,89,88,91,90
A1,B0,C3,D1,93,92,95,94
A1,B1,C1,D0,105,104,107,106
A1,B1,C1,D1,109,108,111,110
A1,B1,C3,D0,121,120,123,122
A1,B1,C3,D1,125,124,127,126
A2,B0,C1,D0,137,136,139,138
A2,B0,C1,D1,141,140,143,142


You can use pandas.IndexSlice to facilitate a more natural syntax using :, rather than using slice(None).

In [41]:
idx = pd.IndexSlice
dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,foo,foo
A0,B0,C1,D0,8,10
A0,B0,C1,D1,12,14
A0,B0,C3,D0,24,26
A0,B0,C3,D1,28,30
A0,B1,C1,D0,40,42
A0,B1,C1,D1,44,46
A0,B1,C3,D0,56,58
A0,B1,C3,D1,60,62
A1,B0,C1,D0,72,74
A1,B0,C1,D1,76,78


It is possible to perform quite complicated selections using this method on multiple axes at the same time.

In [42]:
dfmi.loc["A1", (slice(None), "foo")]

Unnamed: 0_level_0,Unnamed: 1_level_0,lvl0,a,b
Unnamed: 0_level_1,Unnamed: 1_level_1,lvl1,foo,foo
B0,C0,D0,64,66
B0,C0,D1,68,70
B0,C1,D0,72,74
B0,C1,D1,76,78
B0,C2,D0,80,82
B0,C2,D1,84,86
B0,C3,D0,88,90
B0,C3,D1,92,94
B1,C0,D0,96,98
B1,C0,D1,100,102


In [43]:
dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,foo,foo
A0,B0,C1,D0,8,10
A0,B0,C1,D1,12,14
A0,B0,C3,D0,24,26
A0,B0,C3,D1,28,30
A0,B1,C1,D0,40,42
A0,B1,C1,D1,44,46
A0,B1,C3,D0,56,58
A0,B1,C3,D1,60,62
A1,B0,C1,D0,72,74
A1,B0,C1,D1,76,78


Using a boolean indexer you can provide selection related to the values.

In [44]:
mask = dfmi[("a", "foo")] > 200
dfmi.loc[idx[mask, :, ["C1", "C3"]], idx[:, "foo"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,foo,foo
A3,B0,C1,D1,204,206
A3,B0,C3,D0,216,218
A3,B0,C3,D1,220,222
A3,B1,C1,D0,232,234
A3,B1,C1,D1,236,238
A3,B1,C3,D0,248,250
A3,B1,C3,D1,252,254


You can also specify the axis argument to .loc to interpret the passed slicers on a single axis.

In [45]:
dfmi.loc(axis=0)[:, :, ["C1", "C3"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,bar,foo,bah,foo
A0,B0,C1,D0,9,8,11,10
A0,B0,C1,D1,13,12,15,14
A0,B0,C3,D0,25,24,27,26
A0,B0,C3,D1,29,28,31,30
A0,B1,C1,D0,41,40,43,42
A0,B1,C1,D1,45,44,47,46
A0,B1,C3,D0,57,56,59,58
A0,B1,C3,D1,61,60,63,62
A1,B0,C1,D0,73,72,75,74
A1,B0,C1,D1,77,76,79,78


Furthermore, you can set the values using the following methods.

In [46]:
df2 = dfmi.copy()
df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,bar,foo,bah,foo
A0,B0,C0,D0,1,0,3,2
A0,B0,C0,D1,5,4,7,6
A0,B0,C1,D0,-10,-10,-10,-10
A0,B0,C1,D1,-10,-10,-10,-10
A0,B0,C2,D0,17,16,19,18
...,...,...,...,...,...,...,...
A3,B1,C1,D1,-10,-10,-10,-10
A3,B1,C2,D0,241,240,243,242
A3,B1,C2,D1,245,244,247,246
A3,B1,C3,D0,-10,-10,-10,-10


You can use a right-hand-side of an alignable object as well.

In [47]:
df2 = dfmi.copy()
df2.loc[idx[:, :, ["C1", "C3"]], :] = df2 * 1000
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,bar,foo,bah,foo
A0,B0,C0,D0,1,0,3,2
A0,B0,C0,D1,5,4,7,6
A0,B0,C1,D0,9000,8000,11000,10000
A0,B0,C1,D1,13000,12000,15000,14000
A0,B0,C2,D0,17,16,19,18
...,...,...,...,...,...,...,...
A3,B1,C1,D1,237000,236000,239000,238000
A3,B1,C2,D0,241,240,243,242
A3,B1,C2,D1,245,244,247,246
A3,B1,C3,D0,249000,248000,251000,250000


### Cross-section
The xs() method of DataFrame additionally takes a level argument to make selecting data at a particular level of a MultiIndex easier.

In [48]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,1.272169,0.92379,-1.566738
bar,two,-0.284021,-0.534629,-2.231442
baz,one,-0.003127,-0.000299,-1.000591
baz,two,0.18737,-0.744271,0.852539
foo,one,1.351286,1.187827,1.496268
foo,two,-0.753089,-2.411475,-0.663856
qux,one,0.545699,-0.471519,-0.097075
qux,two,-0.720463,-0.061372,-0.533184


In [49]:
df.xs("one", level="second")

Unnamed: 0_level_0,A,B,C
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,1.272169,0.92379,-1.566738
baz,-0.003127,-0.000299,-1.000591
foo,1.351286,1.187827,1.496268
qux,0.545699,-0.471519,-0.097075


In [50]:
# using the slicers
df.loc[(slice(None), "one"), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,1.272169,0.92379,-1.566738
baz,one,-0.003127,-0.000299,-1.000591
foo,one,1.351286,1.187827,1.496268
qux,one,0.545699,-0.471519,-0.097075


You can also select on the columns with xs, by providing the axis argument.

In [51]:
df = df.T
df.xs("one", level="second", axis=1)

first,bar,baz,foo,qux
A,1.272169,-0.003127,1.351286,0.545699
B,0.92379,-0.000299,1.187827,-0.471519
C,-1.566738,-1.000591,1.496268,-0.097075


In [52]:
# using the slicers
df.loc[:, (slice(None), "one")]

first,bar,baz,foo,qux
second,one,one,one,one
A,1.272169,-0.003127,1.351286,0.545699
B,0.92379,-0.000299,1.187827,-0.471519
C,-1.566738,-1.000591,1.496268,-0.097075


xs also allows selection with multiple keys.

In [53]:
df.xs(("one", "bar"), level=("second", "first"), axis=1)

first,bar
second,one
A,1.272169
B,0.92379
C,-1.566738


In [54]:
# using the slicers
df.loc[:, ("bar", "one")]

A    1.272169
B    0.923790
C   -1.566738
Name: (bar, one), dtype: float64

You can pass drop_level=False to xs to retain the level that was selected.

In [55]:
df.xs("one", level="second", axis=1, drop_level=False)

first,bar,baz,foo,qux
second,one,one,one,one
A,1.272169,-0.003127,1.351286,0.545699
B,0.92379,-0.000299,1.187827,-0.471519
C,-1.566738,-1.000591,1.496268,-0.097075


Compare the above with the result using drop_level=True (the default value).

In [56]:
df.xs("one", level="second", axis=1, drop_level=True)

first,bar,baz,foo,qux
A,1.272169,-0.003127,1.351286,0.545699
B,0.92379,-0.000299,1.187827,-0.471519
C,-1.566738,-1.000591,1.496268,-0.097075


### Advanced reindexing and alignment
Using the parameter level in the reindex() and align() methods of pandas objects is useful to broadcast values across a level. For instance:

In [57]:
midx = pd.MultiIndex(
    levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]]
)
df = pd.DataFrame(np.random.randn(4, 2), index=midx)
df

Unnamed: 0,Unnamed: 1,0,1
one,y,-2.503694,0.850641
one,x,0.976478,0.839555
zero,y,-0.485034,0.853295
zero,x,0.694438,-2.062692


In [58]:
df2 = df.mean(level=0)
df2

Unnamed: 0,0,1
one,-0.763608,0.845098
zero,0.104702,-0.604698


In [59]:
df2.reindex(df.index, level=0)

Unnamed: 0,Unnamed: 1,0,1
one,y,-0.763608,0.845098
one,x,-0.763608,0.845098
zero,y,0.104702,-0.604698
zero,x,0.104702,-0.604698


In [60]:
# aligning
df_aligned, df2_aligned = df.align(df2, level=0)
df_aligned

Unnamed: 0,Unnamed: 1,0,1
one,y,-2.503694,0.850641
one,x,0.976478,0.839555
zero,y,-0.485034,0.853295
zero,x,0.694438,-2.062692


In [61]:
df2_aligned

Unnamed: 0,Unnamed: 1,0,1
one,y,-0.763608,0.845098
one,x,-0.763608,0.845098
zero,y,0.104702,-0.604698
zero,x,0.104702,-0.604698


### Swapping levels with swaplevel
The swaplevel() method can switch the order of two levels:

In [62]:
df[:5]

Unnamed: 0,Unnamed: 1,0,1
one,y,-2.503694,0.850641
one,x,0.976478,0.839555
zero,y,-0.485034,0.853295
zero,x,0.694438,-2.062692


In [63]:
df[:5].swaplevel(0, 1, axis=0)

Unnamed: 0,Unnamed: 1,0,1
y,one,-2.503694,0.850641
x,one,0.976478,0.839555
y,zero,-0.485034,0.853295
x,zero,0.694438,-2.062692


### Reordering levels with reorder_levels
The reorder_levels() method generalizes the swaplevel method, allowing you to permute the hierarchical index levels in one step:

In [64]:
df[:5].reorder_levels([1, 0], axis=0)

Unnamed: 0,Unnamed: 1,0,1
y,one,-2.503694,0.850641
x,one,0.976478,0.839555
y,zero,-0.485034,0.853295
x,zero,0.694438,-2.062692


### Renaming names of an Index or MultiIndex
The rename() method is used to rename the labels of a MultiIndex, and is typically used to rename the columns of a DataFrame. The columns argument of rename allows a dictionary to be specified that includes only the columns you wish to rename.

In [65]:
df.rename(columns={0: "col0", 1: "col1"})

Unnamed: 0,Unnamed: 1,col0,col1
one,y,-2.503694,0.850641
one,x,0.976478,0.839555
zero,y,-0.485034,0.853295
zero,x,0.694438,-2.062692


This method can also be used to rename specific labels of the main index of the DataFrame.

In [66]:
df.rename(index={"one": "two", "y": "z"})

Unnamed: 0,Unnamed: 1,0,1
two,z,-2.503694,0.850641
two,x,0.976478,0.839555
zero,z,-0.485034,0.853295
zero,x,0.694438,-2.062692


The rename_axis() method is used to rename the name of a Index or MultiIndex. In particular, the names of the levels of a MultiIndex can be specified, which is useful if reset_index() is later used to move the values from the MultiIndex to a column.

In [67]:
df.rename_axis(index=["abc", "def"])

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
abc,def,Unnamed: 2_level_1,Unnamed: 3_level_1
one,y,-2.503694,0.850641
one,x,0.976478,0.839555
zero,y,-0.485034,0.853295
zero,x,0.694438,-2.062692


Note that the columns of a DataFrame are an index, so that using rename_axis with the columns argument will change the name of that index.

In [68]:
df.rename_axis(columns="Cols").columns

RangeIndex(start=0, stop=2, step=1, name='Cols')

Both rename and rename_axis support specifying a dictionary, Series or a mapping function to map labels/names to new values.

When working with an Index object directly, rather than via a DataFrame, Index.set_names() can be used to change the names.

In [69]:
mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"])
mi.names

FrozenList(['x', 'y'])

In [70]:
mi2 = mi.rename("new name", level=0)
mi2

MultiIndex([(1, 'a'),
            (1, 'b'),
            (2, 'a'),
            (2, 'b')],
           names=['new name', 'y'])

You cannot set the names of the MultiIndex via a level.

```python
mi.levels[0].name = "name via level"
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-100-35d32a9a5218> in <module>
----> 1 mi.levels[0].name = "name via level"

/pandas/pandas/core/indexes/base.py in name(self, value)
   1241         if self._no_setting_name:
   1242             # Used in MultiIndex.levels to avoid silently ignoring name updates.
-> 1243             raise RuntimeError(
   1244                 "Cannot set name on a level of a MultiIndex. Use "
   1245                 "'MultiIndex.set_names' instead."

RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.
```
Use Index.set_names() instead.

## Sorting a MultiIndex
For MultiIndex-ed objects to be indexed and sliced effectively, they need to be sorted. As with any index, you can use sort_index().

In [71]:
import random
random.shuffle(tuples)
s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
s

bar  one    1.187945
foo  one   -1.088132
qux  two   -0.235409
foo  two   -1.075064
qux  one   -2.714405
bar  two    0.140647
baz  two   -1.412481
     one   -0.105408
dtype: float64

In [72]:
s.sort_index()

bar  one    1.187945
     two    0.140647
baz  one   -0.105408
     two   -1.412481
foo  one   -1.088132
     two   -1.075064
qux  one   -2.714405
     two   -0.235409
dtype: float64

In [73]:
s.sort_index(level=0)

bar  one    1.187945
     two    0.140647
baz  one   -0.105408
     two   -1.412481
foo  one   -1.088132
     two   -1.075064
qux  one   -2.714405
     two   -0.235409
dtype: float64

In [74]:
s.sort_index(level=1)

bar  one    1.187945
baz  one   -0.105408
foo  one   -1.088132
qux  one   -2.714405
bar  two    0.140647
baz  two   -1.412481
foo  two   -1.075064
qux  two   -0.235409
dtype: float64

You may also pass a level name to sort_index if the MultiIndex levels are named.

In [75]:
s.index.set_names(["L1", "L2"], inplace=True)
s.sort_index(level="L1")

L1   L2 
bar  one    1.187945
     two    0.140647
baz  one   -0.105408
     two   -1.412481
foo  one   -1.088132
     two   -1.075064
qux  one   -2.714405
     two   -0.235409
dtype: float64

In [76]:
s.sort_index(level="L2")

L1   L2 
bar  one    1.187945
baz  one   -0.105408
foo  one   -1.088132
qux  one   -2.714405
bar  two    0.140647
baz  two   -1.412481
foo  two   -1.075064
qux  two   -0.235409
dtype: float64

On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex:

In [77]:
df.T.sort_index(level=1, axis=1)

Unnamed: 0_level_0,one,zero,one,zero
Unnamed: 0_level_1,x,x,y,y
0,0.976478,0.694438,-2.503694,-0.485034
1,0.839555,-2.062692,0.850641,0.853295


Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning). It will also return a copy of the data rather than a view:

In [78]:
dfm = pd.DataFrame(
    {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)}
)
dfm = dfm.set_index(["jim", "joe"])
dfm

Unnamed: 0_level_0,Unnamed: 1_level_0,jolie
jim,joe,Unnamed: 2_level_1
0,x,0.684468
0,x,0.592105
1,z,0.318557
1,y,0.312839


In [79]:
dfm.loc[(1, 'z')]

  dfm.loc[(1, 'z')]


Unnamed: 0_level_0,Unnamed: 1_level_0,jolie
jim,joe,Unnamed: 2_level_1
1,z,0.318557


Furthermore, if you try to index something that is not fully lexsorted, this can raise:

```python
dfm.loc[(0, 'y'):(1, 'z')]
# UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'
```

The is_lexsorted() method on a MultiIndex shows if the index is sorted, and the lexsort_depth property returns the sort depth:

In [80]:
dfm.index.is_lexsorted()

False

In [81]:
dfm.index.lexsort_depth

1

In [82]:
dfm = dfm.sort_index()
dfm

Unnamed: 0_level_0,Unnamed: 1_level_0,jolie
jim,joe,Unnamed: 2_level_1
0,x,0.684468
0,x,0.592105
1,y,0.312839
1,z,0.318557


In [83]:
dfm.index.is_lexsorted()

True

In [84]:
dfm.index.lexsort_depth

2

And now selection works as expected.

In [85]:
dfm.loc[(0, "y"):(1, "z")]

Unnamed: 0_level_0,Unnamed: 1_level_0,jolie
jim,joe,Unnamed: 2_level_1
1,y,0.312839
1,z,0.318557


## Take methods
Similar to NumPy ndarrays, pandas Index, Series, and DataFrame also provides the take() method that retrieves elements along a given axis at the given indices. The given indices must be either a list or an ndarray of integer index positions. take will also accept negative integers as relative positions to the end of the object.

In [86]:
index = pd.Index(np.random.randint(0, 1000, 10))
index

Int64Index([308, 375, 393, 965, 510, 527, 733, 876, 786, 315], dtype='int64')

In [87]:
positions = [0, 9, 3]
index[positions]

Int64Index([308, 315, 965], dtype='int64')

In [88]:
index.take(positions)

Int64Index([308, 315, 965], dtype='int64')

In [89]:
ser = pd.Series(np.random.randn(10))
ser.iloc[positions]

0   -1.324203
9    2.065381
3   -1.406986
dtype: float64

In [90]:
ser.take(positions)

0   -1.324203
9    2.065381
3   -1.406986
dtype: float64

For DataFrames, the given indices should be a 1d list or ndarray that specifies row or column positions.

In [91]:
frm = pd.DataFrame(np.random.randn(5, 3))
frm.take([1, 4, 3])

Unnamed: 0,0,1,2
1,1.068843,-0.669149,1.481131
4,0.79298,-0.214509,-0.483197
3,-0.682371,-0.198516,0.524203


In [92]:
frm.take([0, 2], axis=1)

Unnamed: 0,0,2
0,-0.129224,-0.427613
1,1.068843,1.481131
2,0.478187,0.889799
3,-0.682371,0.524203
4,0.79298,-0.483197


It is important to note that the take method on pandas objects are not intended to work on boolean indices and may return unexpected results.

In [93]:
arr = np.random.randn(10)
arr.take([False, False, True, True])

array([ 0.66446529,  0.66446529, -0.95594464, -0.95594464])

In [94]:
arr[[0, 1]]

array([ 0.66446529, -0.95594464])

In [95]:
ser = pd.Series(np.random.randn(10))
ser.take([False, False, True, True])

0   -0.368315
0   -0.368315
1   -0.903296
1   -0.903296
dtype: float64

Finally, as a small note on performance, because the take method handles a narrower range of inputs, it can offer performance that is a good deal faster than fancy indexing.

In [96]:
arr = np.random.randn(10000, 5)
indexer = np.arange(10000)
random.shuffle(indexer)

%timeit arr[indexer]
%timeit arr.take(indexer, axis=0)

159 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
69.1 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [97]:
ser = pd.Series(arr[:, 0])

%timeit ser.iloc[indexer]
%timeit ser.take(indexer)

182 µs ± 8.93 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
174 µs ± 3.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


## Index types
We have discussed MultiIndex in the previous sections pretty extensively. Documentation about DatetimeIndex and PeriodIndex are shown here, and documentation about TimedeltaIndex is found here.

In the following sub-sections we will highlight some other index types.

### CategoricalIndex
CategoricalIndex is a type of index that is useful for supporting indexing with duplicates. This is a container around a Categorical and allows efficient indexing and storage of an index with a large number of duplicated elements.

In [98]:
from pandas.api.types import CategoricalDtype

df = pd.DataFrame({"A": np.arange(6), "B": list("aabbca")})
df["B"] = df["B"].astype(CategoricalDtype(list("cab")))
df

Unnamed: 0,A,B
0,0,a
1,1,a
2,2,b
3,3,b
4,4,c
5,5,a


In [99]:
df.dtypes

A       int64
B    category
dtype: object

In [100]:
df["B"].cat.categories

Index(['c', 'a', 'b'], dtype='object')

Setting the index will create a CategoricalIndex.

In [101]:
df2 = df.set_index("B")
df2.index

CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Indexing with __getitem__/.iloc/.loc works similarly to an Index with duplicates. The indexers must be in the category or the operation will raise a KeyError.

In [102]:
df2.loc["a"]

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
a,0
a,1
a,5


The CategoricalIndex is preserved after indexing:

In [103]:
df2.loc["a"].index

CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Sorting the index will sort by the order of the categories (recall that we created the index with CategoricalDtype(list('cab')), so the sorted order is cab).

In [104]:
df2.sort_index()

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
c,4
a,0
a,1
a,5
b,2
b,3


Groupby operations on the index will preserve the index nature as well.

In [105]:
df2.groupby(level=0).sum()

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
c,4
a,6
b,5


In [106]:
df2.groupby(level=0).sum().index

CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Reindexing operations will return a resulting index based on the type of the passed indexer. Passing a list will return a plain-old Index; indexing with a Categorical will return a CategoricalIndex, indexed according to the categories of the passed Categorical dtype. This allows one to arbitrarily index these even with values not in the categories, similarly to how you can reindex any pandas index.

In [107]:
df3 = pd.DataFrame(
    {"A": np.arange(3), "B": pd.Series(list("abc")).astype("category")}
)
df3 = df3.set_index("B")
df3

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
a,0
b,1
c,2


In [108]:
df3.reindex(["a", "e"])

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
a,0.0
e,


In [109]:
df3.reindex(["a", "e"]).index

Index(['a', 'e'], dtype='object', name='B')

In [110]:
df3.reindex(pd.Categorical(["a", "e"], categories=list("abe")))

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
a,0.0
e,


In [111]:
df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))).index

CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, name='B', dtype='category')

>Warning
>
>Reshaping and Comparison operations on a CategoricalIndex must have the same categories or a TypeError will be raised.
>



TypeError: categories must match existing categories when appending
```

In [112]:
df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")})
df4["B"] = df4["B"].astype(CategoricalDtype(list("ab")))
df4 = df4.set_index("B")
df4.index

CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, name='B', dtype='category')

In [113]:
df5 = pd.DataFrame({"A": np.arange(2), "B": list("bc")})
df5["B"] = df5["B"].astype(CategoricalDtype(list("bc")))
df5 = df5.set_index("B")
df5.index

CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, name='B', dtype='category')

In [114]:
pd.concat([df4, df5])

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
b,0
a,1
b,0
c,1


### Int64Index and RangeIndex
Int64Index is a fundamental basic index in pandas. This is an immutable array implementing an ordered, sliceable set.

RangeIndex is a sub-class of Int64Index that provides the default index for all NDFrame objects. RangeIndex is an optimized version of Int64Index that can represent a monotonic ordered set. These are analogous to Python range types.

### Float64Index
By default a Float64Index will be automatically created when passing floating, or mixed-integer-floating values in index creation. This enables a pure label-based slicing paradigm that makes [],ix,loc for scalar indexing and slicing work exactly the same.

In [115]:
indexf = pd.Index([1.5, 2, 3, 4.5, 5])
indexf

Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')

In [116]:
sf = pd.Series(range(5), index=indexf)
sf

1.5    0
2.0    1
3.0    2
4.5    3
5.0    4
dtype: int64

Scalar selection for [],.loc will always be label based. An integer will match an equal float index (e.g. 3 is equivalent to 3.0).

In [117]:
sf[3]

2

In [118]:
sf[3.0]

2

In [119]:
sf.loc[3]

2

In [120]:
sf.loc[3.0]

2

The only positional indexing is via iloc.

In [121]:
sf.iloc[3]

3

A scalar index that is not found will raise a KeyError. Slicing is primarily on the values of the index when using [],ix,loc, and always positional when using iloc. The exception is when the slice is boolean, in which case it will always be positional.

In [122]:
sf[2:4]

2.0    1
3.0    2
dtype: int64

In [123]:
sf.loc[2:4]

2.0    1
3.0    2
dtype: int64

In [124]:
sf.iloc[2:4]

3.0    2
4.5    3
dtype: int64

In float indexes, slicing using floats is allowed.

In [125]:
sf[2.1:4.6]

3.0    2
4.5    3
dtype: int64

In [126]:
sf.loc[2.1:4.6]

3.0    2
4.5    3
dtype: int64

In non-float indexes, slicing using floats will raise a TypeError.

```python
pd.Series(range(5))[3.5]
# TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)

pd.Series(range(5))[3.5:4.5]
# TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)
```

Here is a typical use-case for using this type of indexing. Imagine that you have a somewhat irregular timedelta-like indexing scheme, but the data is recorded as floats. This could, for example, be millisecond offsets.

In [127]:
dfir = pd.concat(
    [
        pd.DataFrame(
            np.random.randn(5, 2), index=np.arange(5) * 250.0, columns=list("AB")
        ),
        pd.DataFrame(
            np.random.randn(6, 2),
            index=np.arange(4, 10) * 250.1,
            columns=list("AB"),
        ),
    ]
)

dfir

Unnamed: 0,A,B
0.0,0.671263,1.549384
250.0,0.829938,-1.526808
500.0,0.293795,-1.536554
750.0,1.251101,0.779655
1000.0,0.516474,-0.133739
1000.4,0.253315,0.389848
1250.5,-0.329006,-0.541991
1500.6,-0.289555,-0.12521
1750.7,-0.193879,0.239935
2000.8,-0.705685,0.32961


Selection operations then will always work on a value basis, for all selection operators.

In [128]:
dfir[0:1000.4]

Unnamed: 0,A,B
0.0,0.671263,1.549384
250.0,0.829938,-1.526808
500.0,0.293795,-1.536554
750.0,1.251101,0.779655
1000.0,0.516474,-0.133739
1000.4,0.253315,0.389848


In [129]:
dfir.loc[0:1001, "A"]

0.0       0.671263
250.0     0.829938
500.0     0.293795
750.0     1.251101
1000.0    0.516474
1000.4    0.253315
Name: A, dtype: float64

In [130]:
dfir.loc[1000.4]

A    0.253315
B    0.389848
Name: 1000.4, dtype: float64

You could retrieve the first 1 second (1000 ms) of data as such:

In [131]:
dfir[0:1000]

Unnamed: 0,A,B
0.0,0.671263,1.549384
250.0,0.829938,-1.526808
500.0,0.293795,-1.536554
750.0,1.251101,0.779655
1000.0,0.516474,-0.133739


If you need integer based selection, you should use iloc:

In [132]:
dfir.iloc[0:5]

Unnamed: 0,A,B
0.0,0.671263,1.549384
250.0,0.829938,-1.526808
500.0,0.293795,-1.536554
750.0,1.251101,0.779655
1000.0,0.516474,-0.133739


### IntervalIndex
IntervalIndex together with its own dtype, IntervalDtype as well as the Interval scalar type, allow first-class support in pandas for interval notation.

The IntervalIndex allows some unique indexing and is also used as a return type for the categories in cut() and qcut().

#### Indexing with an IntervalIndex
An IntervalIndex can be used in Series and in DataFrame as the index.

In [133]:
df = pd.DataFrame(
    {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])
)

df

Unnamed: 0,A
"(0, 1]",1
"(1, 2]",2
"(2, 3]",3
"(3, 4]",4


Label based indexing via .loc along the edges of an interval works as you would expect, selecting that particular interval.

In [134]:
df.loc[2]

A    2
Name: (1, 2], dtype: int64

In [135]:
df.loc[[2, 3]]

Unnamed: 0,A
"(1, 2]",2
"(2, 3]",3


If you select a label contained within an interval, this will also select the interval.

In [136]:
df.loc[2.5]

A    3
Name: (2, 3], dtype: int64

In [137]:
df.loc[[2.5, 3.5]]

Unnamed: 0,A
"(2, 3]",3
"(3, 4]",4


Selecting using an Interval will only return exact matches (starting from pandas 0.25.0).

In [138]:
df.loc[pd.Interval(1, 2)]

A    2
Name: (1, 2], dtype: int64

Trying to select an Interval that is not exactly contained in the IntervalIndex will raise a KeyError.
```python
df.loc[pd.Interval(0.5, 2.5)]
---------------------------------------------------------------------------
KeyError: Interval(0.5, 2.5, closed='right')
```

Selecting all Intervals that overlap a given Interval can be performed using the overlaps() method to create a boolean indexer.

In [139]:
idxr = df.index.overlaps(pd.Interval(0.5, 2.5))
idxr

array([ True,  True,  True, False])

In [140]:
df[idxr]

Unnamed: 0,A
"(0, 1]",1
"(1, 2]",2
"(2, 3]",3


### Binning data with cut and qcut
cut() and qcut() both return a Categorical object, and the bins they create are stored as an IntervalIndex in its .categories attribute.

In [141]:
c = pd.cut(range(4), bins=2)
c

[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

In [142]:
c.categories

IntervalIndex([(-0.003, 1.5], (1.5, 3.0]],
              closed='right',
              dtype='interval[float64]')

cut() also accepts an IntervalIndex for its bins argument, which enables a useful pandas idiom. First, We call cut() with some data and bins set to a fixed number, to generate the bins. Then, we pass the values of .categories as the bins argument in subsequent calls to cut(), supplying new data which will be binned into the same bins.

In [143]:
pd.cut([0, 3, 5, 1], bins=c.categories)

[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

Any value which falls outside all bins will be assigned a NaN value.

### Generating ranges of intervals
If we need intervals on a regular frequency, we can use the interval_range() function to create an IntervalIndex using various combinations of start, end, and periods. The default frequency for interval_range is a 1 for numeric intervals, and calendar day for datetime-like intervals:

In [144]:
pd.interval_range(start=0, end=5)

IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
              closed='right',
              dtype='interval[int64]')

In [145]:
pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4)

IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03], (2017-01-03, 2017-01-04], (2017-01-04, 2017-01-05]],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [146]:
pd.interval_range(end=pd.Timedelta("3 days"), periods=3)

IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]],
              closed='right',
              dtype='interval[timedelta64[ns]]')

The freq parameter can used to specify non-default frequencies, and can utilize a variety of frequency aliases with datetime-like intervals:

In [147]:
pd.interval_range(start=0, periods=5, freq=1.5)

IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]],
              closed='right',
              dtype='interval[float64]')

In [148]:
pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W")

IntervalIndex([(2017-01-01, 2017-01-08], (2017-01-08, 2017-01-15], (2017-01-15, 2017-01-22], (2017-01-22, 2017-01-29]],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [149]:
pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9H")

IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0 days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]],
              closed='right',
              dtype='interval[timedelta64[ns]]')

Additionally, the closed parameter can be used to specify which side(s) the intervals are closed on. Intervals are closed on the right side by default.

In [150]:
pd.interval_range(start=0, end=4, closed="both")

IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]],
              closed='both',
              dtype='interval[int64]')

In [151]:
pd.interval_range(start=0, end=4, closed="neither")

IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)],
              closed='neither',
              dtype='interval[int64]')

Specifying start, end, and periods will generate a range of evenly spaced intervals from start to end inclusively, with periods number of elements in the resulting IntervalIndex:

In [152]:
pd.interval_range(start=0, end=6, periods=4)

IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]],
              closed='right',
              dtype='interval[float64]')

In [153]:
pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3)

IntervalIndex([(2018-01-01, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28]],
              closed='right',
              dtype='interval[datetime64[ns]]')

## Miscellaneous indexing FAQ
### Integer indexing
Label-based indexing with integer axis labels is a thorny topic. It has been discussed heavily on mailing lists and among various members of the scientific Python community. In pandas, our general viewpoint is that labels matter more than integer locations. Therefore, with an integer axis index only label-based indexing is possible with the standard tools like .loc. The following code will generate exceptions:

```python
s = pd.Series(range(5))
s[-1]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

```

In [154]:
df = pd.DataFrame(np.random.randn(5, 4))
df

Unnamed: 0,0,1,2,3
0,-0.12087,-1.683649,0.07444,1.253468
1,1.473856,0.41547,0.109394,1.074921
2,0.137779,-0.455789,-1.492801,-0.642501
3,0.277223,-0.181083,-0.306003,0.028183
4,-0.438279,-0.595009,-0.733418,-0.745774


In [155]:
df.loc[-2:]

Unnamed: 0,0,1,2,3
0,-0.12087,-1.683649,0.07444,1.253468
1,1.473856,0.41547,0.109394,1.074921
2,0.137779,-0.455789,-1.492801,-0.642501
3,0.277223,-0.181083,-0.306003,0.028183
4,-0.438279,-0.595009,-0.733418,-0.745774


This deliberate decision was made to prevent ambiguities and subtle bugs (many users reported finding bugs when the API change was made to stop “falling back” on position-based indexing).

### Non-monotonic indexes require exact matches
If the index of a Series or DataFrame is monotonically increasing or decreasing, then the bounds of a label-based slice can be outside the range of the index, much like slice indexing a normal Python list. Monotonicity of an index can be tested with the is_monotonic_increasing() and is_monotonic_decreasing() attributes.

In [156]:
df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=["data"], data=list(range(5)))
df.index.is_monotonic_increasing

True

In [157]:
df.loc[0:4, :]

Unnamed: 0,data
2,0
3,1
3,2
4,3


In [158]:
# slice is are outside the index, so empty DataFrame is returned
df.loc[13:15, :]

Unnamed: 0,data


On the other hand, if the index is not monotonic, then both slice bounds must be unique members of the index.

In [159]:
df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5], columns=["data"], data=list(range(6)))
df.index.is_monotonic_increasing

False

In [160]:
# OK because 2 and 4 are in the index
df.loc[2:4, :]

Unnamed: 0,data
2,0
3,1
1,2
4,3


```python
# 0 is not in the index
df.loc[0:4, :]
KeyError: 0
    
# 3 is not a unique label
df.loc[2:3, :]
KeyError: 'Cannot get right slice bound for non-unique label: 3'
```

Index.is_monotonic_increasing and Index.is_monotonic_decreasing only check that an index is weakly monotonic. To check for strict monotonicity, you can combine one of those with the is_unique() attribute.

In [163]:
weakly_monotonic = pd.Index(["a", "b", "c", "c"])
weakly_monotonic

Index(['a', 'b', 'c', 'c'], dtype='object')

In [164]:
weakly_monotonic.is_monotonic_increasing

True

In [165]:
weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique

False

### Endpoints are inclusive
Compared with standard Python sequence slicing in which the slice endpoint is not inclusive, label-based slicing in pandas is inclusive. The primary reason for this is that it is often not possible to easily determine the “successor” or next element after a particular label in an index. For example, consider the following Series:

In [166]:
s = pd.Series(np.random.randn(6), index=list("abcdef"))
s

a    0.859615
b    0.968770
c   -0.005287
d    0.340507
e   -0.093823
f    1.229598
dtype: float64

Suppose we wished to slice from c to e, using integers this would be accomplished as such:

In [167]:
s[2:5]

c   -0.005287
d    0.340507
e   -0.093823
dtype: float64

However, if you only had c and e, determining the next element in the index can be somewhat complicated. For example, the following does not work:

```python
s.loc['c':'e' + 1]
```

A very common use case is to limit a time series to start and end at two specific dates. To enable this, we made the design choice to make label-based slicing include both endpoints:

In [169]:
s.loc["c":"e"]

c   -0.005287
d    0.340507
e   -0.093823
dtype: float64

This is most definitely a “practicality beats purity” sort of thing, but it is something to watch out for if you expect label-based slicing to behave exactly in the way that standard Python integer slicing works.

### Indexing potentially changes underlying Series dtype
The different indexing operation can potentially change the dtype of a Series.

In [170]:
series1 = pd.Series([1, 2, 3])
series1.dtype

dtype('int64')

In [171]:
res = series1.reindex([0, 4])
res.dtype

dtype('float64')

In [172]:
res

0    1.0
4    NaN
dtype: float64

In [173]:
series2 = pd.Series([True])
series2.dtype

dtype('bool')

In [174]:
res = series2.reindex_like(series1)
res.dtype

dtype('O')

In [175]:
res

0    True
1     NaN
2     NaN
dtype: object

This is because the (re)indexing operations above silently inserts NaNs and the dtype changes accordingly. This can cause some issues when using numpy ufuncs such as numpy.logical_and.

See the this old issue for a more detailed discussion.