# Select and filter data

Indexing series `(obj[...])` works analogously to indexing NumPy arrays, except that you can use index values of the series instead of just integers. Here are some examples:

In [1]:
import numpy as np
import pandas as pd

In [2]:
idx = pd.date_range("2022-02-02", periods=7)
rng = np.random.default_rng()
s = pd.Series(rng.normal(size=7), index=idx)

In [3]:
s

2022-02-02    0.002127
2022-02-03    1.655759
2022-02-04   -1.552128
2022-02-05   -1.581026
2022-02-06   -0.992316
2022-02-07    1.490786
2022-02-08   -1.542455
Freq: D, dtype: float64

In [4]:
s["2022-02-03"]

1.655759430268265

In [5]:
s[1]

1.655759430268265

In [6]:
 s[2:4]

2022-02-04   -1.552128
2022-02-05   -1.581026
Freq: D, dtype: float64

In [7]:
s[["2022-02-04", "2022-02-03", "2022-02-02"]]

2022-02-04   -1.552128
2022-02-03    1.655759
2022-02-02    0.002127
dtype: float64

In [8]:
s[[1, 3]]

2022-02-03    1.655759
2022-02-05   -1.581026
Freq: 2D, dtype: float64

In [9]:
s[s > 0]

2022-02-02    0.002127
2022-02-03    1.655759
2022-02-07    1.490786
dtype: float64

While you can select data by label in this way, the preferred method for selecting index values is the `loc` operator:

In [10]:
s.loc[["2022-02-04", "2022-02-03", "2022-02-02"]]

2022-02-04   -1.552128
2022-02-03    1.655759
2022-02-02    0.002127
dtype: float64

The reason for the preference for `loc` is the different treatment of integers when indexing with `[]`. In regular `[]`-based indexing, integers are treated as labels if the index contains integers, so the behaviour varies depending on the data type of the index. In our example, the expression `s.loc[[3, 2, 1]]` will fail because the index does not contain integers:

In [11]:
s.loc[[3, 2, 1]]

KeyError: "None of [Index([3, 2, 1], dtype='int64')] are in the [index]"

While the `loc` operator exclusively indexes labels, the `iloc` operator exclusively indexes with integers:

In [12]:
s.iloc[[3, 2, 1]]

2022-02-05   -1.581026
2022-02-04   -1.552128
2022-02-03    1.655759
Freq: -1D, dtype: float64

You can also slice with labels, but this works differently from normal Python slicing because the endpoint is included:

In [13]:
s.loc["2022-02-03":"2022-02-04"]

2022-02-03    1.655759
2022-02-04   -1.552128
Freq: D, dtype: float64

Setting with these methods changes the corresponding section of the row:

In [14]:
s.loc["2022-02-03":"2022-02-04"] = 0

s

2022-02-02    0.002127
2022-02-03    0.000000
2022-02-04    0.000000
2022-02-05   -1.581026
2022-02-06   -0.992316
2022-02-07    1.490786
2022-02-08   -1.542455
Freq: D, dtype: float64

Indexing in a DataFrame is used to retrieve one or more columns with either a single value or a sequence:

In [15]:
data = {
    "Code": ["U+0000", "U+0001", "U+0002", "U+0003", "U+0004", "U+0005"],
    "Decimal": [0, 1, 2, 3, 4, 5],
    "Octal": ["001", "002", "003", "004", "004", "005"],
    "Key": ["NUL", "Ctrl-A", "Ctrl-B", "Ctrl-C", "Ctrl-D", "Ctrl-E"],
}

df = pd.DataFrame(data)
df = pd.DataFrame(data, columns=["Decimal", "Octal", "Key"], index=df["Code"])

df

Unnamed: 0_level_0,Decimal,Octal,Key
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
U+0000,0,1,NUL
U+0001,1,2,Ctrl-A
U+0002,2,3,Ctrl-B
U+0003,3,4,Ctrl-C
U+0004,4,4,Ctrl-D
U+0005,5,5,Ctrl-E


In [16]:
df["Key"]

Code
U+0000       NUL
U+0001    Ctrl-A
U+0002    Ctrl-B
U+0003    Ctrl-C
U+0004    Ctrl-D
U+0005    Ctrl-E
Name: Key, dtype: object

In [17]:
df[["Decimal", "Key"]]

Unnamed: 0_level_0,Decimal,Key
Code,Unnamed: 1_level_1,Unnamed: 2_level_1
U+0000,0,NUL
U+0001,1,Ctrl-A
U+0002,2,Ctrl-B
U+0003,3,Ctrl-C
U+0004,4,Ctrl-D
U+0005,5,Ctrl-E


In [18]:
df[:2]

Unnamed: 0_level_0,Decimal,Octal,Key
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
U+0000,0,1,NUL
U+0001,1,2,Ctrl-A


In [19]:
df[df["Decimal"] > 2]

Unnamed: 0_level_0,Decimal,Octal,Key
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
U+0003,3,4,Ctrl-C
U+0004,4,4,Ctrl-D
U+0005,5,5,Ctrl-E


The line selection syntax `df[:2]` is provided for convenience. Passing a single item or a list to the `[]` operator selects columns.

Another use case is indexing with a Boolean DataFrame, which is generated by a scalar comparison, for example:

In [19]:
df["Decimal"] > 2

Code
U+0000    False
U+0001    False
U+0002    False
U+0003     True
U+0004     True
U+0005     True
Name: Decimal, dtype: bool

In [20]:
df[df["Decimal"] > 2] = "NA"

df

Unnamed: 0_level_0,Decimal,Octal,Key
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
U+0000,0.0,1.0,NUL
U+0001,1.0,2.0,Ctrl-A
U+0002,2.0,3.0,Ctrl-B
U+0003,,,
U+0004,,,
U+0005,,,


Like Series, DataFrame has special operators `loc` and `iloc` for label-based and integer indexing respectively. Since DataFrame is two-dimensional, you can select a subset of the rows and columns with NumPy-like notation using either axis labels (`loc`) or integers (`iloc`).

In [21]:
df.loc["U+0002", ["Decimal", "Key"]]

Decimal         2
Key        Ctrl-B
Name: U+0002, dtype: object

In [22]:
df.iloc[[2], [1, 2]]

Unnamed: 0_level_0,Octal,Key
Code,Unnamed: 1_level_1,Unnamed: 2_level_1
U+0002,3,Ctrl-B


In [23]:
df.iloc[[0, 1], [1, 2]]

Unnamed: 0_level_0,Octal,Key
Code,Unnamed: 1_level_1,Unnamed: 2_level_1
U+0000,1,NUL
U+0001,2,Ctrl-A


Both indexing functions work with slices in addition to individual labels or lists of labels:

In [24]:
df.loc[:"U+0003", "Key"]

Code
U+0000       NUL
U+0001    Ctrl-A
U+0002    Ctrl-B
U+0003        NA
Name: Key, dtype: object

In [25]:
df.iloc[:3, :3]

Unnamed: 0_level_0,Decimal,Octal,Key
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
U+0000,0,1,NUL
U+0001,1,2,Ctrl-A
U+0002,2,3,Ctrl-B


So there are many ways to select and rearrange the data contained in a pandas object. In the following, I put together a brief summary of most of these possibilities for DataFrames:

Type | Note
:--- | :---
`df[LABEL]` | selects a single column or a sequence of columns from the DataFrame
`df.loc[LABEL]` | selects a single row or a subset of rows from the DataFrame by label
`df.loc[:, LABEL]` | selects a single column or a subset of columns from the DataFrame by Label
`df.loc[LABEL1, LABEL2]` | selects both rows and columns by label
`df.iloc[INTEGER]` | selects a single row or a subset of rows from the DataFrame by integer position
`df.iloc[INTEGER1, INTEGER2]` | selects a single column or a subset of columns by integer position
`df.at[LABEL1, LABEL2]` | selects a single value by row and column label
`df.iat[INTEGER1, INTEGER2]` | selects a scalar value by row and column position (integers)
`reindex NEW_INDEX` | selects rows or columns by label
`get_value`, `set_value` | deprecated since version 0.21.0: use `.at[]` or `.iat[]` instead.