# Indexing `DataFrame`s

A `DataFrame` can be indexed in a few different ways.  Let's use a small example `DataFrame` to show what these ways are.

In [1]:
import pandas as pd

# One way to manually construct a dataframe: call pd.DataFrame() on
# a dictionary.  The dictionary should be in the form: {"column name": [column values]}.
# The `pd.DataFrame()` function has a LOT of ways it can construct dataframes,
# so read the documentation to see other ways you can do this.
df = pd.DataFrame({
    "x1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "x2": [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
    "x3": ["a", "b", "a", "b", "a", "b", "a", "b", "a", "b"]
})
print(df)

   x1  x2 x3
0   1  11  a
1   2  12  b
2   3  13  a
3   4  14  b
4   5  15  a
5   6  16  b
6   7  17  a
7   8  18  b
8   9  19  a
9  10  20  b


In [2]:
# Select a single column
print(df["x1"])

0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
Name: x1, dtype: int64


In [3]:
# Select multiple columns: index with a list/tuple/array of column names.
print(df[["x1", "x2"]])

   x1  x2
0   1  11
1   2  12
2   3  13
3   4  14
4   5  15
5   6  16
6   7  17
7   8  18
8   9  19
9  10  20


In [4]:
# Filter rows by indexing with an array of booleans.
# This array needs to be exactly as long as the dataframe is tall.
# This works exactly the same as indexing a Numpy array with an
# array of booleans.
# This example will drop every other row.
print(df[
    [True, False, True, False, True, False, True, False, True, False]
])

   x1  x2 x3
0   1  11  a
2   3  13  a
4   5  15  a
6   7  17  a
8   9  19  a


In [5]:
# There are also the .loc and .iloc properties on dataframes,
# which can be used for faster indexing/lookups in some edge cases.
# But you don't need to worry about them for a while (I very rarely
# use them).
#
# df.iloc lets you select rows and columns *by index/position.*
# Syntax: df.iloc[row_indexer, column_indexer].
# column_indexer is optional; all columns are selected if it's omitted.
# Use `:` to select all rows/columns explicitly.
print(df.iloc[:, [0, 1]])

# df.loc lets you select *by name*.
# In this dataframe, the index is numeric, so it'll look like
# selecting rows by position.  So let's set the index to x2 first.
print(df.set_index("x2").loc[[11, 12, 13], :])

   x1  x2
0   1  11
1   2  12
2   3  13
3   4  14
4   5  15
5   6  16
6   7  17
7   8  18
8   9  19
9  10  20
    x1 x3
x2       
11   1  a
12   2  b
13   3  a


Now, let's combine a few of these ideas together, and create something called a *subquery*.  (so named because it looks a lot like subqueries in SQL).  A subquery is a filtering step where you filter *based on some condition* that is checked for each column/row.  E.g., "keep all rows where the value of x1 is even."

In [6]:
# Operations are vectorized, so this returns an array of booleans.
where = df["x1"] % 2 == 0
print(where)

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
9     True
Name: x1, dtype: bool


In [7]:
# Now we index df with that.
# print(df[where])

# More commonly you'll see both of these steps combined:
print(
    df[
        df["x1"] % 2 == 0
    ]
)

   x1  x2 x3
1   2  12  b
3   4  14  b
5   6  16  b
7   8  18  b
9  10  20  b
