# Indexing, Selecting, and Filtering

## Setup

In [None]:
import pandas as pd

## Creation

In [None]:
data = {
    "Capital": {
        "Spain": "Madrid",
        "Belgium": "Brussels",
        "France": "Paris",
        "Italy": "Roma",
        "Germany": "Berlin",
        "Portugal": "Lisbon",
        "Norway": "Oslo",
        "Greece": "Athens",
    },
    "Population": {
        "Spain": 46733038,
        "Belgium": 11449656,
        "France": 67076000,
        "Italy": 60390560,
        "Germany": 83122889,
        "Portugal": 10295909,
        "Norway": 5391369,
        "Greece": 10718565,
    },
    "Monarch": {
        "Spain": "Felipe VI",
        "Belgium": "Philippe",
        "Norway": "Harald V",
    },
    "Area": {
        "Spain": 505990,
        "Belgium": 30688,
        "France": 640679,
        "Italy": 301340,
        "Germany": 357022,
        "Portugal": 92212,
        "Norway": 385207,
        "Greece": 131957,
    },
}

In [None]:
# For now, let's forget about these steps:
df = pd.DataFrame(data)
df["Capital"] = df["Capital"].astype("string")
df["Monarch"] = df["Monarch"].astype("string")

In [None]:
df

Apple stock data, taken from the [`matplotlib` sample datasets](https://github.com/matplotlib/sample_data/blob/master/aapl.csv)

In [None]:
# For now, let's forget about these steps:
apple = pd.read_csv("AAPL.csv")
apple["Date"] = apple["Date"].astype("datetime64[ns]")
apple = apple.set_index("Date")
apple = apple.sort_index()

In [None]:
apple.head()

## `.loc[]` and `.iloc[]`

```python
df.loc[..., ...]
df.loc[rows, columns]

df.iloc[..., ...]
df.iloc[rows, columns]

```

In [None]:
df

`.loc[]` expects labels (from the columns or from the index):

In [None]:
df.loc["Germany":"Norway"]

`.iloc[]` expects integers:

In [None]:
df.iloc[-3:]

## Demo 1: Selecting one column (as a `Series`)

In [None]:
df

Select one column:

In [None]:
df["Capital"]

In [None]:
df.loc[:, "Capital"]

Check the type of the object returned:

In [None]:
type(df.loc[:, "Capital"])

## Exercise 1

In [None]:
apple.head()

Select the "Volume" column of the `apple` DataFrame:

Check the type of the object returned:

## Demo 2: Selecting several columns

In [None]:
df

Select several columns:

In [None]:
df.loc[:, ["Capital", "Area"]]

Check the type of the object returned:

In [None]:
type(df.loc[:, ["Capital", "Area"]])

## Exercise 2

In [None]:
apple.head()

Select the "Open" and "Close" columns of the `apple` DataFrame:

Check the type of the object returned:

## Demo 3: Selecting one column (as a `DataFrame`)

In [None]:
df

Select one column, and return a DataFrame:

In [None]:
df.loc[:, ["Monarch"]]

Check the type of the object returned:

In [None]:
type(df.loc[:, ["Monarch"]])

## Exercise 3

In [None]:
apple.head()

Select the "Adj Close" column of the `apple` DataFrame, and return a DataFrame:

Check the type of the object returned:

## Demo 4: Slicing rows (using the index)

In [None]:
df

Slice the first few rows until "Italy" included:

In [None]:
df.loc[:"Italy"]

Check the shape:

In [None]:
df.loc[:"Italy"].shape

<div class="alert alert-warning">

<b>Beware:</b> Unlike in <code>Python</code>, <b>the end point is included</b> when slicing in <code>pandas</code> <b>using the index!</b>

</div>

Slice the last few rows, starting from "Italy":

In [None]:
df.loc["Italy":]

Check the shape:

In [None]:
df.loc["Italy":].shape

Slice the rows from "Belgium" until "Germany" included:

In [None]:
df.loc["Belgium":"Germany"]

Check the shape:

In [None]:
df.loc["Belgium":"Germany"].shape

<div class="alert alert-warning">

<b>Beware:</b> Unlike in <code>Python</code>, <b>the end point is included</b> when slicing in <code>pandas</code> <b>using the index!</b>

</div>

## Exercise 4

In [None]:
apple.head()

Slice the first few rows of the `apple` DataFrame until the 14 September 1984 included:

Check the shape:

Slice the last few rows of the `apple` DataFrame, starting from the 1 October 2008:

Check the shape:

Slice the rows of the `apple` DataFrame for the month of February 2000:

Check the shape:

## Demo 5: Slicing rows (using integers)

In [None]:
df

Slice the first 4 rows:

In [None]:
df.iloc[:4]

Check the shape:

In [None]:
df.iloc[:4].shape

<div class="alert alert-warning">

<b>Beware:</b> Unlike in <code>Python</code>, <b>the end point is NOT included</b> when slicing in <code>pandas</code> <b>using integers!</b>

</div>

Slice the last 3 rows:

In [None]:
df.iloc[-3:]

Check the shape:

In [None]:
df.iloc[-3:].shape

Slice the rows from the third until the fifth:

In [None]:
df

In [None]:
df.iloc[2:5]

Check the shape:

In [None]:
df.iloc[2:5].shape

<div class="alert alert-warning">

<b>Beware:</b> Unlike in <code>Python</code>, <b>the end point is NOT included</b> when slicing in <code>pandas</code> <b>using integers!</b>

</div>

## Exercise 5

In [None]:
apple.head()

Slice the first 3 rows of the `apple` DataFrame:

Check the shape:

Slice the last 6 rows of the `apple` DataFrame:

Check the shape:

Slice the second to the fourth rows of the `apple` DataFrame:

Check the shape:

## Demo 6: Selecting data with a boolean array

In [None]:
df

In [None]:
df["Population"]

Comparisons return boolean arrays:

In [None]:
df["Population"] < 15_000_000

Select the rows for which the population is less than 15 million people:

In [None]:
df.loc[df["Population"] < 15_000_000]

Check the shape:

In [None]:
df.loc[df["Population"] < 15_000_000].shape

Select the rows for which the area is greater than or equal to 400 thousand square km:

In [None]:
df["Area"]

In [None]:
df["Area"] >= 400_000

In [None]:
df.loc[df["Area"] >= 400_000]

Check the shape:

In [None]:
df.loc[df["Area"] >= 400_000].shape

Select the rows for which the area is smaller than 400 thousand square km:

In [None]:
df.loc[df["Area"] < 400_000]

Check the shape:

In [None]:
df.loc[df["Area"] < 400_000].shape

The original DataFrame has been split into two parts:

In [None]:
df.shape

Select the rows for which the capital is "Roma":

In [None]:
df["Capital"] == "Roma"

In [None]:
df.loc[df["Capital"] == "Roma"]

Check the shape:

In [None]:
df.loc[df["Capital"] == "Roma"].shape

## Exercise 6

In [None]:
apple.head()

Select the rows of the `apple` DataFrame for which the "Open" column was less than or equal to 26.50:

Check the shape:

Select the rows of the `apple` DataFrame for which the "Volume" column was greater than 100_000_000:

Check the shape:

Using the `apple` DataFrame, find out how many days the "Close" value was exactly 14.00:

Check the shape:

<div class="alert alert-info">

<b>Note:</b> Up to this point, the <code>.loc[]</code> and <code>.iloc[]</code> methods offered no new functionality; below are examples of their power!

</div>

## Demo 7: Selecting one or more columns (using integers)

In [None]:
df

Select one column:

In [None]:
df["Capital"]

In [None]:
df.loc[:, "Capital"]

Select one column using integers:

In [None]:
# Raises an error, because indexing by integer is not allowed:
df[0]

In [None]:
df.iloc[:, 0]

Select several columns using integers:

In [None]:
df.iloc[:, [1, 3]]

In [None]:
df.iloc[:, 1:]

## Exercise 7

In [None]:
apple.head()

Select the second column of the `apple` DataFrame:

Select the first and fourth columns of the `apple` DataFrame:

Select the first to fourth columns of the `apple` DataFrame:

## Demo 8: Selecting a slice of columns

In [None]:
df

Select a slice of columns:

In [None]:
df[["Population", "Monarch", "Area"]]

In [None]:
# Raises an error, because a single slice refers to rows, and not to columns:
df["Population":"Area"]

In [None]:
df.loc[:, "Population":"Area"]

## Exercise 8

In [None]:
apple.head()

Select the slice of columns from "Open" to "Close" of the `apple` DataFrame:

## Demo 9: Selecting specific rows by labels

In [None]:
df

Select specific rows:

In [None]:
df["France":"Portugal"]

In [None]:
# Raises an error, because a list refers to columns, and not to rows:
df[["France", "Germany"]]

In [None]:
df.loc[["France", "Germany"]]

## Exercise 9

In [None]:
apple.head()

Select the rows for 18 May 2000 and 18 May 2001 of the `apple` DataFrame using `.loc[]`:

## Demo 10: Selecting on both rows and columns

In [None]:
df

Select one or more columns:

In [None]:
df[["Capital", "Area"]]

In [None]:
df.loc[:, ["Capital", "Area"]]

Select rows:

In [None]:
df["Belgium":"Norway"]

In [None]:
df.loc["Belgium":"Norway"]

Select on both rows and columns by chaining single selections:

In [None]:
df[["Capital", "Area"]]["Belgium":"Norway"]

In [None]:
df["Belgium":"Norway"][["Capital", "Area"]]

Select on both rows and columns, using the `.loc[]`/`.iloc[]` methods:

In [None]:
df.loc["Belgium":"Norway", ["Capital", "Area"]]

<div class="alert alert-info">

<b>Note:</b> The <code>.ix[]</code> method allows to select using a mix of labels and integers, but it is deprecated - avoid using it.

</div>

<div class="alert alert-success">

<b>Best Practice:</b> Use <code>.loc[]</code> / <code>.iloc[]</code> when selecting on both rows and columns to <b>view values</b>.

</div>

<div class="alert alert-danger">

<b>Warning:</b> Always use <code>.loc[]</code> / <code>.iloc[]</code> when selecting on both rows and columns to <b>assign values</b>! (See <code>SettingWithCopyWarning</code>)

</div>

## Exercise 10

In [None]:
apple.head()

Select the "Open" and "Close" columns of the `apple` DataFrame using `.loc[]`:

Select the rows for the month of February 2000 of the `apple` DataFrame using `.loc[]`:

Combine both selections using `.loc[]`:

## Demo 11: Selecting values with `.at[]` and `.iat[]`

In [None]:
df

Select a specific value giving both the row and the column, using the `.loc[]`/`.iloc[]` methods:

In [None]:
df.at["Belgium", "Capital"]

In [None]:
df.iat[1, 0]

## Exercise 11

In [None]:
apple.head()

Select the "Close" value for the 18 May 2000 of the `apple` DataFrame using `.at[]`:

Select the "Open" value for the 7 September 1984 of the `apple` DataFrame using `.at[]`:

## Summary

```python
df.loc[..., ...]
df.loc[rows, columns]

df.iloc[..., ...]
df.iloc[rows, columns]

```

 Command                         | Result
:--------------------------------|:------------------------------------------------------
`df["Column"]`                   | Selects one column, and returns a `Series`
`df[["Column_1", "Column_2"]]`   | Selects several columns, and returns a `DataFrame`
`df[["Column"]]`                 | Selects one column, and returns a `DataFrame`
`df[:"Spain"]`                   | Slices rows using the index, and returns a `DataFrame`
`df[:10]`                        | Slices rows using integers, and returns a `DataFrame`
`df[df["Column"] > 0]`           | Selects rows, and returns a `DataFrame`
                                 |
`df.loc[..., ...]`               | Selects on both rows and columns (using labels)
`df.iloc[..., ...]`              | Selects on both rows and columns (using integers)
                                 |
`df.at[..., ...]`                | Selects value at a specific position (using labels)
`df.iat[..., ...]`               | Selects value at a specific position (using integers)
                                 |
`df.loc["Spain"]`                | Selects one row, and returns a `Series`
`df.loc[["Spain", "Belgium"]]`   | Selects several rows, and returns a `DataFrame`
`df.loc[["Spain"]]`              | Selects one row, and returns a `DataFrame`
`df.loc[["Spain":"Germany"]]`    | Selects a slice of rows, and returns a `DataFrame`
                                 |
`df.loc[:, "Capital"]`           | Selects one column, and returns a `Series`
`df.loc[:, ["Capital", "Area"]]` | Selects several columns, and returns a `DataFrame`
`df.loc[:, ["Capital"]]`         | Selects one column, and returns a `DataFrame`
`df.loc[:, ["Capital":"Area"]]`  | Selects a slice of columns, and returns a `DataFrame`








 


