# Session 11

[![Open and Execute in Google Colaboratory](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/astrojuanlu/ie-mbd-python-data-analysis-i/blob/main/sessions/Session%2011.ipynb)

- The index in pandas
- `.loc` vs `.iloc`
- Concatenating several dataframes

## The index in pandas

> There is just one core concept that brings together almost all of the pandas API [...]: the index and index alignment.

James Powell ([source](https://youtu.be/pjq3QOxl9Ok?si=B-wOGZJ7XvO70zmk))

The index is a property of pandas DataFrames that allows you to refer to specific rows by _label_, rather than by _position_. You have already used it, but there is so much that can be done with it.

In [None]:
df_covid = pd.read_csv(
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/raw/main/"
    "data/national_covid19.csv"
)
df_covid.head()

You can visually distinguish the index on the left hand side because the values are highlighted in bold. By default, an autoincrement, integer index is used, like in this case.

Remember that you can use `.loc` to index and slice a DataFrame by label:

In [None]:
df_covid.loc[0:3]

However, things start to get interesting when you use some other column as an index. In this case, let's use the date:

In [None]:
df_covid["date"].is_unique  # Good sanity check, not mandatory

In [None]:
df_covid_r = df_covid.set_index("date")
df_covid_r.head()

Notice that `date` is no longer a column!

In [None]:
df_covid_r["date"]  # Fails

But more importantly, now you can use the date as the index:

In [None]:
df_covid_r.loc["2020-04-01":"2020-04-05"]

## `.loc` vs `.iloc`

Now, `.loc` is very powerful because it allows you to use labels, rather than positions. But what if you want positions for some reason?

Well, that's what `.iloc` is for:

In [None]:
df_covid_r.iloc[:3]

Notice that `.iloc` follows the Python semantics of slicing (the end is not included), unlike `.loc`:

In [None]:
df_covid.loc[:3]

## Exercises

### 1. Wildfires in Spain

Read the `data/fires-all.csv` file into a pandas DataFrame.

How many fires were there in the year 2018? Compute that in 2 ways:
- Filtering on the `fecha` column
- Setting `fecha` as the index

Verify that the result is exactly the same.

In [None]:
FIRES_URL = (
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/raw/main/"
    "data/fires-subset.csv"
)
FIRES_URL

## Concatenating several dataframes

`pd.concat` adds one dataframe after another, either vertically (along rows, the default) or horizontally (along columns). It is useful when you have several dataframes that relate to the same data and you want to combine them into one, for example paginated results.

In [None]:
df_madrid = pd.read_csv(
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/raw/main/data/grandes-tenedores-madrid.csv"
)
df_madrid.head()

In [None]:
len(df_madrid)

Let's artificially split the dataframe using `.iloc[]`. It's like `.loc[]`, but works by index instead of by label.

The first 100 rows will go in one dataset, the remaining rows into another:

In [None]:
df_madrid_1 = df_madrid.iloc[:100]
df_madrid_1.head(1)

In [None]:
df_madrid_2 = df_madrid.iloc[100:]
df_madrid_2.head(1)

In [None]:
len(df_madrid_1) + len(df_madrid_2) == len(df_madrid)

In [None]:
df_madrid_concat = pd.concat([df_madrid_1, df_madrid_2])

In [None]:
df_madrid_concat.equals(df_madrid)

Alternatively, we can split by columns:

In [None]:
df_madrid_left = df_madrid.iloc[:, :11]
df_madrid_left.head(1)

In [None]:
df_madrid_right = df_madrid.iloc[:, 11:]
df_madrid_right.head(1)

In [None]:
df_madrid_concat_cols = pd.concat([df_madrid_left, df_madrid_right], axis="columns")

In [None]:
df_madrid_concat_cols.equals(df_madrid)

## Exercises

### 2. More split and concat

Using the same Rick & Morty data, split the data if five datasets, one per season. Then, concat it all again. At the end, check that the assembled dataframe and the original one are the same.