# Session 09

[![Open and Execute in Google Colaboratory](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/astrojuanlu/ie-mbd-python-data-analysis-i/blob/main/sessions/Session%2009.ipynb)

- Tabular data with pandas DataFrames
- Slicing and indexing pandas DataFrames
- Filtering pandas DataFrames
- 1-dimensional data: pandas Series

## Tabular data with pandas `DataFrame`s

pandas is the most widely used Python library to read and manipulate tabular data. Its most important object is the `DataFrame`, which represents a rectangular table of rows and columns.

To start using pandas, you need to import it first. The most common way to import pandas is this:

In [None]:
import pandas as pd

All the pandas functions will be accessible under the `pd` alias:

In [None]:
pd.__version__

`DataFrame` objects are rectangular tables formed by a number of columns. Each column can have a different `dtype`, but all of them must have the same number of rows.

`DataFrame` objects can be created with the `DataFrame()` initializer, but in most cases they will appear as a result of other function or method calls, for example `pandas.read_csv`:

In [None]:
df = pd.read_csv(
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/"
    "raw/main/data/titanic.csv"
)
# df = pd.read_csv("national_covid19.csv", index_col="date")  # You can specify the index column

It's a good practice to display the first few rows of the `DataFrame` right after opening it:

In [None]:
df.head()

Notice that you can retrieve the `dtypes` of the `DataFrame` (it's no longer a single `dtype`!) as well as a few other important properties:

In [None]:
df.dtypes

In [None]:
df.shape

Some clarifications:

- The `object` dtype means something pandas couldn't parse. Could be a string, a string representing a date, or something completely different. In most cases it means `str`, but be careful when they appear.
- `float64` is similar to `int64`: a floating point value with 64 bits of precision.
- `NaN` means "Not a Number" and in general it means "missing value". In following sessions you will see how to effectively handle missing values, and what do they mean.

`DataFrame` objects have a large number of methods as well:

In [None]:
df.describe()

Notice that `object` columns are excluded. To include them all:

In [None]:
df.describe(include="all")

## Exercises

### 1. CSV reading

Read the `grandes-tenedores-madrid.csv` file and display
- number of rows,
- number of columns,
- dtypes of the columns,
- and the first five rows

Tip: The path is `https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/raw/main/data/grandes-tenedores-madrid.csv`

Data source: https://datos.civio.es/dataset/megatenedores-de-la-comunidad-de-madrid/

## Indexing

To extract a single column, use the indexing syntax, like you would do with a dictionary:

In [None]:
df["Name"]

For anything else (retrieving a slice of columns, slicing both rows and columns), use the `.loc` accessor instead:

In [None]:
df.loc[:, "Age":]  # All rows, from the "Age" column to the last one

In [None]:
df.loc[:5, ["SibSp", "Parch"]]  # Rows up to the one with the index 5 inclusive, columns "SibSp" and "Parch" only

In [None]:
df.loc[
    [100, 110],  # Rows labeled 100 and 110
    ["Fare", "Cabin"]
]

Notice that these methods work for _any_ `DataFrame`. For example, the result of `.describe()` is _also_ a `DataFrame`!

In [None]:
stats = df.describe()
stats

In [None]:
type(stats)

In [None]:
stats.loc["min":"max", ["Age", "Pclass"]]

## Selection

Filtering a `DataFrame` according to some condition is also done with the `.loc` accessor. For this, you can leverage the fact that operations between pandas `DataFrame` or `Series` objects and Python scalars are broadcasted automatically. In other words: operations are performed in sequence for all the elements of the pandas object.

In [None]:
3 > 100  # Returns False

In [None]:
df["Fare"] > 100  # Returns a Series of boolean values

Does the previous `Series` contain any `True` value? You can check with the `.any()` method:

In [None]:
fare_filter = df["Fare"] > 100
fare_filter.any()

In [None]:
fare_filter.dtype

You can use this filter or mask inside `.loc` to select a subset of rows:

In [None]:
df.loc[fare_filter]

Combining filters requires some care, because the usual boolean Python operators don't work:

In [None]:
(3 > 100) and (3 < 1_000)

In [None]:
(df["Fare"] > 100) and (df["Fare"] < 1_000)  # Fails!

To operate boolean pandas `Series`, you need to use the bitwise operators:

- Bitwise OR `|`
- Bitwise AND `&`
- Bitwise NOT `~`

In [None]:
# Select rows with cases_total between 100 and 1000
fare_filter = (100 < df["Fare"]) & (df["Fare"] < 1_000)
df.loc[fare_filter]

In [None]:
# Select cases where "icu" is not null, equivalent to
# df.loc[df["icu"].notnull()]
df.loc[~df["Age"].isnull()]

## Exercises

### 2. Selecting rows and columns

From the `grandes-tenedores-madrid.csv` dataset, select the rows where the column "Inmuebles" is larger than 300 and "Subsector" is not "Fondos de inversión".

Then, keep only the columns "NIF", "Matriz", "Sede empresa matriz", and "Sector".

## 1-dimensional data: pandas `Series`

Each of the columns of a `DataFrame` is a `Series` object. `Series` are mutable (hence you can add, remove, and replace elements) and homogeneous (all its elements must have the same type). As you will see, `Series` can be indexed by numerical position (like lists) or by label (like dictionaries).

For example, let's create a `Series` from a Python `list`:

In [None]:
my_list = list(range(10, 15))
my_list

In [None]:
ser = pd.Series(my_list)
ser

In [None]:
type(ser)

`Series` have an important property, the `dtype`, that holds the type of its individual elements. As you can see in the string representation of the `Series`, the `dtype` of `ser` is `int64`, which means that `ser` is a `Series` of integers.

In [None]:
ser.dtype

<div class="alert alert-info">The <code>64</code> part in <code>int64</code> alludes to how much memory does pandas use to store those integers, which is fixed to 64 bits. This means that, contrary to Python integers, there is a maximum integer pandas can store in an <code>int64</code>. But you don't need to worry about that now.</div>

If `pd.Series` receives heterogeneous input, it will cast everything to `object` and the `Series` will be less useful (no numerical operations will be allowed):

In [None]:
pd.Series([True, True, False])  # dtype: bool

In [None]:
pd.Series([True, True, False, 2])  # dtype: object

`Series` objects have a large number of methods, including some typical statistical and aggregation ones:

In [None]:
ser.max(), ser.min()

In [None]:
ser.mean(), ser.median(), ser.std()

And some more sophisticated methods:

In [None]:
ser.describe()
# ser.describe([.33, .66, .99])  # notice that you can specify the percentiles!

Series made of non-numerical dtypes will have a different result for `.describe()`:

In [None]:
pd.Series([1, 2, 2, 2, 3.0, "five"]).describe()

A `Series` has some `values` and an `index`. The `index` will become very important when working with `DataFrame`s.

In [None]:
ser

In [None]:
ser.values

In [None]:
ser.index

`Series` objects can also be created from dictionaries, in which case the index will change:

In [None]:
pd.Series({"a": 1, "b": 2})

You can refer to values by index using the `.loc` accessor:

In [None]:
ser.loc[0]

<div class="alert alert-warning">Notice that we don't call <code>loc</code> a <em>method</em> because it is not called! It is a special object that "abuses" the indexing/slicing syntax. If you call it with parenthesis, you will get an error.</div>

The semantics of `.loc` with respect to slicing are slightly different from the base ones from Python. Most importantly, the end of the slice is included!

In [None]:
ser.loc[1:3]

To access items by numerical position instead, use `.iloc`:

In [None]:
ser2 = pd.Series({"a": 1, "b": 2})
ser2

In [None]:
ser2.loc["a"]

In [None]:
ser2.loc["b"]

In [None]:
ser2.iloc[0]

In [None]:
ser2.iloc[1]

## Exercises

### 3. Advanced: Managing semi-structured data

Load the `rick-and-morty.json` data to a Python object, then store the episodes list in a variable `episodes`, then pass it to the method `pandas.DataFrame.from_records` to turn the list of episodes into a `DataFrame`.

Then, answer the same questions from session 6, using exclusively pandas methods (no comprehensions or loops).

In [None]:
import requests

DATA_URL = (
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/"
    "raw/main/data/rick-and-morty.json"
)

data = requests.get(DATA_URL).json()
print(type(data), len(data))

In [None]:
df = pd.DataFrame.from_records(data["_embedded"]["episodes"])
df.head()