# Session 13

[![Open and Execute in Google Colaboratory](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/astrojuanlu/ie-mbd-python-data-analysis-i/blob/main/sessions/Session%2013.ipynb)

- Dealing with missing data
- Defining outliers and finding them
- Practice: Given a dataset, solve business questions

In [None]:
import pandas as pd

## Dealing with missing data

As you have noticed already, lots of pandas operations generate missing data in the form of `NaN`.

Filtering this data with the usual methods doesn't work, because, following the IEEE 754 standard, NaN is different from itself:

In [None]:
float("nan") == float("nan")

pandas DataFrames have special methods to deal with missing data.

First, let's rebuild the augmented real estate dataset:

In [None]:
import pandas as pd

In [None]:
df_madrid = pd.read_csv(
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/raw/main/"
    "data/grandes-tenedores-madrid.csv"
)
df_madrid.head()

In [None]:
df_spain = pd.read_csv(
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/raw/main/"
    "data/megatenedores_estatal_2024.csv"
)
df_spain.head()

In [None]:
df_madrid_company_data_simple = (
    df_madrid.loc[:, ["NIF", "Filial propietaria directa", "Matriz"]]
    .drop_duplicates(subset="Matriz")
)

In [None]:
df_spain_augmented = (
    df_spain
    .merge(
        df_madrid_company_data_simple,
        how="left",
        on="Matriz",
    )
)
df_spain_augmented.head()

Filter:

In [None]:
df_spain_augmented.loc[
    df_spain_augmented["NIF"].isna()
].head()

Dropping rows with missing data:

In [None]:
df_spain_augmented.dropna(subset="NIF").head()

Filling missing data:

In [None]:
df_spain_augmented.fillna({"NIF": "<UNKNOWN>"}).head()

## Exercises

### 1. Missing location and puzzling columns

- Observe that, in the European Commission dataset, the `location` column is sometimes null. Inspect how many rows have this property. What would you do with those? (Open question)
- There are some trailing columns in the dataset with some null values. How many non-null values do they have? What would you do with those? (Open question with a "more correct" answer)

## Defining outliers and finding them

When performing exploratory data analysis on a dataset, it's often important to understand the distribution of numerical variables, and spot outliers, if any. Take the wildfires dataset for example:

In [None]:
import pandas as pd

In [None]:
df_wf = pd.read_csv("../data/fires-subset.csv")
df_wf.head()

In [None]:
df_wf["superficie"].describe()

In [None]:
ax = df_wf["superficie"].plot.hist(bins=100)
ax.set_yscale("log")

As you can see, the large majority of wildfires have a relatively small size, but a few of them have a disproportionate size.

There are different ways to spot outliers. Two simple methods are:

- Z-score cutoffs
- Percentile / quantile cutoffs

(The z-score is "the number of standard deviations by which the value of a raw score is above or below the mean value")

In [None]:
# 3 standard deviations above the mean
df_wf.loc[
    df_wf["superficie"]
    > (df_wf["superficie"].mean() + df_wf["superficie"].std() * 3)
]

In [None]:
# above the 99th percentile
df_wf.loc[
    df_wf["superficie"]
    > df_wf["superficie"].quantile(0.99)
]

## Exercises

### 3. Extra intense lobbyists

In the European Commission dataset, are there any lobbyists that meet way more frequently than others?

To find out, compute the number of meetings by lobbyist, and then apply some of the methods above to find outliers.

## Practice

Continue exploring the European Commission dataset:

- Extract the year from the `date` field. How many meetings per year happened? What is the year with the largest number of meetings?
- What are the most common locations for the meetings?
- Split the dataset before and after the onset of the COVID-19 pandemic. How do the most common locations differ between the two parts? Do the most common subjects change?