# Session 11

[![Open and Execute in Google Colaboratory](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/astrojuanlu/ie-mbd-python-data-analysis-i/blob/main/sessions/Session%2011.ipynb)

- String methods on pandas columns
- Combining several dataframes
    - Concat
    - Join
    - Merge

## String methods on pandas columns

In [None]:
# BLUESKY_DATA_URL = "../data/bluesky_more_5000_likes_filtered.json"
BLUESKY_DATA_URL = (
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/"
    "raw/main/data/bluesky_more_5000_likes_filtered.json"
)

In [None]:
import pandas as pd

In [None]:
df = pd.read_json(BLUESKY_DATA_URL)
df.head()

In [None]:
df["instance"].str.split(".")

String methods also work for other objects!

In [None]:
df["instance"].str.split(".").str.get(0)

## Exercises

### 1. Columns of dictionaries

Read the Rick & Morty data. The goal is to compute the average rating per season.

For that, notice that the rating column contains a dictionary with 1 key. Extract the numeric value and place it under a new column called "rating_num". Then, use groupby operations to compute the average rating per season.

## Combining several dataframes

### Concat

`pd.concat` adds one dataframe after another, either vertically (along rows, the default) or horizontally (along columns). It is useful when you have several dataframes that relate to the same data and you want to combine them into one, for example paginated results.

In [None]:
df_madrid = pd.read_csv(
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/raw/main/data/grandes-tenedores-madrid.csv"
)
df_madrid.head()

In [None]:
len(df_madrid)

Let's artificially split the dataframe using `.iloc[]`. It's like `.loc[]`, but works by index instead of by label.

The first 100 rows will go in one dataset, the remaining rows into another:

In [None]:
df_madrid_1 = df_madrid.iloc[:100]
df_madrid_1.head(1)

In [None]:
df_madrid_2 = df_madrid.iloc[100:]
df_madrid_2.head(1)

In [None]:
len(df_madrid_1) + len(df_madrid_2) == len(df_madrid)

In [None]:
df_madrid_concat = pd.concat([df_madrid_1, df_madrid_2])

In [None]:
df_madrid_concat.equals(df_madrid)

Alternatively, we can split by columns:

In [None]:
df_madrid_left = df_madrid.iloc[:, :11]
df_madrid_left.head(1)

In [None]:
df_madrid_right = df_madrid.iloc[:, 11:]
df_madrid_right.head(1)

In [None]:
df_madrid_concat_cols = pd.concat([df_madrid_left, df_madrid_right], axis="columns")

In [None]:
df_madrid_concat_cols.equals(df_madrid)

## Exercises

### 2. More split and concat

Using the same Rick & Morty data, split the data if five datasets, one per season. Then, concat it all again. At the end, check that the assembled dataframe and the original one are the same.

### Merge

In [None]:
df_spain = pd.read_csv(
    # "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/raw/main/data/megatenedores_estatal_2024.csv"
    "../data/megatenedores_estatal_2024.csv"
)
df_spain.head()

In [None]:
df_madrid_company_data = df_madrid.loc[:, ["NIF", "Filial propietaria directa", "Matriz"]]
df_madrid_company_data.head()

In [None]:
len(df_madrid_company_data)

In [None]:
df_madrid_company_data["Matriz"].nunique()

## Exercises

### 3. European Commission lobbists

Below you can find some data coming from Civio on lobby meetings at the European Commission.

In [None]:
df_euco = pd.read_csv(
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/raw/main/data/commission-lobbists-meetings.csv"
)
df_euco.head()

In [None]:
df_euco.loc[df_euco["nr"] > 10]

However, we know very little about the "lobbyst". Below, you can find some extra information (AI-generated) on some of them.

In [None]:
lobby_data = pd.read_csv(
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/raw/main/data/"
    "lobby_data_ai.csv"
)
lobby_data.head()

Combine both datasets to answer these questions:

- What are the top 10 lobbysts in terms of number of meetings?
- How many meetings happened with US vs with EU-based lobbysts?
- What was the most common policy orientation of the meetings?
- Now, answer the same questions, but using only meetings with a "high" number of representatives (let's use `nr > mean(nr)`)

### Join

The `.join` method is like the `.merge` one, but it's more efficient when working with indexes.

In [None]:
df_madrid_company_data.head(1)

In [None]:
df_spain.head(1)

In [None]:
df_spain.set_index("Matriz")

In [None]:
df_spain.set_index("Matriz").join(df_madrid_company_data.set_index("Matriz"))