# Session 10

[![Open and Execute in Google Colaboratory](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/astrojuanlu/ie-mbd-python-data-analysis-i/blob/main/sessions/Session%2010.ipynb)

- Reading semi-structured data into pandas
- String methods on pandas columns
- The "group by and aggregate" operation
- The `agg` method

## Reading semi-structured data into pandas

pandas DataFrames are always table-like objects, but that doesn't mean that they're limited to CSV data. In fact, you can read many other formats using different `pandas.read_*` methods.

For example, it's possible to read semi-structured data into a pandas DataFrame. Let's do an example with JSON:

In [None]:
# BLUESKY_DATA_URL = "../data/bluesky_more_5000_likes_filtered.json"
BLUESKY_DATA_URL = (
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/"
    "raw/main/data/bluesky_more_5000_likes_filtered.json"
)

In [None]:
df = pd.read_json(BLUESKY_DATA_URL)
df.head()

Notice several things:

(1) both the `text` and `langs` columns have dtype `object`, even though the former is made of strings and the latter is made of lists!

In [None]:
df.dtypes

(2) If a given field is present in at least one record, the records that don't have it will hold a `NaN` value. You will learn more about handling missing data in the next session.

---

Alternatively, you could also read this data using the `requests` package into a native Python object:

In [None]:
# import json
import requests

# data = json.load(open(BLUESKY_DATA_URL))
data = requests.get(BLUESKY_DATA_URL).json()
type(data), len(data)

And then use one of the `pandas.DataFrame.from_*` methods (notice that these are on a different namespace than the `pandas.read_*` methods):

In [None]:
pd.DataFrame.from_records(data).head(2)

## Exercises

### 1. JSON reading

The `rick-and-morty.json` data is not so easy to read directly as JSON. Go ahead and try it. What's the error?

Use the alternative method: load the `rick-and-morty.json` data to a Python object, then store the episodes list in a variable `episodes`, then pass it to the method `pandas.DataFrame.from_records` to turn the list of episodes into a `DataFrame`.

List the first 5 rows to verify that it was correctly loaded. **Notice that some columns contain dictionaries**.

In [None]:
import requests

RICK_MORTY_DATA_URL = (
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/"
    "raw/main/data/rick-and-morty.json"
)

rm_data = requests.get(RICK_MORTY_DATA_URL).json()
print(type(rm_data), len(rm_data))

In [None]:
import pandas as pd

In [None]:
rm_df = pd.DataFrame.from_records(data["_embedded"]["episodes"])
rm_df.head()

## The "group by and aggregate" operation

Group by operations in pandas are essential to perform advanced aggregations. The concept and the syntax are directly borrowed from SQL, and follow a similar "split-apply-combine" procedure. At a very high level, this is what happens:

![Group by and aggregate](../img/group-by-agg.png)

Group by operations are initiated by calling the `groupby` method of a DataFrame. But notice that this returns an intermediate object:

In [None]:
df.head(1)

In [None]:
df.groupby("instance")

To effectively use this object, you have to finalize the operation by calling an aggregation. The result will be another pandas object, with the index containing each of the distinct values of the column you are grouping by.

For example, to know how many posts are there per instance:

In [None]:
df.groupby("instance").size()

And to extract the first (earliest) post of each instance:

In [None]:
df.groupby("instance").first()

Some aggregations work with specific data types. For example, yhou might be interested in average statistics of some numerical columns:

In [None]:
(
    df.loc[:, ["instance", "like_count", "reply_count"]]
    .groupby("instance")
    .mean()
)

## Exercises

### 2. Analyzing semi-structured data

Answer the remaining questions about the Rick & Morty data from session 6, using exclusively pandas methods (no comprehensions or loops).

## The `agg` method

Sometimes you want to apply more complex aggregations, or stack several of them, or apply different aggregations to different columns. The `.agg` method of the `DataFrameGroupBy` allows you to do all that.

In [None]:
(
    df.loc[:, ["instance", "like_count", "reply_count"]]
    .groupby("instance")
    .agg(["mean", "std"])  # A list of aggregation functions
)

In [None]:
(
    df.loc[:, ["instance", "like_count", "reply_count"]]
    .groupby("instance")
    .agg({"like_count": "mean", "reply_count": ["sum", "std"]})  # A dictionary/mapping of column names to aggregation functions
)

## Exercises

### 3. More analysis of semi-structured data

Load the Reddit data from Session 06 and answer the questions there, using exclusively pandas methods (no comprehensions or loops).

In [None]:
REDDIT_DATA_URL = (
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/"
    "raw/main/data/reddit_popular.json"
)