# Session 06

[![Open and Execute in Google Colaboratory](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/astrojuanlu/ie-mbd-python-data-analysis-i/blob/main/sessions/Session%2006.ipynb)

- List, dictionary, and set comprehensions
- Filtering comprehensions
- Dealing with missing keys
- Practice
    - Dealing with information in several containers
    - Learning about unique values
    - Storing information and mutating it
    - Debugging techniques

## List, dictionary, and set comprehensions

Python provides a compact way of creating containers by iterating over the elements of another container: comprehensions.

Most simple data structures seen so far support comprehensions. For example, for lists:

In [None]:
numbers = list(range(10))
print(numbers)

In [None]:
square_numbers = [num ** 2 for num in numbers]
square_numbers

A similar syntax works for dictionaries and sets:

In [None]:
{
    word.lower()
    for word in
    "This is a complete sentence, and that is another complete sentence".split()
}

In [None]:
{
    word.lower(): len(word)
    for word in
    "This is a complete sentence, and that is another complete sentence".split()
}

<div class="alert alert-warning">Using parentheses does not generate a "tuple comprehension", but a totally different thing: a generator expression. We will not cover generators in this course.

Remember: commas generate tuples!</div>

In [None]:
(number ** 2 for number in range(5))

## Filtering comprehensions

A handy feature of comprehensions is that they can be _filtered_ by adding an `if` part at the end:

In [None]:
numbers = list(range(10))
print(numbers)

In [None]:
even_numbers = [
    num
    for num in
    numbers
    if num % 2 == 0
]
even_numbers

In [None]:
long_words = {
    word.lower(): len(word)
    for word in
    "This is a complete sentence, and that is another complete sentence".split()
    if len(word) > 5
}
long_words

## Dealing with missing keys

Sometimes, extracting a key from a dictionary in a loop is not robust:

In [None]:
data = [
    {"text": "ABC"},
    {"text": "DEF"},
    {"text": "GHI"},
    {},  # This is empty!
]
[row["text"] for row in data]

For those cases, it's often better to use the `.get` method, which returns a fallback value if the key is not present:

In [None]:
{"key": "present"}.get("_MISSING_KEY", "MISSING")

In [None]:
[row.get("text", "<empty>") for row in data]

## Exercises

### Rick & Morty shows data

In [None]:
import requests

DATA_URL = (
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/"
    "raw/main/data/rick-and-morty.json"
)

data = requests.get(DATA_URL).json()
print(type(data), len(data))

Basic aggregates:

- How many episodes were there in the show in total?
- How many episodes did each season have?
- When were the earliest and latest episode aired?
- At what times were episodes aired?
- What's the average and standard deviation of the average rating of all the episodes?

Derived data:

- Create a simplified version of the data with only these keys:

```
{
  'id': ...,
  'name': ...,
  'season': ...,
  'number': ...,
  'airdate': ...,
  'airtime': ...,
  'runtime': ...,
  'average_rating': ...,
  'summary': ...,
}
```

- Compute a "top" field to each episode, which can be `True` or `False` depending on whether the average rating is above or below the overall average

Text processing:

- How many times are "Rick" and "Morty" mentioned across all episode summaries? Who is mentioned the most?
- What are the 15 most common words repeated in all summaries?
- Same as above, but excluding common English stopwords ("a", "to", "and", "the", "of", ...)

### Reddit data

In [None]:
import requests

DATA_URL = (
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/"
    "raw/main/data/reddit_popular.json"
)

reddit_all = requests.get(DATA_URL).json()
print(type(reddit_all), len(reddit_all))

Basic aggregates:

- How many Reddit posts are there?
- How many different "kinds"?
- Are all the post IDs unique?
- Does any post have a nonzero number of downvotes?
- How many different subreddits are represented?
- How many unique values for "flair text" are present?

Data mining:

- What is the "edited" field?
- How are the "is_self" and "domain" fields related?

Advanced:

- There's a few posts from "positive" subreddits (r/MadeMeSmile) and a few ones from "negative" subreddits (r/mildlyinfuriating, r/Wellthatsucks). Compute a "sentiment" key to those posts, with the value "1" or "-1" for positive or negative.
- Compute the average upvote ratio of positive vs negative posts