# Session 04

[![Open and Execute in Google Colaboratory](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/astrojuanlu/ie-mbd-python-data-analysis-i/blob/main/sessions/Session%2004.ipynb)

- Sets, properties and methods
- Converting objects
- Special sequences: Ranges
- Dictionaries, properties and methods
- Basic import statements

## Sets

Sets are not so common, but very handy for some use cases: they are an unordered, mutable collection of unique elements. This means that they can't contain duplicates:

In [None]:
my_set = {"a", "b", "c", "c", "c", "c", "d", "d"}
my_set

Since they are unordered, they cannot be indexed:

In [None]:
my_set[0]

But they can be mutated:

In [None]:
my_set.add("e")
my_set

And operated together:

![Venn diagram](../img/venn-diagram.png)

In [None]:
# Union
{"a", "b"} | {"b", "c"}

In [None]:
# Intersection
{"a", "b"} & {"b", "c"}

In [None]:
# Difference
{"a", "b"} - {"b", "c"}

In [None]:
# Exclusive disjunction
{"a", "b"} ^ {"b", "c"}

## Converting objects

In some cases you can convert one object in another. To do this, use the corresponding built-in function (`list`, `tuple`, `str`):

In [None]:
0.1

In [None]:
str(0.1)  # Notice the quotes: this is a string!

In [None]:
list(my_tuple)  # Converts the tuple above into a list, notice the square brackets!

In [None]:
tuple(my_list)  # Converts the list above into a tuple, notice the parentheses!

In [None]:
list({"a", "b"} | {"b", "c"})

<div class="alert alert-warning">Remember: sets are unordered! So iterating them will yield the elements, but the order can't be guaranteed.</div>

Some other times, conversion might proceed and lose information:

In [None]:
int(1.5)

Or directly fail:

In [None]:
int("hello")

## Special sequences

There are some special sequences in Python that are very useful, as it's the case with ranges:

In [None]:
range(1, 10)

That doesn't say much, but see what happens if you convert it to a list:

In [None]:
list(range(1, 10))

## Dictionaries

After covering tuples and lists, the last essential data structure are dictionaries. Dictionaries are mappings between keys and values:

In [None]:
my_dictionary = {
    "key_1": "A",
    "key_2": "B",
    "key_3": "C",
}
my_dictionary

The keys and the values of the dictionary can be extracted separately:

In [None]:
my_dictionary.keys()

In [None]:
my_dictionary.values()

And `.items()` can be used to extract pairs of `(key, value)`:

In [None]:
my_dictionary.items()

These are special sequence-like objects that can be converted to, say, lists:

In [None]:
list(my_dictionary.keys())

While sequences are indexed by position, dictionaries are indexed by key:

In [None]:
my_dictionary["key_1"]

Dictionaries are mutable, which means that you can add new keys or mutate existing values:

In [None]:
my_dictionary["key_4"] = "D"
my_dictionary["key_1"] = "a"
my_dictionary

Only immutable (actually: [hashable](https://docs.python.org/3/glossary.html#term-hashable)) objects can be keys:

In [None]:
my_dictionary[[0]] = "Zero"

Dictionaries, lists, and tuples can be arbitrarily nested. For example:

In [None]:
data = [
    {
        "created_at": "Wed Apr 22 06:04:57 +0000 2020",
        "entities": {
            "hashtags": [
                {
                    "text": "balboaisland",
                    "indices": [
                        21,
                        34
                    ]
                },
                {
                    "text": "newportbeach",
                    "indices": [
                        35,
                        48
                    ]
                },
            ]
        }
    }
]

In [None]:
data[0]["entities"]["hashtags"][0]["indices"]

Those are: lists, inside dictionaries, inside lists, inside dictionaries, inside dictionaries, inside a list.

## Basic import statements

In [None]:
from statistics import mean  # belongs to the standard library (stdlib)

In [None]:
mean([1, 3, 4])

In [None]:
import requests  # Not part of the stdlib, requires installing a package

And potentially, you could `import` your own Python modules (more on that in a few weeks).

## Exercises

### 1. Basic questions about Bluesky data

The file `bluesky_data.json` contains a subset of real Bluesky posts obtained from Failla, A. & Rossetti, G., 2024. “I’m in the Bluesky Tonight”: Insights from a year worth of social data F. Saracco, ed.. _Plos One_, 19(11), p.e0310330. Available at: https://doi.org/10.1371/journal.pone.0310330.

The data was filtered running the following command over an incomplete download:

```
$ cat *.jsonl | jq -c 'select(.like_count > 5000)' | jq -c -s '.' > bluesky_more_5000_likes_compact.json
```

And then further filtered to remove profanity using https://github.com/zautumnz/profane-words/blob/c7fe112/words.json.

These are the first 25 lines:

```
$ jq < data/bluesky_more_5000_likes_filtered.json | head -n25
[
  {
    "post_id": 193115731,
    "user_id": 2648,
    "instance": "ilovecitr.us",
    "date": 202403110052,
    "text": "Wins for Oppenheimer and Godzilla Minus One mean this is the first time a movie and its sequel both won Oscars the same year.",
    "langs": [
      "eng"
    ],
    "like_count": 6300,
    "reply_count": 50,
    "repost_count": 2138,
    "reply_to": null,
    "replied_author": null,
    "thread_root": null,
    "thread_root_author": null,
    "repost_from": 65676,
    "reposted_author": 20990,
    "quotes": null,
    "quoted_author": null,
    "labels": null,
    "sent_label": 2,
    "sent_score": 0.807
  },
```

You can load it as a list of dictionaries using the code below.

In [None]:
# import json
import requests

# DATA_URL = "../data/bluesky_more_5000_likes_filtered.json"
DATA_URL = (
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/"
    "raw/main/data/bluesky_more_5000_likes_filtered.json"
)

# data = json.load(open(DATA_URL))
data = requests.get(DATA_URL).json()
type(data), len(data)

- What kind of object is `data`?
- How many posts are in the dataset?
- Does the text of the 34th post contain the word "twitter" in any capitalization?
- Does the 76th post have more than 6 000 likes?
- How many words does the text of the 111th post have?
- How many times does the newline character (`\n`) appear in the text of the 189th post?
- How many languages appear in the 289th post?
- Of the last 3 posts of the dataset, which one has the highest like count?
- Are the last 3 posts of the dataset in chronological order? _(Tip: save each of them in a different variable, and then use a chained comparison)_

In [None]:
data[-3:]

### 2. Advanced: Aggregate questions about Bluesky data

- What are the languages of the most liked post?
- What is the text of the most liked English-written post?
- How many unique values of `instance` exist?
- Excluding `bsky.social`, what's the instance with the largest number of posts?
- On what weekday are posts more frequent?
- Filter out posts with low confidence on the sentiment analysis (`sent_score < 0.8`) and plot word clouds for positive posts (`sent_label == 2`) and negative ones (`sent_label == 0`)