# Session 06

## Dictionaries

After covering tuples and lists, the last essential data structure are dictionaries. Dictionaries are mappings between keys and values:

In [1]:
my_dictionary = {
    "key_1": "A",
    "key_2": "B",
    "key_3": "C",
}
my_dictionary

{'key_1': 'A', 'key_2': 'B', 'key_3': 'C'}

The keys and the values of the dictionary can be extracted separately:

In [2]:
my_dictionary.keys()

dict_keys(['key_1', 'key_2', 'key_3'])

In [3]:
my_dictionary.values()

dict_values(['A', 'B', 'C'])

And `.items()` can be used to extract pairs of `(key, value)`:

In [4]:
my_dictionary.items()

dict_items([('key_1', 'A'), ('key_2', 'B'), ('key_3', 'C')])

While sequences are indexed by position, dictionaries are indexed by key:

In [5]:
my_dictionary["key_1"]

'A'

Dictionaries are mutable, which means that you can add new keys or mutate existing values:

In [6]:
my_dictionary["key_4"] = "D"
my_dictionary["key_1"] = "a"
my_dictionary

{'key_1': 'a', 'key_2': 'B', 'key_3': 'C', 'key_4': 'D'}

Only immutable objects can be keys:

In [7]:
my_dictionary[[0]] = "Zero"

TypeError: unhashable type: 'list'

There are three ways to iterate a dictionary: over its keys, over its values, or both:

In [8]:
for key in my_dictionary:
    print(key)

key_1
key_2
key_3
key_4


In [9]:
for value in my_dictionary.values():
    print(value)

a
B
C
D


In [10]:
for key, value in my_dictionary.items():
    print(key, ":", value)

key_1 : a
key_2 : B
key_3 : C
key_4 : D


Dictionaries, lists, and tuples can be arbitrarily nested. For example, this is an excerpt of a tweet represented as JSON data (see `data/twitter_data.json`):

In [11]:
data = [
    {
        "created_at": "Wed Apr 22 06:04:57 +0000 2020",
        "entities": {
            "hashtags": [
                {
                    "text": "balboaisland",
                    "indices": [
                        21,
                        34
                    ]
                },
                {
                    "text": "newportbeach",
                    "indices": [
                        35,
                        48
                    ]
                },
            ]
        }
    }
]

In [12]:
data[0]["entities"]["hashtags"][0]["indices"]

[21, 34]

_Those are lists inside dictionaries inside lists inside dictionaries inside dictionaries inside a list._

## Sets

Sets are not so common, but very handy for some use cases: they are an unordered collection of unique elements. This means that they can't contain duplicates:

In [13]:
my_set = {"a", "b", "c", "c", "c", "c", "d", "d"}
my_set

{'a', 'b', 'c', 'd'}

Since they are unordered, they cannot be indexed:

In [14]:
my_set[0]

TypeError: 'set' object is not subscriptable

But they can be operated together:

![Venn diagram](../img/venn-diagram.png)

In [15]:
# Union
{"a", "b"} | {"b", "c"}

{'a', 'b', 'c'}

In [16]:
# Intersection
{"a", "b"} & {"b", "c"}

{'b'}

In [17]:
# Difference
{"a", "b"} - {"b", "c"}

{'a'}

In [18]:
# Exclusive disjunction
{"a", "b"} ^ {"b", "c"}

{'a', 'c'}

And iterated:

In [19]:
for element in {"a", "b"} | {"b", "c"}:
    print(element)

a
b
c


<div class="alert alert-warning">Remember: sets are unordered! So iterating them will yield the elements, but the order can't be guaranteed.</div>

## List, dictionary, and set comprehensions

Python provides a compact way of creating containers without using the usual `for` loops covered so far: comprehensions. This list comprehension:

In [20]:
my_list = [number ** 2 for number in range(5)]
my_list

[0, 1, 4, 9, 16]

Is equivalent to this code block:

In [21]:
my_list = []
for number in range(5):
    my_list.append(number ** 2)
my_list

[0, 1, 4, 9, 16]

A similar syntax works for dictionaries and sets:

In [22]:
{letter: number for number, letter in enumerate("abcde")}

{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

In [23]:
{word.lower() for word in "This is a complete sentence".split()}

{'a', 'complete', 'is', 'sentence', 'this'}

<div class="alert alert-warning">Using parentheses does not generate a "tuple comprehension", but a totally different thing: a generator expression. We will not cover generators in this course.</div>

In [24]:
(number ** 2 for number in range(5))

<generator object <genexpr> at 0x7f44c53bd040>

## Exercises

### Twitter data

The file `twitter_data.json` contains a subset of real tweets obtained from http://covid19research.site/geo-tagged_twitter_datasets/, with full metadata as retrieved by the Twitter API. These are the first 10 lines:

```
[
  {
    "created_at": "Wed Apr 22 06:04:57 +0000 2020",
    "id": 1252840795737997317,
    "id_str": "1252840795737997317",
    "text": "Tennis a la Balboa.\n\n#balboaisland #newportbeach #tennis #covid_19 #coronavirus #orangecounty #california\u2026 https://t.co/px1GCH1bgZ",
    "truncated": true,
    "entities": {
      "hashtags": [
        {
```

You can load it as a list of dictionaries using the code below.

In [25]:
import requests

DATA_URL = (
    "https://github.com/astrojuanlu/ie-mbd-python-data-analysis-i/"
    "raw/main/data/twitter_data.json"
)

data = requests.get(DATA_URL).json()
print(type(data), len(data))

<class 'list'> 9


### 1. Print hashtags

Save the first tweet of the list in a separate variable called `first_tweet`, and print its ID (stored in the `id` key) and all its hashtags (stored in the `entities` key).

Expected result:

```
1252840795737997317
#balboaisland
#newportbeach
#tennis
#covid_19
#coronavirus
#orangecounty
#california
```

### 2. Function `print_tweet`

Wrap the logic from the previous exercise in a function `print_tweet(tweet)` that takes a dictionary representing a single tweet. Then call it for all the tweets in `data`.

In [27]:
for tweet in data:
    print_tweet(tweet)

1252840795737997317
#balboaisland
#newportbeach
#tennis
#covid_19
#coronavirus
#orangecounty
#california
1256259052461592580
#Repost
1256459430129885184
#ThisWeekend
#la
#LosAngeles
#activities
#lockdown
1256459484521615362
1258334628345044992
#COVID19
#Protesters
1253472356476940290
1253472297144299520
1254774730529243136
1258172892077912064


### 3. Check hashtags

Write a function `has_hashtags` that takes a single tweet and returns `True` if the list of hashtags is non-empty (has length larger than zero), `False` otherwise. Then iterate over all the tweets in `data` and call `has_hashtags` for all of them.

In [29]:
for tweet in data:
    print(tweet["id"], has_hashtags(tweet))

1252840795737997317 True
1256259052461592580 True
1256459430129885184 True
1256459484521615362 False
1258334628345044992 True
1253472356476940290 False
1253472297144299520 False
1254774730529243136 False
1258172892077912064 False


### 4. Improve `print_tweet`

Modify `print_tweet` so that, if no hashtags are found, it prints `(no hashtags)`. Use the `has_hashtags` function above.

In [31]:
print_tweet(data[4])

1258334628345044992
#COVID19
#Protesters


In [32]:
print_tweet(data[5])

1253472356476940290
(no hashtags)


### 5. Trimming tweets

Write a function `trim_tweet(tweet)` that takes a dictionary representing a single tweet and returns a simpler dictionary, with keys `('id', 'text', 'favorite_count', 'retweet_count', 'has_hashtags', 'username', 'user_followers')`. Try to guess which fields of the original tweet you should use for each output key.

In [34]:
trim_tweet(data[0])

{'id': 1252840795737997317,
 'text': 'Tennis a la Balboa.\n\n#balboaisland #newportbeach #tennis #covid_19 #coronavirus #orangecounty #california… https://t.co/px1GCH1bgZ',
 'favorite_count': 0,
 'retweet_count': 0,
 'has_hashtags': True,
 'username': 'JoesNews_',
 'user_followers': 46}

### 6. Trimming the data

Write a function `trim_data(tweet_list)` that receives a tweets list (like the original data) and returns another list of simplified tweets, by applying `trim_tweet` to each tweet in the list.

In [36]:
trim_data(data[:3])

[{'id': 1252840795737997317,
  'text': 'Tennis a la Balboa.\n\n#balboaisland #newportbeach #tennis #covid_19 #coronavirus #orangecounty #california… https://t.co/px1GCH1bgZ',
  'favorite_count': 0,
  'retweet_count': 0,
  'has_hashtags': True,
  'username': 'JoesNews_',
  'user_followers': 46},
 {'id': 1256259052461592580,
  'text': '#Repost bandcamp\n• • • • • •\nTo keep supporting musicians during the Covid-19 pandemic, we are waiving our revenue… https://t.co/W7mJY3QPH0',
  'favorite_count': 0,
  'retweet_count': 0,
  'has_hashtags': True,
  'username': 'MagdaVegaSLC',
  'user_followers': 84},
 {'id': 1256459430129885184,
  'text': 'Things to Do (Inside) in L.A. #ThisWeekend https://t.co/DnsuwtVXbb via @LAMag\n#la #LosAngeles #activities #lockdown… https://t.co/QvFjrEAfFj',
  'favorite_count': 0,
  'retweet_count': 0,
  'has_hashtags': True,
  'username': 'alistsocialent',
  'user_followers': 68}]

### 7. Simple analytics on Twitter data

Apply the `trim_data` function to the data and find **the mean favorite count of the tweets without hashtags written by people with more than 500 followers**.

_Hint: The answer is around 16_