# 🐍 Step 1 - Python Introduction 

## 📚 Course 1: Python for Data Science: Fundamentals 

## 5️⃣ Dictionaries and Frequency Tables 

---

👦 [Anh-Thi DINH](https://dinhanhthi.com) — 🔥 [dataquest-aio](https://github.com/dinhanhthi/dataquest-aio) on Github.

⚡ **Note**: Some errors in this notebook appear intentionally to illustrate the wrong commands.

❓ You run this notebook on Google Colab? If "Yes", please replace `0` by `1` in the below cell and run it first.

In [6]:
use_colab = 0

## 📝 Mission 314

⏬ Download the takeaway for this mission in folder `/takeaways/` [on Github](https://github.com/dinhanhthi/dataquest-aio/tree/master/takeaways). [Source](https://app.dataquest.io/m/314/dictionaries-and-frequency-tables) of this mission.

👉 Before going deeply the exercises, we recall the Mobile App Store data set (Ramanathan Perumal) ([source](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)),

In [7]:
# You don't need to understand the codes in this cell today (later), I just use it to display the table.
# But you need to run it to see the dataset

import pandas as pd

if use_colab:
    dataquest_aio = 'https://raw.githubusercontent.com/dinhanhthi/dataquest-aio/master/step-1-python-introduction/'
    dataset_url = dataquest_aio + 'course-1-python-for-ds-fundamentals/data/AppleStore.csv'
else:
    dataset_url = './data/AppleStore.csv' # if you use localhost
    
df = pd.read_csv(dataset_url, encoding="utf8")

In [8]:
from csv import reader
from urllib.request import urlopen

if use_colab: # you run this file on Google Colab?
    opened_file = urlopen(dataset_url).read().decode('utf-8')
    read_file = reader(opened_file.splitlines())
else: # you run this file on localhost?
    opened_file = open(dataset_url, encoding="utf8")
    read_file = reader(opened_file)

apps_data = list(read_file)

In [5]:
# you don't have to understand this (it's used for displaying the dataset)
df.head(5) # only show the first 5 rows of the dataset

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


The `cont_rating` column offers information about the content rating of each app. The content rating of an app (also known as the maturity rating) represents the age required to use that app. The table below shows the unique content ratings in our data set, along with the number of apps specific to each rating:

| Content rating | Number of apps |
|----------------|----------------|
| 4+ | 4,433 |
| 9+ | 987 |
| 12+ | 1,155 |
| 17+ | 622 |

From the table above, we can see that:

- Most apps (4,433 apps) have a content rating of 4+ (only people aged four or older are allowed to use these apps).
- Apps with a content rating of 17+ are the fewest (622 apps).
- In the middle, we have the 9+ and 12+ apps — 987 apps have a content rating of 9+, and 1,155 apps have a rating of 12+.

If we wanted to save the data from the table above, we could use two lists or maybe a *list of lists*.

In [14]:
# Two lists
content_ratings = ['4+', '9+', '12+', '17+']
numbers = [4433, 987, 1155, 622]

# A list of lists
content_rating_numbers = [['4+', '9+', '12+', '17+'],
                          [4433, 987, 1155, 622]]

print(content_rating_numbers)

[['4+', '9+', '12+', '17+'], [4433, 987, 1155, 622]]


However, using *list of lists* is not always so clear to understand the pairs "label-value". For example, we need a type of variable so that just looking on the label "4+", we can understand right away that its value is "4433". That's why a **dictionary** was born.

In [15]:
content_ratings = {'4+': 4433, '9+': 987, '12+': 1155, '17+': 622}
print(content_ratings)

{'4+': 4433, '9+': 987, '12+': 1155, '17+': 622}


In [16]:
# Explore dictionary
over_9 = content_ratings['9+'] # index in this case is a string, not an integer like in list
over_17 = content_ratings['17+']

print(over_9)
print(over_17)

987
622


In [17]:
# empty dictionary
content_ratings = {}
print(content_ratings)

# add key:value to dictionary
content_ratings["4+"] = 4433
print(content_ratings)

{}
{'4+': 4433}


In [18]:
# A dictionary with the keys having different types
d = {5: 'int', 
     '5': 'string',
     3.5: 'float',
     False: 'Boolean'}
print(d)

{5: 'int', '5': 'string', 3.5: 'float', False: 'Boolean'}


In [19]:
# But you can't
d2 = {[1, 2]: 'list',}

TypeError: unhashable type: 'list'

In [20]:
# Neither
d3 = {{1: 2}: 'dict'}

TypeError: unhashable type: 'dict'

❓ Why we CAN'T assign keys by `list` or `dict`? Let's take a look at what we CAN (`int`, `string`, `float`, `Boolean`). 

💡 When we populate a dictionary, Python tries to convert each dictionary key to an integer (even if the key is of a data type other than an integer) in the background. Python does the conversion using the `hash()` command:

In [21]:
print('If key is an int: ', hash(5))
print('If key is a string: ', hash('5'))
print('If key is a float: ', hash(3.5))
print('If key is a Boolean: ', hash(False))

If key is an int:  5
If key is a string:  6075207366409610028
If key is a float:  1152921504606846979
If key is a Boolean:  0


In [22]:
# But
hash([1, 2]) # if key is a list

TypeError: unhashable type: 'list'

In [23]:
# Neither
hash({1: 2}) # if key is  a dict

TypeError: unhashable type: 'dict'

💡 Look! We get the same `NameError` as in the definition of `d2` and `d3`.

In [24]:
# If you use multiple "same" keys, python keep the last one
d = {'key_1': 1,
     'key_2': 2,
     'key_1': 3,
     'key_3': 4}
print(d)

{'key_1': 3, 'key_2': 2, 'key_3': 4}


In [25]:
# Python understannds 1 and 0 like True and False
d_1 = {1: 'one', True: 'Boolean'} # because 1 and True are the same (in Python's head), 
                                  #  it keeps the last value assigning to 1 (which first occurs)
d_2 = {False: 'Bool', 0: 'zero'} # False is occurs first, so it will be keep as a key
d_3 = {0: 'zero', 1: 'one', 2: 'two', True: 'true', False: 'false'}

print("d_1: ", d_1)
print("d_2: ", d_2)
print("d_3: ", d_3)

d_1:  {1: 'Boolean'}
d_2:  {False: 'zero'}
d_3:  {0: 'false', 1: 'true', 2: 'two'}


In [26]:
# Check if a key is already in a dictionary
d = {1: 'one', 2: 'two', 3: 'three', '4': 'four'}
print(1 in d)
print(4 in d)
print('4' in d)

True
False
True


❓ **Question**: Back to the dataset `apps_data`, how can we have the table of content ratings in our dataset as below

| Content rating | Number of apps |
|----------------|----------------|
| 4+ | 4,433 |
| 9+ | 987 |
| 12+ | 1,155 |
| 17+ | 622 |

💡 **Hint**: We go through the column `cont_rating` (index 10) and count the number of apps w.r.t. each ratings.

In [27]:
# take a look on the dataset
df.head() # for now, you don't need to understand this line

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


In [28]:
content_ratings = {'4+': 0, '9+': 0, '12+': 0, '17+': 0} # initial settings (start counting from 0)

for row in apps_data[1:]: # go through each app
    c_rating = row[10] # get the rating
    if c_rating in content_ratings: # check if THAT rating in the dict?
        content_ratings[c_rating] += 1 # if there is THAT rating, we count 1 more
        
print(content_ratings)

{'4+': 4433, '9+': 987, '12+': 1155, '17+': 622}


❗ You may wonder that: how we know beforehand the unique values we wanna count (to create initially `content_ratings` dict).

❓ **Question**: The question as about except the fact that you don't know what are the unique values of `con_rating`.

💡 **Hint**: When going through all rows, beside count the rating, you need to find and check the key values.

In [29]:
content_ratings = {} # we start with an empty dictionary

for row in apps_data[1:]:
    c_rating = row[10] 
    if c_rating in content_ratings: # check if a "rating" key is alreay exist 
        content_ratings[c_rating] += 1 # count 1 more
    else:
        content_ratings[c_rating] = 1 # the 1st counting
        
print(content_ratings)

{'4+': 4433, '12+': 1155, '9+': 987, '17+': 622}


❓ **Question** (practice): Count the number of times each unique genre occurs.

💡 **Hint**: Consider column `prime_genre` (index 11)

In [30]:
genre_counting = {}

for row in apps_data[1:]:
    genre = row[11]    
    if genre in genre_counting:
        genre_counting[genre] += 1
    else:
        genre_counting[genre] = 1
        
print(genre_counting)

{'Social Networking': 167, 'Photo & Video': 349, 'Games': 3862, 'Music': 138, 'Reference': 64, 'Health & Fitness': 180, 'Weather': 72, 'Utilities': 248, 'Travel': 81, 'Shopping': 122, 'News': 75, 'Navigation': 46, 'Lifestyle': 144, 'Entertainment': 535, 'Food & Drink': 63, 'Sports': 114, 'Book': 112, 'Finance': 104, 'Education': 453, 'Productivity': 178, 'Business': 57, 'Catalogs': 10, 'Medical': 23}


❓ **Question**: What's the most common app genre? (Don't count it by you!)

In [31]:
print("The most common app genre: ", max(genre_counting, key=genre_counting.get))
print("It has: {} apps.".format(max(genre_counting.values()))) 

The most common app genre:  Games
It has: 3862 apps.


💡 You know a new way to insert a value inside a string when printing output! Check other types in [my note page](https://note.dinhanhthi.com/python-input-output).

In [32]:
max(genre_counting) # NOTE: this way return max of key, not the key where its value is max!

'Weather'

❓ **Question**: What percentage of apps has a content rating of `17+`? What percentage of apps can a 15-year-old download?


In [33]:
# you can use previous methods to find this content_ratings
content_ratings = {'4+': 4433, '12+': 1155, '9+': 987, '17+': 622}

total_number_of_apps = sum(content_ratings.values()) # sum of all values in dict content_ratings

for rating in content_ratings:
    content_ratings[rating] /= total_number_of_apps
    content_ratings[rating] *= 100
    
percentage_17_plus = content_ratings['17+']
percentage_15_allowed = content_ratings['4+'] + content_ratings['9+'] + content_ratings['12+']

print("The percentage of apps has a content rating of 17+: ", percentage_17_plus)
print("The percentage of apps a 15-year-old can download: ", percentage_15_allowed)

The percentage of apps has a content rating of 17+:  8.642489926358204
The percentage of apps a 15-year-old can download:  91.35751007364179


❓ **Question**: Transform the frequencies inside `content_ratings` to proportions and percentages while creating separate dictionaries for each.

In [34]:
# you can use previous methods to find this content_ratings
content_ratings = {'4+': 4433, '12+': 1155, '9+': 987, '17+': 622}

total_number_of_apps = sum(content_ratings.values()) # sum of all values in dict content_ratings

c_ratings_proportions = {}
c_ratings_percentages = {}

for key in content_ratings:
    proportion = content_ratings[key] / total_number_of_apps
    percentage = proportion * 100
    
    c_ratings_proportions[key] = proportion
    c_ratings_percentages[key] = percentage
    
print(content_ratings)
print(c_ratings_proportions)
print(c_ratings_percentages)

{'4+': 4433, '12+': 1155, '9+': 987, '17+': 622}
{'4+': 0.6159510907322495, '12+': 0.16048353480616923, '9+': 0.13714047519799916, '17+': 0.08642489926358204}
{'4+': 61.595109073224954, '12+': 16.04835348061692, '9+': 13.714047519799916, '17+': 8.642489926358204}


❓ **Question**: Create a table for a file size range as below,

|Data size (bytes)|Frequency|
|--- |--- |
|0 - 10,000,000(0 - 10 MB)|?|
|10,000,000 - 50,000,000(10 - 50 MB)|?|
|50,000,000 - 100,000,000(50 - 100 MB)|?|
|100,000,000 - 500,000,000(100 - 500 MB)|?|
|500,000,000+(500+ MB)|?|


In [35]:
n_user_ratings = []
for row in apps_data[1:]:
    n_user_ratings.append(int(row[5]))
    
ratings_max = max(n_user_ratings)
ratings_min = min(n_user_ratings)

user_ratings_freq = {'0 - 10000': 0, '10000 - 100000': 0, '100000 - 500000': 0,
                    '500000 - 1000000': 0, '1000000+': 0}

for row in apps_data[1:]:
    user_ratings = int(row[5])
    
    if user_ratings <= 10000:
        user_ratings_freq['0 - 10000'] += 1
        
    elif 10000 < user_ratings <= 100000:
        user_ratings_freq['10000 - 100000'] += 1
        
    elif 100000 < user_ratings <= 500000:
        user_ratings_freq['100000 - 500000'] += 1
        
    elif 500000 < user_ratings <= 1000000:
        user_ratings_freq['500000 - 1000000'] += 1
        
    elif user_ratings > 1000000:
        user_ratings_freq['1000000+'] += 1


print(user_ratings_freq)

{'0 - 10000': 6181, '10000 - 100000': 798, '100000 - 500000': 196, '500000 - 1000000': 16, '1000000+': 6}
