# Fun with Lego

In this folder there is a CSV file with all Lego sets from 1950 to 2017. Your homework is to use this file to answer the questions below. Just as for the in-class exercise, you should be able to solve everything without any non-standard libraries (except for the optional matplotlib exercise), but you are more than welcome to use any other libraries.

Answer the following questions and put your answer in a dictionary, with the first word of each line (e.g. 'all_pieces' or 'year_most') as the key and your answer as the value. Your homework must also include the code that you used to find the answers.

- 'all_pieces': If you had one of each of the sets, how many pieces of Lego would you have then?
- 'year_most': In which year was the highest number of sets released?
- 'average_pieces': What is the average number of pieces in all sets, rounded to 1 decimal?
- 'most_used_word': Which word is used most often in the names of the sets?

You can find more information about the dataset here:
https://www.kaggle.com/rtatman/lego-database

Optional matplotlib exercise:
- Plot the years from 1950-2017 on the x-axis and the median number of pieces of a set of the given year on the y-axis.

Optional themes exercises (for these, you will also need the themes.csv file):
Each set is part of a theme, and each theme is also part of one or more parent themes. For example, the set 60141-1 is part of theme 80 (Police), which in turn is part of theme 67 (Classic Town), which again in turn is part of theme 50 (Town). Theme 50, however, is a parent theme, and therefore there are no other themes 'above' it.
- Create a dictionary with all parent themes as keys, and a list of all their sub-themes as values. Here, you should only discern between a parent theme and any subtheme. Thus, theme 50 would be a parent theme, and both theme 80 and 67 should be listed on the same level.
- Create a dictionary with all parent themes as keys and the number of the sets that are part of it. Here, you have to make sure that each set is only counted once!



In [351]:
from collections import defaultdict
from csv import reader

file = open("legosets.csv", "r")
csv_reader = reader(file)

keys = next(csv_reader)

legosets = defaultdict(dict) # index by set_num
for line_values in csv_reader:
    set_id = line_values[0]  # first row is always id 
    legosets[set_id] = dict(zip(keys[1:], line_values[1:]))

In [352]:
results = dict(all_pieces=sum([int(set_dict["num_parts"]) for set_dict in legosets.values()]))

In [353]:
legosets[list(legosets.keys())[9]]

{'name': 'Weetabix Promotional House 2',
 'year': '1976',
 'theme_id': '413',
 'num_parts': '149'}

In [354]:
sets_per_year = defaultdict(list)

for set_id, set_dict in legosets.items():
    sets_per_year[set_dict["year"]].append(set_id)
    
set_count = [len(sets) for sets in sets_per_year.values()]

results["year_most"] = list(sets_per_year.keys())[set_count.index(max(set_count))]

In [355]:
num_parts_per_set = [int(set_dict["num_parts"]) for set_dict in legosets.values()]
results["average_pieces"] = round(sum(num_parts_per_set) / len(num_parts_per_set))

In [377]:
all_words = list()

chars_to_remove = [",", "[", "]", "(", ")"]
for set_id, set_dict in legosets.items():
    name = set_dict["name"]
    _ = [name.replace("[", " ") for _char in chars_to_remove]
    for char in chars_to_remove:
        name = name.replace(char, " ")
    all_words += name.split(" ")


unique_words = list(set(all_words))
for word in unique_words:
    if len(word) <= 1:
        unique_words.remove(word)
        
# unique_words.remove("")
unique_words_count = [all_words.count(word) for word in unique_words]
# _ = unique_words_count.remove("")
results["most_used_word"] = unique_words[unique_words_count.index(max(unique_words_count))]

In [378]:
results

{'all_pieces': 1894089,
 'year_most': '2014',
 'average_pieces': 162,
 'most_used_word': 'Set'}