**Lego Dataset: Data Analysis in Python**


Legos have been a staple in children’s entertainment for decades and continue to thrive in today’s climate with hundreds of thousands of children actively enjoying this renowned product. Developed from a wooden toy in the 1930s, The Lego Group is now estimated to be worth over 11 billion dollars, with hundreds of different products readily available. While many have come and gone with their own replicas, Lego has truly skyrocketed to become one of its own in the toy industry. Consumers can buy Legos by set, which all contain a variety of pieces of different shapes and sizes that all fit together. These sets have many different themes, including partnerships with entertainment franchises such as Star Wars, Jurassic Park, and Harry Potter.

The dataset we used is a public dataset posted on Kaggle. The initial dataset was created by user 'MattieTerzolo' for the simple purpose of curiosity regarding questions regarding the world of Legos. The dataset is in a lego_sets.csv file with 14 various columns. Some of the columns include recommended ages, set name, price, piece count, and review ratings.

**Dataset Used: https://www.kaggle.com/datasets/mterzolo/lego-sets?select=lego_sets.csv**

**Questions**


2. How many products fall within the different price ranges? (We would have to categoirize this into low, medium, high).
3. Piece count of lego sets based on themes. Which themes have the highest and lowest piece counts on average?
4. What is the correlation between the piece count and the price of each set?
5. Box plot for (CORRELATION BETWEEN PRICE AND REVIEW DIFFICULTY)
6. COMPARE MOST EXPENSIVE SET VS CHEAPEST SET!!!!
7. What age range is the most popular among Lego sets?


**Data Wrangling**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

lego = pd.read_csv("datasets/lego_dataset.csv")
lego = lego.sort_values(by = 'list_price', ascending = False)
lego.describe()

The info function gives us an insight into the dataset.

In [None]:
lego.info()

The line below shows the name of all the themes of the various Lego sets in the database.

In [None]:
uniqueThemeNames = (lego["theme_name"].unique())
uniqueThemeNames

The code shows the unique countries in the dataset.

In [None]:
print(lego["country"].unique())

The code shows distinct set names in the dataset.

In [None]:
distinctSets = lego.drop_duplicates(subset = 'set_name')
#sortedByPrice = distinctSets.sort_values(by = 'list_price', ascending = False)
#sortedByPrice[['set_name','list_price']].head(10)

##THIS CODE HAS TO BE CHANGED SO THAT WE SORT FIRST AND THEN DROP DUPLICATES TO GET THE HIGHEST LIST PRICE

This code shows distinct theme names in the dataset.

In [None]:
distinctThemes = lego.drop_duplicates(subset = 'theme_name')

sortedByCount = distinctThemes.sort_values(by = 'piece_count', ascending = False)

sortedByCount[['theme_name','piece_count']].head(10)
##THIS CODE HAS TO BE CHANGED SAME AS ABOVE.


**Question 1: What features might contribute to a higher star rating?**

*The star rating of the Lego dataset is the accumulation of reviews based on what the customer thought of the product. These reviews provide future potential buyers with a baseline of what other consumers thought of the product.

We began coding by pulling any relevant data that could have affected the star rating of a product, such as the price, number of reviews, and piece count. Afterward, we utilized the “group by” function to list the values of star reviews in ascending order. From here, we were able to draw multiple conclusions.*


In [None]:
##QUESTION 1

stardf = lego[['list_price', 'num_reviews','piece_count','star_rating']]
stardf = stardf.groupby('star_rating').mean().reset_index()

stardf


*On average, the products with the worst reviews often had a high price for a low number of pieces. It can be assumed that the customer did not feel that they were getting the value of what they paid for from the product. Products with higher star ratings often provided a higher number of pieces, making the value of the product much higher.*


**Question 2: How many products fall within the different price ranges? (We would have to categoirize this into low, medium, high).**

In [None]:
low = len(distinctSets.loc[distinctSets['list_price'] < 25])
medium = len(distinctSets.loc[distinctSets['list_price'] < 50]) - low
high = len(distinctSets.loc[distinctSets['list_price'] < 75]) - medium
max_high = len(distinctSets.loc[distinctSets['list_price'] >= 75])
print(low, medium, high)

print("There are " + str(low)+ " low priced lego sets" )
print("There are " + str(medium) + " medium priced lego sets" )
print("There are " + str(high) + " high priced lego sets" )
print("There are " + str(max_high) + " very high priced lego sets" )

**Question 3: Piece count of lego sets based on themes. Which themes have the highest and lowest piece counts on average?**

In [None]:
#Question 3 
# #Piece count of lego sets based on themes. Which themes have the highest and lowest piece counts on average? #Data Calculations & Statistics 
theme_grouped = distinctSets.groupby("theme_name") 
pieceByTheme = theme_grouped["piece_count"] 
distinctMean = pieceByTheme.mean() 
pieceByTheme.median() 
top10 = distinctMean.nlargest(10).reset_index() 
plt.figure(figsize=(25, 8)) 
bar_colors = np.random.rand(5, 3)
plt.bar(top10["theme_name"], top10["piece_count"], color= bar_colors) 
plt.title("Highest Average Piececount Based off Theme") 
plt.xlabel("Average Piececount", fontsize = 16) 
plt.ylabel("Theme", fontsize = 18) # Find the top 5 themes based on mean

**Question 4: What is the correlation between the piece count and the price of each set?**

*For the manufacturer, the amount of pieces they use to create a product directly correlates with the price of the product for the consumer. If a product requires more pieces to create, it is going to up the costs of manufacturing. This code analyzes how the number of pieces can spike the price of a Lego product, visualizing a set’s number of pieces versus its price to see if there is a correlation.*


In [None]:
palette = sns.color_palette("husl", n_colors=len(uniqueThemeNames))

# Create a scatter plot for each theme
for i, theme in enumerate(uniqueThemeNames):
    theme_data = distinctSets[distinctSets['theme_name'] == theme]
    plt.scatter(theme_data['piece_count'], theme_data['list_price'], label=theme, color=palette[i])
    
x = theme_data['piece_count'].mean()
y = theme_data['list_price'].mean()

plt.title('Piece Count vs List Price for Lego Sets')
plt.xlabel('Piece Count')
plt.ylabel('List Price')
plt.legend(title='Theme', bbox_to_anchor=(1.05, 1), loc='upper left')  # Add a legend
plt.show()



*Our code began by generating a for-loop to iterate through every Lego product in the dataset. Within the for-loop, we generated the scatter plot to mark our points. Based on which theme a set belonged to, the set generated a certain color. Based on the visualization, there is an overall positive correlation between the price and the number of pieces regardless of theme.*

In [None]:
x = round(x,3)
y = round(y, 3)
bestFit = ('y = ', x, 'x + ', y )
print(''.join(map(str, bestFit)))

**Question 6**

In [None]:
distinctSets.boxplot(column = 'list_price', by = 'review_difficulty', patch_artist = True, vert = False, showfliers = False, boxprops = dict(facecolor = 'lightblue'))
plt.xlabel('Price of Individual Lego Sets')
plt.ylabel('Assembly Difficulty Reviews')
plt.title('Prices of Lego Sets by Review Difficulty')
plt.suptitle('')

In [None]:
#Question 7

mostExpensiveSet = distinctSets.head(1)
leastExpensiveSet = distinctSets.tail(1)

mostExpensiveSet
leastExpensiveSet

**Question 7: What age range is most popular among lego sets?**

*When a Lego product is released, the label often recommends the set for a specific age range. This can vary from set to set and is often reflective of how difficult it is to build the Lego set. While this recommendation is irrelevant for anyone older, this can be important for consumers who want to buy it for a person of a certain age. Parents for example may not want to buy a toy for their child that can be too difficult to assemble.*


In [None]:
colors = ['lightblue', 'blue', 'green', 'purple', 'orange', 'yellow', 'pink', 'cyan', 'lightgreen', 'red']
variousAges = distinctSets.groupby(['ages']).size().sort_values(ascending = True)

percentage = variousAges / len(distinctSets)

percentage.tail(10).plot.barh(color = colors)
plt.xlabel('Percentage of Sets') 
plt.ylabel('Age Range')
plt.title('Percentage of Lego Sets within Given Age Ranges')


*This code analyzes the dataset to understand what age group is most commonly used. The code begins by utilizing the “groupby” function to group all the sets best off their age groups and store them in ascending order. This line is followed by an operation to divide the various age groups by the total number of sets, calculating the total percentage each age group holds in the dataset. This data is visualized through a bar graph, which shows the ages between 7-14 were most popular among Lego sets.*