### Data Cleaning

# LEGO Set Metadata

This table outlines the metadata for LEGO sets, including various attributes of each set.

| Field           | Description                                                 |
|-----------------|-------------------------------------------------------------|
| `set_id`        | Official LEGO item number                                   |
| `name`          | Name of the LEGO set                                        |
| `year`          | Release year                                                |
| `theme`         | LEGO theme the set belongs to                               |
| `subtheme`      | Subtheme within the theme                                   |
| `themeGroup`    | Overall group the theme belongs to                          |
| `category`      | Type of set                                                 |
| `pieces`        | Number of pieces in the set                                 |
| `minifigs`      | Number of mini figures included in the set                  |
| `agerange_min`  | Minimum age recommended                                     |
| `US_retailPrice`| US retail price at launch                                   |
| `bricksetURL`   | URL for the set on brickset.com                              |
| `thumbnailURL`  | Small image of the set                                      |
| `imageURL`      | Full size image of the set                                  |


In [57]:
import pandas as pd

In [58]:
import sys
import os

sys.path.append(os.path.abspath(os.path.join('..', 'utils')))

import data_processing_functions as dpf

In [59]:
lego_sets = pd.read_csv('../data/raw/lego_sets.csv')

### Basic Information

In [60]:
dpf.show_basic_info(lego_sets)


DataFrame Shape: (18457, 14)
Number of Rows: 18457
Number of Columns: 14

Data Types of Columns:
set_id             object
name               object
year                int64
theme              object
subtheme           object
themeGroup         object
category           object
pieces            float64
minifigs          float64
agerange_min      float64
US_retailPrice    float64
bricksetURL        object
thumbnailURL       object
imageURL           object
dtype: object

Missing Values per Column:
set_id                0
name                  0
year                  0
theme                 0
subtheme           3556
themeGroup            2
category              0
pieces             3924
minifigs          10058
agerange_min      11670
US_retailPrice    11475
bricksetURL           0
thumbnailURL       1006
imageURL           1006
dtype: int64

First 5 Rows of Data:
  set_id                     name  year      theme     subtheme themeGroup  \
0    1-8          Small house set  1970  Minit

In [61]:
dpf.show_data_types(lego_sets)

Data Types of Columns:
set_id             object
name               object
year                int64
theme              object
subtheme           object
themeGroup         object
category           object
pieces            float64
minifigs          float64
agerange_min      float64
US_retailPrice    float64
bricksetURL        object
thumbnailURL       object
imageURL           object
dtype: object


In [62]:
dpf.show_missing_values(lego_sets)


Missing Values in Columns:
set_id                0
name                  0
year                  0
theme                 0
subtheme           3556
themeGroup            2
category              0
pieces             3924
minifigs          10058
agerange_min      11670
US_retailPrice    11475
bricksetURL           0
thumbnailURL       1006
imageURL           1006
dtype: int64


In [63]:
dpf.show_null_percentage(lego_sets)


Percentage of Missing Values in Each Column:
set_id             0.000000
name               0.000000
year               0.000000
theme              0.000000
subtheme          19.266403
themeGroup         0.010836
category           0.000000
pieces            21.260226
minifigs          54.494230
agerange_min      63.228044
US_retailPrice    62.171534
bricksetURL        0.000000
thumbnailURL       5.450507
imageURL           5.450507
dtype: float64


#### Check for Duplicates

In [64]:
dpf.check_for_duplicates(lego_sets)


No duplicate rows found in the DataFrame.


#### Rename Columns

In [65]:
column_rename_dict = {
    'set_id': 'set_id',                    # Official LEGO item number
    'name': 'set_name',                    # Name of the LEGO set
    'year': 'release_year',                # Release year
    'theme': 'theme_name',                 # LEGO theme the set belongs to
    'subtheme': 'subtheme_name',           # Subtheme within the theme
    'themeGroup': 'theme_group',           # Overall group the theme belongs to
    'category': 'set_category',            # Type of set (e.g., playset, vehicle)
    'pieces': 'num_pieces',                # Number of pieces in the set
    'minifigs': 'num_minifigs',            # Number of mini figures included
    'agerange_min': 'min_age_recommended', # Minimum age recommended
    'US_retailPrice': 'us_retail_price',   # US retail price at launch
    'bricksetURL': 'brickset_url',         # URL for the set on brickset.com
    'thumbnailURL': 'thumbnail_url',       # Small image of the set
    'imageURL': 'image_url'           # Full size image of the set
}

lego_sets = dpf.rename_columns(lego_sets, column_rename_dict)

In [66]:
lego_sets.head()

Unnamed: 0,set_id,set_name,release_year,theme_name,subtheme_name,theme_group,set_category,num_pieces,num_minifigs,min_age_recommended,us_retail_price,brickset_url,thumbnail_url,image_url
0,1-8,Small house set,1970,Minitalia,,Vintage,Normal,67.0,,,,https://brickset.com/sets/1-8,https://images.brickset.com/sets/small/1-8.jpg,https://images.brickset.com/sets/images/1-8.jpg
1,2-8,Medium house set,1970,Minitalia,,Vintage,Normal,109.0,,,,https://brickset.com/sets/2-8,https://images.brickset.com/sets/small/2-8.jpg,https://images.brickset.com/sets/images/2-8.jpg
2,3-6,Medium house set,1970,Minitalia,,Vintage,Normal,158.0,,,,https://brickset.com/sets/3-6,https://images.brickset.com/sets/small/3-6.jpg,https://images.brickset.com/sets/images/3-6.jpg
3,4-4,Large house set,1970,Minitalia,,Vintage,Normal,233.0,,,,https://brickset.com/sets/4-4,https://images.brickset.com/sets/small/4-4.jpg,https://images.brickset.com/sets/images/4-4.jpg
4,4-6,Mini House and Vehicles,1970,Samsonite,Model Maker,Vintage,Normal,,,,,https://brickset.com/sets/4-6,,


#### Drop empty rows in specific columns
- price
- pieces

In [67]:
lego_sets = dpf.drop_empty_rows_from_column(lego_sets, 'us_retail_price')
lego_sets = dpf.drop_empty_rows_from_column(lego_sets, 'num_pieces')

Number of rows deleted: 11475
Number of rows deleted: 1660


### Unique Values

#### Unique values in string columns

In [68]:
dpf.show_column_value_counts(lego_sets, 'set_name')
dpf.show_column_value_counts(lego_sets, 'set_category')
dpf.show_column_value_counts(lego_sets, 'theme_name')
dpf.show_column_value_counts(lego_sets, 'subtheme_name')
dpf.show_column_value_counts(lego_sets, 'theme_group')


Value counts for column set_name:
set_name
City Advent Calendar                   16
Star Wars Advent Calendar              12
Friends Advent Calendar                10
Fire Station                            9
Fire Truck                              9
                                       ..
AT-RT                                   1
Republic Troopers vs. Sith Troopers     1
Clone Troopers vs. Droidekas            1
Mountain Climber                        1
BARC Speeder with Sidecar               1
Name: count, Length: 4883, dtype: int64

Value counts for column set_category:
set_category
Normal        5164
Extended       118
Collection      15
Other           12
Book            11
Random           2
Name: count, dtype: int64

Value counts for column theme_name:
theme_name
City               504
Star Wars          457
Duplo              437
Friends            338
Ninjago            269
                  ... 
Life of George       2
Ghostbusters         2
Vikings              1
Boost  

#### Convert string columns to lowercase
- Standardize small differences (& / and, etc)

In [69]:
columns_to_lowercase = ['set_name', 'set_category', 'theme_name', 'subtheme_name', 'theme_group']
for column in columns_to_lowercase:
    dpf.convert_strings_to_lowercase(lego_sets, column)
    dpf.clean_text(column)
    lego_sets[column] = lego_sets[column].str.replace('&', 'and', regex=False)
    lego_sets[column] = lego_sets[column].str.replace(' / ', '/', regex=False)
    lego_sets[column] = lego_sets[column].str.replace('vs.', 'vs', regex=False)
    print(f"\n Unique values in '{column}' column: {lego_sets[column].unique()}\n")


 Unique values in 'set_name' column: ['straight rails' 'curved rails' 'manual points with track' ...
 'the heavenly realms' 'lunar new year traditions'
 'lunar new year ice festival']


 Unique values in 'set_category' column: ['normal' 'other' 'collection' 'extended' 'book' 'random']


 Unique values in 'theme_name' column: ['trains' 'duplo' 'basic' 'dacta' 'town' 'bulk bricks' 'star wars'
 'seasonal' 'belville' 'quatro' 'city' 'technic' 'racers' 'bionicle'
 'castle' 'education' 'spongebob squarepants' 'creator' 'make and create'
 'vikings' 'exo-force' 'batman' 'mindstorms' 'advanced models'
 'harry potter' 'space' 'aqua raiders' 'bricks and more' 'indiana jones'
 'power functions' 'agents' 'factory' 'architecture' 'miscellaneous'
 'games' 'pirates' 'power miners' 'books' 'hero factory'
 'prince of persia' 'toy story' 'atlantis' 'ben 10: alien force'
 'collectable minifigures' 'world racers' 'serious play' 'ninjago'
 'pirates of the caribbean' "pharaoh's quest" 'cars'
 'master builde

#### Clean numerical columns
- 

In [70]:
columns_to_int = ['min_age_recommended', 'num_minifigs', 'num_pieces', 'release_year']
for column in columns_to_int:
    dpf.convert_columns_to_int(lego_sets, [column])
    print(f"\n Unique values in {column} column: {lego_sets[column].unique()}\n")


 Unique values in min_age_recommended column: <IntegerArray>
[<NA>, 1, 4, 16, 7, 5, 12, 8, 6, 2, 9, 3, 11, 10, 14, 18]
Length: 16, dtype: Int64


 Unique values in num_minifigs column: <IntegerArray>
[<NA>,    1,    3,    2,    5,    6,   31,   11,    4,    8,    9,    7,   10,
   12,   21,   13,   16,   24,   15,   22,   17,   14,   20,   27,   19,   28,
   25]
Length: 27, dtype: Int64


 Unique values in num_pieces column: <IntegerArray>
[   8,    6,    3,    1,    2,    9,   50,  100,   25,   87,
 ...
  671,  777, 1083, 1222, 2187,  880, 1406, 2433, 1066, 1519]
Length: 1238, dtype: Int64


 Unique values in release_year column: <IntegerArray>
[1991, 1992, 1993, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018,
 2019, 2020, 2021, 2022]
Length: 30, dtype: Int64



In [71]:
dpf.show_missing_values(lego_sets)


Missing Values in Columns:
set_id                    0
set_name                  0
release_year              0
theme_name                0
subtheme_name           915
theme_group               0
set_category              0
num_pieces                0
num_minifigs           1618
min_age_recommended     936
us_retail_price           0
brickset_url              0
thumbnail_url             1
image_url                 1
dtype: int64


#### Replace empty values
- with 0 in numerical values
- with 'unknown' in subtheme column

In [72]:
lego_sets['num_pieces'] = lego_sets['num_pieces'].fillna(0)
lego_sets['min_age_recommended'] = lego_sets['min_age_recommended'].fillna(0)
lego_sets['num_minifigs'] = lego_sets['num_minifigs'].fillna(0)
lego_sets['subtheme_name'] = lego_sets['subtheme_name'].fillna('unknown')

In [73]:
display(lego_sets)

Unnamed: 0,set_id,set_name,release_year,theme_name,subtheme_name,theme_group,set_category,num_pieces,num_minifigs,min_age_recommended,us_retail_price,brickset_url,thumbnail_url,image_url
1986,4515-1,straight rails,1991,trains,9v,modern day,normal,8,0,0,12.99,https://brickset.com/sets/4515-1,https://images.brickset.com/sets/small/4515-1.jpg,https://images.brickset.com/sets/images/4515-1...
1987,4520-1,curved rails,1991,trains,9v,modern day,normal,8,0,0,12.99,https://brickset.com/sets/4520-1,https://images.brickset.com/sets/small/4520-1.jpg,https://images.brickset.com/sets/images/4520-1...
1988,4531-1,manual points with track,1991,trains,9v,modern day,normal,6,0,0,27.99,https://brickset.com/sets/4531-1,https://images.brickset.com/sets/small/4531-1.jpg,https://images.brickset.com/sets/images/4531-1...
1993,4548-1,transformer and speed regulator,1991,trains,9v,modern day,normal,3,0,0,41.99,https://brickset.com/sets/4548-1,https://images.brickset.com/sets/small/4548-1.jpg,https://images.brickset.com/sets/images/4548-1...
2111,2304-1,large building plate,1992,duplo,unknown,pre-school,normal,1,0,1,14.99,https://brickset.com/sets/2304-1,https://images.brickset.com/sets/small/2304-1.jpg,https://images.brickset.com/sets/images/2304-1...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18025,80037-1,dragon of the east,2022,monkie kid,season 3,action/adventure,normal,880,4,8,79.99,https://brickset.com/sets/80037-1,https://images.brickset.com/sets/small/80037-1...,https://images.brickset.com/sets/images/80037-...
18026,80038-1,monkie kid's team van,2022,monkie kid,season 3,action/adventure,normal,1406,6,9,129.99,https://brickset.com/sets/80038-1,https://images.brickset.com/sets/small/80038-1...,https://images.brickset.com/sets/images/80038-...
18027,80039-1,the heavenly realms,2022,monkie kid,season 3,action/adventure,normal,2433,8,10,189.99,https://brickset.com/sets/80039-1,https://images.brickset.com/sets/small/80039-1...,https://images.brickset.com/sets/images/80039-...
18028,80108-1,lunar new year traditions,2022,seasonal,chinese traditional festivals,miscellaneous,normal,1066,12,8,79.99,https://brickset.com/sets/80108-1,https://images.brickset.com/sets/small/80108-1...,https://images.brickset.com/sets/images/80108-...
