### Data Cleaning

# LEGO Set Metadata

This table outlines the metadata for LEGO sets, including various attributes of each set.

| Field           | Description                                                 |
|-----------------|-------------------------------------------------------------|
| `set_id`        | Official LEGO item number                                   |
| `name`          | Name of the LEGO set                                        |
| `year`          | Release year                                                |
| `theme`         | LEGO theme the set belongs to                               |
| `subtheme`      | Subtheme within the theme                                   |
| `themeGroup`    | Overall group the theme belongs to                          |
| `category`      | Type of set                                                 |
| `pieces`        | Number of pieces in the set                                 |
| `minifigs`      | Number of mini figures included in the set                  |
| `agerange_min`  | Minimum age recommended                                     |
| `US_retailPrice`| US retail price at launch                                   |
| `bricksetURL`   | URL for the set on brickset.com                              |
| `thumbnailURL`  | Small image of the set                                      |
| `imageURL`      | Full size image of the set                                  |


In [21]:
import pandas as pd

In [22]:
import sys
import os

sys.path.append(os.path.abspath(os.path.join('..', 'utils')))

import data_processing_functions as dpf

In [23]:
lego_sets = pd.read_csv('../data/raw/lego_sets.csv')

### Basic Information

In [24]:
dpf.show_basic_info(lego_sets)


DataFrame Shape: (18457, 14)
Number of Rows: 18457
Number of Columns: 14

Data Types of Columns:
set_id             object
name               object
year                int64
theme              object
subtheme           object
themeGroup         object
category           object
pieces            float64
minifigs          float64
agerange_min      float64
US_retailPrice    float64
bricksetURL        object
thumbnailURL       object
imageURL           object
dtype: object

Missing Values per Column:
set_id                0
name                  0
year                  0
theme                 0
subtheme           3556
themeGroup            2
category              0
pieces             3924
minifigs          10058
agerange_min      11670
US_retailPrice    11475
bricksetURL           0
thumbnailURL       1006
imageURL           1006
dtype: int64

First 5 Rows of Data:
  set_id                     name  year      theme     subtheme themeGroup  \
0    1-8          Small house set  1970  Minit

In [25]:
dpf.show_data_types(lego_sets)

Data Types of Columns:
set_id             object
name               object
year                int64
theme              object
subtheme           object
themeGroup         object
category           object
pieces            float64
minifigs          float64
agerange_min      float64
US_retailPrice    float64
bricksetURL        object
thumbnailURL       object
imageURL           object
dtype: object


In [26]:
dpf.show_missing_values(lego_sets)


Missing Values in Columns:
set_id                0
name                  0
year                  0
theme                 0
subtheme           3556
themeGroup            2
category              0
pieces             3924
minifigs          10058
agerange_min      11670
US_retailPrice    11475
bricksetURL           0
thumbnailURL       1006
imageURL           1006
dtype: int64


In [27]:
dpf.show_null_percentage(lego_sets)


Percentage of Missing Values in Each Column:
set_id             0.000000
name               0.000000
year               0.000000
theme              0.000000
subtheme          19.266403
themeGroup         0.010836
category           0.000000
pieces            21.260226
minifigs          54.494230
agerange_min      63.228044
US_retailPrice    62.171534
bricksetURL        0.000000
thumbnailURL       5.450507
imageURL           5.450507
dtype: float64


#### Check for Duplicates

In [28]:
dpf.check_for_duplicates(lego_sets)


No duplicate rows found in the DataFrame.


#### Rename Columns

In [29]:
column_rename_dict = {
    'set_id': 'set_id',                    # Official LEGO item number
    'name': 'set_name',                    # Name of the LEGO set
    'year': 'release_year',                # Release year
    'theme': 'theme_name',                 # LEGO theme the set belongs to
    'subtheme': 'subtheme_name',           # Subtheme within the theme
    'themeGroup': 'theme_group',           # Overall group the theme belongs to
    'category': 'set_category',            # Type of set (e.g., playset, vehicle)
    'pieces': 'num_pieces',                # Number of pieces in the set
    'minifigs': 'num_minifigs',            # Number of mini figures included
    'agerange_min': 'min_age_recommended', # Minimum age recommended
    'US_retailPrice': 'us_retail_price',   # US retail price at launch
    'bricksetURL': 'brickset_url',         # URL for the set on brickset.com
    'thumbnailURL': 'thumbnail_url',       # Small image of the set
    'imageURL': 'image_url'           # Full size image of the set
}

lego_sets = dpf.rename_columns(lego_sets, column_rename_dict)


In [30]:
lego_sets.head()

Unnamed: 0,set_id,set_name,release_year,theme_name,subtheme_name,theme_group,set_category,num_pieces,num_minifigs,min_age_recommended,us_retail_price,brickset_url,thumbnail_url,image_url
0,1-8,Small house set,1970,Minitalia,,Vintage,Normal,67.0,,,,https://brickset.com/sets/1-8,https://images.brickset.com/sets/small/1-8.jpg,https://images.brickset.com/sets/images/1-8.jpg
1,2-8,Medium house set,1970,Minitalia,,Vintage,Normal,109.0,,,,https://brickset.com/sets/2-8,https://images.brickset.com/sets/small/2-8.jpg,https://images.brickset.com/sets/images/2-8.jpg
2,3-6,Medium house set,1970,Minitalia,,Vintage,Normal,158.0,,,,https://brickset.com/sets/3-6,https://images.brickset.com/sets/small/3-6.jpg,https://images.brickset.com/sets/images/3-6.jpg
3,4-4,Large house set,1970,Minitalia,,Vintage,Normal,233.0,,,,https://brickset.com/sets/4-4,https://images.brickset.com/sets/small/4-4.jpg,https://images.brickset.com/sets/images/4-4.jpg
4,4-6,Mini House and Vehicles,1970,Samsonite,Model Maker,Vintage,Normal,,,,,https://brickset.com/sets/4-6,,


### Unique Values

In [31]:
dpf.show_column_summary(lego_sets)


Summary Statistics for All Columns:
        set_id          set_name  release_year theme_name  subtheme_name  \
count    18457             18457  18457.000000      18457          14901   
unique   18457             15374           NaN        154            895   
top     YOTT-1  Bonus/Value Pack           NaN       Gear  Magazine Gift   
freq         1               165           NaN       2832            453   
mean       NaN               NaN   2007.960611        NaN            NaN   
std        NaN               NaN     11.948666        NaN            NaN   
min        NaN               NaN   1970.000000        NaN            NaN   
25%        NaN               NaN   2001.000000        NaN            NaN   
50%        NaN               NaN   2011.000000        NaN            NaN   
75%        NaN               NaN   2017.000000        NaN            NaN   
max        NaN               NaN   2022.000000        NaN            NaN   

          theme_group set_category    num_pieces  

#### Unique values in string columns

In [32]:
dpf.show_column_value_counts(lego_sets, 'set_name')
dpf.show_column_value_counts(lego_sets, 'set_category')
dpf.show_column_value_counts(lego_sets, 'theme_name')
dpf.show_column_value_counts(lego_sets, 'subtheme_name')
dpf.show_column_value_counts(lego_sets, 'theme_group')


Value counts for column set_name:
set_name
Bonus/Value Pack                              165
Helicopter                                     36
Basic Building Set, 5+                         32
Basic Building Set, 3+                         32
Basic Set                                      25
                                             ... 
BrickJournal Issue 77                           1
{Friends 10th anniversary golden minidoll}      1
The LEGO Ideas Book (New Edition)               1
Star Wars Awesome Vehicles                      1
Meet the Minifigures                            1
Name: count, Length: 15374, dtype: int64

Value counts for column set_category:
set_category
Normal        12757
Gear           2832
Other          1094
Book            631
Collection      578
Extended        501
Random           64
Name: count, dtype: int64

Value counts for column theme_name:
theme_name
Gear                       2832
Duplo                      1275
Star Wars                   863
Cit

#### Convert string columns to lowercase

In [33]:
columns_to_lowercase = ['set_name', 'set_category', 'theme_name', 'subtheme_name', 'theme_group']
for column in columns_to_lowercase:
    dpf.convert_strings_to_lowercase(lego_sets, column)
    dpf.clean_text(column)
    lego_sets[column] = lego_sets[column].str.replace('&', 'and', regex=False)
    lego_sets[column] = lego_sets[column].str.replace(' / ', '/', regex=False)
    

Check unique values after conversion

In [34]:
dpf.show_column_value_counts(lego_sets, 'set_name')
dpf.show_column_value_counts(lego_sets, 'set_category')
dpf.show_column_value_counts(lego_sets, 'theme_name')
dpf.show_column_value_counts(lego_sets, 'subtheme_name')
dpf.show_column_value_counts(lego_sets, 'theme_group')


Value counts for column set_name:
set_name
bonus/value pack          167
helicopter                 36
basic building set, 3+     32
basic building set, 5+     32
basic set                  26
                         ... 
takadox mask                1
sword takadox               1
picnic tea set              1
princess crown              1
city police megaphone       1
Name: count, Length: 15309, dtype: int64

Value counts for column set_category:
set_category
normal        12757
gear           2832
other          1094
book            631
collection      578
extended        501
random           64
Name: count, dtype: int64

Value counts for column theme_name:
theme_name
gear                       2832
duplo                      1275
star wars                   863
city                        798
collectable minifigures     773
                           ... 
life of george                2
the powerpuff girls           2
system                        1
boost                         1

In [35]:
print( lego_sets['set_name'].unique())

['small house set' 'medium house set' 'large house set' ...
 'buildable 2 x 2 red brick' 'sls duck' 'summer wave']


In [36]:
print(lego_sets['set_category'].unique())


['normal' 'book' 'other' 'gear' 'collection' 'extended' 'random']


In [37]:
print(lego_sets['theme_name'].unique())


['minitalia' 'samsonite' 'trains' 'books' 'legoland' 'duplo'
 'universal building set' 'system' 'homemaker' 'gear' 'basic' 'dacta'
 'building set with people' 'preschool' 'hobby set' 'technic'
 'service packs' 'promotional' 'town' 'castle' 'space' 'fabuland' 'scala'
 'education' 'boats' 'model team' 'assorted' 'pirates' 'belville'
 'creator' 'freestyle' 'primo' 'aquazone' 'time cruisers' 'western'
 'classic' 'adventurers' 'znap' 'racers' 'mindstorms' 'seasonal'
 'rock raiders' 'star wars' 'studios' 'baby' 'action wheelers' 'sports'
 'bulk bricks' 'advanced models' 'mickey mouse' 'bionicle' 'jack stone'
 'harry potter' 'dinosaurs' 'alpha team' 'explore' 'spybotics' 'galidor'
 'miscellaneous' 'island xtreme stunts' 'clikits' 'world city' '4 juniors'
 'spider-man' 'discovery' 'quatro' 'make and create' 'city' 'factory'
 'vikings' 'dino 2010' 'dino attack' 'batman' 'spongebob squarepants'
 'avatar the last airbender' 'exo-force' 'bricks and more' 'aqua raiders'
 'serious play' 'indiana jon

In [38]:
print(lego_sets['subtheme_name'].unique())

[nan 'model maker' 'supplemental' 'supplemental/4.5v' '4.5v' 'lego'
 'vehicle' 'building' 'basic set' 'supplemental/12v' '12v' 'gears'
 'promotional' 'storage' '4.5/12v' 'jumbo bricks' 'town' 'large vehicle'
 'boats' 'supplementaries' 'model' 'supplementary set' 'seasonal'
 'supplementary' 'trains' 'ferries' 'accessories' 'classic' 'technic'
 'special' 'bags' 'minifigure pack' 'jewellery' 'educational'
 'shops and services' 'medical' 'racing' 'vehicles' 'maintenance'
 'construction' 'fire' 'key chains/promotional' 'leisure' 'police' 'space'
 'postal' 'universal' 'bonus/value pack' 'baby' 'product collection'
 'lion knights' 'black falcons' 'universal building set' 'flight'
 'emergency' 'arctic' 'duplo' 'forestmen' 'futuron' 'blacktron'
 'brick kicks' 'harry n. abrams inc.' 'black knights' 'miscellaneous'
 'furniture' 'farm' 'monorail' 'castle' 'imperial guards' 'coastguard'
 'rescue' 'space police' 'lego preschool' 'pirates' 'm-tron' 'blacktron 2'
 '9v' 'toolo' 'wolfpack' 'paradisa' 's

In [39]:
print(lego_sets['theme_group'].unique())


['vintage' 'modern day' 'miscellaneous' 'pre-school' 'basic' 'educational'
 'technical' 'historical' 'action/adventure' 'junior' 'girls'
 'model making' 'racing' 'licensed' 'constraction' nan 'art and crafts']
