### Data Cleaning

# LEGO Set Metadata

This table outlines the metadata for LEGO sets, including various attributes of each set.

| Field           | Description                                                 |
|-----------------|-------------------------------------------------------------|
| `set_id`        | Official LEGO item number                                   |
| `name`          | Name of the LEGO set                                        |
| `year`          | Release year                                                |
| `theme`         | LEGO theme the set belongs to                               |
| `subtheme`      | Subtheme within the theme                                   |
| `themeGroup`    | Overall group the theme belongs to                          |
| `category`      | Type of set                                                 |
| `pieces`        | Number of pieces in the set                                 |
| `minifigs`      | Number of mini figures included in the set                  |
| `agerange_min`  | Minimum age recommended                                     |
| `US_retailPrice`| US retail price at launch                                   |
| `bricksetURL`   | URL for the set on brickset.com                              |
| `thumbnailURL`  | Small image of the set                                      |
| `imageURL`      | Full size image of the set                                  |


In [1]:
import pandas as pd

In [2]:
import sys
import os

# Add the parent directory to the system path
sys.path.append(os.path.abspath(os.path.join('..', 'utils')))

import data_processing_functions as dpf

In [3]:
lego_sets = pd.read_csv('../data/raw/lego_sets.csv')

In [4]:
dpf.show_basic_info(lego_sets)


DataFrame Shape: (18457, 14)
Number of Rows: 18457
Number of Columns: 14

Data Types of Columns:
set_id             object
name               object
year                int64
theme              object
subtheme           object
themeGroup         object
category           object
pieces            float64
minifigs          float64
agerange_min      float64
US_retailPrice    float64
bricksetURL        object
thumbnailURL       object
imageURL           object
dtype: object

Missing Values per Column:
set_id                0
name                  0
year                  0
theme                 0
subtheme           3556
themeGroup            2
category              0
pieces             3924
minifigs          10058
agerange_min      11670
US_retailPrice    11475
bricksetURL           0
thumbnailURL       1006
imageURL           1006
dtype: int64

First 5 Rows of Data:
  set_id                     name  year      theme     subtheme themeGroup  \
0    1-8          Small house set  1970  Minit

In [5]:
dpf.show_data_types(lego_sets)

Data Types of Columns:
set_id             object
name               object
year                int64
theme              object
subtheme           object
themeGroup         object
category           object
pieces            float64
minifigs          float64
agerange_min      float64
US_retailPrice    float64
bricksetURL        object
thumbnailURL       object
imageURL           object
dtype: object


In [6]:
dpf.show_missing_values(lego_sets)


Missing Values in Columns:
set_id                0
name                  0
year                  0
theme                 0
subtheme           3556
themeGroup            2
category              0
pieces             3924
minifigs          10058
agerange_min      11670
US_retailPrice    11475
bricksetURL           0
thumbnailURL       1006
imageURL           1006
dtype: int64


In [7]:
dpf.show_null_percentage(lego_sets)


Percentage of Missing Values in Each Column:
set_id             0.000000
name               0.000000
year               0.000000
theme              0.000000
subtheme          19.266403
themeGroup         0.010836
category           0.000000
pieces            21.260226
minifigs          54.494230
agerange_min      63.228044
US_retailPrice    62.171534
bricksetURL        0.000000
thumbnailURL       5.450507
imageURL           5.450507
dtype: float64


#### Duplicates

In [8]:
dpf.check_for_duplicates(lego_sets)


No duplicate rows found in the DataFrame.


#### Rename Columns

In [10]:
column_rename_dict = {
    'set_id': 'set_id',                    # Official LEGO item number
    'name': 'set_name',                    # Name of the LEGO set
    'year': 'release_year',                # Release year
    'theme': 'theme_name',                 # LEGO theme the set belongs to
    'subtheme': 'subtheme_name',           # Subtheme within the theme
    'themeGroup': 'theme_group',           # Overall group the theme belongs to
    'category': 'set_category',            # Type of set (e.g., playset, vehicle)
    'pieces': 'num_pieces',                # Number of pieces in the set
    'minifigs': 'num_minifigs',            # Number of mini figures included
    'agerange_min': 'min_age_recommended', # Minimum age recommended
    'US_retailPrice': 'us_retail_price',   # US retail price at launch
    'bricksetURL': 'brickset_url',         # URL for the set on brickset.com
    'thumbnailURL': 'thumbnail_url',       # Small image of the set
    'imageURL': 'full_image_url'           # Full size image of the set
}

lego_sets = dpf.rename_columns(lego_sets, column_rename_dict)


In [11]:
lego_sets.head()

Unnamed: 0,set_id,set_name,release_year,theme_name,subtheme_name,theme_group,set_category,num_pieces,num_minifigs,min_age_recommended,us_retail_price,brickset_url,thumbnail_url,image_url
0,1-8,Small house set,1970,Minitalia,,Vintage,Normal,67.0,,,,https://brickset.com/sets/1-8,https://images.brickset.com/sets/small/1-8.jpg,https://images.brickset.com/sets/images/1-8.jpg
1,2-8,Medium house set,1970,Minitalia,,Vintage,Normal,109.0,,,,https://brickset.com/sets/2-8,https://images.brickset.com/sets/small/2-8.jpg,https://images.brickset.com/sets/images/2-8.jpg
2,3-6,Medium house set,1970,Minitalia,,Vintage,Normal,158.0,,,,https://brickset.com/sets/3-6,https://images.brickset.com/sets/small/3-6.jpg,https://images.brickset.com/sets/images/3-6.jpg
3,4-4,Large house set,1970,Minitalia,,Vintage,Normal,233.0,,,,https://brickset.com/sets/4-4,https://images.brickset.com/sets/small/4-4.jpg,https://images.brickset.com/sets/images/4-4.jpg
4,4-6,Mini House and Vehicles,1970,Samsonite,Model Maker,Vintage,Normal,,,,,https://brickset.com/sets/4-6,,
