<a href="https://colab.research.google.com/github/dzastin96/product-category-classifier/blob/main/notebooks/product_category_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üìù Product Data Preparation
### Author: Dzastin Januzi

## üéØ Goal
The goal of this notebook is to **clean, standardize, and audit the product dataset** to ensure it is ready for analysis.  
Key objectives include:
- Standardizing column names for consistency  
- Converting `listing_date` to proper datetime format  
- Rounding numeric metrics (`number_of_views`, `merchant_rating`) to two decimals  
- Identifying and removing missing values and duplicates  
- Producing an audit‚Äëready preview of the cleaned dataset

## üìë Columns Description

| Column            | Description                                                                 |
|-------------------|-----------------------------------------------------------------------------|
| üÜî product_id     | Unique identifier for each product                                          |
| üè∑Ô∏è product_title | Name/title of the product                                                   |
| üè™ merchant_id    | Unique identifier for the merchant                                          |
| üìÇ category_label | Category under which the product is listed                                  |
| üî¢ product_code   | Internal product code (may overlap with product_id)                         |
| üëÅÔ∏è number_of_views | Number of times the product listing has been viewed (numeric, float)        |
| ‚≠ê merchant_rating | Rating score of the merchant (numeric, float, typically 1‚Äì5 scale)          |
| üìÖ listing_date   | Date when the product was listed (datetime, formatted as YYYY-MM-DD)        |

## üì• 1. Load Data & Preview

We begin by loading the raw product dataset and inspecting the first few rows.  
This helps verify the structure, column names, and initial data quality.

In [78]:
import pandas as pd
from IPython.display import display

# Load the CSV file into a DataFrame from data folder
df = pd.read_csv('../data/IMLP4_TASK_03-products.csv')

# Display the first 5 rows of the DataFrame
display(df.head(5).style.set_caption("FIRST 5 R0WS"))

Unnamed: 0,product ID,Product Title,Merchant ID,Category Label,_Product Code,Number_of_Views,Merchant Rating,Listing Date
0,1,apple iphone 8 plus 64gb silver,1,Mobile Phones,QA-2276-XC,860.0,2.5,5/10/2024
1,2,apple iphone 8 plus 64 gb spacegrau,2,Mobile Phones,KA-2501-QO,3772.0,4.8,12/31/2024
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim free smartphone in gold,3,Mobile Phones,FP-8086-IE,3092.0,3.9,11/10/2024
3,4,apple iphone 8 plus 64gb space grey,4,Mobile Phones,YI-0086-US,466.0,3.4,5/2/2022
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked sim free,5,Mobile Phones,NZ-3586-WP,4426.0,1.6,4/12/2023


## üîç 2. Initial Audit

The purpose of this section is to **inspect the raw dataset** before any cleaning or transformation.  
We want to understand its structure, identify missing values, and check for duplicates.

### Steps:
- üìä Display basic DataFrame information (`df.info()`)
- ‚ö†Ô∏è Count missing values per column
- üîÅ Check for duplicated rows

In [79]:
# Display the information of the DataFrame
df.info()

# Count missing values per column
df_nan_counts = df.isnull().sum().to_frame(name='Count')
display(df_nan_counts.style.set_caption("Missing Values per Column"))

# Count for duplicated products if exists
count_duplicated_rows = df.duplicated().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   product ID       35311 non-null  int64  
 1   Product Title    35139 non-null  object 
 2   Merchant ID      35311 non-null  int64  
 3    Category Label  35267 non-null  object 
 4   _Product Code    35216 non-null  object 
 5   Number_of_Views  35297 non-null  float64
 6   Merchant Rating  35141 non-null  float64
 7    Listing Date    35252 non-null  object 
dtypes: float64(2), int64(2), object(4)
memory usage: 2.2+ MB


Unnamed: 0,Count
product ID,0
Product Title,172
Merchant ID,0
Category Label,44
_Product Code,95
Number_of_Views,14
Merchant Rating,170
Listing Date,59


## üßπ 3. Cleaning & Transformation

In this section we apply transformations to standardize and clean the dataset.  
The goal is to ensure consistent column naming, correct data types, and removal of invalid rows.

### Steps:
- üìù Standardize column names (lowercase, underscores, remove special characters)
- üìÖ Convert `listing_date` to proper datetime format
- üî¢ Round numeric columns (`number_of_views`, `merchant_rating`) to 2 decimals
- ‚ö†Ô∏è Remove rows with missing values
- üîÅ Remove duplicated rows

In [80]:
# Standardize column names
df.columns = (
    df.columns
    .str.strip()                            # remove leading/trailing spaces
    .str.lower()                            # convert to lowercase
    .str.replace(r'^_', '', regex=True)     # removes underscore only at start
    .str.replace(' ', '_')                  # replace spaces with underscores
    .str.replace(r'[^\w_]', '', regex=True) # remove special characters
)

# Convert 'listing_date' to datetime format
df['listing_date'] = pd.to_datetime(df['listing_date'], errors='coerce', dayfirst=False)

# Round 'number_of_views' and 'merchant_rating' to 2 decimal places (NOT WORKING FOR DISPLAY())
# display() has own formatting settings for floats and datetime, and does not reflect changes made to the DataFrame itself. But we will use display() for better visualization in Jupyter Notebooks.
df['number_of_views'] = df['number_of_views'].round(2)
df['merchant_rating'] = df['merchant_rating'].round(2)


# Filter out rows that contain at least one missing value
rows_with_nan = df[df.isnull().any(axis=1)]

# Count the number of rows before removing missing values
rows_before = len(df)

# Remove rows with any missing values
df = df.dropna()

# Count the number of rows after removing missing values
rows_after = len(df)

## üìä 4. Statistics & Preview

In this section we summarize the results of the cleaning process and preview the cleaned dataset.  
This provides a clear before/after comparison and confirms that the dataset is ready for analysis.

### Steps:
- üìà Report number of rows before and after cleaning
- ‚ö†Ô∏è Show how many rows contained missing values
- üîÅ Report number of duplicates removed
- üëÄ Preview the first 5 rows of the cleaned dataset

In [81]:
# Print the statistics
print("\n=== Prepare Data Statistics ===")
print("\n‚úÖ Column names have been standardized.")
print("‚úÖ 'listing_date' column has been converted to datetime format.")
if count_duplicated_rows > 0:
    df = df.drop_duplicates() # keep first occurrence
    print(f"‚úÖ Number of duplicated rows removed: {count_duplicated_rows}.")
else:
    print("‚ÑπÔ∏è No duplicated rows found.")

print(f"‚úÖ Number of rows before removing missing values: {rows_before}")
print(f"‚úÖ Number of rows with missing values: {len(rows_with_nan)}")
print(f"‚úÖ Number of rows after removing missing values: {rows_after}")

# Display the first 5 rows of the DataFrame
display(df.head(5).style.set_caption("FIRST 5 R0WS"))



=== Prepare Data Statistics ===

‚úÖ Column names have been standardized.
‚úÖ 'listing_date' column has been converted to datetime format.
‚ÑπÔ∏è No duplicated rows found.
‚úÖ Number of rows before removing missing values: 35311
‚úÖ Number of rows with missing values: 551
‚úÖ Number of rows after removing missing values: 34760


Unnamed: 0,product_id,product_title,merchant_id,category_label,product_code,number_of_views,merchant_rating,listing_date
0,1,apple iphone 8 plus 64gb silver,1,Mobile Phones,QA-2276-XC,860.0,2.5,2024-05-10 00:00:00
1,2,apple iphone 8 plus 64 gb spacegrau,2,Mobile Phones,KA-2501-QO,3772.0,4.8,2024-12-31 00:00:00
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim free smartphone in gold,3,Mobile Phones,FP-8086-IE,3092.0,3.9,2024-11-10 00:00:00
3,4,apple iphone 8 plus 64gb space grey,4,Mobile Phones,YI-0086-US,466.0,3.4,2022-05-02 00:00:00
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked sim free,5,Mobile Phones,NZ-3586-WP,4426.0,1.6,2023-04-12 00:00:00
