# Loading and Inspecting the Dataset

Before diving into analysis, we first need to load the dataset and take a look at its structure.

In this step, we will:

- Load the CSV file directly from GitHub using `pandas.read_csv`
- Check how many rows and columns the dataset contains with `df.shape`
- Review the column names to understand what variables are available
- Display the first few rows with `df.head()` to get a quick sense of the data
- Optionally inspect data types and basic metadata to confirm everything is correctly parsed

This initial inspection helps us ensure the dataset is correctly loaded and ready for further exploration.

In [23]:
# Load dataset from GitHub
import pandas as pd 

url = "https://raw.githubusercontent.com/Uvirgil/products-category-prediction/main/data/products.csv"
df = pd.read_csv(url)


# Quick structure
print(df.shape)
print(df.columns.tolist())
df.head(5)

(35311, 8)
['product ID', 'Product Title', 'Merchant ID', ' Category Label', '_Product Code', 'Number_of_Views', 'Merchant Rating', ' Listing Date  ']


Unnamed: 0,product ID,Product Title,Merchant ID,Category Label,_Product Code,Number_of_Views,Merchant Rating,Listing Date
0,1,apple iphone 8 plus 64gb silver,1,Mobile Phones,QA-2276-XC,860.0,2.5,5/10/2024
1,2,apple iphone 8 plus 64 gb spacegrau,2,Mobile Phones,KA-2501-QO,3772.0,4.8,12/31/2024
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,Mobile Phones,FP-8086-IE,3092.0,3.9,11/10/2024
3,4,apple iphone 8 plus 64gb space grey,4,Mobile Phones,YI-0086-US,466.0,3.4,5/2/2022
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,5,Mobile Phones,NZ-3586-WP,4426.0,1.6,4/12/2023


# Cleaning and Inspecting Column Names

After loading the dataset, it’s important to standardize the column names to make them easier to work with in code. Inconsistent naming (extra spaces, uppercase letters, or leading underscores) can cause errors later when selecting or transforming columns.

In this step, we:

- **Strip whitespace** from column names using `.str.strip()`
- **Remove leading underscores** with `.str.lstrip('_')`
- **Replace spaces with underscores** using `.str.replace(' ', '_')`
- **Convert all names to lowercase** with `.str.lower()`

This ensures all column names follow a consistent, Python‑friendly format.

Next, we perform a quick inspection:

- `df.head()` shows the first 5 rows of the cleaned dataset
- `df.info()` provides metadata: number of entries, column names, data types, and non‑null counts
- `df.category_label.value_counts()` displays the frequency of each category, helping us understand the distribution of target labels

Together, these steps confirm that the dataset is clean, structured, and ready for feature engineering or modeling.

In [24]:
df.columns = (
    df.columns
    .str.strip()
    .str.lstrip('_')
    .str.replace(' ', '_')
    .str.lower()
)

print(df.head())
print(df.info())
print(df.category_label.value_counts())


   product_id                                      product_title  merchant_id  \
0           1                    apple iphone 8 plus 64gb silver            1   
1           2                apple iphone 8 plus 64 gb spacegrau            2   
2           3  apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...            3   
3           4                apple iphone 8 plus 64gb space grey            4   
4           5  apple iphone 8 plus gold 5.5 64gb 4g unlocked ...            5   

  category_label product_code  number_of_views  merchant_rating listing_date  
0  Mobile Phones   QA-2276-XC            860.0              2.5    5/10/2024  
1  Mobile Phones   KA-2501-QO           3772.0              4.8   12/31/2024  
2  Mobile Phones   FP-8086-IE           3092.0              3.9   11/10/2024  
3  Mobile Phones   YI-0086-US            466.0              3.4     5/2/2022  
4  Mobile Phones   NZ-3586-WP           4426.0              1.6    4/12/2023  
<class 'pandas.core.frame.DataFrame'>
R

# Cleaning Product Titles and Handling Missing Data

Raw text fields often contain inconsistencies such as extra spaces, mixed casing, or missing values.  
To prepare the dataset for modeling, we need to standardize product titles and remove incomplete rows.

In this step, we:

- **Define a helper function `clean_title(text)`**:
  - If the value is missing (`NaN`), return an empty string
  - Convert the text to lowercase
  - Strip leading and trailing spaces
  - Collapse multiple spaces into a single space

- **Apply the cleaning function** to the `product_title` column so all titles follow a consistent format

- **Check the initial dataset shape** with `df.shape` to see how many rows and columns we start with

- **Drop rows with missing values**:
  - First, ensure that both `product_title` and `category_label` are present
  - Then, remove any remaining rows with missing values in other columns

- **Print the new dataset shape** to confirm how many rows remain after cleaning

- **Display missing values per column** using `df.isna().sum()` to verify that the dataset is now complete

This ensures that our dataset has clean product titles and no missing values, making it reliable for feature engineering and model training.

In [25]:
def clean_title(text):
    if pd.isna(text):
        return ""
    t = str(text).lower().strip()
    return " ".join(t.split())

df["product_title"] = df["product_title"].apply(clean_title)
print("Initial dataset shape:", df.shape)

df = df.dropna(subset=["product_title", "category_label"])
df = df.dropna()
print("New dataset shape:", df.shape)

print("Missing values per column:")
print(df.isna().sum())

Initial dataset shape: (35311, 8)
New dataset shape: (34929, 8)
Missing values per column:
product_id         0
product_title      0
merchant_id        0
category_label     0
product_code       0
number_of_views    0
merchant_rating    0
listing_date       0
dtype: int64


### Save DataFrame in a new file products_modified.csv

In [26]:
df.to_csv("../data/products_modified.csv")