<a href="https://colab.research.google.com/github/almastupac/ml-automatic-classification-of-products-by-category/blob/main/notebook/classification_of_products_by_category.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## ðŸ“¥ Loading and inspecting the dataset
Before diving into analysis, we first need to load and take a look at its structure.
In this step, we will:
- Load the csv file from GitHub
- Check how many rows and columns we have
- Display the first fex rows
- Review data types and basic metadata for each column

This will help us ensure the dataser is correctly loaded and ready for further exploration.

In [1]:
import pandas as pd

# load dataset from GitHub
url = "https://raw.githubusercontent.com/almastupac/ml-automatic-classification-of-products-by-category/main/data/products.csv"
df = pd.read_csv(url)

# Print shape(number of rows and columns)
print("Dataset shape (rows, columns): ", df.shape)

# Show first 5 rows
print("\nFirst 5 rows:")
display(df.head())
# Show column data types and non-null counts
print("\nDataset info:")
df.info()

Dataset shape (rows, columns):  (35311, 8)

First 5 rows:


Unnamed: 0,product ID,Product Title,Merchant ID,Category Label,_Product Code,Number_of_Views,Merchant Rating,Listing Date
0,1,apple iphone 8 plus 64gb silver,1,Mobile Phones,QA-2276-XC,860.0,2.5,5/10/2024
1,2,apple iphone 8 plus 64 gb spacegrau,2,Mobile Phones,KA-2501-QO,3772.0,4.8,12/31/2024
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,Mobile Phones,FP-8086-IE,3092.0,3.9,11/10/2024
3,4,apple iphone 8 plus 64gb space grey,4,Mobile Phones,YI-0086-US,466.0,3.4,5/2/2022
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,5,Mobile Phones,NZ-3586-WP,4426.0,1.6,4/12/2023



Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   product ID       35311 non-null  int64  
 1   Product Title    35139 non-null  object 
 2   Merchant ID      35311 non-null  int64  
 3    Category Label  35267 non-null  object 
 4   _Product Code    35216 non-null  object 
 5   Number_of_Views  35297 non-null  float64
 6   Merchant Rating  35141 non-null  float64
 7    Listing Date    35252 non-null  object 
dtypes: float64(2), int64(2), object(4)
memory usage: 2.2+ MB


## ðŸ§¹ Cleaning Column Names
Before further processing, we clean and standaridze the column names by:
- converting to lowercase
- replacing spaces with underscores
- removing special characters
- making names consistent

In [4]:
# Clean column names
df.columns = (
    df.columns
    .str.strip()  # remove leading/trailing spaces
    .str.lower()  # lowercase
    .str.lstrip("_")  # remove leading underscore
    .str.replace(" ", "_")  # replace spaces
    .str.replace("-", "_")  # replace hyphens
    .str.replace(r"[^a-zA-Z0-9_]", "", regex=True)  # remove special characters
)

df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   product_id       35311 non-null  int64  
 1   product_title    35139 non-null  object 
 2   merchant_id      35311 non-null  int64  
 3   category_label   35267 non-null  object 
 4   product_code     35216 non-null  object 
 5   number_of_views  35297 non-null  float64
 6   merchant_rating  35141 non-null  float64
 7   listing_date     35252 non-null  object 
dtypes: float64(2), int64(2), object(4)
memory usage: 2.2+ MB
