## **Final Task** - Category prediction based on 'products.csv' data

##### **Author**: Danilo Jelovac
---
##### >>. **Goal**:
Our goal here is to chose the best performing model and train it so it will predict
product category based on the product itself with big precision. 

This notebook will show data analysis and preprocessing with in-depth analysis of 
the data and preparing it for ML process.


In [28]:
# ------------------------------------------
# Importing libraries required for this task:
# ------------------------------------------

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# --Confirmation message:
print(">. If you see this message - the libraries are uploaded successfuly!\n")

>. If you see this message - the libraries are uploaded successfuly!



In [43]:
# ----------------------------------------------
# Loading dataset, printing out samples and data:
# ----------------------------------------------


# ------------------------
FOLDER_NAME = "ml_data"
FILE_NAME = "products.csv"
# ------------------------


# --Loading the dataframe:

df = pd.read_csv(f"../{FOLDER_NAME}/{FILE_NAME}")

# --Samples presentation:

print("==== SAMPLE PRESENTATION ====")
print("-" * 30, "\n")

print(f">. Dataframe shape (rows, columns) -> {df.shape}")
print(f">. Dataframe column names -> {df.columns.to_list()}")

print("\n>. Sample of 5 rows:\n------")
display(df.head())

print("\n>. Information:\n------")
print(df.info())

print("\n>. Number of NaNs:\n------")
print(f"""- product ID -> {df['product ID'].isna().sum()}
- Product Title -> {df['Product Title'].isna().sum()}
- Merchant ID -> {df['Merchant ID'].isna().sum()}
- Category Label -> {df[' Category Label'].isna().sum()}
- Product Code -> {df['_Product Code'].isna().sum()}
- Number of Views -> {df['Number_of_Views'].isna().sum()}
- Merchant Rating -> {df['Merchant Rating'].isna().sum()}
- Listing Date -> {df[' Listing Date  '].isna().sum()}  
      """)

print("\n>. Checking on data types...\n------")
print(df.dtypes)

print("\n", "-" * 30, "\n")

==== SAMPLE PRESENTATION ====
------------------------------ 

>. Dataframe shape (rows, columns) -> (35311, 8)
>. Dataframe column names -> ['product ID', 'Product Title', 'Merchant ID', ' Category Label', '_Product Code', 'Number_of_Views', 'Merchant Rating', ' Listing Date  ']

>. Sample of 5 rows:
------


Unnamed: 0,product ID,Product Title,Merchant ID,Category Label,_Product Code,Number_of_Views,Merchant Rating,Listing Date
0,1,apple iphone 8 plus 64gb silver,1,Mobile Phones,QA-2276-XC,860.0,2.5,5/10/2024
1,2,apple iphone 8 plus 64 gb spacegrau,2,Mobile Phones,KA-2501-QO,3772.0,4.8,12/31/2024
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,Mobile Phones,FP-8086-IE,3092.0,3.9,11/10/2024
3,4,apple iphone 8 plus 64gb space grey,4,Mobile Phones,YI-0086-US,466.0,3.4,5/2/2022
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,5,Mobile Phones,NZ-3586-WP,4426.0,1.6,4/12/2023



>. Information:
------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   product ID       35311 non-null  int64  
 1   Product Title    35139 non-null  object 
 2   Merchant ID      35311 non-null  int64  
 3    Category Label  35267 non-null  object 
 4   _Product Code    35216 non-null  object 
 5   Number_of_Views  35297 non-null  float64
 6   Merchant Rating  35141 non-null  float64
 7    Listing Date    35252 non-null  object 
dtypes: float64(2), int64(2), object(4)
memory usage: 2.2+ MB
None

>. Number of NaNs:
------
- product ID -> 0
- Product Title -> 172
- Merchant ID -> 0
- Category Label -> 44
- Product Code -> 95
- Number of Views -> 14
- Merchant Rating -> 170
- Listing Date -> 59  
      

>. Checking on data types...
------
product ID           int64
Product Title       object
Merchant ID          int64
 Category Lab

In [46]:
# -------------------------------------------
# Cleaning, fixing, preparing df for analysis:
# -------------------------------------------


# --Cleaning the NaNs:
#   Since the dataframe is very large (~35k) and numbers of NaNs is rather small,
#   there is no reason to keep them and confuse our model.

cleaned_df = df.dropna()
print("\n>. Number of NaNs check:\n------")
print(cleaned_df.isna().sum())

# --Fixing column names:
#   They are very inconsistent with a lot of spaces and unnecesary characters.

cleaned_df.columns = (cleaned_df.columns.str.replace("_"," ").str.strip().str.lower().str.replace(" ", "_"))
print(f"\n>. Coulmns fix check -> {cleaned_df.columns.tolist()}")

print("\n", "-" * 30, "\n")


>. Number of NaNs check:
------
product ID         0
Product Title      0
Merchant ID        0
 Category Label    0
_Product Code      0
Number_of_Views    0
Merchant Rating    0
 Listing Date      0
dtype: int64

>. Coulmns fix check -> ['product_id', 'product_title', 'merchant_id', 'category_label', 'product_code', 'number_of_views', 'merchant_rating', 'listing_date']

 ------------------------------ 



In [None]:
# -----------------------------------------------
# Data analysis through visualization of the data:
# -----------------------------------------------


#