# Data Preprocessing

**revise and add to paper**

> In the data preprocessing phase of the Amazon product categorization project, several steps were taken to clean and transform the raw product name data. These steps include removing stop words, eliminating punctuation marks, and filtering out words containing numbers. The resulting dataset was then subsetted to include only the relevant columns: "name" and "main_category". The preprocessed data was saved for further analysis and modeling. These preprocessing steps ensure that the product names are standardized and ready for feature extraction and classification, ultimately contributing to the accurate prediction of product categories.

In [1]:
import pandas as pd
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
from nltk.tokenize import word_tokenize

# notebook configurations
pd.options.display.max_colwidth = 1000

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/bzekeria/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
df = pd.read_csv("data/amazon_products_sampled_raw.csv")
df

Unnamed: 0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price
0,Caprese ZOLA women's Satchel (YELLOW),accessories,Bags & Luggage,https://m.media-amazon.com/images/I/717z0jxPDmL._AC_UL320_.jpg,https://www.amazon.in/Caprese-SLZOLLGYLW-Womens-Western-Yellow/dp/B0B3YRYFP7/ref=sr_1_8081?qid=1679144163&s=luggage&sr=1-8081,5.0,1,"₹1,244.71","₹4,599"
1,Fastrack Brown Leather Men's Wallet (C0408LBR01),accessories,Bags & Luggage,https://m.media-amazon.com/images/I/81WgnqcRnzL._AC_UL320_.jpg,https://www.amazon.in/Fastrack-Brown-Mens-Wallet-C0408LBR01/dp/B07BKYB2DX/ref=sr_1_553?qid=1679143906&s=luggage&sr=1-553,4.3,2162,₹821,"₹1,095"
2,TEKZIE Butterfly Colourful Combo Set of - 3 Watch for Girls & Women.,accessories,Watches,https://m.media-amazon.com/images/I/41XT3Ck+IhL._AC_UL320_.jpg,https://www.amazon.in/TEKZIE-Butterfly-Colourful-Combo-Set/dp/B09NGJ8ZG5/ref=sr_1_9743?qid=1679155977&s=watches&sr=1-9743,,,₹323,₹999
3,Sorellaz Womens Rose Gold Open Branch Ring: SR/FAJEWLK21-L80/1,accessories,Fashion & Silver Jewellery,https://m.media-amazon.com/images/I/51KbDbULuDL._AC_UL320_.jpg,https://www.amazon.in/Sorellaz-Womens-Rose-Gold-Branch/dp/B0B3RR7CLH/ref=sr_1_17546?qid=1679160557&s=jewelry&sr=1-17546,3.4,3,₹189,₹730
4,Astroghar Evil Eye Pendant for Protection & Prosperity for Men and Women,accessories,Fashion & Silver Jewellery,https://m.media-amazon.com/images/I/51Ug3-OyR+L._AC_UL320_.jpg,https://www.amazon.in/ASTROGHAR-Protection-Multicolour-Pendant-Prosperity/dp/B08FF3DDVL/ref=sr_1_14583?qid=1679160450&s=jewelry&sr=1-14583,3.4,38,₹240,₹600
...,...,...,...,...,...,...,...,...,...
77179,Cleo by Khadim's Synthetic PVC Sole Blue Decorative Sandal for Women,women's shoes,Fashion Sandals,https://m.media-amazon.com/images/I/61lOsS1GH2L._AC_UL320_.jpg,https://www.amazon.in/Cleo-Khadims-Synthetic-Decorative-Sandal/dp/B092WVLSZH/ref=sr_1_7901?qid=1679211724&s=shoes&sr=1-7901,2.0,1,₹389,₹649
77180,"Walky Wear Dashing Bellies, Ballet Flat Belly for Womens and Girl's",women's shoes,Ballerinas,https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T1/images/I/6130NBC92YL._AC_UL320_.jpg,https://www.amazon.in/Walky-Wear-Dashing-Bellies-Ballet/dp/B08F37V6ZT/ref=sr_1_697?qid=1679211836&s=shoes&sr=1-697,3.7,7,₹499,₹999
77181,BEREALSlate Black Sandal Women,women's shoes,Shoes,https://m.media-amazon.com/images/I/51VeJPTxggL._AC_UL320_.jpg,https://www.amazon.in/BEREALSlate-Black-Sandal-Women-numeric_5/dp/B09GNKFFPZ/ref=sr_1_5725?qid=1679211540&s=shoes&sr=1-5725,3.1,3,₹959,"₹1,599"
77182,Stride girls Anok Floaters,women's shoes,Fashion Sandals,https://m.media-amazon.com/images/I/71-IgueZKNL._AC_UL320_.jpg,https://www.amazon.in/Stride-Womens-Black-Floaters-5-16119093/dp/B07XKGSNMP/ref=sr_1_6034?qid=1679211702&s=shoes&sr=1-6034,5.0,1,₹373.10,"₹2,999"


In [3]:
df.columns

Index(['name', 'main_category', 'sub_category', 'image', 'link', 'ratings',
       'no_of_ratings', 'discount_price', 'actual_price'],
      dtype='object')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77184 entries, 0 to 77183
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   name            77184 non-null  object
 1   main_category   77184 non-null  object
 2   sub_category    77184 non-null  object
 3   image           77184 non-null  object
 4   link            77184 non-null  object
 5   ratings         54133 non-null  object
 6   no_of_ratings   54133 non-null  object
 7   discount_price  66992 non-null  object
 8   actual_price    75006 non-null  object
dtypes: object(9)
memory usage: 5.3+ MB


In [5]:
df["main_category"].unique()

array(['accessories', 'appliances', 'bags & luggage', 'beauty & health',
       'car & motorbike', 'grocery & gourmet foods', 'home & kitchen',
       'home, kitchen, pets', 'industrial supplies', "kids' fashion",
       "men's clothing", "men's shoes", 'music', 'pet supplies',
       'sports & fitness', 'stores', 'toys & baby products',
       'tv, audio & cameras', "women's clothing", "women's shoes"],
      dtype=object)

In [6]:
name_null_values = df["name"].isnull().sum()

# Print the number of null values
print("Number of null values in name:", name_null_values)

Number of null values in name: 0


In [7]:
main_category_null_values = df["main_category"].isnull().sum()

# Print the number of null values
print("Number of null values in name:", main_category_null_values)

Number of null values in name: 0


In [8]:
def remove_stop_words(input_string):
    # Tokenize the string
    tokens = word_tokenize(input_string)

    # Get the list of English stop words
    stop_words = set(stopwords.words("english"))

    # Remove stop words
    filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

    # Reconstruct the string
    output_string = " ".join(filtered_tokens)

    return output_string

In [9]:
# Apply to column "name"
df['name'] = df['name'].apply(remove_stop_words)

In [10]:
df.name

0                                          Caprese ZOLA women 's Satchel ( YELLOW )
1                               Fastrack Brown Leather Men 's Wallet ( C0408LBR01 )
2                    TEKZIE Butterfly Colourful Combo Set - 3 Watch Girls & Women .
3                   Sorellaz Womens Rose Gold Open Branch Ring : SR/FAJEWLK21-L80/1
4                      Astroghar Evil Eye Pendant Protection & Prosperity Men Women
                                            ...                                    
77179                Cleo Khadim 's Synthetic PVC Sole Blue Decorative Sandal Women
77180                 Walky Wear Dashing Bellies , Ballet Flat Belly Womens Girl 's
77181                                                BEREALSlate Black Sandal Women
77182                                                    Stride girls Anok Floaters
77183    Punjabi Juttis Women Ethnic Flat Traditional Mojari Handmade Stylish Shoes
Name: name, Length: 77184, dtype: object

In [11]:
df[["name"]]

Unnamed: 0,name
0,Caprese ZOLA women 's Satchel ( YELLOW )
1,Fastrack Brown Leather Men 's Wallet ( C0408LBR01 )
2,TEKZIE Butterfly Colourful Combo Set - 3 Watch Girls & Women .
3,Sorellaz Womens Rose Gold Open Branch Ring : SR/FAJEWLK21-L80/1
4,Astroghar Evil Eye Pendant Protection & Prosperity Men Women
...,...
77179,Cleo Khadim 's Synthetic PVC Sole Blue Decorative Sandal Women
77180,"Walky Wear Dashing Bellies , Ballet Flat Belly Womens Girl 's"
77181,BEREALSlate Black Sandal Women
77182,Stride girls Anok Floaters


In [12]:
# Remove punctuation
df["name"] = df["name"].apply(lambda x: re.sub(r"[^a-zA-Z0-9\s]", "", str(x)))

In [13]:
df[["name"]]

Unnamed: 0,name
0,Caprese ZOLA women s Satchel YELLOW
1,Fastrack Brown Leather Men s Wallet C0408LBR01
2,TEKZIE Butterfly Colourful Combo Set 3 Watch Girls Women
3,Sorellaz Womens Rose Gold Open Branch Ring SRFAJEWLK21L801
4,Astroghar Evil Eye Pendant Protection Prosperity Men Women
...,...
77179,Cleo Khadim s Synthetic PVC Sole Blue Decorative Sandal Women
77180,Walky Wear Dashing Bellies Ballet Flat Belly Womens Girl s
77181,BEREALSlate Black Sandal Women
77182,Stride girls Anok Floaters


In [14]:
# Remove words containing a number (brand numbers, product id, etc.)
df["name"] = df["name"].apply(lambda x: ' '.join([word for word in x.split() if not re.search(r'\d', word)]))
df[["name"]]

Unnamed: 0,name
0,Caprese ZOLA women s Satchel YELLOW
1,Fastrack Brown Leather Men s Wallet
2,TEKZIE Butterfly Colourful Combo Set Watch Girls Women
3,Sorellaz Womens Rose Gold Open Branch Ring
4,Astroghar Evil Eye Pendant Protection Prosperity Men Women
...,...
77179,Cleo Khadim s Synthetic PVC Sole Blue Decorative Sandal Women
77180,Walky Wear Dashing Bellies Ballet Flat Belly Womens Girl s
77181,BEREALSlate Black Sandal Women
77182,Stride girls Anok Floaters


In [15]:
# Remove single string characters
df["name"] = df["name"].str.replace(r'\b\w\b', "", regex = True)
df[["name"]]

Unnamed: 0,name
0,Caprese ZOLA women Satchel YELLOW
1,Fastrack Brown Leather Men Wallet
2,TEKZIE Butterfly Colourful Combo Set Watch Girls Women
3,Sorellaz Womens Rose Gold Open Branch Ring
4,Astroghar Evil Eye Pendant Protection Prosperity Men Women
...,...
77179,Cleo Khadim Synthetic PVC Sole Blue Decorative Sandal Women
77180,Walky Wear Dashing Bellies Ballet Flat Belly Womens Girl
77181,BEREALSlate Black Sandal Women
77182,Stride girls Anok Floaters


In [16]:
# Subset the data to columns of interest
df = df[["name", "main_category"]]

In [17]:
df

Unnamed: 0,name,main_category
0,Caprese ZOLA women Satchel YELLOW,accessories
1,Fastrack Brown Leather Men Wallet,accessories
2,TEKZIE Butterfly Colourful Combo Set Watch Girls Women,accessories
3,Sorellaz Womens Rose Gold Open Branch Ring,accessories
4,Astroghar Evil Eye Pendant Protection Prosperity Men Women,accessories
...,...,...
77179,Cleo Khadim Synthetic PVC Sole Blue Decorative Sandal Women,women's shoes
77180,Walky Wear Dashing Bellies Ballet Flat Belly Womens Girl,women's shoes
77181,BEREALSlate Black Sandal Women,women's shoes
77182,Stride girls Anok Floaters,women's shoes


In [18]:
df.to_csv("data/amazon_products_sampled_cleaned.csv", index = False)