# <div style = 'background-color:skyblue'> <center> Data cleaning (Step 1) </div>

---
### Project Overview
This project is divided into <b>three parts</b>, each focused on a main task and implemented in <b>separate</b> Jupyter Notebooks. Jupyter variables can be easily toggled in the menu above. To facilitate communication between these notebooks, values will be stored using `%store` and retrieved as required using `%store -r`. The approach I used to tackle this challenge involved these sub-tasks in order:

### Sections
1. Data cleaning (looking for accuracy and consistency) -  [`clean_string_data.ipynb` (link for path here)](./clean_string_data.ipynb)
2. Filtering similarities using different algorithms on significant columns to identify potential entries - [ `filter_data_similarity.ipynb` (link for path here)](./filter_data_similarity.ipynb)
3. Merging the similar pairs retrieved at the previous step by creating groups and consolidating them into single enriched entities - [`group_duplicates_consolidate_groups.ipynb` (link for path here)](./group_duplicates_consolidate_groups.ipynb)
---

### Table of Contents

- [Analysing the dataset](#analysing-the-dataset)
- [Normalization (lowercasing + removing any characters that are not alphanumeric)](#normalization)
- [Tokenization (and normalization: removing stop words)](#tokenization)

### Analysing the dataset
In order to open the given dataset, which has a `.parquet` extension, I imported `pandas`, as it is well-suited for handling even this file format and will come to use later in the project for transforming matrices into data frames when analysing the concept of sparsity. 

In [1]:
import pandas as pd

df = pd.read_parquet('data\\veridion_product_deduplication_challenge.snappy.parquet')
df

Unnamed: 0,unspsc,root_domain,page_url,product_title,product_summary,product_name,product_identifier,brand,intended_industries,applicability,...,form,size,color,purity,energy_efficiency,pressure_rating,power_rating,quality_standards_and_certifications,miscellaneous_features,description
0,Sewing and stitchery and weaving equipment and...,studio-atcoat.com,https://studio-atcoat.com/1372696759/?idx=510,Glimakra Warping Board (8m),The Glimakra Warping Board is designed for use...,Warping Board,[],,[Textile],[use with floor looms],...,[],"[{'dimension': 'Length', 'qualitative': False,...",[],[],,[],[],[],[],"The ""Warping Board"" is designed for use with f..."
1,Electric alternating current AC motors,worm-gears.net,https://worm-gears.net/tag/worm-gear-box/,NMRV Worm Gearbox Motor,The NMRV Worm Gearbox Motor is a high-efficien...,Worm Gearbox Motor,[],,[Industrial],[industrial applications],...,[],[],"[{'original': 'Blue', 'simple': 'Blue'}, {'ori...",[],,[],"[{'qualitative': False, 'type': 'min', 'unit':...",[],"[Omnibearing installation, High radiation effi...","The ""Worm Gearbox Motor"" is a high-efficiency ..."
2,Vehicle trim and exterior covering,customcarcoverco.com,https://customcarcoverco.com/collections/vendo...,Nissan R33 GTR Car Cover,A custom car cover designed for the Nissan R33...,Car Cover,[],,[Automotive],[protecting vehicles from the elements],...,[],[],[],[],,[],[],[],"[Personalization with custom brand logos, grap...","The ""Car Cover"" is a custom-designed cover tai..."
3,Pipe connectors,plumbmaster.com,https://www.plumbmaster.com/search?q=wolverine...,Flexible Fittings,"Flexible fittings for plumbing applications, a...",Flexible Fittings,[],,[Plumbing],[plumbing installations],...,[],[],[],[],,[],[],[],"[allows for movement, flexibility in installat...","""Flexible Fittings"" are designed for plumbing ..."
4,Doors,sogno.in,http://www.sogno.in/product-detail-CST-HGD-331...,CST-HGD-33103 Hinged Closet Door,The CST-HGD-33103 Hinged Closet Door is a meti...,Hinged Closet Door,[],CST,"[Home Appliances, Construction]",[Closet Storage],...,[],[],[],[],,[],[],[],"[Italian craftsmanship, German engineering, Sm...","The ""Hinged Closet Door"" is a storage solution..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21941,Other,dsbridal.com,https://www.dsbridal.com/index.php/sale/veils....,15/16 Accessories,"Accessories designed for Sweet 15/16, availabl...",Accessories,[],,[Retail],[accessories for Sweet 15/16],...,[],"[{'dimension': 'Diameter', 'qualitative': Fals...",[],[],,[],[],[],[part of the sale collection],"""Accessories"" are designed for use with Sweet ..."
21942,Processed and synthetic rubber,50735-in.all.biz,https://50735-in.all.biz/group-goods,General Mechanical Rubber Goods,A category of rubber goods designed for genera...,Rubber Goods,[],,[Manufacturing],[],...,[],[],[],[],,[],[],[],[],"""Rubber Goods"" are designed for general mechan..."
21943,Fresh cut rose bouquets,lilyofthevalley.uk,https://www.lilyofthevalley.uk/product/luxurio...,Luxurious Rose Garden,The Luxurious Rose Garden is a stunning floral...,Floral Arrangement,[],Lily Of The Valley Florist,"[Retail, Gifts]","[gifting, decorative purposes]",...,[],"[{'dimension': 'Width', 'qualitative': False, ...",[],[],,[],[],[],[Product images available in various resolutions],"""The 'Floral Arrangement' offered by Lily Of T..."
21944,Vision correction or cosmetic eyewear and rela...,getcontactlensesonline.com.au,https://getcontactlensesonline.com.au/brand/al...,Dailies AquaComfort Plus Multifocal (30 Pack),A pack of 30 Dailies AquaComfort Plus Multifoc...,Multifocal Contact Lenses,[],Dailies,[Healthcare],[vision correction],...,[],[],[],[],,[],[],[],[],"""Multifocal Contact Lenses"" are designed for d..."


We can note that there are **21946** product entries, containing diverse columns such as: 
- `unspsc` (classification system meant to categorize products)
- `root_domain` (doesn't include the `www.` subdomain or `https://`/ `http://` protocols)
- `page_url` (the full path, contains `root_domain`)
- `product_title`
- `product_summary`
- `product_name` (less detailed than `product_title`) 

In addition to these key columns, there are **25** other columns that are less relevant (for example `description` represents the columns `product_name` and `product_summary` combined and slighlty modified, making it redundant).

---
The `describe()` method is able to identify in columns containing integers additional information that informs me that there are negative values for `manufacturing_year` (int32), indicating corrupted data.

In [2]:
df.describe()

Unnamed: 0,manufacturing_year
count,21946.0
mean,-1.0
std,0.0
min,-1.0
25%,-1.0
50%,-1.0
75%,-1.0
max,-1.0


The `info()` method provides some great insight as well.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21946 entries, 0 to 21945
Data columns (total 31 columns):
 #   Column                                Non-Null Count  Dtype 
---  ------                                --------------  ----- 
 0   unspsc                                21946 non-null  object
 1   root_domain                           21946 non-null  object
 2   page_url                              21946 non-null  object
 3   product_title                         21946 non-null  object
 4   product_summary                       20761 non-null  object
 5   product_name                          21910 non-null  object
 6   product_identifier                    21946 non-null  object
 7   brand                                 7062 non-null   object
 8   intended_industries                   21946 non-null  object
 9   applicability                         21946 non-null  object
 10  eco_friendly                          969 non-null    object
 11  ethical_and_sustainability_p

It can be observed that out of the **31** columns, `product_summary`, `product_name`, `description` are missing very few entries containing no information, whereas the columns `brand`, `eco_friendly`, `energy_efficiency` are almost irrelevant for our purpose of eliminating product duplicates. It must be noted that `brand` is indeed an important parameter, however `root_domain` and `page-url` should at least identify the comerciants.

Intuitively, the rest of these columns, regardless of whether they appear to have no missing data, are formatted in a way where a space detected is still considered valid in terms of not-null data. 

### Normalization
<i><b>(lowercasing + removing any characters that are not alphanumeric)</b></i>

#### Idea⭐
The conclusion is that there's no reason to use the method `df.dropna()` for a certain subset of columns since they are not relevant, however I should format the string objects so that all letters will be lowercase, eliminate all spaces where they are not followed by other characters or that are positioned at the beggining/ end of a sentence, and conjunctions. 

---

Here is some code to further investigate how the not-null values should be treated.
> As mentioned above, transforming into lowercase and eliminating spaces is a priority

A copy of the original `df` is created and will be used for the rest of the project (`cleaned_df` has its own memory allocation thanks to `copy()` method, being impossible to interfere with values stored in `df`). I begin to select the columns having the `object` data type (all apart from `manufacturing_year`) and iterate over each column. Then just like in a **SQL** query, we count the number of null values for that specific column, just to see their relevance.

Afterwards, the prior cleaning consists of:
> `fillna('')`: replaces all missing values in the column with an empty string </br>
> `fillna('')` is called before applying lowercase `str.lower()` only because it would fail since there null values don't have a corresponding lowercase version </br>
> `r'^\s*$'` is used to match strings that are either completely empty or contain only whitespace characters and these matches are replaced with null values (from **numpy** module) to ensure emptiness for certain entries


In [4]:
import numpy as np

cleaned_df = df.copy()
string_columns = cleaned_df.select_dtypes(include=['object']).columns

for col in string_columns:
    if cleaned_df[col].isnull().sum() > 0:
        print(f'Column {col} has {cleaned_df[col].isnull().sum()} missing values')
    
    cleaned_df[col] = cleaned_df[col].fillna('').str.lower()
    cleaned_df[col] = cleaned_df[col].replace(r'^\s*$', np.nan, regex=True)
cleaned_df

Column product_summary has 1185 missing values
Column product_name has 36 missing values
Column brand has 14884 missing values
Column eco_friendly has 20977 missing values


  cleaned_df[col] = cleaned_df[col].replace(r'^\s*$', np.nan, regex=True)


Column energy_efficiency has 21769 missing values
Column description has 1171 missing values


Unnamed: 0,unspsc,root_domain,page_url,product_title,product_summary,product_name,product_identifier,brand,intended_industries,applicability,...,form,size,color,purity,energy_efficiency,pressure_rating,power_rating,quality_standards_and_certifications,miscellaneous_features,description
0,sewing and stitchery and weaving equipment and...,studio-atcoat.com,https://studio-atcoat.com/1372696759/?idx=510,glimakra warping board (8m),the glimakra warping board is designed for use...,warping board,,,,,...,,,,,,,,,,"the ""warping board"" is designed for use with f..."
1,electric alternating current ac motors,worm-gears.net,https://worm-gears.net/tag/worm-gear-box/,nmrv worm gearbox motor,the nmrv worm gearbox motor is a high-efficien...,worm gearbox motor,,,,,...,,,,,,,,,,"the ""worm gearbox motor"" is a high-efficiency ..."
2,vehicle trim and exterior covering,customcarcoverco.com,https://customcarcoverco.com/collections/vendo...,nissan r33 gtr car cover,a custom car cover designed for the nissan r33...,car cover,,,,,...,,,,,,,,,,"the ""car cover"" is a custom-designed cover tai..."
3,pipe connectors,plumbmaster.com,https://www.plumbmaster.com/search?q=wolverine...,flexible fittings,"flexible fittings for plumbing applications, a...",flexible fittings,,,,,...,,,,,,,,,,"""flexible fittings"" are designed for plumbing ..."
4,doors,sogno.in,http://www.sogno.in/product-detail-cst-hgd-331...,cst-hgd-33103 hinged closet door,the cst-hgd-33103 hinged closet door is a meti...,hinged closet door,,cst,,,...,,,,,,,,,,"the ""hinged closet door"" is a storage solution..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21941,other,dsbridal.com,https://www.dsbridal.com/index.php/sale/veils....,15/16 accessories,"accessories designed for sweet 15/16, availabl...",accessories,,,,,...,,,,,,,,,,"""accessories"" are designed for use with sweet ..."
21942,processed and synthetic rubber,50735-in.all.biz,https://50735-in.all.biz/group-goods,general mechanical rubber goods,a category of rubber goods designed for genera...,rubber goods,,,,,...,,,,,,,,,,"""rubber goods"" are designed for general mechan..."
21943,fresh cut rose bouquets,lilyofthevalley.uk,https://www.lilyofthevalley.uk/product/luxurio...,luxurious rose garden,the luxurious rose garden is a stunning floral...,floral arrangement,,lily of the valley florist,,,...,,,,,,,,,,"""the 'floral arrangement' offered by lily of t..."
21944,vision correction or cosmetic eyewear and rela...,getcontactlensesonline.com.au,https://getcontactlensesonline.com.au/brand/al...,dailies aquacomfort plus multifocal (30 pack),a pack of 30 dailies aquacomfort plus multifoc...,multifocal contact lenses,,dailies,,,...,,,,,,,,,,"""multifocal contact lenses"" are designed for d..."


In [5]:
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21946 entries, 0 to 21945
Data columns (total 31 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   unspsc                                21946 non-null  object 
 1   root_domain                           21946 non-null  object 
 2   page_url                              21946 non-null  object 
 3   product_title                         21946 non-null  object 
 4   product_summary                       20761 non-null  object 
 5   product_name                          21910 non-null  object 
 6   product_identifier                    0 non-null      float64
 7   brand                                 7062 non-null   object 
 8   intended_industries                   0 non-null      float64
 9   applicability                         0 non-null      float64
 10  eco_friendly                          0 non-null      float64
 11  ethical_and_sus

> Interesting difference between data types in `df` and `cleaned_df` </br>
I assume the regex replacement influenced some **object** data type defined columns to become **float64**.

---

Knowing `.duplicated()` can be performed on a subset of columns that I find relevant, it is rather simple to say that I can identify exactly **16** duplicates.

In [6]:
duplicates = cleaned_df[cleaned_df.duplicated(subset=['unspsc', 'product_title', 'page_url', 'root_domain'], keep=False)].sort_values(by='product_title')
print(f'Total duplicate rows based on the 4 main identifiers: {duplicates.shape[0]}')
duplicates

Total duplicate rows based on the 4 main identifiers: 16


Unnamed: 0,unspsc,root_domain,page_url,product_title,product_summary,product_name,product_identifier,brand,intended_industries,applicability,...,form,size,color,purity,energy_efficiency,pressure_rating,power_rating,quality_standards_and_certifications,miscellaneous_features,description
6266,domestic kitchen tools and utensils,millargb.com,https://millargb.com/en/,accessories,a selection of kitchen accessories to compleme...,kitchen accessories,,,,,...,,,,,,,,,,"""kitchen accessories"" are a selection of kitch..."
9604,domestic kitchen tools and utensils,millargb.com,https://millargb.com/en/,accessories,a category of kitchen accessories to complemen...,kitchen accessories,,,,,...,,,,,,,,,,"""kitchen accessories"" are designed to compleme..."
701,personal paper products,babysoft.co.za,https://www.babysoft.co.za/,baby soft® fresh moist toilet tissue,a fresh moist toilet tissue product from baby ...,fresh moist toilet tissue,,baby soft®,,,...,,,,,,,,,,"""baby soft fresh moist toilet tissue"" is a toi..."
6816,personal paper products,babysoft.co.za,https://www.babysoft.co.za/,baby soft® fresh moist toilet tissue,a fresh moist toilet tissue product that offer...,fresh moist toilet tissue,,baby soft,,,...,,,,,,,,,,"""baby soft fresh moist toilet tissue"" is a fre..."
3103,automation control connectivity devices,trombetta.co,https://trombetta.co/,canopen devices,dc power products designed for a variety of ap...,canopen devices,,,,,...,,,,,,,,,,"""canopen devices"" are dc power products design..."
5069,automation control connectivity devices,trombetta.co,https://trombetta.co/,canopen devices,trombetta offers a full range of products that...,canopen devices,,trombetta,,,...,,,,,,,,,,"""canopen devices"" manufactured by trombetta ar..."
6285,infant foods and beverages,kendalnutricare.com,https://kendalnutricare.com/,kendamil baby milk,kendamil baby milk is a world-class nutrition ...,baby milk,,kendamil,,,...,,,,,,,,,,"""baby milk"" is a baby formula produced in kend..."
6384,infant foods and beverages,kendalnutricare.com,https://kendalnutricare.com/,kendamil baby milk,"kendamil baby milk is a british-made product, ...",baby milk,,kendal nutricare,,,...,,,,,,,,,,"""baby milk"" is a british-made product crafted ..."
384,agricultural machinery for harvesting,fxforagerparts.co.uk,http://www.fxforagerparts.co.uk/,new holland forage harvester parts for nh fx s...,new holland forage harvester parts for nh fx s...,forage harvester parts,,new holland,,,...,,,,,,,,,,"""forage harvester parts"" are essential mainten..."
4028,agricultural machinery for harvesting,fxforagerparts.co.uk,http://www.fxforagerparts.co.uk/,new holland forage harvester parts for nh fx s...,new holland forage harvester parts for nh fx s...,forage harvester parts,,new holland,,,...,,,,,,,,,,"""forage harvester parts"" are essential mainten..."


In the next section I tried dropping the other match since out of the **16** pairs, the first entity could've remained in the data frame.

In [7]:
# cleaned_df.drop_duplicates(subset=['unspsc', 'product_title', 'page_url', 'root_domain'], keep='first', inplace=True)
# print(f'Number of cleaned rows: {df.shape[0]-cleaned_df.shape[0]}')
# cleaned_df

---

The purpose of cleaning the data in `cleaned_df` is to remove all anomalies, specifically characters that are not letters or digits.  To achieve this, a regex pattern is applied to replace any characters that do not match with an empty string, ensuring that only valid alphanumeric characters (**[a-zA-Z0-9 ]**) remain in the dataset.

In [8]:
cleaned_df['product_title'].str.replace("[a-zA-Z0-9 ]", "", regex=True).unique()

array(['()', '', '--', ..., '––/', '()--/', '|石斑鱼'],
      shape=(1686,), dtype=object)

In [9]:
subset_columns = ['product_name', 'product_title', 'product_summary']

aux_df = cleaned_df[subset_columns].copy() # since we know it will output the unwanted characters, without the needed ones, we decide not to modify cleaned_df yet
for col in subset_columns:
    aux_df[col] = aux_df[col].str.replace("[a-zA-Z0-9 ]", "", regex=True)
aux_df

Unnamed: 0,product_name,product_title,product_summary
0,,(),",.,원."
1,,,"-..,.,.,.,.,.,.,.,.,.,.,,.,.,,.,,,,,,,,,,,,,,,..."
2,,,".,,."
3,,,",."
4,,--,"--,.,-.--,,.,.,.,.,,,.',."
...,...,...,...
21941,,/,"/,/"".."
21942,,,",."
21943,,,",:.£.,£...,.-.,.,,,.,-.,."
21944,,(),",.,.$.."


In [10]:
for col in subset_columns:
    print(f"Unique values in '{col}': {aux_df[col].unique()}")

Unique values in 'product_name': ['' '-' '---' '--' '/' '-#-' ',-' '/"' '-/' '&' ',' "'" '.' ',,,,' nan
 '-..-' '°' '--/' '–' ',&' '+' '®®' ',,,' '.-' '..' "'-" '™' '&-' "-''"
 '()' '&-&' '/-' "-'-" '----' 'สวสัสวัสวัส' ',()--' ',,' 'é' '"' '°---'
 '|-' '-/"-' ':' '-----' '()/()-' '()–' '---.-' '--.' '’' '..-' '®-' '%'
 '/()' ',"' '-,' 'สวสัสวัส' '-()' '®™' '-(),,&' '----//' ',---' '"-' '//'
 '(---)' '-,,,-' '&&' '-..--' '-(+),-' '++' '--...' '®' "'--" '-&' '.:/'
 '--;' '(-©)' '°-' '------' '/.' '_/-' '...' ',°-' 'µ()' '®’' '--™'
 "--'---" '-/"(±")' '_+_' '&&//.&' '-,&' '--*-' '[()--]' '#-' '-°---' '."'
 '-*' '--++' '.%' "'&" '-+' ',...:' '&/' '+.' '.(.)' '&.' '(-)' '()-'
 '×–-' ',,,,,,' '(/)' '---(-----)' '″()' '+×' '-,-' '////' '///' '硫压片' 'ó'
 '-,,(:)()' 'ô.' '’-' '--()' '-.' 'â' ',ó,,' ':,' '-#' '/--' 'â-é' 'ö'
 '&()' "''" ',-()' '£' '!!' '--+' '.....' 'ابراتلداتنامداتنامداتنامداتن'
 '--[]' '-./' '-,,' ',--' '&’' '--!' '.---.' '(,)' "+''/////" '-ó' './'
 ',,&' '|' '’&' '-α--' '-(‐-

When ^ is used inside square brackets ([^a-zA-Z0-9 ]), it negates the character class, meaning the logic illustrated above can easily be implemented for cleaned_df.

In [11]:
for col in subset_columns:
    cleaned_df[col] = cleaned_df[col].str.replace("[^a-zA-Z0-9 ]", "", regex=True)
cleaned_df

Unnamed: 0,unspsc,root_domain,page_url,product_title,product_summary,product_name,product_identifier,brand,intended_industries,applicability,...,form,size,color,purity,energy_efficiency,pressure_rating,power_rating,quality_standards_and_certifications,miscellaneous_features,description
0,sewing and stitchery and weaving equipment and...,studio-atcoat.com,https://studio-atcoat.com/1372696759/?idx=510,glimakra warping board 8m,the glimakra warping board is designed for use...,warping board,,,,,...,,,,,,,,,,"the ""warping board"" is designed for use with f..."
1,electric alternating current ac motors,worm-gears.net,https://worm-gears.net/tag/worm-gear-box/,nmrv worm gearbox motor,the nmrv worm gearbox motor is a highefficienc...,worm gearbox motor,,,,,...,,,,,,,,,,"the ""worm gearbox motor"" is a high-efficiency ..."
2,vehicle trim and exterior covering,customcarcoverco.com,https://customcarcoverco.com/collections/vendo...,nissan r33 gtr car cover,a custom car cover designed for the nissan r33...,car cover,,,,,...,,,,,,,,,,"the ""car cover"" is a custom-designed cover tai..."
3,pipe connectors,plumbmaster.com,https://www.plumbmaster.com/search?q=wolverine...,flexible fittings,flexible fittings for plumbing applications al...,flexible fittings,,,,,...,,,,,,,,,,"""flexible fittings"" are designed for plumbing ..."
4,doors,sogno.in,http://www.sogno.in/product-detail-cst-hgd-331...,csthgd33103 hinged closet door,the csthgd33103 hinged closet door is a meticu...,hinged closet door,,cst,,,...,,,,,,,,,,"the ""hinged closet door"" is a storage solution..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21941,other,dsbridal.com,https://www.dsbridal.com/index.php/sale/veils....,1516 accessories,accessories designed for sweet 1516 available ...,accessories,,,,,...,,,,,,,,,,"""accessories"" are designed for use with sweet ..."
21942,processed and synthetic rubber,50735-in.all.biz,https://50735-in.all.biz/group-goods,general mechanical rubber goods,a category of rubber goods designed for genera...,rubber goods,,,,,...,,,,,,,,,,"""rubber goods"" are designed for general mechan..."
21943,fresh cut rose bouquets,lilyofthevalley.uk,https://www.lilyofthevalley.uk/product/luxurio...,luxurious rose garden,the luxurious rose garden is a stunning floral...,floral arrangement,,lily of the valley florist,,,...,,,,,,,,,,"""the 'floral arrangement' offered by lily of t..."
21944,vision correction or cosmetic eyewear and rela...,getcontactlensesonline.com.au,https://getcontactlensesonline.com.au/brand/al...,dailies aquacomfort plus multifocal 30 pack,a pack of 30 dailies aquacomfort plus multifoc...,multifocal contact lenses,,dailies,,,...,,,,,,,,,,"""multifocal contact lenses"" are designed for d..."


---

### Tokenization 
<i><b>(and normalization: removing stop words)</b></i>

The last part of the `clean_string_data.ipynb` notebook focuses on tokenizing the text in the `product_title` and `product_name` columns. It also includes a step to remove stop words, which I have manually curated into a small set. 

Later in the process, I found out about the possibility of using `TfidfVectorizer` to convert sentences into numerical vectors based on word frequencies, similar to tokenization. Additionally, this method allows to remove stop words by setting the **stop_words** flag to 'english', which effectively eliminates a large set of common English words that do not significantly contribute to the meaning of the text and might even influence negatively how the similarity is handled between sentences.

Therefore, I decided not to tackle the problem in this primitive manner in the [`deduplication_script.py` (link for path here)](./deduplication_script.py) file containing the final version of the project.

In [12]:
'''def tokenize(text):
    if isinstance(text, str):
        return text.split()
    return []'''
cleaned_df['product_name_tokens'] = cleaned_df['product_name'].apply(
    lambda x: x.split() if isinstance(x, str) else []
)
cleaned_df['product_title_tokens'] = cleaned_df['product_title'].apply(
    lambda x: x.split() if isinstance(x, str) else []
)
cleaned_df

Unnamed: 0,unspsc,root_domain,page_url,product_title,product_summary,product_name,product_identifier,brand,intended_industries,applicability,...,color,purity,energy_efficiency,pressure_rating,power_rating,quality_standards_and_certifications,miscellaneous_features,description,product_name_tokens,product_title_tokens
0,sewing and stitchery and weaving equipment and...,studio-atcoat.com,https://studio-atcoat.com/1372696759/?idx=510,glimakra warping board 8m,the glimakra warping board is designed for use...,warping board,,,,,...,,,,,,,,"the ""warping board"" is designed for use with f...","[warping, board]","[glimakra, warping, board, 8m]"
1,electric alternating current ac motors,worm-gears.net,https://worm-gears.net/tag/worm-gear-box/,nmrv worm gearbox motor,the nmrv worm gearbox motor is a highefficienc...,worm gearbox motor,,,,,...,,,,,,,,"the ""worm gearbox motor"" is a high-efficiency ...","[worm, gearbox, motor]","[nmrv, worm, gearbox, motor]"
2,vehicle trim and exterior covering,customcarcoverco.com,https://customcarcoverco.com/collections/vendo...,nissan r33 gtr car cover,a custom car cover designed for the nissan r33...,car cover,,,,,...,,,,,,,,"the ""car cover"" is a custom-designed cover tai...","[car, cover]","[nissan, r33, gtr, car, cover]"
3,pipe connectors,plumbmaster.com,https://www.plumbmaster.com/search?q=wolverine...,flexible fittings,flexible fittings for plumbing applications al...,flexible fittings,,,,,...,,,,,,,,"""flexible fittings"" are designed for plumbing ...","[flexible, fittings]","[flexible, fittings]"
4,doors,sogno.in,http://www.sogno.in/product-detail-cst-hgd-331...,csthgd33103 hinged closet door,the csthgd33103 hinged closet door is a meticu...,hinged closet door,,cst,,,...,,,,,,,,"the ""hinged closet door"" is a storage solution...","[hinged, closet, door]","[csthgd33103, hinged, closet, door]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21941,other,dsbridal.com,https://www.dsbridal.com/index.php/sale/veils....,1516 accessories,accessories designed for sweet 1516 available ...,accessories,,,,,...,,,,,,,,"""accessories"" are designed for use with sweet ...",[accessories],"[1516, accessories]"
21942,processed and synthetic rubber,50735-in.all.biz,https://50735-in.all.biz/group-goods,general mechanical rubber goods,a category of rubber goods designed for genera...,rubber goods,,,,,...,,,,,,,,"""rubber goods"" are designed for general mechan...","[rubber, goods]","[general, mechanical, rubber, goods]"
21943,fresh cut rose bouquets,lilyofthevalley.uk,https://www.lilyofthevalley.uk/product/luxurio...,luxurious rose garden,the luxurious rose garden is a stunning floral...,floral arrangement,,lily of the valley florist,,,...,,,,,,,,"""the 'floral arrangement' offered by lily of t...","[floral, arrangement]","[luxurious, rose, garden]"
21944,vision correction or cosmetic eyewear and rela...,getcontactlensesonline.com.au,https://getcontactlensesonline.com.au/brand/al...,dailies aquacomfort plus multifocal 30 pack,a pack of 30 dailies aquacomfort plus multifoc...,multifocal contact lenses,,dailies,,,...,,,,,,,,"""multifocal contact lenses"" are designed for d...","[multifocal, contact, lenses]","[dailies, aquacomfort, plus, multifocal, 30, p..."


In [13]:
stop_words = {'product', 'products', 'make', 'makes', 'use', 'allowing', 'allows', 'available', 'a', 'an', 'and', 'is', 'the', 'of', 'for', 'plus', 'with', 'to', 'in', 'on', 'at', 'by', 'as', 'from', 'that', 'this', 'these', 'those', 'or', 'but', 'not', 'over', 'under', 'above', 'below', 'between', 'among', 'through', 'into', 'onto', 'after', 'before', 'since', 'during', 'while', 'if', 'then', 'else', 'when', 'where', 'why', 'how', 'all', 'any', 'each', 'every', 'other', 'another', 'such', 'own', 'same', 'different', 'more', 'less', 'few', 'many', 'most', 'some', 'several', 'fewer'}

def remove_stop_words(text):
    if not isinstance(text, str):
        return text 
    tokens = text.split()
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    return ' '.join(filtered_tokens)

cleaned_df['product_summary_stop_words_removed'] = cleaned_df['product_summary'].apply(remove_stop_words)
cleaned_df

Unnamed: 0,unspsc,root_domain,page_url,product_title,product_summary,product_name,product_identifier,brand,intended_industries,applicability,...,purity,energy_efficiency,pressure_rating,power_rating,quality_standards_and_certifications,miscellaneous_features,description,product_name_tokens,product_title_tokens,product_summary_stop_words_removed
0,sewing and stitchery and weaving equipment and...,studio-atcoat.com,https://studio-atcoat.com/1372696759/?idx=510,glimakra warping board 8m,the glimakra warping board is designed for use...,warping board,,,,,...,,,,,,,"the ""warping board"" is designed for use with f...","[warping, board]","[glimakra, warping, board, 8m]",glimakra warping board designed floor looms pr...
1,electric alternating current ac motors,worm-gears.net,https://worm-gears.net/tag/worm-gear-box/,nmrv worm gearbox motor,the nmrv worm gearbox motor is a highefficienc...,worm gearbox motor,,,,,...,,,,,,,"the ""worm gearbox motor"" is a high-efficiency ...","[worm, gearbox, motor]","[nmrv, worm, gearbox, motor]",nmrv worm gearbox motor highefficiency gear bo...
2,vehicle trim and exterior covering,customcarcoverco.com,https://customcarcoverco.com/collections/vendo...,nissan r33 gtr car cover,a custom car cover designed for the nissan r33...,car cover,,,,,...,,,,,,,"the ""car cover"" is a custom-designed cover tai...","[car, cover]","[nissan, r33, gtr, car, cover]",custom car cover designed nissan r33 gtr model...
3,pipe connectors,plumbmaster.com,https://www.plumbmaster.com/search?q=wolverine...,flexible fittings,flexible fittings for plumbing applications al...,flexible fittings,,,,,...,,,,,,,"""flexible fittings"" are designed for plumbing ...","[flexible, fittings]","[flexible, fittings]",flexible fittings plumbing applications moveme...
4,doors,sogno.in,http://www.sogno.in/product-detail-cst-hgd-331...,csthgd33103 hinged closet door,the csthgd33103 hinged closet door is a meticu...,hinged closet door,,cst,,,...,,,,,,,"the ""hinged closet door"" is a storage solution...","[hinged, closet, door]","[csthgd33103, hinged, closet, door]",csthgd33103 hinged closet door meticulously de...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21941,other,dsbridal.com,https://www.dsbridal.com/index.php/sale/veils....,1516 accessories,accessories designed for sweet 1516 available ...,accessories,,,,,...,,,,,,,"""accessories"" are designed for use with sweet ...",[accessories],"[1516, accessories]",accessories designed sweet 1516 various sizes ...
21942,processed and synthetic rubber,50735-in.all.biz,https://50735-in.all.biz/group-goods,general mechanical rubber goods,a category of rubber goods designed for genera...,rubber goods,,,,,...,,,,,,,"""rubber goods"" are designed for general mechan...","[rubber, goods]","[general, mechanical, rubber, goods]",category rubber goods designed general mechani...
21943,fresh cut rose bouquets,lilyofthevalley.uk,https://www.lilyofthevalley.uk/product/luxurio...,luxurious rose garden,the luxurious rose garden is a stunning floral...,floral arrangement,,lily of the valley florist,,,...,,,,,,,"""the 'floral arrangement' offered by lily of t...","[floral, arrangement]","[luxurious, rose, garden]",luxurious rose garden stunning floral arrangem...
21944,vision correction or cosmetic eyewear and rela...,getcontactlensesonline.com.au,https://getcontactlensesonline.com.au/brand/al...,dailies aquacomfort plus multifocal 30 pack,a pack of 30 dailies aquacomfort plus multifoc...,multifocal contact lenses,,dailies,,,...,,,,,,,"""multifocal contact lenses"" are designed for d...","[multifocal, contact, lenses]","[dailies, aquacomfort, plus, multifocal, 30, p...",pack 30 dailies aquacomfort multifocal contact...


In [14]:
%store cleaned_df

Stored 'cleaned_df' (DataFrame)
