# <div style = 'background-color:orange'> <center> Group duplicates and consolidate groups (Step 3) </div>

---
### Sections
1. Data cleaning (looking for accuracy and consistency) -  [`clean_string_data.ipynb` (link for path here)](./clean_string_data.ipynb)
2. Filtering similarities using different algorithms on significant columns to identify potential entries - [ `filter_data_similarity.ipynb` (link for path here)](./filter_data_similarity.ipynb)
3. Merging the similar pairs retrieved at the previous step by creating groups and consolidating them into single enriched entities - [`group_duplicates_consolidate_groups.ipynb` (link for path here)](./group_duplicates_consolidate_groups.ipynb)
---

This part involves grouping the pairings into batches of matches and then combine all the received data into one enriched entity for each group.

In [13]:
%store -r validated_duplicates
%store -r check_duplicates
%store -r cleaned_df

---
### Grouping using DFS

Firstly, I thought about representing an undirected graph of duplicate relationships and retrieve all groups of connected components using DFS. A key aspect of this is that it is going to handle indirect duplicates using transitivity so it is full-proof.

In [14]:
from collections import defaultdict

graph = defaultdict(list)
for i, j, _, _, _ in validated_duplicates:
    graph[i].append(j)
    graph[j].append(i)

def find_connected_components(graph):
    visited = set()
    components = []
    for node in graph:
        if node not in visited:
            stack = [node]
            component = []
            while stack:
                current = stack.pop()
                if current not in visited:
                    visited.add(current)
                    component.append(current)
                    stack.extend(graph[current])
            components.append(component)
    return components

duplicate_groups = find_connected_components(graph)
duplicate_groups = sorted(duplicate_groups, key=lambda x: len(x), reverse=True)
duplicate_groups

[[99,
  11529,
  11450,
  10330,
  9079,
  8981,
  8861,
  8695,
  8687,
  8566,
  8551,
  7674,
  7622,
  7576,
  7502,
  7411,
  6988,
  6984,
  6120,
  5753,
  5342,
  4570,
  4464,
  3678,
  3539,
  3362,
  3237,
  2950,
  2557,
  2552,
  1950,
  1756,
  1396,
  734,
  477,
  362,
  286,
  232],
 [482,
  11527,
  11373,
  10445,
  10221,
  9954,
  9368,
  8245,
  7802,
  7521,
  7513,
  7161,
  6195,
  5855,
  5809,
  5608,
  5298,
  4994,
  4500,
  3731,
  3140,
  2457,
  2182,
  1521,
  870,
  826,
  811,
  736],
 [137,
  11591,
  10371,
  9918,
  9347,
  9274,
  9199,
  9013,
  8654,
  8282,
  8173,
  7081,
  7017,
  6889,
  6870,
  6579,
  6218,
  6137,
  6089,
  4436,
  3268,
  3259,
  2692,
  2644,
  1538,
  1035,
  841],
 [1551,
  11289,
  10472,
  10402,
  8548,
  8160,
  7516,
  7073,
  6862,
  6756,
  6340,
  6319,
  5930,
  5779,
  5422,
  5093,
  5083,
  2996,
  2559,
  2219],
 [128,
  11665,
  10640,
  8357,
  8302,
  6247,
  6056,
  5563,
  5548,
  5055,
  4461,
  415

In [23]:
len(duplicate_groups)

1151

For fun I wanted to see if the `root_domain` + `product_summary` approach would change much and, optimistically due to the very few previous results, 3 more groups were found.

In [24]:
unique_pairs = set()

for pair in validated_duplicates:
    unique_pairs.add(frozenset((pair[0], pair[1])))

for pair in check_duplicates:
    unique_pairs.add(frozenset((pair[0], pair[1])))

final_duplicates = [tuple(pair) for pair in unique_pairs]
graph_url = defaultdict(list)
for i, j in final_duplicates:
    graph_url[i].append(j)
    graph_url[j].append(i)

duplicate_groups_url = find_connected_components(graph_url)
duplicate_groups_url = sorted(duplicate_groups_url, key=lambda x: len(x), reverse=True)

print("Number of duplicate groups:", len(duplicate_groups_url))
print("Duplicate Groups:")
for group in duplicate_groups_url:
    print(group)

Number of duplicate groups: 1154
Duplicate Groups:
[99, 3362, 3539, 3237, 10330, 7622, 232, 286, 7411, 7502, 1396, 2552, 4570, 11529, 8861, 362, 1756, 9079, 8551, 6988, 3678, 8566, 5342, 8687, 8695, 4464, 5753, 6984, 6120, 2950, 7674, 477, 2557, 7576, 11450, 734, 1950, 8981]
[5298, 870, 482, 7521, 4994, 10221, 3731, 5809, 1521, 9954, 6195, 7513, 11527, 7802, 10445, 11373, 9368, 5608, 811, 4500, 2182, 8245, 2457, 3140, 7161, 826, 736, 5855]
[3259, 1035, 6137, 6218, 6089, 8282, 7017, 9347, 1538, 841, 137, 9918, 9274, 9199, 2644, 6870, 6889, 10371, 8173, 11591, 7081, 6579, 2692, 3268, 8654, 9013, 4436]
[2996, 2219, 5083, 5422, 6340, 8548, 5930, 6319, 6862, 7516, 11289, 7073, 8160, 5093, 5779, 1551, 2559, 10472, 10402, 6756]
[1784, 4461, 4155, 8357, 6247, 5563, 8302, 1522, 6056, 1226, 3570, 5055, 11665, 10640, 128, 3499, 3715, 5548, 1960]
[5304, 4280, 11612, 4281, 3800, 10315, 4501, 6585, 4715, 4283, 11762, 1012, 8468, 8019, 9588, 11532]
[2786, 11142, 6343, 1158, 5921, 4934, 1291, 2468, 11

> The number of groups doesn't indicate yet the number of actual entries needed to be dropped from the data frame and for this reason, I deal with this in the following part:

In [25]:
aux_no_duplicates = 0
for group in duplicate_groups_url:
    if len(group) > 1:
        aux_no_duplicates += len(group)
aux_no_duplicates

3574

In [18]:
no_duplicates = 0
for group in duplicate_groups_url:
    if len(group) > 1:
        no_duplicates += len(group[1:])
no_duplicates

2420

In [19]:
all_duplicate_indices = set()
for group in duplicate_groups_url:
    all_duplicate_indices.update(group)

non_duplicate_indices = set(cleaned_df.index) - all_duplicate_indices
print(all_duplicate_indices, non_duplicate_indices, sep="\n")

non_duplicates_df = cleaned_df.loc[list(non_duplicate_indices)]
non_duplicates_df

{3, 8196, 8, 8208, 8209, 16401, 8210, 8218, 26, 8220, 8221, 8222, 32, 35, 38, 8232, 8236, 45, 8239, 48, 8242, 8244, 8245, 8246, 8249, 8251, 60, 8253, 62, 61, 59, 65, 8258, 67, 8260, 8261, 70, 68, 8264, 8266, 8268, 8269, 79, 8273, 83, 8276, 84, 86, 88, 8282, 8287, 16480, 97, 99, 8293, 8294, 104, 8297, 106, 8300, 109, 8302, 8301, 112, 8306, 16500, 117, 119, 8311, 8312, 8314, 16504, 124, 125, 8318, 8316, 128, 130, 16515, 8322, 8325, 8324, 137, 138, 139, 8332, 141, 144, 16529, 8336, 147, 148, 149, 150, 8343, 152, 8344, 151, 155, 8348, 16538, 8349, 8351, 159, 8353, 8354, 161, 157, 8357, 167, 169, 170, 171, 8364, 172, 8366, 8369, 8370, 179, 177, 8373, 182, 183, 184, 8374, 8381, 16573, 192, 8385, 197, 8393, 201, 8394, 8396, 205, 16588, 207, 209, 211, 8404, 213, 215, 216, 8408, 218, 219, 220, 222, 8415, 224, 226, 227, 8421, 231, 232, 233, 8424, 238, 8430, 239, 243, 8438, 8442, 252, 8445, 8447, 8450, 8453, 261, 8455, 262, 8457, 265, 268, 8462, 271, 8464, 8465, 8466, 8468, 8476, 286, 8479, 287, 

Unnamed: 0,unspsc,root_domain,page_url,product_title,product_summary,product_name,product_identifier,brand,intended_industries,applicability,...,energy_efficiency,pressure_rating,power_rating,quality_standards_and_certifications,miscellaneous_features,description,product_name_tokens,product_title_tokens,product_summary_stop_words_removed,key_identifiers
0,sewing and stitchery and weaving equipment and...,studio-atcoat,https://studio-atcoat.com/1372696759/?idx=510,glimakra warping board 8m,the glimakra warping board is designed for use...,warping board,,,,,...,,,,,,"the ""warping board"" is designed for use with f...","[warping, board]","[glimakra, warping, board, 8m]",glimakra warping board designed floor looms pr...,"[8m, glimakra, warping]"
1,electric alternating current ac motors,worm-gears,https://worm-gears.net/tag/worm-gear-box/,nmrv worm gearbox motor,the nmrv worm gearbox motor is a highefficienc...,worm gearbox motor,,,,,...,,,,,,"the ""worm gearbox motor"" is a high-efficiency ...","[worm, gearbox, motor]","[nmrv, worm, gearbox, motor]",nmrv worm gearbox motor highefficiency gear bo...,[nmrv]
2,vehicle trim and exterior covering,customcarcoverco,https://customcarcoverco.com/collections/vendo...,nissan r33 gtr car cover,a custom car cover designed for the nissan r33...,car cover,,,,,...,,,,,,"the ""car cover"" is a custom-designed cover tai...","[car, cover]","[nissan, r33, gtr, car, cover]",custom car cover designed nissan r33 gtr model...,"[gtr, r33]"
4,doors,sogno,http://www.sogno.in/product-detail-cst-hgd-331...,csthgd33103 hinged closet door,the csthgd33103 hinged closet door is a meticu...,hinged closet door,,cst,,,...,,,,,,"the ""hinged closet door"" is a storage solution...","[hinged, closet, door]","[csthgd33103, hinged, closet, door]",csthgd33103 hinged closet door meticulously de...,"[closet, csthgd33103]"
5,faucets or taps,plumbmaster,https://www.plumbmaster.com/search?q=wolverine...,deep faucets,faucets with a deep design providing a secure ...,deep faucets,,,,,...,,,,,,"""deep faucets"" are designed with a deep design...","[deep, faucets]","[deep, faucets]",faucets deep design providing secure stable co...,"[deep, faucets]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21941,other,dsbridal,https://www.dsbridal.com/index.php/sale/veils....,1516 accessories,accessories designed for sweet 1516 available ...,accessories,,,,,...,,,,,,"""accessories"" are designed for use with sweet ...",[accessories],"[1516, accessories]",accessories designed sweet 1516 various sizes ...,"[1516, accessories]"
21942,processed and synthetic rubber,50735-in,https://50735-in.all.biz/group-goods,general mechanical rubber goods,a category of rubber goods designed for genera...,rubber goods,,,,,...,,,,,,"""rubber goods"" are designed for general mechan...","[rubber, goods]","[general, mechanical, rubber, goods]",category rubber goods designed general mechani...,[goods]
21943,fresh cut rose bouquets,lilyofthevalley,https://www.lilyofthevalley.uk/product/luxurio...,luxurious rose garden,the luxurious rose garden is a stunning floral...,floral arrangement,,lily of the valley florist,,,...,,,,,,"""the 'floral arrangement' offered by lily of t...","[floral, arrangement]","[luxurious, rose, garden]",luxurious rose garden stunning floral arrangem...,"[garden, luxurious, rose]"
21944,vision correction or cosmetic eyewear and rela...,getcontactlensesonline,https://getcontactlensesonline.com.au/brand/al...,dailies aquacomfort plus multifocal 30 pack,a pack of 30 dailies aquacomfort plus multifoc...,multifocal contact lenses,,dailies,,,...,,,,,,"""multifocal contact lenses"" are designed for d...","[multifocal, contact, lenses]","[dailies, aquacomfort, plus, multifocal, 30, p...",pack 30 dailies aquacomfort multifocal contact...,[]


After retrieving the variable `non_duplicates_df` which removes all of the indexes contained in each group of duplicates, I begin creating the enriched entities and concatenate `non_duplicates_df` with `representatives_df` (including the newly formed entities) in order to send it to a parquet output file.

In [20]:
import pandas as pd

enriched_entries = []
cleaned_df['product_summary'] = cleaned_df['product_summary'].fillna('')
cleaned_df['description'] = cleaned_df['description'].fillna('')

for group in duplicate_groups_url:
    if len(group) > 1:
        group_rows = cleaned_df.loc[group]
        
        enriched_entry = {
            'unspsc': group_rows['unspsc'].mode()[0],
            'product_title': group_rows['product_title'].mode()[0],
            'product_name': group_rows['product_name'].mode()[0],
            'product_summary': ' '.join(group_rows['product_summary'].dropna().unique()),
            'root_domain': group_rows['root_domain'].mode()[0],
            'page_url': group_rows['page_url'].mode()[0],
            'description': group_rows['description'].mode()[0],
            'product_name_tokens': list(set([keyword for sublist in group_rows['product_name_tokens'] for keyword in sublist])),
            'product_title_tokens': list(set([keyword for sublist in group_rows['product_title_tokens'] for keyword in sublist])),
            'key_identifiers': list(set([keyword for sublist in group_rows['key_identifiers'] for keyword in sublist]))
        }
        enriched_entries.append(enriched_entry)

representatives_df = pd.DataFrame(enriched_entries)
representatives_df

Unnamed: 0,unspsc,product_title,product_name,product_summary,root_domain,page_url,description,product_name_tokens,product_title_tokens,key_identifiers
0,string instruments,c6 fr deluxe guitar,guitar,a guitar model c6 fr deluxe available in vario...,schecterguitars,https://www.schecterguitars.com/guitars/6-stri...,"the ""guitar"" is a 6-string guitar available in...",[guitar],"[deluxe, c6, fr, guitar]","[deluxe, c6, fr]"
1,string instruments,avenger exotic guitar,guitar,a guitar model named avenger exotic available ...,schecterguitars,https://www.schecterguitars.com/guitars/6-stri...,"the ""guitar"" is a 6-string guitar designed for...",[guitar],"[lh, avenger, guitar, exotic]","[avenger, exotic]"
2,string instruments,sunset6 extreme lh guitar,guitar,a guitar model sunset6 extreme available in va...,schecterguitars,https://www.schecterguitars.com/guitars/6-stri...,"the ""guitar"" is a left-handed electric guitar ...",[guitar],"[sunset6, extreme, lh, guitar]","[sunset6, extreme]"
3,vehicle trim and exterior covering,custom car cover company aston martin vehicle ...,vehicle cover,a custom vehicle cover specifically designed f...,customcarcoverco,https://customcarcoverco.com/collections/vendo...,"""the 'vehicle cover' is a tailored cover desig...","[cover, vehicle]","[martin, cars, alfa, mazda, car, mitsubishi, c...",[cover]
4,dental materials,mono implants,dental implants,mono implants are a type of dental implant off...,shopshatkinfirst,https://shopshatkinfirst.com/collections/all/i...,"""dental implants"" are advanced course implants...","[implants, dental]","[mono, implants]","[mono, implants]"
...,...,...,...,...,...,...,...,...,...,...
1149,medication dispensing and measuring devices an...,preinked c stamp,preinked c stamp,the preinked c stamp is a specialized tool des...,apothecaryproducts,https://shop.apothecaryproducts.com/products/p...,"the ""pre-inked 'c' stamp"" manufactured by apot...","[c, stamp, preinked]","[c, stamp, preinked]","[stamp, preinked]"
1150,actuators,hardware,hardware,a product manufactured by krishna machine tool...,hayleywindows,https://krishnamtc.com/about-us,"""hardware"" manufactured by krishna machine too...",[hardware],[hardware],[hardware]
1151,pipe nipples,nipple,nipple,a variety of nipples designed for plumbing app...,plumbmaster,https://www.plumbmaster.com/search?q=wolverine...,"""nipple"" is designed for plumbing applications...",[nipple],[nipple],[nipple]
1152,aluminum based alloys,aenodised,aenodised,aenodised is a product category offered by ali...,alifbapk,https://alifbapk.com/index.php/product-categor...,"""aenodised"" by alif-ba aluminium export qualit...",[aenodised],[aenodised],[aenodised]


In [21]:
deduplicated_df = pd.concat([non_duplicates_df, representatives_df], ignore_index=True)
deduplicated_df

Unnamed: 0,unspsc,root_domain,page_url,product_title,product_summary,product_name,product_identifier,brand,intended_industries,applicability,...,energy_efficiency,pressure_rating,power_rating,quality_standards_and_certifications,miscellaneous_features,description,product_name_tokens,product_title_tokens,product_summary_stop_words_removed,key_identifiers
0,sewing and stitchery and weaving equipment and...,studio-atcoat,https://studio-atcoat.com/1372696759/?idx=510,glimakra warping board 8m,the glimakra warping board is designed for use...,warping board,,,,,...,,,,,,"the ""warping board"" is designed for use with f...","[warping, board]","[glimakra, warping, board, 8m]",glimakra warping board designed floor looms pr...,"[8m, glimakra, warping]"
1,electric alternating current ac motors,worm-gears,https://worm-gears.net/tag/worm-gear-box/,nmrv worm gearbox motor,the nmrv worm gearbox motor is a highefficienc...,worm gearbox motor,,,,,...,,,,,,"the ""worm gearbox motor"" is a high-efficiency ...","[worm, gearbox, motor]","[nmrv, worm, gearbox, motor]",nmrv worm gearbox motor highefficiency gear bo...,[nmrv]
2,vehicle trim and exterior covering,customcarcoverco,https://customcarcoverco.com/collections/vendo...,nissan r33 gtr car cover,a custom car cover designed for the nissan r33...,car cover,,,,,...,,,,,,"the ""car cover"" is a custom-designed cover tai...","[car, cover]","[nissan, r33, gtr, car, cover]",custom car cover designed nissan r33 gtr model...,"[gtr, r33]"
3,doors,sogno,http://www.sogno.in/product-detail-cst-hgd-331...,csthgd33103 hinged closet door,the csthgd33103 hinged closet door is a meticu...,hinged closet door,,cst,,,...,,,,,,"the ""hinged closet door"" is a storage solution...","[hinged, closet, door]","[csthgd33103, hinged, closet, door]",csthgd33103 hinged closet door meticulously de...,"[closet, csthgd33103]"
4,faucets or taps,plumbmaster,https://www.plumbmaster.com/search?q=wolverine...,deep faucets,faucets with a deep design providing a secure ...,deep faucets,,,,,...,,,,,,"""deep faucets"" are designed with a deep design...","[deep, faucets]","[deep, faucets]",faucets deep design providing secure stable co...,"[deep, faucets]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19521,medication dispensing and measuring devices an...,apothecaryproducts,https://shop.apothecaryproducts.com/products/p...,preinked c stamp,the preinked c stamp is a specialized tool des...,preinked c stamp,,,,,...,,,,,,"the ""pre-inked 'c' stamp"" manufactured by apot...","[c, stamp, preinked]","[c, stamp, preinked]",,"[stamp, preinked]"
19522,actuators,hayleywindows,https://krishnamtc.com/about-us,hardware,a product manufactured by krishna machine tool...,hardware,,,,,...,,,,,,"""hardware"" manufactured by krishna machine too...",[hardware],[hardware],,[hardware]
19523,pipe nipples,plumbmaster,https://www.plumbmaster.com/search?q=wolverine...,nipple,a variety of nipples designed for plumbing app...,nipple,,,,,...,,,,,,"""nipple"" is designed for plumbing applications...",[nipple],[nipple],,[nipple]
19524,aluminum based alloys,alifbapk,https://alifbapk.com/index.php/product-categor...,aenodised,aenodised is a product category offered by ali...,aenodised,,,,,...,,,,,,"""aenodised"" by alif-ba aluminium export qualit...",[aenodised],[aenodised],,[aenodised]


In [22]:
deduplicated_df.to_parquet('data\\deduplicated_products.parquet', engine='pyarrow')