## Preparing the dataset

Following the instructons you should have checked out the project and started jupyter notebook in the parent folder.
```
uni-sofia-entity-linking-magellan    <= "jupyter notebook" started here
 |- datasets
 |- notebooks
    |- entity_match_electronics.ipynb
```
 

In [4]:
# Extract the zip file with dataset CSV files (amazon.csv, best_buy.csv )
# -o is to overwrite
!unzip -o ../dataset/dataset_electronics_ID_7.zip -d ../dataset/ 

Archive:  ../dataset/dataset_electronics_ID_7.zip
  inflating: ../dataset/amazon.csv   
  inflating: ../dataset/best_buy.csv  


In [5]:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

In [6]:
# Read the CSV files and set 'ID' as the key attribute
A = em.read_csv_metadata("../dataset/amazon.csv", key='ID')
B = em.read_csv_metadata("../dataset/best_buy.csv", key='ID')

Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.


Lets have a look at the loaded data frames. Notice that `A`.`Original_Price` can be null.

In [7]:
A.head(3)

Unnamed: 0,ID,Brand,Name,Amazon_Price,Original_Price,Features
0,1,Asus,"ASUS X205TA 11.6 Inch Laptop (Intel Atom, 2 GB, 32GB SSD, Gold) - Free Upgrade to Windows 10",$199.00,,Intel Atom 1.33 GHz Processor. 2 GB DDR3 RAM. 32GB SSD Storage; No Optical Drive. 11.6 inches 13...
1,2,Other,AmazonBasics 11.6-Inch Laptop Sleeve,$9.99,,Form-fitting sleeve with quick top-loading access for Chromebooks and MacBook Air laptops. Preci...
2,3,Lenovo,Lenovo G50 Entertainment Laptop - Black: DOORBUSTER - Intel Core i7-5500U (2.4GHz / 3.0 GHz Turb...,$799.77,$999.99,"5th Generation Intel Core i7-5500U Processor (2.4 GHz Turbo / 3.0 GHz Base, 1600MHz 4MB). 15.6\ ..."


In [8]:
B.head(3)

Unnamed: 0,ID,Brand,Name,Price,Description,Features
0,1,Asus,Asus 11.6 Laptop Intel Atom 2GB Memory 32GB Flash Storage Blue X205TA-SATM0404G,$189.99,"11.6&#34; Laptop - Intel Atom - 2GB Memory - 32GB Flash Storage, Read customer reviews and buy o...","Microsoft Windows 8.1 operating system preinstalled,Intel?? Atom??? processor Z3735F,2GB DDR3L m..."
1,2,HP,HP 15.6 TouchScreen Laptop Intel Core i3 6GB Memory 750GB Hard Drive Black 15-r264dx,$379.99,"15.6&#34; Touch-Screen Laptop - Intel Core i3 - 6GB Memory - 750GB Hard Drive, Read customer rev...","Microsoft Windows 8.1 operating system preinstalled,5th Gen Intel?? Core??? i3-5010U processor,I..."
2,3,Asus,Asus 2in1 13.3 TouchScreen Laptop Intel Core i5 6GB Memory 1TB Hard Drive Black Q302LA-BBI5T19,$749.99,"2-in-1 13.3&#34; Touch-Screen Laptop - Intel Core i5 - 6GB Memory - 1TB Hard Drive, Read custome...","Microsoft Windows 10 operating system,13.3 TFT-LCD touch screen for hands-on control,5th Gen Int..."


In [12]:
print(f"len(A): {len(A)}")
print(f"len(B): {len(B)}")
print(f"len(A) * len(B): {len(A) * len(B)}")

len(A): 4259
len(B): 5001
len(A) * len(B): 21299259


# Block Tables and Make Set of Candidates

Obviously having 21'299'259 records as a cross product between `A` and `B` is quite high value. What we are going to do now is to reduce obviously non-matching pairs. This process is called blocking tables `A` and `B`. We can use 2 of the blocking mechanisms provided by *py_entitymatching* and namely that would be:
 - attribute equivalence
 - overlap
We know that for an electronics to match , it should be the same `Brand`, so this should match. Sometimes it can have error or typo in the brand, so we can use overlap for tokens in the `Name` and `Description`. Here is the blocking plan:

In [None]:
# Blocking plan

# A, B -- AttrEquivalence blocker [Brand]-------------|
#                                                     |---> candidate set
# A, B -- Overlap blocker [Name]----------------------|

In [13]:
# Create attribute equivalence blocker
ab = em.AttrEquivalenceBlocker()
# Block tables using 'year' attribute : same year include in candidate set
C1 = ab.block_tables(A, B, 'Brand', 'Brand', 
                     l_output_attrs=['Brand','Name','Amazon_Price','Original_Price','Features'],
                     r_output_attrs=['Brand','Name','Price','Description','Features']
                    )
len(C1)

4439971

In [15]:
C1.head(2)

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_Brand,ltable_Name,ltable_Amazon_Price,ltable_Original_Price,ltable_Features,rtable_Brand,rtable_Name,rtable_Price,rtable_Description,rtable_Features
0,0,1,1,Asus,"ASUS X205TA 11.6 Inch Laptop (Intel Atom, 2 GB, 32GB SSD, Gold) - Free Upgrade to Windows 10",$199.00,,Intel Atom 1.33 GHz Processor. 2 GB DDR3 RAM. 32GB SSD Storage; No Optical Drive. 11.6 inches 13...,Asus,Asus 11.6 Laptop Intel Atom 2GB Memory 32GB Flash Storage Blue X205TA-SATM0404G,$189.99,"11.6&#34; Laptop - Intel Atom - 2GB Memory - 32GB Flash Storage, Read customer reviews and buy o...","Microsoft Windows 8.1 operating system preinstalled,Intel?? Atom??? processor Z3735F,2GB DDR3L m..."
1,1,1,3,Asus,"ASUS X205TA 11.6 Inch Laptop (Intel Atom, 2 GB, 32GB SSD, Gold) - Free Upgrade to Windows 10",$199.00,,Intel Atom 1.33 GHz Processor. 2 GB DDR3 RAM. 32GB SSD Storage; No Optical Drive. 11.6 inches 13...,Asus,Asus 2in1 13.3 TouchScreen Laptop Intel Core i5 6GB Memory 1TB Hard Drive Black Q302LA-BBI5T19,$749.99,"2-in-1 13.3&#34; Touch-Screen Laptop - Intel Core i5 - 6GB Memory - 1TB Hard Drive, Read custome...","Microsoft Windows 10 operating system,13.3 TFT-LCD touch screen for hands-on control,5th Gen Int..."


In [19]:
# Initialize overlap blocker
ob = em.OverlapBlocker()
# Block over title attribute
C2 = ob.block_tables(A, B, 'Name', 'Name', show_progress=True, overlap_size=3)
len(C2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  l_df[l_dummy_overlap_attr] = l_df[l_overlap_attr]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  r_df[r_dummy_overlap_attr] = r_df[r_overlap_attr]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  table[overlap_attr] = values
  projected_dataframe = dataframe[proj_attrs].dropna(0,
0% [################

1246333

In [20]:
# Combine the outputs from attr. equivalence blocker and overlap blocker
C = em.combine_blocker_outputs_via_union([C1, C2])
len(C)

5314261

In [21]:
C.head(3)

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_Brand,ltable_Name,ltable_Amazon_Price,ltable_Original_Price,ltable_Features,rtable_Brand,rtable_Name,rtable_Price,rtable_Description,rtable_Features
0,0,1,1,Asus,"ASUS X205TA 11.6 Inch Laptop (Intel Atom, 2 GB, 32GB SSD, Gold) - Free Upgrade to Windows 10",$199.00,,Intel Atom 1.33 GHz Processor. 2 GB DDR3 RAM. 32GB SSD Storage; No Optical Drive. 11.6 inches 13...,Asus,Asus 11.6 Laptop Intel Atom 2GB Memory 32GB Flash Storage Blue X205TA-SATM0404G,$189.99,"11.6&#34; Laptop - Intel Atom - 2GB Memory - 32GB Flash Storage, Read customer reviews and buy o...","Microsoft Windows 8.1 operating system preinstalled,Intel?? Atom??? processor Z3735F,2GB DDR3L m..."
1,1,1,3,Asus,"ASUS X205TA 11.6 Inch Laptop (Intel Atom, 2 GB, 32GB SSD, Gold) - Free Upgrade to Windows 10",$199.00,,Intel Atom 1.33 GHz Processor. 2 GB DDR3 RAM. 32GB SSD Storage; No Optical Drive. 11.6 inches 13...,Asus,Asus 2in1 13.3 TouchScreen Laptop Intel Core i5 6GB Memory 1TB Hard Drive Black Q302LA-BBI5T19,$749.99,"2-in-1 13.3&#34; Touch-Screen Laptop - Intel Core i5 - 6GB Memory - 1TB Hard Drive, Read custome...","Microsoft Windows 10 operating system,13.3 TFT-LCD touch screen for hands-on control,5th Gen Int..."
2,2,1,9,Asus,"ASUS X205TA 11.6 Inch Laptop (Intel Atom, 2 GB, 32GB SSD, Gold) - Free Upgrade to Windows 10",$199.00,,Intel Atom 1.33 GHz Processor. 2 GB DDR3 RAM. 32GB SSD Storage; No Optical Drive. 11.6 inches 13...,Samsung,Samsung 11.6 Chromebook 2 Intel Celeron 2GB Memory 16GB Flash Memory Silver XE500C12-K01US,$219.99,"11.6&#34; Chromebook 2 - Intel Celeron - 2GB Memory - 16GB Flash Memory, Read customer reviews a...","11.6 display,Intel?? Celeron?? processor N2840,2GB system memory,16GB eMMC flash memory,Built-in..."
