<div style="text-align:center">
    <h2>Record Linkage: Exploration Data Analysis</h2>
</div>


As we could see in the project's readme, the first two steps, "blocking" and "block processing," have already been completed earlier since the goal of this project is to perform entity resolution and clustering. Therefore, in the data exploration, we will focus on selecting the most suitable/useful column for the comparison.

Import libraries

In [1]:
import pandas as pd
import numpy as np

First, we read the .csv files for each retail A and B.

In [2]:
retA = pd.read_csv('retailerA.csv')
retB = pd.read_csv('retailerB.csv')

**DataFrame Size**

We check the size of both dataframes

In [13]:
print(f'[Retailer A] Rows: {retA.shape[0]} - Columns: {retA.shape[1]}')
print(f'[Retailer B] Rows: {retB.shape[0]} - Columns: {retB.shape[1]}')

[Retailer A] Rows: 1081 - Columns: 4
[Retailer B] Rows: 1092 - Columns: 4


We can see that we have 11 more samples in the Retailer A that in B

**Data Type**

| Variable | Datatype | Description |
|----------|----------|----------|
| unique_id | Identifier | int64 |
| title | Name of the article | string |
| description | Description of the article | string |
| price | Price in dollars ($) | string |

**Data Inspection**

In [16]:
retA.head(5)

Unnamed: 0,unique_id,title,description,price
0,1,Linksys EtherFast 8-Port 10/100 Switch - EZXS88W,Linksys EtherFast 8-Port 10/100 Switch - EZXS8...,$44.00
1,2,Linksys EtherFast10/100 5-Port Auto-Sensing Sw...,Linksys EtherFast10/100 5-Port Auto-Sensing Sw...,$29.00
2,3,Netgear ProSafe 5 Port 10/100 Desktop Switch -...,Netgear ProSafe 5 Port 10/100 Desktop Switch -...,$40.00
3,4,Belkin F3H982-10 Pro Series High Integrity 10 ...,Belkin F3H982-10 Pro Series High Integrity 10 ...,
4,5,Netgear Prosafe 16 Port 10/100 Rackmount Switc...,Netgear Prosafe 16 Port 10/100 Rackmount Switc...,$131.00


In [17]:
retB.head(5)

Unnamed: 0,unique_id,title,description,price
0,1,Linksys EtherFast EZXS88W Ethernet Switch - EZ...,Linksys EtherFast 8-Port 10/100 Switch (New/Wo...,
1,2,Linksys EtherFast EZXS55W Ethernet Switch,5 x 10/100Base-TX LAN,
2,3,Netgear ProSafe FS105 Ethernet Switch - FS105NA,NETGEAR FS105 Prosafe 5 Port 10/100 Desktop Sw...,
3,4,Belkin Pro Series High Integrity VGA/SVGA Moni...,1 x HD-15 - 1 x HD-15 - 10ft - Beige,
4,5,Netgear ProSafe JFS516 Ethernet Switch,Netgear ProSafe 16 Port 10/100 Rackmount Switc...,


**Missing Values**

In [18]:
vacios = pd.DataFrame(retA.isnull().sum()).sort_values(0,ascending=False)
vacios.columns = ['vacios']
vacios['vacios%'] = round(vacios['vacios']/retA.shape[0], 2)*100
vacios 

Unnamed: 0,vacios,vacios%
price,663,61.0
unique_id,0,0.0
title,0,0.0
description,0,0.0


In [19]:
vacios = pd.DataFrame(retB.isnull().sum()).sort_values(0,ascending=False)
vacios.columns = ['vacios']
vacios['vacios%'] = round(vacios['vacios']/retB.shape[0], 2)*100
vacios 

Unnamed: 0,vacios,vacios%
price,502,46.0
description,446,41.0
unique_id,0,0.0
title,0,0.0


**Conclusions**

- The price column is not representative for making a similarity comparison since numerical monetary values do not describe a product. Additionally, it is a column with many null values.
- The description column could be useful for comparison, but in retailer B, there are too many null values (41%), so we discard it for this reason.
- We observe that in both dataframes, the title column is complete and contains information about product names. However, upon inspecting the data, we realize that these names are quite descriptive, so we consider them useful for matching.