# Project Stage 3: Entity Matching (EM)

Problem Description:
We have extracted two tables A, B. Both tables have the same schema. Now we need to perform entity matching and will do so with Magellan, an EM tool to match the two tables. 

- Goal: find entities across two tables (A, B) that match. 


In [1]:
import py_entitymatching as em
import pandas as pd
import os, sys
import numpy as np

# Read CSV

- Read csv file from disk as a table and set its metadata

In [2]:
metacriticData = pd.read_csv('data/metacritic.csv')
wikiData = pd.read_csv('data/wikiData.csv')

In [3]:
# add ID column to each dataset
metacriticID = ["a" + str(num) for num in np.arange(1, len(metacriticData.index)+1)]
wikiID = ["b" + str(num) for num in np.arange(1, len(wikiData.index)+1)]

col_idx = 0
metacriticData.insert(loc = col_idx, column = 'ID', value = metacriticID)
wikiData.insert(loc = col_idx, column = 'ID', value = wikiID)

In [4]:
metacriticData.head()

Unnamed: 0,ID,Album,Artist,Genre,Label,Producer,Release Date,Meta Score
0,a1,Wrong Creatures,Black Rebel Motorcycle Club,['Pop/Rock'],['Vagrant Records'],,Jan 12 2018,69
1,a2,No Cross No Crown,Corrosion of Conformity,['Pop/Rock'],['Nuclear Blast'],,Jan 12 2018,77
2,a3,Encore,Anderson East,['Singer-Songwriter'],['Low Country Sound'],,Jan 12 2018,74
3,a4,A Day With The Homies [EP],Panda Bear,['Pop/Rock'],['Domino'],,Jan 12 2018,74
4,a5,Four Stones,Dean McPhee,['Alternative'],['Hood Faire'],,Jan 12 2018,84


In [5]:
wikiData.head()

Unnamed: 0,ID,Album,Artist,Genre,Label,Producer,Release Date,Meta Score
0,b1,Be Calm,Air Dubai,"['Hip hop', ' pop']",['Hopeless'],"['Dwight A. Baker', ' Colin Munroe']",Jul 1 2014,
1,b2,From Parts Unknown,Every Time I Die,"['Metalcore', ' hardcore punk', ' mathcore', ' sludge metal']",['Epitaph'],['Kurt Ballou'],Jul 1 2014,
2,b3,"I'm Almost Happy Here, But I Never Feel At Home",Hotel Books,"['Spoken word', ' indie rock', ' emo']",['inVogue'],"['Jay Maas', ' Hiram Hernandez']",Jul 1 2014,
3,b4,Paula,Robin Thicke,['R&B'],"['Star Trak', ' Interscope']","['Robin Thicke', ' Pro Jay']",Jul 1 2014,
4,b5,Isolate and Medicate,Seether,"['Post-grunge', ' hard rock', ' alternative metal']","['The Bicycle Music Company', ' Concord Bicycle', ' Spinefarm']","[""Brendan O'Brien""]",Jul 1 2014,


In [6]:
# set metadata
em.set_key(wikiData, 'ID')
em.set_key(metacriticData, 'ID')

True

# Blocking via overlap

In [36]:
ob = em.OverlapBlocker()

#at least 1 word of artist
oc = ob.block_tables(metacriticData, wikiData,'Artist','Artist',word_level=True,overlap_size=1,
                   l_output_attrs=["Album","Artist","Release Date"],
                   r_output_attrs=["Album","Artist","Release Date"],
                    show_progress=True)

print(len(oc))
# file_name = 'overlap_results.csv'
# C2.to_csv(file_name, sep=',')
# oc.head(1000)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


96621


In [37]:
#at least 1 word of title
ob2 = em.OverlapBlocker()
oc2 = ob2.block_candset(oc, 'Album', 'Album', word_level=True,overlap_size=1,show_progress = True)
print(len(oc2))
#oc2.head(50)

#2 of the 3 (month day year) of release date
ob3 = em.OverlapBlocker()
oc3 = ob3.block_candset(oc2, 'Release Date', 'Release Date', word_level=True,overlap_size=2,show_progress = True)
print(len(oc3))
oc3.head()

file_name = 'overlap_results.csv'
oc3.to_csv(file_name, sep=',')


0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


6975


0% [##############################] 100% | ETA: 00:00:00

1149



Total time elapsed: 00:00:00


# Blocking via attribute equivalence blocker

- apply multiple blockers to produce a candidate set of tuple pairs.
- assume that two albums with different release dates do not refer to the same real world album.
- then assume that two albums with different album names do not refer to the same real world album. So we apply attribute equivalence blocking on Album.

In [28]:
# Block with attribute equivalence blocker object
ab1 = em.AttrEquivalenceBlocker()

# block using release date
C1 = ab1.block_tables(metacriticData, wikiData, 
                   l_block_attr='Release Date', r_block_attr='Release Date', 
                    l_output_attrs=['Album', 'Artist', 'Release Date'],
                    r_output_attrs=['Album', 'Artist', 'Release Date'],
                    l_output_prefix='l_', r_output_prefix='r_', allow_missing=True)

51115

In [30]:
# Instantiate attribute equivalence blocker object
ab2 = em.AttrEquivalenceBlocker()

# Use block_tables to apply blocking over two input tables.
C2 = ab2.block_candset(C1, 'Album', 'Album', show_progress = False)

In [31]:
# Display the candidate set of tuple pairs
len(C2)

706

In [32]:
file_name = 'results.csv'
C2.to_csv(file_name, sep=',')