# Enhancing Entity Resolution and Data Profiling Techniques: A Comprehensive Approach

##Profiling relational data


For this task, download and read the paper about profiling relational data, select a set of summary statistics about the data (minimum of 10 different values) and write Python code to compute these quantities for a dataset of your choice. Preferably, you can use one of the csv files from the road safety dataset. Explain the importance of each summary statistic that you selected in understanding the characteristics of the dataset.

Link to the dataset selected: https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-historical-revisions-data.csv

In [None]:
import pandas as pd
import numpy as np
import collections
import matplotlib.pyplot as plt

dataset = pd.read_csv("dft-road-casualty-statistics-historical-revisions-data.csv")
dataset

####A. Cardinalities


In [None]:
# 1. Number of rows: this statistic that depicts how many data points or observations are in the dataset.
#It provides a sense of the dataset's size and scope, critical for understanding the overall context of the data.

num_rows = len(dataset)

# 2. Value lengths (minimum, maximum, median, and average): Measuring the length of values in a dataset allows to understand the distribution and variability of data.
#It helps to identify potential outliers or anomalies and assess whether the data conforms to expected patterns.

value_lengths = dataset['variable'].str.len()
min_value_length = value_lengths.min()
max_value_length = value_lengths.max()
median_value_length = value_lengths.median()
average_value_length = value_lengths.mean()

# 3. Null values (number and percentage): Identifying the number or percentage of null values is essential for data quality assessment.
#Handling missing data appropriately is vital for ensuring the accuracy and reliability of any analysis or modeling.

null_values = dataset.isnull().sum()
percentage_null_values = (null_values / num_rows) * 100

# 4. Distinct values (i.e. cardinality): This statistic reveals the diversity and uniqueness of values within a column.
#It is particularly valuable for categorical variables, as it helps to understand the number of different categories or classes present in the data.

distinct_values = dataset['accident_index'].nunique()

# 5. Uniqueness: it is the ratio of distinct values to the total number of rows, which provides insights into the relative diversity of the data.
#A low uniqueness indicates that a few values dominate the dataset, which might impact analysis outcomes.

uniqueness = distinct_values / num_rows

In [None]:
print("A. Cardinalities:")
print("1. Number of Rows:", num_rows)
print("2. Value Lengths - Minimum:", min_value_length)
print("   Value Lengths - Maximum:", max_value_length)
print("   Value Lengths - Median:", median_value_length)
print("   Value Lengths - Average:", average_value_length)
print("3. Null Values - Number:", null_values)
print("   Null Values - Percentage:", percentage_null_values)
print("4. Distinct Values (Cardinality):", distinct_values)
print("5. Uniqueness:", uniqueness)

####B. Value Distributions

In [None]:
# 6. Histogram (equi-width):  Frequency histograms offer a visual representation of the distribution of values within a column.
#They allow to assess data patterns, such as skewness, central tendency, and the presence of multiple modes.

plt.hist(value_lengths, bins='auto', edgecolor='k')
plt.xlabel("Value Length")
plt.ylabel("Frequency")
plt.title("Value Lengths Histogram (Equi-Width)")
plt.show()

# 7. Constancy (Frequency of the most frequent value): Constancy measures how frequently the most common value occurs within a column.
#It helps identify whether a single value dominates the data, potentially highlighting anomalies or issues with data collection.

most_frequent_value = dataset['accident_year'].mode().values[0]
constancy = (dataset['accident_year'] == most_frequent_value).sum() / num_rows

# 8. Quartiles: Quartiles divide numerical data into four equal parts, each containing 25% of the data points.
#They are crucial for understanding the spread and central tendency of numeric values, making them valuable for data analysis and visualization.

quartiles = np.percentile(dataset['accident_year'].dropna().astype(float), [25, 50, 75])

# 9. First digit distribution (Benford's Law):  Analyzing the distribution of the first digit in numeric values can be used to detect anomalies or fraud.
#It is a statistical test based on Benford's Law, which states the expected distribution of first digits in naturally occurring data.
def first_digit_distribution(data_series):
    first_digits = data_series.astype(str).str.replace(".", "").str.strip().str[0]
    return dict(collections.Counter(first_digits))

first_digit_dist = first_digit_distribution(dataset['accident_year'].dropna().astype(float))

In [None]:
print("6. Histogram (Equi-Width):")
# Histogram already displayed
print("7. Constancy (Frequency of Most Frequent Value):", constancy)
print("8. Quartiles:", quartiles)
print("9. First Digit Distribution (Benford's Law):")
for digit, count in first_digit_dist.items():
    print(f"   Digit {digit}: {count} occurrences")

####C. Patterns, Data Types, and Domains

In [None]:
# 10. Data type (assuming varchar as a generic DBMS-specific type): Knowing the concrete database management system (DBMS)-specific data type (e.g., varchar, timestamp) is crucial for compatibility, data storage, and querying.
data_type = 'varchar'

# 11. Size (maximum number of digits in numeric values): It is important for assessing precision and deciding how to handle numerical values in calculations or aggregations.
size2 = dataset['accident_year'].apply(lambda x: len(str(x)))
size3 = size2.value_counts()

# 12. Decimals (maximum number of decimals in numeric values): Knowing the maximum number of decimals helps maintain precision during calculations and conversions.
import re
decimal_places = dataset['accident_year'].apply(lambda x: len(re.findall(r'\.\d+', str(x))))
max_decimal_places = decimal_places.max()

# 13. Patterns (histogram of value patterns): It provides insights into the structure of the data, particularly for alphanumeric or complex data. It can help identify regular expressions or formats within the data.
pattern_lengths = dataset['accident_year'].astype(str).apply(lambda x: len(set(x)))

plt.hist(pattern_lengths, bins='auto', edgecolor='k')
plt.xlabel("Pattern Length")
plt.ylabel("Frequency")
plt.title("Pattern Lengths Histogram")
plt.show()

# 14. Data class: Determining the semantic, generic data type (e.g., code, indicator, text, date/time) helps in understanding the purpose and meaning of the data within a column.
data_class = 'generic'

# 15. Domain (classification of semantic domain): Classifying data into semantic domains (e.g., credit card, first name, city, phenotype) provides context and informs data validation, data cleansing, and data transformation processes.
domain = 'uncategorized'

In [None]:
print("10. Data Type:", data_type)
print("11. Size (Maximum Number of Digits):", size3.index[0])
print("12. Decimals (Maximum Number of Decimals):", max_decimal_places)
print("13. Patterns (Pattern Lengths Histogram):") # Histogram already displayed
print("14. Data Class:", data_class)
print("15. Domain:", domain)

##Entity resolution

In [None]:
import pandas as pd
import numpy as np

!pip install recordlinkage
import recordlinkage as rl

!pip install py_stringmatching
from py_stringmatching import Levenshtein
from py_stringmatching import Jaro
from py_stringmatching import Affine
from sklearn.preprocessing import MinMaxScaler as mms
lev = Levenshtein()
jaro = Jaro()
aff = Affine()

acm_df = pd.read_csv("ACM.csv")
dblp_df = pd.read_csv("DBLP2.csv", encoding="ISO-8859-1")

Compare every single record in the dataset (ACM.csv) with all the records in (DBLP2.csv) and find the similar records (records that represent the same publication). ACM contains the following attributes: id,"title","authors","venue","year". ACM contains the following attributes: id,"title","authors","venue","year". To compare two records, follow the steps:

A) Ignore the pub_id.

B) Change all alphabetical characters into lowercase.

C) Convert multiple spaces to one.

In [None]:
#a:
#pub_id is ignored automatically in the similarity functions, because only the specific ['column'] is used in the dataframe

#b:
def to_lower(df):
  for i in range(len(df)):
    for j in range(len(df.columns)):
      lower = str(df.iloc[i,j]).lower()
      df.iloc[i,j] = lower
  return df

acm_df = to_lower(acm_df)
dblp_df = to_lower(dblp_df)

#c:
import re
def replace_spaces(df):
  for i in range(len(df)):
    for j in range(len(df.columns)):
      sentence = str(df.iloc[i][j])
      result = re.sub(' +', ' ', sentence)
      df.iloc[i][j] = result
  return df

acm_df = replace_spaces(acm_df)
dblp_df = replace_spaces(dblp_df)

D) Use Levenshtein similarity (for comparing the values in the title attribute and compute the score (st). (MED refers to the minimum edit distance and |Si| is the number of characters in string Si).

In [None]:
#d:
def levenshtein_sim():
  lev_scores = []

  #Iterate over the datasets, compute the minimal edit distance using the py_stringmatching library and manually convert to the Levenshtein similarity
  for i in range(len(acm_df)):
      for j in range(len(dblp_df)):
          min_edit_dist = lev.get_raw_score(acm_df['title'][i], dblp_df['title'][j])
          max_length = max(len(acm_df['title'][i]), len(dblp_df['title'][j]))
          comp_lev = 1 - min_edit_dist/max_length
          lev_scores.append((acm_df['title'][i], dblp_df['title'][j], comp_lev))

  #St is a list with all the computed Levenshtein scores and for clarity lev_df is a dataset where you can see the compared titles and their similarity score
  lev_df = pd.DataFrame(lev_scores, columns=['ACM Title', 'DBLP Title', 'Levenshtein Score'])
  St = lev_df['Levenshtein Score']
  return lev_df, St

lev_df, St = levenshtein_sim()

In [None]:
#e:
def jaro_sim():
  jaro_scores = []

  #Iterate over the datasets and compute the Jaro score using the py_stringmatching library
  for i in range(len(acm_df)):
      for j in range(len(dblp_df)):
          comp_jaro = jaro.get_raw_score(acm_df['authors'][i], dblp_df['authors'][j])
          jaro_scores.append((acm_df['authors'][i], dblp_df['authors'][j], comp_jaro))

  #Sa is a list with all the computed Jaro scores and for clarity jaro_df is a dataset where you can see the compared authors and their similarity score
  jaro_df = pd.DataFrame(jaro_scores, columns=['ACM Authors', 'DBLP Authors', 'Jaro Score'])
  Sa = jaro_df['Jaro Score']
  return jaro_df, Sa

jaro_df, Sa = jaro_sim()

F) Use a modified version of the affine similarity that is scaled to the interval [0, 1] for the venue attribute (Sc).

In [None]:
#f:
def aff_sim():

  aff_scores = []

  #Iterate over the datasets and compute the Affine gap score using the py_stringmatching library
  for i in range(len(acm_df)):
    for j in range(len(dblp_df)):
        score = aff.get_raw_score(acm_df.iloc[i]['venue'], dblp_df.iloc[j]['venue'])
        aff_scores.append(score)

  #Scale the scores to [0, 1] range using MinMaxScaler()
  scaler = mms()
  sc_scaled = scaler.fit_transform(np.array(aff_scores).reshape(-1, 1)).flatten() #The reshape makes sure it is a 2D array and flatten takes away the brackets []
  aff_scores_scaled = []

  #Now add the scaled scores to the records
  index = 0
  for i in range(len(acm_df)):
    for j in range(len(dblp_df)):
      aff_scores_scaled.append((acm_df.iloc[i]['venue'], dblp_df.iloc[j]['venue'], sc_scaled[index]))
      index += 1

  #Sc is a list with all the scaled Affine scores and for clarity aff_df is a dataset where you can see the compared venues and their similarity score
  aff_df = pd.DataFrame(aff_scores_scaled, columns=['ACM Venue', 'DBLP Venue', 'Scaled Affine Score'])
  Sc = aff_df['Scaled Affine Score']
  return aff_df, Sc

aff_df, Sc = aff_sim()

G) Use Match (1) / Mismatch (0) for the year (Sy).





In [None]:
#g:
def compute_matches():
  match_year = []

  #Iterate over the datasets and compute the match/mismatch score (either 1 or 0)
  for i in range(len(acm_df)):
    for j in range(len(dblp_df)):
          if acm_df['year'][i] == dblp_df['year'][j]:
            match_year.append((acm_df['year'][i], dblp_df['year'][j], 1))
          else:
            match_year.append((acm_df['year'][i], dblp_df['year'][j], 0))

  #Sy is a list with all the match/mismatch scores and for clarity match_df is a dataset where you can see the compared years and their similarity score
  match_df = pd.DataFrame(match_year, columns=['ACM year', 'DBLP year', 'Match (1) or Mismatch (0)'])
  Sy = match_df['Match (1) or Mismatch (0)']
  return match_df, Sy

match_df, Sy = compute_matches()

H) Use the formula rec_sim = w1 * st + w2 * sa + w3 * sc + w4 *sy to combine the scores and compute the final score, where the sum of the 4 attributes = 1.

In [None]:
#h:
def rec_sim():
  w1 = 0.25
  w2 = 0.25
  w3 = 0.25
  w4 = 0.25
  rec_sim = w1 * St + w2 * Sa + w3 * Sc + w4 * Sy
  return rec_sim

rec_sim()

I) Report the records with rec_sim > 0.7 as duplicate records by storing the ids of both records in a list.

In [None]:
def duplicate_records():
  rec_scores = rec_sim()
  duplicate_records = []

  #Iterate over the datasets and note down both id's and their combined similarity score
  for i in range(len(acm_df)):
    for j in range(len(dblp_df)):
      duplicate_records.append((acm_df.iloc[i]['id'], dblp_df.iloc[j]['id'], rec_scores[i * len(dblp_df) + j]))

  #Create a new dataframe containing both id's and their combined similarity scores, then filter to show only similarity scores larger than 0.7
  combined_df = pd.DataFrame(duplicate_records, columns=['ACM_ID', 'DBLP_ID', 'Rec_Sim_Score'])
  filtered_df = combined_df[combined_df['Rec_Sim_Score'] > 0.7]
  return filtered_df

duplicate_records()


J) In the table DBLP-ACM_perfectMapping.csv, you can find the actual mappings (the ids of the correct duplicate records). Compute the precision of this method by counting the number of duplicate records that you discovered correctly. That is, among all the reported similar records by your method, how many pairs exist in the file DBLP-ACM_perfectMapping.csv.

In [None]:
def compute_precision():
  #Read in the perfect mapping csv and make sure to also convert to the desired format using our definitions from b and c above
  dblp_perf = pd.read_csv('DBLP-ACM_perfectMapping.csv')
  dblp_perf = to_lower(dblp_perf)
  dblp_perf = replace_spaces(dblp_perf)

  duplicate_df = duplicate_records()
  total_duplicates = len(duplicate_df)

  correct_duplicates = 0
  for i in range(total_duplicates):
    if (
            (duplicate_df.iloc[i]['ACM_ID'] in dblp_perf['idACM'].values) and
            (duplicate_df.iloc[i]['DBLP_ID'] in dblp_perf['idDBLP'].values)
        ):
            correct_duplicates += 1

  precision = correct_duplicates/total_duplicates
  print("The precision of the record similarity measure is:", str(precision))

compute_precision()

K) Record the running time of the method. You can observe that the program takes a long time to get the results. What can you do to reduce the running time? (Just provide clear discussion – no need for implementing the ideas.)

In [None]:
import timeit
import random
starttime = timeit.default_timer()
levenshtein_sim()
time1 = timeit.default_timer() - starttime

starttime2 = timeit.default_timer()
jaro_sim()
time2 = timeit.default_timer() - starttime2

starttime3 = timeit.default_timer()
aff_sim()
time3 = timeit.default_timer() - starttime3

starttime4 = timeit.default_timer()
compute_matches()
time4 = timeit.default_timer() - starttime4

starttime5 = timeit.default_timer()
rec_sim()
time5 = timeit.default_timer() - starttime5

starttime6 = timeit.default_timer()
duplicate_records()
time6 = timeit.default_timer() - starttime6

starttime7 = timeit.default_timer()
compute_precision()
time7 = timeit.default_timer() - starttime7

total_runtime = time1+time2+time3+time4+time5+time6+time7
print("Total runtime:", str(total_runtime))

To reduce the running time, we could consider the following strategies:

1. **Early Exit Optimization**: It can be added an early exit condition to break out of the calculation if the edit distance exceeds a certain threshold or if it becomes larger than a predefined maximum value.

2. **Memoization**: Implement memoization (caching) to store the results of previously computed Levenshtein distances for pairs of strings.

3. **Parallel Processing**: If there is a large number of string comparisons to make, you can parallelize the calculations using multi-threading or multiprocessing.  

4. **Indexing and Filtering**: When comparing a large dataset against a smaller set of potential matches, indexing and filtering the potential matches first based on some criteria may reduce the number of expensive distance calculations.

5. **Library Optimizations**: Specialized libraries or functions that are optimized for string similarity calculations.

6. **Limit String Length**: limiting the length of the strings being compared, by truncating or preprocessing long strings to make them more manageable for similarity calculations.



## Enhanced Entity Resolution Method using Shingling, MinHash, and Locality Sensitive Hashing (LSH)

Concatenating the values in each record into one single string

In [None]:
def concat_records(df):
  records = []
  for i in range(len(df)):
    row = df.iloc[i, :]
    row = row.str.cat(sep=' ')
    records.append(row)
  return records

acm_records = concat_records(acm_df)
dblp_records = concat_records(dblp_df)

Changing all alphabetical characters into lowercase.


In [None]:
def to_lower(records):
  for i in range(len(records)):
    lower = str(records[i]).lower()
    records[i] = lower
  return records

acm_records = to_lower(acm_records)
dblp_records = to_lower(dblp_records)

Convert multiple spaces to one.


In [None]:
def replace_spaces(records):
  for i in range(len(records)):
    record = records[i]
    result = re.sub(' +', ' ', record)
    records[i] = result
  return records

acm_records = replace_spaces(acm_records)
dblp_records = replace_spaces(dblp_records)

Combine the records from both tables into one big list as we did during the lab.

In [None]:
combined_records = acm_records + dblp_records
print(combined_records)

Use the functions in the tutorials from lab 5 to compute the shingles, the minhash signature and the similarity.

In [None]:
def shingle(text: str, k: int)->set:
    """
    Create a set of 'shingles' from the input text using k-shingling.

    Parameters:
        text (str): The input text to be converted into shingles.
        k (int): The length of the shingles (substring size).

    Returns:
        set: A set containing the shingles extracted from the input text.
    """
    shingle_set = []
    for i in range(len(text) - k+1):
        shingle_set.append(text[i:i+k])
    return set(shingle_set)

def build_vocab(shingle_sets: list)->dict:
    """
    Constructs a vocabulary dictionary from a list of shingle sets.

    This function takes a list of shingle sets and creates a unified vocabulary
    dictionary. Each unique shingle across all sets is assigned a unique integer
    identifier.

    Parameters:
    - shingle_sets (list of set): A list containing sets of shingles.

    Returns:
    - dict: A vocabulary dictionary where keys are the unique shingles and values
      are their corresponding unique integer identifiers.

    Example:
    sets = [{"apple", "banana"}, {"banana", "cherry"}]
    build_vocab(sets)
    {'apple': 0, 'cherry': 1, 'banana': 2}  # The exact order might vary due to set behavior
    """
    full_set = {item for set_ in shingle_sets for item in set_}
    vocab = {}
    for i, shingle in enumerate(list(full_set)):
        vocab[shingle] = i
    return vocab

def one_hot(shingles: set, vocab: dict):
    vec = np.zeros(len(vocab))
    for shingle in shingles:
        idx = vocab[shingle]
        vec[idx] = 1
    return vec


def get_shingles_2(): #code block turned into function to record run time
  k = 3
  sentences = combined_records
  shingles = []
  for sentence in sentences:
      shingles.append(shingle(sentence,k))
  vocab = build_vocab(shingles)
  shingles_1hot = []
  for shingle_set in shingles:
      shingles_1hot.append(one_hot(shingle_set,vocab))
  shingles_1hot = np.stack(shingles_1hot)
  return vocab, shingles_1hot, shingles

vocab = get_shingles_2()[0]
shingles_1hot = get_shingles_2()[1]
shingles = get_shingles_2()[2]

def get_minhash_arr(num_hashes:int,vocab:dict):
    """
    Generates a MinHash array for the given vocabulary.

    This function creates an array where each row represents a hash function and
    each column corresponds to a word in the vocabulary. The values are permutations
    of integers representing the hashed value of each word for that particular hash function.

    Parameters:
    - num_hashes (int): The number of hash functions (rows) to generate for the MinHash array.
    - vocab (dict): The vocabulary where keys are words and values can be any data
      (only keys are used in this function).

    Returns:
    - np.ndarray: The generated MinHash array with `num_hashes` rows and columns equal
      to the size of the vocabulary. Each cell contains the hashed value of the corresponding
      word for the respective hash function.

    Example:
    vocab = {'apple': 1, 'banana': 2}
    get_minhash_arr(2, vocab)
    # Possible output:
    # array([[1, 2],
    #        [2, 1]])
    """
    length = len(vocab.keys())
    arr = np.zeros((num_hashes,length))
    for i in range(num_hashes):
        permutation = np.random.permutation(len(vocab.keys())) + 1
        arr[i,:] = permutation.copy()
    return arr.astype(int)

def get_signature(minhash:np.ndarray, vector:np.ndarray):
    """
    Computes the signature of a given vector using the provided MinHash matrix.

    The function finds the nonzero indices of the vector, extracts the corresponding
    columns from the MinHash matrix, and computes the signature as the minimum value
    across those columns for each row of the MinHash matrix.

    Parameters:
    - minhash (np.ndarray): The MinHash matrix where each column represents a shingle
      and each row represents a hash function.
    - vector (np.ndarray): A vector representing the presence (non-zero values) or
      absence (zero values) of shingles.

    Returns:
    - np.ndarray: The signature vector derived from the MinHash matrix for the provided vector.

    Example:
    minhash = np.array([[2, 3, 4], [5, 6, 7], [8, 9, 10]])
    vector = np.array([0, 1, 0])
    get_signature(minhash, vector)
    output:array([3, 6, 9])
    """
    idx = np.nonzero(vector)[0].tolist()
    shingles = minhash[:,idx]
    signature = np.min(shingles,axis=1)
    return signature

def jaccard_similarity(set1, set2):
    intersection_size = len(set1.intersection(set2))
    union_size = len(set1.union(set2))
    return intersection_size / union_size if union_size != 0 else 0.0

def compute_signature_similarity(signature_1, signature_2):
    """
    Calculate the similarity between two signature matrices using MinHash.

    Parameters:
    - signature_1: First signature matrix as a numpy array.
    - signature_matrix2: Second signature matrix as a numpy array.

    Returns:
    - Estimated Jaccard similarity.
    """
    # Ensure the matrices have the same shape
    if signature_1.shape != signature_2.shape:
        raise ValueError("Both signature matrices must have the same shape.")
    # Count the number of rows where the two matrices agree
    agreement_count = np.sum(signature_1 == signature_2)
    # Calculate the similarity
    similarity = agreement_count / signature_2.shape[0]

    return similarity


def get_sigs(): #turned into function to compute run time later
  minhash_arr =  get_minhash_arr(100,vocab)
  signatures = []
  for vector in shingles_1hot:
    signatures.append(get_signature(minhash_arr,vector))
  signatures = np.stack(signatures)
  signatures.shape
  return signatures

signatures = get_sigs()


print(compute_signature_similarity(signatures[2],signatures[3]))
print(jaccard_similarity(shingles[3],shingles[2]))

Extract the top 2224 candidates from the LSH algorithm, compare them to the actual mappings in the file DBLP-ACM_perfectMapping.csv and compute the precision of the method

In [None]:
import requests
from io import StringIO
from IPython.display import display
from random import shuffle
from itertools import combinations
class LSH:
    """
    Implements the Locality Sensitive Hashing (LSH) technique for approximate
    nearest neighbor search.
    """
    buckets = []
    counter = 0

    def __init__(self, b: int):
        """
        Initializes the LSH instance with a specified number of bands.

        Parameters:
        - b (int): The number of bands to divide the signature into.
        """
        self.b = b
        for i in range(b):
            self.buckets.append({})

    def make_subvecs(self, signature: np.ndarray) -> np.ndarray:
        """
        Divides a given signature into subvectors based on the number of bands.

        Parameters:
        - signature (np.ndarray): The MinHash signature to be divided.

        Returns:
        - np.ndarray: A stacked array where each row is a subvector of the signature.
        """
        l = len(signature)
        assert l % self.b == 0
        r = int(l / self.b)
        subvecs = []
        for i in range(0, l, r):
            subvecs.append(signature[i:i+r])
        return np.stack(subvecs)

    def add_hash(self, signature: np.ndarray):
        """
        Adds a signature to the appropriate LSH buckets based on its subvectors.

        Parameters:
        - signature (np.ndarray): The MinHash signature to be hashed and added.
        """
        subvecs = self.make_subvecs(signature).astype(str)
        for i, subvec in enumerate(subvecs):
            subvec = ','.join(subvec)
            if subvec not in self.buckets[i].keys():
                self.buckets[i][subvec] = []
            self.buckets[i][subvec].append(self.counter)
        self.counter += 1

    def check_candidates(self) -> set:
        """
        Identifies candidate pairs from the LSH buckets that could be potential near duplicates.

        Returns:
        - set: A set of tuple pairs representing the indices of candidate signatures.
        """
        candidates = []
        for bucket_band in self.buckets:
            keys = bucket_band.keys()
            for bucket in keys:
                hits = bucket_band[bucket]
                if len(hits) > 1:
                    candidates.extend(combinations(hits, 2))
        return set(candidates)


def apply_lsh():
  b = 10   # number of buckets
  lsh = LSH(b)
  for signature in signatures:
      lsh.add_hash(signature)
  candidate_pairs = lsh.check_candidates()
  len(candidate_pairs)
  return candidate_pairs

candidate_pairs = apply_lsh()
top_candidates = list(candidate_pairs)[:2224]

In [None]:
#actually comparing top candidates with correct mappings:
# we might have to check precision formula because precision formula is num true pos/ (num true pos + false pos)
dblp_key = pd.read_csv('DBLP-ACM_perfectMapping.csv')

def compute_precision(top_candidates):
  false_count = 0
  correct_count = 0
  dblp_keys_id = dblp_key['idACM']
  for i in range(len(top_candidates)):
    if top_candidates[i][0] in dblp_keys_id or top_candidates[i][1] in dblp_keys_id:
      correct_count+=1
    else:
      false_count+=1
  precision = correct_count/(correct_count+false_count)
  print("Precision:", str(precision))
compute_precision(top_candidates)


Record the running time of the method

In [None]:
import timeit
import random
starttime = timeit.default_timer()
get_shingles_2()
time1 = timeit.default_timer() - starttime

starttime2 = timeit.default_timer()
get_sigs()
time2 = timeit.default_timer() - starttime2

starttime3 = timeit.default_timer()
apply_lsh()
time3 = timeit.default_timer() - starttime3

starttime4 = timeit.default_timer()
compute_precision(top_candidates)
time4 = timeit.default_timer() - starttime4

total_runtime = time1+time2+time3+time4
print("Total runtime:", str(total_runtime))

The precision for the method in Part 2 was much higher than the precision in Part 1 and the runtime in Part 2 was also much faster than Part 1. Overall, this would make the method in Part 2 a better choice for entity resolution.

## Data preparation

1. Computing the correlation between the different columns after removing the outcome column.

2. Removing the disguised values from the table. We need to remove the values that equal to 0 from columns BloodPressure, SkinThickness and BMI as these are missing values but they have been replaced by the value 0. Remove the value but keep the record (i.e.) change the value to null.

3. Filling the cells with null using the mean values of the records that have the same class label.

4. Computing the correlation between the different columns.

5. Comparing the values from this step with the values in the first step (just mention the most important changes (if any)) and comment on your findings.



In [None]:
df = pd.read_csv('diabetes.csv')

In [None]:
#1)
df2 = df.drop(['Outcome'], axis=1)
print(df2.corr())

In [None]:
#2
import numpy as np
cols = ['BloodPressure', 'SkinThickness', 'BMI']
for col in df.columns:
  df[col] = df[col].replace(0, np.nan)

In [None]:
#3
for i in range(len(df)):
 for j in range(len(df.columns)):
   if pd.isna(df.iloc[i][j]):
     if df.loc[i]['Outcome']==1.0:
       cls = df[df['Outcome']==1.0]
     else:
       cls = df[df['Outcome']!=1.0]
     col_name = df.columns[j]
     col_avg = cls.iloc[:, j].mean()
     df.iloc[i, j] = col_avg


In [None]:
#4
df3 = df.drop(['Outcome'], axis=1)
correlation_matrix = df3.corr()
print(correlation_matrix)

5: The main difference between the correlations in part 4 and part 1 is that the correlations in part 1 were calculated with rows that contained values of 0, whereas these rows were ignored in part 4 where they had null values. The correlations not involving the blood pressure, skin thickness or BMI columns are the same in both parts, but in general, while the correlations are relatively similar for those columns, the correlations in part 4 are slightly higher because the null rows have been omitted.


*   Step 1 focuses on understanding **independent variable relationships and potential multicollinearity**.
*   In contrast, Step 4 provides a **broader view of correlations and aiding variable selection for predictive modeling**.
*   The main difference is that Step 4 includes **correlations between predictors and the outcome**.