# Python For Data Analysis Group 2
Github Repository: https://github.com/educated-fool/entity-resolution-group2

## Module 1: RecordLinkage Using Features Weighted Average
Author: Sixuan Li

- **Targeted Field Comparisons**: This module utilizes a set of specified fields (name, address, city, state, postal code) for individual comparisons.

- **Weighted Scoring**: Each field comparison contributes differently to the overall score based on predefined weights, which allows for fine-tuning the importance of each field in determining a match. This method might prioritize certain fields over others, based on how the weights are set.

- **Jaro-Winkler Thresholding**: Comparisons use the Jaro-Winkler similarity with a threshold, which can prevent lower-quality matches from being considered.

- **Resulting Matches**: The approach is geared towards high precision in matches, potentially at the cost of missing some valid matches if the field weights or thresholds are not optimally set.

In [3]:
# Importing the necessary libraries
import pandas as pd

# Reading the left and right datasets from CSV files
left_df = pd.read_csv('./data/left_dataset.csv')
right_df = pd.read_csv('./data/right_dataset.csv')

In [4]:
from src.module1_recordlinkage import record_linkage_pipeline

In [5]:
record_linkage_pipeline(left_df, right_df, filename='./data/recordlinkage_submission.csv', threshold=0.85)

Unnamed: 0,Unnamed: 1,left_dataset,right_dataset,confidence_score
7,84020,8,84021,1.00
56,32736,57,32737,1.00
59,39158,60,39159,0.92
59,39236,60,39237,0.92
66,77382,67,77383,1.00
...,...,...,...,...
94474,165,94475,166,1.00
94474,2372,94475,2373,1.00
94538,79362,94539,79363,1.00
94545,83309,94546,83310,0.92


## Module 2: RecordLinkage Using Features Mean
Author: Sixuan Li

- **Consolidated Address Matching**: This module creates a `full_address` field by concatenating address components, allowing for a holistic comparison of the entire address as one entity rather than separate parts.

- **Uniform Scoring**: The final match score is the average of the scores from the compared fields (name and full address), treating all aspects equally. This lack of weighting simplifies the model but also assumes all fields are equally important.

- **Full Jaro-Winkler Comparison**: It applies the Jaro-Winkler method without a threshold for individual field comparisons, which could result in considering more potential matches.

- **Resulting Matches**: Expected to yield a higher number of matches due to the comprehensive address comparison and the inclusive scoring system, potentially at the risk of including more false positives.

In [6]:
import pandas as pd
left_df = pd.read_csv('./data/left_dataset.csv')
right_df = pd.read_csv('./data/right_dataset.csv')

In [4]:
from src.module2_recordlinkage import record_linkage_pipeline_full_address

In [7]:
record_linkage_pipeline_full_address(left_df, right_df, filename='./data/recordlinkage2_submission.csv')

Unnamed: 0,Unnamed: 1,left_dataset,right_dataset,confidence_score
1,75686,2,75687,0.81
7,81911,8,81912,0.83
7,84020,8,84021,0.90
14,51941,15,51942,0.81
19,87105,20,87106,0.81
...,...,...,...,...
94559,74490,94560,74491,0.82
94570,72779,94571,72780,0.81
94570,75425,94571,75426,0.83
94578,80356,94579,80357,0.97


Key Differences Between Module 1 and Module 2:

- **Precision vs. Recall**: Module 1 might have higher precision due to its weighted and threshold-based approach, whereas Module 2 is likely to have higher recall, identifying more potential matches by utilizing a full address match and averaging comparison scores.

- **Complexity in Scoring**: The first module's weighted score allows for nuanced scoring reflective of the relative importance of data fields. The second module's equal weighting is simpler but less tailored to specific field significances.

- **Field Utilization**: Module 2's use of a single full_address field for matching might capture nuances in address variations better than separate field comparisons in Module 1, which might miss matches due to discrepancies in individual address components.

## Module 3: TheFuzz
Author: Xueni Wang

**Library Preparation**:
- TheFuzz Python library provides tools for string matching and fuzzy comparison.

**Data Preprocessing**:
- Checks and imputes missing values.
- Conducts preprocessing on postal codes (zip codes) and other features like name, address, city, and state to standard formats. 

**Data Matching and Analysis**:
- Creates “combined” feature that merge multiple columns into a single string per record, facilitating easier comparisons.
- Implements “block-key” to reduce the comparison space, making the matching process more efficient.
- Performs matching by using “extractOne” function, specifically looking to match entries with a high degree of similarity.

In [1]:
import pandas as pd
left_df = pd.read_csv('./data/left_dataset.csv')
right_df = pd.read_csv('./data/right_dataset.csv')

In [2]:
from src.module3_thefuzz import thefuzz_pipeline

In [3]:
thefuzz_pipeline(left_df, right_df, output_csv = "./data/thefuzz_submission.csv")

Unnamed: 0,left_dataset,right_dataset,confidence_score
0,60,39237,0.81
1,534,42420,0.89
2,1337,39545,0.83
3,2651,49857,0.86
4,3214,49285,0.91
...,...,...,...
14927,58215,88924,0.85
14928,16504,84508,0.84
14929,38004,84508,0.91
14930,72169,91487,0.89


## Module 4: fnmatch + textdistance
Author: Margaret Ma
- **Data Cleaning**: Standardizes postal codes and cleans text fields to ensure uniformity, facilitating accurate comparisons.

- **Blocking Keys Creation**: Forms a block_key for each entry using the initial characters of the address, name, state, and postal code. This key reduces the comparison scope by grouping similar entries together.

- **Matching and Scoring**: Employs fnmatch to find potential name matches and uses textdistance to calculate similarity scores. Matches are confirmed and scored higher if the names are very similar, and a secondary similarity measure is used for others.

- **Output**: Entries with a confidence score above 0.8 are identified as high-confidence matches and saved to a CSV file.

In [1]:
import pandas as pd
left_df = pd.read_csv('./data/left_dataset.csv')
right_df = pd.read_csv('./data/right_dataset.csv')

In [2]:
from src.module4_fnmatch_textdistance import fnmatch_textdistance_pipeline

In [3]:
fnmatch_textdistance_pipeline(left_df, right_df, filename = './data/fnmatch_textdistance_submission.csv', threshold = 0.8)

Unnamed: 0,left_dataset,right_dataset,confidence_score
251,8,81912,1.000
293,8,84021,1.000
950,15,57146,0.885
3561,57,32737,1.000
3707,60,39159,1.000
...,...,...,...
6837734,94475,166,1.000
6837742,94475,2373,1.000
6842155,94539,79363,1.000
6842431,94546,83310,1.000


## Module 5: Dedupe
Author: Xinyi Yu

**Data Cleaning**
- Renaming the column zip_code in right_df to postal_code for consistency between the two DataFrames.
- Preprocessing routine via the preprocess_data function, which involves cleaning and standardizing postal_code, state, name, address, and city fields. 
- Converted state, name, address, and city fields to lowercase, stripped of extra spaces. 
- Replace missing values by string ‘unknown’. 


**Data Matching**
- Data Conversion: Both DataFrames are converted into dictionaries by their indices for processing.
- Setup for Dedupe: The function initializes the RecordLink tool from the dedupe library, specifying fields like name, address, city, state, and postal_code for string-based record linkage.
- Training the Dedupe Model: Using the dedupe library, the model is trained on manually labeled data to identify likely matches.
- Matching Records: The model applies the trained criteria to find and validate matches between the datasets, using a confidence threshold of 0.8 to ensure high relevance.

**Scoring and Output**
- Generating Output: Matched records are compiled into a list of tuples that include the indices and confidence scores.
- Exporting Results: This data is formatted into a DataFrame with columns for left_dataset, right_dataset, and confidence_score, which is then saved to a CSV file at the specified filepath.

In [1]:
import pandas as pd
left_df = pd.read_csv('./data/left_dataset.csv')
right_df = pd.read_csv('./data/right_dataset.csv')

In [2]:
from src.module5_dedupe import dedupe_pipeline 

In [3]:
dedupe_pipeline(left_df, right_df, filepath = './data/dedupe_submission.csv')

name : dammanns garden company
address : 5129 s emerson ave
city : indianapolis
state : IN
postal_code : 46237

name : dammanns garden company llc
address : 5129 s emerson ave
city : indianapolis
state : IN
postal_code : 46237

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
name : the ultimate bake shoppe of ardmore
address : 120 coulter ave
city : ardmore
state : PA
postal_code : 19003

name : the ultimate bake shoppe of ardmore llc
address : 120 coulter ave suite 3
city : ardmore
state : PA
postal_code : 19003

1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
name : quick lube plus
address : 6014 e hillsborough ave ste c
city : tampa
state : FL
postal_code : 33610

name : quick lube plus tires inc
address : 6014 e hillsborough ave ste c
city : tampa
state : FL
postal_code : 33610

2/10 positive, 0/10 negative
Do these records refer to the same thing?
(y

Unnamed: 0,left_dataset,right_dataset,confidence_score
0,17919,74720,0.975422
1,64766,753,0.975422
2,49662,10025,0.975422
3,77310,73888,0.975422
4,25853,20450,0.975422
...,...,...,...
5251,72760,37980,0.800886
5252,9831,3532,0.800503
5253,50011,58100,0.800399
5254,42587,67573,0.800152


## Module 6: difflib
Author: Shaoze Li

**Data Preprocessing**:
- Read the CSV files into pandas DataFrames.
- Convert all names and addresses to lowercase strings.
- Remove punctuation from postal codes and zip codes.
- Create a new column called block_key that combines the first two characters of the name, the first character of the address, the state, and the first three characters of the postal/zip code.

**Entity Resolution**:
- Merge the two datasets on the block_key column using an inner join, creating a DataFrame called merged.
- Calculate the similarity between the names and addresses of the matched entities in merged using the SequenceMatcher from difflib.
- Calculate a confidence score for each match based on the average similarity between names and addresses.

**Filtering Results**:
- Filter the merged DataFrame to include only matches with a confidence score greater than a specified threshold (default is 0.8).
- Print the filtered results, including the entity_id from the left dataset, the business_id from the right dataset, and the confidence score.
- Overall, this code aims to identify potential matches between entities in two datasets based on similarities in their names and addresses, providing a confidence score to assess the quality of the matches.

In [1]:
from src.module6_difflib import difflib

In [2]:
def main():
    left_data_path = './data/left_dataset.csv'
    right_data_path = './data/right_dataset.csv'
    difflib(left_data_path, right_data_path)

if __name__ == "__main__":
    main()

         entity_id  business_id  confidence
581             57        32737    0.900000
749             67        77383    0.954545
907             77        28232    0.908163
976             79        62397    0.966667
1028            82        68048    0.827586
...            ...          ...         ...
1132961      94475         2373    0.935897
1133016      94480        77995    0.800680
1133713      94539        79363    0.875000
1133752      94546        83310    0.937500
1134271      94579        80357    0.866944

[7537 rows x 3 columns]


## Module 7: Splink
Author: Riley Xiong

- **Data Cleaning**: Standardizes postal codes and cleanses text fields such as names, addresses, and city names to ensure uniformity across datasets, facilitating more accurate and reliable comparisons.

- **Blocking Keys Creation**: Implements strategic blocking keys by extracting initial characters of address, name, state, and postal code. This method effectively reduces the computational load by narrowing down potential comparisons to groups of similar entries, enhancing processing efficiency.

- **Matching and Scoring**: Utilizes a combination of deterministic rules and probabilistic scoring based on text similarity measures including Levenshtein, Jaro-Winkler, and Jaccard indexes. These methods prioritize entries with higher similarity scores, focusing on detailed comparisons within the blocked groups.

- **Model Training and Prediction**: Conducts Expectation Maximization to refine match probabilities. Attempts to employ a low threshold for match probability (0.10) in order to capture a broad range of potential matches. However, the extensive number of comparisons generated exceeds processing capabilities, hindering efficient assessment and application of results.

- **Output**: Due to the overwhelming number of comparisons, the system struggles to efficiently process and confirm high-confidence matches. Future adjustments to the blocking and scoring settings are required to manage the computational demands and improve the system's output efficacy.