# Record Linkage - Sorted Neighborhood Index

In this notebook, we use Record Linkage to match the JobPostings and Orbis datasets using purely SortedNeighborhood Index.

The notebook is organized in the following fashion:

0. Import libraries and define constants
1. Upload parts of JobPostings dataset
2. Upload parts of Orbis dataset
3. Records to match
4. Sorted Neighbourhood Index with addresses
5. Sorted Neighbourhood Index without addresses
6. Save processed data

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os
import pandas as pd
import recordlinkage
# Import jellyfish.cjellyfish for record linkage
import jellyfish.cjellyfish # The import checks if C-version of string comparision of recordlinkage is installed

from linkage.model.utils import save_dataframe, read_dataframe
from linkage.model.record_matching import Linking, print_matched_counts, print_unmatched_counts
from linkage.model.record_linkage_utils import CompareZipCodes, CompareString
from linkage.model.examine_dataframe import print_dataframe_length

In [None]:
# Two types of data, all or the first part (part01.rar)
# part01 is used for implementation purposes 
# To check if everything is working as it sould
TYPE = 'all'  # 'all' or 'part01'

# 'std' for standardized, 'std_dict_40k' for dictionary cleaning with the 40k most common words
NOTE = 'std'

In [None]:
# Specify paths to data directories
PROCESSED_JP_DIR = f"../data/processed/jobpostings"
PROCESSED_ORBIS_DIR = f"../data/processed/orbis/{TYPE}"
PROCESSED_DATA_DIR = f"../data/processed/linkage/{TYPE}"

# Specifie file names to read from
JP_FILE = f'jobpostings_test_sample_std_dict_40k.csv'
ORBIS_NAME_FILE = f'orbis_german_bvid_name_processed_{TYPE}_{NOTE}.csv'
ORBIS_ADDR_FILE = f'orbis_german_all_addresses_processed_{TYPE}_{NOTE}.csv' #'orbis_german_all_addresses_clean.csv'

LINKED_DF = "linked_matches_sni.csv"

# Columns
# JobPostings
JP_INDEX = 'jobposting_id'
JP_COMPANY_NAME, JP_COMPANY_NAME_STANDARDIZED, JP_COMPANY_NAME_DICT_CLEANED = 'company', 'company_standard', 'company_dict_clean'
JP_COMPANY_CITY, JP_COMPANY_ZIP, JP_COMPANY_STATE = 'company_city', 'company_zipcode', 'company_state'
JP_JOB_CITY, JP_JOB_ZIP, JP_JOB_STATE = 'job_city', 'job_zipcode', 'job_state'

# Orbis
ORBIS_INDEX = 'BvD ID number'
ORBIS_COMPANY_NAME, ORBIS_COMPANY_NAME_STANDARDIZED, ORBIS_COMPANY_NAME_DICT_CLEANED = 'NAME', 'NAME_standard', 'NAME_dict_clean'
ORBIS_COMPANY_CITY, ORBIS_COMPANY_ZIP, ORBIS_COMPANY_STATE = 'City (native)', 'Postcode', 'Region in country'

# Files for the partial results 
SORTED_NN_MATCHING_STD_COMPANY = f"linked_matches_sni_std_company_{TYPE}_{NOTE}.csv"
SORTED_NN_MATCHING_STD_JOB = f"linked_matches_sni_std_job_{TYPE}_{NOTE}.csv"

SORTED_NN_MATCHING_ORG_COMPANY = f"linked_matches_sni_org_company_{TYPE}_{NOTE}.csv"
SORTED_NN_MATCHING_ORG_JOB = f"linked_matches_sni__org_job_{TYPE}_{NOTE}.csv"

SORTED_NN_MATCHING_STD = f"linked_matches_sni_std_{TYPE}_{NOTE}.csv"
SORTED_NN_MATCHING_STD_2 = f"linked_matches_sni_std_{TYPE}_{NOTE}.csv"

SORTED_NN_MATCHING_ORG = f"linked_matches_sni_org_{TYPE}_{NOTE}.csv"
SORTED_NN_MATCHING_ORG_2 = f"linked_matches_sni_org_{TYPE}_{NOTE}.csv"

NOT_MATCHED = "not_matched_sni.txt"


## 1. Upload parts of JobPostings dataset

The preprocessed JobPostings dataset is stored on path:
```python
../data/processed/jobpostings/
```

The data are read into Pandas **DataFrame**.



In [None]:
df_jp = read_dataframe(PROCESSED_JP_DIR, JP_FILE, JP_INDEX)
df_jp.head()

## 2. Upload parts of Orbis dataset

The preprocessed Orbis dataset is stored on path:
```python
../data/processed/orbis/
```

The data are read into Pandas **DataFrame**.



### Read the company name dataframe

We read the file containing Orbis company names.

In [None]:
df_orbis_name = read_dataframe(PROCESSED_ORBIS_DIR, ORBIS_NAME_FILE)
df_orbis_name.head()

### Read the company addresses dataframe

We read the file containing Orbis company addresses.

In [None]:
df_orbis_addresses = read_dataframe(PROCESSED_ORBIS_DIR, ORBIS_ADDR_FILE)
df_orbis_addresses.head()

### Join the Orbis dataframes

We join Orbis parts to create one dataframe.

Note: BvD ID number in addresses' dataframe is not unique.  

In [None]:
df_orbis = df_orbis_name.merge(df_orbis_addresses, on=ORBIS_INDEX, how='inner')
df_orbis.head()

### Check the dataframe

We check some values of the dataframes.

In [None]:
print_dataframe_length(df_orbis)

In [None]:
# TODO: do in orbis-name notebook
df_orbis.rename(columns={"company_standard": "NAME_standard", "company_dict_clean": "NAME_dict_clean"}, inplace=True)

In [None]:
# Check the states in Orbis
df_orbis[ORBIS_COMPANY_STATE].unique()

In [None]:
# Check the states in JobPostings
df_jp[JP_COMPANY_STATE].unique()

In [None]:
#df_orbis = df_orbis.head(1000)

### Orbis index

Change name of the Orbis index (it is not the _BvD ID_ because of the missing uniqueness.

In [None]:
# Name the index for joining
# JP dataset has unique index, therefore is set during the .csv reading
df_orbis.index.name = 'orbis_index'

## 3. Records to match

Print the number of unmatched records and initialize a linking class.

In [None]:
print_unmatched_counts(df_jp, JP_COMPANY_NAME_STANDARDIZED)

In [None]:
# Initialize class containing methods for record linkage
linking = Linking(JP_INDEX, JP_COMPANY_NAME, JP_COMPANY_NAME_STANDARDIZED, JP_COMPANY_NAME_DICT_CLEANED,
                  JP_COMPANY_CITY, JP_COMPANY_ZIP, JP_COMPANY_STATE,
                  JP_JOB_CITY, JP_JOB_ZIP, JP_JOB_STATE,
                  ORBIS_INDEX, ORBIS_COMPANY_NAME, ORBIS_COMPANY_NAME_STANDARDIZED, ORBIS_COMPANY_NAME_DICT_CLEANED,
                  ORBIS_COMPANY_CITY, ORBIS_COMPANY_ZIP, ORBIS_COMPANY_STATE)


## 4. Sorted Neighbourhood Index with addresses

Sorted Neighborhood Index using addresses for attribute comparison.

We index on:

1. Standardized company name and company address
1. Standardized company name and job address
1. Original company name and company address
1. Original company name and job address

### Standardized company name and company addresses

Create SNI on standardized company name and filter matches using company addresses.

In [None]:
# Create index
indexer = recordlinkage.SortedNeighbourhoodIndex(JP_COMPANY_NAME_STANDARDIZED, ORBIS_COMPANY_NAME, window=7) 

# Make record pairs
candidate_links = indexer.index(df_jp, df_orbis)

print(f'Num of candidates: {len(candidate_links)}\n')

In [None]:
# Compare fields of candidate pairs
features_name = linking.compare_similar_records(df_jp, df_orbis, candidate_links, addr_type='company')

# Filter candidate pairs
df_merge_name_result = linking.merge_dataframes_on_linkage_result(features_name, df_jp, df_orbis, addr_type='company')

In [None]:
df_merge_name_result.head()

In [None]:
# Save dataframe to a csv file
save_dataframe(df_merge_name_result, PROCESSED_DATA_DIR, SORTED_NN_MATCHING_STD_COMPANY)

In [None]:
# Process matched and not matched records
# Add matches to a new df
matched_df = df_merge_name_result.copy()
matched_df.set_index([JP_INDEX, ORBIS_INDEX], inplace=True)

# Remove matches from old JobPostings dataframe
df_jp.drop(df_merge_name_result[JP_INDEX], axis=0, inplace=True)

print_matched_counts(matched_df, JP_COMPANY_NAME_STANDARDIZED)
print_unmatched_counts(df_jp, JP_COMPANY_NAME_STANDARDIZED)

### Standardized company name and job addresses

Create SNI on standardized company name and filter matches using job addresses.

In [None]:
# Create index
indexer = recordlinkage.SortedNeighbourhoodIndex(JP_COMPANY_NAME_STANDARDIZED, ORBIS_COMPANY_NAME, window=7)

# Make record pairs
candidate_links = indexer.index(df_jp, df_orbis)

print(f'Num of candidates: {len(candidate_links)}\n')

In [None]:
# Compare fields of candidate pairs
features_name = linking.compare_similar_records(df_jp, df_orbis, candidate_links, addr_type='job')

# Filter candidate pairs
df_merge_name_result = linking.merge_dataframes_on_linkage_result(features_name, df_jp, df_orbis, addr_type='job')

In [None]:
df_merge_name_result.head()

In [None]:
# Save dataframe to a csv file
save_dataframe(df_merge_name_result, PROCESSED_DATA_DIR, SORTED_NN_MATCHING_STD_JOB)

In [None]:
# Process matched and not matched records
linking.process_matched(df_jp, matched_df, df_merge_name_result, JP_COMPANY_NAME_STANDARDIZED)

### Original company name and company addresses

Create SNI on standardized company name and filter matches using company addresses.

In [None]:
# Create index
indexer = recordlinkage.SortedNeighbourhoodIndex(JP_COMPANY_NAME, ORBIS_COMPANY_NAME, window=7) 

# Make record pairs
candidate_links = indexer.index(df_jp, df_orbis)

print(f'Num of candidates: {len(candidate_links)}\n')

In [None]:
# Compare fields of candidate pairs
features_name = linking.compare_similar_records(df_jp, df_orbis, candidate_links, addr_type='company')

# Filter candidate pairs
df_merge_name_result = linking.merge_dataframes_on_linkage_result(features_name, df_jp, df_orbis, addr_type='company')

In [None]:
df_merge_name_result.head()

In [None]:
# Save dataframe to a csv file
save_dataframe(df_merge_name_result, PROCESSED_DATA_DIR, SORTED_NN_MATCHING_ORG_COMPANY)

In [None]:
# Process matched and not matched records
linking.process_matched(df_jp, matched_df, df_merge_name_result, JP_COMPANY_NAME_STANDARDIZED)

### Original company name and job addresses

Create SNI on standardized company name and filter matches using job addresses.

In [None]:
# Create index
indexer = recordlinkage.SortedNeighbourhoodIndex(JP_COMPANY_NAME, ORBIS_COMPANY_NAME, window=7)

# Make record pairs
candidate_links = indexer.index(df_jp, df_orbis)

print(f'Num of candidates: {len(candidate_links)}\n')

In [None]:
# Compare fields of candidate pairs
features_name = linking.compare_similar_records(df_jp, df_orbis, candidate_links, addr_type='job')

# Filter candidate pairs
df_merge_name_result = linking.merge_dataframes_on_linkage_result(features_name, df_jp, df_orbis, addr_type='job')

In [None]:
df_merge_name_result.head()

In [None]:
# Save dataframe to a csv file
save_dataframe(df_merge_name_result, PROCESSED_DATA_DIR, SORTED_NN_MATCHING_ORG_JOB)

In [None]:
# Process matched and not matched records
linking.process_matched(df_jp, matched_df, df_merge_name_result, JP_COMPANY_NAME_STANDARDIZED)

## 5. Sorted Neighbourhood Index without addresses

Sorted Neighborhood Index not using addresses for attribute comparison.

We index on:

1. Standardized company name
1. Standardized company name
1. Original company name
1. Original company name

We repeat the process to get more matches, since more company names get into candidate pairs using the window size.

### Standardized company name 

Create SNI on standardized company name without using addresses.

In [None]:
# Create index
indexer = recordlinkage.SortedNeighbourhoodIndex(JP_COMPANY_NAME_STANDARDIZED, ORBIS_COMPANY_NAME, window=7) 

# Make record pairs
candidate_links = indexer.index(df_jp, df_orbis)

print(f'Num of candidates: {len(candidate_links)}\n')

In [None]:
# Compare fields of candidate pairs
features_name = linking.compare_similar_records(df_jp, df_orbis, candidate_links)

# Filter candidate pairs
df_merge_name_result = linking.merge_dataframes_on_linkage_result(features_name, df_jp, df_orbis,  score_threshold=1)

In [None]:
df_merge_name_result.head()

In [None]:
# Save dataframe to a csv file
save_dataframe(df_merge_name_result, PROCESSED_DATA_DIR, SORTED_NN_MATCHING_STD)

In [None]:
# Process matched and not matched records
linking.process_matched(df_jp, matched_df, df_merge_name_result, JP_COMPANY_NAME_STANDARDIZED)

### Standardized company name

Create SNI on standardized company without using addresses.

In [None]:
# Create index
indexer = recordlinkage.SortedNeighbourhoodIndex(JP_COMPANY_NAME_STANDARDIZED, ORBIS_COMPANY_NAME, window=7)

# Make record pairs
candidate_links = indexer.index(df_jp, df_orbis)

print(f'Num of candidates: {len(candidate_links)}\n')

In [None]:
# Compare fields of candidate pairs
features_name = linking.compare_similar_records(df_jp, df_orbis, candidate_links)

# Filter candidate pairs
df_merge_name_result = linking.merge_dataframes_on_linkage_result(features_name, df_jp, df_orbis, score_threshold=1)

In [None]:
df_merge_name_result.head()

In [None]:
# Save dataframe to a csv file
save_dataframe(df_merge_name_result, PROCESSED_DATA_DIR, SORTED_NN_MATCHING_STD_2)

In [None]:
# Process matched and not matched records
linking.process_matched(df_jp, matched_df, df_merge_name_result, JP_COMPANY_NAME_STANDARDIZED)

### Original company name

Create SNI on standardized company name without using addresses.

In [None]:
# Create index
indexer = recordlinkage.SortedNeighbourhoodIndex(JP_COMPANY_NAME, ORBIS_COMPANY_NAME, window=7) 

# Make record pairs
candidate_links = indexer.index(df_jp, df_orbis)

print(f'Num of candidates: {len(candidate_links)}\n')

In [None]:
# Compare fields of candidate pairs
features_name = linking.compare_similar_records(df_jp, df_orbis, candidate_links)

# Filter candidate pairs
df_merge_name_result = linking.merge_dataframes_on_linkage_result(features_name, df_jp, df_orbis, score_threshold=1)

In [None]:
df_merge_name_result.head()

In [None]:
# Save dataframe to a csv file
save_dataframe(df_merge_name_result, PROCESSED_DATA_DIR, SORTED_NN_MATCHING_ORG)

In [None]:
# Process matched and not matched records
linking.process_matched(df_jp, matched_df, df_merge_name_result, JP_COMPANY_NAME_STANDARDIZED)

### Original company name

Create SNI on standardized company name without using addresses.

In [None]:
# Create index
indexer = recordlinkage.SortedNeighbourhoodIndex(JP_COMPANY_NAME, ORBIS_COMPANY_NAME, window=7)

# Make record pairs
candidate_links = indexer.index(df_jp, df_orbis)

print(f'Num of candidates: {len(candidate_links)}\n')

In [None]:
# Compare fields of candidate pairs
features_name = linking.compare_similar_records(df_jp, df_orbis, candidate_links)

# Filter candidate pairs
df_merge_name_result = linking.merge_dataframes_on_linkage_result(features_name, df_jp, df_orbis, score_threshold=1)

In [None]:
df_merge_name_result.head()

In [None]:
# Save dataframe to a csv file
save_dataframe(df_merge_name_result, PROCESSED_DATA_DIR, SORTED_NN_MATCHING_ORG_2)

In [None]:
# Process matched and not matched records
linking.process_matched(df_jp, matched_df, df_merge_name_result, JP_COMPANY_NAME_STANDARDIZED)

## 6. Save processed data

The processed data is stored in a csv file on a path:
```python
../data/processed/linkage/
```

### Save matched

In [None]:
save_dataframe(matched_df, PROCESSED_DATA_DIR, LINKED_DF)

### Save not-matched

In [None]:
save_dataframe(pd.DataFrame(df_jp[JP_COMPANY_NAME].unique()), PROCESSED_DATA_DIR, NOT_MATCHED)