<a href="https://colab.research.google.com/github/cj2001/senzing_occrp_mapping_demo/blob/main/eda1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# From Raw Data to Resolved Identities: Transforming Your Data for Senzing Entity Resolution
## Step-by-Step Strategies to Prepare Your Data for Accurate, Scalable Identity Matching
#### Written by: Clair J. Sullivan (clair@clairsullivan.com)
#### January 23, 2025

## Introduction

Entity resolution is all about untangling messy data to match records that refer to the same real-world entity, like spotting duplicates of a customer with slightly different names or addresses.  It’s the secret sauce for fixing data quality issues, linking information across datasets, and getting a clear, 360-degree view of your customers.  Senzing is a plug-and-play AI solution designed to make entity resolution fast, easy, and scalable.  It works in real-time to uncover connections in your data, giving you a clear view of every record linked to a person, company, or a multitude of other entity types through the use of a simple SDK-like interface.

In my [previous blog post](https://senzing.com/knowledge-graphs-graph-rag/) I showed the importance of doing entity resolution to create entity-resolved knowledge graphs (ERKGs).  It was based on the [introduction to ERKGs](https://senzing.com/entity-resolved-knowledge-graphs/) by Paco Nathan.  While these two posts showed the importance of using entity resolution in creating knowledge graphs, they only briefly demonstrated how one maps real world data into Senzing for entity resolution.  In this post, I will show with real-world data how one would take a CSV file in Python and generate files that can be read and analyzed by Senzing.

As usual, all data and code used in this blog post can be found on my [GitHub profile](https://github.com/cj2001/senzing_occrp_mapping_demo).


In [None]:
# prompt: connect to a subfolder

from google.colab import drive
drive.mount('/content/drive')

# Navigate to your subfolder
import os
os.chdir('/content/drive/MyDrive/Senzing/data')  # Replace 'YourSubfolder' with the actual path to your subfolder

# Now you are connected to the subfolder and can perform operations within it
print(os.getcwd())  # Print the current working directory to confirm

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Senzing/data


In [None]:
# prompt: print the contents of this subfolder

for filename in os.listdir(os.getcwd()):
    print(filename)

17000 OCCRP Data - Original Format.csv
17000 Data Records on Open Sanctions Watch List.xlsx
rapidsai-csp-utils
cufile.log
Company Insights and Comments.xlsx
17000 OCCRP Data - Original Format.gsheet
eda1.ipynb


In [None]:
!nvidia-smi

Mon Oct 28 16:52:33 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
# Colab warns and provides remediation steps if the GPUs is not compatible with RAPIDS.

!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

fatal: destination path 'rapidsai-csp-utils' already exists and is not an empty directory.
Installing RAPIDS remaining 24.6.* libraries
Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cuda-python<13.0a0,>=12.0 (from cudf-cu12==24.6.*)
  Downloading cuda_python-12.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading cuda_python-12.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.2/24.2 MB 73.4 MB/s eta 0:00:00
Installing collected packages: cuda-python
  Attempting uninstall: cuda-python
    Found existing installation: cuda-python 11.8.3
    Uninstalling cuda-python-11.8.3:
      Successfully uninstalled cuda-python-11.8.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pylibcudf-cu12 24.10.1 requires rmm-cu12==24.10.*, b

In [None]:
import cudf

--------------------------------------------------------------------------------

  CuPy may not function correctly because multiple CuPy packages are installed
  in your environment:

    cupy-cuda11x, cupy-cuda12x

  Follow these steps to resolve this issue:

    1. For all packages listed above, run the following command to remove all
       existing CuPy installations:

         $ pip uninstall <package_name>

      If you previously installed CuPy via conda, also run the following:

         $ conda uninstall cupy

    2. Install the appropriate CuPy package.
       Refer to the Installation Guide for detailed instructions.

         https://docs.cupy.dev/en/stable/install.html

--------------------------------------------------------------------------------



In [None]:
occrp_df = cudf.read_csv('17000 OCCRP Data - Original Format.csv')
occrp_df.head()

In [None]:
import pandas as pd

temp_df = pd.read_excel('17000 Data Records on Open Sanctions Watch List.xlsx')
#print(temp_df.dtypes)
temp_df = temp_df.astype(str)   # Because all columns are object with one as float64
os_df = cudf.DataFrame.from_pandas(temp_df)
os_df.head()

Unnamed: 0,query_name,query_country,result_score,result_name,result_dob,result_country,result_risks,result_sources,result_criteria,result_url
0,UNIVAR,TR,0.9,Necdet Ünüvar,1960-06-06 00:00:00,tr,role.pep,everypolitician|wd_peps|wikidata,person_name_jaro_winkler=0.91|person_name_phon...,https://www.opensanctions.org/entities/Q1973627/
1,DANSKE BANK A/S EESTI FILIAAL,EE,1.0,DANSKE BANK A/S EESTI FILIAAL,,ee,fin.bank,iso9362_bic,name_fingerprint_levenshtein=1.00|name_literal...,https://www.opensanctions.org/entities/bic-FOR...
2,DANSKE BANK A/S EESTI FILIAAL,EE,1.0,DANSKE BANK A/S EESTI FILIAAL,,ee,fin.bank,iso9362_bic,name_fingerprint_levenshtein=1.00|name_literal...,https://www.opensanctions.org/entities/bic-FOR...
3,DANSKE BANK A/S EESTI FILIAAL,EE,0.74,AS CITADELE BANKA EESTI FILIAAL,,ee,fin.bank,iso9362_bic,name_fingerprint_levenshtein=0.82,https://www.opensanctions.org/entities/NK-7Bba...
4,VELASCO INTERNATIONAL INC.,VG,0.76,MANSACO INTERNATIONAL INC.,,ch|vg,corp.offshore,ext_icij_offshoreleaks,name_fingerprint_levenshtein=0.85,https://www.opensanctions.org/entities/icijol-...


In [None]:
temp_df = pd.read_excel('Company Insights and Comments.xlsx')
cic_df = cudf.DataFrame.from_pandas(temp_df)
cic_df.head()

Unnamed: 0,UK Company,Address,Office Location Type,Status,Ownership Type,Parent Name,Parnet Country,Jurisdiction Risk Ranking,Comments
0,Hilux Services LP,"Suite 1105 111 West George Street, Glasgow, G2...",Mail Box Location,,Officer,Solberg Business Ltd,BVI,High Risk,"Scottish limited partnerships (SLPs), structur..."
1,Hilux Services LP,"Suite 1105 111 West George Street, Glasgow, G2...",Mail Box Location,,Officer,Akron Resources Corp,BVI,Medium Risk,The money was moved through the Glasgow-based ...
2,Polux Management LP,"Suite 1098 111 West George Street, Glasgow, G2...",Mail Box Location,,Parent,Solberg Business Ltd,BVI,High Risk,Hilux Services LP and Polux Management LP were...
3,Polux Management LP,"Suite 1098 111 West George Street, Glasgow, G2...",Mail Box Location,,Parent,Akron Resources Corp,BVI,Medium Risk,
4,LCM Alliance LLP,"175 Darkes Lane, Suite B, 2nd Floor, Potters B...",Flex Office Space Location,,Officer,Astrocom AG,"1st, Floor Dekk House, Zippora Street Providen...",Medium Risk,No one answers the door when you press the buz...


In [None]:
import cugraph

In [None]:
occrp_df.dtypes

Unnamed: 0,0
payer_name,object
payer_jurisdiction,object
payer_account,object
source_file,object
amount_orig,float64
id,int64
beneficiary_type,object
beneficiary_core,bool
amount_orig_currency,object
beneficiary_name,object


In [None]:
oc_G = cugraph.Graph()
oc_G.from_cudf_edgelist(occrp_df, source='payer_name', destination='beneficiary_name', weight='amount_orig')



In [None]:
pagerank_scores = cugraph.pagerank(oc_G)
type(pagerank_scores)



In [None]:
# prompt: print pagerank_scores ordered by highest score

# Sort pagerank_scores by 'pagerank' column in descending order
sorted_pagerank_scores = pagerank_scores.sort_values('pagerank', ascending=False)

# Print the sorted scores
print(sorted_pagerank_scores)

      pagerank                          vertex
2922  0.140979                LCM ALLIANCE LLP
2920  0.134917             METASTAR INVEST LLP
2921  0.130821               HILUX SERVICES LP
2923  0.057363             POLUX MANAGEMENT LP
2925  0.014787                 KG COMMERCE LLP
...        ...                             ...
1852 -0.001543                      OOO PRODOS
1816 -0.001566  BAYBURT GROUP CONSTRUCTION LTD
896  -0.005208                          MODIAR
942  -0.005964                  SECURO LIMITED
2905 -0.012555                     SECURO LTD.

[3880 rows x 2 columns]


In [None]:
occrp_df.shape, os_df.shape, cic_df.shape

((16940, 23), (1183, 10), (740, 9))

In [None]:
# prompt: get number of distinct values of occrp_df['beneficiary_type'] and occrp_df['id']

print(f"Number of distinct beneficiary_type values: {occrp_df['beneficiary_type'].nunique()}")
print(f"Number of distinct id values: {occrp_df['id'].nunique()}")

Number of distinct beneficiary_type values: 3
Number of distinct id values: 16940


In [None]:
# prompt: print distinct values of occrp_df['beneficiary_type']

print(occrp_df['beneficiary_type'].unique())

0    Company
1     Person
2    Invalid
Name: beneficiary_type, dtype: object
