# RegTech Session: KYC and sanctions

This session present an example focusing on sanctions development and their place in modern regulation industry. We cover the following topis:

* KYC, and KYC intersection with sanction lists (in presentation)
* Landscape of sanctions: individual vs. sectorial
* Relevant databases of sanctioned entities: OFAC, etc.
* Solution 1: Sanctions on entities 
    * Building identity matching software
    * Decision of the intervention threshold
    * Limitations and Challenges
* Solution 2: Sectorial sanctions
    * Building sector classification software
    * Clustering
    * ...




In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import random

#from sklearn.cluster import KMeans
#from sklearn.decomposition import PCA


## Example 1: KYC and sanction lists

### Aim and description of our exersize

In this example, we download data of sactioned entities (persons, companies, organizations) and implement a simple matching function that provides string similarity scores calculated by various methods.

### Relevant sources of saction lists

* [Office of Foreign Assets Control (OFAC)](https://sanctionssearch.ofac.treas.gov/Details.aspx?id=13087)
* [EU sanction map](https://www.sanctionsmap.eu)
* [UK sanction list](https://www.gov.uk/government/publications/the-uk-sanctions-list)
* private sources, e.g., [www.opensanctions.org](https://www.opensanctions.org/)

Let us use one of these sources to download a list of sanctioned entities.

[See names.txt of this link](https://www.opensanctions.org/datasets/default/)

In [3]:
! wget https://data.opensanctions.org/datasets/20250130/default/names.txt?v=20250130065302-gpf

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [16]:
target_url = 'https://data.opensanctions.org/datasets/20250130/default/names.txt?v=20250130065302-gpf'


In [50]:
import urllib3

http = urllib3.PoolManager()
response = http.request('GET', target_url)
data = response.data.decode('utf-8')

In [53]:
type(data)

str

In [52]:
data[:10]

'ООО "ЗЕЛИН'

In [46]:
import urllib.request
data = urllib.request.urlopen(target_url)

In [None]:
lines = []
for line in data:
    lines.append(line)

In [49]:
lines[:10]

[b'\xd0\x9e\xd0\x9e\xd0\x9e "\xd0\x97\xd0\x95\xd0\x9b\xd0\x98\xd0\x9d\xd0\xa1\xd0\x9a\xd0\x98\xd0\x99 \xd0\x93\xd0\xa0\xd0\xa3\xd0\x9f\xd0\x9f"\n',
 b'Tovarystvo z obmezhenoiu vidpovidalnistiu "Zelinskyi Hrupp"\n',
 b'\xd0\xa2\xd0\xbe\xd0\xb2\xd0\xb0\xd1\x80\xd0\xb8\xd1\x81\xd1\x82\xd0\xb2\xd0\xbe \xd0\xb7 \xd0\xbe\xd0\xb1\xd0\xbc\xd0\xb5\xd0\xb6\xd0\xb5\xd0\xbd\xd0\xbe\xd1\x8e \xd0\xb2\xd1\x96\xd0\xb4\xd0\xbf\xd0\xbe\xd0\xb2\xd1\x96\xd0\xb4\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd1\x96\xd1\x81\xd1\x82\xd1\x8e "\xd0\x97\xd0\xb5\xd0\xbb\xd1\x96\xd0\xbd\xd1\x81\xd1\x8c\xd0\xba\xd0\xb8\xd0\xb9 \xd0\x93\xd1\x80\xd1\x83\xd0\xbf\xd0\xbf"\n',
 b'Limited Liability Company "Zelinsky Group"\n',
 b'\xd0\x9e\xd0\xb1\xd1\x89\xd0\xb5\xd1\x81\xd1\x82\xd0\xb2\xd0\xbe \xd1\x81 \xd0\xbe\xd0\xb3\xd1\x80\xd0\xb0\xd0\xbd\xd0\xb8\xd1\x87\xd0\xb5\xd0\xbd\xd0\xbd\xd0\xbe\xd0\xb9 \xd0\xbe\xd1\x82\xd0\xb2\xd0\xb5\xd1\x82\xd1\x81\xd1\x82\xd0\xb2\xd0\xb5\xd0\xbd\xd0\xbd\xd0\xbe\xd1\x81\xd1\x82\xd1\x8c\xd1\x8e "\xd0\x97

In [45]:
import urllib2  # the lib that handles the url stuff
lines = []

data = urllib2.urlopen(target_url) # it's a file like object and works just like a file
for line in data: # files are iterable
    lines.append(line)

ModuleNotFoundError: No module named 'urllib2'

In [None]:
import requests
response = requests.get(target_url, )
data = response.text
lines = []
for i, line in enumerate(data.split('\n')):
    #print(line)
    lines.append(line)
    

In [37]:
lines[:15]

['Ð\x9eÐ\x9eÐ\x9e "Ð\x97Ð\x95Ð\x9bÐ\x98Ð\x9dÐ¡Ð\x9aÐ\x98Ð\x99 Ð\x93Ð\xa0Ð£Ð\x9fÐ\x9f"',
 'Tovarystvo z obmezhenoiu vidpovidalnistiu "Zelinskyi Hrupp"',
 'Ð¢Ð¾Ð²Ð°Ñ\x80Ð¸Ñ\x81Ñ\x82Ð²Ð¾ Ð· Ð¾Ð±Ð¼ÐµÐ¶ÐµÐ½Ð¾Ñ\x8e Ð²Ñ\x96Ð´Ð¿Ð¾Ð²Ñ\x96Ð´Ð°Ð»Ñ\x8cÐ½Ñ\x96Ñ\x81Ñ\x82Ñ\x8e "Ð\x97ÐµÐ»Ñ\x96Ð½Ñ\x81Ñ\x8cÐºÐ¸Ð¹ Ð\x93Ñ\x80Ñ\x83Ð¿Ð¿"',
 'Limited Liability Company "Zelinsky Group"',
 'Ð\x9eÐ±Ñ\x89ÐµÑ\x81Ñ\x82Ð²Ð¾ Ñ\x81 Ð¾Ð³Ñ\x80Ð°Ð½Ð¸Ñ\x87ÐµÐ½Ð½Ð¾Ð¹ Ð¾Ñ\x82Ð²ÐµÑ\x82Ñ\x81Ñ\x82Ð²ÐµÐ½Ð½Ð¾Ñ\x81Ñ\x82Ñ\x8cÑ\x8e "Ð\x97ÐµÐ»Ð¸Ð½Ñ\x81ÐºÐ¸Ð¹ Ð\x93Ñ\x80Ñ\x83Ð¿Ð¿"',
 'SANAVBARI NIKITENKO',
 'Ð\x9eÑ\x82ÐºÑ\x80Ñ\x8bÑ\x82Ð¾Ðµ Ð°ÐºÑ\x86Ð¸Ð¾Ð½ÐµÑ\x80Ð½Ð¾Ðµ Ð¾Ð±Ñ\x89ÐµÑ\x81Ñ\x82Ð²Ð¾ "Ð\xadÐ»ÐµÐºÑ\x82Ñ\x80Ð¾Ñ\x81Ñ\x82Ð°Ð»Ñ\x8cÑ\x81ÐºÐ¸Ð¹ Ñ\x85Ð¸Ð¼Ð¸ÐºÐ¾-Ð¼ÐµÑ\x85Ð°Ð½Ð¸Ñ\x87ÐµÑ\x81ÐºÐ¸Ð¹ Ð·Ð°Ð²Ð¾Ð´ Ð¸Ð¼ÐµÐ½Ð¸ Ð\x9d.Ð\x94.Ð\x97ÐµÐ»Ð¸Ð½Ñ\x81ÐºÐ¾Ð³Ð¾"',
 'Open Joint-Stock Company "Elektrostal ChemicalMechanical Plant named after N.D.Zelinsky"',
 'Ð\x9eÑ\x82ÐºÑ\x80Ñ\x8bÑ\x82Ð¾Ðµ Ð°ÐºÑ\x86Ð¸Ð¾Ð½ÐµÑ\x80Ð½Ð¾Ðµ Ð¾Ð±Ñ\x89ÐµÑ\x81Ñ\x82

In [44]:
line.encode('cp1251')

b''

In [43]:
import chardet

s = line
bs = line.encode("utf-8")
encoding = chardet.detect(bs)["encoding"]

str = s.encode(encoding).decode("utf-8")

print(str)

TypeError: encode() argument 'encoding' must be str, not None

In [41]:
chardet.detect(line)["encoding"]

TypeError: Expected object of type bytes or bytearray, got: <class 'str'>

In [33]:
import codecs

line

'Ð\x9eÐ\x9eÐ\x9e "Ð\x97Ð\x95Ð\x9bÐ\x98Ð\x9dÐ¡Ð\x9aÐ\x98Ð\x99 Ð\x93Ð\xa0Ð£Ð\x9fÐ\x9f"'

In [None]:




line.decode('utf-8')

AttributeError: 'str' object has no attribute 'decode'

### Identity matching software

In [None]:
!pip install fuzzywuzzy



In [None]:
from fuzzywuzzy import fuzz

def check_name(name, risk_treshold = 85):
  suspected_entities = []

  if name in lines:
    return 1.0
  else:
    levenstein_ratio = partial_ratio = token_sort_ratio = 0.0
    for line in lines:

      levenstein_ratio = fuzz.ratio(name, line)
      if levenstein_ratio > risk_treshold:
        suspected_entities.append((line,levenstein_ratio,'lev-ratio'))

      token_sort_ratio = fuzz.token_sort_ratio(name, line, full_process=False)
      if token_sort_ratio > risk_treshold:
        suspected_entities.append((line,token_sort_ratio,'tok-ratio'))

    return suspected_entities

In [None]:
#here, enter name of your fauvorite terrorist
check_name('Kim Jong Un')

[('Jong Man Kim', 87, 'tok-ratio'),
 ('Kim Jong Man', 87, 'lev-ratio'),
 ('Kim Jong Man', 87, 'tok-ratio'),
 ('Kim Jong Eun', 87, 'lev-ratio'),
 ('Kim Džong Un', 87, 'lev-ratio'),
 ('Kim Džong Un', 87, 'tok-ratio'),
 ('Kim Dzong Un', 87, 'lev-ratio'),
 ('Kim Dzong Un', 87, 'tok-ratio'),
 ('Kim Yong Un', 91, 'lev-ratio'),
 ('Jong Un Kim', 100, 'tok-ratio'),
 ('Kim Jong Gun', 87, 'lev-ratio'),
 ('Kim Un Jon', 95, 'tok-ratio'),
 ('Kim Un Jong', 100, 'tok-ratio'),
 ('Kim Jung Un', 91, 'lev-ratio'),
 ('Kim Jung Un', 91, 'tok-ratio'),
 ('Un Jong Kim', 100, 'tok-ratio'),
 ('Un Gyong Kim', 87, 'tok-ratio')]

In [None]:
#here, enter name of someone who is probably not a terrorist
check_name('Peter Fratric')

[]

### Setting the decision threshold

In [None]:
#decrease the risk threshold, and see you get more matches
check_name('Peter Fratric', risk_treshold = 75)

[('Peter Frick', 83, 'lev-ratio'),
 ('Peter Frick', 83, 'tok-ratio'),
 ('Peter Frølich', 77, 'lev-ratio'),
 ('Peter Frølich', 77, 'tok-ratio'),
 ('Frantisek Peter', 79, 'tok-ratio'),
 ('František Peter', 79, 'tok-ratio'),
 ('Peter Friedrich', 79, 'lev-ratio'),
 ('Peter Friedrich', 79, 'tok-ratio'),
 ('Friedrich, Peter', 76, 'tok-ratio'),
 ('Peter Frolich', 77, 'lev-ratio'),
 ('Peter Frolich', 77, 'tok-ratio'),
 ('Peter Forster', 77, 'lev-ratio'),
 ('Peter Forster', 77, 'tok-ratio'),
 ('Peter Ferraro', 77, 'lev-ratio'),
 ('Peter Ferraro', 77, 'tok-ratio'),
 ('Peter Francis', 77, 'lev-ratio'),
 ('Peter Francis', 77, 'tok-ratio'),
 ('Peter Ferrara', 77, 'lev-ratio'),
 ('Peter Ferrara', 77, 'tok-ratio'),
 ('Peter Fitzpatrick', 80, 'lev-ratio'),
 ('Peter Fitzpatrick', 80, 'tok-ratio'),
 ('Fitzpatrick, Peter', 77, 'tok-ratio'),
 ('Peter Gration', 77, 'lev-ratio'),
 ('Peter Gration', 77, 'tok-ratio')]

Question to think about: How would you set the intervention treshold? What data would you use? Explain steps to be taken.

### Limitations and challanges

Let us think about following questions:
 * What sort of data can we get to improve our entity matching?
 * Are there some issues with frequency of updates?
 * Can sanctioned entities benefit still get their hands on assets by obscuring their involvement?

## Example 2: 





TODO: sectorial sanctions
sector membership classification (fuzzy by default)
similar as nuclear whitelisting
dataset verification on a sample

suptech - maybe my future research


