# CPV Classifier POC

## 1 - Get Data

Simple transformer-based classifier proof of concept based on data from TheyBuyForYou.

See https://theybuyforyou.eu/ for background on TheyBuyForYou and http://data.tbfy.eu/ for information on the Knowledge Graph (KG) data that was created as part of this project. Data from the knowledge graph used in this proof of concept is made available under the following license terms and therefore the same license applies to the code and data in this repository.

> The KG data is provided under the Creative Commons BY-NC-SA 4.0 License, which allows you to use, share and adapt the data for non-commercial uses as long as you give appropriate credit and share any adapted data under the same license as the original. If you wish to use the data for commercial uses please contact the TheyBuyForYou project.

In [1]:
import os
import random
import json
from tbfy_util import *
from typing import List
import pandas as pd
from collections import Counter
import shelve
import nltk

## Load data

The data used comes from the TheyBuyForYou Knowledge Graph (KG) which is available via http://dump.tbfy.eu, though it is hosted on Zenodo. The zipfile is approx 5Gb so is not included in this repo. If you want to re-create these steps from scratch you will need to do the following:

1. Install zenodo_get (see https://zenodo.org/record/1261813#.YXMpDZvTVH4). This utility takes away a lot of headaches when trying to download large zenodo files
2. Download the JSON dump file (via http://dump.tbfy.eu or https://zenodo.org/record/5546616#.YXMoOZvTVH4) to the `data` directory
3. Unzip it

In [2]:
files = [os.path.join(dp, f) for dp, dn, fn in os.walk(os.path.expanduser("./data/3_JSON_Enriched")) for f in fn if "release" in f]
l = len(files)
print(f"{l} files found")

2211031 files found


## Have a look at some examples

In [3]:
idx = random.randrange(0,l)
f = files[idx]
j = json.load(open(f,'r'))
j

{'uri': 'https://openopps.com/tenders/ocds-0c46vo-0049-e5543718-8753-4bb3-8015-590d58b7f93e/?format=json',
 'license': 'https://opendatacommons.org/licenses/odbl/',
 'releases': [{'id': 'e5543718-8753-4bb3-8015-590d58b7f93e',
   'tag': ['tender'],
   'date': '2019-08-01T00:00:00+00:00',
   'ocid': 'ocds-0c46vo-0049-e5543718-8753-4bb3-8015-590d58b7f93e',
   'buyer': {'address': {'countryName': 'Germany'}},
   'tender': {'id': 'e5543718-8753-4bb3-8015-590d58b7f93e',
    'title': 'Roh\xadbau\xadar\xadbei\xadten, Ver\xadblend\xadmau\xader\xadwerk: Grund\xadschu\xadle Süd\xadstadt, Oer\xadling\xadhau\xadsen',
    'status': 'active',
    'documents': [{'id': 'tender_url',
      'url': 'https://www.service.bund.de/IMPORTE/Ausschreibungen/healyhudson/2019/08/e5543718-8753-4bb3-8015-590d58b7f93e.html;jsessionid=F2A7F85653A12235DC3B51EA48D3847C.1_cid296?nn=4641482&type=0&searchResult=true',
      'format': 'text/html',
      'language': 'en',
      'documentType': 'tenderNotice'}],
    'descript

In [4]:
for idx in range(0,10):
    idx = random.randrange(0,l)
    j = json.load(open(files[idx],'r'))
    for release in j['releases']:
        lang = release.get("language")
        tender = release['tender']
        title = get_title(tender)
        desc = get_description(tender)
        classif,scheme = get_classification(tender)
        print("**************************************************")
        print(f"File id: {idx}, Language: {lang}")
        print(f"Title: {title}")
        print(f"Description (up to 500 chars): {desc[:500]}")
        print(f"Classification code {classif} and scheme {scheme}")
        print()

**************************************************
File id: 1543770, Language: fr
Title: Marché de mandat de maîtrise d'ouvrage dans le cadre de la construction d'un bâtiment à usage de résidence d'hébergement, activités de formation et d'un logement sur le site CREPS à Essey-lès-Nancy
Description (up to 500 chars): Mission de mandat de maîtrise d'ouvrage dans le cadre de la construction d'un bâtiment.
Classification code 79933000 and scheme CPV

**************************************************
File id: 1893919, Language: ro
Title: „Furnizare echipamente informatice și licențe software”
Description (up to 500 chars): Achiziția „Furnizare echipamente informatice si licente software” divizata in 14 loturi.Numărul de zile până la care se pot solicita clarificări înainte de data-limită de depunere a ofertelor/candidaturilor: 19 zile. Autoritatea contractanta va raspunde in mod clar si complet tuturor solicitarilor de clarificari/informatii suplimentare in a 11-a zi inainte de data-limita

## Store relevant data in a pandas dataframe



In [5]:
def file_to_rows(f: str, idx: int) -> List:
    rows = []
    j = json.load(open(f, 'r'))
    for release in j["releases"]:
        if release.get("language") and release["language"] == 'en' and release.get("tender"):
            tender = release["tender"]
            title = get_title(tender)
            description = get_description(tender)
            classif,scheme = get_classification(tender)
            
            if scheme and "CPV" in scheme.upper():
                class8 = classif[:8]
                if len(class8) < 8:
                    print(f"row {idx} has classification {classif} that is less than 8 chars")
                class6 = class8[:6] + "00"
                class4 = class8[:4] + "0000"
                class2 = class8[:2] + "000000"
                rows.append([f,title,description,class8,class6,class4,class2])
    return rows

In [6]:
all_rows = []
idx = 0
for x in files:
    rows = file_to_rows(x,idx)
    all_rows.extend(rows)
    idx += 1
    if idx % 100_000 == 0:
        print(f"Processing file #{idx}")
    
print(f"Got {len(all_rows)} rows")

Processing file #100000
Processing file #200000
Processing file #300000
Processing file #400000
Processing file #500000
Processing file #600000
Processing file #700000
Processing file #800000
Processing file #900000
Processing file #1000000
Processing file #1100000
Processing file #1200000
Processing file #1300000
Processing file #1400000
Processing file #1500000
Processing file #1600000
Processing file #1700000
Processing file #1800000
Processing file #1900000
Processing file #2000000
Processing file #2100000
Processing file #2200000
Got 245884 rows


In [7]:
df = pd.DataFrame(columns=["filename","title","description","class8","class6","class4","class2"],
                  data=all_rows)
df

Unnamed: 0,filename,title,description,class8,class6,class4,class2
0,./data/3_JSON_Enriched/2019-08-21/ocds-0c46vo-...,"APMS Services for Maple Surgery, Cambridge for...","Maple Surgery is located at Hanover Close, Cam...",85120000,85120000,85120000,85000000
1,./data/3_JSON_Enriched/2019-08-21/ocds-0c46vo-...,Provision of Compliance Auditing Services,Dublin Bus is seeking submissions from suitabl...,79212000,79212000,79210000,79000000
2,./data/3_JSON_Enriched/2019-08-21/ocds-0c46vo-...,DBC (SF) Pilot of Robotic Process Automation,Dacorum Borough Council requires support from ...,72222000,72222000,72220000,72000000
3,./data/3_JSON_Enriched/2019-08-21/ocds-0c46vo-...,Passenger Transport for 8 Passengers or Less w...,Passenger transport for 8 passengers or less w...,60000000,60000000,60000000,60000000
4,./data/3_JSON_Enriched/2019-08-21/ocds-0c46vo-...,Denbighshire Schools ICT Network Framework,The aims of this contract is to provide a fram...,48000000,48000000,48000000,48000000
...,...,...,...,...,...,...,...
245879,./data/3_JSON_Enriched/2019-04-23/ocds-0c46vo-...,Delivered Ready Prepared Meals,The range is for selection of standard ready p...,15894220,15894200,15890000,15000000
245880,./data/3_JSON_Enriched/2019-04-23/ocds-0c46vo-...,Traffic Signals Planned and Unplanned Inspecti...,Renfrewshire Council require a suitably qualif...,50232000,50232000,50230000,50000000
245881,./data/3_JSON_Enriched/2019-04-23/ocds-0c46vo-...,The Supply for the Development of Dudley Counc...,Dudley Council invites providers to submit a q...,73000000,73000000,73000000,73000000
245882,./data/3_JSON_Enriched/2019-04-23/ocds-0c46vo-...,LIFE Welsh Raised Bogs — Framework for Peat Re...,Lot 2: Removal of invasive Scrub:NRW is intend...,16600000,16600000,16600000,16000000


In [8]:
df.class8.describe()

count       245884
unique        5512
top       45000000
freq          9599
Name: class8, dtype: object

In [9]:
cpv_counter = Counter(df.class8)
print(f"Most common classifications: {cpv_counter.most_common(10)}")
print(f"Least common classifications: {cpv_counter.most_common()[:-10:-1]}")

Most common classifications: [('45000000', 9599), ('85000000', 8787), ('85100000', 8786), ('60000000', 7677), ('85312000', 5613), ('72000000', 5003), ('80000000', 4242), ('85300000', 3727), ('48000000', 3695), ('38000000', 3198)]
Least common classifications: [('35322400', 1), ('24311800', 1), ('77231800', 1), ('30213500', 1), ('45243400', 1), ('71631490', 1), ('48625000', 1), ('33620000', 1), ('03222320', 1)]


## Bucketize classifications

Approx 5000 classifications across approx 250k examples is approx 50 examples per classification code. Most common entries are all high level CPV codes and there are many CPV codes with just 1 entry per CPV code.

To give the model a better chance of being accurate, group the classifications in buckets so that there are at least N examples per classification.

In [10]:
def bucketize(df: pd.DataFrame, minv: int = 100):
    '''
        Really not very proud of this function but it does the job
    '''
    c8 = Counter(df["class8"])
    c6 = Counter(df["class6"])
    c4 = Counter(df["class4"])
    c2 = Counter(df["class2"])

    cpv_mapping = {}
    for k,v in c8.items():
        if v > minv:
            cpv_mapping[k] = k
        else:
            cpv_6 = k[:6] + "00"
            v6 = c6[cpv_6]
            if v6 > minv:
                cpv_mapping[k] = cpv_6
            else:
                cpv_4 = k[:4] + "0000"
                v4 = c4[cpv_4]
                if v4 > minv:
                    cpv_mapping[k] = cpv_4
                else:
                    cpv_mapping[k] = k[:2] + "000000"
    return cpv_mapping

In [11]:
cpv_mapping = bucketize(df,300)

In [12]:
print(f"Number of classifications after bucketizing: {len(set(cpv_mapping.values()))}")

Number of classifications after bucketizing: 225


In [13]:
df['mapped_cpv'] = df.apply(lambda x: cpv_mapping[x['class8']],axis=1)
df.mapped_cpv.describe()

count       245884
unique         225
top       45000000
freq         10596
Name: mapped_cpv, dtype: object

In [14]:
mapped_cpv_counter = Counter(df.mapped_cpv)
print(f"Most common classifications: {mapped_cpv_counter.most_common(10)}")
print(f"Least common classifications: {mapped_cpv_counter.most_common()[:-10:-1]}")

Most common classifications: [('45000000', 10596), ('85000000', 9217), ('60000000', 8866), ('85100000', 8786), ('72000000', 6820), ('48000000', 6676), ('80000000', 6401), ('50000000', 5990), ('85312000', 5613), ('38000000', 5411)]
Least common classifications: [('11000000', 1), ('01000000', 1), ('40000000', 1), ('93000000', 1), ('78000000', 1), ('21000000', 1), ('23000000', 1), ('95000000', 1), ('20000000', 2)]


There are still some 'least common' classifications with 1 or 2 entries, but these are already top-level CPV classifications so the bucketize function wouldn't group them further

## Prepare data for training

create series for text and series for classifications.

Due to the large disparities in text size, will only use title plus first sentence of each tender.

In [15]:
df['sent1'] = df.apply(lambda x: nltk.sent_tokenize(x.description)[0], axis=1)

In [16]:
df_for_training = df.drop(columns=['filename','description','class8','class6','class4','class2'])
df_for_training

Unnamed: 0,title,mapped_cpv,sent1
0,"APMS Services for Maple Surgery, Cambridge for...",85120000,"Maple Surgery is located at Hanover Close, Cam..."
1,Provision of Compliance Auditing Services,79212000,Dublin Bus is seeking submissions from suitabl...
2,DBC (SF) Pilot of Robotic Process Automation,72220000,Dacorum Borough Council requires support from ...
3,Passenger Transport for 8 Passengers or Less w...,60000000,Passenger transport for 8 passengers or less w...
4,Denbighshire Schools ICT Network Framework,48000000,The aims of this contract is to provide a fram...
...,...,...,...
245879,Delivered Ready Prepared Meals,15890000,The range is for selection of standard ready p...
245880,Traffic Signals Planned and Unplanned Inspecti...,50230000,Renfrewshire Council require a suitably qualif...
245881,The Supply for the Development of Dudley Counc...,73000000,Dudley Council invites providers to submit a q...
245882,LIFE Welsh Raised Bogs — Framework for Peat Re...,16000000,Lot 2: Removal of invasive Scrub:NRW is intend...


In [17]:
df_for_training.to_json("data/training_data.json")