# Introduction

### Overview

This notebook describes the mapping process of EITI data to the Beneficial Ownership Data Standard [BODS 0.4](https://github.com/openownership/data-standard/tree/main/schema. It is structured in 5 sections: 
0. Prerequisites
1. Statement Mapping
2. Entity Mapping
3. Relationship mapping
4. Final matching
5. Declaration export

### Mapping process

The mapping sections are broadly structured in 3 parts: 
1. Schema and dictionary definition
2. Mapping function
3. Output verification 

They rely on a [mapping reference](https://docs.google.com/spreadsheets/d/1CPeZ_5FiqIRCmHGHh7Gz1McpxmwN1EoBwkMYtRqFWFo/edit?pli=1#gid=134387124) made possible by [flattening the BODS json schema](https://github.com/civicliteracies/EITI_SDT_data_verification_and_validation/blob/sqlite/4_clean/3_bods_mapping/02_schema_flattening.ipynb) files.

The mapping process uses dictionaries to hold the target data structures, and the data is transformed using the following logic: 

`<bods_object>_schema` serves as a blueprint for the `<bods_object>_json` instances that are populated with data from `df_<dataset>`, assigned a unique identifier `<bods_object>_dict_key`, and stored as JSON strings in the `<bods_object>_dict` dictionary.

where `<bods_object>` can be either statement, entity or relationship.


# Part 0 - Prerequisites

### Overview
1. Import libraries
3. Import the data as dataframes
2. Define utility functions and variables to be used across the code


0.1. We import the appropriate libraries

In [1]:
import pandas as pd
import json
import random
import copy
import uuid
from collections import Counter, defaultdict, OrderedDict


0.2. We import the relevant datasets directly from Github to facilitate replication.

In [2]:
url_part1=('https://raw.githubusercontent.com/civicliteracies/EITI_SDT_data_verification_and_validation/sqlite/4_clean/2_data_editing/output/eiti-data_part1_1.3.csv')
url_part5=('https://raw.githubusercontent.com/civicliteracies/EITI_SDT_data_verification_and_validation/sqlite/4_clean/2_data_editing/output/eiti-data_part5-0.11.8.csv')

df_part1 = pd.read_csv(url_part1)
df_part5 = pd.read_csv(url_part5, low_memory=False)

0.3. we define the various utility functions needed for later: 
* a uuid3 function used to create a recordID for relationship entities
* a uuid4 function used to create a statementID for relationship entities
* a function to print 2 random items from a dictionary

In [3]:
# generate UUID3
def generate_uuid3(*args, namespace=uuid.UUID('00000000-0000-0000-0000-000000000000')):
    name = "-".join(map(str, args))  # Concatenate arguments into a single string
    return str(uuid.uuid3(namespace, name))

# Print a sample of 2 random items from the dictionary containing JSON strings
def print_random_keys(dictionary, num_keys=2):
    separator = "-" * 40
    random_keys = random.sample(list(dictionary.keys()), num_keys)
    
    for random_key in random_keys:
        print(f"{random_key}: {dictionary[random_key]}\n{separator}\n")


# Part 1 - Generating statements

### Overview

1. Schema and dictionary definition
2. Mapping
3. Output verification

### Logic

`statement_schema` serves as a blueprint for creating `statement_json` instances, which are populated with data from `df_part1`, assigned unique identifiers `statement_dict_key`, and stored as JSON strings in the `statement_dict` dictionary.

1.1 We define the schema and create the dictionary to hold the mappped JSONs.

In [4]:
# BODS statement structure template
statement_schema = {
    "statementId": "",
    "statementDate": "",
    "publicationDetails": {
        "publicationDate": "",
        "bodsVersion": "",
        "license": "",
        "publisher": {
            "name": "",
            "url": ""
        }
    },
    "source": {
        "type": [],
        "description": "",
        "url": "",
        "retrievedAt": "",
        "assertedBy": [
            {
                "name": "",
                "uri": ""
            }
        ]
    },
    "declaration": "",
    "declarationSubject": "",
    "recordId": "",
    "recordType": "",
    "recordDetails": {}
}

# Dictionary to hold the JSON strings
statement_dict = {}

1.2. We loop through part1 data to generate the JSON based on the mapping rules and we print the number of created JSONs for verification.

In [5]:

# Iterate over each row in df_part1
for index, row in df_part1.iterrows():
    statement_json = statement_schema.copy()

    # Fill the statement_json with data from the row
    statement_json["statementId"] = ''
    statement_json["statementDate"] = row['eiti_data_publication_date']
    statement_json["publicationDetails"]["publicationDate"] = row['end_date']
    statement_json["publicationDetails"]["bodsVersion"] = '0.4'
    statement_json["publicationDetails"]["license"] = 'http://opendatacommons.org/licenses/pddl/1.0/'
    statement_json["publicationDetails"]["publisher"]["name"] = 'Extractive Industries Transparency Initiative'
    statement_json["publicationDetails"]["publisher"]["url"] = 'https://eiti.org/open-data'
    statement_json["source"]["type"] = ['officialRegister', 'verified']
    statement_json["source"]["url"] = 'https://eiti.portaljs.com'
    statement_json["source"]["retrievedAt"] = pd.Timestamp('today').strftime('%Y-%m-%d')
    statement_json["source"]["assertedBy"][0]["name"] = row['submitter_name']
    statement_json["source"]["assertedBy"][0]["uri"] = row['submitter_email']
    statement_json["declaration"] = f"{row['iso_alpha2_code']}-{row['start_date'].replace('-', '')}-{row['end_date'].replace('-', '')}"
    statement_json["declarationSubject"] = row['iso_alpha2_code']
    statement_json["recordId"] = ''
    statement_json["recordType"] = ''
    
    # Create a key based on the statement identifier
    statement_dict_key = row['eiti_id_declaration']
    
    # Save the JSON string in the dictionary
    statement_dict[statement_dict_key] = json.dumps(statement_json, indent=2, ensure_ascii=False)



print(f"The dictionnary has {len(statement_dict.keys())} items")

The dictionnary has 73 items


1.3. We verify the output by printing 2 random statement_dict entries.

In [6]:
print_random_keys(statement_dict)

919aa363-4086-3318-8c0d-69fc2c399736: {
  "statementId": "",
  "statementDate": NaN,
  "publicationDetails": {
    "publicationDate": "2018-12-31",
    "bodsVersion": "0.4",
    "license": "http://opendatacommons.org/licenses/pddl/1.0/",
    "publisher": {
      "name": "Extractive Industries Transparency Initiative",
      "url": "https://eiti.org/open-data"
    }
  },
  "source": {
    "type": [
      "officialRegister",
      "verified"
    ],
    "description": "",
    "url": "https://eiti.portaljs.com",
    "retrievedAt": "2024-06-03",
    "assertedBy": [
      {
        "name": "Ghazi Khiari",
        "uri": "Ghazi.khiari@bdo-ifi.com"
      }
    ]
  },
  "declaration": "ML-20180101-20181231",
  "declarationSubject": "ML",
  "recordId": "",
  "recordType": "",
  "recordDetails": {}
}
----------------------------------------

e7853b05-d03e-3848-93ec-d21f6cdf39ea: {
  "statementId": "",
  "statementDate": NaN,
  "publicationDetails": {
    "publicationDate": "2018-12-31",
    "bods

# Part 2. Generating Entities

### Overview

1. Entity data preparation
2. Schema and dictionary definition
3. Mapping
4. Output verification

### Logic

`entity_schema` serves as a blueprint for creating `entity_json` instances, which are populated with data from `df_part1`, assigned unique identifiers `entity_dict_key`, and stored as JSON strings in the `entity_dict` dictionary.

2.1. We create a dataframe that holds only the unique values for each type of entity (companies, projects, government entities) while assigning them the proper label in the `entity_type` column. 

In [7]:
# Extract unique entities (within each declaration) and add entity type
unique_companies = df_part5[['company_name', 'original_company_name', 'eiti_id_company', 'company_id', 'iso_alpha2_code', 'country', 'company_public_listing_or_website', 'start_date', 'end_date', 'eiti_id_declaration']].dropna(subset=['eiti_id_company']).drop_duplicates().assign(entity_type='registeredEntity')
unique_projects = df_part5[['project_name', 'eiti_id_project', 'iso_alpha2_code', 'country', 'start_date', 'end_date', 'eiti_id_declaration']].dropna(subset=['eiti_id_project']).drop_duplicates().assign(entity_type='arrangement')
unique_government = df_part5[['government_entity', 'eiti_id_government', 'iso_alpha2_code', 'country', 'start_date', 'end_date', 'eiti_id_declaration']].dropna(subset=['eiti_id_government']).drop_duplicates().assign(entity_type='stateBody')


# Combine into a single DataFrame
df_entities = pd.concat([unique_companies, unique_projects, unique_government], ignore_index=True)

print(f"The dataframe has {len(df_entities.index)} rows\n")

The dataframe has 8242 rows



2.2. We define the schema and create the dictionary to hold the mappped JSONs.

In [8]:
# Define the entity schema
entity_schema = {
    "isComponent": False,
    "entityType": {
        "type": "",
        "subtype": ""
    },
    "name": "",
    "alternateNames": [],
    "jurisdiction": {
        "name": "",
        "code": ""
    },
    "identifiers": [],
    "uri": "",
}

# Create the entity dictionary
entity_dict = {}

2.3. We loop through `df_entities` to generate the mapped entity JSONs before stroing them in `entity_dict`.  The size of the `entity_dict` should match the number of rows of `df_entities`

In [9]:

# Iterate over each row in df_entities to create JSON files
for index, row in df_entities.iterrows():

    entity_json = entity_schema.copy()

    entity_json["entityType"]["type"] = row['entity_type']
    entity_json["entityType"]["subtype"] = (
        'governmentDepartment' if row['entity_type'] == 'stateBody' and 'minist' in str(row['government_entity']).lower() else
        'stateAgency' if row['entity_type'] == 'stateBody' else ''
    )

    if row['entity_type'] == 'registeredEntity':
        del entity_json["entityType"]["subtype"]

    entity_json["name"] = (
        row['company_name'] if row['entity_type'] == 'registeredEntity' else
        row['project_name'] if row['entity_type'] == 'arrangement' else
        row['government_entity']
    )
    
    if pd.notna(row['original_company_name']):
        entity_json["alternateNames"] = row['original_company_name']
    else:
        del entity_json["alternateNames"]

    entity_json["jurisdiction"]["name"] = row['country']
    entity_json["jurisdiction"]["code"] = row['iso_alpha2_code']
    entity_json["identifiers"] = [{
        "id": (
            row['eiti_id_company'] if row['entity_type'] == 'registeredEntity' else
            row['eiti_id_project'] if row['entity_type'] == 'arrangement' else
            row['eiti_id_government']
        ),
        "scheme": "XI-EITI",
        "schemeName": "Extractive Industries Transparency Initiative",
        "uri": f"/entity_statement/{row['eiti_id_company'] if row['entity_type'] == 'registeredEntity' else row['eiti_id_project'] if row['entity_type'] == 'arrangement' else row['eiti_id_government']}"
    }]
    entity_json["uri"] = row['company_public_listing_or_website']
    
    if row['entity_type'] == 'registeredEntity' and pd.notna(row['company_id']):
        entity_json["identifiers"].append({
            "id": row['company_id'],
            "scheme": "n/a",
            "schemeName": "Local ID",
            "uri": "n/a"
        })

    # Create the dictionary key
    entity_dict_key = (index, row['eiti_id_declaration'])

    # Insert entity JSONs in the dictionary alongside their matching keys
    entity_dict[entity_dict_key] = json.dumps(entity_json, indent=2, ensure_ascii=False)

# Clear process status with a final message
print(f"The dictionnary has {len(entity_dict.keys())} items")


The dictionnary has 8242 items


2.4. We verify the output by printing 2 random statement_dict entries.

In [10]:
# Display 2 random items for quality check
print_random_keys(entity_dict)

(8174, 'f11d93c6-1ca8-3b49-b082-f4bd12e7c5cb'): {
  "isComponent": false,
  "entityType": {
    "type": "stateBody",
    "subtype": "stateAgency"
  },
  "name": "DIRECTION GÉNÉRALE TECHNIQUE DES MINES (DGTM)",
  "jurisdiction": {
    "name": "Chad",
    "code": "TD"
  },
  "identifiers": [
    {
      "id": "78a66aa1-2112-4d26-adf2-ac0ee22e19cc",
      "scheme": "XI-EITI",
      "schemeName": "Extractive Industries Transparency Initiative",
      "uri": "/entity_statement/78a66aa1-2112-4d26-adf2-ac0ee22e19cc"
    }
  ],
  "uri": NaN
}
----------------------------------------

(6444, '822e6097-313c-32d0-9ca4-4db85c24052d'): {
  "isComponent": false,
  "entityType": {
    "type": "arrangement",
    "subtype": ""
  },
  "name": "FMC K",
  "jurisdiction": {
    "name": "Liberia",
    "code": "LR"
  },
  "identifiers": [
    {
      "id": "a08dc542-645b-4882-923d-763fbe9d4b28",
      "scheme": "XI-EITI",
      "schemeName": "Extractive Industries Transparency Initiative",
      "uri": "/ent

# Part 3 - Relationships

## Overview

1. Schema and dictionary definition
2. Mapping to the different relationship schemas
3. Output verification
4. Consolidation

## Logic 

### Core mapping

EITI data describes multiple relationships, requiring the definition of several schemas. We defined 5 types of relationships and assigned the following attributes

| InterestedParty | Subject | directOrIndirect | descriptor |
| ---- | ---- | ---- | ---- |
| Country | Government Agency | direct | controlByLegalFramework |
| Government Agency | Company (SOE) | direct | controlByLegalFramework, rightsToProfitOrIncome |
| Government Agency | Company (Private) | direct | rightsToProfitOrIncome |
| Company | Project | direct | rightsGrantedByContract |
| Government Agency | Project | indirect | controlByLegalFramework |

Those are used in the five different `relationship_schemas`. 

The `populate_relationships` function uses `relationship_schemas` as a template to create `relationship_json` instances, which are populated with data from the `df_part5`. Each `relationship_json` is then stored as a JSON string in the `relationship_dicts` dictionary under the corresponding `relationship_type` inner dictionary, using a tuple of the row index and `eiti_id_declaration` as the unique key.

### Schema extension

In the context of EITI data, the interests linking an InterestedParty (government entity/company) to a subject (company, project) refer to the monetary value or in-kind amount of taxes paid to a government entity, whether directly or in relation to a specific project. BODS does not have a specific mechanism to add arbitrary interests, so we added them in interests[].details property by transforming the expected value from a string to an array of objects. This allow us to add the relevant information while minisming the additional nesting level, following BODS design philosophy.

3.1. We define the five possible schemas as a single dictionary, as well as five separate dictionaries to hold the JSON files mapped to each schema. 

In [11]:
relationship_schemas = {
    "country_government": {
        "subject": "",
        "interestedParty": "",
        "interests": [{
            "type": "controlByLegalFramework",
            "directOrIndirect": "direct",
            "beneficialOwnershipOrControl": False,
        }],
        "isComponent": False
    },
    "government_soe": {
        "subject": "",
        "interestedParty": "",
        "interests": [
            {
                "type": "controlByLegalFramework",
                "directOrIndirect": "direct",
                "beneficialOwnershipOrControl": False,
            },
            {
                "type": "rightsToProfitOrIncome",
                "directOrIndirect": "direct",
                "beneficialOwnershipOrControl": False,
                "details": []
            }
        ],
        "isComponent": True
    },
    "government_company": {
        "subject": "",
        "interestedParty": "",
        "interests": [{
            "type": "rightsToProfitOrIncome",
            "directOrIndirect": "direct",
            "beneficialOwnershipOrControl": False,
            "details": []
        }],
        "isComponent": True
    },
    "company_project": {
        "subject": "",
        "interestedParty": "",
        "interests": [{
            "type": "rightsGrantedByContract",
            "directOrIndirect": "direct",
            "beneficialOwnershipOrControl": False,
            "details": []
        }],
        "isComponent": True
    },
    "government_project": {
        "subject": "",
        "interestedParty": "",
        "interests": [{
            "type": "controlByLegalFramework",
            "directOrIndirect": "indirect",
            "beneficialOwnershipOrControl": False,
        }],
        "isComponent": False,
        "componentRecords": []
    }
}

relationship_dicts = {
    "country_government": {},
    "government_soe": {},
    "government_company": {},
    "company_project": {},
    "government_project": {},
}

3.2. we define a function to map and process df_part5 to generate the relationship JSONs. They are then stored within their matching inner dictionary inside of relationship_dicts. 

TODO: store row value temporarily to be able to fill componentRecords array later by using the row as a match. Maybe in the key tuple.

In [12]:
def populate_relationships(df, relationship_type, schema, subject_col, interested_party_col, start_date_col):
    relationship_dicts[relationship_type] = {}

    for index, row in df.iterrows():

        if pd.notna(row[subject_col]) and pd.notna(row[interested_party_col]):
            relationship_json = copy.deepcopy(schema)
            relationship_json["subject"] = row[subject_col]
            relationship_json["interestedParty"] = row[interested_party_col]
            
            for interest in relationship_json["interests"]:
                interest["startDate"] = row[start_date_col]
                if "details" in interest:
                    detail = {
                        "revenue_stream_name": row["revenue_stream_name"],
                        "revenue_value": row["revenue_value"],
                        "reporting_currency": row["reporting_currency"]
                    }
                    if pd.notna(row["in_kind_volume"]):
                        detail["in_kind_volume"] = row["in_kind_volume"]
                    if pd.notna(row["in_kind_unit"]):
                        detail["in_kind_unit"] = row["in_kind_unit"]
                    interest["details"].append(detail)
            
            relationship_dicts[relationship_type][(index, row['eiti_id_declaration'])] = json.dumps(relationship_json, indent=2, ensure_ascii=False)

# Pre-filter DataFrame to avoid repetitive filtering
df_soes = df_part5[df_part5['company_type'] == "State-owned enterprises & public corporations"]
df_private = df_part5[df_part5['company_type'] == "Private"]

# Populate relationships
populate_relationships(df_part5, "country_government", relationship_schemas["country_government"], "government_entity", "iso_alpha2_code", "start_date")
populate_relationships(df_soes, "government_soe", relationship_schemas["government_soe"], "company_name", "government_entity", "start_date")
populate_relationships(df_private, "government_company", relationship_schemas["government_company"], "company_name","government_entity", "start_date")
populate_relationships(df_part5, "company_project", relationship_schemas["company_project"], "project_name", "company_name", "start_date")
populate_relationships(df_part5, "government_project", relationship_schemas["government_project"], "project_name", "government_entity", "start_date")

# Print the number of items in each dictionary
for relationship_type, relationships in relationship_dicts.items():
    print(f"{relationship_type}: {len(relationships)} items")

total_relationships = sum(len(relationships) for relationships in relationship_dicts.values())
print(f"\nfor a total of: {total_relationships} items")


country_government: 31826 items
government_soe: 2611 items
government_company: 28889 items
company_project: 12320 items
government_project: 11832 items

for a total of: 87478 items


3.3. We verify the output by printing 1 random entry from each inner dictionary of relationship_dicts

In [13]:
# function to print random samples from each relationship dictionary
def relationship_sample(relationship_dicts, num_keys=1):
    for relationship_type, relationships in relationship_dicts.items():
        print(f"Samples from {relationship_type}:")
        print_random_keys(relationships, num_keys=num_keys)

relationship_sample(relationship_dicts, num_keys=1)

Samples from country_government:
(15244, 'bde10cb7-34f9-3d15-8e13-65b1899ba250'): {
  "subject": "FONDO MEXICANO DEL PETRÓLEO",
  "interestedParty": "MX",
  "interests": [
    {
      "type": "controlByLegalFramework",
      "directOrIndirect": "direct",
      "beneficialOwnershipOrControl": false,
      "startDate": "2017-01-01"
    }
  ],
  "isComponent": false
}
----------------------------------------

Samples from government_soe:
(30154, '54bec788-8c5c-3c77-af56-09cfcb43830a'): {
  "subject": "UKRGAZVYDOBUVANNYA JSC",
  "interestedParty": "STATE TAX SERVICE OF UKRAINE",
  "interests": [
    {
      "type": "controlByLegalFramework",
      "directOrIndirect": "direct",
      "beneficialOwnershipOrControl": false,
      "startDate": "2018-01-01"
    },
    {
      "type": "rightsToProfitOrIncome",
      "directOrIndirect": "direct",
      "beneficialOwnershipOrControl": false,
      "details": [
        {
          "revenue_stream_name": "Production royalty",
          "revenue_valu

3.4. We combine the relationship dictionaries into one: 
* global_index is used to ensure that all relationships are inserted as unique entities in the new dictionary
* index is kept in order to populate the componentRecords array later as all related relationships have the same index (row)

In [14]:
relationship_dict = {}
global_index = 0

for relationship_type, relationships in relationship_dicts.items():
    for (index, eiti_id_declaration), value in relationships.items():
        # Create a new global key using the global index, original index, and eiti_id_declaration
        new_key = (global_index, index, eiti_id_declaration)
        relationship_dict[new_key] = value
        global_index += 1

# Print the total number of relationship entities
print(f"Number of relationship entities: {len(relationship_dict)}")

Number of relationship entities: 87478


3.5. We verify the output by printing 2 random statement_dict entries.

In [15]:
print_random_keys(relationship_dict)

(6106, 6153, 'e821dd0c-7660-3334-a55a-732ab12351d7'): {
  "subject": "ALBANIAN CUSTOMS ADMINISTRATE",
  "interestedParty": "AL",
  "interests": [
    {
      "type": "controlByLegalFramework",
      "directOrIndirect": "direct",
      "beneficialOwnershipOrControl": false,
      "startDate": "2018-01-01"
    }
  ],
  "isComponent": false
}
----------------------------------------

(55275, 22550, '4a310016-9552-3539-bc27-fa55ce8f2f49'): {
  "subject": "DANTATA & SAWOE CONSTRUCTION COMPANY (NIGERIA) LTD",
  "interestedParty": "FEDERAL INLAND REVENUE SERVICE",
  "interests": [
    {
      "type": "rightsToProfitOrIncome",
      "directOrIndirect": "direct",
      "beneficialOwnershipOrControl": false,
      "details": [
        {
          "revenue_stream_name": "Company Income Tax",
          "revenue_value": 211097.61,
          "reporting_currency": "USD",
          "in_kind_unit": "n/v"
        }
      ],
      "startDate": "2017-01-01"
    }
  ],
  "isComponent": true
}
-------------

# Part 4 - Final matching

## Overview

1. Matching entities with statements
2. Matching relationships with statements
3. Grouping all statements

4.1 Matching entities with statements. The size of entity_statement_dict should be equal to entity_dict. 

In [16]:
entity_statement_dict = {}

for (index, eiti_id_declaration) in entity_dict.keys():

    if eiti_id_declaration in statement_dict:
        statement = json.loads(statement_dict[eiti_id_declaration])
        entity = json.loads(entity_dict[(index, eiti_id_declaration)])
        statement["recordDetails"] = entity

        # Set recordId and recordType in statement_dict
        statement["recordId"] = entity["identifiers"][0]["id"]
        statement["recordType"] = 'entity'
        
        statement["statementId"] = generate_uuid3(
                                                statement["recordDetails"]["entityType"]["type"],
                                                statement["recordDetails"]["name"],
                                                statement["recordDetails"]["jurisdiction"]["name"],
                                                statement["recordDetails"]["jurisdiction"]["code"],
                                                json.dumps(statement["recordDetails"]["identifiers"], sort_keys=True))

        entity_statement_dict[index] = json.dumps(statement, indent=2, ensure_ascii=False)

# Print the length of the combined dictionary

print(f"entity_statement_dict: {len(entity_statement_dict)} items\r")
print(f"entity_dict: {len(entity_dict)} items")

entity_statement_dict: 8242 items
entity_dict: 8242 items


4.2. We verify the output by printing 2 random entity_statement_dict entries.

In [17]:
print_random_keys(entity_statement_dict)

6242: {
  "statementId": "9461c774-6719-32d4-91e8-0ab16e7a6506",
  "statementDate": NaN,
  "publicationDetails": {
    "publicationDate": "2018-12-31",
    "bodsVersion": "0.4",
    "license": "http://opendatacommons.org/licenses/pddl/1.0/",
    "publisher": {
      "name": "Extractive Industries Transparency Initiative",
      "url": "https://eiti.org/open-data"
    }
  },
  "source": {
    "type": [
      "officialRegister",
      "verified"
    ],
    "description": "",
    "url": "https://eiti.portaljs.com",
    "retrievedAt": "2024-06-03",
    "assertedBy": [
      {
        "name": "Freda Effah Bortier",
        "uri": "fredarry121@yahoo.com; fredabortier@gmail.com"
      }
    ]
  },
  "declaration": "GH-20180101-20181231",
  "declarationSubject": "GH",
  "recordId": "b290ee2e-6bbe-4abb-b9e0-71dad047b6b1",
  "recordType": "entity",
  "recordDetails": {
    "isComponent": false,
    "entityType": {
      "type": "arrangement",
      "subtype": ""
    },
    "name": "RL7/2",
    "

4.3. Matching relationships with statements. The size of relationship_statement_dict should be equal to relationship_dict

In [18]:
relationship_statement_dict = {}
component_records_dict = {}
relationships_list = []
primary_relationships_set = set()

# First Loop: Process relationships and generate unique recordId
for (global_index, index, eiti_id_declaration) in relationship_dict.keys():
    if eiti_id_declaration in statement_dict:
        statement = json.loads(statement_dict[eiti_id_declaration])
        relationship = json.loads(relationship_dict[(global_index, index, eiti_id_declaration)])
        
        statement["statementId"] = generate_uuid3(
            relationship["subject"], 
            relationship["interestedParty"], 
            json.dumps(relationship["interests"], sort_keys=True)
        )

        # Generate a unique recordId for the relationship
        record_id = generate_uuid3(
            relationship["subject"], 
            relationship["interestedParty"]
        )
        statement["recordId"] = record_id
        statement["recordDetails"] = relationship
        
        # Set recordType in statement
        statement["recordType"] = 'relationship'

        # Check if the relationship is a primary relationship
        if not relationship.get("isComponent") and relationship["interests"][0].get("directOrIndirect") == "indirect":
            primary_relationships_set.add((index, eiti_id_declaration))
        
        # Store component relationships
        if relationship.get("isComponent"):
            key = (index, eiti_id_declaration)
            if key not in component_records_dict:
                component_records_dict[key] = []
            component_records_dict[key].append(record_id)
        
        # Add the statement along with its keys to the list
        relationships_list.append((index, relationship.get("isComponent"), relationship["interests"][0].get("directOrIndirect") == "indirect", global_index, statement, eiti_id_declaration))

# Second Loop: Update isComponent flag for primary relationships and prepare componentRecords
for i, (index, is_component, is_indirect, global_index, statement, eiti_id_declaration) in enumerate(relationships_list):
    relationship = statement["recordDetails"]

    # Update isComponent flag
    if (index, eiti_id_declaration) in primary_relationships_set:
        if not is_indirect:
            relationships_list[i] = (index, True, is_indirect, global_index, statement, eiti_id_declaration)
            relationship["isComponent"] = True
        else:
            relationship["isComponent"] = False

    # Check if the relationship is the primary relationship (indirect)
    if is_indirect:
        key = (index, eiti_id_declaration)
        if key in component_records_dict:
            relationship["componentRecords"] = component_records_dict[key]
        else:
            relationship["componentRecords"] = []

# Sort the relationships_list
relationships_list.sort(key=lambda x: (x[0], x[1], not x[2], x[3]))

# Third Loop: Insert the sorted relationships into the final dictionary
for _, is_component, is_indirect, global_index, statement, eiti_id_declaration in relationships_list:
    relationship_statement_dict[global_index] = json.dumps(statement, indent=2, ensure_ascii=False)

# Print the length of the combined dictionary
print(f"relationship_statement_dict: {len(relationship_statement_dict)} items")
print(f"relationship_dict: {len(relationship_dict)} items")


relationship_statement_dict: 87478 items
relationship_dict: 87478 items


In [19]:
print_random_keys(relationship_statement_dict)

55990: {
  "statementId": "374a82f5-accd-337f-b2e0-f64f4f0c8a6b",
  "statementDate": "2020-03-16",
  "publicationDetails": {
    "publicationDate": "2018-12-31",
    "bodsVersion": "0.4",
    "license": "http://opendatacommons.org/licenses/pddl/1.0/",
    "publisher": {
      "name": "Extractive Industries Transparency Initiative",
      "url": "https://eiti.org/open-data"
    }
  },
  "source": {
    "type": [
      "officialRegister",
      "verified"
    ],
    "description": "",
    "url": "https://eiti.portaljs.com",
    "retrievedAt": "2024-06-03",
    "assertedBy": [
      {
        "name": "Badejo Tajudeen Olaposi; Deji Adeshile",
        "uri": "audit@tbc.ng, tbadejo@tbc.ng, tajudeenolaposi@gmail.com; info@adeshileandco.com; adeshileandco@gmail.com"
      }
    ]
  },
  "declaration": "NG-20180101-20181231",
  "declarationSubject": "NG",
  "recordId": "59e53d5a-a394-3465-983f-808055a39020",
  "recordType": "relationship",
  "recordDetails": {
    "subject": "ATLAS PETROLEUM NI

Validate that groups are properly created.

In [20]:
# Check a random group of relationships
def check_random_group(relationships_list, num_groups=1):
    grouped_relationships = {}
    for relationship in relationships_list:
        index = relationship[0]
        if index not in grouped_relationships:
            grouped_relationships[index] = []
        grouped_relationships[index].append(relationship)

    random_groups = random.sample(list(grouped_relationships.keys()), num_groups)
    for group_index in random_groups:
        print(f"Relationships for index {group_index}:")
        for rel in grouped_relationships[group_index]:
            print(json.dumps(rel[4], indent=2, ensure_ascii=False))
        print("\n")

# Check one random group of relationships
check_random_group(relationships_list, num_groups=1)

Relationships for index 31760:
{
  "statementId": "4e3ae4b6-64f2-3ced-88f0-8720a0397a82",
  "statementDate": "2022-02-01",
  "publicationDetails": {
    "publicationDate": "2020-12-31",
    "bodsVersion": "0.4",
    "license": "http://opendatacommons.org/licenses/pddl/1.0/",
    "publisher": {
      "name": "Extractive Industries Transparency Initiative",
      "url": "https://eiti.org/open-data"
    }
  },
  "source": {
    "type": [
      "officialRegister",
      "verified"
    ],
    "description": "",
    "url": "https://eiti.portaljs.com",
    "retrievedAt": "2024-06-03",
    "assertedBy": [
      {
        "name": "Andrii Kitura",
        "uri": "andrii.kitura@ua.ey.com"
      }
    ]
  },
  "declaration": "UA-20200101-20201231",
  "declarationSubject": "UA",
  "recordId": "3c295d2a-ef7b-30e1-9948-61b74e021773",
  "recordType": "relationship",
  "recordDetails": {
    "subject": "SUBSOIL USE SPECIAL PERMIT NO. 592, DATED 08.05.1996",
    "interestedParty": "STATE TAX SERVICE OF 

Validate the number of relationship per groups.

In [21]:
# Group relationships by index and count the number of relationships per index
grouped_relationships = {}
for relationship in relationships_list:
    index = relationship[0]
    if index not in grouped_relationships:
        grouped_relationships[index] = []
    grouped_relationships[index].append(relationship)

# Count the number of indices with 0, 1, 2, 3, or 4 relationships
relationship_counts = Counter(len(rels) for rels in grouped_relationships.values())

# Include indices with 0 relationships if needed
max_index = max(grouped_relationships.keys())
all_indices = set(range(max_index + 1))
existing_indices = set(grouped_relationships.keys())
zero_relationship_indices = all_indices - existing_indices
relationship_counts[0] = len(zero_relationship_indices)

# Calculate percentages
total_indices = len(all_indices)
relationship_percentages = {k: (v / total_indices) * 100 for k, v in relationship_counts.items()}

# Print results
print("Number of indices with 0, 1, 2, 3, or 4 relationships:")
for k in range(5):
    print(f"{k} relationships: {relationship_counts.get(k, 0)} ({relationship_percentages.get(k, 0):.2f}%)")


Number of indices with 0, 1, 2, 3, or 4 relationships:
0 relationships: 269 (0.82%)
1 relationships: 814 (2.47%)
2 relationships: 20320 (61.75%)
3 relationships: 0 (0.00%)
4 relationships: 11506 (34.96%)


We check that ComponentRecords have been assigned properly.

In [22]:
# Third Loop: Insert the sorted relationships into the final dictionary
for _, is_component, is_indirect, global_index, statement, eiti_id_declaration in relationships_list:
    relationship_statement_dict[global_index] = json.dumps(statement, indent=2, ensure_ascii=False)

# Counting Mechanism
total_with_component_records = 0
zero_elements = 0
one_element = 0
two_elements = 0

for statement_json in relationship_statement_dict.values():
    statement = json.loads(statement_json)
    relationship = statement["recordDetails"]
    if "componentRecords" in relationship:
        total_with_component_records += 1
        num_elements = len(relationship["componentRecords"])
        if num_elements == 0:
            zero_elements += 1
        elif num_elements == 1:
            one_element += 1
        elif num_elements == 2:
            two_elements += 1

# Print the results
print(f"Total relationships with componentRecords: {total_with_component_records}")
print(f"Number of relationships with 0 componentRecords: {zero_elements}")
print(f"Number of relationships with 1 componentRecords: {one_element}")
print(f"Number of relationships with 2 componentRecords: {two_elements}")


Total relationships with componentRecords: 11832
Number of relationships with 0 componentRecords: 326
Number of relationships with 1 componentRecords: 0
Number of relationships with 2 componentRecords: 11506


We join the entity_statement_dict and relationship_statement_dict in a single dictionary where each key is a eiti_id_declaration in order to facilitate verification and export.

In [23]:
# Create a combined dictionary of dictionaries using OrderedDict
combined_dict = defaultdict(OrderedDict)

# Add entity statements first and then add corresponding relationship statements
for index, entity_statement_json in entity_statement_dict.items():
    entity_statement = json.loads(entity_statement_json)
    eiti_id_declaration = entity_statement["declaration"]

    # Initialize the dictionary for this eiti_id_declaration if it doesn't exist
    if eiti_id_declaration not in combined_dict:
        combined_dict[eiti_id_declaration] = OrderedDict()

    # Add the entity statement to the combined dictionary
    combined_dict[eiti_id_declaration][index] = entity_statement_json

# Add relationship statements next
for global_index, statement_json in relationship_statement_dict.items():
    relationship_statement = json.loads(statement_json)
    eiti_id_declaration = relationship_statement["declaration"]

    # Initialize the dictionary for this eiti_id_declaration if it doesn't exist
    if eiti_id_declaration not in combined_dict:
        combined_dict[eiti_id_declaration] = OrderedDict()

    # Add the relationship statement to the combined dictionary
    combined_dict[eiti_id_declaration][global_index] = statement_json


# Part 5. Verification and export

## Overview 

We're checking here that dataset generated match our expectations.

We check the data by extracting the more meaningful features in a table in order to verify that the stats match our expectations.

In [26]:
# Initialize counters for entities and relationships
results = []

# Iterate over combined_dict to count entity and relationship statements
for eiti_id_declaration, statements in combined_dict.items():
    entity_count = 0
    relationship_count = 0
    country = None

    for key, statement_json in statements.items():
        statement = json.loads(statement_json)
        record_type = statement["recordType"]
        if record_type == "entity":
            entity_count += 1
            if country is None:
                country = statement["recordDetails"].get("iso_alpha2_code", "Unknown")
        elif record_type == "relationship":
            relationship_count += 1
            if country is None:
                country = statement["recordDetails"].get("iso_alpha2_code", "Unknown")

    ratio = round(entity_count / relationship_count, 2) if relationship_count > 0 else 0
    results.append([eiti_id_declaration, country, entity_count, relationship_count, ratio])

# Create the DataFrame
results_df = pd.DataFrame(results, columns=['eiti_id_declaration', 'country', 'number of entity statements', 'number of relationship statements', 'ratio entity:relationship'])

# Print the DataFrame
display(results_df)


Unnamed: 0,eiti_id_declaration,country,number of entity statements,number of relationship statements,ratio entity:relationship
0,AF-20171221-20181220,Unknown,77,2630,0.03
1,AF-20181221-20191220,Unknown,272,7876,0.03
2,AL-20170101-20171231,Unknown,130,1454,0.09
3,AL-20180101-20181231,Unknown,148,1941,0.08
4,AR-20180101-20181231,Unknown,28,290,0.10
...,...,...,...,...,...
68,UA-20190101-20191231,Unknown,381,2280,0.17
69,UA-20200101-20201231,Unknown,505,4172,0.12
70,ZM-20170101-20171231,Unknown,28,298,0.09
71,ZM-20180101-20181231,Unknown,30,357,0.08


We check and double check

In [25]:
df_check2 = df_part5[['eiti_id_company','eiti_id_project','eiti_id_government','iso_alpha2_code','eiti_id_declaration']]

# Initialize counters for entities and relationships
entity_counts = defaultdict(lambda: {'count': 0, 'country': None})
relationship_counts = defaultdict(lambda: {'count': 0, 'country': None})

# Iterate over all rows directly
for index, row in df_check2.iterrows():
    declaration = row['eiti_id_declaration']
    country = row['iso_alpha2_code']

    # Set the country for the declaration if not already set
    if entity_counts[declaration]['country'] is None:
        entity_counts[declaration]['country'] = country
    if relationship_counts[declaration]['country'] is None:
        relationship_counts[declaration]['country'] = country

    # Count entities
    if pd.notna(row['eiti_id_company']):
        entity_counts[declaration]['count'] += 1
    if pd.notna(row['eiti_id_project']):
        entity_counts[declaration]['count'] += 1
    if pd.notna(row['eiti_id_government']):
        entity_counts[declaration]['count'] += 1

    # Count relationships
    if pd.notna(row['iso_alpha2_code']) and pd.notna(row['eiti_id_government']):
        relationship_counts[declaration]['count'] += 1
    if pd.notna(row['eiti_id_government']) and pd.notna(row['eiti_id_company']):
        relationship_counts[declaration]['count'] += 1
    if pd.notna(row['eiti_id_company']) and pd.notna(row['eiti_id_project']):
        relationship_counts[declaration]['count'] += 1
    if pd.notna(row['eiti_id_government']) and pd.notna(row['eiti_id_project']):
        relationship_counts[declaration]['count'] += 1

# Prepare data for the DataFrame
results = []

for declaration in entity_counts:
    entity_count = entity_counts[declaration]['count']
    relationship_count = relationship_counts[declaration]['count']
    country = entity_counts[declaration]['country']
    ratio = round(entity_count / relationship_count, 2) if relationship_count > 0 else 0
    results.append([declaration, country, entity_count, relationship_count, ratio])

# Create the DataFrame
results_df = pd.DataFrame(results, columns=['eiti_id_declaration', 'country', 'number of entity statements', 'number of relationship statements', 'ratio entity:relationship'])

# Print the DataFrame
display(results_df)

Unnamed: 0,eiti_id_declaration,country,number of entity statements,number of relationship statements,ratio entity:relationship
0,1f61bd83-c1cd-3658-8fab-29ba86d584a7,AF,2467,2620,0.94
1,c40776d5-a273-3f9d-b805-075e804b9f3e,AF,7408,7658,0.97
2,fb121867-d9c5-3112-90f5-f798ed67c49d,AL,1461,1452,1.01
3,e821dd0c-7660-3334-a55a-732ab12351d7,AL,1834,1938,0.95
4,2df5c073-4216-3f74-a2cd-23aa44dd3c9c,AR,245,214,1.14
...,...,...,...,...,...
68,f1a966c7-d6b9-3cb7-a10a-b9cb0f63a4c7,UA,1945,2280,0.85
69,fef32215-a021-3118-bdc3-a44079a72bdd,UA,3305,4172,0.79
70,acb80970-c46e-3169-8ba6-d85c21d30aee,ZM,213,239,0.89
71,10d0408e-ef95-35ca-bb4e-07f9aa1c2979,ZM,257,289,0.89
