# Introduction

### Overview

This notebook describes the mapping process of EITI data to the Beneficial Ownership Data Standard [BODS 0.4](https://github.com/openownership/data-standard/tree/main/schema. It is structured in 5 sections: 
0. Prerequisites
1. Statement Mapping
2. Entity Mapping
3. Relationship mapping
4. Final matching
5. Declaration export

### Mapping process

The mapping sections are broadly structured in 3 parts: 
1. Schema and dictionary definition
2. Mapping function
3. Output verification 

They rely on a [mapping reference](https://docs.google.com/spreadsheets/d/1CPeZ_5FiqIRCmHGHh7Gz1McpxmwN1EoBwkMYtRqFWFo/edit?pli=1#gid=134387124) made possible by [flattening the BODS json schema](https://github.com/civicliteracies/EITI_SDT_data_verification_and_validation/blob/sqlite/4_clean/3_bods_mapping/02_schema_flattening.ipynb) files.

The mapping process uses dictionaries to hold the target data structures, and the data is transformed using the following logic: 

`<bods_object>_schema` serves as a blueprint for the `<bods_object>_json` instances that are populated with data from `df_<dataset>`, assigned a unique identifier `<bods_object>_dict_key`, and stored as JSON strings in the `<bods_object>_dict` dictionary.

where `<bods_object>` can be either statement, entity or relationship.


# Part 0 - Prerequisites

### Overview
1. Import libraries
3. Import the data as dataframes
2. Define utility functions and variables to be used across the code


0.1. We import the appropriate libraries

In [49]:
import pandas as pd
import json
import random
import copy
import uuid

0.2. We import the relevant datasets directly from Github to facilitate replication.

In [31]:
url_part1=('https://raw.githubusercontent.com/civicliteracies/EITI_SDT_data_verification_and_validation/sqlite/4_clean/2_data_editing/output/eiti-data_part1_1.3.csv')
url_part5=('https://raw.githubusercontent.com/civicliteracies/EITI_SDT_data_verification_and_validation/sqlite/4_clean/2_data_editing/output/eiti-data_part5-0.11.8.csv')

df_part1 = pd.read_csv(url_part1)
df_part5 = pd.read_csv(url_part5, low_memory=False)

0.3. we define the various utility functions needed for later: 
* a uuid3 function used to create a recordID for relationship entities
* a uuid4 function used to create a statementID for relationship entities
* a function to print 2 random items from a dictionary

In [32]:
# generate UUID3 based on subject and interestedParty
def generate_uuid3(subject, interestedParty):
    namespace = uuid.UUID('00000000-0000-0000-0000-000000000000')
    name = f"{subject}-{interestedParty}"
    return str(uuid.uuid3(namespace, name))

# Generate UUID4 for statementId
def generate_uuid4():
    return str(uuid.uuid4())

# Print a sample of 2 random items from the dictionary containing JSON strings
def print_random_keys(dictionary, num_keys=2):
    separator = "-" * 40
    random_keys = random.sample(list(dictionary.keys()), num_keys)
    
    for random_key in random_keys:
        print(f"{random_key}: {dictionary[random_key]}\n{separator}\n")


# Part 1 - Generating statements

### Overview

1. Schema and dictionary definition
2. Mapping
3. Output verification

### Logic

`statement_schema` serves as a blueprint for creating `statement_json` instances, which are populated with data from `df_part1`, assigned unique identifiers `statement_dict_key`, and stored as JSON strings in the `statement_dict` dictionary.

1.1 We define the schema and create the dictionary to hold the mappped JSONs.

In [35]:
# BODS statement structure template
statement_schema = {
    "statementId": "",
    "statementDate": "",
    "annotations": [],
    "publicationDetails": {
        "publicationDate": "",
        "bodsVersion": "",
        "license": "",
        "publisher": {
            "name": "",
            "url": ""
        }
    },
    "source": {
        "type": [],
        "description": "",
        "url": "",
        "retrievedAt": "",
        "assertedBy": [
            {
                "name": "",
                "uri": ""
            }
        ]
    },
    "declaration": "",
    "declarationSubject": "",
    "recordId": "",
    "recordType": "",
    "recordDetails": {}
}

# Dictionary to hold the JSON strings
statement_dict = {}

1.2. We loop through part1 data to generate the JSON based on the mapping rules and we print the number of created JSONs for verification.

In [37]:

# Iterate over each row in df_part1
for index, row in df_part1.iterrows():
    statement_json = statement_schema.copy()

    # Fill the statement_json with data from the row
    statement_json["statementId"] = ''
    statement_json["statementDate"] = row['eiti_data_publication_date']
    statement_json["publicationDetails"]["publicationDate"] = row['end_date']
    statement_json["publicationDetails"]["bodsVersion"] = '0.4'
    statement_json["publicationDetails"]["license"] = 'http://opendatacommons.org/licenses/pddl/1.0/'
    statement_json["publicationDetails"]["publisher"]["name"] = 'Extractive Industries Transparency Initiative'
    statement_json["publicationDetails"]["publisher"]["url"] = 'https://eiti.org/open-data'
    statement_json["source"]["type"] = ['officialRegister', 'verified']
    statement_json["source"]["url"] = 'https://eiti.portaljs.com'
    statement_json["source"]["retrievedAt"] = pd.Timestamp('today').strftime('%Y-%m-%d')
    statement_json["source"]["assertedBy"][0]["name"] = row['submitter_name']
    statement_json["source"]["assertedBy"][0]["uri"] = row['submitter_email']
    statement_json["declaration"] = f"{row['iso_alpha2_code']}-{row['start_date'].replace('-', '')}-{row['end_date'].replace('-', '')}"
    statement_json["declarationSubject"] = row['iso_alpha2_code']
    statement_json["recordId"] = ''
    statement_json["recordType"] = ''
    
    # Create a key based on the statement identifier
    statement_dict_key = row['eiti_id_declaration']
    
    # Save the JSON string in the dictionary
    statement_dict[statement_dict_key] = json.dumps(statement_json, indent=2, ensure_ascii=False)



print(f"The dictionnary has {len(statement_dict.keys())} items")

The dictionnary has 73 items


1.3. We verify the output by printing 2 random statement_dict entries.

In [33]:
print_random_keys(statement_dict)

4c603c65-856f-307e-9317-a0aafd609fd9: {
  "statementId": "",
  "statementDate": "2020-09-14",
  "annotations": [],
  "publicationDetails": {
    "publicationDate": "2018-12-31",
    "bodsVersion": "0.4",
    "license": "http://opendatacommons.org/licenses/pddl/1.0/",
    "publisher": {
      "name": "Extractive Industries Transparency Initiative",
      "url": "https://eiti.org/open-data"
    }
  },
  "source": {
    "type": [
      "officialRegister",
      "verified"
    ],
    "description": "",
    "url": "https://eiti.portaljs.com",
    "retrievedAt": "2024-05-30",
    "assertedBy": [
      {
        "name": "Heghine Ghukasyan",
        "uri": "heghine.ghukasyan@am.ey.com"
      }
    ]
  },
  "declaration": "AM-20180101-20181231",
  "declarationSubject": "AM",
  "recordId": "",
  "recordType": "",
  "recordDetails": {}
}
----------------------------------------

bfa84e89-dbd8-3bc8-b038-9a9b3de9e663: {
  "statementId": "",
  "statementDate": "2019-12-31",
  "annotations": [],
  "p

# Part 2. Generating Entities

### Overview

1. Entity data preparation
2. Schema and dictionary definition
3. Mapping
4. Output verification

### Logic

`entity_schema` serves as a blueprint for creating `entity_json` instances, which are populated with data from `df_part1`, assigned unique identifiers `entity_dict_key`, and stored as JSON strings in the `entity_dict` dictionary.

2.1. We create a dataframe that holds only the unique values for each type of entity (companies, projects, government entities) while assigning them the proper label in the `entity_type` column. 

In [38]:
# Extract unique entities and add entity type
unique_companies = df_part5[['company_name', 'eiti_id_company', 'iso_alpha2_code', 'country', 'company_public_listing_or_website', 'start_date', 'end_date', 'eiti_id_declaration']].dropna(subset=['eiti_id_company']).drop_duplicates().assign(entity_type='registeredEntity')
unique_projects = df_part5[['project_name', 'eiti_id_project', 'iso_alpha2_code', 'country', 'start_date', 'end_date', 'eiti_id_declaration']].dropna(subset=['eiti_id_project']).drop_duplicates().assign(entity_type='arrangement')
unique_government = df_part5[['government_entity', 'eiti_id_government', 'iso_alpha2_code', 'country', 'start_date', 'end_date', 'eiti_id_declaration']].dropna(subset=['eiti_id_government']).drop_duplicates().assign(entity_type='stateBody')


# Combine into a single DataFrame
df_entities = pd.concat([unique_companies, unique_projects, unique_government], ignore_index=True)

print(f"The dataframe has {len(df_entities.index)} rows\n")

The dataframe has 8242 rows



2.2. We define the schema and create the dictionary to hold the mappped JSONs.

In [48]:
# Define the entity schema
entity_schema = {
    "isComponent": False,
    "entityType": {
        "type": "",
        "subtype": "",
        "details": ""
    },
    "name": "",
    "jurisdiction": {
        "name": "",
        "code": ""
    },
    "identifiers": [],
    "addresses": [],
    "uri": "",
    "publicListing": None,
    "formedByStatute": None
}

# Create the entity dictionary
entity_dict = {}

2.3. We loop through `df_entities` to generate the mapped entity JSONs before stroing them in `entity_dict`.  The size of the `entity_dict` should match the number of rows of `df_entities`

In [None]:

# Iterate over each row in df_entities to create JSON files
for index, row in df_entities.iterrows():

    entity_json = entity_schema.copy()

    entity_json["isComponent"] = False
    entity_json["entityType"]["type"] = row['entity_type']
    entity_json["entityType"]["subtype"] = (
        'governmentDepartment' if row['entity_type'] == 'stateBody' and 'minist' in str(row['government_entity']).lower() else
        'stateAgency' if row['entity_type'] == 'stateBody' else ''
    )
    entity_json["name"] = (
        row['company_name'] if row['entity_type'] == 'registeredEntity' else
        row['project_name'] if row['entity_type'] == 'arrangement' else
        row['government_entity']
    )
    entity_json["jurisdiction"]["name"] = row['country']
    entity_json["jurisdiction"]["code"] = row['iso_alpha2_code']
    entity_json["identifiers"] = [{
        "id": (
            row['eiti_id_company'] if row['entity_type'] == 'registeredEntity' else
            row['eiti_id_project'] if row['entity_type'] == 'arrangement' else
            row['eiti_id_government']
        ),
        "scheme": "XI-EITI",
        "schemeName": "Extractive Industries Transparency Initiative",
        "uri": f"/entity_statement/{row['eiti_id_company'] if row['entity_type'] == 'registeredEntity' else row['eiti_id_project'] if row['entity_type'] == 'arrangement' else row['eiti_id_government']}"
    }]
    entity_json["uri"] = row['company_public_listing_or_website']
    
    # Create the dictionary key
    entity_dict_key = (index, row['eiti_id_declaration'])

    # Insert entity JSONs in the dictionary alongside their matching keys
    entity_dict[entity_dict_key] = json.dumps(entity_json, indent=2, ensure_ascii=False)

# Clear process status with a final message
print(f"The dictionnary has {len(entity_dict.keys())} items")


2.4. We verify the output by printing 2 random statement_dict entries.

In [47]:
# Display 2 random items for quality check
print_random_keys(entity_dict)

(8128, '4a310016-9552-3539-bc27-fa55ce8f2f49'): {
  "isComponent": false,
  "entityType": {
    "type": "stateBody",
    "subtype": "stateAgency",
    "details": ""
  },
  "name": "NIGER DELTA DEVELOPMENT COMMISSION",
  "jurisdiction": {
    "name": "Nigeria",
    "code": "NG"
  },
  "identifiers": [
    {
      "id": "af5ac417-4623-437b-84e4-692b0ff135bc",
      "scheme": "XI-EITI",
      "schemeName": "Extractive Industries Transparency Initiative",
      "uri": "/entity_statement/af5ac417-4623-437b-84e4-692b0ff135bc"
    }
  ],
  "addresses": [],
  "uri": NaN,
  "publicListing": null,
  "formedByStatute": null
}
----------------------------------------

(8179, 'e7c2170f-97e5-3330-9cdf-3e27e3dc7ca3'): {
  "isComponent": false,
  "entityType": {
    "type": "stateBody",
    "subtype": "stateAgency",
    "details": ""
  },
  "name": "DIRECTION GÉNÉRALE DU TRÉSOR ET DE LA COMPTABILITÉ PUBLIQUE (DGTCP)",
  "jurisdiction": {
    "name": "Chad",
    "code": "TD"
  },
  "identifiers": [
   

# Part 3 - Relationships

## Overview

1. Schema and dictionary definition
2. Mapping to the different relationship schemas
3. Output verification
4. Consolidation

## Logic 

### Core mapping

EITI data describes multiple relationships, requiring the definition of several schemas. We defined 5 types of relationships and assigned the following attributes

| InterestedParty | Subject | directOrIndirect | descriptor |
| ---- | ---- | ---- | ---- |
| Country | Government Agency | direct | controlByLegalFramework |
| Government Agency | Company (SOE) | direct | controlByLegalFramework, rightsToProfitOrIncome |
| Government Agency | Company (Private) | direct | rightsToProfitOrIncome |
| Company | Project | direct | rightsGrantedByContract |
| Government Agency | Project | indirect | controlByLegalFramework |

Those are used in the five different `relationship_schemas`. 

The `populate_relationships` function uses `relationship_schemas` as a template to create `relationship_json` instances, which are populated with data from the `df_part5`. Each `relationship_json` is then stored as a JSON string in the `relationship_dicts` dictionary under the corresponding `relationship_type` inner dictionary, using a tuple of the row index and `eiti_id_declaration` as the unique key.

### Schema extension

In the context of EITI data, the interests linking an InterestedParty (government entity/company) to a subject (company, project) refer to the monetary value or in-kind amount of taxes paid to a government entity, whether directly or in relation to a specific project. BODS does not have a specific mechanism to add arbitrary interests, so we added them in interests[].details property by transforming the expected value from a string to an array of objects. This allow us to add the relevant information while minisming the additional nesting level, following BODS design philosophy.

3.1. We define the five possible schemas as a single dictionary, as well as five separate dictionaries to hold the JSON files mapped to each schema. 

In [51]:
relationship_schemas = {
    "country_government": {
        "subject": "",
        "interestedParty": "",
        "interests": [{
            "type": "controlByLegalFramework",
            "directOrIndirect": "direct",
            "beneficialOwnershipOrControl": False,
        }],
        "isComponent": False
    },
    "government_soe": {
        "subject": "",
        "interestedParty": "",
        "interests": [
            {
                "type": "controlByLegalFramework",
                "directOrIndirect": "direct",
                "beneficialOwnershipOrControl": False,
            },
            {
                "type": "rightsToProfitOrIncome",
                "directOrIndirect": "direct",
                "beneficialOwnershipOrControl": False,
                "details": []
            }
        ],
        "isComponent": True
    },
    "government_company": {
        "subject": "",
        "interestedParty": "",
        "interests": [{
            "type": "rightsToProfitOrIncome",
            "directOrIndirect": "direct",
            "beneficialOwnershipOrControl": False,
            "details": []
        }],
        "isComponent": True
    },
    "company_project": {
        "subject": "",
        "interestedParty": "",
        "interests": [{
            "type": "rightsGrantedByContract",
            "directOrIndirect": "direct",
            "beneficialOwnershipOrControl": False,
            "details": []
        }],
        "isComponent": True
    },
    "government_project": {
        "subject": "",
        "interestedParty": "",
        "interests": [{
            "type": "controlByLegalFramework",
            "directOrIndirect": "indirect",
            "beneficialOwnershipOrControl": False,
        }],
        "isComponent": False,
        "componentRecords": []
    }
}

relationship_dicts = {
    "country_government": {},
    "government_soe": {},
    "government_company": {},
    "company_project": {},
    "government_project": {},
}

3.2. we define a function to map and process df_part5 to generate the relationship JSONs. They are then stored within their matching inner dictionary inside of relationship_dicts. 

In [59]:
def populate_relationships(df, relationship_type, schema, subject_col, interested_party_col, start_date_col):
    relationship_dicts[relationship_type] = {}

    for index, row in df.iterrows():

        if pd.notna(row[subject_col]) and pd.notna(row[interested_party_col]):
            relationship_json = copy.deepcopy(schema)
            relationship_json["subject"] = row[subject_col]
            relationship_json["interestedParty"] = row[interested_party_col]
            
            for interest in relationship_json["interests"]:
                interest["startDate"] = row[start_date_col]
                if "details" in interest:
                    detail = {
                        "revenue_stream_name": row["revenue_stream_name"],
                        "revenue_value": row["revenue_value"],
                        "reporting_currency": row["reporting_currency"]
                    }
                    if pd.notna(row["in_kind_volume"]):
                        detail["in_kind_volume"] = row["in_kind_volume"]
                    if pd.notna(row["in_kind_unit"]):
                        detail["in_kind_unit"] = row["in_kind_unit"]
                    interest["details"].append(detail)
            
            relationship_dicts[relationship_type][(index, row['eiti_id_declaration'])] = json.dumps(relationship_json, indent=2, ensure_ascii=False)

# Pre-filter DataFrame to avoid repetitive filtering
df_soes = df_part5[df_part5['company_type'] == "State-owned enterprises & public corporations"]
df_private = df_part5[df_part5['company_type'] == "Private"]

# Populate relationships
populate_relationships(df_part5, "country_government", relationship_schemas["country_government"], "government_entity", "iso_alpha2_code", "start_date")
populate_relationships(df_soes, "government_soe", relationship_schemas["government_soe"], "company_name", "government_entity", "start_date")
populate_relationships(df_private, "government_company", relationship_schemas["government_company"], "company_name","government_entity", "start_date")
populate_relationships(df_part5, "company_project", relationship_schemas["company_project"], "project_name", "company_name", "start_date")
populate_relationships(df_part5, "government_project", relationship_schemas["government_project"], "project_name", "government_entity", "start_date")

# Print the number of items in each dictionary
for relationship_type, relationships in relationship_dicts.items():
    print(f"{relationship_type}: {len(relationships)} items")


country_government: 31826 items
government_soe: 2611 items
government_company: 28889 items
company_project: 12320 items
government_project: 11832 items


3.3. We verify the output by printing 1 random entry from each inner dictionary of relationship_dicts

In [60]:
# function to print random samples from each relationship dictionary
def relationship_sample(relationship_dicts, num_keys=1):
    for relationship_type, relationships in relationship_dicts.items():
        print(f"Samples from {relationship_type}:")
        print_random_keys(relationships, num_keys=num_keys)

relationship_sample(relationship_dicts, num_keys=1)

Samples from country_government:
(5985, 'e821dd0c-7660-3334-a55a-732ab12351d7'): {
  "subject": "ALBANIAN CUSTOMS ADMINISTRATE",
  "interestedParty": "AL",
  "interests": [
    {
      "type": "controlByLegalFramework",
      "directOrIndirect": "direct",
      "beneficialOwnershipOrControl": false,
      "startDate": "2018-01-01"
    }
  ],
  "isComponent": false
}
----------------------------------------

Samples from government_soe:
(32384, 'fef32215-a021-3118-bdc3-a44079a72bdd'): {
  "subject": "UKRNAFTA PJSC",
  "interestedParty": "STATE TAX SERVICE OF UKRAINE",
  "interests": [
    {
      "type": "controlByLegalFramework",
      "directOrIndirect": "direct",
      "beneficialOwnershipOrControl": false,
      "startDate": "2020-01-01"
    },
    {
      "type": "rightsToProfitOrIncome",
      "directOrIndirect": "direct",
      "beneficialOwnershipOrControl": false,
      "details": [
        {
          "revenue_stream_name": "Production royalty",
          "revenue_value": NaN,

3.4. We combine the relationship dictionaries into one. 

In [63]:
relationship_dict = {}
index = 0

for relationship_type, relationships in relationship_dicts.items():
    for key, value in relationships.items():
        # Create a new global key using the global index
        new_key = (index, eiti_id_declaration)
        relationship_dict[new_key] = value
        index += 1

# Print the total number of relationship entities
print(f"Number of relationship entities: {len(relationship_dict)}")

Number of relationship entities: 87478


3.5. We verify the output by printing 2 random statement_dict entries.

In [67]:
print_random_keys(relationship_dict)

(77219, '5abb2996-ea0c-36b0-a728-8e9de6fe4f97'): {
  "subject": "LIANZI - NEMBA",
  "interestedParty": "DIRECTION GÉNÉRALE DES IMPÔTS ET DES DOMAINES (DGID)",
  "interests": [
    {
      "type": "controlByLegalFramework",
      "directOrIndirect": "indirect",
      "beneficialOwnershipOrControl": false,
      "startDate": "2017-01-01"
    }
  ],
  "isComponent": false,
  "componentRecords": []
}
----------------------------------------

(84658, '5abb2996-ea0c-36b0-a728-8e9de6fe4f97'): {
  "subject": "TUBAY NICKEL-COBALT PROJECT",
  "interestedParty": "BUREAU OF INTERNAL REVENUE (BIR)",
  "interests": [
    {
      "type": "controlByLegalFramework",
      "directOrIndirect": "indirect",
      "beneficialOwnershipOrControl": false,
      "startDate": "2018-01-01"
    }
  ],
  "isComponent": false,
  "componentRecords": []
}
----------------------------------------



# Part 4 - Final matching

## Overview

1. Matching entities with statements
2. Matching relationships with statements
3. Grouping all statements

4.1 Matching entities with statements.

In [68]:
entity_statement_dict = {}

for (index, eiti_id_declaration) in entity_dict.keys():
    if eiti_id_declaration in statement_dict:
        statement = json.loads(statement_dict[eiti_id_declaration])
        entity = json.loads(entity_dict[(index, eiti_id_declaration)])
        statement["recordDetails"] = entity

        # Set recordId and recordType in statement_dict
        statement["recordId"] = entity["identifiers"][0]["id"]
        statement["recordType"] = 'entity'
        
        entity_statement_dict[index] = json.dumps(statement, indent=2, ensure_ascii=False)

# Print the length of the combined dictionary
print(f"Number of combined entries: {len(entity_statement_dict)}")

Number of combined entries: 0


In [None]:
# Assuming relationship_dicts and statement_dict are already defined

combined_relationships_dict = {}

for relationship_type, relationships in relationship_dicts.items():
    for (index, eiti_id_declaration), relationship in relationships.items():
        if eiti_id_declaration in statement_dict:
            statement = json.loads(statement_dict[eiti_id_declaration])
            relationship_data = json.loads(relationship)
            
            # Add relationship data to the statement
            statement["recordDetails"] = relationship_data
            
            # Set recordId and recordType in statement_dict
            statement["recordId"] = generate_uuid3(relationship_data["subject"], relationship_data["interestedParty"])  # Updated line
            statement["recordType"] = 'relationship'
            
            combined_relationships_dict[(relationship_type, index)] = json.dumps(statement, indent=2, ensure_ascii=False)

# Print the length of the combined dictionary
print(f"Number of combined relationship entries: {len(combined_relationships_dict)}")



In [None]:
# Function to sample a random item from a flat dictionary
def sample_relationships(flat_dict):
    random_key = random.choice(list(flat_dict.keys()))
    return {random_key: flat_dict[random_key]}

# Sample a random item from the combined relationship dictionary
sampled_relationship = sample_relationships(combined_relationships_dict)

# Separator for clarity
separator = "-" * 40

# Print the sampled relationships
for (relationship_type, index), sample in sampled_relationship.items():
    print(f"Sample from {relationship_type} (index {index}):")
    print(f"Value: {sample}\n{separator}")

In [None]:
# Create unified dictionary
unified_dict = {}

# Update statementId and add to unified dictionary using original index
for key, value in combined_dict.items():
    statement = json.loads(value)
    statement["statementId"] = generate_uuid4()
    unified_dict[key] = statement

# Prepare to sort relationships
relationships_list = []
for relationship_type, relationships in relationship_dicts.items():
    for (index, eiti_id_declaration), relationship in relationships.items():
        relationship_data = json.loads(relationship)
        statement = json.loads(statement_dict[eiti_id_declaration])
        statement["recordDetails"] = relationship_data
        statement["statementId"] = generate_uuid4()
        statement["recordType"] = 'relationship'
        relationships_list.append((relationship_type, statement, relationship_data["interests"][0]["startDate"], eiti_id_declaration, index))

# Sort relationships by start_date and eiti_id_relationship
relationships_list.sort(key=lambda x: (x[2], x[3]))

# Create grouped relationships dictionary
grouped_relationships = {}
for relationship_type, statement, _, eiti_id_declaration, index in relationships_list:
    if eiti_id_declaration not in grouped_relationships:
        grouped_relationships[eiti_id_declaration] = []
    grouped_relationships[eiti_id_declaration].append((relationship_type, statement, index))

# Order and update componentRecords for government_project items
for eiti_id_declaration, relations in grouped_relationships.items():
    sorted_relations = sorted(relations, key=lambda x: ['country_government', 'government_company', 'government_soe', 'company_project', 'government_project'].index(x[0]))
    for relationship_type, statement, index in sorted_relations:
        unified_dict[index] = statement
        if relationship_type == 'government_project':
            component_records = [s for t, s, idx in sorted_relations if t in ['government_company', 'government_soe', 'company_project']]
            if component_records:
                # this needs fixing, it's not clear how it's filling this value
                unified_dict[index]["recordDetails"]["componentRecords"] = [r["recordDetails"] for r in component_records] 


# Print the number of combined entries
print(f"Number of combined entries: {len(unified_dict)}")

In [None]:
# Function to sample a random item from each type
def sample_random_items(unified_dict):
    samples = {
        "companies": [],
        "soe": [],
        "gov_agency": [],
        "relationship": []
    }

    for key, value in unified_dict.items():
        record_details = value.get("recordDetails", {})
        entity_type = record_details.get("entityType", {}).get("type", "")
        record_type = value.get("recordType", "")

        if entity_type == "registeredEntity" and record_type == "entity":
            samples["companies"].append((key, value))
        elif entity_type == "stateOwnedEntity" and record_type == "entity":
            samples["soe"].append((key, value))
        elif entity_type == "stateBody" and record_type == "entity":
            samples["gov_agency"].append((key, value))
        elif record_type == "relationship":
            samples["relationship"].append((key, value))

    return {type_: random.choice(items) if items else None for type_, items in samples.items()}

# Get random samples
random_samples = sample_random_items(unified_dict)

# Print the samples with separators
for entity_type, sample in random_samples.items():
    if sample:
        key, value = sample
        print(f"Sample from {entity_type}:")
        print(f"Key: {key}")
        print(f"Value: {json.dumps(value, indent=2, ensure_ascii=False)}")
        print(separator)

In [None]:
# Merge the relationship dictionaries ensuring unique (index, eiti_id_declaration) keys
merged_relationships = {}

for relationship_type, relationships in relationship_dicts.items():
    for key, relationship_json in relationships.items():
        index, eiti_id_declaration = key
        if eiti_id_declaration not in merged_relationships:
            merged_relationships[eiti_id_declaration] = []
        merged_relationships[eiti_id_declaration].append((index, relationship_json))


print(f"The merged dictionary has {len(merged_relationships)} items")



In [None]:

# Create the unified dictionary
unified_dict = {}

for eiti_id_declaration, relationship_entries in merged_relationships.items():
    if eiti_id_declaration in statement_dict:
        statement_json = json.loads(statement_dict[eiti_id_declaration])
        for index, relationship_json in relationship_entries:
            statement_copy = copy.deepcopy(statement_json)
            statement_copy['recordDetails'] = json.loads(relationship_json)
            unified_key = (index, eiti_id_declaration)
            unified_dict[unified_key] = json.dumps(statement_copy, indent=2, ensure_ascii=False)

# Print the number of items in the unified dictionary
print(f"The unified dictionary has {len(unified_dict)} items")

# Print a sample of 2 random items from the unified dictionary
separator = "-" * 40
random_keys = random.sample(list(unified_dict.keys()), 2)

for random_key in random_keys:
    print(f"{random_key}: {unified_dict[random_key]}\n{separator}\n")


In [None]:
# Function to ensure proper UTF-8 encoding
def ensure_utf8(value):
    if isinstance(value, str):
        return value.encode('utf-8', errors='replace').decode('utf-8')
    return value

# Select a random eiti_id_declaration from unified_dict
random_declaration = random.choice([value['declaration'] for value in unified_dict.values()])

# Filter the unified_dict for entries matching the selected eiti_id_declaration
filtered_entries = [ensure_utf8(value) for value in unified_dict.values() if value.get('declaration') == random_declaration]

# Print the number of filtered entries
print(f"Number of entries for eiti_id_declaration '{random_declaration}': {len(filtered_entries)}")

# Output the filtered entries as a single JSON array
output_file = f"filtered_entries_{random_declaration}.json"
with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(filtered_entries, f, ensure_ascii=False, indent=2)

# Print a message confirming the file creation
print(f"Filtered entries saved to {output_file}")

In [None]:
import random

# Filter unified_dict to get only government_project items
government_project_items = {key: value for key, value in unified_dict.items() if value["recordType"] == "relationship" and value["recordDetails"]["isComponent"] == False}

# Ensure there are government_project items
if government_project_items:
    # Select a random item
    random_key = random.choice(list(government_project_items.keys()))
    random_item = government_project_items[random_key]

    # Print the random government_project item
    print(f"Random government_project item (Key: {random_key}):")
    print(json.dumps(random_item, indent=2, ensure_ascii=False))
else:
    print("No government_project items found.")
