# EITI consolidated Summary Data files verification and validation
   
## Purpose
The purpose of this notebook is to verify assumptions that should hold true due to the nature of EITI's data and validate the usefulness of the data for different use-cases or business goals.

## Data
1. Consolidated summary data files of all countries and years divided into individual sheets (datasets).
   - [Part 1 - About](data/consolidated/Part%201%20-%20About.csv)
   - [Part 3 - Reporting companies' list](data/consolidated/Part%203%20-%20Reporting%20companies'%20list.csv)
   - Part 3 - Reporting government entities list
   - Part 3 - Reporting projects' list
   - Part 4 - Government revenues
   - Part 5 - Company data

## Methodology
1. [Import the libraries](#Import-Libraries)
2. [Import/load the data](#Import-the-data)
3. Verify the assumptions
4. Validate use-cases
5. Perform analysis and show simple statistics/visualizations about the data

## Assumptions to verify
1. [**COMPLETENESS AND CORRECTNESS**](#Completeness-and-Correctness)
    1. [Cell values are complete and/or correct (i.e. no unexpected or incorrect values)](#No-duplicate-company/agency-names-in-the-same-year-and-country)
2. [**UNIQUENESS AND NON-DUPLICATION**](#Uniqueness-and-Non-duplication)
    1. [No duplicate company/agency names in the same year and country](#No-duplicate-company/agency-names-in-the-same-year-and-country)
    2. [No duplicate project names in the same year and country](#No-duplicate-project-names-in-the-same-year-and-country)
    3. [No duplicate company/agency IDs in the same year and country](#No-duplicate-company-ids-in-the-same-year-and-country)
    4. [No duplicate company-project pair in the same year and country](#No-duplicate-company-project-pair-in-the-same-year-and-country)
3. [**CONSISTENCY**](#Consistency)
    1. [All companies or agencies in company data (Part 5) exist in the Reporting companies' agencies (Part 3a) and Reporting government entities list (Part 3b)](#All-companies-or-agencies-in-company-data-(Part-5)-exist-in-the-Reporting-companies'-agencies-(Part-3a)-and-Reporting-government-entities-list-(Part-3b))
    2. Consistent company name-company ID pair over time
    3. Consistent project name over time and across the datasets
    4. Consistent company-project pair over time and across the datasets
    5. Consistent company-company type pair over time and across the datasets
    6. Consistency of country names and ISO codes across the datasets
    7. Consistency of contry name-ISO code pair across the datasets
    8. Consistency of commodities, commodities-project pair, commodities-unit pair

## Use-cases to validate (examples)
1. How has the extractives industry evolved over time in country X? How much is the volume extracted over time? How much revenue is generated?
2. How much taxes are paid by SOEs (SOE = state-owned enterprises company type) for each country? How much is this compared to private companies?
3. What percentage of the market share do SOEs take in each country? Based on volume extracted? Based on revenue? Based on taxes paid?

Add more use-cases. A use case should demonstrate if a reuser, coming with a specific question relevant to a real life need, can answer that question in the data. 

## Statistics and graphs
1. [How many years of data does each company/agency have?](#How-many-years-of-data-does-each-company/agency-have?)
2. [How many (and what companies/agencies) have changed their company type (i.e. from SOE to private and vice versa)?](#How-many-(and-what-companies/agencies)-have-changed-their-company-type-(i.e.-from-SOE-to-private-and-vice-versa)?)
3. [How many (and which) companies/agencies have had their names changed?](How-many-(and-what)-companies/agencies-have-had-their-names-changed?)
4. [How many (and which) companies/agencies have had their IDs changed?](#How-many-(and-which)-companies/agencies-have-had-their-IDs-changed?)
5. [Instances of company/agency names being inputted incorrectly](#Instances-of-company/agency-names-being-inputted-incorrectly)

Show statistics and graphs for the assumptions verified and use-cases validated
Include also:
- comparison among countries (e.g. the most companies, the most consistent data, the most number of issues, etc.)
- comparison among years (e.g. the most companies, the most consistent data, the most number of issues, etc.)
- statistics only for SOE agencies/companies (SOE = state-owned enterprises company type)

### Import Libraries

In [1]:
# import libraries

import pandas as pd
from os import path

### Import the data

In [2]:
file_dir = "data/consolidated/"

# load the csvs into data frames
df_part_1 = pd.read_csv(path.join(file_dir, "Part 1 - About.csv"))
df_part_3a = pd.read_csv(path.join(file_dir, "Part 3 - Reporting companies' list.csv"))
df_part_3b = pd.read_csv(path.join(file_dir, "Part 3 - Reporting government entities list.csv"))
df_part_3c = pd.read_csv(path.join(file_dir, "Part 3 - Reporting projects' list.csv"))
df_part_4 = pd.read_csv(path.join(file_dir, "Part 4 - Government revenues.csv"))
# df_part_5 = pd.read_csv(path.join(file_dir, "Part 5 - Company data.csv")) # results in warning about columns having mixed types
df_part_5 = pd.read_csv(path.join(file_dir, "Part 5 - Company data.csv"), low_memory=False)

df_list = [df_part_1, df_part_3a, df_part_3b, df_part_3c, df_part_4, df_part_5]
df_dict = {"Part 1 - About.csv": df_part_1,
           "Part 3 - Reporting companies' list.csv": df_part_3a,
           "Part 3 - Reporting government entities list.csv": df_part_3b,
           "Part 3 - Reporting projects' list.csv": df_part_3c,
           "Part 4 - Government revenues.csv": df_part_4,
           "Part 5 - Company data.csv": df_part_5
          }

In [3]:
# Checking mixed type columns
mixed_type_columns = df_part_5.map(type).nunique() > 1

# Print or display the columns with mixed types
print("Columns with mixed types:")
print(mixed_type_columns[mixed_type_columns].index.tolist())

Columns with mixed types:
['Company', 'Government entity', 'Revenue stream name', 'Levied on project (Y/N)', 'Reported by project (Y/N)', 'Project name', 'Reporting currency', 'Revenue value', 'Payment made in-kind (Y/N)', 'In-kind volume (if applicable)', 'Unit (if applicable)', 'Comments', 'Country', 'ISO Code', 'Year', 'Start Date', 'End Date']


## Assumptions to verify

### Completeness and Correctness

It is assumed that the data received are complete and correct. In this section, we check for any unexpected or unexpected values in the tables such as instances of 'no data', 'NULL', or 'NaN'.

#### Expectations
- We expect that the data is complete and correct.
- There are no 'no data', 'NULL', or 'NaN' values in the dataframes
  
#### Assumptions
- The dataset is assumed to have undergone rigorous data validation checks to identify and address any missing or erroneous values.
- Data entry procedures have been consistently followed, minimizing the likelihood of unexpected or incorrect values.
- Cell values are expected to fall within valid and meaningful ranges.


#### Results

In [4]:
# Functions to find and count NULL and #ERROR! values

def columns_with_null(df):
    return df.columns[df.isnull().any()]

def count_null_per_column(df):
    return df.isnull().sum()    

def columns_with_error(df):
    return df.columns[df.eq('#ERROR!').any()]

def count_error_per_column(df):
    return df.apply(lambda col: (col == '#ERROR!').sum())

def columns_with_empty_strings(df):
    return df.columns[df.eq('').any()]

def count_empty_strings_per_column(df):
    return df.apply(lambda col: (col == '').sum())

def columns_with_only_space(df):
    return df.columns[df.map(lambda cell: isinstance(cell, str) and cell.isspace()).any()]

def count_only_space_per_column(df):
    return df.map(lambda cell: str(cell).isspace()).sum()

**The results below show that there are columns in the data that has NULL**

In [5]:
for tab in df_dict:
    print("Columns with NULL values for {}:".format(tab))
    for column in columns_with_null(df_dict[tab]):
        print("- {}".format(column))
        
    print("\nNumber of columns with NULL: {}/{}".format(len(columns_with_null(df_dict[tab])), df_dict[tab].shape[1]))
    print("\nNumber of NULL per column:")
    print(count_null_per_column(df_dict[tab]))
    print("\nTotal number of rows/observations: {}".format(df_dict[tab].shape[0]))
    print("\n------\n")
    

Columns with NULL values for Part 1 - About.csv:
- ISO Alpha-3 Code
- National currency name
- National currency ISO-4217
- Start Date
- End Date
- Has an EITI Report been prepared by an Independent Administrator?
- What is the name of the company?
- Date that the EITI Report was made public
- URL, EITI Report
- Does the government systematically disclose EITI data at a single location?
- Publication date of the EITI data
- Website link (URL) to EITI data
- Are there other files of relevance?
- Date that other file was made public
- URL
- Does the government have an open data policy?
- Open data portal / files
- Oil
- Gas
- Mining (incl. Quarrying)
- Other, non-upstream sectors
- If yes, please specify name (insert new rows if multiple)
- Number of reporting government entities (incl SOEs if recipient)
- Number of reporting companies (incl SOEs if payer)
- Reporting currency (ISO-4217 currency codes)
- Exchange rate used: 1 USD = 
- Exchange rate source (URL,…)
- … by revenue stream
- 

**The results below show that there are columns in the data that has '#ERROR!'**

In [6]:
for tab in df_dict:
    print("Columns with '#ERROR!' values for {}:".format(tab))
    for column in columns_with_error(df_dict[tab]):
        print("- {}".format(column))
        
    print("\nNumber of columns with '#ERROR!': {}/{}".format(len(columns_with_error(df_dict[tab])), df_dict[tab].shape[1]))
    print("\nNumber of '#ERROR!' per column:")
    print(count_error_per_column(df_dict[tab]))
    print("\nTotal number of rows/observations: {}".format(df_dict[tab].shape[0]))
    print("\n------\n")
    

Columns with '#ERROR!' values for Part 1 - About.csv:
- ISO Alpha-3 Code
- National currency name
- National currency ISO-4217

Number of columns with '#ERROR!': 3/39

Number of '#ERROR!' per column:
Country or area name                                                            0
ISO Alpha-3 Code                                                               20
National currency name                                                         20
National currency ISO-4217                                                     19
Start Date                                                                      0
End Date                                                                        0
Has an EITI Report been prepared by an Independent Administrator?               0
What is the name of the company?                                                0
Date that the EITI Report was made public                                       0
URL, EITI Report                                              

**The results below show that there are columns in the data that has empty strings ('')**

In [7]:
for tab in df_dict:
    print("Columns with '' values for {}:".format(tab))
    for column in columns_with_empty_strings(df_dict[tab]):
        print("- {}".format(column))
        
    print("\nNumber of columns with '': {}/{}".format(len(columns_with_empty_strings(df_dict[tab])), df_dict[tab].shape[1]))
    print("\nNumber of '' per column:")
    print(count_empty_strings_per_column(df_dict[tab]))
    print("\nTotal number of rows/observations: {}".format(df_dict[tab].shape[0]))
    print("\n------\n")
    

Columns with '' values for Part 1 - About.csv:

Number of columns with '': 0/39

Number of '' per column:
Country or area name                                                           0
ISO Alpha-3 Code                                                               0
National currency name                                                         0
National currency ISO-4217                                                     0
Start Date                                                                     0
End Date                                                                       0
Has an EITI Report been prepared by an Independent Administrator?              0
What is the name of the company?                                               0
Date that the EITI Report was made public                                      0
URL, EITI Report                                                               0
Does the government systematically disclose EITI data at a single location?    0
Pub

**The results below show that there are columns in the data that are only white spaces**

In [8]:
for tab in df_dict:
    print("Columns with only white spaces for {}:".format(tab))
    for column in columns_with_only_space(df_dict[tab]):
        print("- {}".format(column))
        
    print("\nNumber of columns with '': {}/{}".format(len(columns_with_only_space(df_dict[tab])), df_dict[tab].shape[1]))
    print("\nNumber of '' per column:")
    print(count_only_space_per_column(df_dict[tab]))
    print("\nTotal number of rows/observations: {}".format(df_dict[tab].shape[0]))
    print("\n------\n")
    

Columns with only white spaces for Part 1 - About.csv:

Number of columns with '': 0/39

Number of '' per column:
Country or area name                                                           0
ISO Alpha-3 Code                                                               0
National currency name                                                         0
National currency ISO-4217                                                     0
Start Date                                                                     0
End Date                                                                       0
Has an EITI Report been prepared by an Independent Administrator?              0
What is the name of the company?                                               0
Date that the EITI Report was made public                                      0
URL, EITI Report                                                               0
Does the government systematically disclose EITI data at a single location? 

### Uniqueness and Non-duplication

In [9]:
def count_duplicates(main_df, duplicate_df, field, name):
    return main_df[duplicate_df].groupby(field).size().reset_index(name=name)

### No duplicate company/agency names in the same year and country

#### Expectations
- A company or agency name will appear only once per year per country.
- We expect that within the provided dataset, each combination of a company or agency name in a given year and country is unique.
- This expectation is grounded in the assumption that the dataset is designed to maintain distinct company or agency identities for a given year and country.

#### Assumptions
- The company or agency is only reported once per year regardless of sector/commodity
- Company and agency names are assumed to be uniquely associated with specific entities within a given year and country.
- Any duplicate entries would be considered as potential data entry errors or inconsistencies.

#### Results

In [10]:
# def get_duplicate_names(df, field):
#     return df[df.duplicated([field, 'Country', 'Year'], keep=False)].sort_values(by=[field])

def get_duplicates(df, field_to_check, *constraint_fields):
    return df[df.duplicated([field_to_check, *constraint_fields], keep=False)].sort_values(by=[field_to_check])

In [11]:
get_duplicates(df_part_3a, 'Full company name', *['Country', 'Year'])

Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
1549,Al Waha Petroleum Co. Ltd.,Private,,Oil & Gas,"Oil, Gas, Condensates",https://www.petroalwaha.com/,,628712909.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1585,Al Waha Petroleum Co. Ltd.,Private,,Oil & Gas,"Oil, Gas, Condensates",,,628712909.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1552,CNOOC IRAQ LIMITED,Private,,Oil & Gas,"Oil, Gas, Condensates",https://cnoocinternational.com/operations/midd...,,1098601809.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1586,CNOOC IRAQ LIMITED,Private,,Oil & Gas,"Oil, Gas, Condensates",,,1098601809.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1564,PT PERTAMINA IRAK,Private,,Oil & Gas,"Oil, Gas, Condensates",,,990135616.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1588,PT PERTAMINA IRAK,Private,,Oil & Gas,"Oil, Gas, Condensates",,,990135616.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
497,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Nickel,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,864690546.00,Philippines,PHL,2018,2018-01-01,2018-12-31
519,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Limestone,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,,Philippines,PHL,2018,2018-01-01,2018-12-31
173,شرکت برادران خالد عزیز,Private,1019984010,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
174,شرکت برادران خالد عزیز,Private,1019984010,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20


#### Results
- There are duplicate listings of companies in the same year in Iraq and Afghanistan
- The duplicate listing in the Philippines may be attributed to the difference in commodities

In [12]:
get_duplicates(df_part_3b, 'Full name of agency', *['Country', 'Year'])

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date
238,Mpohor Wassa East,Local government,,#ERROR!,Ghana,GHA,2019.0,2019-01-01,2019-12-31
252,Mpohor Wassa East,Local government,,,Ghana,GHA,2019.0,2019-01-01,2019-12-31
247,Obuasi Municipal Assembly,Local government,,#ERROR!,Ghana,GHA,2019.0,2019-01-01,2019-12-31
259,Obuasi Municipal Assembly,Local government,,,Ghana,GHA,2019.0,2019-01-01,2019-12-31


#### Results
- There are duplicate listings of agencies for Ghana in 2019

### No duplicate project names in the same year and country

#### Expectations
- A project name will appear only once per year per country.
- Each project name within the dataset is unique for a given year and country. This anticipation is based on the assumption that project names are assigned in a manner that avoids duplicates for a given year and country.

#### Assumptions
- A project is only reported once per year and country and commodity
- Project names are assumed to be distinct identifiers for different projects within the same year and country.
- Duplicate project names within the same year and country would be treated as anomalies.

#### Results

In [13]:
duplicate_projects = get_duplicates(df_part_3c, 
                                    'Full project name', 
                                    *['Country', 'Year', 'Commodities (one commodity/row)'])

duplicate_projects

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
1239,0,,MCCM,Not applicable,Not applicable,,Not applicable,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1109,0,,Bayan erdes,Not applicable,Not applicable,,Not applicable,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1168,0,,Mon lid trade,Not applicable,Not applicable,,Not applicable,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1185,0,,STBL,Not applicable,Not applicable,,Not applicable,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1196,0,,Terguun service,Not applicable,Not applicable,,Not applicable,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1400,,,,,,,,,,,,,,
1402,,,,,,,,,,,,,,
1404,,,,,,,,,,,,,,
1406,,,,,,,,,,,,,,


In [38]:
duplicate_project_names = duplicate_projects['Full project name'].unique()

print("Number of duplicate project names: {}\n".format(len(duplicate_project_names.tolist())))

# duplicate_project_counts = duplicate_projects['Full project name'].value_counts()
# # print(duplicate_project_counts)

# print("Number of duplicates per project")
# for name, count in duplicate_project_counts.items():
#     if count > 1:
#         duplicates = duplicate_projects[duplicate_projects['Full project name'] == name]
#         print(f"Name: {name}, Count: {count}")

duplicate_projects_count = duplicate_projects.groupby('Full project name').size().reset_index(name='number of times duplicated')
duplicate_projects_count

Number of duplicate project names: 198



Unnamed: 0,Full project name,number of times duplicated
0,0,8
1,2025257,2
2,ALKANE ENERGY UK LIMITED (03128509),4
3,"ALKANE ENERGY UK LIMITED (03128509), EGDON RES...",2
4,ALPHA PETROLEUM RESOURCES LIMITED (03949599),4
...,...,...
192,TOTALENERGIES E&P NORTH SEA UK LIMITED (03682299),2
193,TOTALENERGIES E&P UK LIMITED (00811900),20
194,VICTOR,2
195,WINTERSHALL NOORDZEE B.V. (FC027567),3


In [15]:
duplicates = duplicate_projects[duplicate_projects['Full project name'] == name]
duplicates

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
6150,"Wharf, Rvr Usk nr Caldicot Level",Licence,Hanson Limited,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
6149,"Wharf, Rvr Usk nr Caldicot Level",Licence,Hanson Limited,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31


#### Implications
- There is a significant number of duplicate project names across many of the countries.
- There is a need to check/count the number of duplicates per country to see if this should be expected

In [39]:
# pd.set_option('display.max_rows', None)

project_counts_per_country = duplicate_projects.groupby('Country')['Full project name'].value_counts()

duplicates_per_country = project_counts_per_country[project_counts_per_country > 1]

print("Number of duplicate project names per country:")
print(duplicates_per_country)

# pd.reset_option('display.max_rows')

Number of duplicate project names per country:
Country         Full project name                                                             
Afghanistan     SSML-Kabu 13/2012                                                                 6
                SSML-Kabu 42/2014                                                                 4
                SSML-Kabu 2/2008                                                                  4
                SSML-Kabu 19/2016                                                                 4
                SSML-Kabu 10/2016                                                                 4
                                                                                                 ..
United Kingdom  EGDON RESOURCES U.K. LIMITED (03424561), REGENT PARK ENERGY LIMITED (04557422)    2
                Boulby Offshore Mine                                                              2
                Claymore                                  

Country         Full project name                                                             
Afghanistan     SSML-Kabu 13/2012                                                                 6
                SSML-Kabu 42/2014                                                                 4
                SSML-Kabu 2/2008                                                                  4
                SSML-Kabu 19/2016                                                                 4
                SSML-Kabu 10/2016                                                                 4
                                                                                                 ..
United Kingdom  EGDON RESOURCES U.K. LIMITED (03424561), REGENT PARK ENERGY LIMITED (04557422)    2
                Boulby Offshore Mine                                                              2
                Claymore                                                                          2
     

In [41]:
duplicate_projects.groupby('Country').size().reset_index(name='duplicates per country')

Unnamed: 0,Country,duplicates per country
0,Afghanistan,26
1,Armenia,3
2,Burkina Faso,5
3,Cameroon,4
4,Cote d'Ivoire,2
5,Country,7
6,Country,9
7,Dominican Republic,2
8,Germany,45
9,Ghana,22


In [17]:
df_part_3c[df_part_3c['Full project name'] == 'CIRQUE ENERGY (UK) LIMITED (03080778)']

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
6001,CIRQUE ENERGY (UK) LIMITED (03080778),PEDL324,"CIRQUE ENERGY (UK) LIMITED (03080778), STELINM...",Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
6008,CIRQUE ENERGY (UK) LIMITED (03080778),PEDL348,"CIRQUE ENERGY (UK) LIMITED (03080778), STELINM...",Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31


### No duplicate company ids in the same year and country

#### Expectations
- A company ID will appear only once per year per country
- The company or agency identification numbers (IDs) within the dataset are unique for a given year and country.
- The dataset has been structured to ensure a one-to-one mapping between company or agency IDs.

#### Assumptions
- 1 company ID = 1 company in a country
- Each company or agency ID is assumed to uniquely represent a specific entity for the year and country.
- Duplicate company or agency IDs within the same year and country are considered deviations from the expected dataset structure.

#### Results

In [18]:
duplicate_ids = get_duplicates(df_part_3a, 
                                    'Company ID number', 
                                    *['Country', 'Year'])

duplicate_ids

Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
519,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Limestone,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,,Philippines,PHL,2018,2018-01-01,2018-12-31
497,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Nickel,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,864690546.00,Philippines,PHL,2018,2018-01-01,2018-12-31
878,HOUNDE GOLD OPERATION SA,Private,00064526S,Mining,Or,Indisponible,,39636048842.00,Burkina Faso,BFA,2019,1/1/2019,12/31/2019
877,HOUNDE EXPLORATION BF SARL,Private,00064526S,Mining,Or,https://www.endeavourmining.com/our-portfolio/...,,7879789.00,Burkina Faso,BFA,2019,1/1/2019,12/31/2019
872,HOUNDE EXPLORATION BF SARL,Private,00064526S,Mining,Gold,https://www.endeavourmining.com/our-portfolio/...,,13130988.00,Burkina Faso,BFA,2018,1/1/2018,12/31/2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3216,GX Technology,,,Oil & Gas,,,,,Seychelles,SYC,2018,2018-01-01,2018-12-31
3217,JOGMEX,,,Oil & Gas,,,,,Seychelles,SYC,2018,2018-01-01,2018-12-31
3242,GRAYSON BANDA,Private,,Mining,Gold,Not available,Not available,3126118498.00,Tanzania,TZA,2018,2017-07-01,2018-06-30
3256,TANZANIA PORTLAND CEMENT PUBLIC LIMITED COMPANY,Private,,Mining,"Limeston, Sandstone",Not available,Not available,24750760174.02,Tanzania,TZA,2018,2017-07-01,2018-06-30


In [59]:
duplicate_company_ids_1 = df_part_3a[df_part_3a.duplicated(['Country', 'Year', 'Company ID number'], keep=False)]

Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
0,Amin Karimzai Campany,Private,1007815085,Other,Talc,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
1,Habib Shahab Talc and Marble exploitation and ...,Private,1013655012,Other,"Talc, Construction stone",Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
14,Amania Mining Company,Private,9000197187,Other,Fluoride,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
18,Bilal Musazai Company Limited,Private,1009592088,Other,Talc,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
30,Habib Shahab Talc and Marble Extraction and Pr...,Private,1013655012,Other,Talc,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3216,GX Technology,,,Oil & Gas,,,,,Seychelles,SYC,2018,2018-01-01,2018-12-31
3217,JOGMEX,,,Oil & Gas,,,,,Seychelles,SYC,2018,2018-01-01,2018-12-31
3242,GRAYSON BANDA,Private,,Mining,Gold,Not available,Not available,3126118498.00,Tanzania,TZA,2018,2017-07-01,2018-06-30
3256,TANZANIA PORTLAND CEMENT PUBLIC LIMITED COMPANY,Private,,Mining,"Limeston, Sandstone",Not available,Not available,24750760174.02,Tanzania,TZA,2018,2017-07-01,2018-06-30


In [54]:
duplicate_ids.groupby('Company ID number').size().reset_index(name='duplicates')

Unnamed: 0,Company ID number,duplicates
0,000-142-665-000,2
1,00064526S,4
2,1007815085,4
3,1008627083,3
4,1009592088,4
5,1010360087,4
6,1013655012,5
7,1016525014,4
8,1019984010,2
9,103947189,2


In [42]:
duplicate_id_names = duplicate_ids['Company ID number'].unique()

print("Number of duplicate IDs: {}\n".format(len(duplicate_id_names.tolist())))

duplicate_id_counts = duplicate_ids['Company ID number'].value_counts()
# print(duplicate_project_counts)

print("Number of duplicate IDS")
for id, count in duplicate_id_counts.items():
    if count > 1:
        duplicates = duplicate_ids[duplicate_ids['Company ID number'] == id]
        print(f"Name: {id}, Count: {count}")

Number of duplicate IDs: 36

Number of duplicate IDS
Name: Not applicable, Count: 81
Name: Not available, Count: 40
Name: NC, Count: 37
Name: Not communicated, Count: 19
Name: Not Available, Count: 8
Name: 1013655012, Count: 5
Name: 983426417, Count: 5
Name: 1009592088, Count: 4
Name: 1010360087, Count: 4
Name: 1016525014, Count: 4
Name: 9001044669, Count: 4
Name: 2700773, Count: 4
Name: 9001204305, Count: 4
Name: 1007815085, Count: 4
Name: 9000197187, Count: 4
Name: 00064526S, Count: 4
Name: 1008627083, Count: 3
Name: 931713671, Count: 2
Name: 000-142-665-000, Count: 2
Name: Nc, Count: 2
Name: 9003329902, Count: 2
Name: RC321517, Count: 2
Name: 919160675, Count: 2
Name: 9000647140, Count: 2
Name: 9002301225, Count: 2
Name: 9001505461, Count: 2
Name: 9001353375, Count: 2
Name: 9000036856, Count: 2
Name: 761,1998-1999, Count: 2
Name: 1384,1998-1999, Count: 2
Name: 108744308, Count: 2
Name: 1044142014, Count: 2
Name: 103947189, Count: 2
Name: 1019984010, Count: 2
Name: RC322270, Count: 2

In [46]:
duplicate_ids_agency = get_duplicates(df_part_3b, 
                                    'ID number (if applicable)', 
                                    *['Country', 'Year'])

duplicate_ids_agency

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date
424,Norwegian Central Bank,Central goverment,937884117,-,Norway,NOR,2019.0,1/1/2019,12/31/2019
425,Petoro,Central goverment,937884117,96478000000.00,Norway,NOR,2019.0,1/1/2019,12/31/2019
265,Superintendencia de Administración Tributaria ...,Central government,<Use Legal Entity Identifier if available>,149007097.60,Guatemala,GTM,2017.0,2017-01-01,2017-12-01
266,Municipalidades,Local government,<Use Legal Entity Identifier if available>,63301988.42,Guatemala,GTM,2017.0,2017-01-01,2017-12-01
262,Ministerio de Energia y Minas (MEM),Central government,<Use Legal Entity Identifier if available>,228058363.73,Guatemala,GTM,2017.0,2017-01-01,2017-12-01
...,...,...,...,...,...,...,...,...,...
544,Ministry of Mines and Minerals Development,Central goverment,,48229839.93,Zambia,ZMB,2018.0,2018-01-01,2018-12-31
545,Environmental Protection Fund,Other,,23330601.78,Zambia,ZMB,2018.0,2018-01-01,2018-12-31
546,Ministry of Lands,Central goverment,,1756688.47,Zambia,ZMB,2018.0,2018-01-01,2018-12-31
547,IDC,Other,,69205641.67,Zambia,ZMB,2018.0,2018-01-01,2018-12-31


In [56]:
duplicate_ids_agency.groupby('ID number (if applicable)').size().reset_index(name='duplicates')

Unnamed: 0,ID number (if applicable),duplicates
0,937884117,2
1,<Use Legal Entity Identifier if available>,5
2,Mining,3
3,No applicable,7
4,Non applicable,45
5,Not applicable,59
6,Not available,9
7,Oil & Gas,6


In [58]:
duplicate_ids_agency.groupby('Country').size().reset_index(name='duplicates')

Unnamed: 0,Country,duplicates
0,Afghanistan,10
1,Albania,22
2,Argentina,4
3,Armenia,6
4,Burkina Faso,36
5,Cameroon,8
6,Chad,21
7,Cote d'Ivoire,18
8,Democratic Republic of Congo,10
9,Ethiopia,3


#### Implications
- There is a significant number of duplicate company ID values
- Most are due to the information not being available or not existing in the dataset

In [44]:
duplicate_ids.groupby('Country').size().reset_index(name='duplicates')

Unnamed: 0,Country,duplicates
0,Afghanistan,55
1,Armenia,4
2,Burkina Faso,4
3,Chad,48
4,Cote d'Ivoire,6
5,Democratic Republic of Congo,4
6,Guatemala,11
7,Iraq,141
8,Mauritania,5
9,Mexico,10


In [45]:
duplicate_ids.groupby('Country').size().reset_index(name='duplicates').sum()

Country       AfghanistanArmeniaBurkina FasoChadCote d'Ivoir...
duplicates                                                  450
dtype: object

### No duplicate company-project and commodity pair in the same year and country

#### Expectations
- We anticipate that within the provided dataset, each combination of a company and project in a given year and country should be unique.
- This expectation is grounded in the assumption that projects are distinctly identified by the combination of the involved company, the project name, and the associated year and country.

#### Assumptions
- Projects are uniquely identified by the combination of the company, project name, year, and country.
- The dataset has been curated to prevent duplicate entries for the same project within the same year and country.

#### Results

In [20]:
company_project_duplicates = df_part_3c.duplicated(subset=['Affiliated companies, start with Operator', 'Full project name', 'Year', 'Country'], keep=False)

duplicate_rows = df_part_3c[company_project_duplicates]
duplicate_rows

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
21,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2018,2017-12-21,2018-12-20
22,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2018,2017-12-21,2018-12-20
37,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
38,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
41,SSML-Kabu 13/2012,Small-scale mining license,شرکت ساختمانی فاروق استانکزی,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6254,"Area 461, 1,000,000 tpa, Median Deep",Licence,Volker Dredging Ltd,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
6255,Kansanshi Mine,7057 HQ LML,Kansanshi Mining Plc,Copper (2603),Production,250860,Tonnes,1547.48,USD,Zambia,ZMB,2017,2017-01-01,2017-12-31
6256,Kansanshi Mine,17019-HQ-LEL,Kansanshi Mining Plc,Gold (7108),Production,4.565,Tonnes,183.31,USD,Zambia,ZMB,2017,2017-01-01,2017-12-31
6268,Kansanshi Mine,7057 HQ LML,KANSANSHI MINING PLC,Copper (2603),Production,251517.19,Tonnes,1640609303.36,USD,Zambia,ZMB,2018,2018-01-01,2018-12-31


In [21]:
print("Duplicates per country:")
print(count_duplicates(df_part_3c, company_project_duplicates, 'Country', 'duplicates_per_country'))

Duplicates per country:
                Country  duplicates_per_country
0           Afghanistan                      26
1               Albania                      16
2             Argentina                      15
3               Armenia                       9
4          Burkina Faso                      30
5              Cameroon                       4
6         Cote d'Ivoire                      10
7               Country                       7
8              Country                        9
9    Dominican Republic                      29
10                Ghana                      26
11            Guatemala                       3
12              Liberia                      17
13           Madagascar                      13
14           Mauritania                       2
15               Mexico                      19
16             Mongolia                      12
17           Mozambique                       2
18              Myanmar                       2
19          Phil

In [22]:
print("Duplicates per year:")
print(count_duplicates(df_part_3c, company_project_duplicates, 'Year', 'duplicates_per_year'))

Duplicates per year:
   Year  duplicates_per_year
0  2017                   41
1  2018                  651
2  2019                  526
3  2020                  725
4  2021                  439
5  Year                   16


In the next part, let's look at duplicates with the same project name, affiliated companies, country, year, and commodity

In [23]:
company_project_comm_duplicates = df_part_3c.duplicated(subset=['Affiliated companies, start with Operator', 'Full project name', 'Year', 'Country', 'Commodities (one commodity/row)'], keep=False)

duplicate_rows_2 = df_part_3c[company_project_comm_duplicates]
duplicate_rows_2

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
21,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2018,2017-12-21,2018-12-20
22,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2018,2017-12-21,2018-12-20
37,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
38,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
41,SSML-Kabu 13/2012,Small-scale mining license,شرکت ساختمانی فاروق استانکزی,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6250,"Area 340, 500,000 tpa, Nab",Licence,Volker Dredging Ltd,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
6251,"Area 351, 500,000 tpa, SE Isle of Wight",Licence,Volker Dredging Ltd,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
6252,"Area 351, 500,000 tpa, SE Isle of Wight",Licence,Volker Dredging Ltd,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
6253,"Area 461, 1,000,000 tpa, Median Deep",Licence,Volker Dredging Ltd,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31


In [24]:
print("Duplicates per country:")
print(count_duplicates(df_part_3c, company_project_comm_duplicates, 'Country', 'duplicates_per_country'))

Duplicates per country:
               Country  duplicates_per_country
0          Afghanistan                      26
1             Cameroon                       4
2        Cote d'Ivoire                       2
3              Country                       7
4             Country                        9
5   Dominican Republic                       2
6                Ghana                      14
7              Liberia                      17
8           Madagascar                       9
9           Mauritania                       2
10              Mexico                      19
11            Mongolia                      12
12         Philippines                       4
13             Senegal                       2
14             Ukraine                      14
15      United Kingdom                     449


In [25]:
print("Duplicates per year:")
print(count_duplicates(df_part_3c, company_project_comm_duplicates, 'Year', 'duplicates_per_year'))

Duplicates per year:
   Year  duplicates_per_year
0  2017                   10
1  2018                   82
2  2019                   27
3  2020                   18
4  2021                  439
5  Year                   16


In [26]:
print("Duplicates per commodity:")
print(count_duplicates(df_part_3c, company_project_comm_duplicates, 'Commodities (one commodity/row)', 'duplicates_per_commodity'))

Duplicates per commodity:
    Commodities (one commodity/row)  duplicates_per_commodity
0             Coal, Technical water                         6
1   Commodities (one commodity/row)                        16
2                        Condensate                         2
3                  Crude oil (2709)                        23
4                       Gold (7108)                         4
5                  Limestone (2521)                         4
6                Natural gas (2711)                         4
7                     Nickel (2604)                         2
8                    Non applicable                         2
9                    Not applicable                        29
10                    Not available                       441
11                     Not reported                         8
12                     Other (2617)                        11


#### Implications
- there are a significant number of duplicates that have the same Full project name and Affiliated companies in the data (2416)
- even when accounting for the commodity being reported in the project, there are still a significant number of dupliates (610) that have the same Full project name, Affiliated companies, and commodities being reported
- most of these duplicates may be attributed to commodities not being part of the information provided (464/610)

### Consistency

### All companies or agencies in company data (Part 5) exist in the Reporting companies' agencies (Part 3a) and Reporting government entities list (Part 3b)

#### Expectations
- We expect that every company or agency mentioned in the company dataset should have a corresponding entry in either the reporting companies dataset or the reporting government entities dataset.
- This is based on the assumption that the company data are associated with existing companies and/or government entities and any discrepancies could indicate data inconsistencies or missing information.

#### Assumptions
- The company and agency datasets are comprehensive and contain information about all relevant companies and government entities.
- The naming conventions for companies and agencies in the company report match those in the company and government entities tab/dataset.
- Each entry in the company report corresponds to a valid and existing company or agency.

#### Results

In [27]:
def count_missing(main_df, missing_df, field, name):
    return main_df[missing_df].groupby(field).size().reset_index(name=name)

In [28]:
all_companies_exist = df_part_5['Company'].isin(df_part_3a['Full company name']).all()
if all_companies_exist:
    print("All companies in ledger exist in companies.")
else:
    print("Not all companies in ledger exist in companies.")

missing_companies = df_part_5[~df_part_5['Company'].isin(df_part_3a['Full company name'])]

# Display the result
print("Companies in ledger that don't exist in companies:")
print(missing_companies[['Company', 'Country', 'Year']])

print(missing_companies['Company'].unique())

print('\n')
print(count_missing(df_part_5, ~df_part_5['Company'].isin(df_part_3a['Full company name']), 'Country', 'missing_per_country'))

print('\n')
print(count_missing(df_part_5, ~df_part_5['Company'].isin(df_part_3a['Full company name']), 'Year', 'missing_per_year'))

Not all companies in ledger exist in companies.
Companies in ledger that don't exist in companies:
                                                Company         Country  Year
29                           Abid Hassan Zadran Limited     Afghanistan  2018
30                           Abid Hassan Zadran Limited     Afghanistan  2018
31                           Abid Hassan Zadran Limited     Afghanistan  2018
32                           Abid Hassan Zadran Limited     Afghanistan  2018
37     Afghan Shiinink, Mines Extraction and Processing     Afghanistan  2018
...                                                 ...             ...   ...
32506                             Royal Dutch Shell plc  United Kingdom  2021
32507                             Royal Dutch Shell plc  United Kingdom  2021
32508                             Royal Dutch Shell plc  United Kingdom  2021
32509                             Royal Dutch Shell plc  United Kingdom  2021
32510                             Royal Dut

In [29]:
all_in_companies_agencies = (
    df_part_5['Company'].isin(df_part_3a['Full company name']) | 
    df_part_5['Company'].isin(df_part_3b['Full name of agency'])
)

if all_in_companies_agencies.all():
    print("All company names in ledger can be found in either full_company_name or full_agency_name.")
else:
    print("Not all company names in ledger can be found in either full_company_name or full_agency_name.")

not_in_companies_agencies = (
    ~df_part_5['Company'].isin(df_part_3a['Full company name']) & 
    ~df_part_5['Company'].isin(df_part_3b['Full name of agency'])
)

# Display the result
print("Companies or agencies in the company data that don't exist in companies and agencies list:")
print(df_part_5[not_in_companies_agencies][['Company', 'Country', 'Year']])

print(df_part_5[not_in_companies_agencies]['Company'].unique())

print('\n')
print(count_missing(df_part_5, ~df_part_5['Company'].isin(df_part_3a['Full company name']), 'Country', 'missing_per_country'))

print('\n')
print(count_missing(df_part_5, ~df_part_5['Company'].isin(df_part_3a['Full company name']), 'Year', 'missing_per_year'))

Not all company names in ledger can be found in either full_company_name or full_agency_name.
Companies or agencies in the company data that don't exist in companies and agencies list:
                                                Company         Country  Year
29                           Abid Hassan Zadran Limited     Afghanistan  2018
30                           Abid Hassan Zadran Limited     Afghanistan  2018
31                           Abid Hassan Zadran Limited     Afghanistan  2018
32                           Abid Hassan Zadran Limited     Afghanistan  2018
37     Afghan Shiinink, Mines Extraction and Processing     Afghanistan  2018
...                                                 ...             ...   ...
32506                             Royal Dutch Shell plc  United Kingdom  2021
32507                             Royal Dutch Shell plc  United Kingdom  2021
32508                             Royal Dutch Shell plc  United Kingdom  2021
32509                             R

### Consistent company name-company ID pair over time

#### Expectations
- We expect that the pairing of company names with their corresponding company IDs remain consistent over time.
- This expectation is founded on the assumption that company names and IDs serve as stable and reliable identifiers for companies, ensuring their continuity and uniqueness over time.

#### Assumptions
- Each company name is assumed to be consistently associated with a specific company ID across different years.
- Company IDs are assumed to be unique and not reassigned to different companies over time.
  
#### Results


In [30]:
def check_consistency_year_country(df, name_col, id_col):
    # Check for consistency within each group of year and country
    consistency_check = df.groupby(['Year', 'Country', name_col])[id_col].nunique().eq(1).all()

    if consistency_check:
        print(f"The {name_col} and {id_col} pairs are consistent over time.")
    else:
        print(f"There are inconsistencies in {name_col} and {id_col} pairs over time.")


# Function to check consistency of name-id pairs over time and list inconsistent pairs
def check_and_list_inconsistencies(df, name_col, id_col):
    # Check for consistency within each group of year and country
    consistency_check = df.groupby(['Year', 'Country', name_col])[id_col].nunique().eq(1).all()

    # List inconsistent pairs
    inconsistent_pairs = df[~consistency_check].groupby(['Year', 'Country', name_col])[id_col].apply(list).reset_index()
    # print(f"Inconsistent {name_col}-{id_col} pairs:")
    print(inconsistent_pairs)

In [31]:
check_consistency_year_country(df_part_3a, 'Full company name', 'Company ID number')

There are inconsistencies in Full company name and Company ID number pairs over time.


In [32]:
# consistency_check = df_part_3a.groupby(['Year', 'Country', 'Full company name'])['Company ID number'].nunique().eq(1).all()

inconsistent_pairs = df_part_3a[~df_part_3a.groupby(['Year', 'Country', 'Full company name'])['Company ID number'].transform('nunique').eq(1)]

inconsistent_pairs

Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
111,تصدی کود برق بلخ Kod-e-barq Balk,State-owned enterprises & public corporations,,Fertliser & power,,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
529,Aggregated,Private,,Mining,,,,36533218.00,Albania,ALB,2017,1/1/2017,12/31/2017
729,Aggregated,Private,,Mining,,,,39681505.00,Albania,ALB,2018,1/1/2018,12/31/2018
998,EXXON MOBIL,Private,,Oil & Gas,"Oil, gas, condensate",No,,697026000.00,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
1005,ENI IVORY COAST LIMITED,Private,,Oil & Gas,"Oil, gas, condensate",No,,2807961535.00,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3216,GX Technology,,,Oil & Gas,,,,,Seychelles,SYC,2018,2018-01-01,2018-12-31
3217,JOGMEX,,,Oil & Gas,,,,,Seychelles,SYC,2018,2018-01-01,2018-12-31
3242,GRAYSON BANDA,Private,,Mining,Gold,Not available,Not available,3126118498.00,Tanzania,TZA,2018,2017-07-01,2018-06-30
3256,TANZANIA PORTLAND CEMENT PUBLIC LIMITED COMPANY,Private,,Mining,"Limeston, Sandstone",Not available,Not available,24750760174.02,Tanzania,TZA,2018,2017-07-01,2018-06-30


In [33]:
print(inconsistent_pairs['Full company name'].unique())
len(inconsistent_pairs['Full company name'].unique())

['تصدی کود برق بلخ Kod-e-barq Balk' 'Aggregated' 'EXXON MOBIL'
 'ENI IVORY COAST LIMITED' "OPHIR Cote d'Ivoire"
 "ENERGIE DE Cote d'Ivoire" 'SVENSKA PETROL AKTIEBOLAG'
 'HALLA CORPORATION' 'BAI JIE' 'COMFORCE' 'EPPM' 'HUAYOU' 'Dovemco'
 'Entre Mares de Guatemala S.A.' 'Guaxilan S. A.' 'Maya Tradición'
 'Mayaníquel S.A.' 'Minera San Rafael S.A.'
 'Montana Exploradora de Guatemala S.A.' 'Peña Rubia S.A.'
 'Procesamiento de Materias Primas, Sílice y Derivados de Centroamérica S.A.'
 'City Peten S. de R. L.' 'Empresa Petrolea del Itsmo S.A.'
 'Perenco Guatemala Limited' 'BHARAT PETROLEUM CORPORATION LTD.'
 'CANAL COMPANIES' 'CHINA NATIONAL PETROLEUM CORPORATION'
 'CHINA OFFSHORE OIL (SINGAPORE) INTERNATIONAL PTE. LTD.'
 'EMIRATES NATIONAL OIL COMPANY' 'IRAQI OIL TANKERS COMPANY (IOTC)'
 'IRAQ STAR OIL SERVICES COMPANY' 'JXTG NIPPON OIL & ENERGY CORPORATION'
 'LITASCO MIDDLE EAST DMCC'
 'MOL HUNGARIAN OIL AND GAS PUBLIC LIMITED COMPANY'
 'MOTOR OIL (HELLAS) CORINTH REFINERIES S.A' 'NAYARA E

176

In [34]:
print(inconsistent_pairs['Company ID number'].unique())
len(inconsistent_pairs['Company ID number'].unique())

[nan]


1

In [35]:
check_consistency_year_country(df_part_3b, 'Full name of agency', 'ID number (if applicable)')

There are inconsistencies in Full name of agency and ID number (if applicable) pairs over time.


## Use-cases to validate

## Statistics and graphs

### How many years of data does each company/agency have?

### How many (and what companies/agencies) have changed their company type (i.e. from SOE to private and vice versa)?

### How many (and what) companies/agencies have had their names changed?

### How many (and which) companies/agencies have had their IDs changed?

### Instances of company/agency names being inputted incorrectly