# EITI consolidated Summary Data files verification and validation
   
## Purpose
The purpose of this notebook is to verify assumptions that should hold true due to the nature of EITI's data and validate the usefulness of the data for different use-cases or business goals.

## Findings
1. There are a number of formatting errors and mistakes in the summary data files, such as:
    - Spelling mistakes
    - Using placeholder values
    - White spaces around the texts
2. There is a lot of duplication in the summary data files, such as:
    - Duplicate listings of companies/agencies in the same year (Part 3 - Reporting companies' list & Part 3 - Reporting government entities list)
        - possible reason/s:
           - listing the same company/agency for different commodities 
    - Duplicate project names in the same country and year (Part 3 - Reporting projects list)
        - possible reason/s:
            - no project name is given or a placeholder is used
            - same project names for different commodity values but no commodity value is provided
    - Duplicate company/agency ids in the same country and year (Part 3 - Reporting companies' list & Part 3 - Reporting government entities list)
        - possible reason/s:
            - a majority of the duplication is due to the information not being available or not existing in the dataset (N/A, NC, etc.)
    - Duplicate company-project and commodity pairs in the same year and country
        - possible reason/s:
            - information on commodities are not reported nor available
3. There are inconsistencies in the list of projects, companies, and agencies, such as:
    - There are companies and agencies listed in Part 5 - Company data that are not in Part 3 - Reporting companies' list
        - possible reasons/s:
            - mistakes in spelling the name of the same entity
                - Royal Dutch Shell PLC VS Royal Dutch Shell plc
                - Afghan Shiinink, Mines Extraction and Processing VS Afghan Shinink, Mines Extraction and Processing
            - companies not being listed in the Reporting companies' list
    - Company IDs for the same company changes over time
        - possible reason/s:
            - typographical errors
            - reporting change (from Not communicated to Not available, etc.)
            - actual change in the ID
    - Not all reported government entities appear in both the Government revenues (Part 4) and the Reporting government entities list
        - possible reason/s
            - NaN or NULL values or entity names not being provided correctly

## Data
1. Consolidated summary data files of all countries and years divided into individual sheets (datasets).
   - [Part 1 - About](data/consolidated/Part%201%20-%20About.csv)
   - [Part 3 - Reporting companies' list](data/consolidated/Part%203%20-%20Reporting%20companies'%20list.csv)
   - [Part 3 - Reporting government entities list](data/consolidated/Part%203%20-%20Reporting%20government%20entities%20list.csv)
   - [Part 3 - Reporting projects' list](data/consolidated/Part%203%20-%20Reporting%20projects'%20list.csv)
   - [Part 4 - Government revenues](data/consolidated/Part%204%20-%20Government%20revenues.csv)
   - [Part 5 - Company data](data/consolidated/Part%205%20-%20Company%20data.csv)

## Methodology
1. [Import the libraries](#Import-Libraries)
2. [Import/load the data](#Import-the-data)
3. Verify the assumptions
4. Validate use-cases
5. Perform analysis and show simple statistics/visualizations about the data

## Assumptions to verify
1. [**COMPLETENESS AND CORRECTNESS**](#Completeness-and-Correctness)
    1. [Cell values are complete and/or correct (i.e. no unexpected or incorrect values)](#No-duplicate-company/agency-names-in-the-same-year-and-country)
2. [**UNIQUENESS AND NON-DUPLICATION**](#Uniqueness-and-Non-duplication)
    1. [No duplicate company/agency names in the same year and country](#No-duplicate-company/agency-names-in-the-same-year-and-country)
    2. [No duplicate project names for the same commodity in the same year and country](#No-duplicate-project-names-for-the-same-commodity-in-the-same-year-and-country)
    3. [No duplicate company/agency IDs in the same year and country](#No-duplicate-company-ids-in-the-same-year-and-country)
    4. [No duplicate company-project pair in the same year and country](#No-duplicate-company-project-pair-in-the-same-year-and-country)
3. [**CONSISTENCY**](#Consistency)
    1. [All companies or agencies in company data (Part 5) exist in the Reporting companies' agencies list (Part 3a)](#All-companies-or-agencies-in-company-data-(Part-5)-exist-in-the-Reporting-companies'-agencies-list-(Part-3a))
    2. [Consistent company name-company ID pair over time](#Consistent-company-name-company-ID-pair-over-time)
    3. [Government entities that appear in the Government revenues (Part 4) must also appear in the Reporting government entities list (Part 3b)](#Government-entities-that-appear-in-the-Government-revenues-(Part-4)-must-also-appear-in-the-Reporting-government-entities-list-(Part-3b))
    4. Companies that appear as affiliated companies in Reporting projects list (Part 3c) should also appear in the Company data (Part 5)

## Use-cases to validate (examples)
1. [How has the extractives industry evolved over time in country X? How much is the volume extracted over time? How much revenue is generated?](#How-has-the-extractives-industry-evolved-over-time-in-country-X?-How-much-is-the-volume-extracted-over-time?-How-much-revenue-is-generated?)
2. How much taxes are paid by SOEs (SOE = state-owned enterprises company type) for each country? How much is this compared to private companies?
3. What percentage of the market share do SOEs take in each country? Based on volume extracted? Based on revenue? Based on taxes paid?

Add more use-cases. A use case should demonstrate if a reuser, coming with a specific question relevant to a real life need, can answer that question in the data. 

## Statistics and graphs
1. [How many years of data does each country have?](#How-many-years-of-data-does-each-country-have?)

### Import Libraries

In [1]:
# import libraries

import pandas as pd
from os import path

### Import the data

In [121]:
file_dir = "data/consolidated/"

# load the csvs into data frames
df_part_1 = pd.read_csv(path.join(file_dir, "Part 1 - About.csv"))
df_part_3a = pd.read_csv(path.join(file_dir, "Part 3 - Reporting companies' list.csv"))
df_part_3b = pd.read_csv(path.join(file_dir, "Part 3 - Reporting government entities list.csv"))
df_part_3c = pd.read_csv(path.join(file_dir, "Part 3 - Reporting projects' list.csv"))
df_part_4 = pd.read_csv(path.join(file_dir, "Part 4 - Government revenues.csv"))
# df_part_5 = pd.read_csv(path.join(file_dir, "Part 5 - Company data.csv")) # results in warning about columns having mixed types
df_part_5 = pd.read_csv(path.join(file_dir, "Part 5 - Company data.csv"), low_memory=False)

df_list = [df_part_1, df_part_3a, df_part_3b, df_part_3c, df_part_4, df_part_5]
df_dict = {"Part 1 - About.csv": df_part_1,
           "Part 3 - Reporting companies' list.csv": df_part_3a,
           "Part 3 - Reporting government entities list.csv": df_part_3b,
           "Part 3 - Reporting projects' list.csv": df_part_3c,
           "Part 4 - Government revenues.csv": df_part_4,
           "Part 5 - Company data.csv": df_part_5
          }

In [3]:
# Checking mixed type columns
mixed_type_columns = df_part_5.map(type).nunique() > 1

# Print or display the columns with mixed types
print("Columns with mixed types:")
print(mixed_type_columns[mixed_type_columns].index.tolist())

Columns with mixed types:
['Company', 'Government entity', 'Revenue stream name', 'Levied on project (Y/N)', 'Reported by project (Y/N)', 'Project name', 'Reporting currency', 'Payment made in-kind (Y/N)', 'In-kind volume (if applicable)', 'Unit (if applicable)', 'Comments', 'Country', 'ISO Code', 'Start Date', 'End Date']


## Assumptions to verify

### Completeness and Correctness

It is assumed that the data received are complete and correct. In this section, we check for any unexpected or unexpected values in the tables such as instances of 'no data', 'NULL', or 'NaN'.

#### Expectations
- We expect that the data is complete and correct.
- There are no 'no data', 'NULL', or 'NaN' values in the dataframes
  
#### Assumptions
- The dataset is assumed to have undergone rigorous data validation checks to identify and address any missing or erroneous values.
- Data entry procedures have been consistently followed, minimizing the likelihood of unexpected or incorrect values.
- Cell values are expected to fall within valid and meaningful ranges.


#### Results

In [4]:
# Functions to find and count NULL and #ERROR! values

def columns_with_null(df):
    return df.columns[df.isnull().any()]

def count_null_per_column(df):
    return df.isnull().sum()    

def columns_with_error(df):
    return df.columns[df.eq('#ERROR!').any()]

def count_error_per_column(df):
    return df.apply(lambda col: (col == '#ERROR!').sum())

def columns_with_empty_strings(df):
    return df.columns[df.eq('').any()]

def count_empty_strings_per_column(df):
    return df.apply(lambda col: (col == '').sum())

def columns_with_only_space(df):
    return df.columns[df.map(lambda cell: isinstance(cell, str) and cell.isspace()).any()]

def count_only_space_per_column(df):
    return df.map(lambda cell: str(cell).isspace()).sum()

**The results below show that there are columns in the data that has NULL**

In [5]:
for tab in df_dict:
    print("Columns with NULL values for {}:".format(tab))
    for column in columns_with_null(df_dict[tab]):
        print("- {}".format(column))
        
    print("\nNumber of columns with NULL: {}/{}".format(len(columns_with_null(df_dict[tab])), df_dict[tab].shape[1]))
    print("\nNumber of NULL per column:")
    print(count_null_per_column(df_dict[tab]))
    print("\nTotal number of rows/observations: {}".format(df_dict[tab].shape[0]))
    print("\n------\n")
    

Columns with NULL values for Part 1 - About.csv:
- ISO Alpha-3 Code
- National currency name
- National currency ISO-4217
- Start Date
- End Date
- Has an EITI Report been prepared by an Independent Administrator?
- What is the name of the company?
- Date that the EITI Report was made public
- URL, EITI Report
- Does the government systematically disclose EITI data at a single location?
- Publication date of the EITI data
- Website link (URL) to EITI data
- Are there other files of relevance?
- Date that other file was made public
- URL
- Does the government have an open data policy?
- Open data portal / files
- Oil
- Gas
- Mining (incl. Quarrying)
- Other, non-upstream sectors
- If yes, please specify name (insert new rows if multiple)
- Number of reporting government entities (incl SOEs if recipient)
- Number of reporting companies (incl SOEs if payer)
- Reporting currency (ISO-4217 currency codes)
- Exchange rate used: 1 USD = 
- Exchange rate source (URL,…)
- … by revenue stream
- 

**The results below show that there are columns in the data that has '#ERROR!'**

In [6]:
for tab in df_dict:
    print("Columns with '#ERROR!' values for {}:".format(tab))
    for column in columns_with_error(df_dict[tab]):
        print("- {}".format(column))
        
    print("\nNumber of columns with '#ERROR!': {}/{}".format(len(columns_with_error(df_dict[tab])), df_dict[tab].shape[1]))
    print("\nNumber of '#ERROR!' per column:")
    print(count_error_per_column(df_dict[tab]))
    print("\nTotal number of rows/observations: {}".format(df_dict[tab].shape[0]))
    print("\n------\n")
    

Columns with '#ERROR!' values for Part 1 - About.csv:
- ISO Alpha-3 Code
- National currency name
- National currency ISO-4217

Number of columns with '#ERROR!': 3/39

Number of '#ERROR!' per column:
Country or area name                                                            0
ISO Alpha-3 Code                                                               20
National currency name                                                         20
National currency ISO-4217                                                     19
Start Date                                                                      0
End Date                                                                        0
Has an EITI Report been prepared by an Independent Administrator?               0
What is the name of the company?                                                0
Date that the EITI Report was made public                                       0
URL, EITI Report                                              

**The results below show that there are columns in the data that has empty strings ('')**

In [7]:
for tab in df_dict:
    print("Columns with '' values for {}:".format(tab))
    for column in columns_with_empty_strings(df_dict[tab]):
        print("- {}".format(column))
        
    print("\nNumber of columns with '': {}/{}".format(len(columns_with_empty_strings(df_dict[tab])), df_dict[tab].shape[1]))
    print("\nNumber of '' per column:")
    print(count_empty_strings_per_column(df_dict[tab]))
    print("\nTotal number of rows/observations: {}".format(df_dict[tab].shape[0]))
    print("\n------\n")
    

Columns with '' values for Part 1 - About.csv:

Number of columns with '': 0/39

Number of '' per column:
Country or area name                                                           0
ISO Alpha-3 Code                                                               0
National currency name                                                         0
National currency ISO-4217                                                     0
Start Date                                                                     0
End Date                                                                       0
Has an EITI Report been prepared by an Independent Administrator?              0
What is the name of the company?                                               0
Date that the EITI Report was made public                                      0
URL, EITI Report                                                               0
Does the government systematically disclose EITI data at a single location?    0
Pub

**The results below show that there are columns in the data that are only white spaces**

In [8]:
for tab in df_dict:
    print("Columns with only white spaces for {}:".format(tab))
    for column in columns_with_only_space(df_dict[tab]):
        print("- {}".format(column))
        
    print("\nNumber of columns with only white spaces: {}/{}".format(len(columns_with_only_space(df_dict[tab])), df_dict[tab].shape[1]))
    print("\nNumber of only white spaces per column:")
    print(count_only_space_per_column(df_dict[tab]))
    print("\nTotal number of rows/observations: {}".format(df_dict[tab].shape[0]))
    print("\n------\n")
    

Columns with only white spaces for Part 1 - About.csv:

Number of columns with only white spaces: 0/39

Number of only white spaces per column:
Country or area name                                                           0
ISO Alpha-3 Code                                                               0
National currency name                                                         0
National currency ISO-4217                                                     0
Start Date                                                                     0
End Date                                                                       0
Has an EITI Report been prepared by an Independent Administrator?              0
What is the name of the company?                                               0
Date that the EITI Report was made public                                      0
URL, EITI Report                                                               0
Does the government systematically disclose EI

### Uniqueness and Non-duplication

In [9]:
def count_duplicates(main_df, duplicate_df, field, name):
    return main_df[duplicate_df].groupby(field).size().reset_index(name=name)

### No duplicate company/agency names in the same year and country

#### Expectations
- A company or agency name will appear only once per year per country.
- We expect that within the provided dataset, each combination of a company or agency name in a given year and country is unique.
- This expectation is grounded in the assumption that the dataset is designed to maintain distinct company or agency identities for a given year and country.

#### Assumptions
- The company or agency is only reported once per year regardless of sector/commodity
- Company and agency names are assumed to be uniquely associated with specific entities within a given year and country.
- Any duplicate entries would be considered as potential data entry errors or inconsistencies.

#### Results

In [10]:
# def get_duplicate_names(df, field):
#     return df[df.duplicated([field, 'Country', 'Year'], keep=False)].sort_values(by=[field])

def get_duplicates(df, field_to_check, *constraint_fields):
    return df[df.duplicated([field_to_check, *constraint_fields], keep=False)].sort_values(by=[field_to_check])

In [100]:
df_part_3a[df_part_3a.duplicated(subset=['Country', 'Year', 'Full company name'], keep=False)]

Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
173,شرکت برادران خالد عزیز,Private,1019984010,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
174,شرکت برادران خالد عزیز,Private,1019984010,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
497,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Nickel,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,864690546.00,Philippines,PHL,2018,2018-01-01,2018-12-31
519,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Limestone,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,,Philippines,PHL,2018,2018-01-01,2018-12-31
1549,Al Waha Petroleum Co. Ltd.,Private,,Oil & Gas,"Oil, Gas, Condensates",https://www.petroalwaha.com/,,628712909.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1552,CNOOC IRAQ LIMITED,Private,,Oil & Gas,"Oil, Gas, Condensates",https://cnoocinternational.com/operations/midd...,,1098601809.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1564,PT PERTAMINA IRAK,Private,,Oil & Gas,"Oil, Gas, Condensates",,,990135616.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1585,Al Waha Petroleum Co. Ltd.,Private,,Oil & Gas,"Oil, Gas, Condensates",,,628712909.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1586,CNOOC IRAQ LIMITED,Private,,Oil & Gas,"Oil, Gas, Condensates",,,1098601809.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1588,PT PERTAMINA IRAK,Private,,Oil & Gas,"Oil, Gas, Condensates",,,990135616.00,Iraq,IRQ,2018,1/1/2018,12/31/2018


In [94]:
get_duplicates(df_part_3a, 'Full company name', *['Country', 'Year'])

Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
1549,Al Waha Petroleum Co. Ltd.,Private,,Oil & Gas,"Oil, Gas, Condensates",https://www.petroalwaha.com/,,628712909.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1585,Al Waha Petroleum Co. Ltd.,Private,,Oil & Gas,"Oil, Gas, Condensates",,,628712909.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1552,CNOOC IRAQ LIMITED,Private,,Oil & Gas,"Oil, Gas, Condensates",https://cnoocinternational.com/operations/midd...,,1098601809.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1586,CNOOC IRAQ LIMITED,Private,,Oil & Gas,"Oil, Gas, Condensates",,,1098601809.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1564,PT PERTAMINA IRAK,Private,,Oil & Gas,"Oil, Gas, Condensates",,,990135616.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1588,PT PERTAMINA IRAK,Private,,Oil & Gas,"Oil, Gas, Condensates",,,990135616.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
497,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Nickel,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,864690546.00,Philippines,PHL,2018,2018-01-01,2018-12-31
519,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Limestone,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,,Philippines,PHL,2018,2018-01-01,2018-12-31
173,شرکت برادران خالد عزیز,Private,1019984010,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
174,شرکت برادران خالد عزیز,Private,1019984010,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20


#### Results
- There are duplicate listings of companies in the same year in Iraq and Afghanistan
- The duplicate listing in the Philippines may be attributed to the difference in commodities

In [12]:
get_duplicates(df_part_3b, 'Full name of agency', *['Country', 'Year'])

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date
238,Mpohor Wassa East,Local government,,#ERROR!,Ghana,GHA,2019.0,2019-01-01,2019-12-31
252,Mpohor Wassa East,Local government,,,Ghana,GHA,2019.0,2019-01-01,2019-12-31
247,Obuasi Municipal Assembly,Local government,,#ERROR!,Ghana,GHA,2019.0,2019-01-01,2019-12-31
259,Obuasi Municipal Assembly,Local government,,,Ghana,GHA,2019.0,2019-01-01,2019-12-31


#### Results
- There are duplicate listings of agencies for Ghana in 2019

### No duplicate project names for the same commodity in the same year and country

#### Expectations
- A project name will appear only once per year per country per commodity.
- Each project name within the dataset is unique for a given year, country, and commodity. This anticipation is based on the assumption that project names are assigned in a manner that avoids duplicates for a given year, country, and commodity.

#### Assumptions
- A project is only reported once per year and country and commodity
- Project names are assumed to be distinct identifiers for different projects within the same year, country, and commodity.
- Duplicate project names within the same year, country, and commodity would be treated as anomalies.

#### Results

In [101]:
duplicate_projects = get_duplicates(df_part_3c, 
                                    'Full project name', 
                                    *['Country', 'Year', 'Commodities (one commodity/row)', 'Affiliated companies, start with Operator'])

duplicate_projects

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
6082,2025257,Licence,Breedon Group Plc,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
6081,2025257,Licence,Breedon Group Plc,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
5418,ALKANE ENERGY UK LIMITED (03128509),AL10,ALKANE ENERGY UK LIMITED (03128509),Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
5995,ALKANE ENERGY UK LIMITED (03128509),PEDL279,ALKANE ENERGY UK LIMITED (03128509),Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
5965,ALKANE ENERGY UK LIMITED (03128509),PEDL130,ALKANE ENERGY UK LIMITED (03128509),Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1400,,,,,,,,,,,,,,
1402,,,,,,,,,,,,,,
1404,,,,,,,,,,,,,,
1406,,,,,,,,,,,,,,


In [14]:
duplicate_project_names = duplicate_projects['Full project name'].unique()

print("Number of duplicate project names: {}\n".format(len(duplicate_project_names.tolist())))

duplicate_projects_count = duplicate_projects.groupby('Full project name').size().reset_index(name='number of times duplicated')
duplicate_projects_count

Number of duplicate project names: 159



Unnamed: 0,Full project name,number of times duplicated
0,2025257,2
1,ALKANE ENERGY UK LIMITED (03128509),3
2,"ALKANE ENERGY UK LIMITED (03128509), EGDON RES...",2
3,ALPHA PETROLEUM RESOURCES LIMITED (03949599),2
4,ANASURIA HIBISCUS UK LIMITED (09696268),2
...,...,...
153,THIRD ENERGY UK GAS LIMITED (01421481),5
154,TOTALENERGIES E&P UK LIMITED (00811900),17
155,VICTOR,2
156,WINTERSHALL NOORDZEE B.V. (FC027567),2


In [15]:
duplicate_project_check = duplicate_projects[duplicate_projects['Full project name'] == 'TOTALENERGIES E&P UK LIMITED (00811900)']
duplicate_project_check

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
5925,TOTALENERGIES E&P UK LIMITED (00811900),P724,TOTALENERGIES E&P UK LIMITED (00811900),Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
5876,TOTALENERGIES E&P UK LIMITED (00811900),P362,"E. F. OIL AND GAS LIMITED (03430228), ENI ELGI...",Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
5914,TOTALENERGIES E&P UK LIMITED (00811900),P666,"E. F. OIL AND GAS LIMITED (03430228), ENI ELGI...",Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
5842,TOTALENERGIES E&P UK LIMITED (00811900),P281,TOTALENERGIES E&P UK LIMITED (00811900),Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
5839,TOTALENERGIES E&P UK LIMITED (00811900),P268,TOTALENERGIES E&P UK LIMITED (00811900),Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
5843,TOTALENERGIES E&P UK LIMITED (00811900),P284,TOTALENERGIES E&P UK LIMITED (00811900),Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
5477,TOTALENERGIES E&P UK LIMITED (00811900),P1159,"INEOS E&P (UK) LIMITED (04376184), KISTOS ENER...",Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
5485,TOTALENERGIES E&P UK LIMITED (00811900),P1195,"INEOS E&P (UK) LIMITED (04376184), KISTOS ENER...",Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
5947,TOTALENERGIES E&P UK LIMITED (00811900),P911,"INEOS E&P (UK) LIMITED (04376184), KISTOS ENER...",Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
5528,TOTALENERGIES E&P UK LIMITED (00811900),P1678,"INEOS E&P (UK) LIMITED (04376184), KISTOS ENER...",Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31


In [109]:
duplicate_project_check = duplicate_projects[duplicate_projects['Country'] == 'Afghanistan']
duplicate_project_check

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
21,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2018,2017-12-21,2018-12-20
22,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2018,2017-12-21,2018-12-20
101,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2019,2018-12-21,2019-12-20
102,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2019,2018-12-21,2019-12-20
37,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
118,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
117,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
38,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
121,SSML-Kabu 13/2012,Small-scale mining license,شرکت ساختمانی فاروق استانکزی,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
42,SSML-Kabu 13/2012,Small-scale mining license,شرکت ساختمانی فاروق استانکزی,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20


In [108]:
duplicate_project_names

array(['2025257', 'ALKANE ENERGY UK LIMITED (03128509)',
       'ALKANE ENERGY UK LIMITED (03128509), EGDON RESOURCES U.K. LIMITED (03424561)',
       'ALPHA PETROLEUM RESOURCES LIMITED (03949599)',
       'ANASURIA HIBISCUS UK LIMITED (09696268)',
       'ANGUS ENERGY WEALD BASIN NO.3 LIMITED (SC055329)',
       'APACHE BERYL I LIMITED (FC005975)',
       'AURORA ENERGY RESOURCES LIMITED (SC335749)', 'Altan tsagaan ovoo',
       'Ambatovy - AMBATOVY MINERALS S.A',
       'Ambatovy - AMBATOVY MINERALS S.A.',
       'Area 106/400, 1,500,000 tpa, North Dowsing',
       'Area 127, 200,000 tpa, SW Needles',
       'Area 137, 1,000,000 tpa, Area A',
       'Area 197, 300,000 tpa, Protector Overfalls',
       'Area 228, 1,500,000 tpa (mix), Off Great Yarmouth',
       'Area 240, 1,500,000 tpa, Cross Sands',
       'Area 242/361, 325,000 tpa, Lowestoft',
       'Area 254, 500,000 tpa, Off Great Yarmouth',
       'Area 340, 500,000 tpa, Nab',
       'Area 351, 500,000 tpa, SE Isle of Wight',
 

#### Implications
- There is a significant number of duplicate project names across many of the countries.
- This can be caused by not having information in the commodities field
- There is a need to check/count the number of duplicates per country to see if this should be expected

In [16]:
# pd.set_option('display.max_rows', None)

project_counts_per_country = duplicate_projects.groupby('Country')['Full project name'].value_counts()

duplicates_per_country = project_counts_per_country[project_counts_per_country > 1]

print("Number of duplicate project names per country:")
print(duplicates_per_country)

# pd.reset_option('display.max_rows')

Number of duplicate project names per country:
Country         Full project name                             
Afghanistan     SSML-Kabu 13/2012                                 6
                EXPL 3/2012                                       4
                SSML-Kabu 10/2016                                 4
                SSML-Kabu 19/2016                                 4
                SSML-Kabu 2/2008                                  4
                                                                 ..
United Kingdom  SPIRIT ENERGY PRODUCTION UK LIMITED (03115179)    2
                Stowe Hill Cert/Surf Rnt & Roy                    2
                VICTOR                                            2
                WINTERSHALL NOORDZEE B.V. (FC027567)              2
                Wharf, Rvr Usk nr Caldicot Level                  2
Name: count, Length: 158, dtype: int64


In [17]:
duplicate_projects.groupby('Country').size().reset_index(name='duplicates per country')

Unnamed: 0,Country,duplicates per country
0,Afghanistan,26
1,Cameroon,4
2,Cote d'Ivoire,2
3,Country,7
4,Country,9
5,Dominican Republic,2
6,Ghana,14
7,Liberia,17
8,Madagascar,9
9,Mauritania,2


In [18]:
df_part_3c[df_part_3c['Full project name'] == 'CIRQUE ENERGY (UK) LIMITED (03080778)']

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
6001,CIRQUE ENERGY (UK) LIMITED (03080778),PEDL324,"CIRQUE ENERGY (UK) LIMITED (03080778), STELINM...",Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
6008,CIRQUE ENERGY (UK) LIMITED (03080778),PEDL348,"CIRQUE ENERGY (UK) LIMITED (03080778), STELINM...",Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31


In [92]:
duplicate_projects[duplicate_projects['Country']=='Afghanistan']

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
21,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2018,2017-12-21,2018-12-20
22,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2018,2017-12-21,2018-12-20
101,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2019,2018-12-21,2019-12-20
102,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2019,2018-12-21,2019-12-20
37,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
118,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
117,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
38,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
121,SSML-Kabu 13/2012,Small-scale mining license,شرکت ساختمانی فاروق استانکزی,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
42,SSML-Kabu 13/2012,Small-scale mining license,شرکت ساختمانی فاروق استانکزی,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20


In [106]:
duplicate_projects.groupby('Full project name').size().reset_index(name='duplicates per project name')

Unnamed: 0,Full project name,duplicates per project name
0,2025257,2
1,ALKANE ENERGY UK LIMITED (03128509),3
2,"ALKANE ENERGY UK LIMITED (03128509), EGDON RES...",2
3,ALPHA PETROLEUM RESOURCES LIMITED (03949599),2
4,ANASURIA HIBISCUS UK LIMITED (09696268),2
...,...,...
153,THIRD ENERGY UK GAS LIMITED (01421481),5
154,TOTALENERGIES E&P UK LIMITED (00811900),17
155,VICTOR,2
156,WINTERSHALL NOORDZEE B.V. (FC027567),2


In [122]:
duplicate_projects2 = get_duplicates(df_part_3c, 
                                    'Full project name', 
                                    *['Country', 'Year', 'Commodities (one commodity/row)', 'Affiliated companies, start with Operator', 
                                      'Legal agreement reference number(s): contract, licence, lease, concession, …',
                                     'Production (volume)', 'Status', 'Unit', 'Production (value)'])

duplicate_projects2

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
6048,2025257,Licence,Breedon Group Plc,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31
6047,2025257,Licence,Breedon Group Plc,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31
6117,"Area 106/400, 1,500,000 tpa, North Dowsing",Licence,Hanson Limited,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31
6130,"Area 106/400, 1,500,000 tpa, North Dowsing",Licence,Hanson Limited,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31
6129,"Area 106/400, 1,500,000 tpa, North Dowsing",Licence,Hanson Limited,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,"Subsoil use special permit No. 4294, dated 29....",4294,Naftogaz of Ukraine NJSC,,Production,,,,,Ukraine,UKR,2018.0,2018-01-01,2018-12-31
4800,VICTOR,Not available,Not available,Not available,Not available,Not available,,,,United Kingdom,GBR,2019.0,2019-01-01,2019-12-31
4801,VICTOR,Not available,Not available,Not available,Not available,Not available,,,,United Kingdom,GBR,2019.0,2019-01-01,2019-12-31
6116,"Wharf, Rvr Usk nr Caldicot Level",Licence,Hanson Limited,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31


In [123]:
duplicate_projects2.groupby('Country').size().reset_index(name='duplicates per country')

Unnamed: 0,Country,duplicates per country
0,Afghanistan,26
1,Cote d'Ivoire,2
2,Ghana,8
3,Liberia,17
4,Mauritania,2
5,Mexico,19
6,Ukraine,6
7,United Kingdom,189


In [124]:
duplicate_projects2['Full project name'].nunique()

84

### No duplicate company ids in the same year and country

#### Expectations
- A company ID will appear only once per year per country
- The company or agency identification numbers (IDs) within the dataset are unique for a given year and country.
- The dataset has been structured to ensure a one-to-one mapping between company or agency IDs.

#### Assumptions
- 1 company ID = 1 company in a country
- Each company or agency ID is assumed to uniquely represent a specific entity for the year and country.
- Duplicate company or agency IDs within the same year and country are considered deviations from the expected dataset structure.

#### Results

In [19]:
duplicate_ids = get_duplicates(df_part_3a, 
                                    'Company ID number', 
                                    *['Country', 'Year'])

duplicate_ids

Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
519,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Limestone,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,,Philippines,PHL,2018,2018-01-01,2018-12-31
497,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Nickel,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,864690546.00,Philippines,PHL,2018,2018-01-01,2018-12-31
878,HOUNDE GOLD OPERATION SA,Private,00064526S,Mining,Or,Indisponible,,39636048842.00,Burkina Faso,BFA,2019,1/1/2019,12/31/2019
877,HOUNDE EXPLORATION BF SARL,Private,00064526S,Mining,Or,https://www.endeavourmining.com/our-portfolio/...,,7879789.00,Burkina Faso,BFA,2019,1/1/2019,12/31/2019
872,HOUNDE EXPLORATION BF SARL,Private,00064526S,Mining,Gold,https://www.endeavourmining.com/our-portfolio/...,,13130988.00,Burkina Faso,BFA,2018,1/1/2018,12/31/2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3216,GX Technology,,,Oil & Gas,,,,,Seychelles,SYC,2018,2018-01-01,2018-12-31
3217,JOGMEX,,,Oil & Gas,,,,,Seychelles,SYC,2018,2018-01-01,2018-12-31
3242,GRAYSON BANDA,Private,,Mining,Gold,Not available,Not available,3126118498.00,Tanzania,TZA,2018,2017-07-01,2018-06-30
3256,TANZANIA PORTLAND CEMENT PUBLIC LIMITED COMPANY,Private,,Mining,"Limeston, Sandstone",Not available,Not available,24750760174.02,Tanzania,TZA,2018,2017-07-01,2018-06-30


In [20]:
# duplicate_company_ids_1 = df_part_3a[df_part_3a.duplicated(['Country', 'Year', 'Company ID number'], keep=False)]

In [132]:
duplicate_ids = get_duplicates(df_part_3a, 
                                    'Company ID number', 
                                    *['Country', 'Year', 'Full company name', 
                                      'Sector', 'Company type','Stock exchange listing or company website'])

duplicate_ids





Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
497,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Nickel,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,864690546.00,Philippines,PHL,2018,2018-01-01,2018-12-31
519,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Limestone,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,,Philippines,PHL,2018,2018-01-01,2018-12-31
173,شرکت برادران خالد عزیز,Private,1019984010,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
174,شرکت برادران خالد عزیز,Private,1019984010,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
1564,PT PERTAMINA IRAK,Private,,Oil & Gas,"Oil, Gas, Condensates",,,990135616.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1588,PT PERTAMINA IRAK,Private,,Oil & Gas,"Oil, Gas, Condensates",,,990135616.00,Iraq,IRQ,2018,1/1/2018,12/31/2018


In [146]:
duplicate_ids2 = get_duplicates(df_part_3a, 
                                    'Company ID number', 
                                    *['Full company name', 'Company type', 'Sector', 
                                      'Commodities (comma-seperated)', 'Stock exchange listing or company website', 
                                      'Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)',
                                      'Payments to Governments Report', 'Country', 'Year'])

duplicate_ids2

Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
173,شرکت برادران خالد عزیز,Private,1019984010.0,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
174,شرکت برادران خالد عزیز,Private,1019984010.0,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
1564,PT PERTAMINA IRAK,Private,,Oil & Gas,"Oil, Gas, Condensates",,,990135616.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1588,PT PERTAMINA IRAK,Private,,Oil & Gas,"Oil, Gas, Condensates",,,990135616.00,Iraq,IRQ,2018,1/1/2018,12/31/2018


In [153]:
duplicate_ids2 = get_duplicates(df_part_3a, 
                                    'Company ID number', 
                                    *['Full company name', 'Company type', 'Sector', 
                                      'Commodities (comma-seperated)', 'Stock exchange listing or company website', 
                                      'Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)',
                                      'Payments to Governments Report', 'Country', 'Year'])

duplicate_ids2

Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
173,شرکت برادران خالد عزیز,Private,1019984010.0,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
174,شرکت برادران خالد عزیز,Private,1019984010.0,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
1564,PT PERTAMINA IRAK,Private,,Oil & Gas,"Oil, Gas, Condensates",,,990135616.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1588,PT PERTAMINA IRAK,Private,,Oil & Gas,"Oil, Gas, Condensates",,,990135616.00,Iraq,IRQ,2018,1/1/2018,12/31/2018


In [21]:
duplicate_ids.groupby('Company ID number').size().reset_index(name='duplicates')

Unnamed: 0,Company ID number,duplicates
0,000-142-665-000,2
1,00064526S,4
2,1007815085,4
3,1008627083,3
4,1009592088,4
5,1010360087,4
6,1013655012,5
7,1016525014,4
8,1019984010,2
9,103947189,2


In [22]:
# duplicate_id_names = duplicate_ids['Company ID number'].unique()

# print("Number of duplicate IDs: {}\n".format(len(duplicate_id_names.tolist())))

# duplicate_id_counts = duplicate_ids['Company ID number'].value_counts()
# # print(duplicate_project_counts)

# print("Number of duplicate IDS")
# for id, count in duplicate_id_counts.items():
#     if count > 1:
#         duplicates = duplicate_ids[duplicate_ids['Company ID number'] == id]
#         print(f"Name: {id}, Count: {count}")

In [23]:
duplicate_ids.groupby('Country').size().reset_index(name='duplicates')

Unnamed: 0,Country,duplicates
0,Afghanistan,55
1,Armenia,4
2,Burkina Faso,4
3,Chad,48
4,Cote d'Ivoire,6
5,Democratic Republic of Congo,4
6,Guatemala,11
7,Iraq,141
8,Mauritania,5
9,Mexico,10


**Let's check the data for Afghanistan**

In [24]:
duplicate_ids[duplicate_ids['Country'] == 'Afghanistan'].sort_values(by=['Company ID number'])

Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
356,د امین کریمزی د شوکانو او مرمرو داستخراج اوپرو...,Private,1007815085,Other,Talc,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2019,2018-12-21,2019-12-20
234,Ameen Karimzai Talc and Marble Extraction and ...,Private,1007815085,Other,Talc,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2019,2018-12-21,2019-12-20
116,د امین کریمزی د شوکانو او مرمرو داستخراج اوپرو...,Private,1007815085,Other,Talc,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
0,Amin Karimzai Campany,Private,1007815085,Other,Talc,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
272,Mati Sami Limited,Private,1008627083,Other,Talc,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2019,2018-12-21,2019-12-20
230,Ahmad Shah Talc Extraction and Processing,Private,1008627083,Other,Talc,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2019,2018-12-21,2019-12-20
229,Ahmad shah Maidanwall Co,Private,1008627083,Other,Talc,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2019,2018-12-21,2019-12-20
107,بلال موسی زی دمعدنی ډبرو استخراج او پروسس شرکت...,Private,1009592088,Other,Talc,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
18,Bilal Musazai Company Limited,Private,1009592088,Other,Talc,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
347,بلال موسی زی دمعدنی ډبرو استخراج او پروسس شرکت...,Private,1009592088,Other,,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2019,2018-12-21,2019-12-20


##### Now let's look at government entities

In [25]:
duplicate_ids_agency = get_duplicates(df_part_3b, 
                                    'ID number (if applicable)', 
                                    *['Country', 'Year'])

duplicate_ids_agency

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date
424,Norwegian Central Bank,Central goverment,937884117,-,Norway,NOR,2019.0,1/1/2019,12/31/2019
425,Petoro,Central goverment,937884117,96478000000.00,Norway,NOR,2019.0,1/1/2019,12/31/2019
265,Superintendencia de Administración Tributaria ...,Central government,<Use Legal Entity Identifier if available>,149007097.60,Guatemala,GTM,2017.0,2017-01-01,2017-12-01
266,Municipalidades,Local government,<Use Legal Entity Identifier if available>,63301988.42,Guatemala,GTM,2017.0,2017-01-01,2017-12-01
262,Ministerio de Energia y Minas (MEM),Central government,<Use Legal Entity Identifier if available>,228058363.73,Guatemala,GTM,2017.0,2017-01-01,2017-12-01
...,...,...,...,...,...,...,...,...,...
544,Ministry of Mines and Minerals Development,Central goverment,,48229839.93,Zambia,ZMB,2018.0,2018-01-01,2018-12-31
545,Environmental Protection Fund,Other,,23330601.78,Zambia,ZMB,2018.0,2018-01-01,2018-12-31
546,Ministry of Lands,Central goverment,,1756688.47,Zambia,ZMB,2018.0,2018-01-01,2018-12-31
547,IDC,Other,,69205641.67,Zambia,ZMB,2018.0,2018-01-01,2018-12-31


In [26]:
duplicate_ids_agency.groupby('Country').size().reset_index(name='duplicates')

Unnamed: 0,Country,duplicates
0,Afghanistan,10
1,Albania,22
2,Argentina,4
3,Armenia,6
4,Burkina Faso,36
5,Cameroon,8
6,Chad,21
7,Cote d'Ivoire,18
8,Democratic Republic of Congo,10
9,Ethiopia,3


In [27]:
duplicate_ids_agency.groupby('ID number (if applicable)').size().reset_index(name='duplicates')

Unnamed: 0,ID number (if applicable),duplicates
0,937884117,2
1,<Use Legal Entity Identifier if available>,5
2,Mining,3
3,No applicable,7
4,Non applicable,45
5,Not applicable,59
6,Not available,9
7,Oil & Gas,6


In [28]:
duplicate_ids_agency[duplicate_ids_agency['Country'] == 'Afghanistan'].sort_values(by=['ID number (if applicable)'])

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
9,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2019.0,2018-12-21,2019-12-20
8,National Environmental Protection Agency,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2019.0,2018-12-21,2019-12-20
7,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2019.0,2018-12-21,2019-12-20
6,Ministry of Finance (Customs Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2019.0,2018-12-21,2019-12-20
5,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2019.0,2018-12-21,2019-12-20
4,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-24
3,National Environmental Protection Agency,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-23
2,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-22
1,Ministry of Finance (Customs Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-21


#### Implications
- There is a significant number of duplicate company ID values
- Most are due to the information not being available or not existing in the dataset
- Several others are due the same ids appearing multiple times with different names in the name field

### No duplicate company-project-commodity combination in the same year and country

#### Expectations
- We anticipate that within the provided dataset, each combination of a company and project in a given year and country should be unique.
- This expectation is grounded in the assumption that projects are distinctly identified by the combination of the involved company, the project name, and the associated year and country.

#### Assumptions
- Projects are uniquely identified by the combination of the company, project name, year, and country.
- The dataset has been curated to prevent duplicate entries for the same project within the same year and country.

#### Results

In [29]:
company_project_duplicates = df_part_3c.duplicated(subset=['Affiliated companies, start with Operator', 'Full project name', 'Year', 'Country'], keep=False)

duplicate_rows = df_part_3c[company_project_duplicates]
duplicate_rows

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
21,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2018,2017-12-21,2018-12-20
22,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2018,2017-12-21,2018-12-20
37,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
38,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
41,SSML-Kabu 13/2012,Small-scale mining license,شرکت ساختمانی فاروق استانکزی,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6254,"Area 461, 1,000,000 tpa, Median Deep",Licence,Volker Dredging Ltd,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
6255,Kansanshi Mine,7057 HQ LML,Kansanshi Mining Plc,Copper (2603),Production,250860,Tonnes,1547.48,USD,Zambia,ZMB,2017,2017-01-01,2017-12-31
6256,Kansanshi Mine,17019-HQ-LEL,Kansanshi Mining Plc,Gold (7108),Production,4.565,Tonnes,183.31,USD,Zambia,ZMB,2017,2017-01-01,2017-12-31
6268,Kansanshi Mine,7057 HQ LML,KANSANSHI MINING PLC,Copper (2603),Production,251517.19,Tonnes,1640609303.36,USD,Zambia,ZMB,2018,2018-01-01,2018-12-31


In [30]:
print("Duplicates per country:")
dup_country = count_duplicates(df_part_3c, company_project_duplicates, 'Country', 'duplicates_per_country')
dup_country

Duplicates per country:


Unnamed: 0,Country,duplicates_per_country
0,Afghanistan,26
1,Albania,16
2,Argentina,15
3,Armenia,9
4,Burkina Faso,30
5,Cameroon,4
6,Cote d'Ivoire,10
7,Country,7
8,Country,9
9,Dominican Republic,29


In [31]:
print("Duplicates per year:")
dup_year = count_duplicates(df_part_3c, company_project_duplicates, 'Year', 'duplicates_per_year')
dup_year

Duplicates per year:


Unnamed: 0,Year,duplicates_per_year
0,2017,41
1,2018,651
2,2019,526
3,2020,725
4,2021,439
5,Year,16


In the next part, let's look at duplicates with the same project name, affiliated companies, country, year, and commodity

In [32]:
company_project_comm_duplicates = df_part_3c.duplicated(subset=['Affiliated companies, start with Operator', 'Full project name', 'Year', 'Country', 'Commodities (one commodity/row)'], keep=False)

duplicate_rows_2 = df_part_3c[company_project_comm_duplicates]
duplicate_rows_2

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
21,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2018,2017-12-21,2018-12-20
22,EXPL 3/2012,Mining Exploration License,شرکت برادران خالد عزیز,,,Not applicable,Not applicable,Not applicable,Not applicable,Afghanistan,AFG,2018,2017-12-21,2018-12-20
37,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
38,SSML-Kabu 10/2016,Small-scale mining license,Hakim Jan,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
41,SSML-Kabu 13/2012,Small-scale mining license,شرکت ساختمانی فاروق استانکزی,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6250,"Area 340, 500,000 tpa, Nab",Licence,Volker Dredging Ltd,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
6251,"Area 351, 500,000 tpa, SE Isle of Wight",Licence,Volker Dredging Ltd,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
6252,"Area 351, 500,000 tpa, SE Isle of Wight",Licence,Volker Dredging Ltd,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
6253,"Area 461, 1,000,000 tpa, Median Deep",Licence,Volker Dredging Ltd,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31


In [33]:
print("Duplicates per country with the same project name, affiliated companies, year, and commodity:")
dup_country_aff = count_duplicates(df_part_3c, company_project_comm_duplicates, 'Country', 'duplicates_per_country')
dup_country_aff

Duplicates per country with the same project name, affiliated companies, year, and commodity:


Unnamed: 0,Country,duplicates_per_country
0,Afghanistan,26
1,Cameroon,4
2,Cote d'Ivoire,2
3,Country,7
4,Country,9
5,Dominican Republic,2
6,Ghana,14
7,Liberia,17
8,Madagascar,9
9,Mauritania,2


In [34]:
duplicate_rows_2[duplicate_rows_2['Country'] == 'Mexico'].sort_values(by=['Full project name'])

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
1069,CNH-M5- MIQUETLA/2018,CNH-M5- MIQUETLA/2018,Pemex Exploración y Producción,Crude oil (2709),Production,Not available,,,,Mexico,MEX,2018,1/1/2018,12/31/2018
1059,CNH-M5- MIQUETLA/2018,CNH-M5- MIQUETLA/2018,Pemex Exploración y Producción,Crude oil (2709),Production,Not available,,,,Mexico,MEX,2018,1/1/2018,12/31/2018
1057,CNH-R02-L02- A10.CS/2017,CNH-R02-L02- A10.CS/2017,Pantera,Crude oil (2709),Production,Not available,,,,Mexico,MEX,2018,1/1/2018,12/31/2018
1066,CNH-R02-L02- A10.CS/2017,CNH-R02-L02- A10.CS/2017,Pantera,Crude oil (2709),Production,Not available,,,,Mexico,MEX,2018,1/1/2018,12/31/2018
1055,CNH-R02-L02- A4.BG/2017,CNH-R02-L02- A4.BG/2017,Pantera,Crude oil (2709),Production,Not available,,,,Mexico,MEX,2018,1/1/2018,12/31/2018
1064,CNH-R02-L02- A4.BG/2017,CNH-R02-L02- A4.BG/2017,Pantera,Crude oil (2709),Production,Not available,,,,Mexico,MEX,2018,1/1/2018,12/31/2018
1056,CNH-R02-L02- A5.BG/2017,CNH-R02-L02- A5.BG/2017,Pantera,Crude oil (2709),Production,Not available,,,,Mexico,MEX,2018,1/1/2018,12/31/2018
1065,CNH-R02-L02- A5.BG/2017,CNH-R02-L02- A5.BG/2017,Pantera,Crude oil (2709),Production,Not available,,,,Mexico,MEX,2018,1/1/2018,12/31/2018
1053,CNH-R02-L03-CS01/2017,CNH-R02-L03-CS01/2017,Jaguar Exploración Y Producción De Hidrocarburos,Crude oil (2709),Production,Not available,,,,Mexico,MEX,2018,1/1/2018,12/31/2018
1061,CNH-R02-L03-CS01/2017,CNH-R02-L03-CS01/2017,Jaguar Exploración Y Producción De Hidrocarburos,Crude oil (2709),Production,Not available,,,,Mexico,MEX,2018,1/1/2018,12/31/2018


In [35]:
print("Duplicates per year with the same project name, affiliated companies, country, and commodity::")
dup_year_aff = count_duplicates(df_part_3c, company_project_comm_duplicates, 'Year', 'duplicates_per_year')
dup_year_aff

Duplicates per year with the same project name, affiliated companies, country, and commodity::


Unnamed: 0,Year,duplicates_per_year
0,2017,10
1,2018,82
2,2019,27
3,2020,18
4,2021,439
5,Year,16


In [36]:
duplicate_rows_2[duplicate_rows_2['Year'] == '2020'].sort_values(by=['Full project name'])

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
1187,Altan tsagaan ovoo,T/19-11-11,Steppe Gold,Not applicable,Production,426.44,,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1186,Altan tsagaan ovoo,T/19-11-11,Steppe Gold,Not applicable,Production,987.02,,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1213,Baruun noyon uul,Т/19-08-02,Tsagaan uvuljuu,Not applicable,Not applicable,,Not applicable,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1214,Baruun noyon uul,Т/20-12-08,Tsagaan uvuljuu,Not applicable,Production,160627.0,,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1103,Feasibility study for open pit mining and heap...,Т/18-12-15,Bayan airag exploration,Not applicable,Production,3147.0,Kg,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1104,Feasibility study for open pit mining and heap...,Т/18-12-15,Bayan airag exploration,Not applicable,Production,1233.0,Kg,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1217,MV-000296,ХХ-11-12,Tsarig Skhonkhor,Not applicable,Production,1.05,,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1218,MV-000296,ХХ-07-11,Tsarig Skhonkhor,Not applicable,Production,9.1,,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1209,"Nalaikh mine's eastern wing, usage of mud mining",Д/03-06,Khualyan,Not applicable,Not applicable,,Not applicable,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1210,"Nalaikh mine's eastern wing, usage of mud mining",Т/18-12-01,Khualyan,Not applicable,Production,661863.0,,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020


In [37]:
print("Duplicates per commodity:")
dup_comm_aff = count_duplicates(df_part_3c, company_project_comm_duplicates, 'Commodities (one commodity/row)', 'duplicates_per_commodity')
dup_comm_aff

Duplicates per commodity:


Unnamed: 0,Commodities (one commodity/row),duplicates_per_commodity
0,"Coal, Technical water",6
1,Commodities (one commodity/row),16
2,Condensate,2
3,Crude oil (2709),23
4,Gold (7108),4
5,Limestone (2521),4
6,Natural gas (2711),4
7,Nickel (2604),2
8,Non applicable,2
9,Not applicable,29


#### Implications
- there are a significant number of duplicates that have the same Full project name and Affiliated companies in the data (2416)
- even when accounting for the commodity being reported in the project, there are still a significant number of dupliates (610) that have the same Full project name, Affiliated companies, and commodities being reported
- most of these duplicates may be attributed to commodities not being part of the information provided (464/610)

### Consistency

### All companies or agencies in company data (Part 5) exist in the Reporting companies' agencies list (Part 3a)
#### Expectations
- We expect that every company or agency mentioned in the company dataset should have a corresponding entry in either the reporting companies dataset or the reporting government entities dataset.
- This is based on the assumption that the company data are associated with existing companies and/or government entities and any discrepancies could indicate data inconsistencies or missing information.

#### Assumptions
- The company and agency datasets are comprehensive and contain information about all relevant companies and government entities.
- The naming conventions for companies and agencies in the company report match those in the company and government entities tab/dataset.
- Each entry in the company report corresponds to a valid and existing company or agency.

#### Results

In [38]:
def count_missing(main_df, missing_df, field, name):
    return main_df[missing_df].groupby(field).size().reset_index(name=name)

In [39]:
all_companies = df_part_5['Company'].isin(df_part_3a['Full company name'])
all_companies_reverse = df_part_3a['Full company name'].isin(df_part_5['Company'])
all_agencies = df_part_5['Company'].isin(df_part_3b['Full name of agency'])
all_agencies_reverse = df_part_3b['Full name of agency'].isin(df_part_5['Company'])
all_names = (
    df_part_5['Company'].isin(df_part_3a['Full company name']) | 
    df_part_5['Company'].isin(df_part_3b['Full name of agency'])
)

In [40]:
# Check if all companies in the Company data is in the Reporting companies list
print(f'All company names in the Company data are present in the Reporting companies list: {all_companies.all()}')
print(f'All company names in the Reporting companies list are present in the Company data: {all_companies_reverse.all()}')

All company names in the Company data are present in the Reporting companies list: False
All company names in the Reporting companies list are present in the Company data: False


In [41]:
# Get list of companies in Company data that is not in the Reporting companies list
missing_companies = df_part_5[~all_companies].reset_index(drop=True)
missing_companies

Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
0,Abid Hassan Zadran Limited,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,1192667.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
1,Abid Hassan Zadran Limited,Ministry of Finance (Customs Department),Fixed Tax on Exports,No,No,,AFN,143121.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
2,Abid Hassan Zadran Limited,Ministry of Finance (Customs Department),Other Fee on Exports,No,No,,AFN,25954.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
3,Abid Hassan Zadran Limited,Ministry of Finance (Customs Department),Penalty on Exports,No,No,,AFN,6350.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
4,"Afghan Shiinink, Mines Extraction and Processing",Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,35222.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1910,Royal Dutch Shell plc,Oil & Gas Authority (OGA),Oil & Gas Authority (OGA) Levy,Yes,Yes,P84,GBP,99220.0,Not applicable,Not applicable,Not applicable,,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31
1911,Royal Dutch Shell plc,Oil & Gas Authority (OGA),Oil & Gas Authority (OGA) Levy,Yes,Yes,P88,GBP,99220.0,Not applicable,Not applicable,Not applicable,,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31
1912,Royal Dutch Shell plc,Oil & Gas Authority (OGA),Oil & Gas Authority (OGA) Levy,Yes,Yes,P886,GBP,99220.0,Not applicable,Not applicable,Not applicable,,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31
1913,Royal Dutch Shell plc,Oil & Gas Authority (OGA),Oil & Gas Authority (OGA) Levy,Yes,Yes,P96,GBP,99220.0,Not applicable,Not applicable,Not applicable,,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31


In [42]:
# Get list of companies in Company data that is not in the Reporting companies list
missing_companies_reverse = df_part_3b[~all_companies_reverse].reset_index(drop=True)
missing_companies_reverse

  missing_companies_reverse = df_part_3b[~all_companies_reverse].reset_index(drop=True)


Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date
0,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2019.0,2018-12-21,2019-12-20
1,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2019.0,2018-12-21,2019-12-20
2,Société Nationale d’Investissement du Cameroun...,State-owned enterprises & public corporations,,539200395.00,Cameroon,CMR,2017.0,1/1/2017,12/31/2017
3,Direction Générale des Hydrocarbures (DGH),Central goverment,Non applicable,2674420060.00,Cote d'Ivoire,CIV,2018.0,2018-01-01,2018-12-31
4,Localité de Bondoukou,Local government,Non applicable,19449399.00,Cote d'Ivoire,CIV,2018.0,2018-01-01,2018-12-31
5,Ghana National Petroleum Company,Central goverment,,#ERROR!,Ghana,GHA,2017.0,2017-01-01,2017-12-31
6,Other,Other,,51191621914.04,Mongolia,MNG,2019.0,1/1/2019,12/31/2019
7,Internal Revenue Department (IRD),Central goverment,Not applicable,353671807113.44,Myanmar,MMR,2018.0,4/1/2017,3/31/2018
8,Department of Trading (DoT),Central goverment,Not applicable,4559476231.00,Myanmar,MMR,2018.0,4/1/2017,3/31/2018
9,Direction Générale des Douanes et des Droits I...,Central government,,Voir partie 4,Republic of the Congo,COG,2017.0,2017-01-01,2017-12-31


In [43]:
# Count number of instances the missing companies appear in the Company data per year
missing_companies_count_pt = missing_companies.pivot_table(index=['Company', 'Country'], columns='Year', values='ISO Code', aggfunc='count', fill_value=0).reset_index()
missing_companies_count_pt

Year,Company,Country,2017.0,2018.0,2019.0,2020.0,2021.0
0,Abid Hassan Zadran Limited,Afghanistan,0,4,4,0,0
1,"Add new rows as necessary, right click the row...",Argentina,0,1,0,0,0
2,"Afghan Shiinink, Mines Extraction and Processing",Afghanistan,0,4,0,0,0
3,"Afghan Shinink, Mines Extraction and Processing",Afghanistan,0,4,0,0,0
4,Afghan Talc Limited Joint Venture,Afghanistan,0,13,0,0,0
...,...,...,...,...,...,...,...
81,"محمد عیسی ولد آغا گل, محمد عیسی",Afghanistan,0,0,2,0,0
82,"محمد یونس ولد محمد عیسی, محمد یونس",Afghanistan,0,3,1,0,0
83,نصب ومنتاژیک دستگاه ریگریشن محمد سمیع Mohammad...,Afghanistan,0,1,0,0,0
84,"واثق ولد ملنگ خان, واثق",Afghanistan,0,0,2,0,0


In [44]:
missing_companies_count_grp = missing_companies.groupby(['Company', 'Country', 'Year'])['Company'].count().reset_index(name='Number of Instances')
missing_companies_count_grp

Unnamed: 0,Company,Country,Year,Number of Instances
0,Abid Hassan Zadran Limited,Afghanistan,2018.0,4
1,Abid Hassan Zadran Limited,Afghanistan,2019.0,4
2,"Add new rows as necessary, right click the row...",Argentina,2018.0,1
3,"Afghan Shiinink, Mines Extraction and Processing",Afghanistan,2018.0,4
4,"Afghan Shinink, Mines Extraction and Processing",Afghanistan,2018.0,4
...,...,...,...,...
94,"محمد یونس ولد محمد عیسی, محمد یونس",Afghanistan,2018.0,3
95,"محمد یونس ولد محمد عیسی, محمد یونس",Afghanistan,2019.0,1
96,نصب ومنتاژیک دستگاه ریگریشن محمد سمیع Mohammad...,Afghanistan,2018.0,1
97,"واثق ولد ملنگ خان, واثق",Afghanistan,2019.0,2


In [45]:
# Check the number of instances the missing companies appear in the Company data per year for Afghanistan
missing_companies_count_pt[missing_companies_count_pt['Country'] == 'Afghanistan']

Year,Company,Country,2017.0,2018.0,2019.0,2020.0,2021.0
0,Abid Hassan Zadran Limited,Afghanistan,0,4,4,0,0
2,"Afghan Shiinink, Mines Extraction and Processing",Afghanistan,0,4,0,0,0
3,"Afghan Shinink, Mines Extraction and Processing",Afghanistan,0,4,0,0,0
4,Afghan Talc Limited Joint Venture,Afghanistan,0,13,0,0,0
5,Amania Mining,Afghanistan,0,0,804,0,0
...,...,...,...,...,...,...,...
81,"محمد عیسی ولد آغا گل, محمد عیسی",Afghanistan,0,0,2,0,0
82,"محمد یونس ولد محمد عیسی, محمد یونس",Afghanistan,0,3,1,0,0
83,نصب ومنتاژیک دستگاه ریگریشن محمد سمیع Mohammad...,Afghanistan,0,1,0,0,0
84,"واثق ولد ملنگ خان, واثق",Afghanistan,0,0,2,0,0


In [46]:
missing_companies_count_grp[missing_companies_count_grp['Country'] == 'Afghanistan']

Unnamed: 0,Company,Country,Year,Number of Instances
0,Abid Hassan Zadran Limited,Afghanistan,2018.0,4
1,Abid Hassan Zadran Limited,Afghanistan,2019.0,4
3,"Afghan Shiinink, Mines Extraction and Processing",Afghanistan,2018.0,4
4,"Afghan Shinink, Mines Extraction and Processing",Afghanistan,2018.0,4
5,Afghan Talc Limited Joint Venture,Afghanistan,2018.0,13
...,...,...,...,...
94,"محمد یونس ولد محمد عیسی, محمد یونس",Afghanistan,2018.0,3
95,"محمد یونس ولد محمد عیسی, محمد یونس",Afghanistan,2019.0,1
96,نصب ومنتاژیک دستگاه ریگریشن محمد سمیع Mohammad...,Afghanistan,2018.0,1
97,"واثق ولد ملنگ خان, واثق",Afghanistan,2019.0,2


In [47]:
# Check the number of instances the missing companies appear in the Company data per year for Afghanistan in 2019
missing_companies_count_grp[missing_companies_count_grp['Year'] == '2019']

Unnamed: 0,Company,Country,Year,Number of Instances


In [48]:
# Missing company instances per country
missing_companies.groupby('Country').size().reset_index(name='Instances per country')

Unnamed: 0,Country,Instances per country
0,Afghanistan,1745
1,Albania,23
2,Argentina,1
3,Cameroon,1
4,Cote d'Ivoire,7
5,Germany,14
6,Mongolia,2
7,United Kingdom,92


In [49]:
# Missing company instances per year
missing_companies.groupby('Year').size().reset_index(name='Instances per year')

Unnamed: 0,Year,Instances per year
0,2017.0,23
1,2018.0,148
2,2019.0,1609
3,2020.0,1
4,2021.0,92


In [50]:
# Missing company instances per country and year
print("Missing company instances per country and year")
missing_companies.pivot_table(index=['Country'], columns='Year', values='Company', aggfunc='count', fill_value=0).reset_index()

Missing company instances per country and year


Year,Country,2017.0,2018.0,2019.0,2020.0,2021.0
0,Afghanistan,0,129,1601,0,0
1,Albania,11,12,0,0,0
2,Argentina,0,1,0,0,0
3,Cameroon,1,0,0,0,0
4,Cote d'Ivoire,7,0,0,0,0
5,Germany,4,4,5,1,0
6,Mongolia,0,2,0,0,0
7,United Kingdom,0,0,0,0,92


In [51]:
# Now let's look at unique companies
missing_companies_unique = missing_companies.drop_duplicates(subset=['Company', 'Country']).reset_index(drop=True)
missing_companies_unique

Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
0,Abid Hassan Zadran Limited,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,1192667.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
1,"Afghan Shiinink, Mines Extraction and Processing",Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,35222.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
2,"Afghan Shinink, Mines Extraction and Processing",Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,61336.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
3,Afghan Talc Limited Joint Venture,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,316251.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
4,Arif Shahaab Limited,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,1085044.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89,Any additional information that is not eligibl...,,,,,,,,,,,,,,,,
90,Unilateral Disclosure for Other Minerals sector,,,,MCD,Customs Duties,604935287,,,,,,,,,,
91,Unilateral Disclosure for Gems and Jade sector,,,,MCD,Customs Duties,424622051,,,,,,,,,,
92,Social contributions - not received by governm...,,,,,,,,,,,,,,,,


In [52]:
# Count number of instances the unique missing companies appear in the Company data per year
print("Number of instances the unique missing companies appear in the Company data per year")
missing_companies_unique_count_grp = missing_companies.groupby(['Company', 'Country', 'Year'])['Company'].count().reset_index(name='Number of Instances')
missing_companies_unique_count_grp

Number of instances the unique missing companies appear in the Company data per year


Unnamed: 0,Company,Country,Year,Number of Instances
0,Abid Hassan Zadran Limited,Afghanistan,2018.0,4
1,Abid Hassan Zadran Limited,Afghanistan,2019.0,4
2,"Add new rows as necessary, right click the row...",Argentina,2018.0,1
3,"Afghan Shiinink, Mines Extraction and Processing",Afghanistan,2018.0,4
4,"Afghan Shinink, Mines Extraction and Processing",Afghanistan,2018.0,4
...,...,...,...,...
94,"محمد یونس ولد محمد عیسی, محمد یونس",Afghanistan,2018.0,3
95,"محمد یونس ولد محمد عیسی, محمد یونس",Afghanistan,2019.0,1
96,نصب ومنتاژیک دستگاه ریگریشن محمد سمیع Mohammad...,Afghanistan,2018.0,1
97,"واثق ولد ملنگ خان, واثق",Afghanistan,2019.0,2


In [53]:
# Number of unique companies per country that appear in the Company data but not in the Reporting companies list 
missing_companies_unique_count_grp.pivot_table(index=['Country'], values='Company', aggfunc='nunique', fill_value=0, dropna=False)

Unnamed: 0_level_0,Company
Country,Unnamed: 1_level_1
Afghanistan,72
Albania,5
Argentina,1
Cameroon,1
Cote d'Ivoire,3
Germany,2
Mongolia,1
United Kingdom,1


In [54]:
missing_companies[missing_companies['Country'] == "Cote d'Ivoire"]['Company'].nunique(dropna=False)

3

In [55]:
missing_companies_unique_count_grp[missing_companies_unique_count_grp['Country'] == "Cote d'Ivoire"]

Unnamed: 0,Company,Country,Year,Number of Instances
12,Autres acheteurs,Cote d'Ivoire,2017.0,1
13,Autres sociétés non incluses dans le périmètre...,Cote d'Ivoire,2017.0,2
30,LA MANCHA CI (France),Cote d'Ivoire,2017.0,4


In [56]:
# Number of unique companies per year that appear in the Company data but not in the Reporting companies list 
missing_companies_unique_count_grp.pivot_table(index=['Year'], values='Company', aggfunc='nunique', fill_value=0, dropna=False)

Unnamed: 0_level_0,Company
Year,Unnamed: 1_level_1
2017.0,9
2018.0,33
2019.0,55
2020.0,1
2021.0,1


In [57]:
missing_companies[missing_companies['Year'] == "2017"]['Company'].nunique(dropna=False)

0

In [58]:
missing_companies_unique_count_grp[missing_companies_unique_count_grp['Year'] == "2017"]

Unnamed: 0,Company,Country,Year,Number of Instances


### Consistent company name-company ID pair over time

#### Expectations
- We expect that the pairing of company names with their corresponding company IDs remain consistent over time.
- This expectation is founded on the assumption that company names and IDs serve as stable and reliable identifiers for companies, ensuring their continuity and uniqueness over time.

#### Assumptions
- Each company name is assumed to be consistently associated with a specific company ID across different years.
- Company IDs are assumed to be unique and not reassigned to different companies over time.
  
#### Results


In [59]:
# List company names per country with the ids they have for the year

pt_name_id_over_time = df_part_3a.pivot_table(index=['Full company name','Country'], columns='Year', values='Company ID number', aggfunc='first', fill_value=0)

In [60]:
print("List company names per country with the ids they have for the year")
pt_name_id_over_time

List company names per country with the ids they have for the year


Unnamed: 0_level_0,Year,2017,2018,2019,2020,2021
Full company name,Country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"""BEAT GENERATION"" SHPK",Albania,0,L42423012I,0,0,0
"""Balkan Green Energy"" sh.p.k( ish ESEGEI)",Albania,0,K71624026M,0,0,0
"""Chaarat Kapan"" CJSC",Armenia,0,9416902,9416902,0,0
"""D & A"" Sh.P.K",Albania,0,K11829502V,0,0,0
"""DITEKO"" sh.p.k",Albania,0,K92108022E,0,0,0
...,...,...,...,...,...,...
“Teghout” CJSC,Armenia,0,2700773,2700773,0,0
“Vardani Zartonk” LLC,Armenia,0,9414399,9414399,0,0
“Vayk Gold” LLC,Armenia,0,114369,114369,0,0
“Zangezur Copper-Molybdenum Combine” CJSC,Armenia,0,9400818,9400818,0,0


In [61]:
# Define a function to check if all non-zero values are the same in a row
def check_same_nonzero(row):
    nonzero_values = row[row != 0]
    return len(nonzero_values) == 0 or len(set(nonzero_values)) == 1

# Apply the function row-wise to check if values are the same
result = pt_name_id_over_time.apply(check_same_nonzero, axis=1)

# Filter rows where the result is False
rows_with_false_result = pt_name_id_over_time[~result]

print('Companies with changing/different IDs over time')
rows_with_false_result.sort_values(by=['Full company name', 'Country'])


Companies with changing/different IDs over time


Unnamed: 0_level_0,Year,2017,2018,2019,2020,2021
Full company name,Country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ABOURACHID Mining,Chad,Not communicated,Not available,0,0,0
AGBAOU GOLD OPERATIONS,Cote d'Ivoire,1273929 F,1273929F,0,0,0
Apache Corporation,United Kingdom,0,4614761,4614761,4614761,07720972\r\n04614761\r\nFC005975
Belema Oil Producing Limited,Nigeria,Not available,1087839,0,0,0
Breedon Group PLC,United Kingdom,0,0,0,98465,Jersey 98465
CADERAC. SA,Cote d'Ivoire,9910850 P,9910850P,0,0,0
CGCOC Group,Chad,Not communicated,Not available,0,0,0
CapeOmega AS,Norway,913776712,913776712,995152142,0,0
Cemex UK Materials Ltd,United Kingdom,0,4895833,4895833,4895833,658390
Chad construction Materials S.A,Chad,Not communicated,Not available,0,0,0


### Government entities that appear in the Government revenues (Part 4) must also appear in the Reporting government entities list (Part 3b)

#### Expectations
- Government entities in the Government revenues list must appear in the Reporting government entities list and vice versa

#### Assumptions
- Government entities that report their revenues must have corresponding entries in the Reporting government entities list
  
#### Results

In [62]:
# Get list of unique government entities in the Government revenues and Reporting government entities list

govt_entities_in_revenues = df_part_4[['Government entity', 'Country']].drop_duplicates().reset_index(drop=True)
govt_entities_in_revenues

Unnamed: 0,Government entity,Country
0,National Environmental Protection Agency,Afghanistan
1,Ministry of Finance (Customs Department),Afghanistan
2,Ministry of Mines and Petroleum (Revenue Depar...,Afghanistan
3,Ministry of Finance (Revenue Department),Afghanistan
4,Bureau of Internal Revenue (BIR),Philippines
...,...,...
283,Ministry of Lands,Zambia
284,Environmental Protection Fund,Zambia
285,ZCCM- IH,Zambia
286,IDC,Zambia


In [63]:
govt_entities_in_report = df_part_3b[['Full name of agency', 'Country']].drop_duplicates().reset_index(drop=True)
govt_entities_in_report

Unnamed: 0,Full name of agency,Country
0,Ministry of Finance (Revenue Department),Afghanistan
1,Ministry of Finance (Customs Department),Afghanistan
2,Ministry of Mines and Petroleum (Revenue Depar...,Afghanistan
3,National Environmental Protection Agency,Afghanistan
4,Ministry of Industry and Commerce,Afghanistan
...,...,...
334,Environmental Protection Fund,Zambia
335,Ministry of Lands,Zambia
336,ZCCM- IH,Zambia
337,IDC,Zambia


In [64]:
govt_entities_merged_left = pd.merge(govt_entities_in_revenues, govt_entities_in_report, left_on=['Government entity', 'Country'], right_on=['Full name of agency', 'Country'], how='left', indicator=True, )
govt_entities_merged_left

Unnamed: 0,Government entity,Country,Full name of agency,_merge
0,National Environmental Protection Agency,Afghanistan,National Environmental Protection Agency,both
1,Ministry of Finance (Customs Department),Afghanistan,Ministry of Finance (Customs Department),both
2,Ministry of Mines and Petroleum (Revenue Depar...,Afghanistan,Ministry of Mines and Petroleum (Revenue Depar...,both
3,Ministry of Finance (Revenue Department),Afghanistan,Ministry of Finance (Revenue Department),both
4,Bureau of Internal Revenue (BIR),Philippines,Bureau of Internal Revenue (BIR),both
...,...,...,...,...
283,Ministry of Lands,Zambia,Ministry of Lands,both
284,Environmental Protection Fund,Zambia,Environmental Protection Fund,both
285,ZCCM- IH,Zambia,ZCCM- IH,both
286,IDC,Zambia,IDC,both


In [65]:
print("Entities that are in the Government revenues list but not in the Government entities list")
govt_entities_not_in_report = govt_entities_merged_left[govt_entities_merged_left['_merge'] == 'left_only'].drop(columns=['_merge'])
govt_entities_not_in_report

Entities that are in the Government revenues list but not in the Government entities list


Unnamed: 0,Government entity,Country,Full name of agency
21,,Albania,
25,,Argentina,
59,Direction Générale des impôts (DGI),Cote d'Ivoire,
65,Autres,Cote d'Ivoire,
73,Direction des Recettes Provinciales,Democratic Republic of Congo,
83,Ministerio de Energía y Minas (MEM),Dominican Republic,
237,Tax Revenue Authority,Seychelles,
259,Les delegations speciales des communes et pref...,Togo,


In [66]:
govt_entities_merged_right = pd.merge(govt_entities_in_revenues, govt_entities_in_report, left_on=['Government entity', 'Country'], right_on=['Full name of agency', 'Country'], how='right', indicator=True)
govt_entities_merged_right

Unnamed: 0,Government entity,Country,Full name of agency,_merge
0,Ministry of Finance (Revenue Department),Afghanistan,Ministry of Finance (Revenue Department),both
1,Ministry of Finance (Customs Department),Afghanistan,Ministry of Finance (Customs Department),both
2,Ministry of Mines and Petroleum (Revenue Depar...,Afghanistan,Ministry of Mines and Petroleum (Revenue Depar...,both
3,National Environmental Protection Agency,Afghanistan,National Environmental Protection Agency,both
4,,Afghanistan,Ministry of Industry and Commerce,right_only
...,...,...,...,...
334,Environmental Protection Fund,Zambia,Environmental Protection Fund,both
335,Ministry of Lands,Zambia,Ministry of Lands,both
336,ZCCM- IH,Zambia,ZCCM- IH,both
337,IDC,Zambia,IDC,both


In [67]:
print("Entities that are in the Government entities list but not in the Government revenues list")
govt_entities_not_in_revenues = govt_entities_merged_right[govt_entities_merged_right['_merge'] == 'right_only'].drop(columns=['_merge']).reset_index(drop=True)
govt_entities_not_in_revenues

Entities that are in the Government entities list but not in the Government revenues list


Unnamed: 0,Government entity,Country,Full name of agency
0,,Afghanistan,Ministry of Industry and Commerce
1,,Philippines,Department of Budget and Management (DBM)
2,,Philippines,Philippine Natioanl Oil Company (PNOC)
3,,Philippines,Philippine Minding Development Corporation (PDMC)
4,,Albania,Electric energy distribution system operator (...
5,,Albania,Electric Energy Distribution System Operator (...
6,,Argentina,"Secretaría de Minería (SEMIN), Ministerio de D..."
7,,Burkina Faso,Autres bénéficiaires (Paiements sociaux)
8,,Burkina Faso,Agence de l'eau
9,,Chad,Direction Générale des Services de Douanes et ...


In [68]:
df_part_3b[df_part_3b['Full name of agency']=='Department of Budget and Management (DBM)']

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date
13,Department of Budget and Management (DBM),Central goverment,000-449-457-000,,Philippines,PHL,2018.0,2018-01-01,2018-12-31


In [69]:
df_part_4[df_part_4['Government entity']=='Department of Budget and Management (DBM)']

Unnamed: 0,GFS Classification,Sector,Revenue stream name,Government entity,Revenue value,Currency,Country,ISO Code,Year,Start Date,End Date


In [70]:
print("Number of Government entities not in the Government revenues list per country")
govt_entities_not_in_revenues.groupby('Country')['Full name of agency'].count().reset_index(name='Number of entities')

Number of Government entities not in the Government revenues list per country


Unnamed: 0,Country,Number of entities
0,Afghanistan,1
1,Albania,2
2,Argentina,1
3,Burkina Faso,2
4,Chad,4
5,Cote d'Ivoire,3
6,Democratic Republic of Congo,4
7,Dominican Republic,1
8,Ethiopia,1
9,Ghana,13


## Use-cases to validate

### How has the extractives industry evolved over time in country X? How much is the volume extracted over time? How much revenue is generated?

### IMPORTANT: Some countries (e.g. Norway) report blank revenue values as -. This makes the field not read as a number and makes working with the data difficult.

In [71]:
# Revenue per country based on Company data
df_part_5_revenues = pd.read_csv(path.join(file_dir, "Part 5 - Company data.csv"))

df_part_5_revenues['Revenue value'] = pd.to_numeric(df_part_5_revenues['Revenue value'], errors='raise')

# df_part_5_revenues['Revenue value'] = df_part_5_revenues['Revenue value'].str.replace(',', '').astype(float)

# df_part_5_revenues['Revenue value'] = pd.to_numeric(df_part_5_revenues['Revenue value'], errors='raise')

# df_part_5_revenues.groupby(['Country', 'Year'])['Revenue value'].sum().reset_index()

# Check if revenue values are all numbers
# pd.to_numeric(df_part_5_revenues['Revenue value'], errors='coerce').notna()
# df_part_5_revenues['Revenue value'].convert_dtypes()

df_part_5_revenues['Revenue value'].dtype

# df_part_5_revenues['Revenue value'] = pd.to_numeric(df_part_5_revenues['Revenue value'], errors='raise')

dtype('float64')

In [72]:
df_part_5_revenues

Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
0,North Coal Enterprise (NCE),Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,EXP 1/2014,AFN,442801100.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
1,North Coal Enterprise (NCE),Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,EXP 1/2014,AFN,386169944.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
2,North Coal Enterprise (NCE),Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,EXP 1/2014,AFN,336623658.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
3,North Coal Enterprise (NCE),Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,EXP 1/2014,AFN,300000000.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
4,Habib Shahab Talc and Marble exploitation and ...,Ministry of Mines and Petroleum (Revenue Depar...,Penalties of Late Payment,,,,AFN,18.0,,,,,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33039,MAAMBA COLLIERIES LIMITED,Zambian Revenue Authority (ZRA),Excise Duty - Electrical Energy,No,No,,ZMW,66946749.0,No,,,,Zambia,ZMB,2018.0,2018-01-01,2018-12-31
33040,MAAMBA COLLIERIES LIMITED,Local Councils,Annual Business Fees,No,No,,ZMW,311691.0,No,,,,Zambia,ZMB,2018.0,2018-01-01,2018-12-31
33041,MAAMBA COLLIERIES LIMITED,Local Councils,Property Rates,No,No,,ZMW,1200199.0,No,,,,Zambia,ZMB,2018.0,2018-01-01,2018-12-31
33042,MAAMBA COLLIERIES LIMITED,Ministry of Lands,Ground Rent,No,No,,ZMW,86562.0,No,,,,Zambia,ZMB,2018.0,2018-01-01,2018-12-31


In [90]:
total_revenue_per_country_per_year = df_part_5_revenues.groupby(['Country', 'Year', 'Reporting currency'])['Revenue value'].sum().reset_index()

#Set the year column to be a string
total_revenue_per_country_per_year['Year'] = total_revenue_per_country_per_year['Year'].astype(str).str.replace(r'\.0$', '', regex=True)

total_revenue_per_country_per_year 

Unnamed: 0,Country,Year,Reporting currency,Revenue value
0,Afghanistan,2018,AFN,4.804506e+09
1,Afghanistan,2019,AFN,4.265284e+09
2,Albania,2017,ALL,1.614886e+10
3,Albania,2017,USD,2.189620e+05
4,Albania,2018,ALL,1.822660e+10
...,...,...,...,...
70,United Kingdom,2019,GBP,1.503146e+09
71,United Kingdom,2020,GBP,2.550391e+08
72,United Kingdom,2021,GBP,9.898815e+08
73,Zambia,2017,ZMW,8.600124e+09


In [91]:
total_revenue_per_country_per_year_pt = total_revenue_per_country_per_year.pivot_table(index=['Country', 'Reporting currency'], columns='Year', values='Revenue value', aggfunc='sum', fill_value=0).style.format('{:,.2f}')
total_revenue_per_country_per_year_pt

Unnamed: 0_level_0,Year,2017,2018,2019,2020,2021
Country,Reporting currency,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,AFN,0.0,4804506352.0,4265283832.0,0.0,0.0
Albania,ALL,16148860065.0,18226600760.0,0.0,0.0,0.0
Albania,USD,218962.0,0.0,0.0,0.0,0.0
Argentina,ARS,0.0,52123232635.87,0.0,0.0,0.0
Armenia,AMD,0.0,79528828359.34,85427456909.0,0.0,0.0
Burkina Faso,XOF,143446403692.0,108352706471.0,100310250892.0,178994800000.0,0.0
Cameroon,XAF,428111382954.0,0.0,0.0,0.0,0.0
Chad,USD,547448408.0,1003136734.0,0.0,0.0,0.0
Cote d'Ivoire,XOF,279785538194.0,193151957509.0,0.0,0.0,0.0
Democratic Republic of Congo,USD,1665052954.0,0.0,0.0,0.0,0.0


## Statistics

### How many years of data does each country have?

In [75]:
df_part_5.groupby('Country')['Year'].nunique().reset_index(name='Years of data')

Unnamed: 0,Country,Years of data
0,Afghanistan,2
1,Albania,2
2,Argentina,1
3,Armenia,2
4,Burkina Faso,4
5,Cameroon,1
6,Chad,2
7,Cote d'Ivoire,2
8,Democratic Republic of Congo,1
9,Dominican Republic,3


In [76]:
df_part_5.groupby('Country')['Year'].nunique().reset_index(name='Years of data').sort_values(by=['Years of data'], ascending=False)

Unnamed: 0,Country,Years of data
33,United Kingdom,4
32,Ukraine,4
4,Burkina Faso,4
11,Germany,4
24,Norway,3
9,Dominican Republic,3
12,Ghana,3
20,Mongolia,3
0,Afghanistan,2
16,Madagascar,2


In [77]:
df_part_5_a = df_part_5.copy()

df_part_5_a['data_exists'] = 'x'

# List of years from 2017 to 2022
years = list(range(2017, 2023))

# Create a table with 0s and 1s indicating the presence of a report for each country and year
data_exists = df_part_5_a.pivot_table(index='Country', columns='Year', values='data_exists', aggfunc='max', fill_value='')
data_exists = data_exists.reindex(columns=years, fill_value='')

data_exists

Year,2017,2018,2019,2020,2021,2022
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,,x,x,,,
Albania,x,x,,,,
Argentina,,x,,,,
Armenia,,x,x,,,
Burkina Faso,x,x,x,x,,
Cameroon,x,,,,,
Chad,x,x,,,,
Cote d'Ivoire,x,x,,,,
Democratic Republic of Congo,x,,,,,
Dominican Republic,,x,x,x,,
