# Duplication and coherence

## Overall purpose and objective
The overall purpose and objective of the cleaning and verification process is to prepare the data for conversion into a SQLite database (Datasette). As such, the data should follow database best practices.

## Specific purpose of this notebook
This notebook is for checking duplicates in the data and coherence. Particularly, we want to check for:
- Possible problematic rows for each table.
- The companies/agencies in the company and agency lists correspond to companies/agencies in the company data

## Assumptions
- Some combinations of fields should be unique
- The values should be cohrent across tables

## Why this matters 
- Inserting the data in a proper database and assigning EITI IDs require a high confidence in the data quality to avoid downstream issues. Duplicates and non-coherence of the data decreases this level of confidence.

## Findings


## Analysis

### Problematic rows due to possible duplication

In [1]:
# import libraries and data

import pandas as pd
import numpy as np
from os import path
from functools import reduce
from pprint import pprint
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from itertools import combinations

file_dir = "data/consolidated/"
file_dir_old = "data/consolidated/backup/old"

# load the csvs into data frames
df_part_1 = pd.read_csv(path.join(file_dir, "Part 1 - About.csv"))
df_part_3a = pd.read_csv(path.join(file_dir, "Part 3 - Reporting companies' list.csv"))
df_part_3b = pd.read_csv(path.join(file_dir, "Part 3 - Reporting government entities list.csv"))
df_part_3c = pd.read_csv(path.join(file_dir, "Part 3 - Reporting projects' list.csv"))
df_part_4 = pd.read_csv(path.join(file_dir, "Part 4 - Government revenues.csv"))
df_part_5 = pd.read_csv(path.join(file_dir, "Part 5 - Company data.csv"))
# df_part_5 = pd.read_csv(path.join(file_dir, "Part 5 - Company data.csv"), low_memory=False)

df_list = [df_part_1, df_part_3a, df_part_3b, df_part_3c, df_part_4, df_part_5]
df_dict = {"Part 1 - About.csv": df_part_1,
           "Part 3 - Reporting companies' list.csv": df_part_3a,
           "Part 3 - Reporting government entities list.csv": df_part_3b,
           "Part 3 - Reporting projects' list.csv": df_part_3c,
           "Part 4 - Government revenues.csv": df_part_4,
           "Part 5 - Company data.csv": df_part_5
          }

# OPTIONAL COLUMNS
part_3a_opt = ["Stock exchange listing or company website", 
               "Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)"]
part_3b_opt = ["ID number (if applicable)"]
part_5_opt = ["In-kind volume (if applicable)", "Unit (if applicable)", "Comments"]

# only include fields that are non-optional
df_part_1_non_opt = df_part_1.copy()
df_part_3a_non_opt = df_part_3a.copy().drop(columns=part_3a_opt)               
df_part_3b_non_opt = df_part_3b.copy().drop(columns=part_3b_opt)
df_part_3c_non_opt = df_part_3c.copy()
df_part_4_non_opt = df_part_4.copy()
df_part_5_non_opt = df_part_5.copy().drop(columns=part_5_opt)

df_list_non_opt = [df_part_1_non_opt, df_part_3a_non_opt, df_part_3b_non_opt, df_part_3c_non_opt, df_part_4_non_opt, df_part_5_non_opt]
df_dict_non_opt = {"Part 1 - About.csv": df_part_1_non_opt,
           "Part 3 - Reporting companies' list.csv": df_part_3a_non_opt,
           "Part 3 - Reporting government entities list.csv": df_part_3b_non_opt,
           "Part 3 - Reporting projects' list.csv": df_part_3c_non_opt,
           "Part 4 - Government revenues.csv": df_part_4_non_opt,
           "Part 5 - Company data.csv": df_part_5_non_opt
          }

In [2]:
def get_column_combinations(columns, min_cols_diff):
    '''
    Get unique column combinations that will be used for determining problematic rows (i.e. possible duplicates).

    Parameters:
    - columns (list or iterable): A list of column names or identifiers to be used for forming combinations.
    - min_cols_diff (int): The minimum number of columns to form unique combinations.

    Returns:
    list: A list of tuples representing unique combinations of columns.

    Example:
    >>> columns = ['col1', 'col2', 'col3']
    >>> min_cols_diff = 2
    >>> get_column_combinations(columns, min_cols_diff)
    [('col1', 'col2'), ('col1', 'col3'), ('col2', 'col3')]
    '''

    all_combinations = []
    for x in range(min_cols_diff, len(columns) + 1):
        all_combinations.extend(combinations(columns, x))

    return all_combinations

# TEST: Uncomment lines below and run
# OUTPUT: 4 unique column combinations
# columns = ["Full name of agency", "Agency type", "Total reported"]
# column_combinations = get_column_combinations(columns, 2)
# print(f'There are {len(column_combinations)} combinations:')
# pprint(column_combinations)

In [3]:
def add_rowid(df):
    '''
    Add a row identifier (rowid) column to the DataFrame.

    Parameters:
    - df (pandas.DataFrame): The input DataFrame.

    Returns:
    pandas.DataFrame: A new DataFrame with an additional 'rowid' column.

    Example:
    >>> data = {'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']}
    >>> df = pd.DataFrame(data)
    >>> add_rowid(df)
       col1 col2  rowid
    0     1    a      0
    1     2    b      1
    2     3    c      2
    '''

    df_rowid = df.copy()
    df_rowid["rowid"] = range(len(df_rowid))

    return df_rowid    

In [4]:
def get_problematic_rows(df, columns, min_cols_diff):
    '''
    Get unique column combinations and find problematic rows (duplicates) based on specified columns.

    Parameters:
    - df (pandas.DataFrame): The input DataFrame.
    - columns (list or iterable): A list of column names or identifiers to be used for forming combinations.
    - min_cols_diff (int): The minimum number of columns to form unique combinations (i.e. minimum numbers of columns of difference to be considered unique).

    Returns:
    pandas.DataFrame: DataFrame containing unique rows among problematic rows.

    Example:
    >>> data = {'col1': [1, 2, 2, 3, 4], 'col2': ['a', 'b', 'b', 'c', 'd']}
    >>> df = pd.DataFrame(data)
    >>> columns_for_combinations = ['col1', 'col2']
    >>> min_cols_diff = 2
    >>> get_problematic_rows(df, columns_for_combinations, min_cols_diff)
       col1 col2
    0     2    b
    '''

    # Step 1: Create a copy of the DataFrame and add rowid
    df_copy = df.copy()
    df_copy['rowid'] = range(len(df_copy))

    # Step 2: Get column combinations
    all_combinations = []
    for x in range(min_cols_diff, len(columns) + 1):
        all_combinations.extend(combinations(columns, x))

    # Step 3: Find problematic rows for each column combination
    problematic_rows = pd.DataFrame()
    for combination in all_combinations:
        duplicated_rows = df_copy[df_copy.duplicated(subset=list(combination), keep=False)]
        problematic_rows = pd.concat([problematic_rows, duplicated_rows], ignore_index=False)

    # Step 4: Get unique rows among problematic rows
    unique_problematic_rows = problematic_rows.drop_duplicates()

    return unique_problematic_rows

#### Part 3 - Reporting government entities list

In [5]:
columns = ["Full name of agency", "Agency type", "Total reported"]

redflags_part_3b = get_problematic_rows(df_part_3b, columns, 2)
display(redflags_part_3b)

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date,rowid
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,1441453501,Afghanistan,AFG,2018,2017-12-21,2018-12-20,0
1,Ministry of Finance (Customs Department),Central goverment,Not applicable,1367722942,Afghanistan,AFG,2018,2017-12-21,2018-12-21,1
2,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,1478934310.56,Afghanistan,AFG,2018,2017-12-21,2018-12-22,2
3,National Environmental Protection Agency,Central goverment,Not applicable,34000,Afghanistan,AFG,2018,2017-12-21,2018-12-23,3
4,Ministry of Industry and Commerce,Central goverment,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-24,4
...,...,...,...,...,...,...,...,...,...,...
460,Electric energy distribution system operator (...,State-owned enterprises & public corporations,,,Albania,ALB,2017,2017-01-01,2017-12-31,460
471,Electric Energy Distribution System Operator (...,State-owned enterprises & public corporations,,,Albania,ALB,2018,2018-01-01,2018-12-31,471
474,"Secretaría de Minería (SEMIN), Ministerio de D...",Central government,,,Argentina,ARG,2018,2018-01-01,2018-12-31,474
524,CAPAM,Other,,,Cameroon,CMR,2017,2017-01-01,2017-12-31,524


In [6]:
redflags_part_3b.groupby('Country').size().reset_index(name='Number of Rows').sort_values(by="Number of Rows", ascending=False)

Unnamed: 0,Country,Number of Rows
12,Ghana,80
4,Burkina Faso,28
16,Madagascar,24
1,Albania,22
34,Zambia,18
20,Mongolia,18
6,Chad,17
32,Ukraine,15
11,Germany,15
7,Cote d'Ivoire,14


In [7]:
redflags_part_3b.groupby('Year').size().reset_index(name='Number of Rows').sort_values(by="Number of Rows", ascending=False)

Unnamed: 0,Year,Number of Rows
1,2018,128
0,2017,121
2,2019,93
3,2020,31


In [8]:
redflags_part_3b.groupby(["Country", "Year"]).size().reset_index(name='Number of Rows').sort_values(by="Number of Rows", ascending=False).head(15)

Unnamed: 0,Country,Year,Number of Rows
27,Ghana,2019,36
25,Ghana,2017,23
26,Ghana,2018,21
32,Madagascar,2017,12
33,Madagascar,2018,12
2,Albania,2017,11
3,Albania,2018,11
8,Burkina Faso,2018,9
9,Burkina Faso,2019,9
13,Chad,2018,9


### Coherence of companies in Part 3 - Reporting companies' list, Part 3 - Reporting government entities list, and Part 3 - Reporting projects list with Part 5 - Company data

In [9]:
def compare_tables(df1, df2, common_columns_df1, common_columns_df2):
    '''
    Compare two tables based on specified columns.

    Parameters:
    - df1 (pandas.DataFrame): The first DataFrame.
    - df2 (pandas.DataFrame): The second DataFrame.
    - common_columns_df1 (list): Columns used in df1 to find common rows.
    - common_columns_df2 (list): Columns used in df2 to find common rows.

    Returns:
    - common_rows (pandas.DataFrame): Rows common to both DataFrames.
    - unique_rows_df1 (pandas.DataFrame): Rows unique to df1.
    - unique_rows_df2 (pandas.DataFrame): Rows unique to df2.

    Example:
    >>> df1 = pd.DataFrame({'Company': ['A', 'B', 'C'], 'Project name': ['P1', 'P2', 'P3'], 'Country': ['X', 'Y', 'Z'], 'Year': [2020, 2021, 2022]})
    >>> df2 = pd.DataFrame({'Full company name': ['A Corp', 'B Corp', 'D Corp'], 'Company type': ['Type1', 'Type2', 'Type3'], 'Company ID number': [101, 102, 103], 'Country': ['X', 'Y', 'Z'], 'Year': [2020, 2021, 2023]})
    >>> common_cols_df1 = ['Country', 'Year']
    >>> common_cols_df2 = ['Country', 'Year']
    >>> common, unique_df1, unique_df2 = compare_tables(df1, df2, common_cols_df1, common_cols_df2)
    >>> print(common)
      Country  Year
    0       X  2020
    1       Y  2021
    >>> print(unique_df1)
      Company Project name
    2       C           P3
    >>> print(unique_df2)
      Full company name Company type  Company ID number
    2            D Corp       Type3                103
    '''

    # Make copies of the input dataframes
    df1_copy = df1.copy()
    df2_copy = df2.copy()

    # Find common rows
    common_rows = pd.merge(df1_copy, df2_copy, left_on=common_columns_df1, right_on=common_columns_df2, how='inner')

    # Find unique rows in df1
    unique_rows_df1 = df1_copy[~df1_copy.set_index(common_columns_df1).index.isin(common_rows.set_index(common_columns_df1).index)]

    # Find unique rows in df2
    unique_rows_df2 = df2_copy[~df2_copy.set_index(common_columns_df2).index.isin(common_rows.set_index(common_columns_df2).index)]

    return {"in table 1 but not in table 2": unique_rows_df1, 
            "in table 2 but not in table 1": unique_rows_df2,
           "in both tables": common_rows}

In [10]:
def compare_tables_drop_duplicates(df1, df2, common_columns_df1, common_columns_df2):
    '''
    Compare two tables based on specified columns and drop duplicates.

    Parameters:
    - df1 (pandas.DataFrame): The first DataFrame.
    - df2 (pandas.DataFrame): The second DataFrame.
    - common_columns_df1 (list): Columns used in df1 to find common rows.
    - common_columns_df2 (list): Columns used in df2 to find common rows.

    Returns:
    - common_rows (pandas.DataFrame): Rows common to both DataFrames with duplicates dropped.
    - unique_rows_df1 (pandas.DataFrame): Rows unique to df1 with duplicates dropped.
    - unique_rows_df2 (pandas.DataFrame): Rows unique to df2 with duplicates dropped.

    Example:
    >>> df1 = pd.DataFrame({'Company': ['A', 'B', 'C'], 'Project name': ['P1', 'P2', 'P3'], 'Country': ['X', 'Y', 'Z'], 'Year': [2020, 2021, 2022]})
    >>> df2 = pd.DataFrame({'Full company name': ['A Corp', 'B Corp', 'D Corp'], 'Company type': ['Type1', 'Type2', 'Type3'], 'Company ID number': [101, 102, 103], 'Country': ['X', 'Y', 'Z'], 'Year': [2020, 2021, 2023]})
    >>> common_cols_df1 = ['Country', 'Year']
    >>> common_cols_df2 = ['Country', 'Year']
    >>> common, unique_df1, unique_df2 = compare_tables_drop_duplicates(df1, df2, common_cols_df1, common_cols_df2)
    >>> print(common)
      Country  Year
    0       X  2020
    1       Y  2021
    >>> print(unique_df1)
      Company Project name
    2       C           P3
    >>> print(unique_df2)
      Full company name Company type  Company ID number
    2            D Corp       Type3                103
    '''

    # Find common rows
    common_rows = pd.merge(df1, df2, left_on=common_columns_df1, right_on=common_columns_df2, how='inner')

    # Drop duplicates in common rows
    common_rows = common_rows.drop_duplicates(subset=common_columns_df1)

    # Drop duplicates in unique rows in df1
    unique_rows_df1 = df1[~df1.set_index(common_columns_df1).index.isin(common_rows.set_index(common_columns_df1).index)]
    unique_rows_df1 = unique_rows_df1.drop_duplicates(subset=common_columns_df1)

    # Drop duplicates in unique rows in df2
    unique_rows_df2 = df2[~df2.set_index(common_columns_df2).index.isin(common_rows.set_index(common_columns_df2).index)]
    unique_rows_df2 = unique_rows_df2.drop_duplicates(subset=common_columns_df2)

    return {"in table 1 but not in table 2": unique_rows_df1, 
            "in table 2 but not in table 1": unique_rows_df2,
            "in both tables": common_rows}


#### Part 3 - Reporting companies' list and Part 5 - Company data

In [11]:
common_columns_3a5 = ["Full company name"]
common_columns_53a = ["Company"]

for key, data in compare_tables(df_part_3a, df_part_5, common_columns_3a5, common_columns_53a).items():
    print(key)
    display(data)

in table 1 but not in table 2


Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
7,Abed Hasan Zadran Limited,Private,9005801197,Other,Coal,,Not available,,Afghanistan,AFG,2018,2017-12-21,2018-12-24
9,Afghan Shinink Mines Extraction and Processing,Private,9002202316,Other,Talc,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-24
95,"احمد علی ولد خداداد, احمدعلی Ahamd Ali Son of ...",Private,9001263814,Other,Construction stone,Not applicable,Not available,-,Afghanistan,AFG,2018,2017-12-21,2018-12-24
131,شرکت استخراج معادن افغان اکتیف لمیتد Afghan Ac...,Private,9001353375,Other,Chromite,Not applicable,Not available,-,Afghanistan,AFG,2018,2017-12-21,2018-12-24
138,شرکت استخراج معادن ذغال سنک افراسیاب Afrasyab ...,Private,9001505461,Other,Coal,Not applicable,Not available,-,Afghanistan,AFG,2018,2017-12-21,2018-12-24
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3763,Glencore Exploration (DOB/DOI),Private,Not available,Oil,Oil,https://investir.lesechos.fr/cours/action-glen...,Not available,,Chad,TCD,2018,2018-01-01,2018-12-31
3771,Société Nationale de Ciment (SONACIM),State-owned enterprises & public corporations,Not available,Mining,Quarrying,Not available,Not available,,Chad,TCD,2018,2018-01-01,2018-12-31
3778,ARAB CONTRACTORS,Private,600008358,Mining,BTP,https://www.arabcont.com/english/,Not available,,Chad,TCD,2018,2018-01-01,2018-12-31
3781,Chad construction Materials S.A,Private,Not available,Mining,BTP,Not available,Not available,,Chad,TCD,2018,2018-01-01,2018-12-31


in table 2 but not in table 1


Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
29,Abid Hassan Zadran Limited,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,1192667,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
30,Abid Hassan Zadran Limited,Ministry of Finance (Customs Department),Fixed Tax on Exports,No,No,,AFN,143121,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
31,Abid Hassan Zadran Limited,Ministry of Finance (Customs Department),Other Fee on Exports,No,No,,AFN,25954,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
32,Abid Hassan Zadran Limited,Ministry of Finance (Customs Department),Penalty on Exports,No,No,,AFN,6350,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
37,"Afghan Shiinink, Mines Extraction and Processing",Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,35222,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31566,HOUNDE GOLD OPERATION SA,Direction Générale du Trésor et de la Comptabi...,Frais de dossier,Yes,No,Nc,XOF,10000,,,,TRUE,Burkina Faso,BFA,2020,2020-12-01,2020-12-31
31569,ROXGOLD SANU SA,Direction Générale du Trésor et de la Comptabi...,Remboursements de crédit de TVA (remboursement...,No,No,,XOF,-3144990000,,,,TRUE,Burkina Faso,BFA,2020,2020-12-01,2020-12-31
31572,HOUNDE GOLD OPERATION SA,Direction Générale du Trésor et de la Comptabi...,Remboursements de crédit de TVA (remboursement...,No,No,,XOF,-7884500000,,,,TRUE,Burkina Faso,BFA,2020,2020-12-01,2020-12-31
31575,BISSA GOLD SA,Direction Générale du Trésor et de la Comptabi...,Remboursements de crédit de TVA (remboursement...,No,No,,XOF,-16687770000,,,,TRUE,Burkina Faso,BFA,2020,2020-12-01,2020-12-31


in both tables


Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country_x,ISO Code_x,...,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country_y,ISO Code_y,Year_y,Start Date_y,End Date_y
0,Amin Karimzai Campany,Private,1007815085,Other,Talc,Not applicable,Not available,365746,Afghanistan,AFG,...,365746,No,Not applicable,Not applicable,2018-08-28,Afghanistan,AFG,2018,2017-12-21,2018-12-20
1,Habib Shahab Talc and Marble exploitation and ...,Private,1013655012,Other,"Talc, Construction stone",Not applicable,Not available,427491,Afghanistan,AFG,...,18,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
2,Habib Shahab Talc and Marble exploitation and ...,Private,1013655012,Other,"Talc, Construction stone",Not applicable,Not available,427491,Afghanistan,AFG,...,365746,No,Not applicable,Not applicable,2018-08-28,Afghanistan,AFG,2018,2017-12-21,2018-12-20
3,Habib Shahab Talc and Marble exploitation and ...,Private,1013655012,Other,"Talc, Construction stone",Not applicable,Not available,427491,Afghanistan,AFG,...,53675,No,Not applicable,Not applicable,2018-08-01,Afghanistan,AFG,2018,2017-12-21,2018-12-20
4,Habib Shahab Talc and Marble exploitation and ...,Private,1013655012,Other,"Talc, Construction stone",Not applicable,Not available,427491,Afghanistan,AFG,...,8052,No,Not applicable,Not applicable,2018-08-01,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47432,Société de Raffinage de N’Djamena (SRN),Private,600008474,Other,Refinery,Not available,Not available,189977765,Chad,TCD,...,34438339,No,,,,Chad,TCD,2017,2017-01-01,2017-12-31
47433,Société de Raffinage de N’Djamena (SRN),Private,600008474,Other,Refinery,Not available,Not available,189977765,Chad,TCD,...,120595199,Yes,2574070,Barrels,Sales of the State's share collected by SHT to...,Chad,TCD,2017,2017-01-01,2017-12-31
47434,Société de Raffinage de N’Djamena (SRN),Private,600008474,Other,Refinery,Not available,Not available,189977765,Chad,TCD,...,187400000,Yes,4000000,Barrels,Ventes à la raffinerie des quotes-part huile d...,Chad,TCD,2018,2018-01-01,2018-12-31
47435,Société de Raffinage de N’Djamena (SRN),Private,600008474,Other,Refinery,Not available,Not available,189977765,Chad,TCD,...,2068759,No,Non applicable,Non applicable,,Chad,TCD,2018,2018-01-01,2018-12-31


In [12]:
print("With duplicate rows removed")

common_columns_3a5 = ["Full company name"]
common_columns_53a = ["Company"]

for key, data in compare_tables_drop_duplicates(df_part_3a, df_part_5, common_columns_3a5, common_columns_53a).items():
    print(key)
    display(data)

With duplicate rows removed
in table 1 but not in table 2


Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
7,Abed Hasan Zadran Limited,Private,9005801197,Other,Coal,,Not available,,Afghanistan,AFG,2018,2017-12-21,2018-12-24
9,Afghan Shinink Mines Extraction and Processing,Private,9002202316,Other,Talc,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-24
95,"احمد علی ولد خداداد, احمدعلی Ahamd Ali Son of ...",Private,9001263814,Other,Construction stone,Not applicable,Not available,-,Afghanistan,AFG,2018,2017-12-21,2018-12-24
131,شرکت استخراج معادن افغان اکتیف لمیتد Afghan Ac...,Private,9001353375,Other,Chromite,Not applicable,Not available,-,Afghanistan,AFG,2018,2017-12-21,2018-12-24
138,شرکت استخراج معادن ذغال سنک افراسیاب Afrasyab ...,Private,9001505461,Other,Coal,Not applicable,Not available,-,Afghanistan,AFG,2018,2017-12-21,2018-12-24
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3745,ARAB CONTRACTORS,Private,600008358,Mining,Construction,https://www.arabcont.com/english/,Not available,,Chad,TCD,2017,2017-01-01,2017-12-31
3748,Chad construction Materials S.A,Private,Not communicated,Mining,Construction,Not available,Not available,,Chad,TCD,2017,2017-01-01,2017-12-31
3754,Société des Hydrocarbures du Tchad Petroleum C...,State-owned enterprises & public corporations,Not available,Oil,Oil,Not available,Not available,,Chad,TCD,2018,2018-01-01,2018-12-31
3771,Société Nationale de Ciment (SONACIM),State-owned enterprises & public corporations,Not available,Mining,Quarrying,Not available,Not available,,Chad,TCD,2018,2018-01-01,2018-12-31


in table 2 but not in table 1


Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
29,Abid Hassan Zadran Limited,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,1192667,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
37,"Afghan Shiinink, Mines Extraction and Processing",Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,35222,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
41,"Afghan Shinink, Mines Extraction and Processing",Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,61336,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
46,Afghan Talc Limited Joint Venture,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,316251,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
92,Arif Shahaab Limited,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,1085044,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30591,ESTELAR RESOURCES LIMITED S.A,Administración Federal de Ingresos Públicos (A...,Derechos de Exportación,No,No,,ARS,323735644.74,No,,,,Argentina,ARG,2018,2018-01-01,2018-12-31
31078,BISSA GOLD SA,Direction Générale des Impôts (DGI),Acomptes Provisionnels sur IS (AP - IS),Yes,Yes,Bissa Gold SA,XOF,9608091475,No,,,,Burkina Faso,BFA,2018,2018-01-01,2018-12-31
31104,HOUNDE GOLD OPERATION SA,Direction Générale des Douanes (DGD),Droits de Douane et taxes assimilées,Yes,Yes,Hounde Gold,XOF,6273544278,No,,,,Burkina Faso,BFA,2018,2018-01-01,2018-12-31
31108,ROXGOLD SANU SA,Direction Générale des Douanes (DGD),Droits de Douane et taxes assimilées,Yes,Yes,Yaramoko,XOF,940971519,No,,,,Burkina Faso,BFA,2018,2018-01-01,2018-12-31


in both tables


Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country_x,ISO Code_x,...,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country_y,ISO Code_y,Year_y,Start Date_y,End Date_y
0,Amin Karimzai Campany,Private,1007815085,Other,Talc,Not applicable,Not available,365746,Afghanistan,AFG,...,365746,No,Not applicable,Not applicable,2018-08-28,Afghanistan,AFG,2018,2017-12-21,2018-12-20
1,Habib Shahab Talc and Marble exploitation and ...,Private,1013655012,Other,"Talc, Construction stone",Not applicable,Not available,427491,Afghanistan,AFG,...,18,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
5,Abaan Rayan Limited,Private,9004655032,Other,Coal,,Not available,10256884,Afghanistan,AFG,...,8953331,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
9,Abas Ghaznavi Limited,Private,9001935742,Other,Coal,,Not available,36244514,Afghanistan,AFG,...,23473826,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
17,Abdul Fatah,Private,9001846329,Other,Construction stone,Not applicable,Not available,139422,Afghanistan,AFG,...,89598,No,Not applicable,Not applicable,2018-07-31,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47415,MIREDEX,Private,Not available,Mining,Gold,Not available,Not available,9833,Chad,TCD,...,9571,No,Non applicable,Non applicable,,Chad,TCD,2018,2018-01-01,2018-12-31
47417,Société Nationale de Développement de Minérale...,Private,Not available,Mining,Gold,Not available,Not available,7196,Chad,TCD,...,7196,No,Non applicable,Non applicable,,Chad,TCD,2018,2018-01-01,2018-12-31
47418,Société GMIA,Private,Not available,Mining,Gold,Not available,Not available,6866,Chad,TCD,...,4167,No,Non applicable,Non applicable,,Chad,TCD,2018,2018-01-01,2018-12-31
47420,Tchad Oil Transportation Company (TOTCO),Private,600010746,Other,Oil transport,Not available,Not available,40463701.62,Chad,TCD,...,30718669.41,No,Non applicable,Non applicable,,Chad,TCD,2018,2018-01-01,2018-12-31


#### Part 3 - Reporting government entities' list and Part 5 - Company data

In [13]:
print("With duplicate rows")

common_columns_3b5 = ["Full name of agency", "Country", "Year"]
common_columns_53b = ["Government entity", "Country", "Year"]

for key, data in compare_tables(df_part_3b, df_part_5, common_columns_3b5, common_columns_53b).items():
    print(key)
    display(data)

With duplicate rows
in table 1 but not in table 2


Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date
4,Ministry of Industry and Commerce,Central goverment,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-24
9,Ministry of Industry and Commerce,Central goverment,Not applicable,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
10,Direction Générale des Impôts (DGI),Central goverment,No applicable,108131178092,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
14,Direction des Participations et de la Privatis...,Central goverment,No applicable,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
18,Others,Central goverment,No applicable,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
...,...,...,...,...,...,...,...,...,...
542,Autorité de Régulation du Secteur pétrolier Av...,Central goverment,Non applicable,,Chad,TCD,2018,2018-01-01,2018-12-31
543,Société Nationale des Mines et de la Géologie ...,State-owned enterprises & public corporations,Non applicable,,Chad,TCD,2018,2018-01-01,2018-12-31
544,Ministère des Finances,Central goverment,Non applicable,,Chad,TCD,2018,2018-01-01,2018-12-31
545,Commune de Doba,Local government,Non applicable,,Chad,TCD,2018,2018-01-01,2018-12-31


in table 2 but not in table 1


Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
4728,ENI IVORY COAST LIMITED,Direction Générale des impôts (DGI),Bonus de production,No,No,,XOF,2454000000,,,,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
4729,TULLOW CI,Direction Générale des impôts (DGI),Bonus de signature,No,No,,XOF,819375000,,,,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
4746,Société Nationale d'Opérations Petrolière de C...,Direction Générale des impôts (DGI),Contribution des patentes,No,No,,XOF,498167552,,,,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
4747,societe ivoiro-suisse abidjanaise de granit (S...,Direction Générale des impôts (DGI),Contribution des patentes,No,No,,XOF,16338189,,,,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
4748,CADERAC. SA,Direction Générale des impôts (DGI),Contribution des patentes,No,No,,XOF,8760605,,,,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31069,Kiaka Gold,Direction Générale des Douanes (DGD),Pénalités (DGD),No,No,,XOF,697364,No,,,,Burkina Faso,BFA,2017,2017-01-01,2017-12-31
31070,Kiaka Gold,Direction Générale des Impôts (DGI),Taxe sur la Valeur Ajoutée (TVA),No,No,,XOF,3900,No,,,,Burkina Faso,BFA,2017,2017-01-01,2017-12-31
31071,Kiaka Gold,Direction Générale des Impôts (DGI),Pénalités (DGI),No,No,,XOF,2100,No,,,,Burkina Faso,BFA,2017,2017-01-01,2017-12-31
31074,Kiaka Gold,Frais de prestation BUMIGEB,Frais de prestation BUMIGEB,No,No,,XOF,17132152,No,,,,Burkina Faso,BFA,2017,2017-01-01,2017-12-31


in both tables


Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code_x,Year,Start Date_x,End Date_x,Company,...,Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,ISO Code_y,Start Date_y,End Date_y
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,1441453501,Afghanistan,AFG,2018,2017-12-21,2018-12-20,Afghan Gas Enterprise,...,,AFN,24493134,No,Not applicable,Not applicable,2018-12-21,AFG,2017-12-21,2018-12-20
1,Ministry of Finance (Revenue Department),Central goverment,Not applicable,1441453501,Afghanistan,AFG,2018,2017-12-21,2018-12-20,Afghan Gas Enterprise,...,,AFN,21656316,No,Not applicable,Not applicable,2018-12-21,AFG,2017-12-21,2018-12-20
2,Ministry of Finance (Revenue Department),Central goverment,Not applicable,1441453501,Afghanistan,AFG,2018,2017-12-21,2018-12-20,Afghan Gas Enterprise,...,,AFN,12325103,No,Not applicable,Not applicable,2018-12-21,AFG,2017-12-21,2018-12-20
3,Ministry of Finance (Revenue Department),Central goverment,Not applicable,1441453501,Afghanistan,AFG,2018,2017-12-21,2018-12-20,Afghan Gas Enterprise,...,,AFN,1047484,No,Not applicable,Not applicable,2018-12-21,AFG,2017-12-21,2018-12-20
4,Ministry of Finance (Revenue Department),Central goverment,Not applicable,1441453501,Afghanistan,AFG,2018,2017-12-21,2018-12-20,CNPCI Watan Oil and Gas Afghanistan limited,...,,AFN,556996,No,Not applicable,Not applicable,2018-12-21,AFG,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30196,Direction Générale Technique des Mines (DGTM),Central goverment,Non applicable,96283,Chad,TCD,2018,2018-01-01,2018-12-31,SOGEM,...,Non applicable,USD,12953,No,Non applicable,Non applicable,,TCD,2018-01-01,2018-12-31
30197,Direction Générale Technique des Mines (DGTM),Central goverment,Non applicable,96283,Chad,TCD,2018,2018-01-01,2018-12-31,ABOURACHID Mining,...,Non applicable,USD,9895,No,Non applicable,Non applicable,,TCD,2018-01-01,2018-12-31
30198,Direction Générale Technique des Mines (DGTM),Central goverment,Non applicable,96283,Chad,TCD,2018,2018-01-01,2018-12-31,MIREDEX,...,Non applicable,USD,9571,No,Non applicable,Non applicable,,TCD,2018-01-01,2018-12-31
30199,Direction Générale Technique des Mines (DGTM),Central goverment,Non applicable,96283,Chad,TCD,2018,2018-01-01,2018-12-31,Manejem Company Ltd,...,Non applicable,USD,4498,No,Non applicable,Non applicable,,TCD,2018-01-01,2018-12-31


In [14]:
print("With duplicate rows removed")

for key, data in compare_tables_drop_duplicates(df_part_3b, df_part_5, common_columns_3b5, common_columns_53b).items():
    print(key)
    display(data)

With duplicate rows removed
in table 1 but not in table 2


Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date
4,Ministry of Industry and Commerce,Central goverment,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-24
9,Ministry of Industry and Commerce,Central goverment,Not applicable,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
10,Direction Générale des Impôts (DGI),Central goverment,No applicable,108131178092,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
14,Direction des Participations et de la Privatis...,Central goverment,No applicable,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
18,Others,Central goverment,No applicable,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
...,...,...,...,...,...,...,...,...,...
542,Autorité de Régulation du Secteur pétrolier Av...,Central goverment,Non applicable,,Chad,TCD,2018,2018-01-01,2018-12-31
543,Société Nationale des Mines et de la Géologie ...,State-owned enterprises & public corporations,Non applicable,,Chad,TCD,2018,2018-01-01,2018-12-31
544,Ministère des Finances,Central goverment,Non applicable,,Chad,TCD,2018,2018-01-01,2018-12-31
545,Commune de Doba,Local government,Non applicable,,Chad,TCD,2018,2018-01-01,2018-12-31


in table 2 but not in table 1


Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
4728,ENI IVORY COAST LIMITED,Direction Générale des impôts (DGI),Bonus de production,No,No,,XOF,2454000000,,,,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
4885,SOCIETE DES MINES D'ITY (S M I),Autres,Versements au compte de réhabilitation pour l'...,No,No,,XOF,75657609,,,,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
5081,AMUR MUGOTE & FRERES,Direction des Recettes Provinciales,Autorisation de transport de minerais,No,No,,USD,80710,No,Non applicable,Non applicable,,Democratic Republic of Congo,COD,2017,2017-01-01,2017-12-31
6015,Barrick Pueblo Viejo Dominican Corporation,Ministerio de Energía y Minas (MEM),Tasa por Servicios,No,Yes,,DOP,4000,No,,,1,Dominican Republic,DOM,2018,2018-01-01,2018-12-31
6109,BEB Erdgas und Erdöl GmbH & Co. KG,Municipalities,Trade Tax,No,No,,EUR,19466797,,,,,Germany,DEU,2017,2017-01-01,2017-12-31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30531,YPF S.A.,Secretaría de Energía,Regalias,Yes,Yes,Magallanes AM 6,ARS,37731968,No,,,Oil,Argentina,ARG,2018,2018-01-01,2018-12-31
30537,YPF S.A.,Secretaría de Medio Ambiente,Tasa Ambiental Anual,No,No,,ARS,73217,No,,,,Argentina,ARG,2018,2018-01-01,2018-12-31
30928,IAMGOLD Essakane SA,Direction Générale des Douanes (DGD),Droits de Douane et taxes assimilées,No,No,,XOF,17778937001,No,,,,Burkina Faso,BFA,2017,2017-01-01,2017-12-31
30930,IAMGOLD Essakane SA,Direction Générale des Impôts (DGI),Acomptes Provisionnels sur IS (AP - IS),No,No,,XOF,5874863793,No,,,,Burkina Faso,BFA,2017,2017-01-01,2017-12-31


in both tables


Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code_x,Year,Start Date_x,End Date_x,Company,...,Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,ISO Code_y,Start Date_y,End Date_y
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,1441453501,Afghanistan,AFG,2018,2017-12-21,2018-12-20,Afghan Gas Enterprise,...,,AFN,24493134,No,Not applicable,Not applicable,2018-12-21,AFG,2017-12-21,2018-12-20
181,Ministry of Finance (Customs Department),Central goverment,Not applicable,1367722942,Afghanistan,AFG,2018,2017-12-21,2018-12-21,Abaan Rayan Limited,...,,AFN,8953331,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
834,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,1478934310.56,Afghanistan,AFG,2018,2017-12-21,2018-12-22,North Coal Enterprise (NCE),...,EXP 1/2014,AFN,442801100,No,Not applicable,Not applicable,2018-09-22,AFG,2017-12-21,2018-12-20
1144,National Environmental Protection Agency,Central goverment,Not applicable,34000,Afghanistan,AFG,2018,2017-12-21,2018-12-23,استخراج 10000مترمکعب ریگ وجغل توسط شرکت ساختما...,...,,AFN,1000,No,Not applicable,Not applicable,2018-01-11,AFG,2017-12-21,2018-12-20
1153,Ministry of Finance (Revenue Department),Central goverment,Not applicable,1441453501,Afghanistan,AFG,2019,2018-12-21,2019-12-20,Afghan Gas Enterprise,...,,AFN,20000000,Not applicable,Not applicable,Not applicable,,AFG,2018-12-21,2019-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30117,Direction Générale Technique des Mines (DGTM),Central government,Non applicable,51575,Chad,TCD,2017,2017-01-01,2017-12-31,TEKTON MINERAL,...,Non applicable,USD,16332,No,,,,TCD,2017-01-01,2017-12-31
30121,Direction Générale du Trésor et de la Comptabi...,Central goverment,Non applicable,316949335.62,Chad,TCD,2018,2018-01-01,2018-12-31,Petronas,...,Non applicable,USD,103169362,No,Non applicable,Non applicable,,TCD,2018-01-01,2018-12-31
30170,Société des Hydrocarbures du Tchad (SHT),State-owned enterprises & public corporations,Non applicable,680650556.22,Chad,TCD,2018,2018-01-01,2018-12-31,Glencore Energy UK Limited,...,Non applicable,USD,259445592.69,Yes,3803345,Barrels,Ventes (export) des quotes-part huile de l'Eta...,TCD,2018-01-01,2018-12-31
30180,Direction Générale Technique de Pétrole (DGTP),Central goverment,Non applicable,5440559,Chad,TCD,2018,2018-01-01,2018-12-31,Petrochad Mangara,...,Non applicable,USD,1772626,No,Non applicable,Non applicable,,TCD,2018-01-01,2018-12-31


#### Part 3 - Reporting projects' list and Part 5 - Company data

In [15]:
print("With duplicate rows")

common_columns_3c5 = ["Full project name", "Country", "Year"]
common_columns_53c = ["Project name", "Country", "Year"]

for key, data in compare_tables(df_part_3c, df_part_5, common_columns_3c5, common_columns_53c).items():
    print(key)
    display(data)

With duplicate rows
in table 1 but not in table 2


Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
80,APL-EP-57,,Jabul Siraj Consortium,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
81,APL-EP-58,,Core Drillers,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
82,APL-EP-59,,Amin Karimzai Campany,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
83,APL-EP-60,,Afghan Talc Limited,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
84,APL-EP-61,,Nabi Afghan Company,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4974,Champs Nya,Concession d'exploitation NYA du 20/07/2017 (C...,"Esso, Petronas,SHT",Crude oil (2709),Production,638135,Barrels,34993372,USD,Chad,TCD,2017,2017-01-01,2017-12-31
4975,Champs Maikeri,Concession d'exploitation Maikeri du 20/07/201...,"Esso, Petronas,SHT",Crude oil (2709),Production,550314,Barrels,30177537,USD,Chad,TCD,2017,2017-01-01,2017-12-31
4976,Champs Timbré,Concession d'exploitation Timbré du 20/07/2017...,"Esso, Petronas,SHT",Crude oil (2709),Production,598515,Barrels,32820733,USD,Chad,TCD,2017,2017-01-01,2017-12-31
4977,Champs MANGARA,"Autorisation Exclusive d'Exploitation, MANGARA...",PCM/Glencore/SHT,Crude oil (2709),Production,2181729,Barrels,119639347,USD,Chad,TCD,2017,2017-01-01,2017-12-31


in table 2 but not in table 1


Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
4,Habib Shahab Talc and Marble exploitation and ...,Ministry of Mines and Petroleum (Revenue Depar...,Penalties of Late Payment,,,,AFN,18,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
5,Abaan Rayan Limited,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,8953331,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
6,Abaan Rayan Limited,Ministry of Finance (Customs Department),Fixed Tax on Exports,No,No,,AFN,1074419,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
7,Abaan Rayan Limited,Ministry of Finance (Customs Department),Other Fee on Exports,No,No,,AFN,198084,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
8,Abaan Rayan Limited,Ministry of Finance (Customs Department),Penalty on Exports,No,No,,AFN,31050,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31877,China National Petroleum Corporation Internati...,Société des Hydrocarbures du Tchad (SHT),Redevance sur production collecté par la SHT,No,No,Non applicable,USD,-,Yes,3543915,Barrels,Quotes-parts de l'Etat (Redevance sur producti...,Chad,TCD,2018,2018-01-01,2018-12-31
31878,China National Petroleum Corporation Internati...,Société des Hydrocarbures du Tchad (SHT),Profit Oil collecté par la SHT,No,No,Non applicable,USD,-,Yes,2541955,Barrels,Quotes-parts de l'Etat (Profit Oil SHT- 10%) d...,Chad,TCD,2018,2018-01-01,2018-12-31
31879,Petrochad Mangara,Société des Hydrocarbures du Tchad (SHT),Redevance sur production collecté par la SHT,No,No,Non applicable,USD,-,Yes,545318,Barrels,Quotes-parts de l'Etat (Redevance sur producti...,Chad,TCD,2018,2018-01-01,2018-12-31
31880,Petrochad Mangara,Société des Hydrocarbures du Tchad (SHT),Tax Oil collecté par la SHT,No,No,Non applicable,USD,-,Yes,393777,Barrels,Quotes-parts de l'Etat (Tax Oil SHT) dans le c...,Chad,TCD,2018,2018-01-01,2018-12-31


in both tables


Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,...,Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,ISO Code_y,Start Date_y,End Date_y
0,APL-EP-57,,Jabul Siraj Consortium,,,,,,,Afghanistan,...,APL-EP-57,AFN,378109,No,Not applicable,Not applicable,2018-09-18,AFG,2017-12-21,2018-12-20
1,APL-EP-58,,Core Drillers,,,,,,,Afghanistan,...,APL-EP-58,AFN,367260,No,Not applicable,Not applicable,2018-08-29,AFG,2017-12-21,2018-12-20
2,APL-EP-59,,Amin Karimzai Campany,,,,,,,Afghanistan,...,APL-EP-59,AFN,365746,No,Not applicable,Not applicable,2018-08-28,AFG,2017-12-21,2018-12-20
3,APL-EP-60,,Afghan Talc Limited,,,,,,,Afghanistan,...,APL-EP-60,AFN,366791,No,Not applicable,Not applicable,2018-08-29,AFG,2017-12-21,2018-12-20
4,APL-EP-61,,Nabi Afghan Company,,,,,,,Afghanistan,...,APL-EP-61,AFN,365746,No,Not applicable,Not applicable,2018-08-28,AFG,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23963,Yaramoko,,Roxgold Sanu SA,Gold (7108),Production,4.21,Tonnes,137920000000,XOF,Burkina Faso,...,Yaramoko,XOF,1100000,,,,TRUE,BFA,2020-12-01,2020-12-31
23964,Yaramoko,,Roxgold Sanu SA,Silver (7106),Production,446.49,Kg,1870000000,XOF,Burkina Faso,...,Yaramoko,XOF,6964860000,,,,TRUE,BFA,2020-12-01,2020-12-31
23965,Yaramoko,,Roxgold Sanu SA,Silver (7106),Production,446.49,Kg,1870000000,XOF,Burkina Faso,...,Yaramoko,XOF,2666660000,,,,TRUE,BFA,2020-12-01,2020-12-31
23966,Yaramoko,,Roxgold Sanu SA,Silver (7106),Production,446.49,Kg,1870000000,XOF,Burkina Faso,...,Yaramoko,XOF,224310000,,,,TRUE,BFA,2020-12-01,2020-12-31


In [16]:
print("With duplicate rows removed")

# common_columns_3c5 = ["Full project name", "Country", "Year"]
# common_columns_53c = ["Project name", "Country", "Year"]

for key, data in compare_tables_drop_duplicates(df_part_3c, df_part_5, common_columns_3c5, common_columns_53c).items():
    print(key)
    display(data)

With duplicate rows removed
in table 1 but not in table 2


Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
80,APL-EP-57,,Jabul Siraj Consortium,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
81,APL-EP-58,,Core Drillers,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
82,APL-EP-59,,Amin Karimzai Campany,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
83,APL-EP-60,,Afghan Talc Limited,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
84,APL-EP-61,,Nabi Afghan Company,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4974,Champs Nya,Concession d'exploitation NYA du 20/07/2017 (C...,"Esso, Petronas,SHT",Crude oil (2709),Production,638135,Barrels,34993372,USD,Chad,TCD,2017,2017-01-01,2017-12-31
4975,Champs Maikeri,Concession d'exploitation Maikeri du 20/07/201...,"Esso, Petronas,SHT",Crude oil (2709),Production,550314,Barrels,30177537,USD,Chad,TCD,2017,2017-01-01,2017-12-31
4976,Champs Timbré,Concession d'exploitation Timbré du 20/07/2017...,"Esso, Petronas,SHT",Crude oil (2709),Production,598515,Barrels,32820733,USD,Chad,TCD,2017,2017-01-01,2017-12-31
4977,Champs MANGARA,"Autorisation Exclusive d'Exploitation, MANGARA...",PCM/Glencore/SHT,Crude oil (2709),Production,2181729,Barrels,119639347,USD,Chad,TCD,2017,2017-01-01,2017-12-31


in table 2 but not in table 1


Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
4,Habib Shahab Talc and Marble exploitation and ...,Ministry of Mines and Petroleum (Revenue Depar...,Penalties of Late Payment,,,,AFN,18,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
384,WESTCO INTERNATIONAL,Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,EXPL 3/2013,AFN,710962,No,Not applicable,Not applicable,2018-04-22,Afghanistan,AFG,2018,2017-12-21,2018-12-20
453,Hayat Khan,Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,SSML-Kabu 48/2014,AFN,100568,No,Not applicable,Not applicable,2018-07-23,Afghanistan,AFG,2018,2017-12-21,2018-12-20
1153,Aazam Khan Wafa Sherzad Limited,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,110944,Not applicable,Not applicable,Not applicable,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
1175,Abdul Rahman Baba Steel & Iron Company,Ministry of Mines and Petroleum (Revenue Depar...,Surface Rent,Yes,Yes,EXPL 2/2014,AFN,1352198,Not applicable,Not applicable,Not applicable,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31495,NETIANA MINING COMPANY(NMC),Fonds d'Intervention pour l'Environnement (FIE),Versements au Fonds de réhabilitation et de fe...,Yes,No,Nc,XOF,120000000,,,,TRUE,Burkina Faso,BFA,2020,2020-12-01,2020-12-31
31506,GRYPHON MINERALS BURKINA FASO SARL,Direction Générale du Trésor et de la Comptabi...,Taxes superficiaires,Yes,Yes,Nogbele Sud Dierisso Nianka Zeguedougou,XOF,24070000,,,,TRUE,Burkina Faso,BFA,2020,2020-12-01,2020-12-31
31577,SNH,Société Nationale des Hydrocarbures (SNH),Parts d'huile de la SNH-Etat (Petrole),,,,XAF,,Yes,960,Barrels,The value of this in-kind payment form part o...,Cameroon,CMR,2017,2017-01-01,2017-12-31
31716,TEKTON MINERAL,Direction Générale Technique des Mines (DGTM),Appui Institutionnel,No,No,Non applicable,USD,16332,No,,,,Chad,TCD,2017,2017-01-01,2017-12-31


in both tables


Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,...,Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,ISO Code_y,Start Date_y,End Date_y
0,APL-EP-57,,Jabul Siraj Consortium,,,,,,,Afghanistan,...,APL-EP-57,AFN,378109,No,Not applicable,Not applicable,2018-09-18,AFG,2017-12-21,2018-12-20
1,APL-EP-58,,Core Drillers,,,,,,,Afghanistan,...,APL-EP-58,AFN,367260,No,Not applicable,Not applicable,2018-08-29,AFG,2017-12-21,2018-12-20
2,APL-EP-59,,Amin Karimzai Campany,,,,,,,Afghanistan,...,APL-EP-59,AFN,365746,No,Not applicable,Not applicable,2018-08-28,AFG,2017-12-21,2018-12-20
3,APL-EP-60,,Afghan Talc Limited,,,,,,,Afghanistan,...,APL-EP-60,AFN,366791,No,Not applicable,Not applicable,2018-08-29,AFG,2017-12-21,2018-12-20
4,APL-EP-61,,Nabi Afghan Company,,,,,,,Afghanistan,...,APL-EP-61,AFN,365746,No,Not applicable,Not applicable,2018-08-28,AFG,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23926,SEMAFO,,SEMAFO Burkina,Silver (7106),Production,1327.56,Kg,520000000,XOF,Burkina Faso,...,SEMAFO,XOF,9793090000,,,,TRUE,BFA,2020-12-01,2020-12-31
23932,SEMAFO Boungou,,SEMAFO Boungou,Silver (7106),Production,437.16,Kg,170000000,XOF,Burkina Faso,...,SEMAFO Boungou,XOF,7119140000,,,,TRUE,BFA,2020-12-01,2020-12-31
23938,TAPARKO,,SOMITA,Gold (7108),Production,2.92,Tonnes,94480000000,XOF,Burkina Faso,...,TAPARKO,XOF,4781340000,,,,TRUE,BFA,2020-12-01,2020-12-31
23946,WAHGNION GOLD OP SA,,WAHGNION GOLD OP SA,Gold (7108),Production,5.62,Tonnes,181820000000,XOF,Burkina Faso,...,WAHGNION GOLD OP SA,XOF,8578370000,,,,TRUE,BFA,2020-12-01,2020-12-31


### TESTS

In [17]:
# SAMPLE TEST
sample = pd.DataFrame({
    "Full name of agency": ["A", "A", "B", "B", "C", "D", "E"],
    "Agency type": ["pri", "pri", "pri", "soe", "soe", "soe", "soe"],
    "Total reported": [0, 0, 1, 1, 2, 2, 5]
})

columns = ["Full name of agency", "Agency type", "Total reported"]

get_problematic_rows(sample, columns, 2)

Unnamed: 0,Full name of agency,Agency type,Total reported,rowid
0,A,pri,0,0
1,A,pri,0,1
2,B,pri,1,2
3,B,soe,1,3
4,C,soe,2,4
5,D,soe,2,5


In [18]:
# Sample data for sample_companies_list
sample_companies_list_data = {
    "Full company name": ["A Corp", "B Corp", "C Corp", "D Corp", "E Corp", "B Corp", "F Corp"],
    "Company type": ["Type1", "Type2", "Type3", "Type1", "Type2", "Type2", "Type3"],
    "Company ID number": [101, 102, 103, 104, 105, 102, 106],
    "Country": ["X", "Y", "Z", "X", "Y", "Y", "Z"],
    "Year": [2020, 2021, 2022, 2023, 2020, 2021, 2023]
}

sample_company_list = pd.DataFrame(sample_companies_list_data)

# Sample data for sample_company_data
sample_company_data_data = {
    "Company": ["A Corp", "B Corp", "C Corp", "E Corp", "F Corp", "C Corp", "G Corp"],
    "Project name": ["P1", "P2", "P3", "P4", "P5", "P3", "P6"],
    "Country": ["X", "Y", "Z", "X", "Y", "Z", "X"],
    "Year": [2020, 2021, 2022, 2023, 2022, 2023, 2021]
}

sample_company_data = pd.DataFrame(sample_company_data_data)

display(sample_company_list)
display(sample_company_data)

Unnamed: 0,Full company name,Company type,Company ID number,Country,Year
0,A Corp,Type1,101,X,2020
1,B Corp,Type2,102,Y,2021
2,C Corp,Type3,103,Z,2022
3,D Corp,Type1,104,X,2023
4,E Corp,Type2,105,Y,2020
5,B Corp,Type2,102,Y,2021
6,F Corp,Type3,106,Z,2023


Unnamed: 0,Company,Project name,Country,Year
0,A Corp,P1,X,2020
1,B Corp,P2,Y,2021
2,C Corp,P3,Z,2022
3,E Corp,P4,X,2023
4,F Corp,P5,Y,2022
5,C Corp,P3,Z,2023
6,G Corp,P6,X,2021


In [19]:
common_columns_df1 = ["Full company name", "Year"]
common_columns_df2 = ["Company", "Year"]

for key, data in compare_tables(sample_company_list, sample_company_data, common_columns_df1, common_columns_df2).items():
    print(key)
    display(data)

in table 1 but not in table 2


Unnamed: 0,Full company name,Company type,Company ID number,Country,Year
3,D Corp,Type1,104,X,2023
4,E Corp,Type2,105,Y,2020
6,F Corp,Type3,106,Z,2023


in table 2 but not in table 1


Unnamed: 0,Company,Project name,Country,Year
3,E Corp,P4,X,2023
4,F Corp,P5,Y,2022
5,C Corp,P3,Z,2023
6,G Corp,P6,X,2021


in both tables


Unnamed: 0,Full company name,Company type,Company ID number,Country_x,Year,Company,Project name,Country_y
0,A Corp,Type1,101,X,2020,A Corp,P1,X
1,B Corp,Type2,102,Y,2021,B Corp,P2,Y
2,C Corp,Type3,103,Z,2022,C Corp,P3,Z
3,B Corp,Type2,102,Y,2021,B Corp,P2,Y


In [20]:
for key, data in compare_tables_drop_duplicates(sample_company_list, sample_company_data, common_columns_df1, common_columns_df2).items():
    print(key)
    display(data)

in table 1 but not in table 2


Unnamed: 0,Full company name,Company type,Company ID number,Country,Year
3,D Corp,Type1,104,X,2023
4,E Corp,Type2,105,Y,2020
6,F Corp,Type3,106,Z,2023


in table 2 but not in table 1


Unnamed: 0,Company,Project name,Country,Year
3,E Corp,P4,X,2023
4,F Corp,P5,Y,2022
5,C Corp,P3,Z,2023
6,G Corp,P6,X,2021


in both tables


Unnamed: 0,Full company name,Company type,Company ID number,Country_x,Year,Company,Project name,Country_y
0,A Corp,Type1,101,X,2020,A Corp,P1,X
1,B Corp,Type2,102,Y,2021,B Corp,P2,Y
2,C Corp,Type3,103,Z,2022,C Corp,P3,Z


In [21]:
rev_part_4 = df_part_4.copy()
rev_part_5 = df_part_5.copy()

rev_part_4["Revenue value"] = rev_part_4["Revenue value"].str.replace(",", "")

rev_part_4["Revenue value"] = pd.to_numeric(rev_part_4["Revenue value"], errors="coerce")
rev_part_5["Revenue value"] = pd.to_numeric(rev_part_5["Revenue value"], errors="coerce")

# rev_part_4

# 100*(rev_part_4["Revenue value"].sum() - rev_part_5["Revenue value"].sum())/rev_part_4["Revenue value"].sum()

rev_part_4_sum = rev_part_4.groupby(["Country"])["Revenue value"].sum()
rev_part_5_sum = rev_part_5.groupby(["Country"])["Revenue value"].sum()

(100*(rev_part_4_sum - rev_part_5_sum)/rev_part_4_sum).sort_values(ascending=False).reset_index().style.format({'Revenue value': '{:.2f}'})

Unnamed: 0,Country,Revenue value
0,Albania,77.74
1,Nigeria,57.13
2,Argentina,42.85
3,Suriname,37.46
4,Guyana,33.19
5,Liberia,17.03
6,Zambia,15.99
7,Burkina Faso,15.36
8,Mali,12.83
9,Togo,9.38


In [22]:
rev_part_4[rev_part_4["Country"]=="Norway"]["Revenue value"].sum() - rev_part_5[rev_part_5["Country"]=="Norway"]["Revenue value"].sum()

5550000.0

In [23]:
err_part_3a = df_part_3a.copy()

err_part_3a["rowid"] = range(len(err_part_3a))

error_rows = err_part_3a[err_part_3a.apply(lambda row: row.astype(str).str.contains('#ERROR!').any(), axis=1)]

display(error_rows)

Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date,rowid
