# Duplication and coherence

## Overall purpose and objective
The overall purpose and objective of the cleaning and verification process is to prepare the data for conversion into a SQLite database (Datasette). As such, the data should follow database best practices.

## Specific purpose of this notebook
This notebook is for checking duplicates in the data and coherence. Particularly, we want to check for:
- Possible problematic rows for each table.
- The companies/agencies in the company and agency lists correspond to companies/agencies in the company data

## Assumptions
- Some combinations of fields should be unique
- The values should be cohrent across tables

## Why this matters 
- Inserting the data in a proper database and assigning EITI IDs require a high confidence in the data quality to avoid downstream issues. Duplicates and non-coherence of the data decreases this level of confidence.

## Findings


## Analysis

### Problematic rows due to possible duplication

In [1]:
# import libraries and data

import pandas as pd
import numpy as np
from os import path
from functools import reduce
from pprint import pprint
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from itertools import combinations

file_dir = "data/consolidated/"

# load the csvs into data frames
df_part_1 = pd.read_csv(path.join(file_dir, "Part 1 - About.csv"))
df_part_3a = pd.read_csv(path.join(file_dir, "Part 3 - Reporting companies' list.csv"))
df_part_3b = pd.read_csv(path.join(file_dir, "Part 3 - Reporting government entities list.csv"))
df_part_3c = pd.read_csv(path.join(file_dir, "Part 3 - Reporting projects' list.csv"))
df_part_4 = pd.read_csv(path.join(file_dir, "Part 4 - Government revenues.csv"))
df_part_5 = pd.read_csv(path.join(file_dir, "Part 5 - Company data.csv"))
# df_part_5 = pd.read_csv(path.join(file_dir, "Part 5 - Company data.csv"), low_memory=False)

df_list = [df_part_1, df_part_3a, df_part_3b, df_part_3c, df_part_4, df_part_5]
df_dict = {"Part 1 - About.csv": df_part_1,
           "Part 3 - Reporting companies' list.csv": df_part_3a,
           "Part 3 - Reporting government entities list.csv": df_part_3b,
           "Part 3 - Reporting projects' list.csv": df_part_3c,
           "Part 4 - Government revenues.csv": df_part_4,
           "Part 5 - Company data.csv": df_part_5
          }

# OPTIONAL COLUMNS
part_3a_opt = ["Stock exchange listing or company website", 
               "Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)"]
part_3b_opt = ["ID number (if applicable)"]
part_5_opt = ["In-kind volume (if applicable)", "Unit (if applicable)", "Comments"]

# only include fields that are non-optional
df_part_1_non_opt = df_part_1.copy()
df_part_3a_non_opt = df_part_3a.copy().drop(columns=part_3a_opt)               
df_part_3b_non_opt = df_part_3b.copy().drop(columns=part_3b_opt)
df_part_3c_non_opt = df_part_3c.copy()
df_part_4_non_opt = df_part_4.copy()
df_part_5_non_opt = df_part_5.copy().drop(columns=part_5_opt)

df_list_non_opt = [df_part_1_non_opt, df_part_3a_non_opt, df_part_3b_non_opt, df_part_3c_non_opt, df_part_4_non_opt, df_part_5_non_opt]
df_dict_non_opt = {"Part 1 - About.csv": df_part_1_non_opt,
           "Part 3 - Reporting companies' list.csv": df_part_3a_non_opt,
           "Part 3 - Reporting government entities list.csv": df_part_3b_non_opt,
           "Part 3 - Reporting projects' list.csv": df_part_3c_non_opt,
           "Part 4 - Government revenues.csv": df_part_4_non_opt,
           "Part 5 - Company data.csv": df_part_5_non_opt
          }

In [2]:
def get_column_combinations(columns, min_cols_diff):
    '''
    Get unique column combinations that will be used for determining problematic rows (i.e. possible duplicates).

    Parameters:
    - columns (list or iterable): A list of column names or identifiers to be used for forming combinations.
    - min_cols_diff (int): The minimum number of columns to form unique combinations.

    Returns:
    list: A list of tuples representing unique combinations of columns.

    Example:
    >>> columns = ['col1', 'col2', 'col3']
    >>> min_cols_diff = 2
    >>> get_column_combinations(columns, min_cols_diff)
    [('col1', 'col2'), ('col1', 'col3'), ('col2', 'col3')]
    '''

    all_combinations = []
    for x in range(min_cols_diff, len(columns) + 1):
        all_combinations.extend(combinations(columns, x))

    return all_combinations

# TEST: Uncomment lines below and run
# OUTPUT: 4 unique column combinations
# columns = ["Full name of agency", "Agency type", "Total reported"]
# column_combinations = get_column_combinations(columns, 2)
# print(f'There are {len(column_combinations)} combinations:')
# pprint(column_combinations)

In [3]:
def add_rowid(df):
    '''
    Add a row identifier (rowid) column to the DataFrame.

    Parameters:
    - df (pandas.DataFrame): The input DataFrame.

    Returns:
    pandas.DataFrame: A new DataFrame with an additional 'rowid' column.

    Example:
    >>> data = {'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']}
    >>> df = pd.DataFrame(data)
    >>> add_rowid(df)
       col1 col2  rowid
    0     1    a      0
    1     2    b      1
    2     3    c      2
    '''

    df_rowid = df.copy()
    df_rowid["rowid"] = range(len(df_rowid))

    return df_rowid    

In [4]:
def get_problematic_rows(df, columns, min_cols_diff):
    '''
    Get unique column combinations and find problematic rows (duplicates) based on specified columns.

    Parameters:
    - df (pandas.DataFrame): The input DataFrame.
    - columns (list or iterable): A list of column names or identifiers to be used for forming combinations.
    - min_cols_diff (int): The minimum number of columns to form unique combinations (i.e. minimum numbers of columns of difference to be considered unique).

    Returns:
    pandas.DataFrame: DataFrame containing unique rows among problematic rows.

    Example:
    >>> data = {'col1': [1, 2, 2, 3, 4], 'col2': ['a', 'b', 'b', 'c', 'd']}
    >>> df = pd.DataFrame(data)
    >>> columns_for_combinations = ['col1', 'col2']
    >>> min_cols_diff = 2
    >>> get_problematic_rows(df, columns_for_combinations, min_cols_diff)
       col1 col2
    0     2    b
    '''

    # Step 1: Create a copy of the DataFrame and add rowid
    df_copy = df.copy()
    df_copy['rowid'] = range(len(df_copy))

    # Step 2: Get column combinations
    all_combinations = []
    for x in range(min_cols_diff, len(columns) + 1):
        all_combinations.extend(combinations(columns, x))

    # Step 3: Find problematic rows for each column combination
    problematic_rows = pd.DataFrame()
    for combination in all_combinations:
        duplicated_rows = df_copy[df_copy.duplicated(subset=list(combination), keep=False)]
        problematic_rows = pd.concat([problematic_rows, duplicated_rows], ignore_index=False)

    # Step 4: Get unique rows among problematic rows
    unique_problematic_rows = problematic_rows.drop_duplicates()

    return unique_problematic_rows

In [5]:
# SAMPLE TEST
sample = pd.DataFrame({
    "Full name of agency": ["A", "A", "B", "B", "C", "D", "E"],
    "Agency type": ["pri", "pri", "pri", "soe", "soe", "soe", "soe"],
    "Total reported": [0, 0, 1, 1, 2, 2, 5]
})

columns = ["Full name of agency", "Agency type", "Total reported"]

get_problematic_rows(sample, columns, 2)

Unnamed: 0,Full name of agency,Agency type,Total reported,rowid
0,A,pri,0,0
1,A,pri,0,1
2,B,pri,1,2
3,B,soe,1,3
4,C,soe,2,4
5,D,soe,2,5


#### Part 3 - Reporting government entities list

In [6]:
columns = ["Full name of agency", "Agency type", "Total reported"]

get_problematic_rows(df_part_3b, columns, 2)

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date,rowid
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20,0
1,Ministry of Finance (Customs Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-21,1
2,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-22,2
3,National Environmental Protection Agency,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-23,3
4,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-24,4
...,...,...,...,...,...,...,...,...,...,...
479,Other Govt. Agency,Other,,,Tanzania,TZA,2018.0,2017-07-01,2018-06-30,479
488,Les delegations speciales des communes et pref...,Local government,Not applicable,,Togo,TGO,2017.0,2017-01-01,2017-12-31,488
493,Agence Nationale de Gestion de l'Environnement...,Central government,,,Togo,TGO,2018.0,2018-01-01,2018-12-31,493
495,Togolaise des Eaux (TdE),Central government,,,Togo,TGO,2018.0,2018-01-01,2018-12-31,495


### Coherence of companies in Part 3 - Reporting companies' list, Part 3 - Reporting government entities list, and Part 3 - Reporting projects list with Part 5 - Company data

In [7]:
def compare_tables(df1, df2, common_columns_df1, common_columns_df2):
    '''
    Compare two tables based on specified columns.

    Parameters:
    - df1 (pandas.DataFrame): The first DataFrame.
    - df2 (pandas.DataFrame): The second DataFrame.
    - common_columns_df1 (list): Columns used in df1 to find common rows.
    - common_columns_df2 (list): Columns used in df2 to find common rows.

    Returns:
    - common_rows (pandas.DataFrame): Rows common to both DataFrames.
    - unique_rows_df1 (pandas.DataFrame): Rows unique to df1.
    - unique_rows_df2 (pandas.DataFrame): Rows unique to df2.

    Example:
    >>> df1 = pd.DataFrame({'Company': ['A', 'B', 'C'], 'Project name': ['P1', 'P2', 'P3'], 'Country': ['X', 'Y', 'Z'], 'Year': [2020, 2021, 2022]})
    >>> df2 = pd.DataFrame({'Full company name': ['A Corp', 'B Corp', 'D Corp'], 'Company type': ['Type1', 'Type2', 'Type3'], 'Company ID number': [101, 102, 103], 'Country': ['X', 'Y', 'Z'], 'Year': [2020, 2021, 2023]})
    >>> common_cols_df1 = ['Country', 'Year']
    >>> common_cols_df2 = ['Country', 'Year']
    >>> common, unique_df1, unique_df2 = compare_tables(df1, df2, common_cols_df1, common_cols_df2)
    >>> print(common)
      Country  Year
    0       X  2020
    1       Y  2021
    >>> print(unique_df1)
      Company Project name
    2       C           P3
    >>> print(unique_df2)
      Full company name Company type  Company ID number
    2            D Corp       Type3                103
    '''

    # Make copies of the input dataframes
    df1_copy = df1.copy()
    df2_copy = df2.copy()

    # Find common rows
    common_rows = pd.merge(df1_copy, df2_copy, left_on=common_columns_df1, right_on=common_columns_df2, how='inner')

    # Find unique rows in df1
    unique_rows_df1 = df1_copy[~df1_copy.set_index(common_columns_df1).index.isin(common_rows.set_index(common_columns_df1).index)]

    # Find unique rows in df2
    unique_rows_df2 = df2_copy[~df2_copy.set_index(common_columns_df2).index.isin(common_rows.set_index(common_columns_df2).index)]

    return {"in table 1 but not in table 2": unique_rows_df1, 
            "in table 2 but not in table 1": unique_rows_df2,
           "in both tables": common_rows}

In [8]:
def compare_tables_drop_duplicates(df1, df2, common_columns_df1, common_columns_df2):
    '''
    Compare two tables based on specified columns and drop duplicates.

    Parameters:
    - df1 (pandas.DataFrame): The first DataFrame.
    - df2 (pandas.DataFrame): The second DataFrame.
    - common_columns_df1 (list): Columns used in df1 to find common rows.
    - common_columns_df2 (list): Columns used in df2 to find common rows.

    Returns:
    - common_rows (pandas.DataFrame): Rows common to both DataFrames with duplicates dropped.
    - unique_rows_df1 (pandas.DataFrame): Rows unique to df1 with duplicates dropped.
    - unique_rows_df2 (pandas.DataFrame): Rows unique to df2 with duplicates dropped.

    Example:
    >>> df1 = pd.DataFrame({'Company': ['A', 'B', 'C'], 'Project name': ['P1', 'P2', 'P3'], 'Country': ['X', 'Y', 'Z'], 'Year': [2020, 2021, 2022]})
    >>> df2 = pd.DataFrame({'Full company name': ['A Corp', 'B Corp', 'D Corp'], 'Company type': ['Type1', 'Type2', 'Type3'], 'Company ID number': [101, 102, 103], 'Country': ['X', 'Y', 'Z'], 'Year': [2020, 2021, 2023]})
    >>> common_cols_df1 = ['Country', 'Year']
    >>> common_cols_df2 = ['Country', 'Year']
    >>> common, unique_df1, unique_df2 = compare_tables_drop_duplicates(df1, df2, common_cols_df1, common_cols_df2)
    >>> print(common)
      Country  Year
    0       X  2020
    1       Y  2021
    >>> print(unique_df1)
      Company Project name
    2       C           P3
    >>> print(unique_df2)
      Full company name Company type  Company ID number
    2            D Corp       Type3                103
    '''

    # Find common rows
    common_rows = pd.merge(df1, df2, left_on=common_columns_df1, right_on=common_columns_df2, how='inner')

    # Drop duplicates in common rows
    common_rows = common_rows.drop_duplicates(subset=common_columns_df1)

    # Drop duplicates in unique rows in df1
    unique_rows_df1 = df1[~df1.set_index(common_columns_df1).index.isin(common_rows.set_index(common_columns_df1).index)]
    unique_rows_df1 = unique_rows_df1.drop_duplicates(subset=common_columns_df1)

    # Drop duplicates in unique rows in df2
    unique_rows_df2 = df2[~df2.set_index(common_columns_df2).index.isin(common_rows.set_index(common_columns_df2).index)]
    unique_rows_df2 = unique_rows_df2.drop_duplicates(subset=common_columns_df2)

    return {"in table 1 but not in table 2": unique_rows_df1, 
            "in table 2 but not in table 1": unique_rows_df2,
            "in both tables": common_rows}


In [9]:
# Sample data for sample_companies_list
sample_companies_list_data = {
    "Full company name": ["A Corp", "B Corp", "C Corp", "D Corp", "E Corp", "B Corp", "F Corp"],
    "Company type": ["Type1", "Type2", "Type3", "Type1", "Type2", "Type2", "Type3"],
    "Company ID number": [101, 102, 103, 104, 105, 102, 106],
    "Country": ["X", "Y", "Z", "X", "Y", "Y", "Z"],
    "Year": [2020, 2021, 2022, 2023, 2020, 2021, 2023]
}

sample_company_list = pd.DataFrame(sample_companies_list_data)

# Sample data for sample_company_data
sample_company_data_data = {
    "Company": ["A Corp", "B Corp", "C Corp", "E Corp", "F Corp", "C Corp", "G Corp"],
    "Project name": ["P1", "P2", "P3", "P4", "P5", "P3", "P6"],
    "Country": ["X", "Y", "Z", "X", "Y", "Z", "X"],
    "Year": [2020, 2021, 2022, 2023, 2022, 2023, 2021]
}

sample_company_data = pd.DataFrame(sample_company_data_data)

display(sample_company_list)
display(sample_company_data)

Unnamed: 0,Full company name,Company type,Company ID number,Country,Year
0,A Corp,Type1,101,X,2020
1,B Corp,Type2,102,Y,2021
2,C Corp,Type3,103,Z,2022
3,D Corp,Type1,104,X,2023
4,E Corp,Type2,105,Y,2020
5,B Corp,Type2,102,Y,2021
6,F Corp,Type3,106,Z,2023


Unnamed: 0,Company,Project name,Country,Year
0,A Corp,P1,X,2020
1,B Corp,P2,Y,2021
2,C Corp,P3,Z,2022
3,E Corp,P4,X,2023
4,F Corp,P5,Y,2022
5,C Corp,P3,Z,2023
6,G Corp,P6,X,2021


In [10]:
common_columns_df1 = ["Full company name", "Year"]
common_columns_df2 = ["Company", "Year"]

for key, data in compare_tables(sample_company_list, sample_company_data, common_columns_df1, common_columns_df2).items():
    print(key)
    display(data)

in table 1 but not in table 2


Unnamed: 0,Full company name,Company type,Company ID number,Country,Year
3,D Corp,Type1,104,X,2023
4,E Corp,Type2,105,Y,2020
6,F Corp,Type3,106,Z,2023


in table 2 but not in table 1


Unnamed: 0,Company,Project name,Country,Year
3,E Corp,P4,X,2023
4,F Corp,P5,Y,2022
5,C Corp,P3,Z,2023
6,G Corp,P6,X,2021


in both tables


Unnamed: 0,Full company name,Company type,Company ID number,Country_x,Year,Company,Project name,Country_y
0,A Corp,Type1,101,X,2020,A Corp,P1,X
1,B Corp,Type2,102,Y,2021,B Corp,P2,Y
2,C Corp,Type3,103,Z,2022,C Corp,P3,Z
3,B Corp,Type2,102,Y,2021,B Corp,P2,Y


In [11]:
for key, data in compare_tables_drop_duplicates(sample_company_list, sample_company_data, common_columns_df1, common_columns_df2).items():
    print(key)
    display(data)

in table 1 but not in table 2


Unnamed: 0,Full company name,Company type,Company ID number,Country,Year
3,D Corp,Type1,104,X,2023
4,E Corp,Type2,105,Y,2020
6,F Corp,Type3,106,Z,2023


in table 2 but not in table 1


Unnamed: 0,Company,Project name,Country,Year
3,E Corp,P4,X,2023
4,F Corp,P5,Y,2022
5,C Corp,P3,Z,2023
6,G Corp,P6,X,2021


in both tables


Unnamed: 0,Full company name,Company type,Company ID number,Country_x,Year,Company,Project name,Country_y
0,A Corp,Type1,101,X,2020,A Corp,P1,X
1,B Corp,Type2,102,Y,2021,B Corp,P2,Y
2,C Corp,Type3,103,Z,2022,C Corp,P3,Z


#### Part 3 - Reporting companies' list and Part 5 - Company data

In [12]:
common_columns_3a5 = ["Full company name"]
common_columns_53a = ["Company"]

for key, data in compare_tables(df_part_3a, df_part_5, common_columns_3a5, common_columns_53a).items():
    print(key)
    display(data)

in table 1 but not in table 2


Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
7,Abed Hasan Zadran Limited,Private,9005801197,Other,Coal,,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
9,Afghan Shinink Mines Extraction and Processing,Private,9002202316,Other,Talc,,,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
95,"احمد علی ولد خداداد, احمدعلی Ahamd Ali Son of ...",Private,9001263814,Other,Construction stone,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
131,شرکت استخراج معادن افغان اکتیف لمیتد Afghan Ac...,Private,9001353375,Other,Chromite,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
138,شرکت استخراج معادن ذغال سنک افراسیاب Afrasyab ...,Private,9001505461,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3377,Centrica North Sea Oil Ltd (NCMA4),Private,100027000-1,Oil & Gas,"Oil, Gas",,,,Trinidad and Tobago,TTO,2018,2017-10-01,2018-09-30
3378,Centrica Resources Ltd (BLK22),Private,100006133-9,Oil & Gas,"Oil, Gas",,,,Trinidad and Tobago,TTO,2018,2017-10-01,2018-09-30
3388,NGC E&P Investments (Netherlands) B.V.,State-owned enterprises & public corporations,115137-2,Oil & Gas,"Oil, Gas",,,,Trinidad and Tobago,TTO,2018,2017-10-01,2018-09-30
3396,Repsol Angostura Limited,Private,100040582-6,Oil & Gas,"Oil, Gas",,,,Trinidad and Tobago,TTO,2018,2017-10-01,2018-09-30


in table 2 but not in table 1


Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
29,Abid Hassan Zadran Limited,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,1192667.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
30,Abid Hassan Zadran Limited,Ministry of Finance (Customs Department),Fixed Tax on Exports,No,No,,AFN,143121.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
31,Abid Hassan Zadran Limited,Ministry of Finance (Customs Department),Other Fee on Exports,No,No,,AFN,25954.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
32,Abid Hassan Zadran Limited,Ministry of Finance (Customs Department),Penalty on Exports,No,No,,AFN,6350.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
37,"Afghan Shiinink, Mines Extraction and Processing",Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,35222.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32428,Royal Dutch Shell plc,Oil & Gas Authority (OGA),Oil & Gas Authority (OGA) Levy,Yes,Yes,P84,GBP,99220.0,Not applicable,Not applicable,Not applicable,,United Kingdom,GBR,2021,2021-01-01,2021-12-31
32429,Royal Dutch Shell plc,Oil & Gas Authority (OGA),Oil & Gas Authority (OGA) Levy,Yes,Yes,P88,GBP,99220.0,Not applicable,Not applicable,Not applicable,,United Kingdom,GBR,2021,2021-01-01,2021-12-31
32430,Royal Dutch Shell plc,Oil & Gas Authority (OGA),Oil & Gas Authority (OGA) Levy,Yes,Yes,P886,GBP,99220.0,Not applicable,Not applicable,Not applicable,,United Kingdom,GBR,2021,2021-01-01,2021-12-31
32431,Royal Dutch Shell plc,Oil & Gas Authority (OGA),Oil & Gas Authority (OGA) Levy,Yes,Yes,P96,GBP,99220.0,Not applicable,Not applicable,Not applicable,,United Kingdom,GBR,2021,2021-01-01,2021-12-31


in both tables


Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country_x,ISO Code_x,...,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country_y,ISO Code_y,Year_y,Start Date_y,End Date_y
0,Amin Karimzai Campany,Private,1007815085,Other,Talc,Not applicable,Not available,#ERROR!,Afghanistan,AFG,...,365746.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
1,Habib Shahab Talc and Marble exploitation and ...,Private,1013655012,Other,"Talc, Construction stone",Not applicable,Not available,#ERROR!,Afghanistan,AFG,...,18.0,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
2,Habib Shahab Talc and Marble exploitation and ...,Private,1013655012,Other,"Talc, Construction stone",Not applicable,Not available,#ERROR!,Afghanistan,AFG,...,365746.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
3,Habib Shahab Talc and Marble exploitation and ...,Private,1013655012,Other,"Talc, Construction stone",Not applicable,Not available,#ERROR!,Afghanistan,AFG,...,53675.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
4,Habib Shahab Talc and Marble exploitation and ...,Private,1013655012,Other,"Talc, Construction stone",Not applicable,Not available,#ERROR!,Afghanistan,AFG,...,8052.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52555,ZCCM INVESTMENTS HOLDINGS PLC,State-owned enterprises & public corporations,1001761145,Mining,"Ag, AQM, Au, Be3Al2(SiO3)6, Co, Cu, GRT, LST, ...",www.zccm-ih.com.zm/,,95082467.74,Zambia,ZMB,...,719894.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31
52556,ZCCM INVESTMENTS HOLDINGS PLC,State-owned enterprises & public corporations,1001761145,Mining,"Ag, AQM, Au, Be3Al2(SiO3)6, Co, Cu, GRT, LST, ...",www.zccm-ih.com.zm/,,95082467.74,Zambia,ZMB,...,7500.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31
52557,ZCCM INVESTMENTS HOLDINGS PLC,State-owned enterprises & public corporations,1001761145,Mining,"Ag, AQM, Au, Be3Al2(SiO3)6, Co, Cu, GRT, LST, ...",www.zccm-ih.com.zm/,,95082467.74,Zambia,ZMB,...,32781.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31
52558,ZCCM INVESTMENTS HOLDINGS PLC,State-owned enterprises & public corporations,1001761145,Mining,"Ag, AQM, Au, Be3Al2(SiO3)6, Co, Cu, GRT, LST, ...",www.zccm-ih.com.zm/,,95082467.74,Zambia,ZMB,...,23178862.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31


In [13]:
print("With duplicate rows removed")

common_columns_3a5 = ["Full company name"]
common_columns_53a = ["Company"]

for key, data in compare_tables_drop_duplicates(df_part_3a, df_part_5, common_columns_3a5, common_columns_53a).items():
    print(key)
    display(data)

With duplicate rows removed
in table 1 but not in table 2


Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
7,Abed Hasan Zadran Limited,Private,9005801197,Other,Coal,,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
9,Afghan Shinink Mines Extraction and Processing,Private,9002202316,Other,Talc,,,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
95,"احمد علی ولد خداداد, احمدعلی Ahamd Ali Son of ...",Private,9001263814,Other,Construction stone,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
131,شرکت استخراج معادن افغان اکتیف لمیتد Afghan Ac...,Private,9001353375,Other,Chromite,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
138,شرکت استخراج معادن ذغال سنک افراسیاب Afrasyab ...,Private,9001505461,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3328,Centrica North Sea Oil Ltd (NCMA4),Private,100027000-1,Oil & Gas,"Oil, Gas",,,,Trinidad and Tobago,TTO,2017,2016-10-01,2017-09-30
3329,Centrica Resources Ltd (BLK22),Private,100006133-9,Oil & Gas,"Oil, Gas",,,,Trinidad and Tobago,TTO,2017,2016-10-01,2017-09-30
3339,NGC E&P Investments (Netherlands) B.V.,State-owned enterprises & public corporations,115137-2,Oil & Gas,"Oil, Gas",,,,Trinidad and Tobago,TTO,2017,2016-10-01,2017-09-30
3347,Repsol Angostura Limited,Private,100040582-6,Oil & Gas,"Oil, Gas",,,,Trinidad and Tobago,TTO,2017,2016-10-01,2017-09-30


in table 2 but not in table 1


Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
29,Abid Hassan Zadran Limited,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,1.192667e+06,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
37,"Afghan Shiinink, Mines Extraction and Processing",Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,3.522200e+04,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
41,"Afghan Shinink, Mines Extraction and Processing",Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,6.133600e+04,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
46,Afghan Talc Limited Joint Venture,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,3.162510e+05,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
92,Arif Shahaab Limited,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,1.085044e+06,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9062,Autres acheteurs,Société Nationale d'Opérations Pétrolières de ...,Profit Oil et Cost Oil Etat Associé,No,No,,XOF,2.660975e+10,,,,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
10346,Neptune Energy Deutschland GmbH\n(former: Engi...,Tax Offices,Corporation Tax,No,No,,EUR,4.272750e+05,,,,,Germany,DEU,2017,2017-01-01,2017-12-31
10360,Wintershall GmbH\n(now: Wintershall DEA Deutsc...,Mining Authorities,Mining and Extraction Royalties,No,No,,EUR,5.002564e+07,,,,,Germany,DEU,2017,2017-01-01,2017-12-31
15874,OEL,Mineral Resources and Petroleum Authority,License fee for exploitation and exploration o...,Yes,No,,MNT,2.499000e+06,No,0,,,Mongolia,MNG,2018,1/1/2018,12/31/2018


in both tables


Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country_x,ISO Code_x,...,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country_y,ISO Code_y,Year_y,Start Date_y,End Date_y
0,Amin Karimzai Campany,Private,1007815085,Other,Talc,Not applicable,Not available,#ERROR!,Afghanistan,AFG,...,365746.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
1,Habib Shahab Talc and Marble exploitation and ...,Private,1013655012,Other,"Talc, Construction stone",Not applicable,Not available,#ERROR!,Afghanistan,AFG,...,18.0,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
5,Abaan Rayan Limited,Private,9004655032,Other,Coal,,Not available,#ERROR!,Afghanistan,AFG,...,8953331.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
9,Abas Ghaznavi Limited,Private,9001935742,Other,Coal,,Not available,#ERROR!,Afghanistan,AFG,...,23473826.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
17,Abdul Fatah,Private,9001846329,Other,Construction stone,Not applicable,Not available,#ERROR!,Afghanistan,AFG,...,89598.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52512,FIRST QUANTUM MINING AND OPERATIONS LTD BM M S,Private,1001656040,Mining,"Cu, SLF",https://www.first-quantum.com/English/our-oper...,,316715805.31,Zambia,ZMB,...,425101.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31
52519,MOPANI COPPER MINES PLC,Private,1001630233,Mining,Cu,,,686978877.11,Zambia,ZMB,...,169357.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31
52532,CHAMBISHI COPPER SMELTER LIMITED,Private,1001831030,Mining,Cu,,,1178403060.04,Zambia,ZMB,...,8577.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31
52541,MAAMBA COLLIERIES LIMITED,Private,1001594184,Mining,"COA, PYR",http://www.maambacoal.com/,,154183994.89,Zambia,ZMB,...,59452.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31


#### Part 3 - Reporting government entities' list and Part 5 - Company data

In [14]:
print("With duplicate rows")

common_columns_3b5 = ["Full name of agency", "Country", "Year"]
common_columns_53b = ["Government entity", "Country", "Year"]

for key, data in compare_tables(df_part_3b, df_part_5, common_columns_3b5, common_columns_53b).items():
    print(key)
    display(data)

With duplicate rows
in table 1 but not in table 2


Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date
4,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-24
9,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2019.0,2018-12-21,2019-12-20
13,Department of Budget and Management (DBM),Central goverment,000-449-457-000,,Philippines,PHL,2018.0,2018-01-01,2018-12-31
17,Philippine Natioanl Oil Company (PNOC),State-owned enterprises & public corporations,000-169-576-000,,Philippines,PHL,2018.0,2018-01-01,2018-12-31
18,Philippine Minding Development Corporation (PDMC),State-owned enterprises & public corporations,225-860-806-000,,Philippines,PHL,2018.0,2018-01-01,2018-12-31
...,...,...,...,...,...,...,...,...,...
497,Communes et préfectures des localités minières,Local government,,,Togo,TGO,2018.0,2018-01-01,2018-12-31
514,"Ministry for Development of Economy, Trade and...",State government,37508596,#ERROR!,Ukraine,UKR,2019.0,2019-01-01,2019-12-31
518,"Ministry for Development of Economy, Trade and...",State government,37508596,#ERROR!,Ukraine,UKR,2020.0,2020-01-01,2020-12-31
539,Environmental Protection Fund,Other,,2276581.33,Zambia,ZMB,2017.0,2017-01-01,2017-12-31


in table 2 but not in table 1


Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
7589,YPF S.A.,Secretaría de Energía,Regalias,Yes,Yes,Magallanes AM 6,ARS,3.773197e+07,No,,,Oil,Argentina,ARG,2018,1/1/2018,12/31/2018
7590,YPF S.A.,Secretaría de Energía,Regalias,Yes,Yes,Magallanes AM 6,ARS,4.649846e+06,No,,,Gasolina,Argentina,ARG,2018,1/1/2018,12/31/2018
7591,YPF S.A.,Secretaría de Energía,Regalias,Yes,Yes,Magallanes AM 6,ARS,4.506175e+07,No,,,Gas,Argentina,ARG,2018,1/1/2018,12/31/2018
7592,YPF S.A.,Secretaría de Energía,Canon de Permisos de Explotacion offshore,Yes,Yes,Magallanes AM 6,ARS,6.546600e+04,No,,,,Argentina,ARG,2018,1/1/2018,12/31/2018
7593,YPF S.A.,Secretaría de Energía,Canon de Permisos de Explotacion offshore,Yes,Yes,Enarsa E1,ARS,1.368305e+07,No,,,Pago retroactivo 2012,Argentina,ARG,2018,1/1/2018,12/31/2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22180,Mercury Mining Investment Ltd,Mining Inspectorate Division,Mining titles(s) annual service fees,No,No,,USD,8.183300e+02,No,,,,Nigeria,NGA,2018,1/1/2018,12/31/2018
22181,Mercury Mining Investment Ltd,Mining Inspectorate Division,Royalty,No,No,,USD,3.848066e+04,No,,,,Nigeria,NGA,2018,1/1/2018,12/31/2018
22186,Mothercat Limited,Mining Inspectorate Division,Mining titles(s) annual service fees,No,No,,USD,2.945990e+03,No,,,,Nigeria,NGA,2018,1/1/2018,12/31/2018
22187,Mothercat Limited,Mining Inspectorate Division,Royalty,No,No,,USD,9.751630e+04,No,,,,Nigeria,NGA,2018,1/1/2018,12/31/2018


in both tables


Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code_x,Year,Start Date_x,End Date_x,Company,...,Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,ISO Code_y,Start Date_y,End Date_y
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20,Afghan Gas Enterprise,...,,AFN,24493134.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
1,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20,Afghan Gas Enterprise,...,,AFN,21656316.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
2,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20,Afghan Gas Enterprise,...,,AFN,12325103.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
3,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20,Afghan Gas Enterprise,...,,AFN,1047484.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
4,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20,CNPCI Watan Oil and Gas Afghanistan limited,...,,AFN,556996.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32361,Ministry of Lands,Central goverment,,1756688.47,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,CHAMBISHI COPPER SMELTER LIMITED,...,,ZMW,583.0,No,,,,ZMB,2018-01-01,2018-12-31
32362,Ministry of Lands,Central goverment,,1756688.47,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,MAAMBA COLLIERIES LIMITED,...,,ZMW,86562.0,No,,,,ZMB,2018-01-01,2018-12-31
32363,Ministry of Lands,Central goverment,,1756688.47,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,MAAMBA COLLIERIES LIMITED,...,,ZMW,1069732.0,No,,,,ZMB,2018-01-01,2018-12-31
32364,IDC,Other,,69205641.67,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,ZCCM INVESTMENTS HOLDINGS PLC,...,,ZMW,69205642.0,No,,,,ZMB,2018-01-01,2018-12-31


In [15]:
print("With duplicate rows removed")

for key, data in compare_tables_drop_duplicates(df_part_3b, df_part_5, common_columns_3b5, common_columns_53b).items():
    print(key)
    display(data)

With duplicate rows removed
in table 1 but not in table 2


Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date
4,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-24
9,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2019.0,2018-12-21,2019-12-20
13,Department of Budget and Management (DBM),Central goverment,000-449-457-000,,Philippines,PHL,2018.0,2018-01-01,2018-12-31
17,Philippine Natioanl Oil Company (PNOC),State-owned enterprises & public corporations,000-169-576-000,,Philippines,PHL,2018.0,2018-01-01,2018-12-31
18,Philippine Minding Development Corporation (PDMC),State-owned enterprises & public corporations,225-860-806-000,,Philippines,PHL,2018.0,2018-01-01,2018-12-31
...,...,...,...,...,...,...,...,...,...
497,Communes et préfectures des localités minières,Local government,,,Togo,TGO,2018.0,2018-01-01,2018-12-31
514,"Ministry for Development of Economy, Trade and...",State government,37508596,#ERROR!,Ukraine,UKR,2019.0,2019-01-01,2019-12-31
518,"Ministry for Development of Economy, Trade and...",State government,37508596,#ERROR!,Ukraine,UKR,2020.0,2020-01-01,2020-12-31
539,Environmental Protection Fund,Other,,2276581.33,Zambia,ZMB,2017.0,2017-01-01,2017-12-31


in table 2 but not in table 1


Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
7589,YPF S.A.,Secretaría de Energía,Regalias,Yes,Yes,Magallanes AM 6,ARS,37731970.0,No,,,Oil,Argentina,ARG,2018,1/1/2018,12/31/2018
7595,YPF S.A.,Secretaría de Medio Ambiente,Tasa Ambiental Anual,No,No,,ARS,73217.0,No,,,,Argentina,ARG,2018,1/1/2018,12/31/2018
7707,"Add new rows as necessary, right click the row...",,,,,,,,,,,,Argentina,ARG,2018,1/1/2018,12/31/2018
7987,IAMGOLD Essakane SA,Direction Générale des Douanes (DGD),Droits de Douane et taxes assimilées,No,No,,XOF,17778940000.0,No,,,,Burkina Faso,BFA,2017,1/1/2017,12/31/2017
7989,IAMGOLD Essakane SA,Direction Générale des Impôts (DGI),Acomptes Provisionnels sur IS (AP - IS),No,No,,XOF,5874864000.0,No,,,,Burkina Faso,BFA,2017,1/1/2017,12/31/2017
8000,IAMGOLD Essakane SA,Frais de prestation BUMIGEB,Frais de prestation BUMIGEB,No,No,,XOF,1235000.0,No,,,,Burkina Faso,BFA,2017,1/1/2017,12/31/2017
8941,CNR INTERNATIONAL,Société Nationale d'Opérations Pétrolières de ...,Besoins nationaux,No,No,,XOF,7556037000.0,,,,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
8942,ENI IVORY COAST LIMITED,Direction Générale des impôts (DGI),Bonus de production,No,No,,XOF,2454000000.0,,,,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
8944,Autres sociétés non incluses dans le périmètre...,Société pour le Développement Minier de la Cot...,Cession des parts de la SODEMI dans SMI,No,No,,XOF,32057000000.0,,,,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31
8946,TOTAL E & P,Direction Générale des Mines et de la Géologie...,Contribution à la formation,No,No,,XOF,3073151000.0,,,,,Cote d'Ivoire,CIV,2017,2017-01-01,2017-12-31


in both tables


Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code_x,Year,Start Date_x,End Date_x,Company,...,Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,ISO Code_y,Start Date_y,End Date_y
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20,Afghan Gas Enterprise,...,,AFN,24493134.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
181,Ministry of Finance (Customs Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-21,Abaan Rayan Limited,...,,AFN,8953331.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
834,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-22,North Coal Enterprise (NCE),...,EXP 1/2014,AFN,442801100.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
1144,National Environmental Protection Agency,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-23,استخراج 10000مترمکعب ریگ وجغل توسط شرکت ساختما...,...,,AFN,1000.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
1153,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2019.0,2018-12-21,2019-12-20,Afghan Gas Enterprise,...,,AFN,20000000.0,Not applicable,Not applicable,Not applicable,,AFG,2018-12-21,2019-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32322,Local Councils,Local government,,225046694.88,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,KANSANSHI MINING PLC,...,,ZMW,25324000.0,No,,,,ZMB,2018-01-01,2018-12-31
32334,Ministry of Mines and Minerals Development,Central goverment,,48229839.93,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,KANSANSHI MINING PLC,...,7057 HQ LML,ZMW,421032.0,No,,,,ZMB,2018-01-01,2018-12-31
32359,Ministry of Lands,Central goverment,,1756688.47,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,KANSANSHI MINING PLC,...,,ZMW,573104.0,No,,,,ZMB,2018-01-01,2018-12-31
32364,IDC,Other,,69205641.67,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,ZCCM INVESTMENTS HOLDINGS PLC,...,,ZMW,69205642.0,No,,,,ZMB,2018-01-01,2018-12-31


#### Part 3 - Reporting projects' list and Part 5 - Company data

In [19]:
print("With duplicate rows")

common_columns_3c5 = ["Full project name", "Country", "Year"]
common_columns_53c = ["Project name", "Country", "Year"]

for key, data in compare_tables(df_part_3c, df_part_5, common_columns_3c5, common_columns_53c).items():
    print(key)
    display(data)

With duplicate rows
in table 1 but not in table 2


Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
80,APL-EP-57,,Jabul Siraj Consortium,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
81,APL-EP-58,,Core Drillers,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
82,APL-EP-59,,Amin Karimzai Campany,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
83,APL-EP-60,,Afghan Talc Limited,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
84,APL-EP-61,,Nabi Afghan Company,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6237,"Luanshya, Copperbelt",21996-HQ-SEL,LUANSHYA COPPER MINE,Copper (2603),Production,50362.9,Tonnes,328509713.17,USD,Zambia,ZMB,2018,2018-01-01,2018-12-31
6238,Chambishi Mine,7069-HQ-LML,NFC AFRICA MINING PLC,Copper (2603),Production,27644.02,Tonnes,180317831.79,USD,Zambia,ZMB,2018,2018-01-01,2018-12-31
6239,"Copperbelt, Chililabombwe",21996-HQ-SEL,LUBAMBE COPPER MINE LTD,Copper (2603),Production,22074.49,Tonnes,143988648.92,USD,Zambia,ZMB,2018,2018-01-01,2018-12-31
6240,Chibuluma Mine,7064-HQ-LML \n7065-HQ-LML,CHIBULUMA MINES PLC,Copper (2603),Production,11258.52,Tonnes,73437660.05,USD,Zambia,ZMB,2018,2018-01-01,2018-12-31


in table 2 but not in table 1


Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
4,Habib Shahab Talc and Marble exploitation and ...,Ministry of Mines and Petroleum (Revenue Depar...,Penalties of Late Payment,,,,AFN,18.0,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
5,Abaan Rayan Limited,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,8953331.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
6,Abaan Rayan Limited,Ministry of Finance (Customs Department),Fixed Tax on Exports,No,No,,AFN,1074419.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
7,Abaan Rayan Limited,Ministry of Finance (Customs Department),Other Fee on Exports,No,No,,AFN,198084.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
8,Abaan Rayan Limited,Ministry of Finance (Customs Department),Penalty on Exports,No,No,,AFN,31050.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32928,KALUMBILA MINERALS LIMITED,Ministry of Mines and Minerals Development,Area Charges,Yes,Yes,15869 HQ LML,ZMW,409396.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31
32929,KALUMBILA MINERALS LIMITED,Ministry of Mines and Minerals Development,Area Charges,Yes,Yes,15870 HQ LML,ZMW,403032.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31
32930,KALUMBILA MINERALS LIMITED,Ministry of Mines and Minerals Development,Area Charges,Yes,Yes,15871 HQ LML,ZMW,125291.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31
32931,KALUMBILA MINERALS LIMITED,Ministry of Mines and Minerals Development,Area Charges,Yes,Yes,15872 HQ LML,ZMW,236910.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31


in both tables


Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,...,Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,ISO Code,Start Date_y,End Date_y
0,APL-EP-57,,Jabul Siraj Consortium,,,,,,,Afghanistan,...,APL-EP-57,AFN,378109.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
1,APL-EP-58,,Core Drillers,,,,,,,Afghanistan,...,APL-EP-58,AFN,367260.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
2,APL-EP-59,,Amin Karimzai Campany,,,,,,,Afghanistan,...,APL-EP-59,AFN,365746.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
3,APL-EP-60,,Afghan Talc Limited,,,,,,,Afghanistan,...,APL-EP-60,AFN,366791.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
4,APL-EP-61,,Nabi Afghan Company,,,,,,,Afghanistan,...,APL-EP-61,AFN,365746.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25402,,,Small scale,Copper (2603),Production,10859,Tonnes,70831646.25,USD,Zambia,...,,ZMW,66946749.0,No,,,,ZMB,2018-01-01,2018-12-31
25403,,,Small scale,Copper (2603),Production,10859,Tonnes,70831646.25,USD,Zambia,...,,ZMW,311691.0,No,,,,ZMB,2018-01-01,2018-12-31
25404,,,Small scale,Copper (2603),Production,10859,Tonnes,70831646.25,USD,Zambia,...,,ZMW,1200199.0,No,,,,ZMB,2018-01-01,2018-12-31
25405,,,Small scale,Copper (2603),Production,10859,Tonnes,70831646.25,USD,Zambia,...,,ZMW,86562.0,No,,,,ZMB,2018-01-01,2018-12-31


In [18]:
print("With duplicate rows removed")

# common_columns_3c5 = ["Full project name", "Country", "Year"]
# common_columns_53c = ["Project name", "Country", "Year"]

for key, data in compare_tables_drop_duplicates(df_part_3c, df_part_5, common_columns_3c5, common_columns_53c).items():
    print(key)
    display(data)

With duplicate rows removed
in table 1 but not in table 2


Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
80,APL-EP-57,,Jabul Siraj Consortium,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
81,APL-EP-58,,Core Drillers,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
82,APL-EP-59,,Amin Karimzai Campany,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
83,APL-EP-60,,Afghan Talc Limited,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
84,APL-EP-61,,Nabi Afghan Company,,,,,,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6237,"Luanshya, Copperbelt",21996-HQ-SEL,LUANSHYA COPPER MINE,Copper (2603),Production,50362.9,Tonnes,328509713.17,USD,Zambia,ZMB,2018,2018-01-01,2018-12-31
6238,Chambishi Mine,7069-HQ-LML,NFC AFRICA MINING PLC,Copper (2603),Production,27644.02,Tonnes,180317831.79,USD,Zambia,ZMB,2018,2018-01-01,2018-12-31
6239,"Copperbelt, Chililabombwe",21996-HQ-SEL,LUBAMBE COPPER MINE LTD,Copper (2603),Production,22074.49,Tonnes,143988648.92,USD,Zambia,ZMB,2018,2018-01-01,2018-12-31
6240,Chibuluma Mine,7064-HQ-LML \n7065-HQ-LML,CHIBULUMA MINES PLC,Copper (2603),Production,11258.52,Tonnes,73437660.05,USD,Zambia,ZMB,2018,2018-01-01,2018-12-31


in table 2 but not in table 1


Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
4,Habib Shahab Talc and Marble exploitation and ...,Ministry of Mines and Petroleum (Revenue Depar...,Penalties of Late Payment,,,,AFN,18.0,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
384,WESTCO INTERNATIONAL,Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,EXPL 3/2013,AFN,710962.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
453,Hayat Khan,Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,SSML-Kabu 48/2014,AFN,100568.0,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20
1153,Aazam Khan Wafa Sherzad Limited,Ministry of Finance (Customs Department),Export Duty,No,No,,AFN,110944.0,Not applicable,Not applicable,Not applicable,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
1175,Abdul Rahman Baba Steel & Iron Company,Ministry of Mines and Petroleum (Revenue Depar...,Surface Rent,Yes,Yes,EXPL 2/2014,AFN,1352198.0,Not applicable,Not applicable,Not applicable,,Afghanistan,AFG,2019,2018-12-21,2019-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32928,KALUMBILA MINERALS LIMITED,Ministry of Mines and Minerals Development,Area Charges,Yes,Yes,15869 HQ LML,ZMW,409396.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31
32929,KALUMBILA MINERALS LIMITED,Ministry of Mines and Minerals Development,Area Charges,Yes,Yes,15870 HQ LML,ZMW,403032.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31
32930,KALUMBILA MINERALS LIMITED,Ministry of Mines and Minerals Development,Area Charges,Yes,Yes,15871 HQ LML,ZMW,125291.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31
32931,KALUMBILA MINERALS LIMITED,Ministry of Mines and Minerals Development,Area Charges,Yes,Yes,15872 HQ LML,ZMW,236910.0,No,,,,Zambia,ZMB,2018,2018-01-01,2018-12-31


in both tables


Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,...,Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,ISO Code,Start Date_y,End Date_y
0,APL-EP-57,,Jabul Siraj Consortium,,,,,,,Afghanistan,...,APL-EP-57,AFN,378109.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
1,APL-EP-58,,Core Drillers,,,,,,,Afghanistan,...,APL-EP-58,AFN,367260.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
2,APL-EP-59,,Amin Karimzai Campany,,,,,,,Afghanistan,...,APL-EP-59,AFN,365746.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
3,APL-EP-60,,Afghan Talc Limited,,,,,,,Afghanistan,...,APL-EP-60,AFN,366791.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
4,APL-EP-61,,Nabi Afghan Company,,,,,,,Afghanistan,...,APL-EP-61,AFN,365746.0,No,Not applicable,Not applicable,,AFG,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25245,"Area 481, 375,000 tpa, Inner Dowsing",Licence,Van Oord,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,...,"Area 481, 375,000 tpa, Inner Dowsing",GBP,40788.0,Not applicable,Not applicable,Not applicable,Licence,GBR,2021-01-01,2021-12-31
25265,"Area 228, 1,500,000 tpa (mix), Off Great Yarmouth",Licence,Volker Dredging Ltd,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,...,"Area 228, 1,500,000 tpa (mix), Off Great Yarmouth",GBP,78024.0,Not applicable,Not applicable,Not applicable,Dead Rent,GBR,2021-01-01,2021-12-31
25275,"Area 351, 500,000 tpa, SE Isle of Wight",Licence,Volker Dredging Ltd,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,...,"Area 351, 500,000 tpa, SE Isle of Wight",GBP,48768.0,Not applicable,Not applicable,Not applicable,Dead Rent,GBR,2021-01-01,2021-12-31
25279,"Area 461, 1,000,000 tpa, Median Deep",Licence,Volker Dredging Ltd,Not available,Not applicable,Not applicable,,,GBP,United Kingdom,...,"Area 461, 1,000,000 tpa, Median Deep",GBP,119710.0,Not applicable,Not applicable,Not applicable,Dead Rent,GBR,2021-01-01,2021-12-31
