# Red flags and coherence

## Overall purpose and objective
The overall purpose and objective of the cleaning and verification process is to prepare the data for conversion into a SQLite database (Datasette). As such, the data should follow database best practices.

## Specific purpose of this notebook
This notebook is for checking duplicates in the data and coherence. Particularly, we want to check for:
- Possible red flag rows for each table.
- The companies/agencies in the company and agency lists correspond to companies/agencies in the company data
- The revenue reported are consistent between the Government revenues and Company data tables

## Assumptions
- Some combinations of fields should be unique
- The values should be cohrent across tables

## Why this matters 
- Inserting the data in a proper database and assigning EITI IDs require a high confidence in the data quality to avoid downstream issues. Duplicates and non-coherence of the data decreases this level of confidence.

## Findings
- There are a significant number (>40%) of possible red flag columns (duplications) in all of the tables except for Part 1 - About
- Most are due to missing values in the columns that are used to check for uniqueness/duplication
- Most should easily be reconciled by differentiating them using the Country and Year columns
- There are several rows (i.e. companies, agencies, projects) in the Part 3 tables that are not in Part 5 - Company data and vice versa.
- 70% of the reports do not have the same computed revenue values based on Part 4 - Government revenues and Part 5 - Company data

More specific findings below.

## Analysis

### Red flags in the tables due to possible duplication

In this part, we look for red flags in the tables by looking at how similar rows are based on a combination of columns that, when taken together, should form a unique identifier for the row.

In [1]:
# import libraries and data

import pandas as pd
import numpy as np
from os import path
from functools import reduce
from pprint import pprint
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from itertools import combinations

file_dir = "data/consolidated/"
file_dir_old = "data/consolidated/backup/old"

# load the csvs into data frames
df_part_1 = pd.read_csv(path.join(file_dir, "Part 1 - About.csv"))
df_part_3a = pd.read_csv(path.join(file_dir, "Part 3 - Reporting companies' list.csv"))
df_part_3b = pd.read_csv(path.join(file_dir, "Part 3 - Reporting government entities list.csv"))
df_part_3c = pd.read_csv(path.join(file_dir, "Part 3 - Reporting projects' list.csv"))
df_part_4 = pd.read_csv(path.join(file_dir, "Part 4 - Government revenues.csv"))
df_part_5 = pd.read_csv(path.join(file_dir, "Part 5 - Company data.csv"))
# df_part_5 = pd.read_csv(path.join(file_dir, "Part 5 - Company data.csv"), low_memory=False)

df_list = [df_part_1, df_part_3a, df_part_3b, df_part_3c, df_part_4, df_part_5]
df_dict = {"Part 1 - About.csv": df_part_1,
           "Part 3 - Reporting companies' list.csv": df_part_3a,
           "Part 3 - Reporting government entities list.csv": df_part_3b,
           "Part 3 - Reporting projects' list.csv": df_part_3c,
           "Part 4 - Government revenues.csv": df_part_4,
           "Part 5 - Company data.csv": df_part_5
          }

# OPTIONAL COLUMNS
part_3a_opt = ["Stock exchange listing or company website", 
               "Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)"]
part_3b_opt = ["ID number (if applicable)"]
part_5_opt = ["In-kind volume (if applicable)", "Unit (if applicable)", "Comments"]

# only include fields that are non-optional
df_part_1_non_opt = df_part_1.copy()
df_part_3a_non_opt = df_part_3a.copy().drop(columns=part_3a_opt)               
df_part_3b_non_opt = df_part_3b.copy().drop(columns=part_3b_opt)
df_part_3c_non_opt = df_part_3c.copy()
df_part_4_non_opt = df_part_4.copy()
df_part_5_non_opt = df_part_5.copy().drop(columns=part_5_opt)

df_list_non_opt = [df_part_1_non_opt, df_part_3a_non_opt, df_part_3b_non_opt, df_part_3c_non_opt, df_part_4_non_opt, df_part_5_non_opt]
df_dict_non_opt = {"Part 1 - About.csv": df_part_1_non_opt,
           "Part 3 - Reporting companies' list.csv": df_part_3a_non_opt,
           "Part 3 - Reporting government entities list.csv": df_part_3b_non_opt,
           "Part 3 - Reporting projects' list.csv": df_part_3c_non_opt,
           "Part 4 - Government revenues.csv": df_part_4_non_opt,
           "Part 5 - Company data.csv": df_part_5_non_opt
          }

In [2]:
def get_column_combinations(columns, min_cols_diff):
    '''
    Get unique column combinations that will be used for determining problematic rows (i.e. possible duplicates).

    Parameters:
    - columns (list or iterable): A list of column names or identifiers to be used for forming combinations.
    - min_cols_diff (int): The minimum number of columns to form unique combinations.

    Returns:
    list: A list of tuples representing unique combinations of columns.

    Example:
    >>> columns = ['col1', 'col2', 'col3']
    >>> min_cols_diff = 2
    >>> get_column_combinations(columns, min_cols_diff)
    [('col1', 'col2'), ('col1', 'col3'), ('col2', 'col3')]
    '''

    all_combinations = []
    for x in range(min_cols_diff, len(columns) + 1):
        all_combinations.extend(combinations(columns, x))

    return all_combinations

# TEST: Uncomment lines below and run
# OUTPUT: 4 unique column combinations
# columns = ["Full name of agency", "Agency type", "Total reported"]
# column_combinations = get_column_combinations(columns, 2)
# print(f'There are {len(column_combinations)} combinations:')
# pprint(column_combinations)

In [3]:
def add_rowid(df):
    '''
    Add a row identifier (rowid) column to the DataFrame.

    Parameters:
    - df (pandas.DataFrame): The input DataFrame.

    Returns:
    pandas.DataFrame: A new DataFrame with an additional 'rowid' column.

    Example:
    >>> data = {'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']}
    >>> df = pd.DataFrame(data)
    >>> add_rowid(df)
       col1 col2  rowid
    0     1    a      0
    1     2    b      1
    2     3    c      2
    '''

    df_rowid = df.copy()
    df_rowid["rowid"] = range(len(df_rowid))

    return df_rowid    

In [4]:
def get_redflags(df, columns, min_cols_diff):
    '''
    Get unique column combinations and find problematic rows (duplicates) based on specified columns.

    Parameters:
    - df (pandas.DataFrame): The input DataFrame.
    - columns (list or iterable): A list of column names or identifiers to be used for forming combinations.
    - min_cols_diff (int): The minimum number of columns to form unique combinations (i.e. minimum numbers of columns of difference to be considered unique).

    Returns:
    pandas.DataFrame: DataFrame containing unique rows among problematic rows.

    Example:
    >>> data = {'col1': [1, 2, 2, 3, 4], 'col2': ['a', 'b', 'b', 'c', 'd']}
    >>> df = pd.DataFrame(data)
    >>> columns_for_combinations = ['col1', 'col2']
    >>> min_cols_diff = 2
    >>> get_redflags(df, columns_for_combinations, min_cols_diff)
       col1 col2
    0     2    b
    '''

    # Step 1: Create a copy of the DataFrame and add rowid
    df_copy = df.copy()
    df_copy['rowid'] = range(len(df_copy))

    # Step 2: Get column combinations
    all_combinations = []
    for x in range(min_cols_diff, len(columns) + 1):
        all_combinations.extend(combinations(columns, x))

    # Step 3: Find problematic rows for each column combination
    problematic_rows = pd.DataFrame()
    for combination in all_combinations:
        duplicated_rows = df_copy[df_copy.duplicated(subset=list(combination), keep=False)]
        problematic_rows = pd.concat([problematic_rows, duplicated_rows], ignore_index=False)

    # Step 4: Get unique rows among problematic rows
    unique_problematic_rows = problematic_rows.drop_duplicates()

    return unique_problematic_rows

In [5]:
def table_groupby(df, column, ascending=True):

    return df.groupby(column).size().reset_index(name='Number of Rows').sort_values(by="Number of Rows", ascending=ascending)

In [6]:
def num_rf_rows(df1, df2):
    
    perc = 100 * df2.shape[0] / df1.shape[0]
    print(f'{df2.shape[0]} out of {df1.shape[0]} ({perc:.0f}%) rows are possible red flags')

In [7]:
def num_compare_rows(df1, df2):
    
    perc = 100 * df2.shape[0] / df1.shape[0]
    print(f'{df2.shape[0]} out of {df1.shape[0]} ({perc:.0f}%) rows')

#### Part 1 - About

Columns to check:
- Country or area name
- Start Date

***Flagged if rows are similar in at least 2 columns.***

#### Results
- 0 redflag columns

In [8]:
columns_1 = ["Country or area name", "Start Date"]

redflags_part_1 = get_redflags(df_part_1, columns_1, 2)
num_rf_rows(df_part_1, redflags_part_1)
display(redflags_part_1)

0 out of 73 (0%) rows are possible red flags


Unnamed: 0,Country or area name,ISO Alpha-3 Code,National currency name,National currency ISO-4217,Start Date,End Date,Has an EITI Report been prepared by an Independent Administrator?,What is the name of the company?,Date that the EITI Report was made public,"URL, EITI Report",...,… by company,… by project,Systematically disclosed,Through EITI Reporting,Not applicable,Not available,Name,Organisation,Email address,rowid


#### Part 3 - Reporting companies' list

Columns to check:
- Full company name
- Company ID number
- Payments to Governments Report

***Flagged if rows are similar in at least 2 columns.***

#### Results
- 1597 out of 3792 (42%) rows are possible red flags
- A lot of the red flags are caused by rows not having a value under the Company ID number and Payments to Governments Report columns
- Most should easily be reconciled by differentiating them using the Country and Year columns

In [45]:
columns_3a = ["Full company name", "Company ID number", "Payments to Governments Report"]
              
redflags_part_3a = get_redflags(df_part_3a, columns_3a, 2)
num_rf_rows(df_part_3a, redflags_part_3a)
display(redflags_part_3a)

1597 out of 3792 (42%) rows are possible red flags


Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date,rowid
4,Abdul Fatah,Private,9001846329,Other,Construction stone,Not applicable,Not available,139422,Afghanistan,AFG,2018,2017-12-21,2018-12-24,4
8,Afghan Gas Enterprise,Private,Not available,Other,Natural Gas,Not applicable,Not available,,Afghanistan,AFG,2018,2017-12-21,2018-12-24,8
11,Ahmad Rafi Oraaz Limited,Private,9003746923,Other,Coal,,,1124164,Afghanistan,AFG,2018,2017-12-21,2018-12-24,11
14,Amania Mining Company,Private,9000197187,Other,Fluoride,Not applicable,Not available,10612599,Afghanistan,AFG,2018,2017-12-21,2018-12-24,14
17,Bakhtar Afghan Marble Company,Private,1003112081,Other,Talc,Not applicable,Not available,210111,Afghanistan,AFG,2018,2017-12-21,2018-12-24,17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3765,MASHAK PETROLEUM,Private,Not available,Oil,Oil,Not available,Not available,,Chad,TCD,2018,2018-01-01,2018-12-31,3765
3771,Société Nationale de Ciment (SONACIM),State-owned enterprises & public corporations,Not available,Mining,Quarrying,Not available,Not available,,Chad,TCD,2018,2018-01-01,2018-12-31,3771
3775,DTP,Private,Not available,Mining,Granulates,Not available,Not available,,Chad,TCD,2018,2018-01-01,2018-12-31,3775
3776,ETEP,Private,Not available,Mining,BTP,Not available,Not available,,Chad,TCD,2018,2018-01-01,2018-12-31,3776


In [44]:
redflags_part_3a.sort_values(by=["Country", "Full company name"])

Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date,rowid
220,Aazam Khan Wafa Sherzad Limited,Private,9001881367,Other,Talc,Not applicable,Not available,164759,Afghanistan,AFG,2019,2018-12-21,2019-12-20,220
2,Abaan Rayan Limited,Private,9004655032,Other,Coal,,Not available,10256884,Afghanistan,AFG,2018,2017-12-21,2018-12-22,2
3,Abas Ghaznavi Limited,Private,9001935742,Other,Coal,,Not available,36244514,Afghanistan,AFG,2018,2017-12-21,2018-12-23,3
221,Abbas Ghaznavi Limited,Private,9001935742,Other,coal,Not applicable,Not available,19008124,Afghanistan,AFG,2019,2018-12-21,2019-12-20,221
4,Abdul Fatah,Private,9001846329,Other,Construction stone,Not applicable,Not available,139422,Afghanistan,AFG,2018,2017-12-21,2018-12-24,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3319,NFC AFRICA MINING PLC,Private,,Mining,Au,http://www.cnmc.com.cn/indexen.jsp,,255956974.17,Zambia,ZMB,2019,2019-01-01,2019-12-31,3319
3320,Sino Metals,Private,,Mining,,,,100747457.08,Zambia,ZMB,2019,2019-01-01,2019-12-31,3320
3305,ZCCM INVESTMENTS HOLDINGS PLC,State-owned enterprises & public corporations,1001761145,Mining,"Ag, AQM, Au, Be3Al2(SiO3)6, Co, Cu, GRT, LST, ...",www.zccm-ih.com.zm/,,95082467.74,Zambia,ZMB,2018,2018-01-01,2018-12-31,3305
3321,ZCCM INVESTMENTS HOLDINGS PLC (ZCCM- IH),State-owned enterprises & public corporations,1001761145,Mining,"Ag, AQM, Au, Be3Al2(SiO3)6, Co, Cu, GRT, LST, ...",www.zccm-ih.com.zm/,,18793590.49,Zambia,ZMB,2019,2019-01-01,2019-12-31,3321


In [10]:
table_groupby(redflags_part_3a, 'Year', False)

Unnamed: 0,Year,Number of Rows
1,2018,705
2,2019,396
0,2017,395
3,2020,101


In [11]:
table_groupby(redflags_part_3a, 'Country', False).head(10)

Unnamed: 0,Country,Number of Rows
1,Albania,192
0,Afghanistan,179
18,Mongolia,170
21,Norway,154
27,Ukraine,143
28,United Kingdom,122
26,Trinidad and Tobago,98
9,Germany,68
10,Ghana,68
2,Armenia,52


#### Part 3 - Reporting government entities list

Columns to check:
- Full name of agency
- Agency type
- Total reported

***Flagged if rows are similar in at least 2 columns.***

#### Results
- 373 out of 547 (68%) rows are possible red flags
- A lot of the red flags are caused by rows not having a value under the Total reported column
- Most should easily be reconciled by differentiating them using the Country and Year columns

In [12]:
columns_3b = ["Full name of agency", "Agency type", "Total reported"]

redflags_part_3b = get_redflags(df_part_3b, columns_3b, 2)
num_rf_rows(df_part_3b, redflags_part_3b)
display(redflags_part_3b)

373 out of 547 (68%) rows are possible red flags


Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date,rowid
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,1441453501,Afghanistan,AFG,2018,2017-12-21,2018-12-20,0
1,Ministry of Finance (Customs Department),Central goverment,Not applicable,1367722942,Afghanistan,AFG,2018,2017-12-21,2018-12-21,1
2,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,1478934310.56,Afghanistan,AFG,2018,2017-12-21,2018-12-22,2
3,National Environmental Protection Agency,Central goverment,Not applicable,34000,Afghanistan,AFG,2018,2017-12-21,2018-12-23,3
4,Ministry of Industry and Commerce,Central goverment,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-24,4
...,...,...,...,...,...,...,...,...,...,...
460,Electric energy distribution system operator (...,State-owned enterprises & public corporations,,,Albania,ALB,2017,2017-01-01,2017-12-31,460
471,Electric Energy Distribution System Operator (...,State-owned enterprises & public corporations,,,Albania,ALB,2018,2018-01-01,2018-12-31,471
474,"Secretaría de Minería (SEMIN), Ministerio de D...",Central government,,,Argentina,ARG,2018,2018-01-01,2018-12-31,474
524,CAPAM,Other,,,Cameroon,CMR,2017,2017-01-01,2017-12-31,524


In [13]:
table_groupby(redflags_part_3b, 'Year', False)

Unnamed: 0,Year,Number of Rows
1,2018,128
0,2017,121
2,2019,93
3,2020,31


In [14]:
table_groupby(redflags_part_3b, 'Country', False).head(10)

Unnamed: 0,Country,Number of Rows
12,Ghana,80
4,Burkina Faso,28
16,Madagascar,24
1,Albania,22
34,Zambia,18
20,Mongolia,18
6,Chad,17
32,Ukraine,15
11,Germany,15
7,Cote d'Ivoire,14


#### Part 3 - Reporting projects' list

Columns to check:
- Full project name
- Legal agreement reference number(s): contract, licence, lease, concession, …
- Affiliated companies, start with Operator
- Commodities (one commodity/row)

***Flagged if rows are similar in at least 2 columns.***
  
#### Results
- 4707 out of 4979 (95%) rows are possible red flags
- A lot of the red flags are caused by rows not having a value under the Legal agreement reference number(s): contract, licence, lease, concession, … and Commodities (one commodity/row) columns
- Most should easily be reconciled by differentiating them using the Country and Year columns

In [15]:
columns_3c = ["Full project name", "Legal agreement reference number(s): contract, licence, lease, concession, …", "Affiliated companies, start with Operator", "Commodities (one commodity/row)"]

redflags_part_3c = get_redflags(df_part_3c, columns_3c, 2)
num_rf_rows(df_part_3c, redflags_part_3c)
display(redflags_part_3c)

4707 out of 4979 (95%) rows are possible red flags


Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date,rowid
0,APL-EP-57,,Jabul Siraj Consortium,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20,0
1,APL-EP-58,,Core Drillers,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20,1
2,APL-EP-59,,Amin Karimzai Campany,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20,2
3,APL-EP-60,,Afghan Talc Limited,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20,3
4,APL-EP-61,,Nabi Afghan Company,,,,,,,Afghanistan,AFG,2018,2017-12-21,2018-12-20,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4974,Champs Nya,Concession d'exploitation NYA du 20/07/2017 (C...,"Esso, Petronas,SHT",Crude oil (2709),Production,638135,Barrels,34993372,USD,Chad,TCD,2017,2017-01-01,2017-12-31,4974
4975,Champs Maikeri,Concession d'exploitation Maikeri du 20/07/201...,"Esso, Petronas,SHT",Crude oil (2709),Production,550314,Barrels,30177537,USD,Chad,TCD,2017,2017-01-01,2017-12-31,4975
4976,Champs Timbré,Concession d'exploitation Timbré du 20/07/2017...,"Esso, Petronas,SHT",Crude oil (2709),Production,598515,Barrels,32820733,USD,Chad,TCD,2017,2017-01-01,2017-12-31,4976
4977,Champs MANGARA,"Autorisation Exclusive d'Exploitation, MANGARA...",PCM/Glencore/SHT,Crude oil (2709),Production,2181729,Barrels,119639347,USD,Chad,TCD,2017,2017-01-01,2017-12-31,4977


In [16]:
table_groupby(redflags_part_3c, 'Year', False)

Unnamed: 0,Year,Number of Rows
3,2020,1563
2,2019,1465
1,2018,1416
0,2017,263


In [17]:
table_groupby(redflags_part_3c, 'Country', False).head(10)

Unnamed: 0,Country,Number of Rows
28,United Kingdom,1743
27,Ukraine,1703
10,Ghana,168
0,Afghanistan,160
4,Burkina Faso,123
26,Trinidad and Tobago,113
13,Iraq,81
9,Germany,63
24,Senegal,62
18,Mongolia,62


#### Part 4 - Government revenues

Columns to check:
- Sector
- Revenue stream name
- Government entity
- Revenue value

***Flagged if rows are similar in at least 3 columns.***
  
#### Results
- 1194 out of 2319 (51%) rows are possible red flags
- Most should easily be reconciled by differentiating them using the Country and Year columns

In [18]:
columns_4 = ["Sector", "Government entity", "Revenue value", "Revenue stream name"]

redflags_part_4 = get_redflags(df_part_4, columns_4, 3)
num_rf_rows(df_part_4, redflags_part_4)
display(redflags_part_4)

1194 out of 2319 (51%) rows are possible red flags


Unnamed: 0,GFS Classification,Sector,Revenue stream name,Government entity,Revenue value,Currency,Country,ISO Code,Year,Start Date,End Date,rowid
244,Other taxes payable by natural resource compan...,,Impuesto de Superficie,Dirección General de Impuestos Internos (DGII),28490,DOP,Dominican Republic,DOM,2019,2019-01-01,2019-12-31,244
254,Licence fees (114521E),,Solicitud de Exploración,Dirección General de Minería (DGM),,DOP,Dominican Republic,DOM,2019,2019-01-01,2019-12-31,254
263,Other taxes payable by natural resource compan...,,Impuesto de Superficie,Dirección General de Impuestos Internos (DGII),28490,DOP,Dominican Republic,DOM,2020,2020-01-01,2020-12-31,263
265,Licence fees (114521E),,Solicitud de exploración,Dirección General de Minería (DGM),,DOP,Dominican Republic,DOM,2020,2020-01-01,2020-12-31,265
306,Other taxes payable by natural resource compan...,Oil & Gas,Drilling &Well Designation Permit (Per Well),Petroleum Commission,,GHS,Ghana,GHA,2017,2017-01-01,2017-12-31,306
...,...,...,...,...,...,...,...,...,...,...,...,...
2318,Delivered/paid to state-owned enterprise(s) (1...,Oil,Vente du pétrole collectés par SHT PCCL,Société des Hydrocarbures du Tchad (SHT),233804963.54,USD,Chad,TCD,2018,2018-01-01,2018-12-31,2318
1882,Other taxes payable by natural resource compan...,Other,Other payments to the State and local governme...,Municipality of Lushnjë,-,ALL,Albania,ALB,2017,2017-01-01,2017-12-31,1882
1883,Other taxes payable by natural resource compan...,Other,Other payments to the State and local governme...,General Directorate of Taxes,-,ALL,Albania,ALB,2017,2017-01-01,2017-12-31,1883
471,Licence fees (114521E),Mining,"Tax on sand, gravel and other quarry resources",Bureau of Local Government and Finance (BLGF),,PHP,Philippines,PHL,2018,2018-01-01,2018-12-31,471


In [19]:
table_groupby(redflags_part_4, 'Year', False)

Unnamed: 0,Year,Number of Rows
1,2018,512
0,2017,409
2,2019,195
3,2020,78


In [20]:
table_groupby(redflags_part_4, 'Country', False).head(10)

Unnamed: 0,Country,Number of Rows
12,Madagascar,175
10,Ghana,122
26,Ukraine,90
5,Chad,88
3,Burkina Faso,85
18,Nigeria,65
1,Albania,59
13,Mali,56
25,Trinidad and Tobago,55
28,Zambia,51


#### Part 5 - Company data

Columns to check:
- Company
- Project name
- Revenue value
- Revenue stream name

***Flagged if rows are similar in at least 3 columns.***
  
#### Results
- 18129 out of 31882 (57%) rows are possible red flags
- Most should easily be reconciled by differentiating them using the Country and Year columns

In [21]:
columns_5 = ["Company", "Project name", "Revenue value", "Revenue stream name"]

redflags_part_5 = get_redflags(df_part_5, columns_5, 3)
num_rf_rows(df_part_5, redflags_part_5)
display(redflags_part_5)

18129 out of 31882 (57%) rows are possible red flags


Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date,rowid
3,North Coal Enterprise (NCE),Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,EXP 1/2014,AFN,300000000,No,Not applicable,Not applicable,2018-11-19,Afghanistan,AFG,2018,2017-12-21,2018-12-20,3
18,North Coal Enterprise (NCE),Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,EXP 1/2014,AFN,100000000,No,Not applicable,Not applicable,2018-11-21,Afghanistan,AFG,2018,2017-12-21,2018-12-20,18
19,North Coal Enterprise (NCE),Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,EXP 1/2014,AFN,100000000,No,Not applicable,Not applicable,2018-11-19,Afghanistan,AFG,2018,2017-12-21,2018-12-20,19
148,Core Drillers,Ministry of Finance (Customs Department),Other Fee on Exports,No,No,,AFN,110,No,Not applicable,Not applicable,,Afghanistan,AFG,2018,2017-12-21,2018-12-20,148
209,Khoshak Brothers Co,Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,EXPL 1/2006,AFN,5000000,No,Not applicable,Not applicable,2018-07-25,Afghanistan,AFG,2018,2017-12-21,2018-12-20,209
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31619,APCL,Direction Générale des Impôts (DGI),Droits Fixes (y compris droits pour attributio...,,,,XAF,6000000,,,,,Cameroon,CMR,2017,2017-01-01,2017-12-31,31619
31647,NOBLE ENERGY CAM LIMITED,Société Nationale des Hydrocarbures (SNH),Bonus de signature,,,,XAF,572336000,,,,,Cameroon,CMR,2017,2017-01-01,2017-12-31,31647
31655,NEW AGE,Direction Générale des Impôts (DGI),Droits Fixes (y compris droits pour attributio...,,,,XAF,6000000,,,,,Cameroon,CMR,2017,2017-01-01,2017-12-31,31655
31669,TOWER RESOURCES,Société Nationale des Hydrocarbures (SNH),Bonus de signature,,,,XAF,572336000,,,,,Cameroon,CMR,2017,2017-01-01,2017-12-31,31669


In [22]:
table_groupby(redflags_part_5, 'Year', False)

Unnamed: 0,Year,Number of Rows
1,2018,8258
2,2019,5453
0,2017,2309
3,2020,2109


In [23]:
table_groupby(redflags_part_5, 'Country', False).head(10)

Unnamed: 0,Country,Number of Rows
0,Afghanistan,4122
31,Ukraine,2974
32,United Kingdom,1711
1,Albania,1259
20,Mongolia,1070
28,Tanzania,961
23,Nigeria,950
25,Philippines,894
12,Ghana,688
15,Liberia,557


### Coherence of companies in Part 3 - Reporting companies' list, Part 3 - Reporting government entities list, and Part 3 - Reporting projects list with Part 5 - Company data

In [24]:
def compare_tables(df1, df2, common_columns_df1, common_columns_df2):
    '''
    Compare two tables based on specified columns.

    Parameters:
    - df1 (pandas.DataFrame): The first DataFrame.
    - df2 (pandas.DataFrame): The second DataFrame.
    - common_columns_df1 (list): Columns used in df1 to find common rows.
    - common_columns_df2 (list): Columns used in df2 to find common rows.

    Returns:
    - common_rows (pandas.DataFrame): Rows common to both DataFrames.
    - unique_rows_df1 (pandas.DataFrame): Rows unique to df1.
    - unique_rows_df2 (pandas.DataFrame): Rows unique to df2.

    Example:
    >>> df1 = pd.DataFrame({'Company': ['A', 'B', 'C'], 'Project name': ['P1', 'P2', 'P3'], 'Country': ['X', 'Y', 'Z'], 'Year': [2020, 2021, 2022]})
    >>> df2 = pd.DataFrame({'Full company name': ['A Corp', 'B Corp', 'D Corp'], 'Company type': ['Type1', 'Type2', 'Type3'], 'Company ID number': [101, 102, 103], 'Country': ['X', 'Y', 'Z'], 'Year': [2020, 2021, 2023]})
    >>> common_cols_df1 = ['Country', 'Year']
    >>> common_cols_df2 = ['Country', 'Year']
    >>> common, unique_df1, unique_df2 = compare_tables(df1, df2, common_cols_df1, common_cols_df2)
    >>> print(common)
      Country  Year
    0       X  2020
    1       Y  2021
    >>> print(unique_df1)
      Company Project name
    2       C           P3
    >>> print(unique_df2)
      Full company name Company type  Company ID number
    2            D Corp       Type3                103
    '''

    # Make copies of the input dataframes
    df1_copy = df1.copy()
    df2_copy = df2.copy()

    # Find common rows
    common_rows = pd.merge(df1_copy, df2_copy, left_on=common_columns_df1, right_on=common_columns_df2, how='inner')

    # Find unique rows in df1
    unique_rows_df1 = df1_copy[~df1_copy.set_index(common_columns_df1).index.isin(common_rows.set_index(common_columns_df1).index)]

    # Find unique rows in df2
    unique_rows_df2 = df2_copy[~df2_copy.set_index(common_columns_df2).index.isin(common_rows.set_index(common_columns_df2).index)]

    return {"in table 1 but not in table 2": unique_rows_df1, 
            "in table 2 but not in table 1": unique_rows_df2,
           "in both tables": common_rows}

In [25]:
def compare_tables_drop_duplicates(df1, df2, common_columns_df1, common_columns_df2):
    '''
    Compare two tables based on specified columns and drop duplicates.

    Parameters:
    - df1 (pandas.DataFrame): The first DataFrame.
    - df2 (pandas.DataFrame): The second DataFrame.
    - common_columns_df1 (list): Columns used in df1 to find common rows.
    - common_columns_df2 (list): Columns used in df2 to find common rows.

    Returns:
    - common_rows (pandas.DataFrame): Rows common to both DataFrames with duplicates dropped.
    - unique_rows_df1 (pandas.DataFrame): Rows unique to df1 with duplicates dropped.
    - unique_rows_df2 (pandas.DataFrame): Rows unique to df2 with duplicates dropped.

    Example:
    >>> df1 = pd.DataFrame({'Company': ['A', 'B', 'C'], 'Project name': ['P1', 'P2', 'P3'], 'Country': ['X', 'Y', 'Z'], 'Year': [2020, 2021, 2022]})
    >>> df2 = pd.DataFrame({'Full company name': ['A Corp', 'B Corp', 'D Corp'], 'Company type': ['Type1', 'Type2', 'Type3'], 'Company ID number': [101, 102, 103], 'Country': ['X', 'Y', 'Z'], 'Year': [2020, 2021, 2023]})
    >>> common_cols_df1 = ['Country', 'Year']
    >>> common_cols_df2 = ['Country', 'Year']
    >>> common, unique_df1, unique_df2 = compare_tables_drop_duplicates(df1, df2, common_cols_df1, common_cols_df2)
    >>> print(common)
      Country  Year
    0       X  2020
    1       Y  2021
    >>> print(unique_df1)
      Company Project name
    2       C           P3
    >>> print(unique_df2)
      Full company name Company type  Company ID number
    2            D Corp       Type3                103
    '''

    # Find common rows
    common_rows = pd.merge(df1, df2, left_on=common_columns_df1, right_on=common_columns_df2, how='inner')

    # Drop duplicates in common rows
    common_rows = common_rows.drop_duplicates(subset=common_columns_df1)

    # Drop duplicates in unique rows in df1
    unique_rows_df1 = df1[~df1.set_index(common_columns_df1).index.isin(common_rows.set_index(common_columns_df1).index)]
    unique_rows_df1 = unique_rows_df1.drop_duplicates(subset=common_columns_df1)

    # Drop duplicates in unique rows in df2
    unique_rows_df2 = df2[~df2.set_index(common_columns_df2).index.isin(common_rows.set_index(common_columns_df2).index)]
    unique_rows_df2 = unique_rows_df2.drop_duplicates(subset=common_columns_df2)

    return {"in table 1 but not in table 2": unique_rows_df1, 
            "in table 2 but not in table 1": unique_rows_df2,
            "in both tables": common_rows}


#### Part 3 - Reporting companies' list and Part 5 - Company data

#### RESULTS
- 262 rows are in Part 3 - Reporting companies' list but not in Part 5 - Company data
- 105 rows are in Part 5 - Company data but not in Part 3 - Reporting companies' list

In [26]:
common_columns_3a5 = ["Full company name"]
common_columns_53a = ["Company"]

# for key, data in compare_tables(df_part_3a, df_part_5, common_columns_3a5, common_columns_53a).items():
#     print(key)
#     display(data)

compare_3a5 = compare_tables_drop_duplicates(df_part_3a, df_part_5, common_columns_3a5, common_columns_53a)

print("Duplicate rows removed")
for key, data in compare_3a5.items():
    print(f'{key}: {data.shape[0]} rows')

Duplicate rows removed
in table 1 but not in table 2: 262 rows
in table 2 but not in table 1: 105 rows
in both tables: 2641 rows


#### Part 3 - Reporting government entities' list and Part 5 - Company data

#### RESULTS
- 164 rows are in Part 3 - Reporting government entities' list but not in Part 5 - Company data
- 147 rows are in Part 5 - Company data but not in Part 3 - Reporting government entities' list

In [27]:
common_columns_3b5 = ["Full name of agency", "Country", "Year"]
common_columns_53b = ["Government entity", "Country", "Year"]

# for key, data in compare_tables(df_part_3b, df_part_5, common_columns_3b5, common_columns_53b).items():
#     print(key)
#     display(data)

compare_3b5 = compare_tables_drop_duplicates(df_part_3b, df_part_5, common_columns_3b5, common_columns_53b)

print("Duplicate rows removed")
for key, data in compare_3b5.items():
    print(f'{key}: {data.shape[0]} rows')

Duplicate rows removed
in table 1 but not in table 2: 164 rows
in table 2 but not in table 1: 147 rows
in both tables: 379 rows


#### Part 3 - Reporting projects' list and Part 5 - Company data

#### RESULTS
- 767 rows are in Part 3 - Reporting projects' list but not in Part 5 - Company data
- 556 rows are in Part 5 - Company data but not in Part 3 - Reporting projects' list

In [28]:
common_columns_3c5 = ["Full project name", "Country", "Year"]
common_columns_53c = ["Project name", "Country", "Year"]

# for key, data in compare_tables(df_part_3c, df_part_5, common_columns_3c5, common_columns_53c).items():
#     print(key)
#     display(data)

compare_3c5 = compare_tables_drop_duplicates(df_part_3c, df_part_5, common_columns_3c5, common_columns_53c)

print("Duplicate rows removed")
for key, data in compare_3c5.items():
    print(f'{key}: {data.shape[0]} rows')

Duplicate rows removed
in table 1 but not in table 2: 767 rows
in table 2 but not in table 1: 556 rows
in both tables: 3146 rows


### Compare revenue totals between Part 4 - Government revenues and Part 5 - Company data

#### RESULTS

In [29]:
rev_part_4 = df_part_4.copy()
rev_part_5 = df_part_5.copy()

# Convert "Revenue value" to numeric
rev_part_4["Revenue value"] = pd.to_numeric(rev_part_4["Revenue value"], errors="coerce")
rev_part_5["Revenue value"] = pd.to_numeric(rev_part_5["Revenue value"], errors="coerce")

# Group and sum
# rev_part_4_sum = rev_part_4.groupby(["Country"])["Revenue value"].sum()
# rev_part_5_sum = rev_part_5.groupby(["Country"])["Revenue value"].sum()

rev_part_4_sum = rev_part_4.groupby(["Country", "Year"])["Revenue value"].sum()
rev_part_5_sum = rev_part_5.groupby(["Country", "Year"])["Revenue value"].sum()

# Calculate the difference and format as a percentage
difference = (100 * (rev_part_4_sum - rev_part_5_sum) / rev_part_4_sum).sort_values(ascending=False).reset_index()

# # Rename and style the column
diff_perc = difference.rename(columns={"Revenue value": "Difference in computed revenue (%)<br>[Government revenues - Company data]"}).style.format({'Difference in computed revenue (%)<br>[Government revenues - Company data]': '{:.2f}'})

diff_perc

Unnamed: 0,Country,Year,Difference in computed revenue (%) [Government revenues - Company data]
0,Albania,2018,100.0
1,Nigeria,2018,67.84
2,Albania,2017,53.61
3,Argentina,2018,42.85
4,Suriname,2017,37.46
5,Guyana,2018,33.19
6,Burkina Faso,2020,25.97
7,Zambia,2018,22.45
8,Zambia,2017,19.16
9,Liberia,2018,18.04


### TESTS

In [40]:
sample = pd.DataFrame({
    "Full name of agency": ["A", "A", "B", "B", "C", "D", "E"],
    "Agency type": ["pri", "pri", "pri", "soe", "soe", "soe", "soe"],
    "Total reported": [0, 0, 1, 1, 2, 2, 5]
})

sample

Unnamed: 0,Full name of agency,Agency type,Total reported
0,A,pri,0
1,A,pri,0
2,B,pri,1
3,B,soe,1
4,C,soe,2
5,D,soe,2
6,E,soe,5


In [30]:
# SAMPLE TEST
sample = pd.DataFrame({
    "Full name of agency": ["A", "A", "B", "B", "C", "D", "E"],
    "Agency type": ["pri", "pri", "pri", "soe", "soe", "soe", "soe"],
    "Total reported": [0, 0, 1, 1, 2, 2, 5]
})

columns = ["Full name of agency", "Agency type", "Total reported"]

get_redflags(sample, columns, 2)

Unnamed: 0,Full name of agency,Agency type,Total reported,rowid
0,A,pri,0,0
1,A,pri,0,1
2,B,pri,1,2
3,B,soe,1,3
4,C,soe,2,4
5,D,soe,2,5


In [31]:
# Sample data for sample_companies_list
sample_companies_list_data = {
    "Full company name": ["A Corp", "B Corp", "C Corp", "D Corp", "E Corp", "B Corp", "F Corp"],
    "Company type": ["Type1", "Type2", "Type3", "Type1", "Type2", "Type2", "Type3"],
    "Company ID number": [101, 102, 103, 104, 105, 102, 106],
    "Country": ["X", "Y", "Z", "X", "Y", "Y", "Z"],
    "Year": [2020, 2021, 2022, 2023, 2020, 2021, 2023]
}

sample_company_list = pd.DataFrame(sample_companies_list_data)

# Sample data for sample_company_data
sample_company_data_data = {
    "Company": ["A Corp", "B Corp", "C Corp", "E Corp", "F Corp", "C Corp", "G Corp"],
    "Project name": ["P1", "P2", "P3", "P4", "P5", "P3", "P6"],
    "Country": ["X", "Y", "Z", "X", "Y", "Z", "X"],
    "Year": [2020, 2021, 2022, 2023, 2022, 2023, 2021]
}

sample_company_data = pd.DataFrame(sample_company_data_data)

display(sample_company_list)
display(sample_company_data)

Unnamed: 0,Full company name,Company type,Company ID number,Country,Year
0,A Corp,Type1,101,X,2020
1,B Corp,Type2,102,Y,2021
2,C Corp,Type3,103,Z,2022
3,D Corp,Type1,104,X,2023
4,E Corp,Type2,105,Y,2020
5,B Corp,Type2,102,Y,2021
6,F Corp,Type3,106,Z,2023


Unnamed: 0,Company,Project name,Country,Year
0,A Corp,P1,X,2020
1,B Corp,P2,Y,2021
2,C Corp,P3,Z,2022
3,E Corp,P4,X,2023
4,F Corp,P5,Y,2022
5,C Corp,P3,Z,2023
6,G Corp,P6,X,2021


In [32]:
common_columns_df1 = ["Full company name", "Year"]
common_columns_df2 = ["Company", "Year"]

for key, data in compare_tables(sample_company_list, sample_company_data, common_columns_df1, common_columns_df2).items():
    print(key)
    display(data)

in table 1 but not in table 2


Unnamed: 0,Full company name,Company type,Company ID number,Country,Year
3,D Corp,Type1,104,X,2023
4,E Corp,Type2,105,Y,2020
6,F Corp,Type3,106,Z,2023


in table 2 but not in table 1


Unnamed: 0,Company,Project name,Country,Year
3,E Corp,P4,X,2023
4,F Corp,P5,Y,2022
5,C Corp,P3,Z,2023
6,G Corp,P6,X,2021


in both tables


Unnamed: 0,Full company name,Company type,Company ID number,Country_x,Year,Company,Project name,Country_y
0,A Corp,Type1,101,X,2020,A Corp,P1,X
1,B Corp,Type2,102,Y,2021,B Corp,P2,Y
2,C Corp,Type3,103,Z,2022,C Corp,P3,Z
3,B Corp,Type2,102,Y,2021,B Corp,P2,Y


In [33]:
for key, data in compare_tables_drop_duplicates(sample_company_list, sample_company_data, common_columns_df1, common_columns_df2).items():
    print(key)
    display(data)

in table 1 but not in table 2


Unnamed: 0,Full company name,Company type,Company ID number,Country,Year
3,D Corp,Type1,104,X,2023
4,E Corp,Type2,105,Y,2020
6,F Corp,Type3,106,Z,2023


in table 2 but not in table 1


Unnamed: 0,Company,Project name,Country,Year
3,E Corp,P4,X,2023
4,F Corp,P5,Y,2022
5,C Corp,P3,Z,2023
6,G Corp,P6,X,2021


in both tables


Unnamed: 0,Full company name,Company type,Company ID number,Country_x,Year,Company,Project name,Country_y
0,A Corp,Type1,101,X,2020,A Corp,P1,X
1,B Corp,Type2,102,Y,2021,B Corp,P2,Y
2,C Corp,Type3,103,Z,2022,C Corp,P3,Z


In [None]:
# for each row, compare the column combinations for all other rows
# get True or False values for each cell
# get rows with False <= 2

In [70]:
import pandas as pd

sample = pd.DataFrame({
    "Full name of agency": ["A", "A", "B", "B", "C", "D", "E"],
    "Agency type": ["pri", "pri", "pri", "soe", "soe", "soe", "soe"],
    "Total reported": [0, 0, 1, 1, 2, 2, 5],
    "rowid": [0, 1, 2, 3, 4, 5, 6]
})

problematic_rows = pd.DataFrame()

for index, row in df_part_5.iterrows():
    current_row = row
    other_rows = df_part_5.drop(index)  # Drop the current row
    # other_rows = sample  # Drop the current row
    frows = (other_rows.values == current_row.values)
    # display(frows)
    # display(other_rows[(~frows).sum(axis=1) <= 2])
    # for r in frows:
    #     print(r)
    #     print((~r).sum())
    #     if (~r).sum() <= 2:
    rows_with_num_false = other_rows[(~frows).sum(axis=1) <= 2]
    problematic_rows = pd.concat([problematic_rows, rows_with_num_false], ignore_index=False)
            
    # num_false = (current_row != other_rows).sum(axis=1).sum()  # Count number of False values
    # print(f"Number of False values in row {index}: {num_false}")


unique_problematic_rows = problematic_rows.drop_duplicates()

display(unique_problematic_rows)
unique_problematic_rows.shape

Unnamed: 0,Company,Government entity,Revenue stream name,Levied on project (Y/N),Reported by project (Y/N),Project name,Reporting currency,Revenue value,Payment made in-kind (Y/N),In-kind volume (if applicable),Unit (if applicable),Comments,Country,ISO Code,Year,Start Date,End Date
1,North Coal Enterprise (NCE),Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,EXP 1/2014,AFN,386169944,No,Not applicable,Not applicable,2018-06-24,Afghanistan,AFG,2018,2017-12-21,2018-12-20
2,North Coal Enterprise (NCE),Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,EXP 1/2014,AFN,336623658,No,Not applicable,Not applicable,2018-04-18,Afghanistan,AFG,2018,2017-12-21,2018-12-20
3,North Coal Enterprise (NCE),Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,EXP 1/2014,AFN,300000000,No,Not applicable,Not applicable,2018-11-19,Afghanistan,AFG,2018,2017-12-21,2018-12-20
17,North Coal Enterprise (NCE),Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,EXP 1/2014,AFN,200000000,No,Not applicable,Not applicable,2018-12-18,Afghanistan,AFG,2018,2017-12-21,2018-12-20
18,North Coal Enterprise (NCE),Ministry of Mines and Petroleum (Revenue Depar...,Royalties,Yes,Yes,EXP 1/2014,AFN,100000000,No,Not applicable,Not applicable,2018-11-21,Afghanistan,AFG,2018,2017-12-21,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31375,ROXGOLD BURKINA FASO SARL,Direction Générale du Trésor et de la Comptabi...,Frais de dossier,Yes,Yes,ROXGOLD BURKINA FASO,XOF,40000,No,Non applicable,Non applicable,Non applicable,Burkina Faso,BFA,2019,2019-01-01,2019-12-31
31374,ROXGOLD BURKINA FASO SARL,Direction Générale du Trésor et de la Comptabi...,Taxe Superficiaire,Yes,Yes,ROXGOLD BURKINA FASO,XOF,10858296,No,Non applicable,Non applicable,Non applicable,Burkina Faso,BFA,2019,2019-01-01,2019-12-31
31837,Griffiths Energy DOH,Direction Générale Technique de Pétrole (DGTP),Frais de présentation du rapport annuel,No,No,Non applicable,USD,74332,No,Non applicable,Non applicable,,Chad,TCD,2018,2018-01-01,2018-12-31
31838,Griffiths Energy CHAD,Direction Générale Technique de Pétrole (DGTP),Frais de présentation du rapport annuel,No,No,Non applicable,USD,74332,No,Non applicable,Non applicable,,Chad,TCD,2018,2018-01-01,2018-12-31


(1583, 17)

In [66]:
# how to use results
# for each row
# check if row is already corrected
# if yes, go to next row
# if not, check duplicate with column
# correct duplicate with "row" (solve inconsistency, mark row for deletion), note duplicate with "row" is corrected
# repeat

490