# EITI consolidated Summary Data files verification and validation
   
## Purpose
The purpose of this notebook is to verify assumptions that should hold true due to the nature of EITI's data and validate the usefulness of the data for different use-cases or business goals.

## Data
1. Consolidated summary data files of all countries and years divided into individual sheets (datasets).
   - [Part 1 - About](data/consolidated/Part%201%20-%20About.csv)
   - [Part 3 - Reporting companies' list](data/consolidated/Part%203%20-%20Reporting%20companies'%20list.csv)
   - Part 3 - Reporting government entities list
   - Part 3 - Reporting projects' list
   - Part 4 - Government revenues
   - Part 5 - Company data

## Methodology
1. [Import the libraries](#Import-Libraries)
2. [Import/load the data](#Import-the-data)
3. Verify the assumptions
4. Validate use-cases
5. Perform analysis and show simple statistics/visualizations about the data

## Assumptions to verify
1. [**COMPLETENESS AND CORRECTNESS**](#Completeness-and-Correctness)
    1. Cell values are complete and/or correct (i.e. no unexpected or incorrect values)
2. [**UNIQUENESS AND NON-DUPLICATION**](#Uniqueness-and-Non-duplication)
    1. No duplicate company/agency names in the same year and country
    2. No duplicate project names in the same year and country
    3. No duplicate company/agency IDs in the same year and country
    4. No duplicate company-project pair in the same year and country
    5. Unique (single) company ID per year and country    
3. [**CONSISTENCY**](#Consistency)
    1. Consistent company name over time and across the datasets
    2. Consistent company ID over time and across the datasets
    3. Consistent company name-company ID pair over time and across the datasets
    4. Consistent project name over time and across the datasets
    5. Consistent company-project pair over time and across the datasets
    6. Consistent company-company type pair over time and across the datasets
    7. Consistency of country names and ISO codes across the datasets
    8. Consistency of contry name-ISO code pair across the datasets
    9. Consistency of commodities, commodities-project pair, commodities-unit pair

## Use-cases to validate (examples)
1. How has the extractives industry evolved over time in country X? How much is the volume extracted over time? How much revenue is generated?
2. How much taxes are paid by SOEs (SOE = state-owned enterprises company type) for each country? How much is this compared to private companies?
3. What percentage of the market share do SOEs take in each country? Based on volume extracted? Based on revenue? Based on taxes paid?

Add more use-cases. A use case should demonstrate if a reuser, coming with a specific question relevant to a real life need, can answer that question in the data. 

## Statistics and graphs
1. [How many years of data does each company/agency have?](#How-many-years-of-data-does-each-company/agency-have?)
2. [How many (and what companies/agencies) have changed their company type (i.e. from SOE to private and vice versa)?](#How-many-(and-what-companies/agencies)-have-changed-their-company-type-(i.e.-from-SOE-to-private-and-vice-versa)?)
3. [How many (and which) companies/agencies have had their names changed?](How-many-(and-what)-companies/agencies-have-had-their-names-changed?)
4. [How many (and which) companies/agencies have had their IDs changed?](#How-many-(and-which)-companies/agencies-have-had-their-IDs-changed?)
5. [Instances of company/agency names being inputted incorrectly](#Instances-of-company/agency-names-being-inputted-incorrectly)

Show statistics and graphs for the assumptions verified and use-cases validated
Include also:
- comparison among countries (e.g. the most companies, the most consistent data, the most number of issues, etc.)
- comparison among years (e.g. the most companies, the most consistent data, the most number of issues, etc.)
- statistics only for SOE agencies/companies (SOE = state-owned enterprises company type)

### Import Libraries

In [1]:
# import libraries

import pandas as pd
from os import path

### Import the data

In [20]:
file_dir = "data/consolidated/"

# load the csvs into data frames
df_part_1 = pd.read_csv(path.join(file_dir, "Part 1 - About.csv"))
df_part_3a = pd.read_csv(path.join(file_dir, "Part 3 - Reporting companies' list.csv"))
df_part_3b = pd.read_csv(path.join(file_dir, "Part 3 - Reporting government entities list.csv"))
df_part_3c = pd.read_csv(path.join(file_dir, "Part 3 - Reporting projects' list.csv"))
df_part_4 = pd.read_csv(path.join(file_dir, "Part 4 - Government revenues.csv"))
# df_part_5 = pd.read_csv(path.join(file_dir, "Part 5 - Company data.csv")) # results in warning about columns having mixed types
df_part_5 = pd.read_csv(path.join(file_dir, "Part 5 - Company data.csv"), low_memory=False)

df_list = [df_part_1, df_part_3a, df_part_3b, df_part_3c, df_part_4, df_part_5]
df_dict = {"Part 1 - About.csv": df_part_1,
           "Part 3 - Reporting companies' list.csv": df_part_3a,
           "Part 3 - Reporting government entities list.csv": df_part_3b,
           "Part 3 - Reporting projects' list.csv": df_part_3c,
           "Part 4 - Government revenues.csv": df_part_4,
           "Part 5 - Company data.csv": df_part_5
          }

In [3]:
# Checking mixed type columns
mixed_type_columns = df_part_5.map(type).nunique() > 1

# Print or display the columns with mixed types
print("Columns with mixed types:")
print(mixed_type_columns[mixed_type_columns].index.tolist())

Columns with mixed types:
['Company', 'Government entity', 'Revenue stream name', 'Levied on project (Y/N)', 'Reported by project (Y/N)', 'Project name', 'Reporting currency', 'Revenue value', 'Payment made in-kind (Y/N)', 'In-kind volume (if applicable)', 'Unit (if applicable)', 'Comments', 'Country', 'ISO Code', 'Year', 'Start Date', 'End Date']


## Assumptions to verify

### Completeness and Correctness

It is assumed that the data received are complet and correct. In this section, we check for any unexpected or unexpected values in the tables such as instances of 'no data', 'NULL', or 'NaN'.

#### Expectations
- the data is complete

#### Assumptions
- there are no 'no data', 'NULL', or 'NaN' values in the dataframes

#### Results

In [4]:
# Functions to find and count NULL and #ERROR! values

def columns_with_null(df):
    return df.columns[df.isnull().any()]

def count_null_per_column(df):
    return df.isnull().sum()    

def columns_with_error(df):
    return df.columns[df.eq('#ERROR!').any()]

def count_error_per_column(df):
    return df.apply(lambda col: (col == '#ERROR!').sum())

def columns_with_empty_strings(df):
    return df.columns[df.eq('').any()]

def count_empty_strings_per_column(df):
    return df.apply(lambda col: (col == '').sum())

def columns_with_only_space(df):
    return df.columns[df.map(lambda cell: isinstance(cell, str) and cell.isspace()).any()]

def count_only_space_per_column(df):
    return df.map(lambda cell: str(cell).isspace()).sum()

**The results below show that there are columns in the data that has NULL**

In [5]:
for tab in df_dict:
    print("Columns with NULL values for {}:".format(tab))
    for column in columns_with_null(df_dict[tab]):
        print("- {}".format(column))
        
    print("\nNumber of columns with NULL: {}/{}".format(len(columns_with_null(df_dict[tab])), df_dict[tab].shape[1]))
    print("\nNumber of NULL per column:")
    print(count_null_per_column(df_dict[tab]))
    print("\nTotal number of rows/observations: {}".format(df_dict[tab].shape[0]))
    print("\n------\n")
    

Columns with NULL values for Part 1 - About.csv:
- ISO Alpha-3 Code
- National currency name
- National currency ISO-4217
- Start Date
- End Date
- Has an EITI Report been prepared by an Independent Administrator?
- What is the name of the company?
- Date that the EITI Report was made public
- URL, EITI Report
- Does the government systematically disclose EITI data at a single location?
- Publication date of the EITI data
- Website link (URL) to EITI data
- Are there other files of relevance?
- Date that other file was made public
- URL
- Does the government have an open data policy?
- Open data portal / files
- Oil
- Gas
- Mining (incl. Quarrying)
- Other, non-upstream sectors
- If yes, please specify name (insert new rows if multiple)
- Number of reporting government entities (incl SOEs if recipient)
- Number of reporting companies (incl SOEs if payer)
- Reporting currency (ISO-4217 currency codes)
- Exchange rate used: 1 USD = 
- Exchange rate source (URL,…)
- … by revenue stream
- 

**The results below show that there are columns in the data that has '#ERROR!'**

In [6]:
for tab in df_dict:
    print("Columns with '#ERROR!' values for {}:".format(tab))
    for column in columns_with_error(df_dict[tab]):
        print("- {}".format(column))
        
    print("\nNumber of columns with '#ERROR!': {}/{}".format(len(columns_with_error(df_dict[tab])), df_dict[tab].shape[1]))
    print("\nNumber of '#ERROR!' per column:")
    print(count_error_per_column(df_dict[tab]))
    print("\nTotal number of rows/observations: {}".format(df_dict[tab].shape[0]))
    print("\n------\n")
    

Columns with '#ERROR!' values for Part 1 - About.csv:
- ISO Alpha-3 Code
- National currency name
- National currency ISO-4217

Number of columns with '#ERROR!': 3/39

Number of '#ERROR!' per column:
Country or area name                                                            0
ISO Alpha-3 Code                                                               20
National currency name                                                         20
National currency ISO-4217                                                     19
Start Date                                                                      0
End Date                                                                        0
Has an EITI Report been prepared by an Independent Administrator?               0
What is the name of the company?                                                0
Date that the EITI Report was made public                                       0
URL, EITI Report                                              

**The results below show that there are columns in the data that has empty strings ('')**

In [7]:
for tab in df_dict:
    print("Columns with '' values for {}:".format(tab))
    for column in columns_with_empty_strings(df_dict[tab]):
        print("- {}".format(column))
        
    print("\nNumber of columns with '': {}/{}".format(len(columns_with_empty_strings(df_dict[tab])), df_dict[tab].shape[1]))
    print("\nNumber of '' per column:")
    print(count_empty_strings_per_column(df_dict[tab]))
    print("\nTotal number of rows/observations: {}".format(df_dict[tab].shape[0]))
    print("\n------\n")
    

Columns with '' values for Part 1 - About.csv:

Number of columns with '': 0/39

Number of '' per column:
Country or area name                                                           0
ISO Alpha-3 Code                                                               0
National currency name                                                         0
National currency ISO-4217                                                     0
Start Date                                                                     0
End Date                                                                       0
Has an EITI Report been prepared by an Independent Administrator?              0
What is the name of the company?                                               0
Date that the EITI Report was made public                                      0
URL, EITI Report                                                               0
Does the government systematically disclose EITI data at a single location?    0
Pub

**The results below show that there are columns in the data that are only white spaces**

In [8]:
for tab in df_dict:
    print("Columns with only white spaces for {}:".format(tab))
    for column in columns_with_only_space(df_dict[tab]):
        print("- {}".format(column))
        
    print("\nNumber of columns with '': {}/{}".format(len(columns_with_only_space(df_dict[tab])), df_dict[tab].shape[1]))
    print("\nNumber of '' per column:")
    print(count_only_space_per_column(df_dict[tab]))
    print("\nTotal number of rows/observations: {}".format(df_dict[tab].shape[0]))
    print("\n------\n")
    

Columns with only white spaces for Part 1 - About.csv:

Number of columns with '': 0/39

Number of '' per column:
Country or area name                                                           0
ISO Alpha-3 Code                                                               0
National currency name                                                         0
National currency ISO-4217                                                     0
Start Date                                                                     0
End Date                                                                       0
Has an EITI Report been prepared by an Independent Administrator?              0
What is the name of the company?                                               0
Date that the EITI Report was made public                                      0
URL, EITI Report                                                               0
Does the government systematically disclose EITI data at a single location? 

### Uniqueness and Non-duplication

### No duplicate company/agency names in the same year and country

#### Expectations
- A company or agency name will appear only once per year per country

#### Assumptions
- The company or agency is only reported once per year regardless of sector/commodity

#### Results

In [44]:
# def get_duplicate_names(df, field):
#     return df[df.duplicated([field, 'Country', 'Year'], keep=False)].sort_values(by=[field])

def get_duplicates(df, field_to_check, *constraint_fields):
    return df[df.duplicated([field_to_check, *constraint_fields], keep=False)].sort_values(by=[field_to_check])

In [46]:
get_duplicates(df_part_3a, 'Full company name', *['Country', 'Year'])

Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
1549,Al Waha Petroleum Co. Ltd.,Private,,Oil & Gas,"Oil, Gas, Condensates",https://www.petroalwaha.com/,,628712909.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1585,Al Waha Petroleum Co. Ltd.,Private,,Oil & Gas,"Oil, Gas, Condensates",,,628712909.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1552,CNOOC IRAQ LIMITED,Private,,Oil & Gas,"Oil, Gas, Condensates",https://cnoocinternational.com/operations/midd...,,1098601809.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1586,CNOOC IRAQ LIMITED,Private,,Oil & Gas,"Oil, Gas, Condensates",,,1098601809.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1564,PT PERTAMINA IRAK,Private,,Oil & Gas,"Oil, Gas, Condensates",,,990135616.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
1588,PT PERTAMINA IRAK,Private,,Oil & Gas,"Oil, Gas, Condensates",,,990135616.00,Iraq,IRQ,2018,1/1/2018,12/31/2018
497,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Nickel,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,864690546.00,Philippines,PHL,2018,2018-01-01,2018-12-31
519,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Limestone,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,,Philippines,PHL,2018,2018-01-01,2018-12-31
173,شرکت برادران خالد عزیز,Private,1019984010,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20
174,شرکت برادران خالد عزیز,Private,1019984010,Other,Coal,Not applicable,Not available,#ERROR!,Afghanistan,AFG,2018,2017-12-21,2018-12-20


#### Implications
- There are duplicate listings of companies in the same year in Iraq and Afghanistan
- The duplicate listing in the Philippines may be attributed to the difference in commodities

In [47]:
get_duplicates(df_part_3b, 'Full name of agency', *['Country', 'Year'])

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date
238,Mpohor Wassa East,Local government,,#ERROR!,Ghana,GHA,2019.0,2019-01-01,2019-12-31
252,Mpohor Wassa East,Local government,,,Ghana,GHA,2019.0,2019-01-01,2019-12-31
247,Obuasi Municipal Assembly,Local government,,#ERROR!,Ghana,GHA,2019.0,2019-01-01,2019-12-31
259,Obuasi Municipal Assembly,Local government,,,Ghana,GHA,2019.0,2019-01-01,2019-12-31


#### Implications
- There are duplicate listings of agencies for Ghana in 2019

### No duplicate project names in the same year and country

#### Expectations
- A project name will appear only once per year per country.

#### Assumptions
- A project is only reported once per year and country and commodity

#### Results

In [52]:
duplicate_projects = get_duplicates(df_part_3c, 
                                    'Full project name', 
                                    *['Country', 'Year', 'Commodities (one commodity/row)'])

duplicate_projects

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
1239,0,,MCCM,Not applicable,Not applicable,,Not applicable,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1109,0,,Bayan erdes,Not applicable,Not applicable,,Not applicable,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1168,0,,Mon lid trade,Not applicable,Not applicable,,Not applicable,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1185,0,,STBL,Not applicable,Not applicable,,Not applicable,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
1196,0,,Terguun service,Not applicable,Not applicable,,Not applicable,Not applicable,Not applicable,Mongolia,MNG,2020,1/1/2020,12/31/2020
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1400,,,,,,,,,,,,,,
1402,,,,,,,,,,,,,,
1404,,,,,,,,,,,,,,
1406,,,,,,,,,,,,,,


In [83]:
duplicate_project_names = duplicate_projects['Full project name'].unique()

print("Number of duplicate project names: {}\n".format(len(duplicate_project_names.tolist())))

duplicate_project_counts = duplicate_projects['Full project name'].value_counts()
# print(duplicate_project_counts)

print("Number of duplicates per project")
for name, count in duplicate_project_counts.items():
    if count > 1:
        duplicates = duplicate_projects[duplicate_projects['Full project name'] == name]
        print(f"Name: {name}, Count: {count}")

Number of duplicate project names: 198

Number of duplicates per project
Name: NO OPERATOR, Count: 83
Name: Landesamt für Bergbau, Energie und Geologie (LBEG) Hannover, Lower Saxony, Count: 23
Name: TOTALENERGIES E&P UK LIMITED (00811900), Count: 20
Name: SHELL U.K. LIMITED (00140141), Count: 20
Name: REPSOL SINOPEC RESOURCES UK LIMITED (00825828), Count: 18
Name: Not applicable, Count: 17
Name: Full project name, Count: 16
Name: EQUINOR UK LIMITED (01285743), Count: 15
Name: ISLAND GAS LIMITED (04962079), Count: 13
Name: PERENCO UK LIMITED (04653066), Count: 12
Name: TAQA BRATANI LIMITED (05975475), Count: 11
Name: Non Applicable, Count: 11
Name: Area 526, 666,666 tpa, Culver Extension, Count: 10
Name: BP EXPLORATION OPERATING COMPANY LIMITED (00305943), Count: 9
Name: EGDON RESOURCES U.K. LIMITED (03424561), Count: 9
Name: ENQUEST HEATHER LIMITED (02748866), Count: 9
Name: Landesamt für Geologie und Bergbau, Mainz-Hechtsheim, Rhineland Palatinate, Count: 8
Name: PL-5/126 (Ntronang 2)

#### Implications
- There is a significant number of duplicate project names across many of the countries.
- There is a need to check/count the number of duplicates per country to see if this should be expected

In [90]:
# pd.set_option('display.max_rows', None)

project_counts_per_country = duplicate_projects.groupby('Country')['Full project name'].value_counts()

duplicates_per_country = project_counts_per_country[project_counts_per_country > 1]

print("Number of duplicate project names per country:")
print(duplicates_per_country)

# pd.reset_option('display.max_rows')

Number of duplicate project names per country:
Country         Full project name                                                             
Afghanistan     SSML-Kabu 13/2012                                                                 6
                SSML-Kabu 42/2014                                                                 4
                SSML-Kabu 2/2008                                                                  4
                SSML-Kabu 19/2016                                                                 4
                SSML-Kabu 10/2016                                                                 4
                                                                                                 ..
United Kingdom  EGDON RESOURCES U.K. LIMITED (03424561), REGENT PARK ENERGY LIMITED (04557422)    2
                Boulby Offshore Mine                                                              2
                Claymore                                  

In [93]:
df_part_3c[df_part_3c['Full project name'] == 'CIRQUE ENERGY (UK) LIMITED (03080778)']

Unnamed: 0,Full project name,"Legal agreement reference number(s): contract, licence, lease, concession, …","Affiliated companies, start with Operator",Commodities (one commodity/row),Status,Production (volume),Unit,Production (value),Currency,Country,ISO Code,Year,Start Date,End Date
6001,CIRQUE ENERGY (UK) LIMITED (03080778),PEDL324,"CIRQUE ENERGY (UK) LIMITED (03080778), STELINM...",Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31
6008,CIRQUE ENERGY (UK) LIMITED (03080778),PEDL348,"CIRQUE ENERGY (UK) LIMITED (03080778), STELINM...",Not available,Not applicable,Not applicable,,,GBP,United Kingdom,GBR,2021,2021-01-01,2021-12-31


### No duplicate project names in the same year and country

#### Expectations
- A company ID will appear only once per year per country.

#### Assumptions
- 1 company ID = 1 company in a country

#### Results

In [94]:
duplicate_ids = get_duplicates(df_part_3a, 
                                    'Company ID number', 
                                    *['Country', 'Year'])

duplicate_ids

Unnamed: 0,Full company name,Company type,Company ID number,Sector,Commodities (comma-seperated),Stock exchange listing or company website,"Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)",Payments to Governments Report,Country,ISO Code,Year,Start Date,End Date
519,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Limestone,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,,Philippines,PHL,2018,2018-01-01,2018-12-31
497,Rio Tuba Nickel Mining Corporation,Private,000-142-665-000,Mining,Nickel,nickelasia.com/subsidiaries/rio-tuba-nickel-mi...,nickelasia.com/investor-relations/financial-re...,864690546.00,Philippines,PHL,2018,2018-01-01,2018-12-31
878,HOUNDE GOLD OPERATION SA,Private,00064526S,Mining,Or,Indisponible,,39636048842.00,Burkina Faso,BFA,2019,1/1/2019,12/31/2019
877,HOUNDE EXPLORATION BF SARL,Private,00064526S,Mining,Or,https://www.endeavourmining.com/our-portfolio/...,,7879789.00,Burkina Faso,BFA,2019,1/1/2019,12/31/2019
872,HOUNDE EXPLORATION BF SARL,Private,00064526S,Mining,Gold,https://www.endeavourmining.com/our-portfolio/...,,13130988.00,Burkina Faso,BFA,2018,1/1/2018,12/31/2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3216,GX Technology,,,Oil & Gas,,,,,Seychelles,SYC,2018,2018-01-01,2018-12-31
3217,JOGMEX,,,Oil & Gas,,,,,Seychelles,SYC,2018,2018-01-01,2018-12-31
3242,GRAYSON BANDA,Private,,Mining,Gold,Not available,Not available,3126118498.00,Tanzania,TZA,2018,2017-07-01,2018-06-30
3256,TANZANIA PORTLAND CEMENT PUBLIC LIMITED COMPANY,Private,,Mining,"Limeston, Sandstone",Not available,Not available,24750760174.02,Tanzania,TZA,2018,2017-07-01,2018-06-30


In [95]:
duplicate_id_names = duplicate_ids['Company ID number'].unique()

print("Number of duplicate IDs: {}\n".format(len(duplicate_id_names.tolist())))

duplicate_id_counts = duplicate_ids['Company ID number'].value_counts()
# print(duplicate_project_counts)

print("Number of duplicate IDS")
for id, count in duplicate_id_counts.items():
    if count > 1:
        duplicates = duplicate_ids[duplicate_ids['Company ID number'] == id]
        print(f"Name: {id}, Count: {count}")

Number of duplicate IDs: 36

Number of duplicate IDS
Name: Not applicable, Count: 81
Name: Not available, Count: 40
Name: NC, Count: 37
Name: Not communicated, Count: 19
Name: Not Available, Count: 8
Name: 1013655012, Count: 5
Name: 983426417, Count: 5
Name: 1009592088, Count: 4
Name: 1010360087, Count: 4
Name: 1016525014, Count: 4
Name: 9001044669, Count: 4
Name: 2700773, Count: 4
Name: 9001204305, Count: 4
Name: 1007815085, Count: 4
Name: 9000197187, Count: 4
Name: 00064526S, Count: 4
Name: 1008627083, Count: 3
Name: 931713671, Count: 2
Name: 000-142-665-000, Count: 2
Name: Nc, Count: 2
Name: 9003329902, Count: 2
Name: RC321517, Count: 2
Name: 919160675, Count: 2
Name: 9000647140, Count: 2
Name: 9002301225, Count: 2
Name: 9001505461, Count: 2
Name: 9001353375, Count: 2
Name: 9000036856, Count: 2
Name: 761,1998-1999, Count: 2
Name: 1384,1998-1999, Count: 2
Name: 108744308, Count: 2
Name: 1044142014, Count: 2
Name: 103947189, Count: 2
Name: 1019984010, Count: 2
Name: RC322270, Count: 2

#### Implications
- There is a significant number of duplicate company ID values
- Most are due to the information not being available or not existing in the dataset

### Consistency

## Use-cases to validate

## Statistics and graphs

### How many years of data does each company/agency have?

### How many (and what companies/agencies) have changed their company type (i.e. from SOE to private and vice versa)?

### How many (and what) companies/agencies have had their names changed?

### How many (and which) companies/agencies have had their IDs changed?

### Instances of company/agency names being inputted incorrectly