# Duplication and Uniqueness

## Overall purpose and objective
The overall purpose and objective of the cleaning and verification process is to prepare the data for conversion into a SQLite database (Datasette). As such, the data should follow database best practices.

## Specific purpose of this notebook
This notebook is for checking duplicates in the data. Particularly, we want to check for:
- Duplicate instances in data tables
- Duplicate instances of company and agency names and/or IDs
- Duplicate instances of projects

## Assumptions
- Companies, agencies, and projects should be uniquely identifiable in the data using a combination of fields such as name, ID, year, and country
- A company or agency is only reported once per year in the Reporting companies and Reporting government entities lists
- A project is only reported once per year, country, and commodity in the Reporting projects' list
- A company/agency should only have 1 ID and each ID should only pertain to 1 company/agency
- Duplicate entries would be considered as potential data entry errors or inconsistencies

## Why this matters 

## Findings
In general:


More specific findings are discussed below.

### 3A
- Rows that have 0-2 different columns

In [2]:
# import libraries and data

import pandas as pd
import numpy as np
from os import path
from functools import reduce
from pprint import pprint
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from itertools import combinations

file_dir = "data/consolidated/"

# load the csvs into data frames
df_part_1 = pd.read_csv(path.join(file_dir, "Part 1 - About.csv"))
df_part_3a = pd.read_csv(path.join(file_dir, "Part 3 - Reporting companies' list.csv"))
df_part_3b = pd.read_csv(path.join(file_dir, "Part 3 - Reporting government entities list.csv"))
df_part_3c = pd.read_csv(path.join(file_dir, "Part 3 - Reporting projects' list.csv"))
df_part_4 = pd.read_csv(path.join(file_dir, "Part 4 - Government revenues.csv"))
df_part_5 = pd.read_csv(path.join(file_dir, "Part 5 - Company data.csv"))
# df_part_5 = pd.read_csv(path.join(file_dir, "Part 5 - Company data.csv"), low_memory=False)

df_list = [df_part_1, df_part_3a, df_part_3b, df_part_3c, df_part_4, df_part_5]
df_dict = {"Part 1 - About.csv": df_part_1,
           "Part 3 - Reporting companies' list.csv": df_part_3a,
           "Part 3 - Reporting government entities list.csv": df_part_3b,
           "Part 3 - Reporting projects' list.csv": df_part_3c,
           "Part 4 - Government revenues.csv": df_part_4,
           "Part 5 - Company data.csv": df_part_5
          }

# OPTIONAL COLUMNS
part_3a_opt = ["Stock exchange listing or company website", 
               "Audited financial statement (or balance sheet, cash flows, profit/loss statement if unavailable)"]
part_3b_opt = ["ID number (if applicable)"]
part_5_opt = ["In-kind volume (if applicable)", "Unit (if applicable)", "Comments"]

# only include fields that are non-optional
df_part_1_non_opt = df_part_1.copy()
df_part_3a_non_opt = df_part_3a.copy().drop(columns=part_3a_opt)               
df_part_3b_non_opt = df_part_3b.copy().drop(columns=part_3b_opt)
df_part_3c_non_opt = df_part_3c.copy()
df_part_4_non_opt = df_part_4.copy()
df_part_5_non_opt = df_part_5.copy().drop(columns=part_5_opt)

df_list_non_opt = [df_part_1_non_opt, df_part_3a_non_opt, df_part_3b_non_opt, df_part_3c_non_opt, df_part_4_non_opt, df_part_5_non_opt]
df_dict_non_opt = {"Part 1 - About.csv": df_part_1_non_opt,
           "Part 3 - Reporting companies' list.csv": df_part_3a_non_opt,
           "Part 3 - Reporting government entities list.csv": df_part_3b_non_opt,
           "Part 3 - Reporting projects' list.csv": df_part_3c_non_opt,
           "Part 4 - Government revenues.csv": df_part_4_non_opt,
           "Part 5 - Company data.csv": df_part_5_non_opt
          }

In [10]:
# Get column names
columns = ["Full name of agency", "Agency type", "Total reported"]
# columns = df_part_3b.columns

all_combinations = []
column_combinations = combinations(columns, 2)
    # all_combinations.extend(column_combinations)

for combo in column_combinations:
    print(combo)

('Full name of agency', 'Agency type')
('Full name of agency', 'Total reported')
('Agency type', 'Total reported')


## Workflow

1. Declare usable columns
2. Declare the minimum number of different columns between any 2 rows
3. Translate the assumption into a logical analysis
4. Implement the analysis

In [None]:
# 1 .In this table there are 3 usable columns to differentiate rows (see above)
# 2. In this case, the Full name and Total reported should be unique.
# 2. The assumption is that there should be at least 2 columns of difference when comparing any 2 rows.
# 3. The first check is if there are rows with less than 2 columns of difference (0 or 1)

# STEP A. Check if there are any full duplicates (0 cols of difference)

# STEP B. If there is 1 col of difference = Duplicates based on 2 cols
# 2 out of 3 cols are the same
# Check all combinations (2 cols) that could be the same

In [98]:
df_part_3b[df_part_3b["Total reported"].isna()]

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date
13,Department of Budget and Management (DBM),Central goverment,000-449-457-000,,Philippines,PHL,2018.0,2018-01-01,2018-12-31
17,Philippine Natioanl Oil Company (PNOC),State-owned enterprises & public corporations,000-169-576-000,,Philippines,PHL,2018.0,2018-01-01,2018-12-31
18,Philippine Minding Development Corporation (PDMC),State-owned enterprises & public corporations,225-860-806-000,,Philippines,PHL,2018.0,2018-01-01,2018-12-31
30,Electric energy distribution system operator (...,State-owned enterprises & public corporations,,,Albania,ALB,2017.0,1/1/2017,12/31/2017
41,Electric Energy Distribution System Operator (...,State-owned enterprises & public corporations,,,Albania,ALB,2018.0,1/1/2018,12/31/2018
...,...,...,...,...,...,...,...,...,...
479,Other Govt. Agency,Other,,,Tanzania,TZA,2018.0,2017-07-01,2018-06-30
488,Les delegations speciales des communes et pref...,Local government,Not applicable,,Togo,TGO,2017.0,2017-01-01,2017-12-31
493,Agence Nationale de Gestion de l'Environnement...,Central government,,,Togo,TGO,2018.0,2018-01-01,2018-12-31
495,Togolaise des Eaux (TdE),Central government,,,Togo,TGO,2018.0,2018-01-01,2018-12-31


In [99]:
df_part_3b[df_part_3b["Total reported"] == "#ERROR!"]

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20
1,Ministry of Finance (Customs Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-21
2,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-22
3,National Environmental Protection Agency,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-23
4,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-24
...,...,...,...,...,...,...,...,...,...
530,Crown Estate Scotland (CES),State government,Not applicable,#ERROR!,United Kingdom,GBR,2020.0,2020-01-01,2020-12-31
531,Her Majesty’s Revenue and Customs (HMRC),Central goverment,Not applicable,#ERROR!,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31
532,Oil & Gas Authority (OGA),Other,9666504,#ERROR!,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31
533,The Crown Estate (TCE),State government,Not applicable,#ERROR!,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31


In [78]:
table_with_rowid = df_part_3b.copy()
table_with_rowid["rowid"] = range(0, len(table_with_rowid))

table_with_rowid

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date,rowid
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20,0
1,Ministry of Finance (Customs Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-21,1
2,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-22,2
3,National Environmental Protection Agency,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-23,3
4,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-24,4
...,...,...,...,...,...,...,...,...,...,...
544,Ministry of Mines and Minerals Development,Central goverment,,48229839.93,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,544
545,Environmental Protection Fund,Other,,23330601.78,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,545
546,Ministry of Lands,Central goverment,,1756688.47,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,546
547,IDC,Other,,69205641.67,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,547


In [79]:
table_copy = table_with_rowid.copy()

three_cols_dup = table_copy.duplicated(subset=["Full name of agency", "Agency type", "Total reported"], keep=False)
dup3 = table_copy[three_cols_dup]
dup3

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date,rowid
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20,0
1,Ministry of Finance (Customs Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-21,1
2,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-22,2
3,National Environmental Protection Agency,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-23,3
4,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-24,4
...,...,...,...,...,...,...,...,...,...,...
530,Crown Estate Scotland (CES),State government,Not applicable,#ERROR!,United Kingdom,GBR,2020.0,2020-01-01,2020-12-31,530
531,Her Majesty’s Revenue and Customs (HMRC),Central goverment,Not applicable,#ERROR!,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31,531
532,Oil & Gas Authority (OGA),Other,9666504,#ERROR!,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31,532
533,The Crown Estate (TCE),State government,Not applicable,#ERROR!,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31,533


In [47]:
dup3["Total reported"].nunique()

2

In [81]:
two_cols_dup1 = table_copy.duplicated(subset=["Full name of agency", "Agency type"], keep=False)
dup21 = table_copy[two_cols_dup1]

In [89]:
dup21

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date,rowid
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20,0
1,Ministry of Finance (Customs Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-21,1
2,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-22,2
3,National Environmental Protection Agency,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-23,3
4,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-24,4
...,...,...,...,...,...,...,...,...,...,...
541,Zambian Revenue Authority (ZRA),Central goverment,,13273065885.75,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,541
543,Local Councils,Local government,,225046694.88,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,543
544,Ministry of Mines and Minerals Development,Central goverment,,48229839.93,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,544
545,Environmental Protection Fund,Other,,23330601.78,Zambia,ZMB,2018.0,2018-01-01,2018-12-31,545


In [88]:
two_cols_dup2 = table_copy.duplicated(subset=["Full name of agency", "Total reported"], keep=False)
dup22 = table_copy[two_cols_dup2]
dup22

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date,rowid
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20,0
1,Ministry of Finance (Customs Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-21,1
2,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-22,2
3,National Environmental Protection Agency,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-23,3
4,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-24,4
...,...,...,...,...,...,...,...,...,...,...
530,Crown Estate Scotland (CES),State government,Not applicable,#ERROR!,United Kingdom,GBR,2020.0,2020-01-01,2020-12-31,530
531,Her Majesty’s Revenue and Customs (HMRC),Central goverment,Not applicable,#ERROR!,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31,531
532,Oil & Gas Authority (OGA),Other,9666504,#ERROR!,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31,532
533,The Crown Estate (TCE),State government,Not applicable,#ERROR!,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31,533


In [87]:
two_cols_dup3 = table_copy.duplicated(subset=["Agency type", "Total reported"], keep=False)
dup23 = table_copy[two_cols_dup3]
dup23

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date,rowid
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20,0
1,Ministry of Finance (Customs Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-21,1
2,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-22,2
3,National Environmental Protection Agency,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-23,3
4,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-24,4
...,...,...,...,...,...,...,...,...,...,...
530,Crown Estate Scotland (CES),State government,Not applicable,#ERROR!,United Kingdom,GBR,2020.0,2020-01-01,2020-12-31,530
531,Her Majesty’s Revenue and Customs (HMRC),Central goverment,Not applicable,#ERROR!,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31,531
532,Oil & Gas Authority (OGA),Other,9666504,#ERROR!,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31,532
533,The Crown Estate (TCE),State government,Not applicable,#ERROR!,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31,533


In [85]:
combined = pd.concat([dup3, dup21, dup22, dup23], ignore_index=False)
combined

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date,rowid
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20,0
1,Ministry of Finance (Customs Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-21,1
2,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-22,2
3,National Environmental Protection Agency,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-23,3
4,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-24,4
...,...,...,...,...,...,...,...,...,...,...
530,Crown Estate Scotland (CES),State government,Not applicable,#ERROR!,United Kingdom,GBR,2020.0,2020-01-01,2020-12-31,530
531,Her Majesty’s Revenue and Customs (HMRC),Central goverment,Not applicable,#ERROR!,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31,531
532,Oil & Gas Authority (OGA),Other,9666504,#ERROR!,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31,532
533,The Crown Estate (TCE),State government,Not applicable,#ERROR!,United Kingdom,GBR,2021.0,2021-01-01,2021-12-31,533


In [90]:
prob_rows = combined.drop_duplicates()
prob_rows

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date,rowid
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20,0
1,Ministry of Finance (Customs Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-21,1
2,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-22,2
3,National Environmental Protection Agency,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-23,3
4,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-24,4
...,...,...,...,...,...,...,...,...,...,...
479,Other Govt. Agency,Other,,,Tanzania,TZA,2018.0,2017-07-01,2018-06-30,479
488,Les delegations speciales des communes et pref...,Local government,Not applicable,,Togo,TGO,2017.0,2017-01-01,2017-12-31,488
493,Agence Nationale de Gestion de l'Environnement...,Central government,,,Togo,TGO,2018.0,2018-01-01,2018-12-31,493
495,Togolaise des Eaux (TdE),Central government,,,Togo,TGO,2018.0,2018-01-01,2018-12-31,495


In [93]:
yc = prob_rows.duplicated(["Full name of agency", "Agency type", "Total reported", "Year", "Country"])
prob_rows[yc]

Unnamed: 0,Full name of agency,Agency type,ID number (if applicable),Total reported,Country,ISO Code,Year,Start Date,End Date,rowid
0,Ministry of Finance (Revenue Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-20,0
1,Ministry of Finance (Customs Department),Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-21,1
2,Ministry of Mines and Petroleum (Revenue Depar...,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-22,2
3,National Environmental Protection Agency,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-23,3
4,Ministry of Industry and Commerce,Central goverment,Not applicable,#ERROR!,Afghanistan,AFG,2018.0,2017-12-21,2018-12-24,4
...,...,...,...,...,...,...,...,...,...,...
479,Other Govt. Agency,Other,,,Tanzania,TZA,2018.0,2017-07-01,2018-06-30,479
488,Les delegations speciales des communes et pref...,Local government,Not applicable,,Togo,TGO,2017.0,2017-01-01,2017-12-31,488
493,Agence Nationale de Gestion de l'Environnement...,Central government,,,Togo,TGO,2018.0,2018-01-01,2018-12-31,493
495,Togolaise des Eaux (TdE),Central government,,,Togo,TGO,2018.0,2018-01-01,2018-12-31,495


In [71]:
synth = pd.DataFrame({
    "Full name of agency": ["A", "A", "B", "B", "C", "D", "E"],
    "Agency type": ["pri", "pri", "pri", "soe", "soe", "soe", "soe"],
    "Total reported": [0, 0, 1, 1, 2, 2, 5],
    "RowID": [0, 1, 2, 3, 4, 5, 6]
})

synth

Unnamed: 0,Full name of agency,Agency type,Total reported,RowID
0,A,pri,0,0
1,A,pri,0,1
2,B,pri,1,2
3,B,soe,1,3
4,C,soe,2,4
5,D,soe,2,5
6,E,soe,5,6


In [72]:
synth_table = synth.copy()

three_cols_dup = synth_table.duplicated(subset=["Full name of agency", "Agency type", "Total reported"], keep=False)
dup3 = synth_table[three_cols_dup]
dup3

Unnamed: 0,Full name of agency,Agency type,Total reported,RowID
0,A,pri,0,0
1,A,pri,0,1


In [73]:
synth_table = synth.copy()

two_cols_dup1 = synth_table.duplicated(subset=["Full name of agency", "Agency type"], keep=False)
dup21 = synth_table[two_cols_dup1]
dup21

Unnamed: 0,Full name of agency,Agency type,Total reported,RowID
0,A,pri,0,0
1,A,pri,0,1


In [74]:
synth_table = synth.copy()

two_cols_dup2 = synth_table.duplicated(subset=["Agency type", "Total reported"], keep=False)
dup22 = synth_table[two_cols_dup2]
dup22

Unnamed: 0,Full name of agency,Agency type,Total reported,RowID
0,A,pri,0,0
1,A,pri,0,1
4,C,soe,2,4
5,D,soe,2,5


In [75]:
synth_table = synth.copy()

two_cols_dup3 = synth_table.duplicated(subset=["Full name of agency", "Total reported"], keep=False)
dup23 = synth_table[two_cols_dup3]
dup23

Unnamed: 0,Full name of agency,Agency type,Total reported,RowID
0,A,pri,0,0
1,A,pri,0,1
2,B,pri,1,2
3,B,soe,1,3


In [76]:
combined = pd.concat([dup3, dup21, dup22, dup23], ignore_index=False)
combined

Unnamed: 0,Full name of agency,Agency type,Total reported,RowID
0,A,pri,0,0
1,A,pri,0,1
0,A,pri,0,0
1,A,pri,0,1
0,A,pri,0,0
1,A,pri,0,1
4,C,soe,2,4
5,D,soe,2,5
0,A,pri,0,0
1,A,pri,0,1


In [77]:
combined.drop_duplicates()

Unnamed: 0,Full name of agency,Agency type,Total reported,RowID
0,A,pri,0,0
1,A,pri,0,1
4,C,soe,2,4
5,D,soe,2,5
2,B,pri,1,2
3,B,soe,1,3
