Using the cleaned csv files for Xeno Canto (bird), World Bank Data, and World Bank Country (Country Region), we will identify the unpaired entities in the database. Some changes to the database including additional entity resolution have already been made, but using the earlier version of the data tables exported from earlier Jupyter notebooks along with referencing earlier code and notes will allow us to identify the unpaired entities in the database.

In [None]:
import pandas as pd
import numpy as np

In [None]:
prefix = '/content/drive'
from google.colab import drive
drive.mount(prefix, force_remount=True)

Mounted at /content/drive


In [None]:
# copy and paste the file paths to the cleaned csv files between the quotation marks below
xeno_canto_path = ''
world_bank_path = ''
country_region_path = ''

In [None]:
xeno_canto = pd.read_csv(xeno_canto_path)
xeno_canto.columns

Index(['id', 'genus', 'scientificName', 'vernacularName', 'longitudeDecimal',
       'latitudeDecimal', 'country', 'locality', 'accessURI'],
      dtype='object')

In [None]:
world_bank = pd.read_csv(world_bank_path)
world_bank.columns

Index(['Country Name', 'Indicator Code', 'Year', 'value'], dtype='object')

In [None]:
country_region = pd.read_csv(country_region_path)
country_region.columns

Index(['Country Name', 'Region'], dtype='object')

In [None]:
xc_countries = xeno_canto['country'].unique().tolist()
wb_countries = world_bank['Country Name'].unique().tolist()
cr_countries = country_region['Country Name'].unique().tolist()

In [None]:
xc_countries = set(xc_countries)
wb_countries = set(wb_countries)
cr_countries = set(cr_countries)

In [None]:
cr_wb_diff = cr_countries.difference(wb_countries)
cr_wb_diff
# there are no countries in Country Region that are not in World Bank Data

set()

In [None]:
wb_cr_diff = wb_countries.difference(cr_countries)
wb_cr_diff
# there are no countries in World Bank Data that are not in Country Region

set()

In [None]:
xc_cr_diff = xc_countries.difference(cr_countries)
xc_cr_diff

# these are the countries in Bird Data that are not in Country Region
# Cape Verde, East Timor, St Lucia, and Swaziland have been resolved in the database
# so Antarctica, French Guiana, Macedonia, and Taiwan are unpaired

{'Antarctica',
 'Cape Verde',
 'East Timor',
 'French Guiana',
 'Macedonia',
 'St Lucia',
 'Swaziland',
 'Taiwan'}

In [None]:
cr_xc_diff = cr_countries.difference(xc_countries)
cr_xc_diff

# some of these entities were unpaired because they needed entity resolution
# and have been resolved

# many of these countries are in Country Region (and World Bank Data)
# but not in Xeno Canto

{'American Samoa',
 'Aruba',
 'Bermuda',
 'British Virgin Islands',
 'Cabo Verde',
 'Cayman Islands',
 'Channel Islands',
 'Curacao',
 'Eritrea',
 'Eswatini',
 'Faroe Islands',
 'French Polynesia',
 'Gibraltar',
 'Greenland',
 'Guam',
 'Haiti',
 'Hong Kong SAR, China',
 'Isle of Man',
 "Korea, Dem. People's Rep.",
 'Kosovo',
 'Macao SAR, China',
 'Marshall Islands',
 'New Caledonia',
 'North Macedonia',
 'Northern Mariana Islands',
 'San Marino',
 'Sint Maarten (Dutch part)',
 'St. Kitts and Nevis',
 'St. Lucia',
 'St. Martin (French part)',
 'Sudan',
 'Timor-Leste',
 'Turks and Caicos Islands',
 'Tuvalu',
 'Virgin Islands (U.S.)',
 'West Bank and Gaza'}

To close the gap between the conceptual design of our app and the actual database, we want the birdData table to have a foreign key referencing countryRegion (countryName). That way countryRegion will represent all of the countries in the database. (There were no countries in the cleaned World Bank Data table that were not in the cleaned World Bank Countries table, and in the database worldBankData already has a foreign key referencing countryRegion(countryName).) We can implement this change by adding the countries in birdData but not in countryRegion to countryRegion and then adding the foreign key constraint.