# While adding constraints in our MySQL database, the FOREIGN KEY (indicatorCode) REFERENCES worldBankIndicators(indicatorCode) constraint for the worldBankData table failed.

We will try to find out what is causing the issue and whether we can resolve it.

In [None]:
import pandas as pd
import numpy as np

In [None]:
prefix = '/content/drive'
from google.colab import drive
drive.mount(prefix, force_remount=True)

Mounted at /content/drive


In [None]:
# copy and paste the file path between the quotation marks below
world_bank_data_path = ''
world_bank_data = pd.read_csv(world_bank_data_path)
world_bank_data.columns

Index(['Country Name', 'Indicator Code', 'Year', 'value'], dtype='object')

In [None]:
# copy and paste the file path between the quotation marks below
world_bank_indicators_path = ''
world_bank_indicators = pd.read_csv(world_bank_indicators_path, usecols=['Series Code', 'Topic', 'Indicator Name', 'Long definition'])
world_bank_indicators.columns

Index(['Series Code', 'Topic', 'Indicator Name', 'Long definition'], dtype='object')

We'll check the original World Bank Series data file too in case the issue is due to rows deleted during data cleaning, although that shouldn't be the case.

In [None]:
# copy and paste the file path between the quotation marks below
# just in case the rows exist but were deleted during data cleaning
world_bank_indicators_original_path = ''
world_bank_indicators_original = pd.read_csv(world_bank_indicators_original_path, usecols=['Series Code', 'Topic', 'Indicator Name', 'Long definition'])
world_bank_indicators_original.columns

Index(['Series Code', 'Topic', 'Indicator Name', 'Long definition'], dtype='object')

The following left join will result in all the rows of worldBankData joined, if possible, with their corresponding rows of worldBankIndicators

In [None]:
left_joined_on_indicatorCode = world_bank_data.join(world_bank_indicators.set_index('Series Code'), on='Indicator Code')
left_joined_on_indicatorCode.columns

Index(['Country Name', 'Indicator Code', 'Year', 'value', 'Topic',
       'Indicator Name', 'Long definition'],
      dtype='object')

In [None]:
left_joined_on_indicatorCode.shape

(4961021, 7)

In [None]:
# No, it's not an issue involving which rows of worldBankIndicators were deleted in data cleaning
left_joined_on_indicatorCode_original = world_bank_data.join(world_bank_indicators_original.set_index('Series Code'), on='Indicator Code')
left_joined_on_indicatorCode_original.shape

(4961021, 7)

The following inner join will result in a joined table with only the rows that had matches

In [None]:
inner_joined_on_indicatorCode = world_bank_data.join(world_bank_indicators.set_index('Series Code'), on='Indicator Code', how='inner')
inner_joined_on_indicatorCode.shape

(4956426, 7)

There are more rows in worldBankData than have matches in worldBankIndicators.

We will find out which rows in worldBankData did not have matches in worldBankIndicators

In [None]:
no_indicator_code_match = world_bank_data[~world_bank_data['Indicator Code'].isin(world_bank_indicators['Series Code'])]
no_indicator_code_match

Unnamed: 0,Country Name,Indicator Code,Year,value
323647,Colombia,DT.TDS.DPPF.XP.ZS,1970,21.229743
326460,Dominican Republic,DT.TDS.DPPF.XP.ZS,1970,7.141586
367576,Colombia,DT.TDS.DPPF.XP.ZS,1971,18.229302
370705,Dominican Republic,DT.TDS.DPPF.XP.ZS,1971,8.002051
377091,Haiti,DT.TDS.DPPF.XP.ZS,1971,13.254290
...,...,...,...,...
4902818,Vietnam,DT.DOD.PVLX.EX.ZS,2021,12.804994
4903817,Zambia,DT.TDS.DPPF.XP.ZS,2021,2.087835
4904091,Zambia,DT.DOD.PVLX.EX.ZS,2021,99.747269
4904265,Zimbabwe,DT.TDS.DPPF.XP.ZS,2021,1.313011


Let's get a list of the unique indicator codes from World Bank Data that do not have a match in World Bank Series to investigate.

In [None]:
no_indicator_code_match['Indicator Code'].unique()

array(['DT.TDS.DPPF.XP.ZS', 'DT.DOD.PVLX.EX.ZS'], dtype=object)

Inspecting the rows of the raw data files with the above two indicator codes shows that a discrepancy in the indicator names between the Data table and the Series table is the reason that rows in worldBankData with the above two indicator codes could not be joined to a row in worldBankIndicators on worldBankData.indicatorCode = worldBankIndicators.indicatorCode. Rows of both World Bank data files that did not have an indicator name that was in our list of indicator names were removed during data cleaning, and that includes indicator names that were intended to be the same as but were slightly different from the indicator names in our list.

In the raw World Bank Data file, the rows with indicator code 'DT.TDS.DPPF.XP.ZS' have indicator name 'Debt service (PPG and IMF only, % of exports of goods, services and primary income)'. That is the indicator name on the World Bank website too, so we'll use that.

In the raw World Bank Series file, the row with that series code has indicator name 'Debt service to exports (%)'.

In the raw World Bank Data file, the rows with indicator code 'DT.DOD.PVLX.EX.ZS' have indicator name 'Present value of external debt (% of exports of goods, services and primary income)'. That is the indicator name on the World Bank website too, so we'll use that.

In the raw World Bank Series file, the row with that series code has indicator name 'Present value of external debt (% of exports of goods, services and income)'.

Let's check that all of the rows in the raw World Bank Data file have the same indicator name whenever they have the same indicator code (the cleaned World Bank data file does not have an indicator name column).

In [None]:
# copy and paste the file path between the quotation marks below
world_bank_data_original_path = ''
world_bank_data_original = pd.read_csv(world_bank_data_original_path)
world_bank_data_original.columns

Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022'],
      dtype='object')

In [None]:
len(world_bank_data_original.groupby(['Indicator Code', 'Indicator Name'])) == len(world_bank_data_original['Indicator Code'].unique())

True

In [None]:
len(world_bank_data_original.groupby(['Indicator Code', 'Indicator Name'])) == len(world_bank_data_original['Indicator Name'].unique())

True

Yes, all of the rows of the raw World Bank data file had the same indicator name whenever they have the same indicator code.

After editing the indicator names in the World Bank Series/ Indicators table as describe above, we should be able to add the foreign key constraint to World Bank data referencing World Bank Series.