# Data Analysis & Visualization CA - Index Generation and Visualization

## 1. Theoretical Framework
The composite index I am creating is intended to indicate: “What makes a country more or less attractive to live in?”

I am referring to a handbook on constructing composite indicators to guide the development of this (Dunn 2020)

This would include sub-indices for different groupings of factors, which I expect to be labelled after areas like "healthcare", "transport" or "economy". The resulting composite index can be compared to an existing indicator on World Bank's open database: "Net Migration" (World Bank Group 2025), which is the amount of people who move into a country minus the amount of people who move out of a country.



### Expert Opinion
Expert opinion is invaluable for a project such as this, their feedback comes from experience within their area of expertise, giving a strong starting point for potentially signifigant features to include. 

For this, I have considered the following contacts in an attempt to find such opinion(s):
- Irish Department of Foreign Affairs
- Embassies

Unfortunately, I have not gotten any response from these sources. I was anticipating this, as a college project of someone which they have no prior affiliation with is unlikely to be addressed amidst their other work.

### Personal Research
I intend to conduct my own research, which is aimed at discovering and justifying features that may prove useful in the creation of the composite index.

Initial Steps:
1. Ask an LLM for advice on factors that may be suitable for producing an index on country migration attractiveness.
    - I asked Gemini what features might be best to include, as well as great open sources to find such information on countries. I have included a link to the chat in the references section (Gemini 2025).
2. Create and observe a post on reddit, preferably on a subreddit about migration, which asks for reasons why people have moved to/from countries.
    - I created a post on 3 subreddits, asking what factors would impact how much a country draws in or pushes away people, through the lens of what country they are choosing to live in. 
	    - r/immigration
		- r/migration
		- r/expats
3. Review findings from both sources, picking out the most frequently occuring factors.

#### Literature Review in Place of Reddit Posts
The Reddit posts at step 2 did not work out, due to their restrictions on surveys. However, one comment from user cris-cris-cris brought up the idea of performing a literature review. This is a good idea, as it allows me to get the expert opinions I require even without having connections with those people.

<img src="images/lit_review_suggestion.png"/>

(this screenshot from my Reddit notifications was all I could refer to, as I was unfortunately banned from r/immigration due to my post being considered a survey, which I was unaware broke the rules)

To get an idea of the sub-indices I would use, I used the search term “migration factors” in google scholar. From there, I picked out literature which contained distinguishable perspectives on migration, which could be turned into sub-indices. 

I observed the following individual factors: Cultural, Economic, Social, Political, Crime, and Environmental

Sub-indices selected
-	Social (included in 6/6 studies observed)
-	Economic (included in 6/6 studies observed)
-	Cultural (included in 3/6 studies observed)
-	Political (included in 3/6 studies observed)
-	~~Crime~~ (to be considered under the “Social” category, due to assumed importance, yet infrequency as a key category of it’s own)
-	~~Environmental~~ (removed due to <50% frequency) (2/6)

<table>
  <thead>
    <tr>
      <th>Study Name</th>
      <th>Focused Factors for Migration</th>
      <th>Other Factors for Migration</th>
      <th>Study Link</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Push and Pull Factors of Migration (Parkins 2010)</td>
      <td>
        <ul>
          <li>Economic</li>
          <li>Crime (could be considered under the category of social?)</li>
          <li>Social</li>
        </ul>
      </td>
      <td></td>
      <td><a href="https://arpejournal.com/article/119/galley/114/view/">https://arpejournal.com/article/119/galley/114/view/</a></td>
    </tr>
    <tr>
      <td>The environmental factor in migration dynamics – a review of African case studies (Jónsson 2010)</td>
      <td>
        <ul>
          <li>Environmental</li>
        </ul>
      </td>
      <td>
        <ul>
          <li>Political</li>
          <li>Economic</li>
          <li>Social</li>
          <li>Cultural</li>
        </ul>
      </td>
      <td><a href="https://ora.ox.ac.uk/objects/uuid:cece31bd-0118-4481-acc2-e9ca05f9a763/files/m201a6e7e1e22129ba8734247dff9dbb0">https://ora.ox.ac.uk/objects/uuid:cece31bd-0118-4481-acc2-e9ca05f9a763/files/m201a6e7e1e22129ba8734247dff9dbb0</a></td>
    </tr>
    <tr>
      <td>Comparing Push and Pull Factors Affecting Migration (Urbański 2022)</td>
      <td>
        <ul>
          <li>Economic</li>
          <li>Social</li>
          <li>Political</li>
        </ul>
      </td>
      <td></td>
      <td><a href="https://www.mdpi.com/2227-7099/10/1/21">https://www.mdpi.com/2227-7099/10/1/21</a></td>
    </tr>
    <tr>
      <td>The Influence of Factors of Migration on the Migration Status of Rural-Urban Migrants in Dhaka, Bangladesh (Ishtiaque and Ullah 2013)</td>
      <td>
        <ul>
          <li>Social</li>
        </ul>
      </td>
      <td>
        <ul>
          <li>Economic</li>
        </ul>
      </td>
      <td><a href="https://www.researchgate.net/profile/Asif-Ishtiaque/publication/258847945_The_Influence_of_Factors_of_Migration_on_the_Migration_Status_of_Rural-Urban_Migrants_in_Dhaka_Bangladesh/links/0deec5293b7fc7bcdb000000/The-Influence-of-Factors-of-Migration-on-the-Migration-Status-of-Rural-Urban-Migrants-in-Dhaka-Bangladesh.pdf">https://www.researchgate.net/profile/Asif-Ishtiaque/publication/258847945_The_Influence_of_Factors_of_Migration_on_the_Migration_Status_of_Rural-Urban_Migrants_in_Dhaka_Bangladesh/links/0deec5293b7fc7bcdb000000/The-Influence-of-Factors-of-Migration-on-the-Migration-Status-of-Rural-Urban-Migrants-in-Dhaka-Bangladesh.pdf</a></td>
    </tr>
    <tr>
      <td>Socio-Economic Factors Associated with Urban-Rural Migration in Nigeria: A Case Study of Oyo State, Nigeria (Adewale 2005)</td>
      <td>
        <ul>
          <li>Socio-economic
            <ul>
              <li>Social</li>
              <li>Economic</li>
            </ul>
          </li>
        </ul>
      </td>
      <td>
        <ul>
          <li>Cultural</li>
          <li>Environmental</li>
          <li>Political</li>
        </ul>
      </td>
      <td><a href="https://www.researchgate.net/profile/Jacob-Adewale/publication/267716974_Socio-Economic_Factors_Associated_with_Urban-Rural_Migration_in_Nigeria_A_Case_Study_of_Oyo_State_Nigeria/links/61dffae74e4aff4a643bb5b4/Socio-Economic-Factors-Associated-with-Urban-Rural-Migration-in-Nigeria-A-Case-Study-of-Oyo-State-Nigeria.pdf">https://www.researchgate.net/profile/Jacob-Adewale/publication/267716974_Socio-Economic_Factors_Associated_with_Urban-Rural_Migration_in_Nigeria_A_Case_Study_of_Oyo_State_Nigeria/links/61dffae74e4aff4a643bb5b4/Socio-Economic-Factors-Associated-with-Urban-Rural-Migration-in-Nigeria-A-Case-Study-of-Oyo-State-Nigeria.pdf</a></td>
    </tr>
    <tr>
      <td>Factors determining international and regional Migration in Europe. (Fouarge and Ester 2007)</td>
      <td>
        <ul>
          <li>Social</li>
          <li>Cultural</li>
          <li>Economic</li>
        </ul>
      </td>
      <td></td>
      <td><a href="https://cris.maastrichtuniversity.nl/ws/portalfiles/portal/913803/guid-bc2ecf8e-2d2e-4747-b70a-bf0bd8e1d9b6-ASSET1.0.pdf">https://cris.maastrichtuniversity.nl/ws/portalfiles/portal/913803/guid-bc2ecf8e-2d2e-4747-b70a-bf0bd8e1d9b6-ASSET1.0.pdf</a></td>
    </tr>
  </tbody>
</table>

From this literature review, I have found heavy reference to Social and Economic concepts in relation to migratory factors. As well as these, Cultural and Political factors were also of note. My next step is to dive deeper, and find more focused and measurable factors that fit under these four categories. 

##### Economic Factors
From my [Economic Factor Research](Research/EconomicFactorResearch.ipynb) using (Gemini 2025).

<table>
	<tr>
		<th>Factor Name</th>
		<th>Potential Sources (To be searched and cross referenced with initial Gemini query)</th>
	</tr>
	<tr>
		<td>Income / wages</td>
		<td>
			<ul>
				<li>World Bank</li>
				<li>International Labour Organization (ILO)</li>
				<li>OECD</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Education</td>
		<td>
			<ul>
				<li>UNESCO Institute for Statistics (UIS)</li>
				<li>World Bank</li>
				<li>OECD</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Welfare</td>
		<td>
			<ul>
				<li>World Bank</li>
				<li>OECD</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Taxes</td>
		<td>
			<ul>
				<li>OECD</li>
				<li>IMF</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Population density</td>
		<td>
			<ul>
				<li>World Bank</li>
				<li>UN Population Division</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Growth (Specifically GDP per head in PPS)</td>
		<td>
			<ul>
				<li>World Bank</li>
				<li>IMF</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Housing Prices</td>
		<td>
			<ul>
				<li>OECD</li>
				<li>IMF</li>
				<li>UN-Habitat</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Home ownership</td>
		<td>
			<ul>
				<li>World Bank</li>
				<li>Trading Economics</li>
			</ul>
		</td>
	</tr>
</table>

##### Social Factors
From my [Social Factor Research](Research/SocialFactorResearch.ipynb) using (Gemini 2025).

<table>
	<tr>
		<th>Factor Name</th>
		<th>Potential Sources (To be searched and cross referenced with initial Gemini query)</th>
	</tr>
	<tr>
		<td>Marriage rate</td>
		<td>
			<ul>
				<li>World Bank</li>
				<li>United Nations (UN) Data</li>
				<li>OECD</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Measures of authoritarianism, or political rights of a country in general</td>
		<td>
			<ul>
				<li>Freedom House</li>
				<li>Economist Intelligence Unit (EIU)</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Access to electricity</td>
		<td>
			<ul>
				<li>World Bank</li>
				<li>International Energy Agency (IEA)</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Quality of healthcare</td>
		<td>
			<ul>
				<li>Numbeo</li>
				<li>Lancet (Healthcare Access and Quality Index)</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Measure of discrimination</td>
		<td>
			<ul>
				<li>SDG Indicator 10.3.1</li>
				<li>Gallup World Poll</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Population growth (births only, to avoid changes due to migration)</td>
		<td>
			<ul>
				<li>World Bank</li>
				<li>UN Data</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>percentage of the population that are in some younger age group</td>
		<td>
			<ul>
				<li>World Bank</li>
				<li>UN Data</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>quality of education</td>
		<td>
			<ul>
				<li>World Economic Forum</li>
				<li>OECD (PISA)</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Social media usage</td>
		<td>
			<ul>
				<li>Statista</li>
				<li>Our World in Data</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Mobile/internet Network coverage</td>
		<td>
			<ul>
				<li>GSMA</li>
				<li>International Telecommunication Union (ITU)</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Happiness measure</td>
		<td>
			<ul>
				<li>World Happiness Report</li>
				<li>Gallup World Poll</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Crime rate</td>
		<td>
			<ul>
				<li>World Bank</li>
				<li>UN Office on Drugs and Crime (UNODC)</li>
			</ul>
		</td>
	</tr>
</table>

##### Political Factors
From my [Political Factor Research](Research/PoliticalFactorResearch.ipynb) using (ChatGPT 2025).
- Note: I fully moved "warfare" from social to political, as my perspective on it's categorization has changed based on the political natures of war 

<table>
	<tr>
		<th>Factor Name</th>
		<th>Potential Sources (To be searched and cross referenced with initial Gemini query)</th>
	</tr>
	<tr>
		<td>Warfare/conflict</td>
		<td>
			<ul>
				<li>Uppsala Conflict Data Program (UCDP)</li>
				<li>Armed Conflict Location & Event Data Project (ACLED)</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Institutional Trust</td>
		<td>
			<ul>
				<li>World Values Survey (WVS)</li>
				<li>Edelman Trust Barometer</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Political Stability</td>
		<td>
			<ul>
				<li>Worldwide Governance Indicators (WGI)</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Type of Political Regime (Left/Right/Centre Leaning)</td>
		<td>
			<ul>
				<li>Varieties of Democracy (V-Dem) Dataset</li>
				<li>Bertelsmann Transformation Index (BTI)</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Terrorism</td>
		<td>
			<ul>
				<li>Global Terrorism Index (GTI)</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Legal System Fairness</td>
		<td>
			<ul>
				<li>World Justice Project (WJP) Rule of Law Index</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Government Effectiveness</td>
		<td>
			<ul>
				<li>Worldwide Governance Indicators (WGI)</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Regulatory Quality</td>
		<td>
			<ul>
				<li>Worldwide Governance Indicators (WGI)</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Percentage of Workers with Union Representation</td>
		<td>
			<ul>
				<li>International Labour Organization (ILO) Statistics</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Measure of Basic Rights</td>
		<td>
			<ul>
				<li>CIRI Human Rights Data Project</li>
				<li>World Justice Project (WJP) Rule of Law Index</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Corruption</td>
		<td>
			<ul>
				<li>Transparency International's Corruption Perceptions Index (CPI)</li>
				<li>Worldwide Governance Indicators (WGI)</li>
			</ul>
		</td>
	</tr>
	<tr>
		<td>Wealth of a Country (GDP)</td>
		<td>
			<ul>
				<li>World Bank's World Development Indicators</li>
			</ul>
		</td>
	</tr>
</table>

## 2. Data Selection
I carried out extraction of the data marked as "ready" below, with the help of ChatGPT to build a web scraper able to work with the Rule of Law Index (ChatGPT 2025).

### Political Factor Data Selection
<table>
	<tr>
		<th>Status</th>
		<th>Name</th>
		<th>Source</th>
		<th>Reference</th>
		<th>Country Coverage</th>
		<th>Notes</th>
	</tr>
	<tr>
		<td>Ready</td>
		<td>Percentage of Workers with Union Representation</td>
		<td>International Labour Organization (ILO) Statistics</td>
		<td>trade_union.csv (International Labour Organization 2020)</td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Ready</td>
		<td>Warfare/conflict</td>
		<td>Uppsala Conflict Data Program (UCDP) </td>
		<td>UCDP.csv (UCDP 2023)</td>
		<td></td>
		<td>State-based violence (type_of_violence = 1)// (country) // (deaths_a + deaths_b + deaths_civilians + deaths_unknown)</td>
	</tr>
	<tr>
		<td>Ready</td>
		<td>Wealth of a Country (GDP)</td>
		<td>World Bank's World Development Indicators</td>
		<td>world_bank_group.csv (World Bank Group 2025)</td>
		<td></td>
		<td>Using "GDP per capita (current US$)"</td>
	</tr>
	<tr>
		<td>Ready</td>
		<td>Legal System Fairness</td>
		<td>World Justice Project (WJP) Rule of Law Index</td>
		<td>rule_of_law_index.csv (World Justice Project 2024)</td>
		<td></td>
		<td>Uses a combinaion of the civil justice and criminal justice factors</td>
	</tr>
	<tr>
		<td>Ready</td>
		<td>Measure of Basic Rights</td>
		<td>World Justice Project (WJP) Rule of Law Index</td>
		<td>rule_of_law_index.csv (World Justice Project 2024)</td>
		<td></td>
		<td>Fundamental Rights</td>
	</tr>
	<tr>
		<td>Ready</td>
		<td>Institutional Trust</td>
		<td>World Values Survey (WVS)</td>
		<td>Confidence_The_Government (1).xls (World Values Survey Association 2022)</td>
		<td>
		</td>
		<td>
			<ol>
				<li>2017-2022 study. I was able to find Q71.- Confidence: The Government, which helps answer what I set out to determine. I will take the percentages of the answers "A great deal", "Quite a lot", "Not very much" and "None at all", assigning 4/3/2/1 to each. For each country listed, I can multiply each answer's point by the percentage, then sum them for a final score .</li>
				<li>This may be dropped due to it's extremely low country coverage</li>
			</ol>
		</td>
	</tr>
	<tr>
		<td>Ready</td>
		<td>Political Stability</td>
		<td>Worldwide Governance Indicators (WGI)</td>
		<td>wgidataset.xlsx (World Bank Group 2024)</td>
		<td></td>
		<td>
			<ol>
				<li>Using the code (pv). This includes the measurement of absence violence and terrorism, which intrudes on the "Terrorism" factor. I am considering dropping this feature, as it may be better to focus on narrower factors than this combination of different things</li>
			</ol>
		</td>
	</tr>
	<tr>
		<td>Ready</td>
		<td>Government Effectiveness</td>
		<td>Worldwide Governance Indicators (WGI)</td>
		<td>wgidataset.xlsx (World Bank Group 2024)</td>
		<td></td>
		<td>Using the code (ge)</td>
	</tr>
	<tr>
		<td>Ready</td>
		<td>Regulatory Quality</td>
		<td>Worldwide Governance Indicators (WGI)</td>
		<td>wgidataset.xlsx (World Bank Group 2024)</td>
		<td></td>
		<td>Using the code (rq)</td>
	</tr>
	<tr>
		<td>Ready</td>
		<td>Corruption</td>
		<td>Worldwide Governance Indicators (WGI)</td>
		<td>wgidataset.xlsx (World Bank Group 2024)</td>
		<td></td>
		<td>Using the code (cc)</td>
	</tr>
	<tr>
		<td>Dropped</td>
		<td>Type of Political Regime (Left/Right/Centre Leaning)</td>
		<td>
			<ol>
				<li>Varieties of Democracy (V-Dem) Dataset</li>
				<li>Bertelsmann Transformation Index (BTI)</li>
			</ol>
		</td>
		<td></td>
		<td></td>
		<td>Both of these have proven quite difficult to extract information from relating to type of political regime. I may drop this</td>
	</tr>
	<tr>
		<td>Dropped</td>
		<td>Terrorism</td>
		<td>Global Terrorism Index (GTI)</td>
		<td></td>
		<td></td>
		<td>A request must be sent to fetch this data. I have sent this request, but I expect the data will not be given in time.</td>
	</tr>
</table>

### Economic Factor Data Selection
<table>
	<tr>
		<th>Status</th>
		<th>Name</th>
		<th>Source</th>
		<th>Reference</th>
		<th>Country Coverage</th>
		<th>Notes</th>
	</tr>
	<tr>
		<td class="status">Ready</td>
		<td class="name">Income / wages</td>
		<td class="source">International Labour Organization (ILO)</td>
		<td class="reference">income_wages.csv (International Labour Organization 2025)</td>
		<td class="country coverage"></td>
		<td class="notes"></td>
	</tr>
	<tr>
		<td class="status">Ready</td>
		<td class="name">Education</td>
		<td class="source">World Bank</td>
		<td class="reference">world_bank_group.csv (World Bank Group 2025)</td>
		<td class="country coverage"></td>
		<td class="notes">Compulsory education, duration (years)</td>
	</tr>
	<tr>
		<td class="status">Ready</td>
		<td class="name">Taxes</td>
		<td class="source">IMF</td>
		<td class="reference">world-imf2024.xslx (International Monetary Fund 2024)</td>
		<td class="country coverage"></td>
		<td class="notes">TaxRev "Tax revenue, percent of GDP"</td>
	</tr>
	<tr>
		<td class="status">Ready</td>
		<td class="name">Population density</td>
		<td class="source">World Bank</td>
		<td class="reference">world_bank_group.csv (World Bank Group 2025)</td>
		<td class="country coverage"></td>
		<td class="notes">Population density (people per sq. km of land area)</td>
	</tr>
	<tr>
		<td class="status">Ready</td>
		<td class="name">Growth (Specifically GDP per head in PPS)</td>
		<td class="source">World Bank</td>
		<td class="reference">world_bank_group.csv (World Bank Group 2025)</td>
		<td class="country coverage"></td>
		<td class="notes">GDP per capita growth (annual %)</td>
	</tr>
	<tr>
		<td class="status">Dropped</td>
		<td class="name">Welfare</td>
		<td class="source">
			<ul>
				<li>World Bank</li>
				<li>OECD</li>
			</ul>
		</td>
		<td class="reference"></td>
		<td class="country coverage"></td>
		<td class="notes">For relevant categories in the World Bank database, no data was found for many countries. OECD is also quite limited in terms of countries addressed. I may drop this because of how sparse the data is</td>
	</tr>
	<tr>
		<td class="status">Dropped</td>
		<td class="name">Housing Prices</td>
		<td class="source">
			<ul>
				<li>OECD</li>
				<li>IMF</li>
				<li>UN-Habitat</li>
			</ul>
		</td>
		<td class="reference"></td>
		<td class="country coverage"></td>
		<td class="notes">There does not seem to be widely available data on housing prices here</td>
	</tr>
	<tr>
		<td class="status">Dropped</td>
		<td class="name">Home ownership</td>
		<td class="source">
			<ul>
				<li>World Bank</li>
				<li>Trading Economics</li>
			</ul>
		</td>
		<td class="reference"></td>
		<td class="country coverage"></td>
		<td class="notes">Information on this is also sparse, with only 47 countries considered in (TRADING ECONOMICS 2024). It would be ideal to cluster by region for missing values, but the only region that seems to have an adequate number for this is Europe.</td>
	</tr>
</table>

### Social Factor Data Selection
<table>
	<tr>
		<th>Status</th>
		<th>Name</th>
		<th>Source</th>
		<th>Reference</th>
		<th>Country Coverage</th>
		<th>Notes</th>
	</tr>
	<tr>
		<td class="status">Ready</td>
		<td class="name">Access to electricity</td>
		<td class="source">World Bank</td>
		<td class="reference">world_bank_group.csv (World Bank Group 2025)</td>
		<td class="country coverage"></td>
		<td class="notes">Access to electricity (% of population)</td>
	</tr>
	<tr>
		<td class="status">Ready</td>
		<td class="name">Quality of healthcare</td>
		<td class="source">Numbeo</td>
		<td class="reference">health_care_index.csv (marcelobatalhah 2025)</td>
		<td class="country coverage"></td>
		<td class="notes"></td>
	</tr>
	<tr>
		<td class="status">Ready</td>
		<td class="name">Population growth (births only, to avoid changes due to migration)</td>
		<td class="source">World Bank</td>
		<td class="reference">world_bank_group.csv (World Bank Group 2025)</td>
		<td class="country coverage"></td>
		<td class="notes">Birth rate, crude (per 1,000 people)</td>
	</tr>
	<tr>
		<td class="status">Ready</td>
		<td class="name">percentage of the population that are in some younger age group</td>
		<td class="source">World Bank</td>
		<td class="reference">world_bank_group.csv (World Bank Group 2025)</td>
		<td class="country coverage"></td>
		<td class="notes">("Population ages 20-24, male (% of male population)" + "Population ages 20-24, female (% of female population)")</td>
	</tr>
	<tr>
		<td class="status">Ready</td>
		<td class="name">Mobile/internet Network coverage</td>
		<td class="source">International Telecommunication Union (ITU)</td>
		<td class="reference">population-coverage-by-mobile-network-technology.csv (International Telecommunication Union 2024)</td>
		<td class="country coverage"></td>
		<td class="notes"></td>
	</tr>
	<tr>
		<td class="status">Ready</td>
		<td class="name">Crime rate</td>
		<td class="source">World Population Review</td>
		<td class="reference">crime-rate-by-country-2025.csv (World Population Review 2025)</td>
		<td class="country coverage"></td>
		<td class="notes"></td>
	</tr>
	<tr>
		<td class="status">Dropped</td>
		<td class="name">Measure of discrimination</td>
		<td class="source">
			<ul>
				<li>SDG Indicator 10.3.1</li>
				<li>Gallup World Poll</li>
			</ul>
		</td>
		<td class="reference"></td>
		<td class="country coverage"></td>
		<td class="notes">It has been difficult to find a viable measure of discrimination within or without the references that have been linked, so I would probably leave this in favour of others</td>
	</tr>
	<tr>
		<td class="status">Dropped</td>
		<td class="name">quality of education</td>
		<td class="source">
			<ul>
				<li>World Economic Forum</li>
				<li>OECD (PISA)</li>
			</ul>
		</td>
		<td class="reference"></td>
		<td class="country coverage"></td>
		<td class="notes">Difficult to find a deeper insight into the quality of education. An alternative would have been Primary Completion Rate from (World Bank Group 2024), however, I noted that certain percentages were above 100, and therefore decided to abandon that dataset</td>
	</tr>
	<tr>
		<td class="status">Dropped</td>
		<td class="name">Marriage rate</td>
		<td class="source">
			<ul>
				<li>World Bank</li>
				<li>United Nations (UN) Data</li>
				<li>OECD</li>
			</ul>
		</td>
		<td class="reference"></td>
		<td class="country coverage"></td>
		<td class="notes"></td>
	</tr>
	<tr>
		<td class="status">Unsuitable for Social</td>
		<td class="name">Measures of authoritarianism, or political rights of a country in general</td>
		<td class="source">
			<ul>
				<li>Freedom House</li>
				<li>Economist Intelligence Unit (EIU)</li>
			</ul>
		</td>
		<td class="reference"></td>
		<td class="country coverage"></td>
		<td class="notes"></td>
	</tr>
	<tr>
		<td class="status">Dropped</td>
		<td class="name">Social media usage</td>
		<td class="source">
			<ul>
				<li>Statista</li>
				<li>Our World in Data</li>
			</ul>
		</td>
		<td class="reference"></td>
		<td class="country coverage"></td>
		<td class="notes">It is hard to get the specific number/percentage of the population that is actively using social media</td>
	</tr>
	<tr>
		<td class="status">Dropped</td>
		<td class="name">Happiness measure</td>
		<td class="source">
			<ul>
				<li>World Happiness Report</li>
				<li>Gallup World Poll</li>
			</ul>
		</td>
		<td class="reference"></td>
		<td class="country coverage"></td>
		<td class="notes">It would be best not to use this, as social support, GP per capita and perceptions of corruption are part of it. All relating to previously discovered factors, something I want to avoid</td>
	</tr>
</table>

### Importing the data

#### Previous work: Initializing With World Development Indicator
import pandas as pd
import numpy as np

world_development_indicator_df = pd.read_csv("2. Data Extraction\Multi-Category Factor Data (Added)\world_bank_group.csv")
income_wages_df = pd.read_csv("2. Data Extraction\Economic Factor Data (Added, With Name Mismatches)\income_wages.csv")
\# Solution for selecting a specific sheet of an excel file found from the answer of Vaibhav Jadhav at https://stackoverflow.com/questions/71527992/pandas-dataframe-to-specific-sheet-in-a-excel-file-without-losing-formatting
taxes_df = pd.read_excel("2. Data Extraction\Economic Factor Data (Added, With Name Mismatches)\world-imf2024.xlsx",sheet_name="Data")
rule_of_law_df = pd.read_csv("2. Data Extraction\Political Factor Data\\rule_of_law_index.csv")



country_name = world_development_indicator_df["Country Name"].unique()
gdp_growth = []
pop_density = []
education_years = []
gdp_per_capita = []
electricity_access = []
birth_rate = []
young_pop_percentage_male = []
young_pop_percentage_female = []


\# the use of "index" below is from a question on the use of iterrows() in a for loop, answered by waitingkuo https://stackoverflow.com/questions/16476924/how-can-i-iterate-over-rows-in-a-pandas-dataframe
for index, wdi_df_row in world_development_indicator_df.iterrows():
	output_value = ""

	if wdi_df_row["Series Name"] != "Population density (people per sq. km of land area)" and wdi_df_row["2023 [YR2023]"] == ".." : output_value = None
	elif wdi_df_row["Series Name"] == "Population density (people per sq. km of land area)" and wdi_df_row["2022 [YR2022]"] == ".." : output_value = None
	else: 
		if wdi_df_row["Series Name"] != "Population density (people per sq. km of land area)": output_value = wdi_df_row["2023 [YR2023]"]
		else: output_value = wdi_df_row["2022 [YR2022]"]

	if wdi_df_row["Series Name"] == "GDP per capita growth (annual %)": gdp_growth.append(output_value)
	elif wdi_df_row["Series Name"] == "Population density (people per sq. km of land area)": pop_density.append(output_value) # Best set as 2022, since there is no data for any country in 2023
	elif wdi_df_row["Series Name"] == "Compulsory education, duration (years)": education_years.append(output_value)
	elif wdi_df_row["Series Name"] == "GDP per capita (current US$)": gdp_per_capita.append(output_value)
	elif wdi_df_row["Series Name"] == "Access to electricity (% of population)": electricity_access.append(output_value)
	elif wdi_df_row["Series Name"] == "Birth rate, crude (per 1,000 people)": birth_rate.append(output_value)
	elif wdi_df_row["Series Name"] == "Population ages 20-24, male (% of male population)": young_pop_percentage_male.append(output_value)
	elif wdi_df_row["Series Name"] == "Population ages 20-24, female (% of female population)": young_pop_percentage_female.append(output_value)


\# Figured out how to make an empty list with help from jL4's answer at https://stackoverflow.com/questions/43336837/making-equal-size-lists-in-python
avg_hourly_earnings = []
avg_hourly_earnings.extend([None] * len(country_name))

taxes_percentage = []
taxes_percentage.extend([None] * len(country_name))

fundamental_rights = []
fundamental_rights.extend([None] * len(country_name))

legal_system_fairness = []
legal_system_fairness.extend([None] * len(country_name))

df = pd.DataFrame({
	"Country Name":country_name,
	"GDP per capita growth (annual %)":gdp_growth,
	"Population density (people per sq. km of land area)":pop_density,
	"Compulsory education, duration (years)":education_years,
	"GDP per capita (current US$)":gdp_per_capita,
	"Access to electricity (% of population)":electricity_access,
	"Birth rate, crude (per 1,000 people)":birth_rate,
	"Population ages 20-24, male (% of male population)":young_pop_percentage_male,
	"Population ages 20-24, female (% of female population)":young_pop_percentage_female,
	"Average hourly earnings of employees":avg_hourly_earnings,
	"Tax revenue, percent of GDP":taxes_percentage,
	"Fundamental Rights":fundamental_rights,
	"Legal System Fairness":legal_system_fairness
	})

#### Previous work: Adding Average Hourly Earnings
for index, iw_df_row in income_wages_df.iterrows():
	name_mismatched = ""

	\# Solution for checking existence of matching country names in the main dataframe found from Akram's answer at https://stackoverflow.com/questions/21319929/how-to-determine-whether-a-pandas-column-contains-a-particular-value
	if iw_df_row["ref_area.label"] in df["Country Name"].values:
		for index, df_row in df.iterrows():
			if df_row["Country Name"] == iw_df_row["ref_area.label"]:
				df_row["Average hourly earnings of employees"] = iw_df_row["obs_value"]
				break
	\# hardcoding was required for these countries. Due to the mismatch of country name, they triggered the else block below
	elif iw_df_row["ref_area.label"] == "Egypt": name_mismatched = "Egypt, Arab Rep."
	elif iw_df_row["ref_area.label"] == "United Kingdom of Great Britain and Northern Ireland": name_mismatched = "United Kingdom"
	elif iw_df_row["ref_area.label"] == "Republic of Korea": name_mismatched = "Korea, Rep." \# determined to be south korea https://history.state.gov/countries/korea-south#:~:text=Republic%20of%20Korea%20(South%20Korea,Countries%20%2D%20Office%20of%20the%20Historian
	elif iw_df_row["ref_area.label"] == "Republic of Moldova": name_mismatched = "Moldova"
	elif iw_df_row["ref_area.label"] == "Slovakia": name_mismatched = "Slovak Republic"
	elif iw_df_row["ref_area.label"] == "Türkiye": name_mismatched = "Turkiye"
	elif iw_df_row["ref_area.label"] == "Tanzania, United Republic of": name_mismatched = "Tanzania"
	elif iw_df_row["ref_area.label"] == "United States of America": name_mismatched = "United States"
	else:
		print(iw_df_row["ref_area.label"])
		print("I do not exist")
	for index, df_row in df.iterrows():
		if df_row["Country Name"] == name_mismatched:
			df_row["Average hourly earnings of employees"] = iw_df_row["obs_value"]
			# print("Successfully assigned ", df_row["Country Name"]," the value ",iw_df_row["obs_value"])
			break

#### Previous work: Adding Tax Revenue
count = 0
for index, tax_df_row in taxes_df.iterrows():
	if tax_df_row["year"] != 2019: continue
	elif tax_df_row["property_ShortForm_en_displayNam"] in df["Country Name"].values:
		for index, df_row in df.iterrows():
			if df_row["Country Name"] == tax_df_row["property_ShortForm_en_displayNam"]:
				df_row["Tax revenue, percent of GDP"] = tax_df_row["TaxRev"]
				break
	else:
		print(tax_df_row["property_ShortForm_en_displayNam"])
		print("I do not exist")
		count += 1

print(count)

#### Previous work: Adding Rule Of Law Index
count = 0
for index, rol_df_row in rule_of_law_df.iterrows():
	if rol_df_row["Country"] in df["Country Name"].values:
		for index, df_row in df.iterrows():
			if df_row["Country Name"] == rol_df_row["Country"]:
				df_row["Fundamental Rights"] = rol_df_row["Fundamental Rights"]
				df_row["Legal System Fairness"] = rol_df_row["Civil Justice + Criminal Justice"]
				break
	else:
		print(rol_df_row["Country"])
		print("I do not exist")
		count += 1

print(count)

#### Previous work: Checking Null Values
df.isnull().sum()
\# NULL VALUES:
\#  Access to electricity (% of population)                   2
\#  GDP per capita (current US$)                              22
\#  Compulsory education, duration (years)                    20
\#  Population density (people per sq. km of land area)        7
\#  GDP per capita growth (annual %)                          22
\#  Average hourly earnings of employees                      164
\#  Tax revenue, percent of GDP                                43

#### A change of plans
It is at this point that I started realizing how complex the project was getting. I had to accomodate data from many different places, that often did not have a uniform way of naming or identifying countries. Identifiers that seem to have been standardized didn't help much, as there are multiple different identifiers used by different sources.
As well as this, I realized that it may be too difficult to get my hands on some country-level cultural migration factors.

In what I believe to be the best interest of the project, I am now going to scale back, reducing the amount of factors I must tackle, and centering upon the common data structuring of World Bank Group.
I will work within four of the categories laid out in my literature review: Social, Economic, Environmental, Political.
These are my following two sources:

##### P_Data_Extract_From_World_Development_Indicators.xlsx https://databank.worldbank.org/source/world-development-indicators# (World Bank Group 2024)
###### Social
- Access to electricity (% of population)
- Birth rate, crude (per 1,000 people) 
- Population density (people per sq. km of land area)
- Unemployment, total (% of total labor force) (national estimate)
- Population ages 20-24, male (% of male population)
- Population ages 20-24, female (% of female population)

###### Economic
- Current health expenditure (% of GDP) 
- Tax revenue (% of GDP)
- Military expenditure (% of GDP) 
- GDP per capita (current US$)
- GDP per capita growth (annual %)

###### Environmental
- PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)
- Fossil fuel energy consumption (% of total)


##### wgidataset.xlsx https://www.worldbank.org/en/publication/worldwide-governance-indicators (World Bank Group 2024)
###### Political
- Political Stability and Absence of Violence/Terrorism (pv)
- Government Effectiveness (ge)
- Regulatory Quality (rq)
- Control of Corruption (cc)
- Rule of Law (rl)

In [174]:
import pandas as pd
import numpy as np

# Solution for selecting a specific sheet of an excel file found from the answer of Vaibhav Jadhav at https://stackoverflow.com/questions/71527992/pandas-dataframe-to-specific-sheet-in-a-excel-file-without-losing-formatting
world_development_indicator_df = pd.read_excel("2. Data Extraction\P_Data_Extract_From_World_Development_Indicators.xlsx",sheet_name="Data")
worldwide_governance_indicator_df = pd.read_excel("2. Data Extraction\wgidataset.xlsx",sheet_name="Sheet1")


country_name = world_development_indicator_df["Country Name"].unique()
s_electricity_access = []
s_birth_rate = []
s_pop_density = []
s_unemployment = []
s_young_pop_percentage_male = []
s_young_pop_percentage_female = []

ec_current_health_expenditure = []
ec_tax_revenue = []
ec_military_expenditure = []
ec_gdp_per_capita = []
ec_gdp_growth_per_capita = []

en_air_pollution = []
en_fossil_fuel = []


# the use of "index" below is from a question on the use of iterrows() in a for loop, answered by waitingkuo https://stackoverflow.com/questions/16476924/how-can-i-iterate-over-rows-in-a-pandas-dataframe
for index, wdi_df_row in world_development_indicator_df.iterrows():
	output_value = ""

	if wdi_df_row["2020 [YR2020]"] == ".." : output_value = None
	else: output_value = wdi_df_row["2020 [YR2020]"]

	if wdi_df_row["Series Name"] == "Access to electricity (% of population)": s_electricity_access.append(output_value)
	elif wdi_df_row["Series Name"] == "Birth rate, crude (per 1,000 people)": s_birth_rate.append(output_value)
	elif wdi_df_row["Series Name"] == "Population density (people per sq. km of land area)": s_pop_density.append(output_value)
	elif wdi_df_row["Series Name"] == "Unemployment, total (% of total labor force) (national estimate)": s_unemployment.append(output_value)
	elif wdi_df_row["Series Name"] == "Population ages 20-24, male (% of male population)": s_young_pop_percentage_male.append(output_value)
	elif wdi_df_row["Series Name"] == "Population ages 20-24, female (% of female population)": s_young_pop_percentage_female.append(output_value)

	elif wdi_df_row["Series Name"] == "Current health expenditure (% of GDP)": ec_current_health_expenditure.append(output_value)
	elif wdi_df_row["Series Name"] == "Tax revenue (% of GDP)": ec_tax_revenue.append(output_value)
	elif wdi_df_row["Series Name"] == "Military expenditure (% of GDP)": ec_military_expenditure.append(output_value)
	elif wdi_df_row["Series Name"] == "GDP per capita (current US$)": ec_gdp_per_capita.append(output_value)
	elif wdi_df_row["Series Name"] == "GDP per capita growth (annual %)": ec_gdp_growth_per_capita.append(output_value)

	elif wdi_df_row["Series Name"] == "PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)": en_air_pollution.append(output_value)
	elif wdi_df_row["Series Name"] == "Fossil fuel energy consumption (% of total)": en_fossil_fuel.append(output_value)


# Figured out how to make an empty list with help from jL4's answer at https://stackoverflow.com/questions/43336837/making-equal-size-lists-in-python
p_political_stability = []
p_political_stability.extend([None] * len(country_name))

p_government_effectiveness = []
p_government_effectiveness.extend([None] * len(country_name))

p_regulatory_quality = []
p_regulatory_quality.extend([None] * len(country_name))

p_control_of_corruption = []
p_control_of_corruption.extend([None] * len(country_name))

p_rule_of_law = []
p_rule_of_law.extend([None] * len(country_name))

df = pd.DataFrame({
	"Country Name":country_name,

	"Access to electricity (% of population)":s_electricity_access,
	"Birth rate, crude (per 1,000 people)":s_birth_rate,
	"Population density (people per sq. km of land area)":s_pop_density,
	"Unemployment, total (% of total labor force) (national estimate)":s_unemployment,
	"Population ages 20-24, male (% of male population)":s_young_pop_percentage_male,
	"Population ages 20-24, female (% of female population)":s_young_pop_percentage_female,

	"Current health expenditure (% of GDP)":ec_current_health_expenditure,
	"Tax revenue (% of GDP)":ec_tax_revenue,
	"Military expenditure (% of GDP)":ec_military_expenditure,
	"GDP per capita (current US$)":ec_gdp_per_capita,
	"GDP per capita growth (annual %)":ec_gdp_growth_per_capita,

	"PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)":en_air_pollution,
	"Fossil fuel energy consumption (% of total)":en_fossil_fuel,

	"Political Stability and Absence of Violence/Terrorism (pv)":p_political_stability,
	"Government Effectiveness (ge)":p_government_effectiveness,
	"Regulatory Quality (rq)":p_regulatory_quality,
	"Control of Corruption (cc)":p_control_of_corruption,
	"Rule of Law (rl)":p_rule_of_law
	})


In [175]:
for index, wgi_df_row in worldwide_governance_indicator_df.iterrows():

	# Solution for checking existence of matching country names in the main dataframe found from Akram's answer at https://stackoverflow.com/questions/21319929/how-to-determine-whether-a-pandas-column-contains-a-particular-value
	if wgi_df_row["countryname"] in df["Country Name"].values and wgi_df_row["year"] == 2020:
		# Fix for rows not updating has been found using https://chatgpt.com/share/68143640-b6b8-800c-9ff5-0f44f2d2b684
		country_mask = df["Country Name"] == wgi_df_row["countryname"]
		if wgi_df_row["indicator"] == "pv":
			df.loc[country_mask, "Political Stability and Absence of Violence/Terrorism (pv)"] = wgi_df_row["estimate"]
		elif wgi_df_row["indicator"] == "ge":
			df.loc[country_mask, "Government Effectiveness (ge)"] = wgi_df_row["estimate"]
		elif wgi_df_row["indicator"] == "rq":
			df.loc[country_mask, "Regulatory Quality (rq)"] = wgi_df_row["estimate"]
		elif wgi_df_row["indicator"] == "cc":
			df.loc[country_mask, "Control of Corruption (cc)"] = wgi_df_row["estimate"]
		elif wgi_df_row["indicator"] == "rl":
			df.loc[country_mask, "Rule of Law (rl)"] = wgi_df_row["estimate"]


In [176]:
df

Unnamed: 0,Country Name,Access to electricity (% of population),"Birth rate, crude (per 1,000 people)",Population density (people per sq. km of land area),"Unemployment, total (% of total labor force) (national estimate)","Population ages 20-24, male (% of male population)","Population ages 20-24, female (% of female population)",Current health expenditure (% of GDP),Tax revenue (% of GDP),Military expenditure (% of GDP),GDP per capita (current US$),GDP per capita growth (annual %),"PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)",Fossil fuel energy consumption (% of total),Political Stability and Absence of Violence/Terrorism (pv),Government Effectiveness (ge),Regulatory Quality (rq),Control of Corruption (cc),Rule of Law (rl)
0,Afghanistan,97.7,36.601,59.900616,11.710,10.052936,9.895550,15.533614,,1.358857,510.787063,-5.382515,46.087094,,-2.702721,-1.611539,-1.389163,-1.493361,-1.831407
1,Albania,100.0,10.536,103.571131,11.639,8.297450,7.874271,7.503894,16.895541,1.295836,5370.778623,-2.756940,15.707004,57.07,0.088613,-0.155121,0.221967,-0.573539,-0.378028
2,Algeria,99.7,22.430,18.491553,,6.780634,6.707111,5.638317,,6.658711,3743.541952,-6.612475,25.552656,99.89,-0.84782,-0.573643,-1.3553,-0.666827,-0.798028
3,American Samoa,,15.658,248.805000,,6.775924,6.764532,,,,14489.258656,5.351787,6.715147,,1.090209,0.632649,0.525311,1.265454,1.118488
4,Andorra,100.0,6.831,164.638298,,5.834561,5.267937,8.786739,,,37361.090067,-12.223838,9.080281,,1.588675,1.749241,1.335421,1.265454,1.615182
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
212,Virgin Islands (U.S.),100.0,12.100,303.685714,,4.989642,4.766906,,,,39787.374165,-1.264125,8.633428,,1.003442,0.632649,1.335421,-0.001319,0.870141
213,West Bank and Gaza,100.0,28.877,797.885216,25.895,9.200434,9.306387,10.276102,20.851777,,3233.568638,-13.496387,26.363626,,-2.018239,-0.677145,0.054561,-0.57627,-0.473232
214,"Yemen, Rep.",73.9,35.895,68.441129,,9.374350,9.313259,5.772708,,,559.564673,,34.832360,93.71,-2.647733,-2.362375,-1.857522,-1.711301,-1.773078
215,Zambia,44.6,34.408,25.638487,6.032,9.377990,9.327864,6.306744,16.418006,1.170496,951.644317,-5.567735,24.308592,18.45,-0.130979,-0.824682,-0.687794,-0.731882,-0.648217


## 3. Imputation of Missing Data

In [177]:
df.isnull().sum()

Country Name                                                               0
Access to electricity (% of population)                                    2
Birth rate, crude (per 1,000 people)                                       0
Population density (people per sq. km of land area)                        1
Unemployment, total (% of total labor force) (national estimate)          99
Population ages 20-24, male (% of male population)                         0
Population ages 20-24, female (% of female population)                     0
Current health expenditure (% of GDP)                                     25
Tax revenue (% of GDP)                                                    82
Military expenditure (% of GDP)                                           67
GDP per capita (current US$)                                               7
GDP per capita growth (annual %)                                           8
PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)    17

In [178]:
import math
s_electricity_access_avg = math.floor(df['Access to electricity (% of population)'].mean())
df["Access to electricity (% of population)"] = df["Access to electricity (% of population)"].fillna(s_electricity_access_avg)

In [179]:
s_pop_density_avg = math.floor(df['Population density (people per sq. km of land area)'].mean())
df["Population density (people per sq. km of land area)"] = df["Population density (people per sq. km of land area)"].fillna(s_pop_density_avg)

In [180]:
s_unemployment_avg = math.floor(df['Unemployment, total (% of total labor force) (national estimate)'].mean())
df["Unemployment, total (% of total labor force) (national estimate)"] = df["Unemployment, total (% of total labor force) (national estimate)"].fillna(s_unemployment_avg)

In [181]:
ec_current_health_expenditure_avg = math.floor(df['Current health expenditure (% of GDP)'].mean())
df["Current health expenditure (% of GDP)"] = df["Current health expenditure (% of GDP)"].fillna(ec_current_health_expenditure_avg)

In [182]:
ec_tax_revenue_avg = math.floor(df['Tax revenue (% of GDP)'].mean())
df["Tax revenue (% of GDP)"] = df["Tax revenue (% of GDP)"].fillna(ec_tax_revenue_avg)

In [183]:
ec_military_expenditure_avg = math.floor(df['Military expenditure (% of GDP)'].mean())
df["Military expenditure (% of GDP)"] = df["Military expenditure (% of GDP)"].fillna(ec_military_expenditure_avg)

In [184]:
ec_gdp_per_capita_avg = math.floor(df['GDP per capita (current US$)'].mean())
df["GDP per capita (current US$)"] = df["GDP per capita (current US$)"].fillna(ec_gdp_per_capita_avg)

In [185]:
ec_gdp_growth_per_capita_avg = math.floor(df['GDP per capita growth (annual %)'].mean())
df["GDP per capita growth (annual %)"] = df["GDP per capita growth (annual %)"].fillna(ec_gdp_growth_per_capita_avg)

In [186]:
en_air_pollution_avg = math.floor(df['PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)'].mean())
df["PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)"] = df["PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)"].fillna(en_air_pollution_avg)

In [187]:
en_fossil_fuel_avg = math.floor(df['Fossil fuel energy consumption (% of total)'].mean())
df["Fossil fuel energy consumption (% of total)"] = df["Fossil fuel energy consumption (% of total)"].fillna(en_fossil_fuel_avg)

In [188]:
p_political_stability_avg = math.floor(df['Political Stability and Absence of Violence/Terrorism (pv)'].mean())
df["Political Stability and Absence of Violence/Terrorism (pv)"] = df["Political Stability and Absence of Violence/Terrorism (pv)"].fillna(p_political_stability_avg)

In [189]:
p_government_effectiveness_avg = math.floor(df['Government Effectiveness (ge)'].mean())
df["Government Effectiveness (ge)"] = df["Government Effectiveness (ge)"].fillna(p_government_effectiveness_avg)

In [190]:
p_regulatory_quality_avg = math.floor(df['Regulatory Quality (rq)'].mean())
df["Regulatory Quality (rq)"] = df["Regulatory Quality (rq)"].fillna(p_regulatory_quality_avg)

In [191]:
p_control_of_corruption_avg = math.floor(df['Control of Corruption (cc)'].mean())
df["Control of Corruption (cc)"] = df["Control of Corruption (cc)"].fillna(p_control_of_corruption_avg)

In [None]:
p_rule_of_law_avg = math.floor(df['Rule of Law (rl)'].mean())
df["Rule of Law (rl)"] = df["Rule of Law (rl)"].fillna(p_rule_of_law_avg)

In [192]:
df.isnull().sum()

Country Name                                                               0
Access to electricity (% of population)                                    0
Birth rate, crude (per 1,000 people)                                       0
Population density (people per sq. km of land area)                        0
Unemployment, total (% of total labor force) (national estimate)           0
Population ages 20-24, male (% of male population)                         0
Population ages 20-24, female (% of female population)                     0
Current health expenditure (% of GDP)                                      0
Tax revenue (% of GDP)                                                     0
Military expenditure (% of GDP)                                            0
GDP per capita (current US$)                                               0
GDP per capita growth (annual %)                                           0
PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)     0

## 4. Multivariate Analysis

## 5. Normalisation

## 6. Weighting and Aggregation

## 7. Links to other indicators

## 8. Visualisation of the results