Humanitarian Unrest Classifier
Project McNulty: Team Saving the World
Welcome to Project McNulty. Our goal is to identify areas in the world that may be susceptible to humanitarian unrest, specifically focusing on crises that might prompt a mass migration of refugees.
The number of refugees moving out of a country will be standardized by their population and classified a value of 'at stress' (1) or 'not at stress' (0).
Our team will create supervised machine learning models to examine key factors that may lead to humanitarian unrest: economic, environmental and climatic, terrorism and national security, political and population demographics, and food. We will use pre-processing methods, choose an appropriate model, and use the output of individual predictions as features for a final model.
This project will be beneficial to anyone with interest in geopolitics. Both firms and individuals with international interests could use this graph to make educated business, travel, and life decisions. We hope to package the project as an easy-to-navigate interactive infographic.
We are sharing our data on a single database on one ec2 server. We are using Postico, a graphical interface, to locally connect to this shared postgres database.
Here is a link to our presentation.
The idea is simple: if you are in poverty, you aren't happy about it. Also, if something bad happens (war, natural disaster, change in leadership) poverty exacerbates the situation. The World Bank keeps detailed records of various economic factors that can contribute to poverty. The UN Refugee Agency’s (UNHCR) eighth annual High Commissioner’s Dialogue on Protection Challenges described several of these factors: high unemployment, especially among youth, uneven development, lack of access to international markets, and income inequality. To measure these horrible things, I'm looking at some data recorded by the World Bank, and using these measures as features:
- The Gini Coefficient: the most commonly used way of measuring wealth distribution. The larger the wealth gap, the most restless the poorer residents.
- Unemployment of youth, and also general unemployment.
- GDP per capita, normalized by the current PPP (international dollar).
- GDP per capita growth from the previous year for the bottom 40% of the population.
- Food imports as a percentage.
- Total labor force participation. With these economic factors accounted for, our model will get a clearer look at the health of a country on some specific measures.
Data for the environmental algorithm comes primarily from the International Disasters Database (aka EM-DAT), which contains a record of 7000 natural disasters since 1980, and the number of people displaced or affected. That data will be tied to a record of all major earthquakes (magnitude 5.3+) since 1988 from the United States Geological Survey (USGS), and other data that measures the severity of other disasters (floods, storms, droughts, etc). The final model will use the severity of an event and other economic and political data to predict whether or not the disaster will displace a significant number of people.
Foreseeable challenges include: tying the earthquake records (each event is a lat-long) to its EM_DAT record (by country), finding severity data for other types of disasters, and selecting enough features to not over-generalize this global model.
This feature is developed from a number of indices developed by thinktanks and nonprofits such as the Institute for Economics and Peace (IEP) and the Council on Foreign Releations (CFR) that cover acts of terrorism, insurgencies, and border disputes. IEP's Global Peace Index (GPI) is also included in this feature. The data is sourced from a wide range of respected sources, including the International Institute of Strategic Studies, The World Bank, various UN Agencies, peace institutes and the EIU. Each of these indices are combined in a weighted model to produce an overall Security & Terrorism score.
- Council on Foreign Relations (CFR), Invisible Armies Insugency Tracker
- Institute for Economics and Peace (IEP), Global Peace Index 2015, Global Terror Index 2015
- Uppsala Universitet, Depaterment of Peace and Conflict Research, Conflict Data Program - UCDP/PRIO Armed Conflict Dataset v.4-2015, 1946 – 2014
The following features were examined in my model:
- Corruption Perception Index1
- Civil Liberties2
- Political Rights2
- Freedom Status2
- Ratio of Female Legislators3
- Gender Ratio3
- Population Growth3
- Age<5 Mortality3
- Life Expectancy3
- Population age 0-143
- Population age 15-643
- Population Age 65+3
Since 1995, Transparency International has kept a Corruption Perception Index on countries around the world. According to their website, this index's purpose is to 'score countries on how corrupt their public sectors are seen to be.' Fortunately for us, someone else has already compiled all of this data into a single file. The biggest challenge with this data is that after 2010 their method of indexing changed. The most likely options are to either exclude the latest data or to normalize the scores by each year.
This political data will be combined with data from Freedom House which keeps an index on the political rights, civil liberties, freedom status of countries around the world.
Age and gender demographics data from The World Bank will also be analyzed. I used the dataset available from Kaggle which is slightly transformed from the original data. From these sets of data we will try to see if it is possible to predict a country's risk of a humanitarian crisis by their corruption perception and the their population's demographic. Such as, if a population has a higher percentage of males or a higher percentage of younger people, are they more likely to enter a crisis when they perceive their country as being corrupt.
The data for the food algorithm comes primarily from the World Food Programme's global food prices database. This 580,000 row database contains monthly price information on many food prices in markets across the globe. The advantage of this data is that it is precise to the city level (country, region and city) as well as to the month level. The prediction algorithm will be chosen by what best optimizes time-series at that level. Some foreseeable roadblocks include prices that may be reported in local currencies. Another consideration is considering the purchasing power of a citizen in each region. Further exploratory analysis may be required.