
# UNHCR Forcibly Displaced People Casptone Project

### Importing, Cleaning and Munging

---

#### Background


- Real-world dataset provided by the United Nations Refugee Agency (UNHCR).
- Dataset explored the demographics of forcibly displaced people across the world.
- Up-to-date data collection, spanning from 2001 to present (November 2021).

#### Hypothesis

The age and gender of a displaced individual have the greatest influence on the type of accommodation they will be allocated.

---

# Importing the data

In [107]:
import pandas as pd
import numpy as np
from pprint import pprint
import requests
import re
import warnings

warnings.filterwarnings('ignore')

In [108]:
#Load the data, then created a dataframe. Data already a CSV file

demo_global = pd.read_csv('/Users/dayosangowawa/Downloads/demographics_residing_world.csv')
demo_global

Unnamed: 0,Year,Country of Origin Code,Country of Asylum Code,Country of Origin Name,Country of Asylum Name,Population Type,location,urbanRural,accommodationType,Female 0-4,...,Female Unknown,Female Total,Male 0-4,Male 5-11,Male 12-17,Male 18-59,Male 60 or more,Male Unknown,Male Total,Total
0,#date+year,#country+code+origin,#country+code+asylum,#country+name+origin,#country+name+asylum,#indicator+population_type,,,,#affected+f+infants+age_0_4,...,#affected+f+unknown_age,#affected+f+total,#affected+m+infants+age_0_4,#affected+m+children+age_5_11,#affected+m+adolescents+age_12_17,#affected+m+adults+age_18_59,#affected+m+elderly+age_60,#affected+m+unknown_age,#affected+m+total,#affected+all+total
1,2001,AFG,AFG,Afghanistan,Afghanistan,IDP,Central,C,U,0,...,0,0,0,0,0,0,0,0,0,380000
2,2001,AFG,AFG,Afghanistan,Afghanistan,IDP,Dispersed in the country / territory,C,U,0,...,0,0,0,0,0,0,0,0,0,220000
3,2001,AFG,AFG,Afghanistan,Afghanistan,IDP,North,C,U,0,...,0,0,0,0,0,0,0,0,0,300000
4,2001,AFG,AFG,Afghanistan,Afghanistan,IDP,West,C,U,0,...,0,0,0,0,0,0,0,0,0,300000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251621,2020,UGA,ZWE,Uganda,Zimbabwe,REF,Tongogara : Point,R,P,0,...,0,0,0,0,0,9,0,0,9,9
251622,2020,ZMB,ZWE,Zambia,Zimbabwe,OOC,Harare : City,U,I,0,...,0,5,0,0,0,0,0,0,0,5
251623,2020,ZWE,ZWE,Zimbabwe,Zimbabwe,OOC,Harare : City,U,I,5,...,0,26,5,6,5,0,0,0,16,42
251624,2020,ZWE,ZWE,Zimbabwe,Zimbabwe,OOC,Tongogara : Point,R,P,6,...,0,68,5,12,5,0,0,0,22,90


# Cleaning

### Initial Exploration

In [109]:
#Visualized the column names

demo_global.columns

Index(['Year', 'Country of Origin Code', 'Country of Asylum Code',
       'Country of Origin Name', 'Country of Asylum Name', 'Population Type',
       'location', 'urbanRural', 'accommodationType', 'Female 0-4',
       'Female 5-11', 'Female 12-17', 'Female 18-59', 'Female 60 or more',
       'Female Unknown', 'Female Total', 'Male 0-4', 'Male 5-11', 'Male 12-17',
       'Male 18-59', 'Male 60 or more', 'Male Unknown', 'Male Total', 'Total'],
      dtype='object')

In [110]:
#Got rid of the first row as it did not contain data. It simply explained the column names in more detail. 

demo_global = demo_global.iloc[1:]

In [111]:
#Checked the null values - there were zero.

demo_global.isnull().sum()

Year                      0
Country of Origin Code    0
Country of Asylum Code    0
Country of Origin Name    0
Country of Asylum Name    0
Population Type           0
location                  0
urbanRural                0
accommodationType         0
Female 0-4                0
Female 5-11               0
Female 12-17              0
Female 18-59              0
Female 60 or more         0
Female Unknown            0
Female Total              0
Male 0-4                  0
Male 5-11                 0
Male 12-17                0
Male 18-59                0
Male 60 or more           0
Male Unknown              0
Male Total                0
Total                     0
dtype: int64

In [112]:
demo_global['Country of Asylum Name'].nunique()

201

In [113]:
demo_global['Country of Origin Name'].nunique()

221

In [114]:
demo_global['Year'].unique()

array(['2001', '2002', '2003', '2004', '2005', 2005, 2006, 2007, 2008,
       2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019,
       2020], dtype=object)

### Post-2012 subset

According to the data dictionary that was provided, data on the accommodation type was only collected after the year 2012. As a result, observations collected prior to 2012 had been encoded as 'U' (Unknown).

For consistency, I decided to progress with observations collected after 2012. 

This was important as this was my target variable.

In [115]:
#Checked the dtypes
demo_global.dtypes

Year                      object
Country of Origin Code    object
Country of Asylum Code    object
Country of Origin Name    object
Country of Asylum Name    object
Population Type           object
location                  object
urbanRural                object
accommodationType         object
Female 0-4                object
Female 5-11               object
Female 12-17              object
Female 18-59              object
Female 60 or more         object
Female Unknown            object
Female Total              object
Male 0-4                  object
Male 5-11                 object
Male 12-17                object
Male 18-59                object
Male 60 or more           object
Male Unknown              object
Male Total                object
Total                     object
dtype: object

In [116]:
#First had to change the dtype of Year column to 'integer'
demo_global['Year'] = demo_global['Year'].astype(int)

In [117]:
#Checked how many observations I would be left with, post-2012. Still 146958 rows.

demo_global[demo_global['Year'] >= 2012]

Unnamed: 0,Year,Country of Origin Code,Country of Asylum Code,Country of Origin Name,Country of Asylum Name,Population Type,location,urbanRural,accommodationType,Female 0-4,...,Female Unknown,Female Total,Male 0-4,Male 5-11,Male 12-17,Male 18-59,Male 60 or more,Male Unknown,Male Total,Total
104668,2012,COL,ABW,Colombia,Aruba,ASY,Oranjestad : Point,U,I,0,...,0,0,0,0,0,5,0,0,5,5
104669,2012,CUB,ABW,Cuba,Aruba,ASY,Oranjestad : Point,U,I,0,...,0,0,0,0,0,0,0,0,0,0
104670,2012,AFG,AFG,Afghanistan,Afghanistan,IDP,Badakhshan : Wilayat - Province,R,I,0,...,0,67,0,15,17,38,5,0,75,142
104671,2012,AFG,AFG,Afghanistan,Afghanistan,IDP,Badghis : Wilayat - Province,U,S,74,...,0,3881,75,840,921,2278,206,0,4320,8201
104672,2012,AFG,AFG,Afghanistan,Afghanistan,IDP,Baghlan : Wilayat - Province,U,S,12,...,0,622,12,135,147,365,33,0,692,1314
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251621,2020,UGA,ZWE,Uganda,Zimbabwe,REF,Tongogara : Point,R,P,0,...,0,0,0,0,0,9,0,0,9,9
251622,2020,ZMB,ZWE,Zambia,Zimbabwe,OOC,Harare : City,U,I,0,...,0,5,0,0,0,0,0,0,0,5
251623,2020,ZWE,ZWE,Zimbabwe,Zimbabwe,OOC,Harare : City,U,I,5,...,0,26,5,6,5,0,0,0,16,42
251624,2020,ZWE,ZWE,Zimbabwe,Zimbabwe,OOC,Tongogara : Point,R,P,6,...,0,68,5,12,5,0,0,0,22,90


In [118]:
#confirmation of Accommodation Type not being collected before 2012. (U is unknown)

demo_global[demo_global['Year'] < 2012]['accommodationType'].value_counts()

U    104667
Name: accommodationType, dtype: int64

In [119]:
#Checking the value counts of each accommodation type post-2012.

demo_global[demo_global['Year'] >= 2012]['accommodationType'].value_counts()

I    73267
U    60677
P     6870
S     2396
R     1932
C     1816
Name: accommodationType, dtype: int64

In [120]:
#Assigning the post-2012 subset a new datframe.

post2012subset = demo_global[demo_global['Year'] >= 2012]
post2012subset.head()

Unnamed: 0,Year,Country of Origin Code,Country of Asylum Code,Country of Origin Name,Country of Asylum Name,Population Type,location,urbanRural,accommodationType,Female 0-4,...,Female Unknown,Female Total,Male 0-4,Male 5-11,Male 12-17,Male 18-59,Male 60 or more,Male Unknown,Male Total,Total
104668,2012,COL,ABW,Colombia,Aruba,ASY,Oranjestad : Point,U,I,0,...,0,0,0,0,0,5,0,0,5,5
104669,2012,CUB,ABW,Cuba,Aruba,ASY,Oranjestad : Point,U,I,0,...,0,0,0,0,0,0,0,0,0,0
104670,2012,AFG,AFG,Afghanistan,Afghanistan,IDP,Badakhshan : Wilayat - Province,R,I,0,...,0,67,0,15,17,38,5,0,75,142
104671,2012,AFG,AFG,Afghanistan,Afghanistan,IDP,Badghis : Wilayat - Province,U,S,74,...,0,3881,75,840,921,2278,206,0,4320,8201
104672,2012,AFG,AFG,Afghanistan,Afghanistan,IDP,Baghlan : Wilayat - Province,U,S,12,...,0,622,12,135,147,365,33,0,692,1314


In [121]:
# Reset the index as they were modified after taking a subset of the original data.

post2012subset = post2012subset.reset_index()

In [122]:
post2012subset.drop(columns = ['index'], inplace=True)

In [123]:
post2012subset.head()

Unnamed: 0,Year,Country of Origin Code,Country of Asylum Code,Country of Origin Name,Country of Asylum Name,Population Type,location,urbanRural,accommodationType,Female 0-4,...,Female Unknown,Female Total,Male 0-4,Male 5-11,Male 12-17,Male 18-59,Male 60 or more,Male Unknown,Male Total,Total
0,2012,COL,ABW,Colombia,Aruba,ASY,Oranjestad : Point,U,I,0,...,0,0,0,0,0,5,0,0,5,5
1,2012,CUB,ABW,Cuba,Aruba,ASY,Oranjestad : Point,U,I,0,...,0,0,0,0,0,0,0,0,0,0
2,2012,AFG,AFG,Afghanistan,Afghanistan,IDP,Badakhshan : Wilayat - Province,R,I,0,...,0,67,0,15,17,38,5,0,75,142
3,2012,AFG,AFG,Afghanistan,Afghanistan,IDP,Badghis : Wilayat - Province,U,S,74,...,0,3881,75,840,921,2278,206,0,4320,8201
4,2012,AFG,AFG,Afghanistan,Afghanistan,IDP,Baghlan : Wilayat - Province,U,S,12,...,0,622,12,135,147,365,33,0,692,1314


### Dropping columns

I dropped the following columns as they did not contribute to what I was trying to achieve:

- Year
- Country of Origin Code
- Country of Asylum Code
- Female Total
- Male Unknown
- Male Total
- Female Unknown
- Total
- Location

*A potential future exploration idea, is to keep the Year column as treat the dataset as a time series, to invetigate trends over the years.*

In [124]:
post2012subset.drop(columns = ['Year', 'Country of Origin Code', 'Country of Asylum Code',
                               'Female Total', 'Male Unknown', 'Male Total', 'Female Unknown', 
                               'Total', 'location'], inplace=True)

### Renaming the column names

I cleaned all of the column names by replacing white spaces with underscores and changing all of the characters to lowercase to make it easier for analysis steps.

In [125]:
post2012subset.head()

Unnamed: 0,Country of Origin Name,Country of Asylum Name,Population Type,urbanRural,accommodationType,Female 0-4,Female 5-11,Female 12-17,Female 18-59,Female 60 or more,Male 0-4,Male 5-11,Male 12-17,Male 18-59,Male 60 or more
0,Colombia,Aruba,ASY,U,I,0,0,0,0,0,0,0,0,5,0
1,Cuba,Aruba,ASY,U,I,0,0,0,0,0,0,0,0,0,0
2,Afghanistan,Afghanistan,IDP,R,I,0,14,16,37,0,0,15,17,38,5
3,Afghanistan,Afghanistan,IDP,U,S,74,810,853,2004,140,75,840,921,2278,206
4,Afghanistan,Afghanistan,IDP,U,S,12,130,138,321,21,12,135,147,365,33


In [126]:
post2012subset.columns = post2012subset.columns.str.lower()

In [127]:
post2012subset.columns = post2012subset.columns.str.replace(' ', '_')

In [128]:
post2012subset.columns

Index(['country_of_origin_name', 'country_of_asylum_name', 'population_type',
       'urbanrural', 'accommodationtype', 'female_0-4', 'female_5-11',
       'female_12-17', 'female_18-59', 'female_60_or_more', 'male_0-4',
       'male_5-11', 'male_12-17', 'male_18-59', 'male_60_or_more'],
      dtype='object')

In [129]:
post2012subset.rename(columns = {'urbanrural':'urban_or_rural_location', 'accommodationtype': 'accommodation_type',
                                 'female_0-4': 'female_aged_0-4 years', 'female_5-11': 'female_aged_5-11 years', 
                                 'female_12-17':'female_aged_12-17 years', 'female_18-59':'female_aged_18-59 years',
                                'female_60_or_more': 'female_aged_over_60_years', 'male_0-4': 'male_aged_0-4_years',
                                'male_5-11': 'male_aged_5-11_years', 'male_12-17': 'male_aged_12-17 years',
                                'male_18-59': 'male_aged_18-59_years', 'male_60_or_more':'male_aged_over_60_years',
                                'country_of_origin_name':'country_of_origin', 'country_of_asylum_name': 'country_of_asylum'}, 
                                  inplace=True)

### Changing the dtypes

I changed all of the dtypes to the appropriates dtypes. 

In [130]:
post2012subset = post2012subset.astype({'female_aged_0-4 years':'int', 
                                        'female_aged_5-11 years':'int',
                                        'female_aged_12-17 years':'int',
                                        'female_aged_18-59 years':'int', 
                                        'female_aged_over_60_years': 'int', 
                                        'male_aged_0-4_years':'int', 
                                        'male_aged_5-11_years':'int', 
                                        'male_aged_12-17 years':'int',
                                        'male_aged_18-59_years':'int', 
                                        'male_aged_over_60_years':'int'})

In [131]:
post2012subset.dtypes

country_of_origin            object
country_of_asylum            object
population_type              object
urban_or_rural_location      object
accommodation_type           object
female_aged_0-4 years         int64
female_aged_5-11 years        int64
female_aged_12-17 years       int64
female_aged_18-59 years       int64
female_aged_over_60_years     int64
male_aged_0-4_years           int64
male_aged_5-11_years          int64
male_aged_12-17 years         int64
male_aged_18-59_years         int64
male_aged_over_60_years       int64
dtype: object

### Tidying the column values

- Wrote the abbreviations in full for Urban/Rural locations in full to make my plots easier to interpret when analysis steps.
- Removed foreign characters, such as brackets and forward-slash, from Countries of Origin and Countries of Asylum values so they would not cause any issues in the later stages of exploring anf modeling my data.
- Dropped the accommodation types that were imputed as 'U' (unknown) as they were unnecessary as part of the target variable that I was trying to predict. Unfortunately there was a high number of unknown accommodation types, so this is something to investigate further in the future.
- Also dropped the observations that had 'Ubran or Rural Location' encoded as 'V' (unknown) as this wil not be helpful in the later stages of exploratory data analysis and modeling.
- Added the population types VDA (Venezuelans Displaced Abroad) and STA (Stateless persons) to the population type OOC (Others of Concern) as the had very few values. *Reminder: Others of Concerns are those that do not fall into any other category.*

In [132]:
post2012subset['urban_or_rural_location'].replace({'U': 'Urban', 'R': 'Rural', 'V': 'Unknown'}, inplace=True)

In [133]:
#Checking the value counts of the different classes of urban or rural locations

post2012subset['urban_or_rural_location'].value_counts()

Urban      77226
Unknown    55775
Rural      13957
Name: urban_or_rural_location, dtype: int64

In [134]:
#Dropping the locations that were classed as 'unknown'

post2012subset = post2012subset[post2012subset['urban_or_rural_location'] != 'Unknown']

In [135]:
post2012subset.replace({'country_of_origin':{'Iran (Islamic Republic of)': 'Iran', 
                                             'Bolivia (Plurinational State of)':'Bolivia',
                                            'Various / unknown': 'Unknown', 
                                            'Venezuela (Bolivarian Republic of)': 'Venezuela',
                                            'Czechia':'Czech Republic',
                                            "Democratic People's Republic of Korea" : "North Korea", 
                                            "Lao People's Democratic Republic": 'Laos',
                                            'Republic of North Macedonia': 'North Macedonia', 
                                            'Republic of Moldova': 'Moldova', 
                                            'Micronesia (Federated States of)':'Federated States of Micronesia',
                                            'Saint Martin (French part)':'Saint Martin France', 
                                            'Sint Maarten (Dutch part)': 'Sint Maarten Dutch',
                                            'Viet Nam': 'Vietnam', 'Timor-Leste': 'East Timor'}}, inplace=True)

In [136]:
post2012subset.replace({'country_of_asylum':{'Iran (Islamic Republic of)': 'Iran', 
                                            'Bolivia (Plurinational State of)':'Bolivia',
                                            'Various / unknown': 'Unknown', 
                                            'Venezuela (Bolivarian Republic of)': 'Venezuela',
                                            'Czechia':'Czech Republic',
                                            "Democratic People's Republic of Korea" : "North Korea", 
                                            "Lao People's Democratic Republic": 'Laos',
                                            'Republic of North Macedonia': 'North Macedonia', 
                                            'Republic of Moldova': 'Moldova', 
                                            'Micronesia (Federated States of)':'Federated States of Micronesia',
                                            'Saint Martin (French part)':'Saint Martin France', 
                                            'Sint Maarten (Dutch part)': 'Sint Maarten Dutch',
                                            'Viet Nam': 'Vietnam', 'Timor-Leste': 'East Timor'}}, inplace=True)

In [138]:
post2012subset['population_type'].value_counts()

REF    43248
ASY    41673
OOC     3130
IDP     1761
RET      923
RDP      357
STA       68
VDA       23
Name: population_type, dtype: int64

In [144]:
#Converting the population types with fewer values (i.e. VDA and STA) to OOC population type.

post2012subset['population_type'] = post2012subset['population_type'].map(lambda x: x.replace(x, 'OOC') if x == 'VDA' or x == 'STA' else x)

In [146]:
post2012subset['population_type'].value_counts()

REF    43248
ASY    41673
OOC     3221
IDP     1761
RET      923
RDP      357
Name: population_type, dtype: int64

In [147]:
post2012subset['accommodation_type'].value_counts()

I    68263
U    10847
P     6697
S     2213
R     1795
C     1368
Name: accommodation_type, dtype: int64

In [148]:
#Removing the observations with accommodation type 'unknown'.

post2012subset = post2012subset[post2012subset['accommodation_type'] != 'U']

In [149]:
post2012subset['accommodation_type'].value_counts()

I    68263
P     6697
S     2213
R     1795
C     1368
Name: accommodation_type, dtype: int64

### Binarizing the target variable - Accommodation Type

I decided to binarize the target variable so that the classification will be one vs rest. 

This  was because as identified above, there was initially a severe class imbalance amongst the five classes. By binarizing the classes, though class imbalance is stll present, the imabalance is not as severe. This is important as class imbalance can majorly affect how the classification model performs. 

The original five accommodation type classes are: 

- I - Individual accommodation
- S - Self-settled camp
- P - Planned/Managed camp
- C - Collective centre
- R - Reception/Transit camp

They will now be condensed to: 

- I - Individual accommodation
- Q - Not individual accommodation (Other)


In [150]:
post2012subset['accommodation_type'] = post2012subset['accommodation_type'].map(lambda x: x.replace(x, 'Q') if x != 'I' else 'I')

In [151]:
#Still imbalanced but it will have a reduced effect.

post2012subset['accommodation_type'].value_counts()

I    68263
Q    12073
Name: accommodation_type, dtype: int64

In [152]:
#Visualizing the final clean dataframe. 

post2012subset.head()

Unnamed: 0,country_of_origin,country_of_asylum,population_type,urban_or_rural_location,accommodation_type,female_aged_0-4 years,female_aged_5-11 years,female_aged_12-17 years,female_aged_18-59 years,female_aged_over_60_years,male_aged_0-4_years,male_aged_5-11_years,male_aged_12-17 years,male_aged_18-59_years,male_aged_over_60_years
0,Colombia,Aruba,ASY,Urban,I,0,0,0,0,0,0,0,0,5,0
1,Cuba,Aruba,ASY,Urban,I,0,0,0,0,0,0,0,0,0,0
2,Afghanistan,Afghanistan,IDP,Rural,I,0,14,16,37,0,0,15,17,38,5
3,Afghanistan,Afghanistan,IDP,Urban,Q,74,810,853,2004,140,75,840,921,2278,206
4,Afghanistan,Afghanistan,IDP,Urban,Q,12,130,138,321,21,12,135,147,365,33


In [153]:
post2012subset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 80336 entries, 0 to 146957
Data columns (total 15 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   country_of_origin          80336 non-null  object
 1   country_of_asylum          80336 non-null  object
 2   population_type            80336 non-null  object
 3   urban_or_rural_location    80336 non-null  object
 4   accommodation_type         80336 non-null  object
 5   female_aged_0-4 years      80336 non-null  int64 
 6   female_aged_5-11 years     80336 non-null  int64 
 7   female_aged_12-17 years    80336 non-null  int64 
 8   female_aged_18-59 years    80336 non-null  int64 
 9   female_aged_over_60_years  80336 non-null  int64 
 10  male_aged_0-4_years        80336 non-null  int64 
 11  male_aged_5-11_years       80336 non-null  int64 
 12  male_aged_12-17 years      80336 non-null  int64 
 13  male_aged_18-59_years      80336 non-null  int64 
 14  male_

In [154]:
#Shape of the final cleaned dataset

post2012subset.shape

(80336, 15)

### Converting dataframe to CSV file

In [155]:
post2012subset.to_csv('cleaned_unhcrdf_final.csv')