## Why did I choose to add these additional data sets into my data model?

In this notebook, we combine a number of additional data sets which contain certain indicators related to tourism, economics, finances and demographics of countries. The purpose of using this data is to see if a reviewer's nationality influences the type of reviews they write, and to compare indicators of the reviewers' nationality and the country of the hotel they are reviewing. There are a number of interesting  questions which can be answered using this data.

Do general indicators of a country's financial/economic well-being (GDP per capita, GNI per capita, Human Development Index, Mobile phone subscriptions, internet coverage, Urban population percentage, exchange rates) tell us anything  about how positively/negatively reviewers from that country review hotels. Is there a similar pattern in reviewers from countries with similar economic levels (national income, for example)?

It would be interesting to see if reviewers from countries with a similar (relatively low) GDP per capita/GNI per capita tend to write more positive reviews of a hotel in a country with a higher  GDP/GNI. Does the tourism expenditure of a country and the number of tourist arrivals in a country affect how people from that country view other tourists and touristic activity, and therefore have some effect on their reviews? 

I also include political and human rights data. This can be used to answer some more intersting questions. Does the political system a reviewer lives in play any role in how open his/her reviews are? For example, are people from democracies more likely to write negative reviews, and people from authoritarian states more likely to not do so? Or might people from authoritarian states be more likely to be open and expressive in their reviews when they are out of their country? 

These are just a sample of the type of questions that I want to enable people to ask using this data set.

## So, what are these additional data sets?
The following table gives a short description of the additional data sets, along with their data sources and other details. Note that the num_rows columns contains the rows before and after performing the operations in this notebook.

| Data set        | Num_rows           |   Data Source | Comment | Description | Year |
| ------------- |:-------------:| -----:| ----------:| ----------------------:| ------:|
| [Country List ISO](https://datahub.io/core/country-list#resource-data)     | 249 | datahub.io| -| Contains a list of countries along with their 2-digit ISO code.. |- |
| [Tourist-Visitors Arrival and Expenditure](http://data.un.org/)     | 2246 (whittled down to 220) | UNWTO | Found under 'Tourism and transport' after following the link | Data related to different countries' spending on tourism and the no. of inbound visitors/tourists |2018 |
| [Exchange rates](http://data.un.org/)     | 3408 (whittled down to 234) | IMF | Found under 'Finance' after following the link | Data related to exchange rates at the end of 2018 | 2018 |
| [GNI Per Capita](http://hdr.undp.org/en/data)     | 191 | UNDP | Found under dimension='Income/composition of resources' after following the link | Gives the Gross National Income  in dollars(2011 PPP) | 2018 |
| [GDP Per Capita](http://hdr.undp.org/en/data)     | 192 (whittled down to 220) | UNDP | Found under dimension='Income/composition of resources' after following the link | Gives the Gross Domestic Product in dollars (2011 PPP) | 2018 |
| [Internet Users As Percentage of Population](http://hdr.undp.org/en/data)     | 195 | UNDP | Found under dimension='Mobility and Communiucation' after following the link | Gives the percentage of the total population who are internet users | 2018 |
| [Mobile Phone Subscriptions](http://hdr.undp.org/en/data)     | 195 | UNDP | Found under dimension='Income/composition of resources' after following the link | Gives the mobile phone subscriptions per 100 people (>100: people have >1 mobile connection on average) | 2018 |
| [Net Migration Rate](http://hdr.undp.org/en/data)     | 191 | UNDP | Found under dimension='Income/composition of resources' after following the link | Gives the net migration rate (per 1000 people) | 2020 |
| [Population](http://hdr.undp.org/en/data)     | 195 | UNDP | Found under dimension='Demography' after following the link | Gives the total population (in millions) | 2018 |
| [Urban Population Percentage](http://hdr.undp.org/en/data)     | 195 | UNDP | Found under dimension='Human Development Index' after following the link | Gives the urban population as a percentage of the total population | 2018 |
| [Human Development Index (HDI)](http://hdr.undp.org/en/data)     | 195 | UNDP | Found under dimension='Income/composition of resources' after following the link | Gives the Human Development Index and the corresponding rank in 2018 | 2018 |
| [2020_Country_and_Territory_Ratings_and_Statuses_FIW2020](https://freedomhouse.org/report/freedom-world)     | 205 (whittled  down to 195) | Freedom House | I have included only the latest data, not all the data from 1973-2020| Gives 2 indicators of freedom: Political Rights and Civil Liberties, both of which are scored on a 1-7 scale. A column called Status has values corresponding to 'Free', 'Not Free', 'Partially Free'. | 2020 |
| [2020_List_of_Electoral_Democracies_FIW_2020](https://freedomhouse.org/report/freedom-world)     | 195 | Freedom House | I have included only the latest data| Gives a list of countries and whether or not they are democracies: Yes or No | 2020 |
| [human-rights-score-vs-political-regime-type](https://ourworldindata.org/democracy)     | 35333 (whittled down to 196) | Our World in Data| -| Gives a list of countries along with their  political regime type (score) and human rights protection score. The political regime score ranges from -10 (autocracy) to +10 (full democracy). The Human Rights Scores (the higher the better) were first developed by Schnakenberg and Farris (2014) and subsequently updated by Farris (2019). |2015 |

In [61]:
import pandas as pd
import os
import numpy as np
from pyspark.sql import SparkSession

In [65]:
input_path = 'Data/Original/'
output_path = 'Data/Cleaned/'

In [13]:
os.listdir(input_path)

['Internet users, total (% of population).csv',
 'human-rights-score-vs-political-regime-type.csv',
 '2020_Country_and_Territory_Ratings_and_Statuses_FIW2020.xlsx',
 'Net migration rate (per 1,000 people).csv',
 'Gross national income (GNI) per capita (2011 PPP$).csv',
 'SYB63_176_202003_Tourist-Visitors Arrival and Expenditure.csv',
 'Gross domestic product (GDP) per capita (2011 PPP $).csv',
 'Mobile phone subscriptions (per 100 people).csv',
 'places.original.json',
 '.ipynb_checkpoints',
 'Human Development Index (HDI).csv',
 '2020_List_of_Electoral_Democracies_FIW_2020.xlsx',
 'Hotel_Reviews.csv',
 'Population, total (millions).csv',
 'Population, urban (%).csv',
 'airport-codes_csv.csv',
 'countryiso.csv',
 'SYB62_130_201907_Exchange Rates.csv']

### Part 0 : [ISO Codes Country Mapping](https://datahub.io/core/country-list#resource-data)
First, load a list of countries and ISO codes. This will be the left table against which we join everything else.

In [14]:
countrycodes_df  = pd.read_csv(os.path.join(input_path, 'countryiso.csv'))
countrycodes_df = countrycodes_df.rename(columns={'Name': 'Country', 'Code': 'ISOCode'})

In [15]:
print(countrycodes_df.shape)
countrycodes_df.head()

(249, 2)


Unnamed: 0,Country,ISOCode
0,Afghanistan,AF
1,Åland Islands,AX
2,Albania,AL
3,Algeria,DZ
4,American Samoa,AS


### Part 1: [UN data](http://data.un.org)
Tourists-visitors: from UNWTO, i.e.,  World Tourism Org
Exchange rates: from IMF


In [16]:
tourists_visitors_df  = pd.read_csv(os.path.join(input_path, 'SYB63_176_202003_Tourist-Visitors Arrival and Expenditure.csv'))
#consumer_price_index_df  = pd.read_csv(os.path.join(path1, 'SYB62_128_201907_Consumer Price Index.csv'))
exchange_rates_df = pd.read_csv(os.path.join(input_path, 'SYB62_130_201907_Exchange Rates.csv'))

In [17]:
print(tourists_visitors_df.shape)
tourists_visitors_df.head()

(2246, 9)


Unnamed: 0,T33,Region/Country/Area,Year,Series,Tourism arrivals series type,Tourism arrivals series type footnote,Value,Footnotes,Source
0,4,Afghanistan,2010,Tourism expenditure (millions of US dollars),,,147.0,,"World Tourism Organization (UNWTO), Madrid, th..."
1,4,Afghanistan,2016,Tourism expenditure (millions of US dollars),,,62.0,,"World Tourism Organization (UNWTO), Madrid, th..."
2,4,Afghanistan,2017,Tourism expenditure (millions of US dollars),,,16.0,,"World Tourism Organization (UNWTO), Madrid, th..."
3,4,Afghanistan,2018,Tourism expenditure (millions of US dollars),,,50.0,,"World Tourism Organization (UNWTO), Madrid, th..."
4,8,Albania,2010,Tourist/visitor arrivals (thousands),TF,,2191.0,Excluding nationals residing abroad.,"World Tourism Organization (UNWTO), Madrid, th..."


In [18]:
# Keep only the latest value. There are 2 separate fields in 'Series': Tourism expenditure and Tourist/visitor arrivals
# Year is the most recent year available. It's usually 2017/2018
tourists_visitors_df = tourists_visitors_df[~tourists_visitors_df.duplicated(subset=['Region/Country/Area', 'Series'], keep='last')]
tourists_visitors_df = tourists_visitors_df.drop(['T33', 'Year', 'Tourism arrivals series type', 'Tourism arrivals series type footnote', 'Footnotes', 'Source'], axis=1)
tourists_visitors_df = tourists_visitors_df.rename(columns={'Region/Country/Area': 'Country'})
tourists_visitors_df.head()

Unnamed: 0,Country,Series,Value
3,Afghanistan,Tourism expenditure (millions of US dollars),50.0
7,Albania,Tourist/visitor arrivals (thousands),5340.0
13,Albania,Tourism expenditure (millions of US dollars),2306.0
19,Algeria,Tourist/visitor arrivals (thousands),2657.0
23,Algeria,Tourism expenditure (millions of US dollars),172.0


In [19]:
# Pivot so that the values in 'Series' become columns
tourists_pivot = tourists_visitors_df.pivot(index='Country', columns='Series', values='Value')
tourists_pivot.head()

Series,Tourism expenditure (millions of US dollars),Tourist/visitor arrivals (thousands)
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,50.0,
Albania,2306.0,5340.0
Algeria,172.0,2657.0
American Samoa,22.0,20.2
Andorra,,3042.0


In [20]:
tourists_visitors_df = tourists_pivot.reset_index()
#tourists_visitors_df = tourists_visitors_df.set_index('Country')
tourists_visitors_df.head()

Series,Country,Tourism expenditure (millions of US dollars),Tourist/visitor arrivals (thousands)
0,Afghanistan,50.0,
1,Albania,2306.0,5340.0
2,Algeria,172.0,2657.0
3,American Samoa,22.0,20.2
4,Andorra,,3042.0


In [21]:
tourists_visitors_df.shape

(220, 3)

In [22]:
print(exchange_rates_df.shape) # Latest year =2018
exchange_rates_df.head()

(3408, 9)


Unnamed: 0,T16,Region/Country/Area,Year,Series,National currency,National currency footnote,Value,Footnotes,Source
0,4,Afghanistan,1985,Exchange rates: end of period (national curren...,Afghani (AFN),,42.8228,,"International Monetary Fund (IMF), Washington,..."
1,4,Afghanistan,1995,Exchange rates: end of period (national curren...,Afghani (AFN),,47.5,,"International Monetary Fund (IMF), Washington,..."
2,4,Afghanistan,2005,Exchange rates: end of period (national curren...,Afghani (AFN),,50.41,,"International Monetary Fund (IMF), Washington,..."
3,4,Afghanistan,2010,Exchange rates: end of period (national curren...,Afghani (AFN),,45.27,,"International Monetary Fund (IMF), Washington,..."
4,4,Afghanistan,2015,Exchange rates: end of period (national curren...,Afghani (AFN),,68.05,,"International Monetary Fund (IMF), Washington,..."


In [23]:
# Keep only the latest value. Series has Exchange rate for end if period and period average (in USD)
# Year is the most recent year available. It's 2018
exchange_rates_df = exchange_rates_df[~exchange_rates_df.duplicated(subset=['Region/Country/Area', 'Series'], keep='last')]
exchange_rates_df = exchange_rates_df.drop(['T16', 'Year', 'National currency footnote', 'Footnotes', 'Source'], axis=1)
exchange_rates_df = exchange_rates_df.rename(columns={'Region/Country/Area': 'Country'})
exchange_rates_df.head()

Unnamed: 0,Country,Series,National currency,Value
7,Afghanistan,Exchange rates: end of period (national curren...,Afghani (AFN),74.9556
15,Afghanistan,Exchange rates: period average (national curre...,Afghani (AFN),72.0832
21,Åland Islands,Exchange rates: end of period (national curren...,Euro (EUR),0.8734
27,Åland Islands,Exchange rates: period average (national curre...,Euro (EUR),0.8468
34,Albania,Exchange rates: end of period (national curren...,Lek (ALL),107.82


In [24]:
# Now keep only the first value in series for each country: the 2nd value is an average
exchange_rates_df = exchange_rates_df[~exchange_rates_df.duplicated(subset=['Country'], keep='first')]
exchange_rates_df = exchange_rates_df.rename(columns={'Value': 'ExchangeRateEndOfPeriod'})
exchange_rates_df = exchange_rates_df.drop(['Series'], axis=1)
print(exchange_rates_df.shape)
exchange_rates_df.head()

(234, 3)


Unnamed: 0,Country,National currency,ExchangeRateEndOfPeriod
7,Afghanistan,Afghani (AFN),74.9556
21,Åland Islands,Euro (EUR),0.8734
34,Albania,Lek (ALL),107.82
49,Algeria,Algerian Dinar (DZD),118.2906
63,Andorra,Euro (EUR),0.8734


In [25]:
exchange_rates_df.tail()

Unnamed: 0,Country,National currency,ExchangeRateEndOfPeriod
3349,Wallis and Futuna Islands,CFP Franc (XPF),104.2198
3365,Western Sahara,Moroccan Dirham (MAD),9.5655
3381,Zambia,Zambian Kwacha (ZMW),11.9238
3392,Zimbabwe,Zimbabwe Dollar (ZWL),80.7744
3401,Euro Area,Euro (EUR),0.8734


Merge countrycodes first with tourists_visitors_df and then with exchange_rates_df

In [26]:
merged_df = countrycodes_df.merge(tourists_visitors_df, on='Country', how='left')
print(merged_df.shape)
merged_df.head()

(249, 4)


Unnamed: 0,Country,ISOCode,Tourism expenditure (millions of US dollars),Tourist/visitor arrivals (thousands)
0,Afghanistan,AF,50.0,
1,Åland Islands,AX,,
2,Albania,AL,2306.0,5340.0
3,Algeria,DZ,172.0,2657.0
4,American Samoa,AS,22.0,20.2


In [27]:
merged_df = merged_df.merge(exchange_rates_df, on='Country', how='left')
#merged_df = merged_df[~merged_df.duplicated()]
print(merged_df.shape)
merged_df.head()

(249, 6)


Unnamed: 0,Country,ISOCode,Tourism expenditure (millions of US dollars),Tourist/visitor arrivals (thousands),National currency,ExchangeRateEndOfPeriod
0,Afghanistan,AF,50.0,,Afghani (AFN),74.9556
1,Åland Islands,AX,,,Euro (EUR),0.8734
2,Albania,AL,2306.0,5340.0,Lek (ALL),107.82
3,Algeria,DZ,172.0,2657.0,Algerian Dinar (DZD),118.2906
4,American Samoa,AS,22.0,20.2,,


In [28]:
merged_df = merged_df.rename(columns={'Tourism expenditure (millions of US dollars)': "TourismExpenditureMillions",
                                     'Tourist/visitor arrivals (thousands)': 'TouristArrivalsThousands',
                                     'National currency': 'Currency'})
merged_df.head()

Unnamed: 0,Country,ISOCode,TourismExpenditureMillions,TouristArrivalsThousands,Currency,ExchangeRateEndOfPeriod
0,Afghanistan,AF,50.0,,Afghani (AFN),74.9556
1,Åland Islands,AX,,,Euro (EUR),0.8734
2,Albania,AL,2306.0,5340.0,Lek (ALL),107.82
3,Algeria,DZ,172.0,2657.0,Algerian Dinar (DZD),118.2906
4,American Samoa,AS,22.0,20.2,,


### Part 2: [UNDP Data](http://hdr.undp.org/en/data)
Now, it's time to get data from UNDP. NOTE: As with the previous 2 data sets, all the data is from 2018, unless unavailable.

In [29]:
gni_percapita_df = pd.read_csv(os.path.join(input_path, 'Gross national income (GNI) per capita (2011 PPP$).csv'))
gdp_percapita_df = pd.read_csv(os.path.join(input_path, 'Gross domestic product (GDP) per capita (2011 PPP $).csv'))
mobile_phone_subscriptions_df = pd.read_csv(os.path.join(input_path, 'Mobile phone subscriptions (per 100 people).csv'))
net_migration_rate_df = pd.read_csv(os.path.join(input_path, 'Net migration rate (per 1,000 people).csv'))
population_millions_df = pd.read_csv(os.path.join(input_path, 'Population, total (millions).csv'))
urban_population_percent_df = pd.read_csv(os.path.join(input_path, 'Population, urban (%).csv'))
hdi_df = pd.read_csv(os.path.join(input_path, 'Human Development Index (HDI).csv'))
intenet_users_percent_df = pd.read_csv(os.path.join(input_path, 'Internet users, total (% of population).csv'))


In the data sets from UNDP, NULLS are represented by '..'

In [30]:
gni_percapita_df = gni_percapita_df[['Country', '2018']]
print(gni_percapita_df.shape)
gni_percapita_df = gni_percapita_df.rename(columns={'2018': 'GNIPerCapita'})
# Check for '..' if it isn't an integer column
print("No. of countries with NULL GNI: ", gni_percapita_df[gni_percapita_df['GNIPerCapita'].isnull()].size)
if gni_percapita_df.GNIPerCapita.dtypes == 'object':
    print("No. of countries with '..' GNI: ", gni_percapita_df[gni_percapita_df['GNIPerCapita']=='..'].size)
    gni_percapita_df.GNIPerCapita = gni_percapita_df.GNIPerCapita.replace('..', np.nan)
    gni_percapita_df.GNIPerCapita = pd.to_numeric(gni_percapita_df.GNIPerCapita)   
    # Replace .. with NULL
    print("After replacing .. with NULL, no. of countries with NULL GDP: ", gni_percapita_df[gni_percapita_df['GNIPerCapita'].isnull()].size)

gni_percapita_df.head()

(191, 2)
No. of countries with NULL GNI:  0


Unnamed: 0,Country,GNIPerCapita
0,Afghanistan,1746
1,Albania,12300
2,Algeria,13639
3,Andorra,48641
4,Angola,5555


In [31]:
gdp_percapita_df = gdp_percapita_df[['Country', '2018']]
print(gdp_percapita_df.shape)
gdp_percapita_df = gdp_percapita_df.rename(columns={'2018': 'GDPPerCapita'})
print("No. of countries with NULL GDP: ", gdp_percapita_df[gdp_percapita_df['GDPPerCapita'].isnull()].size)
if gdp_percapita_df.GDPPerCapita.dtypes == 'object':
    print("No. of countries with '..' GDP: ", gdp_percapita_df[gdp_percapita_df['GDPPerCapita']=='..'].size)
    gdp_percapita_df.GDPPerCapita = gdp_percapita_df.GDPPerCapita.replace('..', np.nan)
    gdp_percapita_df.GDPPerCapita = pd.to_numeric(gdp_percapita_df.GDPPerCapita)    
    print("After replacing .. with NULL, no. of countries with NULL GDP: ", gdp_percapita_df[gdp_percapita_df['GDPPerCapita'].isnull()].size)
# Replace .. with NULL
gdp_percapita_df.head()

(192, 2)
No. of countries with NULL GDP:  0
No. of countries with '..' GDP:  12
After replacing .. with NULL, no. of countries with NULL GDP:  12


Unnamed: 0,Country,GDPPerCapita
0,Afghanistan,1735.0
1,Albania,12306.0
2,Algeria,13886.0
3,Angola,5725.0
4,Antigua and Barbuda,23768.0


In [27]:
# https://stackoverflow.com/questions/54426845/how-to-check-if-a-pandas-dataframe-contains-only-numeric-column-wise/54427157
pd.to_numeric(gdp_percapita_df['GDPPerCapita'], errors='coerce').notnull().all()

False

In [32]:
intenet_users_percent_df = intenet_users_percent_df[['Country', '2018']]
print(intenet_users_percent_df.shape)
intenet_users_percent_df = intenet_users_percent_df.rename(columns={'2018': 'InternetUsersPercent'})
print("No. of countries with NULL internet users percentage: ", intenet_users_percent_df[intenet_users_percent_df['InternetUsersPercent'].isnull()].size)
#if mobile_phone_subscriptions_df.MobilePhoneSubscriptions.dtypes != 'int64':
if intenet_users_percent_df.InternetUsersPercent.dtypes == 'object':
    print("No. of countries with '..' internet users percentage: ", intenet_users_percent_df[intenet_users_percent_df['InternetUsersPercent']=='..'].size)
    intenet_users_percent_df.InternetUsersPercent = intenet_users_percent_df.InternetUsersPercent.replace('..', np.nan)
    # Convert to float64
    intenet_users_percent_df.InternetUsersPercent = pd.to_numeric(intenet_users_percent_df.InternetUsersPercent)    
    print("After replacing .. with NULL, no. of countries with NULL internet users percentage: ", intenet_users_percent_df[intenet_users_percent_df['InternetUsersPercent'].isnull()].size)
# Replace .. with NULL
intenet_users_percent_df.head()

(195, 2)
No. of countries with NULL internet users percentage:  0
No. of countries with '..' internet users percentage:  248
After replacing .. with NULL, no. of countries with NULL internet users percentage:  248


Unnamed: 0,Country,InternetUsersPercent
0,Afghanistan,
1,Albania,
2,Algeria,59.6
3,Andorra,
4,Angola,


In [33]:
mobile_phone_subscriptions_df = mobile_phone_subscriptions_df[['Country', '2018']]
print(mobile_phone_subscriptions_df.shape)
mobile_phone_subscriptions_df = mobile_phone_subscriptions_df.rename(columns={'2018': 'MobilePhoneSubscriptions'})
print("No. of countries with NULL mobile subscriptions: ", mobile_phone_subscriptions_df[mobile_phone_subscriptions_df['MobilePhoneSubscriptions'].isnull()].size)
#if mobile_phone_subscriptions_df.MobilePhoneSubscriptions.dtypes != 'int64':
if mobile_phone_subscriptions_df.MobilePhoneSubscriptions.dtypes == 'object':
    print("No. of countries with '..' Mobile subscriptions: ", mobile_phone_subscriptions_df[mobile_phone_subscriptions_df['MobilePhoneSubscriptions']=='..'].size)
    mobile_phone_subscriptions_df.MobilePhoneSubscriptions = mobile_phone_subscriptions_df.MobilePhoneSubscriptions.replace('..', np.nan)
    # Convert to float64
    mobile_phone_subscriptions_df.MobilePhoneSubscriptions = pd.to_numeric(mobile_phone_subscriptions_df.MobilePhoneSubscriptions)    
    print("After replacing .. with NULL, no. of countries with NULL Mobile subscriptions: ", mobile_phone_subscriptions_df[mobile_phone_subscriptions_df['MobilePhoneSubscriptions'].isnull()].size)
# Replace .. with NULL
mobile_phone_subscriptions_df.head()

(195, 2)
No. of countries with NULL mobile subscriptions:  0
No. of countries with '..' Mobile subscriptions:  70
After replacing .. with NULL, no. of countries with NULL Mobile subscriptions:  70


Unnamed: 0,Country,MobilePhoneSubscriptions
0,Afghanistan,59.1
1,Albania,94.2
2,Algeria,121.9
3,Andorra,107.3
4,Angola,43.1


In [34]:
net_migration_rate_df = net_migration_rate_df[['Country', '2020']]
print(net_migration_rate_df.shape)
net_migration_rate_df = net_migration_rate_df.rename(columns={'2020': 'NetMigrationRate'})
print("No. of countries with NULL net migration rate: ", net_migration_rate_df[net_migration_rate_df['NetMigrationRate'].isnull()].size)
if net_migration_rate_df.NetMigrationRate.dtypes == 'object':
    print("No. of countries with '..' net migration rate: ", net_migration_rate_df[net_migration_rate_df['NetMigrationRate']=='..'].size)
    net_migration_rate_df.NetMigrationRate = net_migration_rate_df.NetMigrationRate.replace('..', np.nan)
    # Convert to float64
    net_migration_rate_df.NetMigrationRate = pd.to_numeric(net_migration_rate_df.NetMigrationRate)    
    print("After replacing .. with NULL, no. of countries with NULL Migration Rate: ", net_migration_rate_df[net_migration_rate_df['NetMigrationRate'].isnull()].size)
# Replace .. with NULL
net_migration_rate_df.head()

(191, 2)
No. of countries with NULL net migration rate:  0
No. of countries with '..' net migration rate:  12
After replacing .. with NULL, no. of countries with NULL Migration Rate:  12


Unnamed: 0,Country,NetMigrationRate
0,Afghanistan,-1.7
1,Albania,-4.9
2,Algeria,-0.2
3,Angola,0.2
4,Antigua and Barbuda,0.0


In [35]:
population_millions_df = population_millions_df[['Country', '2018']]
print(population_millions_df.shape)
population_millions_df = population_millions_df.rename(columns={'2018': 'Population'})
print("No. of countries with NULL population: ", population_millions_df[population_millions_df['Population'].isnull()].size)
#if mobile_phone_subscriptions_df.MobilePhoneSubscriptions.dtypes != 'int64':
if population_millions_df.Population.dtypes == 'object':
    print("No. of countries with '..' population: ", population_millions_df[population_millions_df['Population']=='..'].size)
    population_millions_df.Population = population_millions_df.Population.replace('..', np.nan)
    # Convert to float64
    population_millions_df.Population = pd.to_numeric(population_millions_df.Population)    
    print("After replacing .. with NULL, no. of countries with NULL population: ", population_millions_df[population_millions_df['Population'].isnull()].size)
# Replace .. with NULL
population_millions_df.head()

(195, 2)
No. of countries with NULL population:  0


Unnamed: 0,Country,Population
0,Afghanistan,37.2
1,Albania,2.9
2,Algeria,42.2
3,Andorra,0.1
4,Angola,30.8


In [36]:
urban_population_percent_df = urban_population_percent_df[['Country', '2018']]
print(urban_population_percent_df.shape)
urban_population_percent_df = urban_population_percent_df.rename(columns={'2018': 'UrbanPopulationPercent'})
print("No. of countries with NULL Urban Population Percentage: ", urban_population_percent_df[urban_population_percent_df['UrbanPopulationPercent'].isnull()].size)
#if mobile_phone_subscriptions_df.MobilePhoneSubscriptions.dtypes != 'int64':
if urban_population_percent_df.UrbanPopulationPercent.dtypes == 'object':
    print("No. of countries with '..' Urban Population Percentage: ", urban_population_percent_df[urban_population_percent_df['UrbanPopulationPercent']=='..'].size)
    urban_population_percent_df.UrbanPopulationPercent = urban_population_percent_df.UrbanPopulationPercent.replace('..', np.nan)
    # Convert to float64
    urban_population_percent_df.UrbanPopulationPercent = pd.to_numeric(urban_population_percent_df.UrbanPopulationPercent)    
    print("After replacing .. with NULL, no. of countries with NULL Urban Population Percentage: ", urban_population_percent_df[urban_population_percent_df['UrbanPopulationPercent'].isnull()].size)
# Replace .. with NULL
urban_population_percent_df.head()

(195, 2)
No. of countries with NULL Urban Population Percentage:  0


Unnamed: 0,Country,UrbanPopulationPercent
0,Afghanistan,25.5
1,Albania,60.3
2,Algeria,72.6
3,Andorra,88.1
4,Angola,65.5


In [37]:
hdi_df = hdi_df[['Country', 'HDI Rank (2018)','2018']]
print(hdi_df.shape)
hdi_df = hdi_df.rename(columns={'2018': 'HDI', 'HDI Rank (2018)': 'HDIRank'})
print("No. of countries with NULL HDI: ", hdi_df[hdi_df['HDI'].isnull()].size)
#if mobile_phone_subscriptions_df.MobilePhoneSubscriptions.dtypes != 'int64':
if hdi_df.HDI.dtypes == 'object':
    print("No. of countries with '..' HDI: ", hdi_df[hdi_df['HDI']=='..'].size)
    hdi_df.HDI = hdi_df.HDI.replace('..', np.nan)
    # Convert to float64
    hdi_df.HDI = pd.to_numeric(hdi_df.HDI)    
    print("After replacing .. with NULL, no. of countries with NULL HDI: ", hdi_df[hdi_df['HDI'].isnull()].size)
# Replace .. with NULL
hdi_df.head()

(189, 3)
No. of countries with NULL HDI:  0


Unnamed: 0,Country,HDIRank,HDI
0,Afghanistan,170,0.496
1,Albania,69,0.791
2,Algeria,82,0.759
3,Andorra,36,0.857
4,Angola,149,0.574


In [38]:
gdp_percapita_df.dtypes

Country          object
GDPPerCapita    float64
dtype: object

Join UNDP data to merged DF

In [39]:
merged_df = merged_df.merge(gni_percapita_df, on='Country', how='left')
merged_df = merged_df.merge(gdp_percapita_df, on='Country', how='left')
merged_df = merged_df.merge(mobile_phone_subscriptions_df, on='Country', how='left')
merged_df = merged_df.merge(net_migration_rate_df, on='Country', how='left')
merged_df = merged_df.merge(population_millions_df, on='Country', how='left')
merged_df = merged_df.merge(urban_population_percent_df, on='Country', how='left')
merged_df = merged_df.merge(hdi_df, on='Country', how='left')
merged_df = merged_df.merge(intenet_users_percent_df, on='Country', how='left')

In [40]:
merged_df.head()

Unnamed: 0,Country,ISOCode,TourismExpenditureMillions,TouristArrivalsThousands,Currency,ExchangeRateEndOfPeriod,GNIPerCapita,GDPPerCapita,MobilePhoneSubscriptions,NetMigrationRate,Population,UrbanPopulationPercent,HDIRank,HDI,InternetUsersPercent
0,Afghanistan,AF,50.0,,Afghani (AFN),74.9556,1746.0,1735.0,59.1,-1.7,37.2,25.5,170.0,0.496,
1,Åland Islands,AX,,,Euro (EUR),0.8734,,,,,,,,,
2,Albania,AL,2306.0,5340.0,Lek (ALL),107.82,12300.0,12306.0,94.2,-4.9,2.9,60.3,69.0,0.791,
3,Algeria,DZ,172.0,2657.0,Algerian Dinar (DZD),118.2906,13639.0,13886.0,121.9,-0.2,42.2,72.6,82.0,0.759,59.6
4,American Samoa,AS,22.0,20.2,,,,,,,,,,,


In [41]:
print(merged_df.shape)

(249, 15)


### Part 3: [Freedom House Political Data](https://freedomhouse.org/report/freedom-world)
Now, it's time to read in the Freedom House Data political data

In [43]:
#human-rights-score-vs-political-regime-type
electoral_democracies_df = pd.read_excel(os.path.join(input_path, '2020_List_of_Electoral_Democracies_FIW_2020.xlsx'), skiprows=1)
country_freedom_ratings_df = pd.read_excel(os.path.join(input_path, '2020_Country_and_Territory_Ratings_and_Statuses_FIW2020.xlsx'),
                                                       sheet_name=1)

In [44]:
# In these ratings, a value of '-' in a column indicates that the country doesn't exist any longer (e.g., USSR)
country_freedom_ratings_df = country_freedom_ratings_df.rename(columns={'PR': 'PoliticalRightsFreedomScore', 
                                                                       'CL': 'CivilLibertiesFreedomScore',
                                                                       'Status': 'FreedomStatus'})
# Both Freedom scores are measured on a 1-7 scale.
#  Expand NF, PF and F in freedom status
status_expansion = {'NF': 'Not Free', 'F': 'Free', 'PF': 'Partly Free'}
country_freedom_ratings_df.FreedomStatus = country_freedom_ratings_df.FreedomStatus.map(status_expansion) 
print(country_freedom_ratings_df.shape)
country_freedom_ratings_df = country_freedom_ratings_df.replace('..', np.nan)
# We can safely drop all rows with NaNs now
country_freedom_ratings_df = country_freedom_ratings_df.dropna()
# Convert Score columns  to int
cols_to_cast = ['PoliticalRightsFreedomScore', 'CivilLibertiesFreedomScore']
country_freedom_ratings_df = country_freedom_ratings_df.astype( {'PoliticalRightsFreedomScore': 'int32',
                                                                'CivilLibertiesFreedomScore': 'int32'})
print(country_freedom_ratings_df.shape)
country_freedom_ratings_df.head()

(205, 4)
(195, 4)


Unnamed: 0,Country,PoliticalRightsFreedomScore,CivilLibertiesFreedomScore,FreedomStatus
0,Afghanistan,5,6,Not Free
1,Albania,3,3,Partly Free
2,Algeria,6,5,Not Free
3,Andorra,1,1,Free
4,Angola,6,5,Not Free


In [45]:
print(electoral_democracies_df.shape)
electoral_democracies_df = electoral_democracies_df.rename(columns={'Electoral Democracy Designation in FIW 2020': 'DemocracyOrNot'})
yes_no_to_boolean = {'Yes': True, 'No': False}
electoral_democracies_df.DemocracyOrNot = electoral_democracies_df.DemocracyOrNot.map(yes_no_to_boolean)
electoral_democracies_df.head()

(195, 2)


Unnamed: 0,Country,DemocracyOrNot
0,Afghanistan,False
1,Albania,True
2,Algeria,False
3,Andorra,True
4,Angola,False


Merge Freedom house data with merged_df.

In [46]:
merged_df = merged_df.merge(country_freedom_ratings_df, on='Country', how='left')
merged_df = merged_df.merge(electoral_democracies_df, on='Country', how='left')
merged_df.head()

Unnamed: 0,Country,ISOCode,TourismExpenditureMillions,TouristArrivalsThousands,Currency,ExchangeRateEndOfPeriod,GNIPerCapita,GDPPerCapita,MobilePhoneSubscriptions,NetMigrationRate,Population,UrbanPopulationPercent,HDIRank,HDI,InternetUsersPercent,PoliticalRightsFreedomScore,CivilLibertiesFreedomScore,FreedomStatus,DemocracyOrNot
0,Afghanistan,AF,50.0,,Afghani (AFN),74.9556,1746.0,1735.0,59.1,-1.7,37.2,25.5,170.0,0.496,,5.0,6.0,Not Free,False
1,Åland Islands,AX,,,Euro (EUR),0.8734,,,,,,,,,,,,,
2,Albania,AL,2306.0,5340.0,Lek (ALL),107.82,12300.0,12306.0,94.2,-4.9,2.9,60.3,69.0,0.791,,3.0,3.0,Partly Free,True
3,Algeria,DZ,172.0,2657.0,Algerian Dinar (DZD),118.2906,13639.0,13886.0,121.9,-0.2,42.2,72.6,82.0,0.759,59.6,6.0,5.0,Not Free,False
4,American Samoa,AS,22.0,20.2,,,,,,,,,,,,,,,


### Part 4: [OurWorldInData Human Rights vs Political Regime Type data](https://ourworldindata.org/democracy) 
Now, let's read in the final data set from Our World in Data. This contains human rights scores and a political regime type score.

In [51]:
polregime_humanrights_df = pd.read_csv(os.path.join(input_path, 'human-rights-score-vs-political-regime-type.csv'))
polregime_humanrights_df.head()

Unnamed: 0,Entity,Code,Year,Political regime type (Score),Human rights protection score,Total population (Gapminder)
0,Afghanistan,AFG,1800,,,3280000.0
1,Afghanistan,AFG,1816,-6.0,,
2,Afghanistan,AFG,1817,-6.0,,
3,Afghanistan,AFG,1818,-6.0,,
4,Afghanistan,AFG,1819,-6.0,,


Let's restrict the data to the most recent year which has values for both scores: 2015.

In [52]:
polregime_humanrights_df = polregime_humanrights_df[polregime_humanrights_df.Year==2015]
polregime_humanrights_df.head()

Unnamed: 0,Entity,Code,Year,Political regime type (Score),Human rights protection score,Total population (Gapminder)
200,Afghanistan,AFG,2015,-1.0,-2.20941,
404,Albania,ALB,2015,9.0,0.770309,
608,Algeria,DZA,2015,2.0,0.221587,
750,Andorra,AND,2015,,4.226664,
953,Angola,AGO,2015,-2.0,-0.537383,


Rename and drop the total population column, we already have population data in our merged_df data frame.

In [53]:
polregime_humanrights_df = polregime_humanrights_df.rename(columns={'Political regime type (Score)':'PoliticalRegimeTypeScore',
                                                                    'Human rights protection score':'HumanRightsScore', 
                                                                    'Entity': 'Country'})
polregime_humanrights_df = polregime_humanrights_df.drop(['Code', 'Year', 'Total population (Gapminder)'], axis=1)
polregime_humanrights_df.head()
 

Unnamed: 0,Country,PoliticalRegimeTypeScore,HumanRightsScore
200,Afghanistan,-1.0,-2.20941
404,Albania,9.0,0.770309
608,Algeria,2.0,0.221587
750,Andorra,,4.226664
953,Angola,-2.0,-0.537383


In [54]:
merged_df = merged_df.merge(polregime_humanrights_df, on='Country', how='left')
merged_df.head()

Unnamed: 0,Country,ISOCode,TourismExpenditureMillions,TouristArrivalsThousands,Currency,ExchangeRateEndOfPeriod,GNIPerCapita,GDPPerCapita,MobilePhoneSubscriptions,NetMigrationRate,...,UrbanPopulationPercent,HDIRank,HDI,InternetUsersPercent,PoliticalRightsFreedomScore,CivilLibertiesFreedomScore,FreedomStatus,DemocracyOrNot,PoliticalRegimeTypeScore,HumanRightsScore
0,Afghanistan,AF,50.0,,Afghani (AFN),74.9556,1746.0,1735.0,59.1,-1.7,...,25.5,170.0,0.496,,5.0,6.0,Not Free,False,-1.0,-2.20941
1,Åland Islands,AX,,,Euro (EUR),0.8734,,,,,...,,,,,,,,,,
2,Albania,AL,2306.0,5340.0,Lek (ALL),107.82,12300.0,12306.0,94.2,-4.9,...,60.3,69.0,0.791,,3.0,3.0,Partly Free,True,9.0,0.770309
3,Algeria,DZ,172.0,2657.0,Algerian Dinar (DZD),118.2906,13639.0,13886.0,121.9,-0.2,...,72.6,82.0,0.759,59.6,6.0,5.0,Not Free,False,2.0,0.221587
4,American Samoa,AS,22.0,20.2,,,,,,,...,,,,,,,,,,


In [55]:
merged_df.dtypes

Country                         object
ISOCode                         object
TourismExpenditureMillions     float64
TouristArrivalsThousands       float64
Currency                        object
ExchangeRateEndOfPeriod         object
GNIPerCapita                   float64
GDPPerCapita                   float64
MobilePhoneSubscriptions       float64
NetMigrationRate               float64
Population                     float64
UrbanPopulationPercent         float64
HDIRank                        float64
HDI                            float64
InternetUsersPercent           float64
PoliticalRightsFreedomScore    float64
CivilLibertiesFreedomScore     float64
FreedomStatus                   object
DemocracyOrNot                  object
PoliticalRegimeTypeScore       float64
HumanRightsScore               float64
dtype: object

Write to csv. This will eventually be written to S3 in Parquet format, but the conversion  to Parquet format is done in the  PrepareDataSetsForS3 notebook.

In [64]:
#merged_df.to_csv(os.path.join(output_path, 'country_indicators.csv'), na_rep='NULL')
merged_df.to_csv(os.path.join(output_path, 'CountryIndicators.csv'))