### 1. Collect the data

In [None]:
import pandas as pd

linkWiki="https://en.wikipedia.org/wiki/Democracy_Index"
democracy=pd.read_html(linkWiki, header=0,attrs={"class":"wikitable sortable"})[4]

linkmil="https://www.cia.gov/the-world-factbook/field/military-expenditures/country-comparison"

milimoney=pd.read_html(linkmil)[0]

linkHDI="https://github.com/UW-eScience-WinterSchool/Python_Session/raw/main/countryCodesHDI.xlsx"
hdidata=pd.read_excel(linkHDI)

### 2. Check column names

In [None]:
democracy.columns

In [None]:
milimoney.columns

In [None]:
hdidata.columns

You checked column names to name the key columns for merge, and to get rid of columns that may bring trouble.

In [None]:
# this renaming will make merge easier
hdidata.rename(columns={'NAME':"Country"},inplace=True)

In [None]:
# merge will likely use "Rank" as the key (leftmost common variable), so get rid of it before it happens:
democracy.drop(columns=["Rank"],inplace=True)
milimoney.drop(columns=["Rank"],inplace=True)

## 3. Fuzzy merge

Check how many countries are present:

In [None]:
DemoCountryAll=set(democracy.Country)
len(DemoCountryAll)

In [None]:
MiliCountryAll=set(milimoney.Country)
len(MiliCountryAll)

In [None]:
HdiCountryAll=set(hdidata.Country)
len(HdiCountryAll)

At least, we should aim to have an amount closer to 166 countries. However, in common we have now:

In [None]:
len(DemoCountryAll.intersection(MiliCountryAll).intersection(HdiCountryAll))

Let's compare _MiliCountryAll_ and _DemoCountryAll_:

In [None]:
# in MiliCountry that are not in DemoCountryAll
MiliYes_DemoNo=MiliCountryAll.difference(DemoCountryAll)
MiliYes_DemoNo

In [None]:
# in DemoCountryAll that are not in MiliCountry
DemoYes_MiliNo=DemoCountryAll.difference(MiliCountryAll)
DemoYes_MiliNo

Time to try matches:

In [None]:
#!pip install thefuzz[speedup]

In [None]:
from thefuzz import process
{mili:process.extractOne(mili,DemoYes_MiliNo)[0] for mili in sorted(MiliYes_DemoNo)}

There are some that can be recovered. Let's prepare the dict:

In [None]:
goodDemo={
    #'Bahamas, The': 'Democratic Republic of the Congo',
# 'Barbados': 'Gambia',
# 'Belize': 'Czech Republic',
# 'Brunei': 'Bhutan',
# 'Burma': 'Bhutan',
 'Cabo Verde': 'Cape Verde',
 'Congo, Democratic Republic of the': 'Democratic Republic of the Congo',
 'Congo, Republic of the': 'Republic of the Congo', #manual
 "Cote d'Ivoire": 'Ivory Coast',
 'Czechia': 'Czech Republic',
 'Gambia, The': 'Gambia',
 'Korea, South': 'South Korea'
# 'Kosovo': 'Comoros',
# 'Seychelles': 'Czech Republic',
# 'Somalia': 'Libya',
# 'South Sudan': 'South Korea'
}

In [None]:
#updating in milimoney
milimoney.Country.replace(goodDemo,inplace=True)

We can do the first merge:

In [None]:
allmerged=democracy.merge(milimoney)
allmerged

Now, with some new names, let's go to the other data frame:

In [None]:
allmergedCountryAll=set(allmerged.Country)



Let's redo the previous analysis:

In [None]:
# in allmerged that are not in HdiCountryAll
allmergedYes_HdiNo=allmergedCountryAll.difference(HdiCountryAll)
# in HdiCountryAll that are not in allmerged
HdiYes_allmergedNo=HdiCountryAll.difference(allmergedCountryAll)

In [None]:
#matches
{allmr:process.extractOne(allmr,HdiYes_allmergedNo)[0] for allmr in sorted(allmergedYes_HdiNo)}

In [None]:
#selecting
hdiGood={'Eswatini': 'Swaziland',
 'Iran': 'Iran (Islamic Republic of)',
 'Ivory Coast': "Cote d'Ivoire",
 'Laos': "Lao People's Democratic Republic", # manual
 'Moldova': 'Republic of Moldova',
 'North Macedonia': 'The former Yugoslav Republic of Macedonia',
 'Republic of the Congo': 'Congo',
 'South Korea': 'Korea, Republic of',#manual
 'Tanzania': 'United Republic of Tanzania',
 'Vietnam': 'Viet Nam'}

The changes would be better if we change in the HDI data:

In [None]:
hdiGood_Change={v:k for k,v in hdiGood.items()}
hdiGood_Change

Then,

In [None]:
hdidata.Country.replace(hdiGood_Change,inplace=True)


Final merge:

In [None]:
allmerged=allmerged.merge(hdidata)
allmerged

### 4. Preprocessing

* _Check the strings in column names_:

In [None]:
allmerged.columns.to_list()

* _Clean strings_:

In [None]:
#replace '%' by "share"
allmerged.columns=allmerged.columns.str.replace("\%","share",regex=True)
#replace 'spaces' by "_"
allmerged.columns=allmerged.columns.str.replace("\s","_",regex=True)
#replace 'whatever is not a character' by ""
allmerged.columns=allmerged.columns.str.replace("\W","",regex=True)
#current names
allmerged.columns.to_list()

* _Drop MORE unneeded columns_:

In [None]:
#take a look:
allmerged.columns[allmerged.columns.str.contains("Rank|Date|Δ",regex=True)]

In [None]:
#save column to drop
toDrop=allmerged.columns[allmerged.columns.str.contains("Rank|Date|Δ",regex=True)]
# drop them
allmerged.drop(columns=toDrop,inplace=True)
# see result
allmerged

* _Look for missing values and check for wrong data types_:

In [None]:
allmerged.info()

a.Some missing values can be corrected, others cannot:

In [None]:
allmerged[allmerged.isnull().any(axis=1)]

We can not use Taiwan, but Namibia can be kept.

In [None]:
allmerged.loc[pd.isnull(allmerged.ISO2),'ISO2']='NA'

Dropping rows with missing values:

In [None]:
allmerged.dropna(inplace=True)
allmerged.reset_index(drop=True,inplace=True)
allmerged

b. Convert string to numerical

In [None]:
toNumeric=['Overall_score',
 'Electoral_process_and_pluralism', 
 'Functioning_of_government',
 'Political_participation', 'Political_culture', 'Civil_liberties']
allmerged.loc[:,toNumeric]=allmerged.loc[:,toNumeric].apply(lambda x: pd.to_numeric(x))

c. Check ordinal variable:

In [None]:
#check levels:
set(allmerged.Regime_type)

In [None]:
levels=['Authoritarian', 'Hybrid regime','Flawed democracy', 'Full democracy']
allmerged.Regime_type=pd.Categorical(allmerged.Regime_type,
                                     categories=levels,
                                     ordered=True)

In [None]:
allmerged.Regime_type

In [None]:
# review:
allmerged.info()

Let's give better names to the main indices:

In [None]:
LastNamesChanges={'Overall_score':'DemoIndex','share_of_GDP':'DefenseIndex','HDI':"HDIndex"}
allmerged.rename(columns=LastNamesChanges,inplace=True)

d. Let's explore those variables:

In [None]:
allmerged.loc[:,['DemoIndex','DefenseIndex',"HDIndex"]].describe()

In [None]:
#!pip install seaborn

In [None]:
import seaborn as sbn

sbn.boxplot(data=allmerged.loc[:,['DemoIndex','DefenseIndex',"HDIndex"]], orient="h", palette="Set2")

Adjust range of HDI:

In [None]:
allmerged.HDIndex=10*allmerged.HDIndex
sbn.boxplot(data=allmerged.loc[:,['DemoIndex','DefenseIndex',"HDIndex"]], orient="h", palette="Set2")

In [None]:
Verifying correlation:

In [None]:
sbn.pairplot(allmerged.loc[:,['DemoIndex','DefenseIndex',"HDIndex"]], corner=True)

In [None]:
allmerged.loc[:,['DemoIndex','DefenseIndex',"HDIndex"]].corr()

Force same monotony:

In [None]:
allmerged.DefenseIndex=-1*((allmerged.DefenseIndex-allmerged.DefenseIndex.max()))+allmerged.DefenseIndex.min()
allmerged.DefenseIndex.describe()

In [None]:
allmerged.loc[:,['DemoIndex','DefenseIndex',"HDIndex"]].corr()

### 4. Save your work

* _Save data for R_:

In [None]:
from rpy2.robjects import pandas2ri
pandas2ri.activate()

from rpy2.robjects.packages import importr

base = importr('base')
base.saveRDS(allmerged,file="allmerged_new.rds")