<br> 
<center><img src="https://i.imgur.com/hkb7Bq7.png" width="500"></center>


### Prof. José Manuel Magallanes, PhD

* Associate Professor, Departamento de Ciencias Sociales, Pontificia Universidad Católica del Perú, [jmagallanes@pucp.edu.pe](mailto:jmagallanes@pucp.edu.pe)

* Visiting Associate Professor, Evans School of Public Policy and Governance / Senior Data Science Fellow, eScience Institute, University of Washington, [magajm@uw.edu](mailto:magajm@uw.edu)
_____

_____


# Introduction to Python

### Using Python for Pre Processing

In the session we will see the use of Python to:

1. Collect data tables into Python:
    * from a file
    * from a web table


2. Preprocess both tables:
    * Clean cell values
    * Format data types


3. Merge both tables


4. Prepare a file for further analysis



## 1. Collect data tables into Python

* **From a file**:

In [1]:
# Location of data file
linkFile="https://github.com/eScienceWinterSchool/data/raw/master/HDI_2018.xlsx"

Reading in a table from a file using pandas, since it is an Excel file, I requires **openpyxl**:

In [None]:
# available
!pip show openpyxl

If not available, please go to Anaconda and install it. Once installed, or if available, continue:

In [2]:
# choose the right function:
import pandas as pd

hdiRaw=pd.read_excel(linkFile)

* **From a table in the web**:

In [3]:
linkWeb = "https://en.wikipedia.org/wiki/Democracy_Index"

Reading in a table from the web using pandas using pandas requires **lxml**:

In [None]:
# available?
!pip show lxml

If not available, please go to Anaconda and install it. Once installed, or if available, continue:

In [4]:
demo=pd.read_html(linkWeb,
                  header=0,
                  attrs={'class': 'wikitable sortable'})

Notice that **demo** is not a data frame:

In [5]:
type(demo)

list

In [6]:
#how many?
len(demo)

3

In [7]:
# is it the last one?
demo[2]

Unnamed: 0,Rank,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Δ Rank,Country,Regime type,Overall score,Δ Score,Elec­toral pro­cess and plura­lism,Func­tioning of govern­ment,Poli­tical partici­pation,Poli­tical cul­ture,Civil liber­ties
0,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies
1,1,,Norway,Full democracy,9.75,0.06,10.00,9.64,10.00,10.00,9.12
2,2,2,New Zealand,Full democracy,9.37,0.12,10.00,8.93,9.44,8.75,9.71
3,3,3,Finland,Full democracy,9.27,0.07,10.00,9.29,8.89,8.75,9.41
4,4,1,Sweden,Full democracy,9.26,,9.58,9.29,8.33,10.00,9.12
...,...,...,...,...,...,...,...,...,...,...,...
166,162,2,Central African Republic,Authoritarian,1.43,0.11,1.25,0.00,1.67,1.88,2.35
167,164,2,Democratic Republic of the Congo,Authoritarian,1.40,0.27,0.75,0.00,2.22,3.13,0.88
168,165,2,North Korea,Authoritarian,1.08,,0.00,2.50,1.67,1.25,0.00
169,166,31,Myanmar,Authoritarian,1.02,2.04,0.00,0.00,1.67,3.13,0.29


In [8]:
demoRaw=demo[2].copy()

## Pre Processing

### Cleaning cell values

* Checking data head

In [9]:
hdiRaw.head(10)

Unnamed: 0.1,Unnamed: 0,Table 1. Human Development Index and its components,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
0,,,,,,,,,,,,,,,
1,,,,,SDG3,,SDG4.3,,SDG4.6,,SDG8.5,,,,
2,,,,,,,,,,,,,,,
3,,,Human development index (HDI),,Life expectancy at birth,,Expected years of schooling,,Mean years of schooling,,Gross national income (GNI) per capita,,GNI per capita rank minus HDI rank,,HDI rank
4,HDI rank,Country,(index value),,(years),,(years),,(years),,(2011 PPP $),,,,
5,,,2018,,2018,,2018,a,2018,a,2018,,2018,,2017
6,,VERY HIGH HUMAN DEVELOPMENT,,,,,,,,,,,,,
7,1,Norway,0.953688,,82.271,,18.06082,b,12.566818,,68058.61613,,5,,1
8,2,Switzerland,0.945936,,83.63,,16.20882,,13.380812,,59374.73403,,8,,2
9,3,Ireland,0.942473,,82.103,,18.79326,b,12.526295,c,55659.67902,,9,,3


In [10]:
hdiRaw.tail(65)

Unnamed: 0.1,Unnamed: 0,Table 1. Human Development Index and its components,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
204,..,Somalia,..,,57.068,,..,,..,,..,,..,,..
205,..,Tuvalu,..,,..,,12.30895,,..,,5408.949247,,..,,..
206,,,,,,,,,,,,,,,
207,,Human development groups,,,,,,,,,,,,,
208,,Very high human development,0.891786,,79.505042,,16.361804,,12.043619,,40111.566426,,—,,—
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
264,,Column 2: UNDESA (2019b).,,,,,,,,,,,,,
265,,Column 3: UNESCO Institute for Statistics (201...,,,,,,,,,,,,,
266,,Column 4: UNESCO Institute for Statistics (201...,,,,,,,,,,,,,
267,,"Column 5: World Bank (2019a), IMF (2019) and U...",,,,,,,,,,,,,


In [11]:
hdiGood=hdiRaw.iloc[7:206,:].copy()
hdiGood

Unnamed: 0.1,Unnamed: 0,Table 1. Human Development Index and its components,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
7,1,Norway,0.953688,,82.271,,18.06082,b,12.566818,,68058.61613,,5,,1
8,2,Switzerland,0.945936,,83.63,,16.20882,,13.380812,,59374.73403,,8,,2
9,3,Ireland,0.942473,,82.103,,18.79326,b,12.526295,c,55659.67902,,9,,3
10,4,Germany,0.938785,,81.18,,17.09638,,14.13215,,46945.9499,,15,,4
11,4,"Hong Kong, China (SAR)",0.938809,,84.687,,16.51223,,12.03813,,60220.79676,,5,,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201,..,Monaco,..,,..,,..,,..,,..,,..,,..
202,..,Nauru,..,,..,,11.25936,e,..,,17312.58793,,..,,..
203,..,San Marino,..,,..,,15.1112,,..,,..,,..,,..
204,..,Somalia,..,,57.068,,..,,..,,..,,..,,..


Two problems:

* The headers are in two different rows.
* The data starts in a lower position

In [24]:
RealHeaders1=hdiRaw.iloc[4,:2].to_list()
RealHeaders1

['HDI rank', 'Country']

In [25]:
RealHeaders2=hdiRaw.iloc[3,2:].to_list()
RealHeaders2

['Human development index (HDI) ',
 nan,
 'Life expectancy at birth',
 nan,
 'Expected years of schooling',
 nan,
 'Mean years of schooling',
 nan,
 'Gross national income (GNI) per capita',
 nan,
 'GNI per capita rank minus HDI rank',
 nan,
 'HDI rank']

In [26]:
RealHeaders=RealHeaders1+RealHeaders2
RealHeaders

['HDI rank',
 'Country',
 'Human development index (HDI) ',
 nan,
 'Life expectancy at birth',
 nan,
 'Expected years of schooling',
 nan,
 'Mean years of schooling',
 nan,
 'Gross national income (GNI) per capita',
 nan,
 'GNI per capita rank minus HDI rank',
 nan,
 'HDI rank']

Notice the repeated column names. Let's avoid that:

In [19]:
hdiGood=hdiGood.iloc[:,1:]
hdiGood

Unnamed: 0,Table 1. Human Development Index and its components,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
7,Norway,0.953688,,82.271,,18.06082,b,12.566818,,68058.61613,,5,,1
8,Switzerland,0.945936,,83.63,,16.20882,,13.380812,,59374.73403,,8,,2
9,Ireland,0.942473,,82.103,,18.79326,b,12.526295,c,55659.67902,,9,,3
10,Germany,0.938785,,81.18,,17.09638,,14.13215,,46945.9499,,15,,4
11,"Hong Kong, China (SAR)",0.938809,,84.687,,16.51223,,12.03813,,60220.79676,,5,,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201,Monaco,..,,..,,..,,..,,..,,..,,..
202,Nauru,..,,..,,11.25936,e,..,,17312.58793,,..,,..
203,San Marino,..,,..,,15.1112,,..,,..,,..,,..
204,Somalia,..,,57.068,,..,,..,,..,,..,,..


In [21]:
hdiGood.columns=RealColumnHeaders[1:]
hdiGood

Unnamed: 0,Country,Human development index (HDI),NaN,Life expectancy at birth,NaN.1,Expected years of schooling,NaN.2,Mean years of schooling,NaN.3,Gross national income (GNI) per capita,NaN.4,GNI per capita rank minus HDI rank,NaN.5,HDI rank
7,Norway,0.953688,,82.271,,18.06082,b,12.566818,,68058.61613,,5,,1
8,Switzerland,0.945936,,83.63,,16.20882,,13.380812,,59374.73403,,8,,2
9,Ireland,0.942473,,82.103,,18.79326,b,12.526295,c,55659.67902,,9,,3
10,Germany,0.938785,,81.18,,17.09638,,14.13215,,46945.9499,,15,,4
11,"Hong Kong, China (SAR)",0.938809,,84.687,,16.51223,,12.03813,,60220.79676,,5,,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201,Monaco,..,,..,,..,,..,,..,,..,,..
202,Nauru,..,,..,,11.25936,e,..,,17312.58793,,..,,..
203,San Marino,..,,..,,15.1112,,..,,..,,..,,..
204,Somalia,..,,57.068,,..,,..,,..,,..,,..


In [23]:
hdiGood.columns.dropna().to_list()

['Country',
 'Human development index (HDI) ',
 'Life expectancy at birth',
 'Expected years of schooling',
 'Mean years of schooling',
 'Gross national income (GNI) per capita',
 'GNI per capita rank minus HDI rank',
 'HDI rank']

In [27]:
BetterHeaders=hdiGood.columns.drop_duplicates().dropna().to_list()

In [29]:
hdiGood=hdiGood.loc[:,BetterHeaders]
hdiGood

Unnamed: 0,Country,Human development index (HDI),Life expectancy at birth,Expected years of schooling,Mean years of schooling,Gross national income (GNI) per capita,GNI per capita rank minus HDI rank,HDI rank
7,Norway,0.953688,82.271,18.06082,12.566818,68058.61613,5,1
8,Switzerland,0.945936,83.63,16.20882,13.380812,59374.73403,8,2
9,Ireland,0.942473,82.103,18.79326,12.526295,55659.67902,9,3
10,Germany,0.938785,81.18,17.09638,14.13215,46945.9499,15,4
11,"Hong Kong, China (SAR)",0.938809,84.687,16.51223,12.03813,60220.79676,5,6
...,...,...,...,...,...,...,...,...
201,Monaco,..,..,..,..,..,..,..
202,Nauru,..,..,11.25936,..,17312.58793,..,..
203,San Marino,..,..,15.1112,..,..,..,..
204,Somalia,..,57.068,..,..,..,..,..


Notice the index values. After so many changes they are messy. Let's solve that.

In [30]:
hdiGood.reset_index(drop=True, inplace=True)
hdiGood

Unnamed: 0,Country,Human development index (HDI),Life expectancy at birth,Expected years of schooling,Mean years of schooling,Gross national income (GNI) per capita,GNI per capita rank minus HDI rank,HDI rank
0,Norway,0.953688,82.271,18.06082,12.566818,68058.61613,5,1
1,Switzerland,0.945936,83.63,16.20882,13.380812,59374.73403,8,2
2,Ireland,0.942473,82.103,18.79326,12.526295,55659.67902,9,3
3,Germany,0.938785,81.18,17.09638,14.13215,46945.9499,15,4
4,"Hong Kong, China (SAR)",0.938809,84.687,16.51223,12.03813,60220.79676,5,6
...,...,...,...,...,...,...,...,...
194,Monaco,..,..,..,..,..,..,..
195,Nauru,..,..,11.25936,..,17312.58793,..,..
196,San Marino,..,..,15.1112,..,..,..,..
197,Somalia,..,57.068,..,..,..,..,..


In [55]:
HDI_BadSymbols=[] # list for bad symbols

NumericColumns=hdiGood.iloc[:,1:].columns # save names of columns with numeric data

for columnName in NumericColumns:# visit every column name
    for cell in hdiGood.loc[:,columnName]:# visit every cell for that column
        try:
            float(cell) # try this
        except: # if not possible:            
            if cell not in HDI_BadSymbols:# if cell is not in the list                
                HDI_BadSymbols.append(cell)# add it to the list

# you get:
HDI_BadSymbols

['..']

In [56]:
hdiGood.replace(to_replace=HDI_BadSymbols,
                value=None,inplace=True)
hdiGood

Unnamed: 0,Country,Human development index (HDI),Life expectancy at birth,Expected years of schooling,Mean years of schooling,Gross national income (GNI) per capita,GNI per capita rank minus HDI rank,HDI rank
0,Norway,0.953688,82.271,18.06082,12.566818,68058.61613,5,1
1,Switzerland,0.945936,83.63,16.20882,13.380812,59374.73403,8,2
2,Ireland,0.942473,82.103,18.79326,12.526295,55659.67902,9,3
3,Germany,0.938785,81.18,17.09638,14.13215,46945.9499,15,4
4,"Hong Kong, China (SAR)",0.938809,84.687,16.51223,12.03813,60220.79676,5,6
...,...,...,...,...,...,...,...,...
194,Monaco,,,,,,,
195,Nauru,,,11.25936,,17312.58793,,
196,San Marino,,,15.1112,,,,
197,Somalia,,57.068,,,,,


In [57]:
hdiGood[hdiGood.iloc[:,1:].isnull().all(axis=1)] #ALL

Unnamed: 0,Country,Human development index (HDI),Life expectancy at birth,Expected years of schooling,Mean years of schooling,Gross national income (GNI) per capita,GNI per capita rank minus HDI rank,HDI rank
62,HIGH HUMAN DEVELOPMENT,,,,,,,
117,MEDIUM HUMAN DEVELOPMENT,,,,,,,
155,LOW HUMAN DEVELOPMENT,,,,,,,
192,OTHER COUNTRIES OR TERRITORIES,,,,,,,
194,Monaco,,,,,,,


In [59]:
hdiGood=hdiGood[~hdiGood.iloc[:,1:].isnull().all(axis=1)] #ALL

In [61]:
hdiGood[hdiGood.iloc[:,1:].isnull().any(axis=1)] #ALL

Unnamed: 0,Country,Human development index (HDI),Life expectancy at birth,Expected years of schooling,Mean years of schooling,Gross national income (GNI) per capita,GNI per capita rank minus HDI rank,HDI rank
193,Korea (Democratic People's Rep. of),,72.095,10.840575,,,,
195,Nauru,,,11.25936,,17312.58793,,
196,San Marino,,,15.1112,,,,
197,Somalia,,57.068,,,,,
198,Tuvalu,,,12.30895,,5408.949247,,


In [65]:
hdiGood=hdiGood[~hdiGood.iloc[:,1:].isnull().any(axis=1)] #ALL

In [66]:
hdiGood

Unnamed: 0,Country,Human development index (HDI),Life expectancy at birth,Expected years of schooling,Mean years of schooling,Gross national income (GNI) per capita,GNI per capita rank minus HDI rank,HDI rank
0,Norway,0.953688,82.271,18.06082,12.566818,68058.61613,5,1
1,Switzerland,0.945936,83.63,16.20882,13.380812,59374.73403,8,2
2,Ireland,0.942473,82.103,18.79326,12.526295,55659.67902,9,3
3,Germany,0.938785,81.18,17.09638,14.13215,46945.9499,15,4
4,"Hong Kong, China (SAR)",0.938809,84.687,16.51223,12.03813,60220.79676,5,6
...,...,...,...,...,...,...,...,...
187,Burundi,0.422882,61.247,11.30463,3.124365,659.732263,4,185
188,South Sudan,0.41277,57.604,5.00038,4.84913,1455.229886,-7,186
189,Chad,0.401176,53.977,7.465364,2.409497,1715.568235,-15,187
190,Central African Republic,0.380662,52.805,7.56836,4.282,776.675996,0,188


#### * Check the column names

The column **HDI rank** appears twice, because it appears twice in the data. Also, you have another rank **GNI per capita rank minus HDI rank'**. Ranks are repetitive because we have the scores. So, you can always get rid of them in this and other similar situations:

In [None]:
FinalHeaders=[header for header in GoodHeaders if 'rank' not in header]
# then
FinalHeaders

In [None]:
# remember loc works with names, not with positions:

hdi.loc[:,FinalHeaders]

Saving previous result:

In [None]:
hdi=hdi.loc[:,FinalHeaders]

In [None]:
# you have:
hdi.head()

The remaining column names need to improve:

In [None]:
# replace with '' (empty) the "blanks":
hdi.columns.str.replace("\s","") 

In [None]:
# replace with '' (empty) consecutive word characters in parenthesis:
hdi.columns.str.replace("\(\w+\)","")

In [None]:
#or all combines
hdi.columns.str.replace("\s+|\(\w+\)","")

Saving result:

In [None]:
hdi.columns=hdi.columns.str.replace("\s+|\(\w+\)","")

In [None]:
hdi.head()

#### * Check the cell values

Find cells where all numeric data is missing, that is, a country with no data:

In [None]:
# check empty cells from second to last
hdi[hdi.iloc[:,1:].isnull().all(axis=1)] #ALL

These are no countries. They were subtitles for groups of countries. 

You need the opposite to that:

In [None]:
# the opposite
hdi[hdi.iloc[:,1:].notnull().all(axis=1)]

Keepin what you need:

In [None]:
hdi=hdi[hdi.iloc[:,1:].notnull().all(axis=1)]

The last code deleted rows, then indexes need to be reset:

In [None]:
hdi.reset_index(drop=True, inplace=True)

So far:

In [None]:
hdi

#### * Check the cell values

Above you see some characters that are not numbers, are they missing values:

In [None]:
hdi[hdi.iloc[:,1:].isnull().any(axis=1)] #ANY

There are symbols that represent missing values, but they are not recognized so. We need to check what values in those **numeric** columns are not numbers:

In [None]:
badHDISymbols=[] # list for bad symbols

NumericColNames=hdi.iloc[:,1:].columns # save names of columns with numeric data

for columnName in NumericColNames:# visit every column name
    for cell in hdi[columnName]:# visit every cell for that column
        try:
            float(cell) # try this
        except: # if not possible:            
            if cell not in badHDISymbols:# if cell is not in the list                
                badHDISymbols.append(cell)# add it to the list

# you get:
badHDISymbols

You need to replace that value in the numeric columns:

In [None]:
# you have:
hdi

When you have numbers in your columns, you can request statistical summaries:

In [None]:
hdi.describe(include='all')

You are not getting those because they lack format.

### -  Formatting

Check the data types:

In [None]:
hdi.dtypes

Country can remain as an object (text), but not the rest.

* **Formatting into numeric type**:

In [None]:
# as easy as:

hdi[NumericColNames]=hdi.loc[:,NumericColNames].apply(pd.to_numeric)

In [None]:
#recheck
hdi.dtypes

In [None]:
# recheck
hdi.describe(include='all')

Some more information:

In [None]:
hdi.info()

## CASE 2: DEMOCRACY INDEX

Link to use:

## - Cleaning

* First rows:

In [None]:
# what's on top?
# names? weird symbols? more links?
demodex.head(10)

* Last rows:

In [None]:
# what's at the bottom?
# note? credits? extra info?

demodex.tail(10)


From what I saw, I will get rid of the _Score_ and the _Rank_, an also of the last row (it is the repetition of the headers):

In [None]:
#bye row 167, and two columns
demClean=demodex.drop(index=167,columns=['Rank','Score']).copy()

In [None]:
demClean

As there are few names, we can change to smaller sizes:

In [None]:
newNames=['pluralism','effectiveness','participation','culture','liberties']

# names from the second and before the last one '[1:-1]':
newMapper={old:new for old,new in zip(demClean.columns[1:-1],newNames)}

demClean.rename(columns=newMapper,inplace=True)

In [None]:
# this is what we have so far:
demClean.head()

Let me check if there is any rare character in the numbers:

In [None]:
badDemoSymbols=[] # list for bad symbols

NumericColNames=demClean.iloc[:,1:-2].columns # save names of columns with numeric data

for columnName in NumericColNames:# visit every column name
    for cell in demClean[columnName]:# visit every cell for that column
        try:
            float(cell) # try this
        except: # if not possible:            
            if cell not in badDemoSymbols:# if cell is not in the list                
                badDemoSymbols.append(cell)# add it to the list

# you get:
badDemoSymbols

The last two columns are categories, we should check if the categories are consistent:

In [None]:
# get unique values in each column:
demClean.iloc[:,-2:].apply(set)

In [None]:
# easier to see:
demClean.iloc[:,-2:].apply(set).to_list()

It looks good so far. Let's go to formatting.

## - Formatting


In [None]:
# checking data types:
demClean.dtypes

* **Formatting into numeric**

In [None]:
demClean[NumericColNames]=demClean[NumericColNames].apply(pd.to_numeric)

* **Formatting into nominal**

In [None]:
demClean.Continent=pd.Categorical(demClean.Continent)

* **Formatting into ordinal**

In [None]:
# Check current levels:
pd.unique(demClean.Regimetype).tolist()

In [None]:
#rewrite the levels in order:
correctLevels=['Authoritarian', 'Hybrid regime', 'Flawed democracy','Full democracy']

In [None]:
#format as ordinal:
demClean.Regimetype=pd.Categorical(demClean.Regimetype,categories=correctLevels,ordered=True)

The data types have changed:

In [None]:
#then
demClean.dtypes

For more detail:

In [None]:
demClean.Regimetype.cat.ordered

In [None]:
# checking missing values
demClean.info()

## Case 3: Integrating and Saving

This should be an easy step:

In [None]:
hdi.merge(demClean)

Notice you have lost countries:

In [None]:
len(demClean),len(hdi)

In [None]:
hdi.merge(demClean,how='outer',indicator=True) # see last column

Let me save previous result:

In [None]:
test=hdi.merge(demClean,how='outer',indicator=True) 

In [None]:
# in hdi only
test.loc[test['_merge']=='left_only',"Country"]

In [None]:
# in demClean only
test.loc[test['_merge']=='right_only',"Country"]

In [None]:
# you need to change the original

#dictionary of replacements:
replacements={'South Korea[n 1]': 'Korea (Republic of)', 
              'Cape Verde':'Cabo Verde',
              'Czech Republic':'Czechia',
              'Hong Kong':'Hong Kong, China (SAR)',
              'Moldova':'Moldova (Republic of)',
              'Bolivia':'Bolivia (Plurinational State of)',
              'Tanzania':'Tanzania (United Republic of)',
              'Palestine':'Palestine, State of',
              'Ivory Coast':"Côte d'Ivoire",
              'Republic of the Congo':'Congo',
              'Venezuela':'Venezuela (Bolivarian Republic of)',
              'Vietnam':'Viet Nam',
              'Eswatini':'Eswatini (Kingdom of)',              
              'Russia':'Russian Federation',
              'Iran':'Iran (Islamic Republic of)',
              'Laos':"Lao People's Democratic Republic",
              'Democratic Republic of the Congo':'Congo (Democratic Republic of the)',
              'Syria':'Syrian Arab Republic',
              'North Korea': "Korea (Democratic People's Rep. of)" #check ""
             }

# replacing
demClean.Country.replace(replacements,inplace=True)

The merge should give more rows:

In [None]:
hdidem=hdi.merge(demClean)
# result:
hdidem

____
____


### <font color="red">Saving File to Disk</font>

#### For future use in Python:

In [None]:
hdidem.to_pickle("hdidemocia.pkl")
# you will need: DF=pd.read_pickle("hdidemocia.pkl")
# or:
# from urllib.request import urlopen
# DF=pd.read_pickle(urlopen("https://...../hdidemocia.pkl"),compression=None)

#### For future  use in R:

In [None]:
from rpy2.robjects import pandas2ri
pandas2ri.activate()

from rpy2.robjects.packages import importr

base = importr('base')
base.saveRDS(hdidem,file="hdidem.RDS")

#In R, you call it with: DF = readRDS("hdidem.RDS")
#or, if iyou read from cloud: DF = readRDS(url("https://...../hdidem.RDS")