<center><img src="https://i.imgur.com/hkb7Bq7.png" width="500"></center>


### **Prof. José Manuel Magallanes, PhD**

* Professor, Departamento de Ciencias Sociales, Pontificia Universidad Católica del Perú, [jmagallanes@pucp.edu.pe](mailto:jmagallanes@pucp.edu.pe)

* Visiting Professor, Evans School of Public Policy and Governance / Senior Data Science Fellow, eScience Institute, University of Washington, [magajm@uw.edu](mailto:magajm@uw.edu)
_____

_____

<a id='home'></a>

# Introduction to Python

### Using Python for Pre Processing

In the session we will see the use of Python to:

1. Collect data as dataframes into Python

2. Preprocess a data frame:
    * [Subset data](#subset)
    * [Fix column names](#fixcolnames)
    * [Look for non-standar missing values](#lookfornas)
    * [Clean cell values](#cleancellvalues)
    * [Format data types](#formatdtypes)


3. Merge both tables:
    * [Basic merge](#merging)
    * [Fuzzy merge](#fuzzmerging)


4. Prepare a file for further analysis
    * [Scaling](#scaling)
    * [Exporting](#exporting)



## 1. Collect data tables into Python

In [None]:
# Location of data file
linkFile="https://hdr.undp.org/sites/default/files/2021-22_HDR/HDR21-22_Statistical_Annex_HDI_Table.xlsx"

Reading in a table from a file using pandas, since it is an Excel file, I requires **openpyxl**:

In [None]:
# available in my computer?
!pip show openpyxl

If not available, please go to Anaconda and install it. Once installed, or if available, continue:

In [None]:
# choose the right function:
import pandas as pd

hdiFile=pd.read_excel(linkFile) # you might get an error!

In [None]:
# you may need to add user-agent

storage_options = {'User-Agent': 'Mozilla/5.0'}
hdiFile = pd.read_excel(linkFile, storage_options=storage_options)

[Home](#home)
______

## 2.  Pre Processing

<a id='subset'></a>

### Subset data

The object **hdiFile** is saving the information as a data frame. Just keep what you need; let's check the data head and tail:

In [None]:
hdiFile.head(10)

In [None]:
hdiFile.tail()

Above, you do not see countries neither in the first rows nor in the last ones. There are rows at the begining and at the end that are not needed. 

Let's find the row indexes that have the data:

In [None]:
# the problem is at the tail:
hdiFile.tail(72)

In [None]:
# is this better?
hdiFile.iloc[7:206,:]

In [None]:
# subsetting as new DF (copy)
hdiRaw=hdiFile.iloc[7:206,:].copy()
hdiRaw

[Home](#home)
______

<a id='fixcolnames'></a>

### Fix column names

Notice that **hdiRaw** do not have the right column names. So we need to recover them from **hdiFile**:

In [None]:
hdiFile.iloc[[3,4],:]

As you see, the column names are in different positions:

In [None]:
# and
hdiFile.iloc[4,:2].to_list()

In [None]:
hdiFile.iloc[3,2:].to_list()

In [None]:
# save column names 
RealHeaders=hdiFile.iloc[4,:2].to_list()+hdiFile.iloc[3,2:].to_list()
# these are:
RealHeaders

Let's avoid all the "ranks":

In [None]:
RealHeaders[1:-3]

In [None]:
# keep just those columns
hdiRaw=hdiRaw.iloc[:,1:-3]
hdiRaw

Now, put the right names:

In [None]:
#renaming
hdiRaw.columns=RealHeaders[1:-3]
#result:
hdiRaw

We still have column names with missing values, let's get rid of those:

In [None]:
BetterHeaders=hdiRaw.columns.dropna().to_list()
#result
BetterHeaders

In [None]:
#subsetting again
hdiRaw=hdiRaw.loc[:,BetterHeaders]
hdiRaw.head()

Notice above that the columns:
* Have acronyms in parenthesis.
* Have spaces between words.

Let's see what can be done:

In [None]:
# bye anything between parenthesis
hdiRaw.columns.str.replace('\(.+\)',"", regex=True)

In [None]:
# bye anything between parenthesis, bye leading-trailing spaces
hdiRaw.columns.str.replace('\(.+\)',"", regex=True).str.strip()

In [None]:
# bye anything between parenthesis, bye leading-trailing spaces, title case
hdiRaw.columns.str.replace('\(.+\)',"", regex=True).\
                          str.strip().\
                          str.title()

Let's keep this last one for a while:

In [None]:
#changing column names
hdiRaw.columns=hdiRaw.columns.str.replace('\(.+\)',"", regex=True).\
                          str.strip().\
                          str.title()
#so
hdiRaw

Now, it is time to decide how the we want as the shorter column name:

* Same title without spaces:

In [None]:
hdiRaw.columns.str.replace(" ",'',regex=False)

* Some acronyms: Let's do this step by step.

In [None]:
# each column names splitted:
[name.split() for name in hdiRaw.columns]

In [None]:
# first letter of each word
[[word[0] for word in name.split()] for name in hdiRaw.columns]

In [None]:
# final result
[''.join([word[0] for word in name.split()]) for name in hdiRaw.columns]

Let's keep the first alternative:

In [None]:
hdiRaw.columns=hdiRaw.columns.str.replace(" ",'',regex=False)

Finally...

In [None]:
hdiRaw

[Home](#home)
______

<a id='lookfornas'></a>

### Look for non-standar missing values

First check a cell that is full of non-word/non-digit characters:

In [None]:
# full match!
[hdiRaw.iloc[:,1].str.fullmatch("\W+",na=False)]

The above result is telling you if whether there is or there is not a full match (True/False). You can use that to keep the rows where this is _True_:

In [None]:
# a quick look...
hdiRaw.iloc[:,1][hdiRaw.iloc[:,1].str.fullmatch("\W+",na=False)]

Let's do this for every column:

In [None]:
i=0
hdiRaw.iloc[:,i][hdiRaw.iloc[:,i].str.fullmatch("\W+",na=False)]

In [None]:
i=1
hdiRaw.iloc[:,i][hdiRaw.iloc[:,i].str.fullmatch("\W+",na=False)]

Using **for** loop:

In [None]:
for i in range(hdiRaw.shape[1]):
    print(hdiRaw.iloc[:,i][hdiRaw.iloc[:,i].str.fullmatch("\W+",na=False)])
# you might error!

Using **try**:

In [None]:
for i in range(hdiRaw.shape[1]):
    try:
        print(hdiRaw.iloc[:,i][hdiRaw.iloc[:,i].str.fullmatch("\W+",na=False)])
    except:
        pass

This means that the people who created this data set used ".." to represent **missing values**. Let's replace those values:

In [None]:
# replacing !
hdiRaw.replace(to_replace=[".."],
               value=None,
               inplace=True)

#result
hdiRaw

[Home](#home)
______

<a id='cleancellvalues'></a>

### Cleaning cell values

Do the current cell values have issues?

* Keeping complete data

In [None]:
# with all missing (after the first column)
hdiRaw[hdiRaw.iloc[:,1:].isna().all(axis=1)]

In [None]:
# with at least one missing (after the first column)
hdiRaw[hdiRaw.iloc[:,1:].isna().any(axis=1)]

Then, netx code will only keep complete data, and save it as a new data frame:

In [None]:
hdiComplete=hdiRaw[~hdiRaw.iloc[:,1:].isna().any(axis=1)].copy()
#
hdiComplete

* Making sure columns of _text_ are clean:

In [None]:
# get rid of leading and trailing spaces in text cells
hdiComplete.Country=hdiComplete.Country.str.strip()

* Checking  numeric columns

In [None]:
hdiComplete.iloc[:,1:].info()

Numbers have been recognised as **object** type. It might be due to having a non numeric value in one cell, or because it **had** a non-numeric.

In [None]:
# can you apply math?
hdiRaw.iloc[:,1:].max()

You just need to give format.

In [None]:
hdiClean=hdiComplete.copy()

[Home](#home)
______

<a id='formatdtypes'></a>

### Formatting

From above, we just need to format the numeric columns:

* **Formatting into numeric type**:

In [None]:
# as easy as:
hdiClean[hdiClean.columns[1:]]=hdiClean.iloc[:,1:].apply(pd.to_numeric)

In [None]:
#recheck
hdiClean.info()

That was easy!

In [None]:
hdiFormat=hdiClean.copy()

[Home](#home)
______


## 3. Integrating

<a id='merging'></a>

### Basic merging

As our data is clean and formatted (to the best of our knowledge), this process should be easy:

In [None]:
demoFormat=pd.read_pickle("demoFormat.pkl")

Take a look at the column names:

In [None]:
demoFormat.columns

In [None]:
hdiFormat.columns

If we are confident we did a good cleaning and formatting, this step should be easy:

In [None]:
# left_on= / right_on NOT NEEDED (only when column names differ)
hdiFormat.merge(demoFormat,left_on='Country', right_on='Country')

Notice the amount of rowd **returned above**, and compare it with the amount of rows in each data frame:

In [None]:
len(hdiFormat),len(demoFormat)

The smallest amount of rows is the maximum amount you expect during merge. Let's check the key values that were not matched:


In [None]:
onlyHDI=set(hdiFormat.Country)-set(demoFormat.Country)
onlyDEMO=set(demoFormat.Country)-set(hdiFormat.Country)

In [None]:
onlyHDI

In [None]:
onlyDEMO

[Home](#home)
______


<a id='fuzzmerging'></a>

### Fuzzy Merge

The previous objects (onlyDEMO, onlyHDI) inform the values not matched in the other data frame. 
If you want to recover some of these values, you may follow these steps (you may need to install **thefuzz**):

In [None]:
from thefuzz import process as fz

# take a country from onlyDEMO
# and get the country that matches the most in OnlyHDI

[(fz.extractOne(demo, onlyHDI),demo) for demo in sorted(onlyDEMO)]

You will not get the best outcome in this step, so you just need to keep the 'safe' matches:

In [None]:
[(fz.extractOne(demo, onlyHDI),demo) for demo in sorted(onlyDEMO) \
 if fz.extractOne(demo, onlyHDI)[1]>=90]

The next step is replace the cells values in one of the data frames.
For that, you need to create a **dictionary of changes**:

In [None]:
# this dictionary is prepared for HDI data:
{fz.extractOne(demo, onlyHDI)[0]:demo for demo in sorted(onlyDEMO) \
 if fz.extractOne(demo, onlyHDI)[1]>=90}

In [None]:
# NOW create the dict and make the changes
changesHDI={fz.extractOne(demo, onlyHDI)[0]:demo \
            for demo in sorted(onlyDEMO) \
            if fz.extractOne(demo, onlyHDI)[1]>=90}

# replace in HDI

hdiFormat.Country.replace(to_replace=changesHDI,inplace=True)

In [None]:
# did you get more rows?
hdiFormat.merge(demoFormat)

If you redo this process, you may recover more rows. I will not do it here, but you are welcome to. 

In [None]:
# hint: start with these two lines!
onlyHDI=set(hdiFormat.Country)-set(demoFormat.Country)
onlyDEMO=set(demoFormat.Country)-set(hdiFormat.Country)

As our merge ended with one fuzzy-merge iteration; the data frame to use further will be:

In [None]:
hdidem=hdiFormat.merge(demoFormat)

The format should still be good:

In [None]:
hdidem.info()

[Home](#home)
______


## Prepare file for further work

<a id='scaling'></a>

###  Scaling

It would be good to check the range of values of your numeric data. You can simply use **describe** (just requesting _min_ and _max_):

In [None]:
hdidem.describe().loc[['min','max']].T #T for transposing

As you see different ranges, it would be good to request a **boxplot** (make sure to install **matplotlib** if not previously installed)

In [None]:
import matplotlib.pyplot as plt

hdidem.plot(kind='box', rot=90,fontsize=5)
plt.semilogy();

Notice that our concern is the numeric data. In case of categorical it is unusual to worry about it, but some cases might need some thinking.

Let me get the column names of the numeric columns:

In [None]:
import numpy as np

colsToScale = hdidem.select_dtypes([np.number]).columns

colsToScale

Time to produce new ranges (make sure you have previously install **scikit-learn**):

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df_minmax = scaler.fit_transform(hdidem.loc[:,colsToScale].to_numpy())
df_scaled = pd.DataFrame(df_minmax, columns=colsToScale)

Let's explore the result:

In [None]:

df_scaled.describe().loc[['min','max']].T 


In [None]:
df_scaled.plot(kind='box', rot=90,fontsize=5);

Let's add a suffix to the column names:

In [None]:
df_scaled.columns=df_scaled.columns+"_mM"

In [None]:
# concat to the right (instead of bottom) with axis=1
pd.concat([hdidem,df_scaled],axis=1)

So this is our last version:

In [None]:
hdidem_plus=pd.concat([hdidem,df_scaled],axis=1)

[Home](#home)
______


<a id='exporting'></a>

### Exporting

#### For future use in Python:

In [None]:
hdidem_plus.to_pickle("hdidem_plus.pkl")
# you will need: DF=pd.read_pickle("hdidem_plus.pkl")
# or:
# from urllib.request import urlopen
# DF=pd.read_pickle(urlopen("https://...../hdidem_plus.pkl"),compression=None)

#### For future  use in R:

In [None]:
!pip show rpy2

In [None]:
from rpy2.robjects import pandas2ri
pandas2ri.activate()

from rpy2.robjects.packages import importr

base = importr('base')
base.saveRDS(hdidem_plus,file="hdidem_plus.RDS")

#In R, you call it with: DF = readRDS("hdidem_plus.RDS")
#or, if read from cloud: DF = readRDS(url("https://...../hdidem_plus.RDS")