<br> 
<center><img src="https://i.imgur.com/hkb7Bq7.png" width="500"></center>


### **Prof. José Manuel Magallanes, PhD**

* Professor, Departamento de Ciencias Sociales, Pontificia Universidad Católica del Perú, [jmagallanes@pucp.edu.pe](mailto:jmagallanes@pucp.edu.pe)

* Visiting Professor, Evans School of Public Policy and Governance / Senior Data Science Fellow, eScience Institute, University of Washington, [magajm@uw.edu](mailto:magajm@uw.edu)
_____

_____

## Collect data tables into Python from the web

* **From a table in the web**:

In [None]:
linkWeb = "https://en.wikipedia.org/wiki/Democracy_Index"

Reading in a table from the web using pandas using pandas requires **lxml**:

In [None]:
# available?
!pip show lxml

If not available, please go to Anaconda and install it. Once installed, or if available, continue:

In [None]:
import pandas as pd

demoWeb=pd.read_html(linkWeb,
                  header=0,
                  attrs={'class': 'wikitable sortable'})

Notice that **demo** is not a data frame:

In [None]:
type(demoWeb)

In [None]:
#how many?
len(demoWeb)

In [None]:
# is it the last one?
demoWeb[2]

As we found the data frame, we create a new object, _demoRaw_, to keep it:

In [None]:
demoRaw=demoWeb[2].copy()

## Pre Processing

### Subset data

Just keep what you need; let's check the data head and tail:

In [None]:
demoRaw.head()

In [None]:
demoRaw.tail()

The data seems ok so far.

### Fix column names



In [None]:
demoRaw.columns

Notice above that the columns:
* Have weird names.
* Have spaces between words.
* A couple of the represent rankings.

Let's see what can be done:

* Weird symbols and rank:

In [None]:
import re
[header for header in demoRaw.columns if re.search('Rank|rank|Δ',header)]

In [None]:
# getting rid of those:
byHeaders=[header for header in demoRaw.columns if re.search('Rank|rank|Δ',header)]
# then
demoRaw.drop(columns=byHeaders, inplace=True)
# now
demoRaw


Some weird symbols remain:

In [None]:
# we still have problems
demoRaw.columns.to_list()

In [None]:
# \W: not from a-z or A-Z nor 0-9
demoRaw.columns.str.replace('\W',"",regex=True).to_list()

In [None]:
# then
demoRaw.columns=demoRaw.columns.str.replace('\W',"",regex=True)
demoRaw

### Look for non-standar missing values

First check a cell that is full of non-digit characters:

In [None]:
for i in range(demoRaw.shape[1]):
    try:
        print(demoRaw.iloc[:,i][demoRaw.iloc[:,i].str.fullmatch("\W+",na=False)])
    except:
        pass

No weird symbols in the cells!

### Cleaning cell values

In [None]:
# text column
demoRaw.Country=demoRaw.Country.str.strip()

In [None]:
# categorical column!
demoRaw.Regimetype.value_counts()

The categorical column has wrong levels. If you see website, they were titles. Let's get rid of them:

In [None]:
FreqTable=demoRaw.Regimetype.value_counts()
#then
FreqTable[FreqTable==1]

In [None]:
# these are the wrong levels
wrongLevels=FreqTable[FreqTable==1].index
#
wrongLevels

In [None]:
demoRaw[demoRaw.Regimetype.isin(wrongLevels)]

In [None]:
#keeping what you need
demoRaw=demoRaw[~demoRaw.Regimetype.isin(wrongLevels)]

* Check column numbers:

In [None]:
# anything that does not look like a number?
demoRaw[~demoRaw.iloc[:,2].str.fullmatch("\d+.*\d*",na=False)]

It seems no problem there.

Will this data need formatting?

In [None]:
demoRaw.info()

Data is clean, but needs formatting:

In [None]:
demoClean=demoRaw.copy()

### Formatting

As we saw above, _Country_ can remain as an object (text), but not the rest.

* **Formatting into numeric type**:

In [None]:
# as easy as:
demoClean[demoClean.columns[2:]]=demoClean.iloc[:,2:].apply(pd.to_numeric)

In [None]:
#recheck
demoClean.info()

* **Formatting into ordinal**

In [None]:
# Check current levels:
pd.unique(demoClean.Regimetype).tolist()

In [None]:
#rewrite the levels in ascending order:
correctLevels=['Authoritarian', 'Hybrid regime', 'Flawed democracy','Full democracy']

In [None]:
#format as ordinal:
demoClean.Regimetype=pd.Categorical(demoClean.Regimetype,categories=correctLevels,ordered=True)

The data types have changed:

In [None]:
#then
demoClean.info()

For more detail:

In [None]:
demoClean.Regimetype.cat.ordered

In [None]:
demoClean.Regimetype.head()

In [None]:
demoFormat=demoClean.copy()

##  Saving

#### For future use in Python:

In [None]:
demoFormat.to_pickle("demoFormat.pkl")
# you will need: DF=pd.read_pickle("demoFormat.pkl")
# or:
# from urllib.request import urlopen
# DF=pd.read_pickle(urlopen("https://...../demoFormat.pkl"),compression=None)