<br> 
<center><img src="https://i.imgur.com/hkb7Bq7.png" width="500"></center>


### Prof. José Manuel Magallanes, PhD

* Visiting Professor, Evans School of Public Policy and Governance / Senior Data Science Fellow, eScience Institute, University of Washington, [magajm@uw.edu](mailto:magajm@uw.edu)

* Associate Professor, Departamento de Ciencias Sociales, Pontificia Universidad Católica del Perú, [jmagallanes@pucp.edu.pe](mailto:jmagallanes@pucp.edu.pe)

_____

_____

<a id='home'></a>

# Introduction to Python

* DATA STRUCTURES
    - [Basic Data Structures](#basic_ds).

    - [Complex Data Structures](#complex_ds).

* DATA PRE PROCESSING
    - [Data cleaning and formatting](#cleanformat).

    - [Integration and Saving](#integratesave).


# Basic Data Structures

Python has basic native structures, like lists, tuples and dictionaries.

## a.  **LISTS** 

Lists are the most flexible structure to save or contain data elements.

In [192]:
names=["Qing", "Françoise", "Raúl", "Bjork","Marie"]
ages=[32,33,28,30,29]
country=["China", "Senegal", "España", "Norway","Korea"]
education=["Bach", "Bach", "Master", "PhD","PhD"]

Above we have created some lists. Lists can contain any values. Lists support different operations:

* **Accessing**:

Keep in mind the positions in Python start in **0**.

In [193]:
# one element
ages[0]

32

In [194]:
# several, using slices:
ages[1:-1] #second to before last

[33, 28, 30]

In [195]:
# several, using slices:
ages[:-2] #all but two last ones

[32, 33, 28]

In [196]:
# non consecutive
from operator import itemgetter
list(itemgetter(0,2,3)(ages))

[32, 28, 30]

In [197]:
# difficul to understand?
ages[0:4:2] + [ages[3]]

[32, 28, 30]

* **Modifying**:

In [198]:
# by position
country[2]="Spain"

# list changed:
country

['China', 'Senegal', 'Spain', 'Norway', 'Korea']

In [199]:
# by value
country=["PR China" if x == "China" else x for x in country]

# list changed:
country

['PR China', 'Senegal', 'Spain', 'Norway', 'Korea']

* **Deleting**

In [200]:
# by position
del country[-1] #last value

# list changed:
country

['PR China', 'Senegal', 'Spain', 'Norway']

In [201]:
# by position
names.pop() #last value by default

# list changed:
names

['Qing', 'Françoise', 'Raúl', 'Bjork']

In [202]:
# only 'del' works for several positions

lista=[1,2,3,4,5,6]
del lista[1:3]

#now:
lista

[1, 4, 5, 6]

In [203]:
# by value
ages.remove(29) 

# list changed:
ages # just first ocurrence of value!!

[32, 33, 28, 30]

In [204]:
# by value
education.remove('PhD') 

# list changed:
education # just first ocurrence!!

['Bach', 'Bach', 'Master', 'PhD']

In [205]:
# deleting every  value:

lista=[1,'a',45,'b','a']
lista=[x for x in lista if x!='a']

# you get:
lista

[1, 45, 'b']

* **Inserting values**

In [206]:
# at the end
lista.append("abc")
lista

[1, 45, 'b', 'abc']

In [207]:
# PART ONE:
# first delete a position
education.pop(2)
education

['Bach', 'Bach', 'PhD']

In [208]:
# PART TWO:
# now insert in that position
education.insert(2,"Master")
education

['Bach', 'Bach', 'Master', 'PhD']

## b.  **TUPLES**

Tuples are inmutable structures in Python, they look like lists but do not share much of their functionality:

In [209]:
# new list:
weekend=("Friday", "Saturday", "Sunday")

You can access:

In [210]:
weekend[0]

'Friday'

But no other operation is allowed.

Python itself uses tuples as output of some important functions:

In [211]:
zip(names,ages)

<zip at 0x7fdb4eb04140>

The **zip** functions creates tuples, by combining in parallel. You can see it if you turn the result into a list:

In [212]:
list(zip(names,ages))  # a list of tuples

[('Qing', 32), ('Françoise', 33), ('Raúl', 28), ('Bjork', 30)]

[home](#home)

______


<a id='complex_ds'></a>

# Complex Data Structures

## a. **DICTIONARIES**  

*Dicts* work in a more sophisticated way, as they have a **'key'**:**'value'** structure:

In [213]:
classroom={'student':names,'age':ages,'edu':education}
# see it:

classroom

{'student': ['Qing', 'Françoise', 'Raúl', 'Bjork'],
 'age': [32, 33, 28, 30],
 'edu': ['Bach', 'Bach', 'Master', 'PhD']}

Dicts do not use indexes to access values:

In [214]:
#classroom[0]

Dicts use keys:

In [215]:
classroom['student']

['Qing', 'Françoise', 'Raúl', 'Bjork']

Notice I created a dictionary where the value is not ONE but a LIST of values.

Once you access a value, you can modify it. You can also use _pop_ or _del_ using the **keys**. But you can not use _append_ to add an element, you need **update**:

In [216]:
classroom.update({'country':country})
# now:
classroom

{'student': ['Qing', 'Françoise', 'Raúl', 'Bjork'],
 'age': [32, 33, 28, 30],
 'edu': ['Bach', 'Bach', 'Master', 'PhD'],
 'country': ['PR China', 'Senegal', 'Spain', 'Norway']}

## b. DATA FRAMES

**Data frames**  are more complex containers of values. The most common analogy is a spreadsheet. To create a data frame, we need to call **pandas**:

In [217]:
import pandas

We can prepare a data frame from a dictionary immediately, but ONLY if you have the same amount of elements in each list representing a column.

In [218]:
# our data frame:
students=pandas.DataFrame(classroom)
## see it:
students

Unnamed: 0,student,age,edu,country
0,Qing,32,Bach,PR China
1,Françoise,33,Bach,Senegal
2,Raúl,28,Master,Spain
3,Bjork,30,PhD,Norway


But, let me update the dictionary with: 

In [219]:
names=["Qing", "Françoise", "Raúl", "Bjork","Marie"]
#
classroom.update({'student':names})
#
classroom

{'student': ['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie'],
 'age': [32, 33, 28, 30],
 'edu': ['Bach', 'Bach', 'Master', 'PhD'],
 'country': ['PR China', 'Senegal', 'Spain', 'Norway']}

We have five students, but only data for four of them. Then this does not work:

In [220]:
#pandas.DataFrame(classroom)

In that case, you need this:

In [221]:
#then
students=pandas.DataFrame({key:pandas.Series(value) for key, value in classroom.items()})

# seeing it:
students

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,Spain
3,Bjork,30.0,PhD,Norway
4,Marie,,,


Sometimes, Python users code like this:

In [222]:
import pandas as pd # renaming the library

students=pd.DataFrame({key:pd.Series(value) for key, value in classroom.items()})
students

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,Spain
3,Bjork,30.0,PhD,Norway
4,Marie,,,


### Data frame basic operations

In [223]:
# data of structure: list? tuple? dataframe?
type(students)

pandas.core.frame.DataFrame

In [224]:
# type of data in data frame column
students.dtypes

student     object
age        float64
edu         object
country     object
dtype: object

In [225]:
# details of data frame
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   student  5 non-null      object 
 1   age      4 non-null      float64
 2   edu      4 non-null      object 
 3   country  4 non-null      object 
dtypes: float64(1), object(3)
memory usage: 288.0+ bytes


In [226]:
# number of rows and columns
students.shape 

(5, 4)

In [227]:
# number of rows:
len(students) 

5

In [228]:
# first rows
students.head(2) # compare with: students.tail(2)

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal


In [229]:
# name of columns
students.columns

Index(['student', 'age', 'edu', 'country'], dtype='object')

If you needed the column names as a list:

In [230]:
students.columns.tolist()# or simply: list(students)

['student', 'age', 'edu', 'country']

If you needed a column values as a list:

In [231]:
students.age.tolist()# list(students.ages)

[32.0, 33.0, 28.0, 30.0, nan]

### Accesing elements in DF:

The data frames in pandas behave much like in R:

In [232]:
#one particular column
students.student

0         Qing
1    Françoise
2         Raúl
3        Bjork
4        Marie
Name: student, dtype: object

In [233]:
# or
students['student'] 

0         Qing
1    Françoise
2         Raúl
3        Bjork
4        Marie
Name: student, dtype: object

In [234]:
# it is not the same as: 
students[['student']] # a data frame, not a column (or series)

Unnamed: 0,student
0,Qing
1,Françoise
2,Raúl
3,Bjork
4,Marie


In [235]:
# this is also a DF
students[['country','student']]

Unnamed: 0,country,student
0,PR China,Qing
1,Senegal,Françoise
2,Spain,Raúl
3,Norway,Bjork
4,,Marie


In [236]:
# and this, using loc:
columnNames=['country','student']
students.loc[:,columnNames]

Unnamed: 0,country,student
0,PR China,Qing
1,Senegal,Françoise
2,Spain,Raúl
3,Norway,Bjork
4,,Marie


In [237]:
## Using positions is very common:
columnPositions=[1,3,0]
students.iloc[:,columnPositions] 

Unnamed: 0,age,country,student
0,32.0,PR China,Qing
1,33.0,Senegal,Françoise
2,28.0,Spain,Raúl
3,30.0,Norway,Bjork
4,,,Marie


### Changing values

If you have a position, you can update values:

In [238]:
students.iloc[4,1]=23 # change is immediate! (no warning)
students

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,Spain
3,Bjork,30.0,PhD,Norway
4,Marie,23.0,,


### Deleting columns

You can modify any values in a data frame, but let me create a **deep** copy of this data frame to play with:

In [239]:
studentsCopy=students.copy()
studentsCopy

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,Spain
3,Bjork,30.0,PhD,Norway
4,Marie,23.0,,


In [240]:
# This is what you want get rid of:
byeColumns=['edu'] # you can delete more than one

#this is the result
studentsCopy.drop(columns=byeColumns)

Unnamed: 0,student,age,country
0,Qing,32.0,PR China
1,Françoise,33.0,Senegal
2,Raúl,28.0,Spain
3,Bjork,30.0,Norway
4,Marie,23.0,


Notice you do not have saved the previous result:

In [241]:
studentsCopy

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,Spain
3,Bjork,30.0,PhD,Norway
4,Marie,23.0,,


In [242]:
#NOW we do
studentsCopy.drop(columns=byeColumns,inplace=True)

In [243]:
#then:
studentsCopy

Unnamed: 0,student,age,country
0,Qing,32.0,PR China
1,Françoise,33.0,Senegal
2,Raúl,28.0,Spain
3,Bjork,30.0,Norway
4,Marie,23.0,


### Deleting a row

Let me delete a row:

In [244]:
# axis 0 is delete by row
studentsCopy.drop(index=2,inplace=True) 
studentsCopy

Unnamed: 0,student,age,country
0,Qing,32.0,PR China
1,Françoise,33.0,Senegal
3,Bjork,30.0,Norway
4,Marie,23.0,


As you see, the index dissapeared. Then, you should reset the indexes:

In [245]:
studentsCopy.reset_index(drop=True,inplace=True)
studentsCopy

Unnamed: 0,student,age,country
0,Qing,32.0,PR China
1,Françoise,33.0,Senegal
2,Bjork,30.0,Norway
3,Marie,23.0,



[home](#home)
_____

<a id='part2'></a>


## 2.  Data Pre processing

<a id='beginning'></a>

Preprocessing includes three stages:

a. **Cleaning**: Cleaning requires that every cell has the right value, and that the dataframe has only the contents needed. Having a clean data frame means:

    1. Verify that headers are at the top of data frame, and well written.

    2. Verify that categories levels are well written.
    
<br>
    
b. **Formatting**: Formatting requires:
    
    1. Verifiying data types
    
    2. Correcting data types: Numerical, categorical, text, date.
    
<br>
    
c. **Integrating and Saving**: It is process of combining several dataframes in one, and saving it into a file that can be the input of future processes.


_____

<a id='cleanformat'></a>

## a. CLEANING

Let's start by bringing in a table from wikipedia:

In [246]:
LINK_to_WIKIPAGE="https://en.wikipedia.org/wiki/Democracy_Index"

democracy=pd.read_html(LINK_to_WIKIPAGE)

Democracy is not a data frame:

In [247]:
type(democracy)

list

This list has:

In [248]:
len(democracy)

7

We could shorten the list of results by telling what kind of table you want:

In [249]:
democracy=pd.read_html(LINK_to_WIKIPAGE, attrs={"class":"wikitable sortable"})
len(democracy)

2

Then, our table is:

In [250]:
democracy[0]

Unnamed: 0,Rank,Country,Score,Electoral processand pluralism,Functio­ning ofgovern­ment,Politicalpartici­pation,Politicalculture,Civilliberties,Regimetype,Region[n 1],Changes fromlast year
0,1,Norway,9.87,10.00,9.64,10.00,10.00,9.71,Full democracy,Western Europe,Score: Rank:
1,2,Iceland,9.58,10.00,9.29,8.89,10.00,9.71,Full democracy,Western Europe,Score: Rank:
2,3,Sweden,9.39,9.58,9.64,8.33,10.00,9.41,Full democracy,Western Europe,Score: Rank:
3,4,New Zealand,9.26,10.00,9.29,8.89,8.13,10.00,Full democracy,Asia & Australasia,Score: Rank:
4,5,Finland,9.25,10.00,8.93,8.89,8.75,9.71,Full democracy,Western Europe,Score: 0.11Rank: 3
...,...,...,...,...,...,...,...,...,...,...,...
163,164,Syria,1.43,0.00,0.00,2.78,4.38,0.00,Authoritarian,Middle East & North Africa,Score: Rank: 2
164,165,Central African Republic,1.32,1.25,0.00,1.11,1.88,2.35,Authoritarian,Sub-Saharan Africa,Score: 0.2Rank: 1
165,166,Democratic Republic of the Congo,1.13,0.00,0.00,1.67,3.13,0.88,Authoritarian,Sub-Saharan Africa,Score: 0.2Rank: 1
166,167,North Korea,1.08,0.00,2.50,1.67,1.25,0.00,Authoritarian,Asia & Australasia,Score: Rank:


In [251]:
demodata=democracy[0]

1. **Verify that headers are at the top of data frame, and well written**.

In [252]:
# headers are in the right position
# are they well written?
demodata.columns.to_list()

['Rank',
 'Country',
 'Score',
 'Electoral processand pluralism',
 'Functio\xadning ofgovern\xadment',
 'Politicalpartici\xadpation',
 'Politicalculture',
 'Civilliberties',
 'Regimetype',
 'Region[n 1]',
 'Changes fromlast year']

In [253]:
old=demodata.columns.to_list()

In [254]:
import re

old_good=[re.split("\s",x)[0] for x in old]
old_good

['Rank',
 'Country',
 'Score',
 'Electoral',
 'Functio\xadning',
 'Politicalpartici\xadpation',
 'Politicalculture',
 'Civilliberties',
 'Regimetype',
 'Region[n',
 'Changes']

In [255]:
old_better=[re.split("\[",x)[0] for x in old_good]
old_better

['Rank',
 'Country',
 'Score',
 'Electoral',
 'Functio\xadning',
 'Politicalpartici\xadpation',
 'Politicalculture',
 'Civilliberties',
 'Regimetype',
 'Region',
 'Changes']

In [256]:
old_better=[re.split("\s|\[",x)[0] for x in old]
old_better

['Rank',
 'Country',
 'Score',
 'Electoral',
 'Functio\xadning',
 'Politicalpartici\xadpation',
 'Politicalculture',
 'Civilliberties',
 'Regimetype',
 'Region',
 'Changes']

In [257]:
allOK=[re.sub(r'\W+', '', x) for x in old_better]
allOK

['Rank',
 'Country',
 'Score',
 'Electoral',
 'Functioning',
 'Politicalparticipation',
 'Politicalculture',
 'Civilliberties',
 'Regimetype',
 'Region',
 'Changes']

In [258]:
allOK[2:]

['Score',
 'Electoral',
 'Functioning',
 'Politicalparticipation',
 'Politicalculture',
 'Civilliberties',
 'Regimetype',
 'Region',
 'Changes']

In [259]:
demodata.columns[2:]

Index(['Score', 'Electoral processand pluralism', 'Functio­ning ofgovern­ment',
       'Politicalpartici­pation', 'Politicalculture', 'Civilliberties',
       'Regimetype', 'Region[n 1]', 'Changes fromlast year'],
      dtype='object')

In [260]:
dict(zip(demodata.columns[2:],allOK[2:]))

{'Score': 'Score',
 'Electoral processand pluralism': 'Electoral',
 'Functio\xadning ofgovern\xadment': 'Functioning',
 'Politicalpartici\xadpation': 'Politicalparticipation',
 'Politicalculture': 'Politicalculture',
 'Civilliberties': 'Civilliberties',
 'Regimetype': 'Regimetype',
 'Region[n 1]': 'Region',
 'Changes fromlast year': 'Changes'}

In [261]:
changes=dict(zip(demodata.columns[2:],allOK[2:]))
demodata.rename(columns=changes)

Unnamed: 0,Rank,Country,Score,Electoral,Functioning,Politicalparticipation,Politicalculture,Civilliberties,Regimetype,Region,Changes
0,1,Norway,9.87,10.00,9.64,10.00,10.00,9.71,Full democracy,Western Europe,Score: Rank:
1,2,Iceland,9.58,10.00,9.29,8.89,10.00,9.71,Full democracy,Western Europe,Score: Rank:
2,3,Sweden,9.39,9.58,9.64,8.33,10.00,9.41,Full democracy,Western Europe,Score: Rank:
3,4,New Zealand,9.26,10.00,9.29,8.89,8.13,10.00,Full democracy,Asia & Australasia,Score: Rank:
4,5,Finland,9.25,10.00,8.93,8.89,8.75,9.71,Full democracy,Western Europe,Score: 0.11Rank: 3
...,...,...,...,...,...,...,...,...,...,...,...
163,164,Syria,1.43,0.00,0.00,2.78,4.38,0.00,Authoritarian,Middle East & North Africa,Score: Rank: 2
164,165,Central African Republic,1.32,1.25,0.00,1.11,1.88,2.35,Authoritarian,Sub-Saharan Africa,Score: 0.2Rank: 1
165,166,Democratic Republic of the Congo,1.13,0.00,0.00,1.67,3.13,0.88,Authoritarian,Sub-Saharan Africa,Score: 0.2Rank: 1
166,167,North Korea,1.08,0.00,2.50,1.67,1.25,0.00,Authoritarian,Asia & Australasia,Score: Rank:


In [262]:
demodata

Unnamed: 0,Rank,Country,Score,Electoral processand pluralism,Functio­ning ofgovern­ment,Politicalpartici­pation,Politicalculture,Civilliberties,Regimetype,Region[n 1],Changes fromlast year
0,1,Norway,9.87,10.00,9.64,10.00,10.00,9.71,Full democracy,Western Europe,Score: Rank:
1,2,Iceland,9.58,10.00,9.29,8.89,10.00,9.71,Full democracy,Western Europe,Score: Rank:
2,3,Sweden,9.39,9.58,9.64,8.33,10.00,9.41,Full democracy,Western Europe,Score: Rank:
3,4,New Zealand,9.26,10.00,9.29,8.89,8.13,10.00,Full democracy,Asia & Australasia,Score: Rank:
4,5,Finland,9.25,10.00,8.93,8.89,8.75,9.71,Full democracy,Western Europe,Score: 0.11Rank: 3
...,...,...,...,...,...,...,...,...,...,...,...
163,164,Syria,1.43,0.00,0.00,2.78,4.38,0.00,Authoritarian,Middle East & North Africa,Score: Rank: 2
164,165,Central African Republic,1.32,1.25,0.00,1.11,1.88,2.35,Authoritarian,Sub-Saharan Africa,Score: 0.2Rank: 1
165,166,Democratic Republic of the Congo,1.13,0.00,0.00,1.67,3.13,0.88,Authoritarian,Sub-Saharan Africa,Score: 0.2Rank: 1
166,167,North Korea,1.08,0.00,2.50,1.67,1.25,0.00,Authoritarian,Asia & Australasia,Score: Rank:


In [263]:
demodata.rename(columns=changes,inplace=True)

In [264]:
demodata

Unnamed: 0,Rank,Country,Score,Electoral,Functioning,Politicalparticipation,Politicalculture,Civilliberties,Regimetype,Region,Changes
0,1,Norway,9.87,10.00,9.64,10.00,10.00,9.71,Full democracy,Western Europe,Score: Rank:
1,2,Iceland,9.58,10.00,9.29,8.89,10.00,9.71,Full democracy,Western Europe,Score: Rank:
2,3,Sweden,9.39,9.58,9.64,8.33,10.00,9.41,Full democracy,Western Europe,Score: Rank:
3,4,New Zealand,9.26,10.00,9.29,8.89,8.13,10.00,Full democracy,Asia & Australasia,Score: Rank:
4,5,Finland,9.25,10.00,8.93,8.89,8.75,9.71,Full democracy,Western Europe,Score: 0.11Rank: 3
...,...,...,...,...,...,...,...,...,...,...,...
163,164,Syria,1.43,0.00,0.00,2.78,4.38,0.00,Authoritarian,Middle East & North Africa,Score: Rank: 2
164,165,Central African Republic,1.32,1.25,0.00,1.11,1.88,2.35,Authoritarian,Sub-Saharan Africa,Score: 0.2Rank: 1
165,166,Democratic Republic of the Congo,1.13,0.00,0.00,1.67,3.13,0.88,Authoritarian,Sub-Saharan Africa,Score: 0.2Rank: 1
166,167,North Korea,1.08,0.00,2.50,1.67,1.25,0.00,Authoritarian,Asia & Australasia,Score: Rank:


2. **Verify that categories levels are well written**.

This requires getting frequency tables:

In [265]:
demodata.Regimetype.value_counts()

Flawed democracy    54
Authoritarian       54
Hybrid regime       37
Full democracy      22
Regimetype           1
Name: Regimetype, dtype: int64

You just identified a weird level **Regimetype**, that might signal a **wrong row**:

In [266]:
demodata[demodata.Regimetype=="Regimetype"]

Unnamed: 0,Rank,Country,Score,Electoral,Functioning,Politicalparticipation,Politicalculture,Civilliberties,Regimetype,Region,Changes
167,Rank,Country,Score,Electoral processand pluralism,Functio­ning ofgovern­ment,Politicalpartici­pation,Politicalculture,Civilliberties,Regimetype,Region,Changes fromlast year


You found a row that did not follow the pattern of the rest. The last step before formatting will be to remove these rows and the columns you do not need.

In [267]:
# strategy 1: subsetting
demodataRowsOK=demodata[demodata.Regimetype!="Regimetype"]
demodataRowsOK

Unnamed: 0,Rank,Country,Score,Electoral,Functioning,Politicalparticipation,Politicalculture,Civilliberties,Regimetype,Region,Changes
0,1,Norway,9.87,10.00,9.64,10.00,10.00,9.71,Full democracy,Western Europe,Score: Rank:
1,2,Iceland,9.58,10.00,9.29,8.89,10.00,9.71,Full democracy,Western Europe,Score: Rank:
2,3,Sweden,9.39,9.58,9.64,8.33,10.00,9.41,Full democracy,Western Europe,Score: Rank:
3,4,New Zealand,9.26,10.00,9.29,8.89,8.13,10.00,Full democracy,Asia & Australasia,Score: Rank:
4,5,Finland,9.25,10.00,8.93,8.89,8.75,9.71,Full democracy,Western Europe,Score: 0.11Rank: 3
...,...,...,...,...,...,...,...,...,...,...,...
162,163,Chad,1.61,0.00,0.00,1.67,3.75,2.65,Authoritarian,Sub-Saharan Africa,Score: Rank:
163,164,Syria,1.43,0.00,0.00,2.78,4.38,0.00,Authoritarian,Middle East & North Africa,Score: Rank: 2
164,165,Central African Republic,1.32,1.25,0.00,1.11,1.88,2.35,Authoritarian,Sub-Saharan Africa,Score: 0.2Rank: 1
165,166,Democratic Republic of the Congo,1.13,0.00,0.00,1.67,3.13,0.88,Authoritarian,Sub-Saharan Africa,Score: 0.2Rank: 1


In [268]:
# strategy 2: droping by index
demodata.drop(index=167)

Unnamed: 0,Rank,Country,Score,Electoral,Functioning,Politicalparticipation,Politicalculture,Civilliberties,Regimetype,Region,Changes
0,1,Norway,9.87,10.00,9.64,10.00,10.00,9.71,Full democracy,Western Europe,Score: Rank:
1,2,Iceland,9.58,10.00,9.29,8.89,10.00,9.71,Full democracy,Western Europe,Score: Rank:
2,3,Sweden,9.39,9.58,9.64,8.33,10.00,9.41,Full democracy,Western Europe,Score: Rank:
3,4,New Zealand,9.26,10.00,9.29,8.89,8.13,10.00,Full democracy,Asia & Australasia,Score: Rank:
4,5,Finland,9.25,10.00,8.93,8.89,8.75,9.71,Full democracy,Western Europe,Score: 0.11Rank: 3
...,...,...,...,...,...,...,...,...,...,...,...
162,163,Chad,1.61,0.00,0.00,1.67,3.75,2.65,Authoritarian,Sub-Saharan Africa,Score: Rank:
163,164,Syria,1.43,0.00,0.00,2.78,4.38,0.00,Authoritarian,Middle East & North Africa,Score: Rank: 2
164,165,Central African Republic,1.32,1.25,0.00,1.11,1.88,2.35,Authoritarian,Sub-Saharan Africa,Score: 0.2Rank: 1
165,166,Democratic Republic of the Congo,1.13,0.00,0.00,1.67,3.13,0.88,Authoritarian,Sub-Saharan Africa,Score: 0.2Rank: 1


Let's keep strategy 1, get rid of the unneeded columns:

In [269]:
[1]+ list(range(3,9) )

[1, 3, 4, 5, 6, 7, 8]

In [270]:
# strategy 1: subsetting
keepPositions=[1]+ list(range(3,9) )
demodataRowsOK.iloc[:,keepPositions]

Unnamed: 0,Country,Electoral,Functioning,Politicalparticipation,Politicalculture,Civilliberties,Regimetype
0,Norway,10.00,9.64,10.00,10.00,9.71,Full democracy
1,Iceland,10.00,9.29,8.89,10.00,9.71,Full democracy
2,Sweden,9.58,9.64,8.33,10.00,9.41,Full democracy
3,New Zealand,10.00,9.29,8.89,8.13,10.00,Full democracy
4,Finland,10.00,8.93,8.89,8.75,9.71,Full democracy
...,...,...,...,...,...,...,...
162,Chad,0.00,0.00,1.67,3.75,2.65,Authoritarian
163,Syria,0.00,0.00,2.78,4.38,0.00,Authoritarian
164,Central African Republic,1.25,0.00,1.11,1.88,2.35,Authoritarian
165,Democratic Republic of the Congo,0.00,0.00,1.67,3.13,0.88,Authoritarian


In [271]:
# strategy 2: dropping
bye=['Rank','Score','Changes']
demodataRowsOK.drop(columns=bye)

Unnamed: 0,Country,Electoral,Functioning,Politicalparticipation,Politicalculture,Civilliberties,Regimetype,Region
0,Norway,10.00,9.64,10.00,10.00,9.71,Full democracy,Western Europe
1,Iceland,10.00,9.29,8.89,10.00,9.71,Full democracy,Western Europe
2,Sweden,9.58,9.64,8.33,10.00,9.41,Full democracy,Western Europe
3,New Zealand,10.00,9.29,8.89,8.13,10.00,Full democracy,Asia & Australasia
4,Finland,10.00,8.93,8.89,8.75,9.71,Full democracy,Western Europe
...,...,...,...,...,...,...,...,...
162,Chad,0.00,0.00,1.67,3.75,2.65,Authoritarian,Sub-Saharan Africa
163,Syria,0.00,0.00,2.78,4.38,0.00,Authoritarian,Middle East & North Africa
164,Central African Republic,1.25,0.00,1.11,1.88,2.35,Authoritarian,Sub-Saharan Africa
165,Democratic Republic of the Congo,0.00,0.00,1.67,3.13,0.88,Authoritarian,Sub-Saharan Africa


Let's use strategy 2 this time:

In [272]:
demodataok=demodataRowsOK.drop(columns=bye)
demodataok

Unnamed: 0,Country,Electoral,Functioning,Politicalparticipation,Politicalculture,Civilliberties,Regimetype,Region
0,Norway,10.00,9.64,10.00,10.00,9.71,Full democracy,Western Europe
1,Iceland,10.00,9.29,8.89,10.00,9.71,Full democracy,Western Europe
2,Sweden,9.58,9.64,8.33,10.00,9.41,Full democracy,Western Europe
3,New Zealand,10.00,9.29,8.89,8.13,10.00,Full democracy,Asia & Australasia
4,Finland,10.00,8.93,8.89,8.75,9.71,Full democracy,Western Europe
...,...,...,...,...,...,...,...,...
162,Chad,0.00,0.00,1.67,3.75,2.65,Authoritarian,Sub-Saharan Africa
163,Syria,0.00,0.00,2.78,4.38,0.00,Authoritarian,Middle East & North Africa
164,Central African Republic,1.25,0.00,1.11,1.88,2.35,Authoritarian,Sub-Saharan Africa
165,Democratic Republic of the Congo,0.00,0.00,1.67,3.13,0.88,Authoritarian,Sub-Saharan Africa


## b. FORMATTING

1. Verifiying data types

In [273]:
demodataok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 167 entries, 0 to 166
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Country                 167 non-null    object
 1   Electoral               167 non-null    object
 2   Functioning             167 non-null    object
 3   Politicalparticipation  167 non-null    object
 4   Politicalculture        167 non-null    object
 5   Civilliberties          167 non-null    object
 6   Regimetype              167 non-null    object
 7   Region                  167 non-null    object
dtypes: object(8)
memory usage: 11.7+ KB


The columns **Country** is text, then it is OK this is **object** data type. 
The Column **Regimetype** is ordinal,  **Region** is nominal, and the rest are **numerical**: we need to make those changes.

Let's start with **Regimetype**:

In [274]:
demodataok.Regimetype.value_counts()

Flawed democracy    54
Authoritarian       54
Hybrid regime       37
Full democracy      22
Name: Regimetype, dtype: int64

In [275]:
from pandas.api.types import CategoricalDtype

# create data type
levelsRegime=CategoricalDtype(categories=["Authoritarian","Hybrid regime","Flawed democracy","Full democracy"],ordered=True)
# make the change:
demodataok.Regimetype=demodataok.Regimetype.astype(levelsRegime)

See the difference:

In [276]:
demodataok.Regimetype

0      Full democracy
1      Full democracy
2      Full democracy
3      Full democracy
4      Full democracy
            ...      
162     Authoritarian
163     Authoritarian
164     Authoritarian
165     Authoritarian
166     Authoritarian
Name: Regimetype, Length: 167, dtype: category
Categories (4, object): ['Authoritarian' < 'Hybrid regime' < 'Flawed democracy' < 'Full democracy']

Now **Region**:

In [277]:
demodataok.Region.value_counts()

Sub-Saharan Africa            44
Eastern Europe                28
Asia & Australasia            28
Latin America                 24
Western Europe                21
Middle East & North Africa    20
North America                  2
Name: Region, dtype: int64

In [278]:
demodataok.Region=demodataok.Region.astype('category')

In [279]:
demodataok.Region

0                  Western Europe
1                  Western Europe
2                  Western Europe
3              Asia & Australasia
4                  Western Europe
                  ...            
162            Sub-Saharan Africa
163    Middle East & North Africa
164            Sub-Saharan Africa
165            Sub-Saharan Africa
166            Asia & Australasia
Name: Region, Length: 167, dtype: category
Categories (7, object): ['Asia & Australasia', 'Eastern Europe', 'Latin America', 'Middle East & North Africa', 'North America', 'Sub-Saharan Africa', 'Western Europe']

Now the numerical data:

In [280]:
demodataok.iloc[:,1:6]

Unnamed: 0,Electoral,Functioning,Politicalparticipation,Politicalculture,Civilliberties
0,10.00,9.64,10.00,10.00,9.71
1,10.00,9.29,8.89,10.00,9.71
2,9.58,9.64,8.33,10.00,9.41
3,10.00,9.29,8.89,8.13,10.00
4,10.00,8.93,8.89,8.75,9.71
...,...,...,...,...,...
162,0.00,0.00,1.67,3.75,2.65
163,0.00,0.00,2.78,4.38,0.00
164,1.25,0.00,1.11,1.88,2.35
165,0.00,0.00,1.67,3.13,0.88


In [281]:
demodataok.iloc[:,1:6].astype('float')

Unnamed: 0,Electoral,Functioning,Politicalparticipation,Politicalculture,Civilliberties
0,10.00,9.64,10.00,10.00,9.71
1,10.00,9.29,8.89,10.00,9.71
2,9.58,9.64,8.33,10.00,9.41
3,10.00,9.29,8.89,8.13,10.00
4,10.00,8.93,8.89,8.75,9.71
...,...,...,...,...,...
162,0.00,0.00,1.67,3.75,2.65
163,0.00,0.00,2.78,4.38,0.00
164,1.25,0.00,1.11,1.88,2.35
165,0.00,0.00,1.67,3.13,0.88


2. Correcting data types: Numerical, categorical, text, date.

In [282]:
demodataok.iloc[:,1:6].apply(pd.to_numeric, errors='coerce')

Unnamed: 0,Electoral,Functioning,Politicalparticipation,Politicalculture,Civilliberties
0,10.00,9.64,10.00,10.00,9.71
1,10.00,9.29,8.89,10.00,9.71
2,9.58,9.64,8.33,10.00,9.41
3,10.00,9.29,8.89,8.13,10.00
4,10.00,8.93,8.89,8.75,9.71
...,...,...,...,...,...
162,0.00,0.00,1.67,3.75,2.65
163,0.00,0.00,2.78,4.38,0.00
164,1.25,0.00,1.11,1.88,2.35
165,0.00,0.00,1.67,3.13,0.88


In [283]:
namesToNumeric=demodataok.iloc[:,1:6].columns
namesToNumeric

Index(['Electoral', 'Functioning', 'Politicalparticipation',
       'Politicalculture', 'Civilliberties'],
      dtype='object')

In [284]:
demodataok[namesToNumeric]=demodataok.iloc[:,1:6].apply(pd.to_numeric, errors='coerce')

In [285]:
demodataok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 167 entries, 0 to 166
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Country                 167 non-null    object  
 1   Electoral               167 non-null    float64 
 2   Functioning             167 non-null    float64 
 3   Politicalparticipation  167 non-null    float64 
 4   Politicalculture        167 non-null    float64 
 5   Civilliberties          167 non-null    float64 
 6   Regimetype              167 non-null    category
 7   Region                  167 non-null    category
dtypes: category(2), float64(5), object(1)
memory usage: 10.0+ KB


[home](#home)

______

<a id='integratesave'></a>

## INTEGRATING AND SAVING

In [286]:
codes=pd.read_csv("countryCodes.csv")
codes

Unnamed: 0,FIPS,ISO2,ISO3,NAME
0,AC,AG,ATG,Antigua and Barbuda
1,AG,DZ,DZA,Algeria
2,AJ,AZ,AZE,Azerbaijan
3,AL,AL,ALB,Albania
4,AM,AM,ARM,Armenia
...,...,...,...,...
241,TB,BL,BLM,Saint Barthelemy
242,GK,GG,GGY,Guernsey
243,JE,JE,JEY,Jersey
244,SX,GS,SGS,South Georgia South Sandwich Islands


In [287]:
demomerged=demodataok.merge(codes,left_on='Country', right_on="NAME",how='outer',indicator=True)
demomerged

Unnamed: 0,Country,Electoral,Functioning,Politicalparticipation,Politicalculture,Civilliberties,Regimetype,Region,FIPS,ISO2,ISO3,NAME,_merge
0,Norway,10.00,9.64,10.00,10.00,9.71,Full democracy,Western Europe,NO,NO,NOR,Norway,both
1,Iceland,10.00,9.29,8.89,10.00,9.71,Full democracy,Western Europe,IC,IS,ISL,Iceland,both
2,Sweden,9.58,9.64,8.33,10.00,9.41,Full democracy,Western Europe,SW,SE,SWE,Sweden,both
3,New Zealand,10.00,9.29,8.89,8.13,10.00,Full democracy,Asia & Australasia,NZ,NZ,NZL,New Zealand,both
4,Finland,10.00,8.93,8.89,8.75,9.71,Full democracy,Western Europe,FI,FI,FIN,Finland,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...
255,,,,,,,,,RN,MF,MAF,Saint Martin,right_only
256,,,,,,,,,TB,BL,BLM,Saint Barthelemy,right_only
257,,,,,,,,,GK,GG,GGY,Guernsey,right_only
258,,,,,,,,,JE,JE,JEY,Jersey,right_only


Whay both had in common:

In [288]:
demomerged[demomerged._merge=='both']

Unnamed: 0,Country,Electoral,Functioning,Politicalparticipation,Politicalculture,Civilliberties,Regimetype,Region,FIPS,ISO2,ISO3,NAME,_merge
0,Norway,10.00,9.64,10.00,10.00,9.71,Full democracy,Western Europe,NO,NO,NOR,Norway,both
1,Iceland,10.00,9.29,8.89,10.00,9.71,Full democracy,Western Europe,IC,IS,ISL,Iceland,both
2,Sweden,9.58,9.64,8.33,10.00,9.41,Full democracy,Western Europe,SW,SE,SWE,Sweden,both
3,New Zealand,10.00,9.29,8.89,8.13,10.00,Full democracy,Asia & Australasia,NZ,NZ,NZL,New Zealand,both
4,Finland,10.00,8.93,8.89,8.75,9.71,Full democracy,Western Europe,FI,FI,FIN,Finland,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...
160,Equatorial Guinea,0.00,0.43,3.33,4.38,1.47,Authoritarian,Sub-Saharan Africa,EK,GQ,GNQ,Equatorial Guinea,both
161,Turkmenistan,0.00,0.79,2.22,5.00,0.59,Authoritarian,Eastern Europe,TX,TM,TKM,Turkmenistan,both
162,Chad,0.00,0.00,1.67,3.75,2.65,Authoritarian,Sub-Saharan Africa,CD,TD,TCD,Chad,both
164,Central African Republic,1.25,0.00,1.11,1.88,2.35,Authoritarian,Sub-Saharan Africa,CT,CF,CAF,Central African Republic,both


What did not find a match:

In [289]:
demomerged[demomerged._merge=='left_only']

Unnamed: 0,Country,Electoral,Functioning,Politicalparticipation,Politicalculture,Civilliberties,Regimetype,Region,FIPS,ISO2,ISO3,NAME,_merge
22,South Korea[n 2],9.17,7.86,7.22,7.5,8.24,Flawed democracy,Asia & Australasia,,,,,left_only
76,North Macedonia,7.0,5.36,6.67,3.75,7.06,Hybrid regime,Eastern Europe,,,,,left_only
82,Moldova,6.58,4.64,6.11,4.38,7.06,Hybrid regime,Eastern Europe,,,,,left_only
94,Tanzania,5.75,5.0,5.0,5.63,4.41,Hybrid regime,Sub-Saharan Africa,,,,,left_only
110,Ivory Coast,4.33,2.86,3.33,5.63,4.12,Hybrid regime,Sub-Saharan Africa,,,,,left_only
121,Myanmar,3.08,3.93,2.78,5.63,2.35,Authoritarian,Asia & Australasia,,,,,left_only
131,Eswatini,0.92,2.86,2.78,5.63,3.53,Authoritarian,Sub-Saharan Africa,,,,,left_only
133,Republic of the Congo,2.17,2.5,3.89,3.75,3.24,Authoritarian,Sub-Saharan Africa,,,,,left_only
135,Vietnam,0.0,3.21,3.89,5.63,2.65,Authoritarian,Asia & Australasia,,,,,left_only
150,Iran,0.0,2.86,4.44,3.13,1.47,Authoritarian,Middle East & North Africa,,,,,left_only


In [290]:
codes[codes.NAME.str.contains('Korea|Macedonia|Moldo|Tanza|Ivo|Burm|Swazi|Congo|Iran|Viet|Lao|Liby|Syr')]

Unnamed: 0,FIPS,ISO2,ISO3,NAME
17,BM,MM,MMR,Burma
26,CF,CG,COG,Congo
27,CG,CD,COD,Democratic Republic of the Congo
83,IR,IR,IRN,Iran (Islamic Republic of)
86,IV,CI,CIV,Cote d'Ivoire
93,KN,KP,PRK,"Korea, Democratic People's Republic of"
95,KS,KR,KOR,"Korea, Republic of"
98,LA,LA,LAO,Lao People's Democratic Republic
106,LY,LY,LBY,Libyan Arab Jamahiriya
111,MK,MK,MKD,The former Yugoslav Republic of Macedonia


In [291]:
currentNamesDemo=demomerged[demomerged._merge=='left_only'].Country.to_list()
currentNamesCodes=['Korea, Republic of','The former Yugoslav Republic of Macedonia','Republic of Moldova',
                  'United Republic of Tanzania',"Cote d'Ivoire",'Burma','Swaziland','Congo','Viet Nam',
                  'Iran (Islamic Republic of)',"Lao People's Democratic Republic",
                  'Libyan Arab Jamahiriya','Syrian Arab Republic',"Korea, Democratic People's Republic of"]

In [292]:
dict(zip(currentNamesDemo,currentNamesCodes))

{'South Korea[n 2]': 'Korea, Republic of',
 'North Macedonia': 'The former Yugoslav Republic of Macedonia',
 'Moldova': 'Republic of Moldova',
 'Tanzania': 'United Republic of Tanzania',
 'Ivory Coast': "Cote d'Ivoire",
 'Myanmar': 'Burma',
 'Eswatini': 'Swaziland',
 'Republic of the Congo': 'Congo',
 'Vietnam': 'Viet Nam',
 'Iran': 'Iran (Islamic Republic of)',
 'Laos': "Lao People's Democratic Republic",
 'Libya': 'Libyan Arab Jamahiriya',
 'Syria': 'Syrian Arab Republic',
 'North Korea': "Korea, Democratic People's Republic of"}

In [293]:
###dictionary of replacements:
replacementsForDemo=dict(zip(currentNamesDemo,currentNamesCodes))

### replacing
demodataok.Country.replace(replacementsForDemo,inplace=True)

In [294]:
demomerged=demodataok.merge(codes,left_on='Country', right_on="NAME")
demomerged

Unnamed: 0,Country,Electoral,Functioning,Politicalparticipation,Politicalculture,Civilliberties,Regimetype,Region,FIPS,ISO2,ISO3,NAME
0,Norway,10.00,9.64,10.00,10.00,9.71,Full democracy,Western Europe,NO,NO,NOR,Norway
1,Iceland,10.00,9.29,8.89,10.00,9.71,Full democracy,Western Europe,IC,IS,ISL,Iceland
2,Sweden,9.58,9.64,8.33,10.00,9.41,Full democracy,Western Europe,SW,SE,SWE,Sweden
3,New Zealand,10.00,9.29,8.89,8.13,10.00,Full democracy,Asia & Australasia,NZ,NZ,NZL,New Zealand
4,Finland,10.00,8.93,8.89,8.75,9.71,Full democracy,Western Europe,FI,FI,FIN,Finland
...,...,...,...,...,...,...,...,...,...,...,...,...
162,Chad,0.00,0.00,1.67,3.75,2.65,Authoritarian,Sub-Saharan Africa,CD,TD,TCD,Chad
163,Syrian Arab Republic,0.00,0.00,2.78,4.38,0.00,Authoritarian,Middle East & North Africa,SY,SY,SYR,Syrian Arab Republic
164,Central African Republic,1.25,0.00,1.11,1.88,2.35,Authoritarian,Sub-Saharan Africa,CT,CF,CAF,Central African Republic
165,Democratic Republic of the Congo,0.00,0.00,1.67,3.13,0.88,Authoritarian,Sub-Saharan Africa,CG,CD,COD,Democratic Republic of the Congo


In [295]:
demomerged.drop(columns="NAME",inplace=True)

In [296]:
demomerged.to_pickle("demomerged.pkl")
# you will need: DF=pd.read_pickle("demomerged.pkl")
# or:
# from urllib.request import urlopen
# DF=pd.read_pickle(urlopen("https://..../demomerged.pkl"),compression=None)

In [297]:
from rpy2.robjects import pandas2ri
pandas2ri.activate()

from rpy2.robjects.packages import importr

base = importr('base')
base.saveRDS(demomerged,file="demomerged.RDS")

#In R, you call it with: DF = readRDS("demomerged.RDS")
#or, if iyou read from cloud: DF = readRDS(url("https://..../demomerged.RDS")



<rpy2.rinterface_lib.sexp.NULLType object at 0x7fdb51bb05a0> [RTYPES.NILSXP]