<br> 
<center><img src="https://i.imgur.com/hkb7Bq7.png" width="500"></center>


### Prof. José Manuel Magallanes, PhD

* Visiting Professor, Evans School of Public Policy and Governance / Senior Data Science Fellow, eScience Institute, University of Washington, [magajm@uw.edu](mailto:magajm@uw.edu)

* Associate Professor, Departamento de Ciencias Sociales, Pontificia Universidad Católica del Perú, [jmagallanes@pucp.edu.pe](mailto:jmagallanes@pucp.edu.pe)

_____

_____

<a id='home'></a>

# Introduction to Python

* DATA STRUCTURES
    - [Basic Data Structures](#basic_ds).

    - [Complex Data Structures](#complex_ds).

* DATA PRE PROCESSING
    - [Data cleaning and formatting](#cleanformat).

    - [Integration and Saving](#integratesave).


# Basic Data Structures

Python has basic native structures, like lists, tuples and dictionaries.

## a.  **LISTS** 

Lists are the most flexible structure to save or contain data elements.

In [1]:
names=["Qing", "Françoise", "Raúl", "Bjork","Marie"]
ages=[32,33,28,30,29]
country=["China", "Senegal", "España", "Norway","Korea"]
education=["Bach", "Bach", "Master", "PhD","PhD"]

Above we have created some lists. Lists can contain any values. Lists support different operations:

* **Accessing**:

Keep in mind the positions in Python start in **0**.

In [2]:
# one element
ages[0]

32

In [3]:
# several, using slices:
ages[1:-1] #second to before last

[33, 28, 30]

In [4]:
# several, using slices:
ages[:-2] #all but two last ones

[32, 33, 28]

In [5]:
# non consecutive
from operator import itemgetter
list(itemgetter(0,2,3)(ages))

[32, 28, 30]

In [6]:
# difficult to understand?
ages[0:4:2] + [ages[3]]

[32, 28, 30]

* **Modifying**:

In [7]:
# by position
country[2]="Spain"

# list changed:
country

['China', 'Senegal', 'Spain', 'Norway', 'Korea']

In [8]:
# by value
country=["PR China" if x == "China" else x for x in country]

# list changed:
country

['PR China', 'Senegal', 'Spain', 'Norway', 'Korea']

* **Deleting**

In [9]:
# by position
del country[-1] #last value

# list changed:
country

['PR China', 'Senegal', 'Spain', 'Norway']

In [10]:
# by position
names.pop() #last value by default

# list changed:
names

['Qing', 'Françoise', 'Raúl', 'Bjork']

In [11]:
# only 'del' works for several positions

lista=[1,2,3,4,5,6]
del lista[1:3]

#now:
lista

[1, 4, 5, 6]

In [12]:
# by value
ages.remove(29) 

# list changed:
ages # just first ocurrence of value!!

[32, 33, 28, 30]

In [13]:
# by value
education.remove('PhD') 

# list changed:
education # just first ocurrence!!

['Bach', 'Bach', 'Master', 'PhD']

In [14]:
# deleting every  value:

lista=[1,'a',45,'b','a']
lista=[x for x in lista if x!='a']

# you get:
lista

[1, 45, 'b']

* **Inserting values**

In [15]:
# at the end
lista.append("abc")
lista

[1, 45, 'b', 'abc']

In [16]:
# PART ONE:
# first delete a position
education.pop(2)
education

['Bach', 'Bach', 'PhD']

In [17]:
# PART TWO:
# now insert in that position
education.insert(2,"Master")
education

['Bach', 'Bach', 'Master', 'PhD']

## b.  **TUPLES**

Tuples are inmutable structures in Python, they look like lists but do not share much of their functionality:

In [18]:
# new list:
weekend=("Friday", "Saturday", "Sunday")

You can access:

In [19]:
weekend[0]

'Friday'

But deleting, modifying a value, and inserting a value IS NOT PERMITTED.

Python itself uses tuples as output of some important functions:

In [20]:
zip(names,ages)

<zip at 0x7fca947bb1e0>

The **zip** functions creates tuples, by combining containers in parallel. You can see it if you turn the result into a list:

In [21]:
list(zip(names,ages))  # a list of tuples

[('Qing', 32), ('Françoise', 33), ('Raúl', 28), ('Bjork', 30)]

[home](#home)

______


<a id='complex_ds'></a>

# Complex Data Structures

## a. **DICTIONARIES**  

*Dicts* work in a more sophisticated way, as they have a **'key'**:**'value'** structure:

In [22]:
classroom={'student':names,'age':ages,'edu':education}
# see it:

classroom

{'student': ['Qing', 'Françoise', 'Raúl', 'Bjork'],
 'age': [32, 33, 28, 30],
 'edu': ['Bach', 'Bach', 'Master', 'PhD']}

Dicts do not use indexes to access values:

In [23]:
#classroom[0]

Dicts use keys:

In [24]:
classroom['student']

['Qing', 'Françoise', 'Raúl', 'Bjork']

Notice that I created a dictionary where the value is not ONE but a LIST of values.

Once you access a value, you can modify it. You can also use _pop_ or _del_ using the **keys**. But you can not use _append_ to add an element, you need **update**:

In [25]:
classroom.update({'country':country})
# now:
classroom

{'student': ['Qing', 'Françoise', 'Raúl', 'Bjork'],
 'age': [32, 33, 28, 30],
 'edu': ['Bach', 'Bach', 'Master', 'PhD'],
 'country': ['PR China', 'Senegal', 'Spain', 'Norway']}

Notice that the previous dicts could be represented as a table, but dicts are no limited to table-like structures:

In [26]:
student1={'names':'Peter','language':['english','spanish'],'age':14}
student2={'names':'Mary','language':['english','spanish','french'],'age':16}
student3={'names':'John','language':['english'],'age':15}

You can create this dict:

In [27]:
class1={'students':[student1,student2,student3]}
class1

{'students': [{'names': 'Peter',
   'language': ['english', 'spanish'],
   'age': 14},
  {'names': 'Mary', 'language': ['english', 'spanish', 'french'], 'age': 16},
  {'names': 'John', 'language': ['english'], 'age': 15}]}

## b. DATA FRAMES

A **Data frame**  is a complex containers of values. The most common analogy is a spreadsheet. To create a data frame, we need to call **pandas**:

In [28]:
import pandas

We can prepare a data frame from a dictionary immediately, but ONLY if you have the same amount of elements in each list representing a column.

In [29]:
# our data frame:
students=pandas.DataFrame(classroom)
## see it:
students

Unnamed: 0,student,age,edu,country
0,Qing,32,Bach,PR China
1,Françoise,33,Bach,Senegal
2,Raúl,28,Master,Spain
3,Bjork,30,PhD,Norway


But, let me update the dictionary with: 

In [30]:
names=["Qing", "Françoise", "Raúl", "Bjork","Marie"]
#
classroom.update({'student':names})
#
classroom

{'student': ['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie'],
 'age': [32, 33, 28, 30],
 'edu': ['Bach', 'Bach', 'Master', 'PhD'],
 'country': ['PR China', 'Senegal', 'Spain', 'Norway']}

We have five students, but only data for four of them. Then this does not work:

In [31]:
#pandas.DataFrame(classroom)

In that case, you need this:

In [32]:
#then
students=pandas.DataFrame({key:pandas.Series(value) for key, value in classroom.items()})

# seeing it:
students

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,Spain
3,Bjork,30.0,PhD,Norway
4,Marie,,,


Sometimes, Python users code like this:

In [33]:
import pandas as pd # renaming the library

students=pd.DataFrame({key:pd.Series(value) for key, value in classroom.items()})
students

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,Spain
3,Bjork,30.0,PhD,Norway
4,Marie,,,


### Data frame basic operations

In [34]:
# data of structure: list? tuple? dataframe?
type(students)

pandas.core.frame.DataFrame

In [35]:
# type of data in data frame column
students.dtypes

student     object
age        float64
edu         object
country     object
dtype: object

In [36]:
# details of data frame
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   student  5 non-null      object 
 1   age      4 non-null      float64
 2   edu      4 non-null      object 
 3   country  4 non-null      object 
dtypes: float64(1), object(3)
memory usage: 288.0+ bytes


In [37]:
# number of rows and columns
students.shape 

(5, 4)

In [38]:
# number of rows:
len(students) 

5

In [39]:
# first rows
students.head(2) # compare with: students.tail(2)

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal


In [40]:
# name of columns
students.columns

Index(['student', 'age', 'edu', 'country'], dtype='object')

If you needed the column names as a list:

In [41]:
students.columns.tolist()# or simply: list(students)

['student', 'age', 'edu', 'country']

If you needed a column values as a list:

In [42]:
students.age.tolist()# list(students.ages)

[32.0, 33.0, 28.0, 30.0, nan]

### Accesing elements in DF:

The data frames in pandas behave much like in R:

In [43]:
#one particular column
students.student

0         Qing
1    Françoise
2         Raúl
3        Bjork
4        Marie
Name: student, dtype: object

In [44]:
# or
students['student'] 

0         Qing
1    Françoise
2         Raúl
3        Bjork
4        Marie
Name: student, dtype: object

In [45]:
# it is not the same as: 
students[['student']] # a data frame, not a column (or series)

Unnamed: 0,student
0,Qing
1,Françoise
2,Raúl
3,Bjork
4,Marie


In [46]:
# this is also a DF
students[['country','student']]

Unnamed: 0,country,student
0,PR China,Qing
1,Senegal,Françoise
2,Spain,Raúl
3,Norway,Bjork
4,,Marie


In [47]:
# and this, using loc:
columnNames=['country','student']
students.loc[:,columnNames]

Unnamed: 0,country,student
0,PR China,Qing
1,Senegal,Françoise
2,Spain,Raúl
3,Norway,Bjork
4,,Marie


In [48]:
## Using positions is very common:
columnPositions=[1,3,0]
students.iloc[:,columnPositions] 

Unnamed: 0,age,country,student
0,32.0,PR China,Qing
1,33.0,Senegal,Françoise
2,28.0,Spain,Raúl
3,30.0,Norway,Bjork
4,,,Marie


### Changing values

If you have a position, you can update values:

In [49]:
students.iloc[4,1]=23 # change is immediate! (no warning)
students

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,Spain
3,Bjork,30.0,PhD,Norway
4,Marie,23.0,,


### Deleting columns

You can modify any values in a data frame, but let me create a **deep** copy of this data frame to play with:

In [50]:
studentsCopy=students.copy()
studentsCopy

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,Spain
3,Bjork,30.0,PhD,Norway
4,Marie,23.0,,


In [51]:
# This is what you want get rid of:
byeColumns=['edu'] # you can delete more than one

#this is the result
studentsCopy.drop(columns=byeColumns)

Unnamed: 0,student,age,country
0,Qing,32.0,PR China
1,Françoise,33.0,Senegal
2,Raúl,28.0,Spain
3,Bjork,30.0,Norway
4,Marie,23.0,


Notice you do not have saved the previous result:

In [52]:
studentsCopy

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,Spain
3,Bjork,30.0,PhD,Norway
4,Marie,23.0,,


In [53]:
#NOW we do
studentsCopy.drop(columns=byeColumns,inplace=True)

In [54]:
#then:
studentsCopy

Unnamed: 0,student,age,country
0,Qing,32.0,PR China
1,Françoise,33.0,Senegal
2,Raúl,28.0,Spain
3,Bjork,30.0,Norway
4,Marie,23.0,


### Deleting a row

Let me delete a row:

In [55]:
# axis 0 is delete by row
studentsCopy.drop(index=2,inplace=True) 
studentsCopy

Unnamed: 0,student,age,country
0,Qing,32.0,PR China
1,Françoise,33.0,Senegal
3,Bjork,30.0,Norway
4,Marie,23.0,


As you see, the index dissapeared. Then, you should reset the indexes:

In [56]:
studentsCopy.reset_index(drop=True,inplace=True)
studentsCopy

Unnamed: 0,student,age,country
0,Qing,32.0,PR China
1,Françoise,33.0,Senegal
2,Bjork,30.0,Norway
3,Marie,23.0,



[home](#home)
_____

<a id='part2'></a>


## 2.  Data Pre processing

<a id='beginning'></a>

Preprocessing includes three stages:

a. **Cleaning**: Cleaning requires that every cell has the right value, and that the dataframe has only the contents needed. Having a clean data frame means:

    1. Verify that headers are at the top of data frame, and well written.

    2. Verify that categories levels are well written.
    
<br>
    
b. **Formatting**: Formatting requires:
    
    1. Verifiying data types
    
    2. Correcting data types: Numerical, categorical, text, date.
    
<br>
    
c. **Integrating and Saving**: It is process of combining several dataframes in one, and saving it into a file that can be the input of future processes.


_____

<a id='cleanformat'></a>

## a. CLEANING

Let's start by bringing in a table from wikipedia:

In [57]:
LINK_to_WIKIPAGE="https://en.wikipedia.org/wiki/Democracy_Index"

democracy=pd.read_html(LINK_to_WIKIPAGE)

Democracy is not a data frame:

In [58]:
type(democracy)

list

This list has this amount of elements:

In [59]:
len(democracy)

10

We could shorten the list of results by telling what kind of table you want:

In [84]:
democracy=pd.read_html(LINK_to_WIKIPAGE, attrs={"class":"wikitable sortable","style":"text-align:center;"})
len(democracy)

4

Then, our table is:

In [86]:
democracy[3]

Unnamed: 0,Rank,Δ Rank,Country,Regime type,Elec­toral pro­cess and plura­lism,Func­tioning of govern­ment,Poli­tical partici­pation,Poli­tical cul­ture,Civil liber­ties,Overall score,Δ Score
0,1,,Norway,Full democracy,10.00,9.64,10.00,10.00,9.71,9.87,
1,2,,Iceland,Full democracy,10.00,9.29,8.89,10.00,9.71,9.58,
2,3,,Sweden,Full democracy,9.58,9.64,8.33,10.00,9.41,9.39,
3,4,,New Zealand,Full democracy,10.00,9.29,8.89,8.13,10.00,9.26,
4,5,3.0,Finland,Full democracy,10.00,8.93,8.89,8.75,9.71,9.25,0.11
...,...,...,...,...,...,...,...,...,...,...,...
162,163,,Chad,Authoritarian,0.00,0.00,1.67,3.75,2.65,1.61,
163,164,2.0,Syria,Authoritarian,0.00,0.00,2.78,4.38,0.00,1.43,
164,165,1.0,Central African Republic,Authoritarian,1.25,0.00,1.11,1.88,2.35,1.32,0.20
165,166,1.0,Democratic Republic of the Congo,Authoritarian,0.00,0.00,1.67,3.13,0.88,1.13,0.20


Let's create an object for that data frame:

In [87]:
demodata=democracy[3]

1. **Verify that headers are at the top of data frame, and well written**.

In [97]:
# headers are in the right position
# are they well written?
demodata.columns.to_list()

['Rank',
 'Δ Rank',
 'Country',
 'Regime type',
 'Elec\xadtoral pro\xadcess and plura\xadlism',
 'Func\xadtioning of govern\xadment',
 'Poli\xadtical partici\xadpation',
 'Poli\xadtical cul\xadture',
 'Civil liber\xadties',
 'Overall score',
 'Δ Score']

Notice the presence of weird symbols. Let me save this original column names:

In [89]:
old=demodata.columns.to_list()

You can change using the **brute-force** approach (changing without coding), but let's use **regular expressions** (regex):

* Replacing:

In [102]:
import re

# \s is 'blanks'
# replace (substitute) space by underscore
old_good=[re.sub("\s","_",name) for name in old]
old_good

['Rank',
 'Δ_Rank',
 'Country',
 'Regime_type',
 'Elec\xadtoral_pro\xadcess_and_plura\xadlism',
 'Func\xadtioning_of_govern\xadment',
 'Poli\xadtical_partici\xadpation',
 'Poli\xadtical_cul\xadture',
 'Civil_liber\xadties',
 'Overall_score',
 'Δ_Score']

In [105]:
old_better=[re.sub('\W+', '', element) for element in old_good]
old_better

['Rank',
 'Δ_Rank',
 'Country',
 'Regime_type',
 'Electoral_process_and_pluralism',
 'Functioning_of_government',
 'Political_participation',
 'Political_culture',
 'Civil_liberties',
 'Overall_score',
 'Δ_Score']

These are the original ones:

In [110]:
demodata.columns

Index(['Rank', 'Δ Rank', 'Country', 'Regime type',
       'Elec­toral pro­cess and plura­lism', 'Func­tioning of govern­ment',
       'Poli­tical partici­pation', 'Poli­tical cul­ture', 'Civil liber­ties',
       'Overall score', 'Δ Score'],
      dtype='object')

Let's make a dict with the zip:

In [111]:
dict(zip(demodata.columns,old_better))

{'Rank': 'Rank',
 'Δ Rank': 'Δ_Rank',
 'Country': 'Country',
 'Regime type': 'Regime_type',
 'Elec\xadtoral pro\xadcess and plura\xadlism': 'Electoral_process_and_pluralism',
 'Func\xadtioning of govern\xadment': 'Functioning_of_government',
 'Poli\xadtical partici\xadpation': 'Political_participation',
 'Poli\xadtical cul\xadture': 'Political_culture',
 'Civil liber\xadties': 'Civil_liberties',
 'Overall score': 'Overall_score',
 'Δ Score': 'Δ_Score'}

The previous dict can serve to tell python how to rename some columns:

In [112]:
# just saving the dict
changes=dict(zip(demodata.columns,old_better))

#using the dict
demodata.rename(columns=changes)

Unnamed: 0,Rank,Δ_Rank,Country,Regime_type,Electoral_process_and_pluralism,Functioning_of_government,Political_participation,Political_culture,Civil_liberties,Overall_score,Δ_Score
0,1,,Norway,Full democracy,10.00,9.64,10.00,10.00,9.71,9.87,
1,2,,Iceland,Full democracy,10.00,9.29,8.89,10.00,9.71,9.58,
2,3,,Sweden,Full democracy,9.58,9.64,8.33,10.00,9.41,9.39,
3,4,,New Zealand,Full democracy,10.00,9.29,8.89,8.13,10.00,9.26,
4,5,3.0,Finland,Full democracy,10.00,8.93,8.89,8.75,9.71,9.25,0.11
...,...,...,...,...,...,...,...,...,...,...,...
162,163,,Chad,Authoritarian,0.00,0.00,1.67,3.75,2.65,1.61,
163,164,2.0,Syria,Authoritarian,0.00,0.00,2.78,4.38,0.00,1.43,
164,165,1.0,Central African Republic,Authoritarian,1.25,0.00,1.11,1.88,2.35,1.32,0.20
165,166,1.0,Democratic Republic of the Congo,Authoritarian,0.00,0.00,1.67,3.13,0.88,1.13,0.20


The previous view is the result, but no changes have been made:

In [113]:
demodata

Unnamed: 0,Rank,Δ Rank,Country,Regime type,Elec­toral pro­cess and plura­lism,Func­tioning of govern­ment,Poli­tical partici­pation,Poli­tical cul­ture,Civil liber­ties,Overall score,Δ Score
0,1,,Norway,Full democracy,10.00,9.64,10.00,10.00,9.71,9.87,
1,2,,Iceland,Full democracy,10.00,9.29,8.89,10.00,9.71,9.58,
2,3,,Sweden,Full democracy,9.58,9.64,8.33,10.00,9.41,9.39,
3,4,,New Zealand,Full democracy,10.00,9.29,8.89,8.13,10.00,9.26,
4,5,3.0,Finland,Full democracy,10.00,8.93,8.89,8.75,9.71,9.25,0.11
...,...,...,...,...,...,...,...,...,...,...,...
162,163,,Chad,Authoritarian,0.00,0.00,1.67,3.75,2.65,1.61,
163,164,2.0,Syria,Authoritarian,0.00,0.00,2.78,4.38,0.00,1.43,
164,165,1.0,Central African Republic,Authoritarian,1.25,0.00,1.11,1.88,2.35,1.32,0.20
165,166,1.0,Democratic Republic of the Congo,Authoritarian,0.00,0.00,1.67,3.13,0.88,1.13,0.20


You can make those changes using the **inplace=True** argument:

In [114]:
demodata.rename(columns=changes,inplace=True)

In [115]:
# you see:
demodata

Unnamed: 0,Rank,Δ_Rank,Country,Regime_type,Electoral_process_and_pluralism,Functioning_of_government,Political_participation,Political_culture,Civil_liberties,Overall_score,Δ_Score
0,1,,Norway,Full democracy,10.00,9.64,10.00,10.00,9.71,9.87,
1,2,,Iceland,Full democracy,10.00,9.29,8.89,10.00,9.71,9.58,
2,3,,Sweden,Full democracy,9.58,9.64,8.33,10.00,9.41,9.39,
3,4,,New Zealand,Full democracy,10.00,9.29,8.89,8.13,10.00,9.26,
4,5,3.0,Finland,Full democracy,10.00,8.93,8.89,8.75,9.71,9.25,0.11
...,...,...,...,...,...,...,...,...,...,...,...
162,163,,Chad,Authoritarian,0.00,0.00,1.67,3.75,2.65,1.61,
163,164,2.0,Syria,Authoritarian,0.00,0.00,2.78,4.38,0.00,1.43,
164,165,1.0,Central African Republic,Authoritarian,1.25,0.00,1.11,1.88,2.35,1.32,0.20
165,166,1.0,Democratic Republic of the Congo,Authoritarian,0.00,0.00,1.67,3.13,0.88,1.13,0.20


2. **Verify that categories levels are well written**.

This requires getting frequency tables:

In [117]:
demodata.Regime_type.value_counts()

Flawed democracy    54
Authoritarian       54
Hybrid regime       37
Full democracy      22
Name: Regime_type, dtype: int64

The names are well written.

At this point, you should know be ready to delete some columns (or keep just the ones you need). I want to keep this positions:

In [120]:
list(range(2,11))

[0, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [133]:
# strategy 1: subsetting
keepPositions=list(range(2,11))
demodata.iloc[:,keepPositions]

Unnamed: 0,Country,Regime_type,Electoral_process_and_pluralism,Functioning_of_government,Political_participation,Political_culture,Civil_liberties,Overall_score,Δ_Score
0,Norway,Full democracy,10.00,9.64,10.00,10.00,9.71,9.87,
1,Iceland,Full democracy,10.00,9.29,8.89,10.00,9.71,9.58,
2,Sweden,Full democracy,9.58,9.64,8.33,10.00,9.41,9.39,
3,New Zealand,Full democracy,10.00,9.29,8.89,8.13,10.00,9.26,
4,Finland,Full democracy,10.00,8.93,8.89,8.75,9.71,9.25,0.11
...,...,...,...,...,...,...,...,...,...
162,Chad,Authoritarian,0.00,0.00,1.67,3.75,2.65,1.61,
163,Syria,Authoritarian,0.00,0.00,2.78,4.38,0.00,1.43,
164,Central African Republic,Authoritarian,1.25,0.00,1.11,1.88,2.35,1.32,0.20
165,Democratic Republic of the Congo,Authoritarian,0.00,0.00,1.67,3.13,0.88,1.13,0.20


In [132]:
demodata.columns

Index(['Rank', 'Δ_Rank', 'Country', 'Regime_type',
       'Electoral_process_and_pluralism', 'Functioning_of_government',
       'Political_participation', 'Political_culture', 'Civil_liberties',
       'Overall_score', 'Δ_Score'],
      dtype='object')

In [134]:
# strategy 2: dropping
bye=['Rank','Δ_Rank','Δ_Score']
demodata.drop(columns=bye)

Unnamed: 0,Country,Regime_type,Electoral_process_and_pluralism,Functioning_of_government,Political_participation,Political_culture,Civil_liberties,Overall_score
0,Norway,Full democracy,10.00,9.64,10.00,10.00,9.71,9.87
1,Iceland,Full democracy,10.00,9.29,8.89,10.00,9.71,9.58
2,Sweden,Full democracy,9.58,9.64,8.33,10.00,9.41,9.39
3,New Zealand,Full democracy,10.00,9.29,8.89,8.13,10.00,9.26
4,Finland,Full democracy,10.00,8.93,8.89,8.75,9.71,9.25
...,...,...,...,...,...,...,...,...
162,Chad,Authoritarian,0.00,0.00,1.67,3.75,2.65,1.61
163,Syria,Authoritarian,0.00,0.00,2.78,4.38,0.00,1.43
164,Central African Republic,Authoritarian,1.25,0.00,1.11,1.88,2.35,1.32
165,Democratic Republic of the Congo,Authoritarian,0.00,0.00,1.67,3.13,0.88,1.13


Let's use strategy 2 this time:

In [135]:
demodataok=demodata.drop(columns=bye)
demodataok

Unnamed: 0,Country,Regime_type,Electoral_process_and_pluralism,Functioning_of_government,Political_participation,Political_culture,Civil_liberties,Overall_score
0,Norway,Full democracy,10.00,9.64,10.00,10.00,9.71,9.87
1,Iceland,Full democracy,10.00,9.29,8.89,10.00,9.71,9.58
2,Sweden,Full democracy,9.58,9.64,8.33,10.00,9.41,9.39
3,New Zealand,Full democracy,10.00,9.29,8.89,8.13,10.00,9.26
4,Finland,Full democracy,10.00,8.93,8.89,8.75,9.71,9.25
...,...,...,...,...,...,...,...,...
162,Chad,Authoritarian,0.00,0.00,1.67,3.75,2.65,1.61
163,Syria,Authoritarian,0.00,0.00,2.78,4.38,0.00,1.43
164,Central African Republic,Authoritarian,1.25,0.00,1.11,1.88,2.35,1.32
165,Democratic Republic of the Congo,Authoritarian,0.00,0.00,1.67,3.13,0.88,1.13


## b. FORMATTING

1. Verifiying data types

In [136]:
demodataok.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167 entries, 0 to 166
Data columns (total 8 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          167 non-null    object 
 1   Regime_type                      167 non-null    object 
 2   Electoral_process_and_pluralism  167 non-null    float64
 3   Functioning_of_government        167 non-null    float64
 4   Political_participation          167 non-null    float64
 5   Political_culture                167 non-null    float64
 6   Civil_liberties                  167 non-null    float64
 7   Overall_score                    167 non-null    float64
dtypes: float64(6), object(2)
memory usage: 10.6+ KB


The column **Country** is text, which is OK this is **object** data type. 
The Column **Regimetype** is ordinal,  and the rest are **numerical**.

2. **Correcting data types:

Let's turn **Regime_type** into ordinal:

In [137]:
demodataok.Regime_type.value_counts()

Flawed democracy    54
Authoritarian       54
Hybrid regime       37
Full democracy      22
Name: Regime_type, dtype: int64

For the case of ordinal date, these are the right steps:

In [139]:
from pandas.api.types import CategoricalDtype

# rigth order (ascending)
levels=["Authoritarian","Hybrid regime","Flawed democracy","Full democracy"]

# create ordinal data type
levelsRegime=CategoricalDtype(categories=levels,ordered=True)

# make the change:
demodataok.Regime_type=demodataok.Regime_type.astype(levelsRegime)

See the difference:

In [141]:
demodataok.Regime_type

0      Full democracy
1      Full democracy
2      Full democracy
3      Full democracy
4      Full democracy
            ...      
162     Authoritarian
163     Authoritarian
164     Authoritarian
165     Authoritarian
166     Authoritarian
Name: Regime_type, Length: 167, dtype: category
Categories (4, object): ['Authoritarian' < 'Hybrid regime' < 'Flawed democracy' < 'Full democracy']

Notice you will not see the change:

In [142]:
demodataok.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167 entries, 0 to 166
Data columns (total 8 columns):
 #   Column                           Non-Null Count  Dtype   
---  ------                           --------------  -----   
 0   Country                          167 non-null    object  
 1   Regime_type                      167 non-null    category
 2   Electoral_process_and_pluralism  167 non-null    float64 
 3   Functioning_of_government        167 non-null    float64 
 4   Political_participation          167 non-null    float64 
 5   Political_culture                167 non-null    float64 
 6   Civil_liberties                  167 non-null    float64 
 7   Overall_score                    167 non-null    float64 
dtypes: category(1), float64(6), object(1)
memory usage: 9.6+ KB


[home](#home)

______

<a id='integratesave'></a>

## INTEGRATING AND SAVING

Let me open this file:

In [None]:
codes=pd.read_csv("countryCodes.csv")
codes

My goal is to add the codes to  **demodataok**. Let's try this:

In [None]:
# how='outer' will use all the rows from both data frames
# indicator=True will tell you when you had matches or not.

demomerged=demodataok.merge(codes,left_on='Country', right_on="NAME",how='outer',indicator=True)
demomerged

What both had in common:

In [None]:
demomerged[demomerged._merge=='both'] # using the indicator!

Let's see what countries in **demodataok** did not find a match in **codes**:

In [None]:
demomerged[demomerged._merge=='left_only']

Let's see if those countries exist in **codes** but with different name:

In [None]:
codes[codes.NAME.str.contains('Korea|Macedonia|Moldo|Tanza|Ivo|Burm|Swazi|Congo|Iran|Viet|Lao|Liby|Syr')]

The function **str.contains** verifies if some string (not the whole word) exist. Now you go manual:

In [None]:
# the names in demomerged that did not find a match
currentNamesDemo=demomerged[demomerged._merge=='left_only'].Country.to_list()

# the names thta will replace the ones in demomerged.
currentNamesCodes=['Korea, Republic of','The former Yugoslav Republic of Macedonia','Republic of Moldova',
                  'United Republic of Tanzania',"Cote d'Ivoire",'Burma','Swaziland','Congo','Viet Nam',
                  'Iran (Islamic Republic of)',"Lao People's Democratic Republic",
                  'Libyan Arab Jamahiriya','Syrian Arab Republic',"Korea, Democratic People's Republic of"]

We will replace names. Let's see the dict of replacements:

In [None]:
dict(zip(currentNamesDemo,currentNamesCodes))

Let's replace:

In [None]:
###dictionary of replacements:
replacementsForDemo=dict(zip(currentNamesDemo,currentNamesCodes))

### replacing
demodataok.Country.replace(replacementsForDemo,inplace=True)

We have altered names of countries in **demodataok**, let's redo the merge:

In [None]:
demomerged=demodataok.merge(codes,left_on='Country', right_on="NAME")
demomerged

In [None]:
#you can drop the repeated column:
demomerged.drop(columns="NAME",inplace=True)

We are done!....let's save for future Python use, and for future R use:

* For future Python use:

In [None]:
demomerged.to_pickle("demomerged.pkl")
# you will open with: DF=pd.read_pickle("demomerged.pkl")
# or, if you have link:
# from urllib.request import urlopen
# DF=pd.read_pickle(urlopen("https://..../demomerged.pkl"),compression=None)

* For future use in R:

In [None]:
from rpy2.robjects import pandas2ri
pandas2ri.activate()

from rpy2.robjects.packages import importr

base = importr('base')
base.saveRDS(demomerged,file="demomerged.RDS")

#In R, you can open it with: DF = readRDS("demomerged.RDS")
#or, if iyou read from a link: DF = readRDS(url("https://..../demomerged.RDS")