<br> 
<center><img src="https://i.imgur.com/hkb7Bq7.png" width="500"></center>


### Prof. José Manuel Magallanes, PhD

* Visiting Professor, Evans School of Public Policy and Governance / Senior Data Science Fellow, eScience Institute, University of Washington, [magajm@uw.edu](mailto:magajm@uw.edu)

* Associate Professor, Departamento de Ciencias Sociales, Pontificia Universidad Católica del Perú, [jmagallanes@pucp.edu.pe](mailto:jmagallanes@pucp.edu.pe)

_____

_____

<a id='home'></a>

# Introduction to Python

* DATA STRUCTURES
    - [Basic Data Structures](#basic_ds).

    - [Complex Data Structures](#complex_ds).

* DATA PRE PROCESSING
    - [Data cleaning and formatting](#cleanformat).

    - [Integration and Saving](#integratesave).


# Basic Data Structures

Python has basic native structures, like lists, tuples and dictionaries.

## a.  **LISTS** 

Lists are the most flexible structure to save or contain data elements.

In [None]:
names=["Qing", "Françoise", "Raúl", "Bjork","Marie"]
ages=[32,33,28,30,29]
country=["China", "Senegal", "España", "Norway","Korea"]
education=["Bach", "Bach", "Master", "PhD","PhD"]

Above we have created some lists. Lists can contain any values. Lists support different operations:

* **Accessing**:

Keep in mind the positions in Python start in **0**.

In [None]:
# one element
ages[0]

In [None]:
# several, using slices:
ages[1:-1] #second to before last

In [None]:
# several, using slices:
ages[:-2] #all but two last ones

In [None]:
# non consecutive
from operator import itemgetter
list(itemgetter(0,2,3)(ages))

In [None]:
# difficult to understand?
ages[0:4:2] + [ages[3]]

* **Modifying**:

In [None]:
# by position
country[2]="Spain"

# list changed:
country

In [None]:
# by value
country=["PR China" if x == "China" else x for x in country]

# list changed:
country

* **Deleting**

In [None]:
# by position
del country[-1] #last value

# list changed:
country

In [None]:
# by position
names.pop() #last value by default

# list changed:
names

In [None]:
# only 'del' works for several positions

lista=[1,2,3,4,5,6]
del lista[1:3]

#now:
lista

In [None]:
# by value
ages.remove(29) 

# list changed:
ages # just first ocurrence of value!!

In [None]:
# by value
education.remove('PhD') 

# list changed:
education # just first ocurrence!!

In [None]:
# deleting every  value:

lista=[1,'a',45,'b','a']
lista=[x for x in lista if x!='a']

# you get:
lista

* **Inserting values**

In [None]:
# at the end
lista.append("abc")
lista

In [None]:
# PART ONE:
# first delete a position
education.pop(2)
education

In [None]:
# PART TWO:
# now insert in that position
education.insert(2,"Master")
education

## b.  **TUPLES**

Tuples are inmutable structures in Python, they look like lists but do not share much of their functionality:

In [None]:
# new list:
weekend=("Friday", "Saturday", "Sunday")

You can access:

In [None]:
weekend[0]

But deleting, modifying a value, and inserting a value IS NOT PERMITTED.

Python itself uses tuples as output of some important functions:

In [None]:
zip(names,ages)

The **zip** functions creates tuples, by combining containers in parallel. You can see it if you turn the result into a list:

In [None]:
list(zip(names,ages))  # a list of tuples

[home](#home)

______


<a id='complex_ds'></a>

# Complex Data Structures

## a. **DICTIONARIES**  

*Dicts* work in a more sophisticated way, as they have a **'key'**:**'value'** structure:

In [None]:
classroom={'student':names,'age':ages,'edu':education}
# see it:

classroom

Dicts do not use indexes to access values:

In [None]:
#classroom[0]

Dicts use keys:

In [None]:
classroom['student']

Notice that I created a dictionary where the value is not ONE but a LIST of values.

Once you access a value, you can modify it. You can also use _pop_ or _del_ using the **keys**. But you can not use _append_ to add an element, you need **update**:

In [None]:
classroom.update({'country':country})
# now:
classroom

Notice that the previous dicts could be represented as a table, but dicts are no limited to table-like structures:

In [None]:
student1={'names':'Peter','language':['english','spanish'],'age':14}
student2={'names':'Mary','language':['english','spanish','french'],'age':16}
student3={'names':'John','language':['english'],'age':15}

You can create this dict:

In [None]:
class1={'students':[student1,student2,student3]}
class1

## b. DATA FRAMES

A **Data frame**  is a complex containers of values. The most common analogy is a spreadsheet. To create a data frame, we need to call **pandas**:

In [None]:
import pandas

We can prepare a data frame from a dictionary immediately, but ONLY if you have the same amount of elements in each list representing a column.

In [None]:
# our data frame:
students=pandas.DataFrame(classroom)
## see it:
students

But, let me update the dictionary with: 

In [None]:
names=["Qing", "Françoise", "Raúl", "Bjork","Marie"]
#
classroom.update({'student':names})
#
classroom

We have five students, but only data for four of them. Then this does not work:

In [None]:
#pandas.DataFrame(classroom)

In that case, you need this:

In [None]:
#then
students=pandas.DataFrame({key:pandas.Series(value) for key, value in classroom.items()})

# seeing it:
students

Sometimes, Python users code like this:

In [None]:
import pandas as pd # renaming the library

students=pd.DataFrame({key:pd.Series(value) for key, value in classroom.items()})
students

### Data frame basic operations

In [None]:
# data of structure: list? tuple? dataframe?
type(students)

In [None]:
# type of data in data frame column
students.dtypes

In [None]:
# details of data frame
students.info()

In [None]:
# number of rows and columns
students.shape 

In [None]:
# number of rows:
len(students) 

In [None]:
# first rows
students.head(2) # compare with: students.tail(2)

In [None]:
# name of columns
students.columns

If you needed the column names as a list:

In [None]:
students.columns.tolist()# or simply: list(students)

If you needed a column values as a list:

In [None]:
students.age.tolist()# list(students.ages)

### Accesing elements in DF:

The data frames in pandas behave much like in R:

In [None]:
#one particular column
students.student

In [None]:
# or
students['student'] 

In [None]:
# it is not the same as: 
students[['student']] # a data frame, not a column (or series)

In [None]:
# this is also a DF
students[['country','student']]

In [None]:
# and this, using loc:
columnNames=['country','student']
students.loc[:,columnNames]

In [None]:
## Using positions is very common:
columnPositions=[1,3,0]
students.iloc[:,columnPositions] 

### Changing values

If you have a position, you can update values:

In [None]:
students.iloc[4,1]=23 # change is immediate! (no warning)
students

### Deleting columns

You can modify any values in a data frame, but let me create a **deep** copy of this data frame to play with:

In [None]:
studentsCopy=students.copy()
studentsCopy

In [None]:
# This is what you want get rid of:
byeColumns=['edu'] # you can delete more than one

#this is the result
studentsCopy.drop(columns=byeColumns)

Notice you do not have saved the previous result:

In [None]:
studentsCopy

In [None]:
#NOW we do
studentsCopy.drop(columns=byeColumns,inplace=True)

In [None]:
#then:
studentsCopy

### Deleting a row

Let me delete a row:

In [None]:
# axis 0 is delete by row
studentsCopy.drop(index=2,inplace=True) 
studentsCopy

As you see, the index dissapeared. Then, you should reset the indexes:

In [None]:
studentsCopy.reset_index(drop=True,inplace=True)
studentsCopy


[home](#home)
_____

<a id='part2'></a>


## 2.  Data Pre processing

<a id='beginning'></a>

Preprocessing includes three stages:

a. **Cleaning**: Cleaning requires that every cell has the right value, and that the dataframe has only the contents needed. Having a clean data frame means:

    1. Verify that headers are at the top of data frame, and well written.

    2. Verify that categories levels are well written.
    
<br>
    
b. **Formatting**: Formatting requires:
    
    1. Verifiying data types
    
    2. Correcting data types: Numerical, categorical, text, date.
    
<br>
    
c. **Integrating and Saving**: It is process of combining several dataframes in one, and saving it into a file that can be the input of future processes.


_____

<a id='cleanformat'></a>

## a. CLEANING

Let's start by bringing in a table from wikipedia:

In [None]:
LINK_to_WIKIPAGE="https://en.wikipedia.org/wiki/Democracy_Index"

democracy=pd.read_html(LINK_to_WIKIPAGE)

Democracy is not a data frame:

In [None]:
type(democracy)

This list has this amount of elements:

In [None]:
len(democracy)

We could shorten the list of results by telling what kind of table you want:

In [None]:
democracy=pd.read_html(LINK_to_WIKIPAGE, attrs={"class":"wikitable sortable"})
len(democracy)

Then, our table is:

In [None]:
democracy[0]

Let's create an object for that data frame:

In [None]:
demodata=democracy[0]

1. **Verify that headers are at the top of data frame, and well written**.

In [None]:
# headers are in the right position
# are they well written?
demodata.columns.to_list()

Notice the presence of weird symbols. Let me save this original column names:

In [None]:
old=demodata.columns.to_list()

You can change using the **brute-force** approach (changing without coding), but let's use **regular expressions** (regex):

* Splitting:

In [None]:
import re

# \s is 'blanks'
# "split every name in old using when you find a blank, and keep the first part (left)
old_good=[re.split("\s",name)[0] for name in old]
old_good

In [None]:
# \[ is just the opening bracket
# "split every name in old_good using when you find a '[', and keep the first part (left)
old_better=[re.split("\[",name)[0] for name in old_good]
old_better

You can combine two **regex** patterns using **|**:

In [None]:
old_better=[re.split("\s|\[",x)[0] for x in old]
old_better

* Replacing:

In [None]:
# \W every that is not a-z, A-Z, 0-9, including the _ (underscore) character
# will be sunstituted by '' (nothing)

allOK=[re.sub('\W+', '', element) for element in old_better]
allOK

Let me keep the ones that were modified:

In [None]:
allOK[2:]

These are the original ones:

In [None]:
demodata.columns[2:]

Let's make a dict with the zip:

In [None]:
dict(zip(demodata.columns[2:],allOK[2:]))

The previous dict can serve to tell python how to rename some columns:

In [None]:
# just saving the dict
changes=dict(zip(demodata.columns[2:],allOK[2:]))

#using the dict
demodata.rename(columns=changes)

The previous view is the result, but no changes have been made:

In [None]:
demodata

You can make those changes using the **inplace=True** argument:

In [None]:
demodata.rename(columns=changes,inplace=True)

In [None]:
# you see:
demodata

2. **Verify that categories levels are well written**.

This requires getting frequency tables:

In [None]:
demodata.Regimetype.value_counts()

You just identified a weird level **Regimetype**, that might signal a **wrong row**:

In [None]:
demodata[demodata.Regimetype=="Regimetype"]

You found a row that did not follow the pattern of the rest. The last step before formatting will be to remove these rows and the columns you do not need.

In [None]:
# strategy 1: subsetting
demodata[demodata.Regimetype!="Regimetype"]

In [None]:
# strategy 2: droping by index
demodata.drop(index=167)

Let's keep strategy 1:

In [None]:
demodataRowsOK=demodata[demodata.Regimetype!="Regimetype"]

At this point, you should know be ready to delete some columns (or keep just the ones you need). I want to keep this positions:

In [None]:
[1]+ list(range(3,9) )

In [None]:
# strategy 1: subsetting
keepPositions=[1]+ list(range(3,9) )
demodataRowsOK.iloc[:,keepPositions]

In [None]:
# strategy 2: dropping
bye=['Rank','Score','Changes']
demodataRowsOK.drop(columns=bye)

Let's use strategy 2 this time:

In [None]:
demodataok=demodataRowsOK.drop(columns=bye)
demodataok

## b. FORMATTING

1. Verifiying data types

In [None]:
demodataok.info()

The columns **Country** is text, then it is OK this is **object** data type. 
The Column **Regimetype** is ordinal,  **Region** is nominal, and the rest are **numerical**: we need to make those changes.

2. **Correcting data types: Numerical, categorical, text, date**.

Let's start with **Regimetype**:

In [None]:
demodataok.Regimetype.value_counts()

For the case of ordinal date, these are the right steps:

In [None]:
from pandas.api.types import CategoricalDtype
# rigth order (ascending)
levels=["Authoritarian","Hybrid regime","Flawed democracy","Full democracy"]

# create ordinal data type
levelsRegime=CategoricalDtype(categories=levels,ordered=True)

# make the change:
demodataok.Regimetype=demodataok.Regimetype.astype(levelsRegime)

See the difference:

In [None]:
demodataok.Regimetype

Now **Region**:

In [None]:
demodataok.Region.value_counts()

When the levels are nominal, this is simpler:

In [None]:
demodataok.Region=demodataok.Region.astype('category')

In [None]:
#verify:
demodataok.Region

Now the numerical data:

In [None]:
demodataok.iloc[:,1:6]

When you have several columns, you can try this:

In [None]:
demodataok.iloc[:,1:6]=demodataok.iloc[:,1:6].astype('float')

In [None]:
demodataok.info()

Or you can to **apply** a function:

In [None]:
demodataok.iloc[:,1:6]=demodataok.iloc[:,1:6].apply(pd.to_numeric, errors='coerce')

In [None]:
demodataok.info()

[home](#home)

______

<a id='integratesave'></a>

## INTEGRATING AND SAVING

Let me open this file:

In [None]:
codes=pd.read_csv("countryCodes.csv")
codes

My goal is to add the codes to  **demodataok**. Let's try this:

In [None]:
# how='outer' will use all the rows from both data frames
# indicator=True will tell you when you had matches or not.

demomerged=demodataok.merge(codes,left_on='Country', right_on="NAME",how='outer',indicator=True)
demomerged

What both had in common:

In [None]:
demomerged[demomerged._merge=='both'] # using the indicator!

Let's see what countries in **demodataok** did not find a match in **codes**:

In [None]:
demomerged[demomerged._merge=='left_only']

Let's see if those countries exist in **codes** but with different name:

In [None]:
codes[codes.NAME.str.contains('Korea|Macedonia|Moldo|Tanza|Ivo|Burm|Swazi|Congo|Iran|Viet|Lao|Liby|Syr')]

The function **str.contains** verifies if some string (not the whole word) exist. Now you go manual:

In [None]:
# the names in demomerged that did not find a match
currentNamesDemo=demomerged[demomerged._merge=='left_only'].Country.to_list()

# the names thta will replace the ones in demomerged.
currentNamesCodes=['Korea, Republic of','The former Yugoslav Republic of Macedonia','Republic of Moldova',
                  'United Republic of Tanzania',"Cote d'Ivoire",'Burma','Swaziland','Congo','Viet Nam',
                  'Iran (Islamic Republic of)',"Lao People's Democratic Republic",
                  'Libyan Arab Jamahiriya','Syrian Arab Republic',"Korea, Democratic People's Republic of"]

We will replace names. Let's see the dict of replacements:

In [None]:
dict(zip(currentNamesDemo,currentNamesCodes))

Let's replace:

In [None]:
###dictionary of replacements:
replacementsForDemo=dict(zip(currentNamesDemo,currentNamesCodes))

### replacing
demodataok.Country.replace(replacementsForDemo,inplace=True)

We have altered names of countries in **demodataok**, let's redo the merge:

In [None]:
demomerged=demodataok.merge(codes,left_on='Country', right_on="NAME")
demomerged

In [None]:
#you can drop the repeated column:
demomerged.drop(columns="NAME",inplace=True)

We are done!....let's save for future Python use, and for future R use:

* For future Python use:

In [None]:
demomerged.to_pickle("demomerged.pkl")
# you will open with: DF=pd.read_pickle("demomerged.pkl")
# or, if you have link:
# from urllib.request import urlopen
# DF=pd.read_pickle(urlopen("https://..../demomerged.pkl"),compression=None)

* For future use in R:

In [None]:
from rpy2.robjects import pandas2ri
pandas2ri.activate()

from rpy2.robjects.packages import importr

base = importr('base')
base.saveRDS(demomerged,file="demomerged.RDS")

#In R, you can open it with: DF = readRDS("demomerged.RDS")
#or, if iyou read from a link: DF = readRDS(url("https://..../demomerged.RDS")