<br> 
<center><img src="https://i.imgur.com/hkb7Bq7.png" width="500"></center>


### Prof. José Manuel Magallanes, PhD

* Visiting Professor, Evans School of Public Policy and Governance / Senior Data Science Fellow, eScience Institute, University of Washington, [magajm@uw.edu](mailto:magajm@uw.edu)

* Associate Professor, Departamento de Ciencias Sociales, Pontificia Universidad Católica del Perú, [jmagallanes@pucp.edu.pe](mailto:jmagallanes@pucp.edu.pe)

_____

_____

<a id='home'></a>

# Introduction to Python

* DATA STRUCTURES
    - [Basic Data Structures](#basic_ds).

    - [Complex Data Structures](#complex_ds).

* DATA PRE PROCESSING
    - [Data cleaning and formatting](#cleanformat).

    - [Integration and Saving](#integratesave).


# 1. Basic Data Structures

Python has basic native structures, like lists, tuples and dictionaries.

## a.  **LISTS** 

Lists are the most flexible structure to save or contain data elements.

In [1]:
names=["Qing", "Françoise", "Raúl", "Bjork","Marie"]
ages=[32,33,28,30,29]
country=["China", "Senegal", "España", "Norway","Korea"]
education=["Bach", "Bach", "Master", "PhD","PhD"]

Above we have created some lists. Lists can contain any values. Lists support different operations:

* **Accessing**:

Keep in mind the positions in Python start in **0**.

In [2]:
# one element
ages[0]

32

In [3]:
# several elements, using slices:
ages[1:-1] #second until the one before last

[33, 28, 30]

In [4]:
# several elements, using slices:
ages[:-2] #all but two last ones

[32, 33, 28]

In [5]:
# non consecutive
from operator import itemgetter
list(itemgetter(0,2,3)(ages))

[32, 28, 30]

In [6]:
# difficult to understand?
ages[0:4:2] + [ages[3]]

[32, 28, 30]

* **Modifying**:

In [None]:
# by position
country[2]="Spain"

# list changed:
country

In [None]:
# by value
country=["PR China" if x == "China" else x for x in country]

# list changed:
country

* **Deleting**

In [None]:
# by position
del country[-1] #last value

# list changed:
country

In [None]:
# by position
names.pop() #last value by default

# list changed:
names

In [None]:
# only 'del' works for several positions

lista=[1,2,3,4,5,6]
del lista[1:3]

#now:
lista

In [None]:
# by value
ages.remove(29) 

# list changed:
ages # just first ocurrence of value!!

In [None]:
# by value
education.remove('PhD') 

# list changed:
education # just first ocurrence!!

In [7]:
# deleting every  value:

lista=[1,'a',45,'b','a']
lista=[x for x in lista if x!='a']

# you get:
lista

[1, 45, 'b']

* **Inserting values**

In [8]:
# at the end
lista.append("abc")
lista

[1, 45, 'b', 'abc']

In [9]:
# In a particular  position
education.insert(2,"PhD")
education

['Bach', 'Bach', 'PhD', 'Master', 'PhD', 'PhD']

## b.  **TUPLES**

Tuples are inmutable structures in Python, they look like lists but do not share much of their functionality:

In [10]:
# new tuple:
weekend=("Friday", "Saturday", "Sunday")

You can access:

In [None]:
weekend[0]

But deleting, modifying a value, and inserting a value IS NOT PERMITTED.

Python itself uses tuples as output of some important functions:

In [None]:
zip(names,ages)

The **zip** functions creates tuples, by combining containers in parallel. You can see it if you turn the result into a list:

In [None]:
list(zip(names,ages))  # a list of tuples

[home](#home)

______


<a id='complex_ds'></a>

# Complex Data Structures

## a. **DICTIONARIES**  

*Dicts* work in a more sophisticated way, as they have a **'key'**:**'value'** structure:

In [11]:
classroom={'student':names,'age':ages,'edu':education}
# see it:

classroom

{'student': ['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie'],
 'age': [32, 33, 28, 30, 29],
 'edu': ['Bach', 'Bach', 'PhD', 'Master', 'PhD', 'PhD']}

Dicts do not use indexes to access values:

In [12]:
classroom[0]

KeyError: 0

Dicts use keys:

In [13]:
classroom['student']

['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie']

Notice that I created a dictionary where the value is not ONE but a LIST of values.

Once you access a value, you can modify it. You can also use _pop_ or _del_ using the **keys**. But you can not use _append_ to add an element, you need **update**:

In [14]:
classroom.update({'country':country})
# now:
classroom

{'student': ['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie'],
 'age': [32, 33, 28, 30, 29],
 'edu': ['Bach', 'Bach', 'PhD', 'Master', 'PhD', 'PhD'],
 'country': ['China', 'Senegal', 'España', 'Norway', 'Korea']}

Notice that the previous dicts could be represented as a table, but dicts are no limited to table-like structures:

In [15]:
student1={'names':'Peter','language':['english','spanish'],'age':14}
student2={'names':'Mary','language':['english','spanish','french'],'age':16}
student3={'names':'John','language':['english'],'age':15}

You can create this dict:

In [16]:
class1={'students':[student1,student2,student3]}
class1

{'students': [{'names': 'Peter',
   'language': ['english', 'spanish'],
   'age': 14},
  {'names': 'Mary', 'language': ['english', 'spanish', 'french'], 'age': 16},
  {'names': 'John', 'language': ['english'], 'age': 15}]}

## b. DATA FRAMES

A **Data frame**  is a complex containers of values. The most common analogy is a spreadsheet. To create a data frame, we need to call **pandas**:

In [17]:
import pandas

We can prepare a data frame from a dictionary immediately, but ONLY if you have the same amount of elements in each list representing a column.

In [18]:
# this dict can not be used for a data frame:
classroom

{'student': ['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie'],
 'age': [32, 33, 28, 30, 29],
 'edu': ['Bach', 'Bach', 'PhD', 'Master', 'PhD', 'PhD'],
 'country': ['China', 'Senegal', 'España', 'Norway', 'Korea']}

In [19]:
# our data frame:
students=pandas.DataFrame(classroom)
## see it:
students

ValueError: arrays must all be same length

In that case, you need this:

In [20]:
#then
students=pandas.DataFrame({key:pandas.Series(value) for key, value in classroom.items()})

# seeing it:
students

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,PhD,España
3,Bjork,30.0,Master,Norway
4,Marie,29.0,PhD,Korea
5,,,PhD,


Sometimes, Python users code like this:

In [21]:
import pandas as pd # renaming the library

students=pd.DataFrame({key:pd.Series(value) for key, value in classroom.items()})
students

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,PhD,España
3,Bjork,30.0,Master,Norway
4,Marie,29.0,PhD,Korea
5,,,PhD,


### Data frame basic operations

In [None]:
# data of structure: list? tuple? dataframe?
type(students)

In [None]:
# type of data in data frame column
students.dtypes

In [None]:
# details of data frame
students.info()

In [None]:
# number of rows and columns
students.shape 

In [None]:
# number of rows:
len(students) 

In [None]:
# first rows
students.head(2) # compare with: students.tail(2)

In [None]:
# name of columns
students.columns

If you needed the column names as a list:

In [None]:
students.columns.tolist()# or simply: list(students)

If you needed a column values as a list:

In [None]:
students.age.tolist()# list(students.ages)

### Accesing elements in DF:

The data frames in pandas behave much like in R:

In [None]:
#one particular column
students.student

In [None]:
# or
students['student'] 

In [None]:
# it is not the same as: 
students[['student']] # a data frame, not a column (or series)

In [None]:
# this is also a DF
students[['country','student']]

In [None]:
# and this, using loc:
columnNames=['country','student']
students.loc[:,columnNames]

In [None]:
## Using positions is very common:
columnPositions=[1,3,0]
students.iloc[:,columnPositions] 

### Changing values

If you have a position, you can update values:

In [None]:
students.iloc[4,1]=23 # change is immediate! (no warning)
students

### Deleting columns

You can modify any values in a data frame, but let me create a **deep** copy of this data frame to play with:

In [None]:
studentsCopy=students.copy()
studentsCopy

In [None]:
# This is what you want get rid of:
byeColumns=['edu'] # you can delete more than one

#this is the result
studentsCopy.drop(columns=byeColumns)

Notice you do not have saved the previous result:

In [None]:
studentsCopy

In [None]:
#NOW we do, using 'inplace'
studentsCopy.drop(columns=byeColumns,inplace=True)

In [None]:
#then:
studentsCopy

### Deleting a row

Let me delete a row:

In [None]:
studentsCopy.drop(index=2,inplace=True) 
studentsCopy

As you see, the index dissapeared. Then, you should reset the indexes:

In [None]:
studentsCopy.reset_index(drop=True,inplace=True)
studentsCopy


[home](#home)
_____

<a id='part2'></a>


# 2.  Data Pre processing

<a id='beginning'></a>

Preprocessing includes three stages:

a. **Cleaning**: Cleaning requires that every cell has the right value, and that the dataframe has only the contents needed. Having a clean data frame means:

    1. Verify that headers are at the top of data frame, and well written.

    2. Verify that categories levels are well written.
    
<br>
    
b. **Formatting**: Formatting requires:
    
    1. Verifiying data types
    
    2. Correcting data types: Numerical, categorical, text, date.
    
<br>
    
c. **Integrating and Saving**: It is process of combining several dataframes in one, and saving it into a file that can be the input of future processes.


_____

<a id='cleanformat'></a>

## a. CLEANING

Let's start by bringing in a table from wikipedia:

In [22]:
# you need to have 'lxml' and 'beautiful soup' previously installed.

LINK_to_WIKIPAGE="https://en.wikipedia.org/wiki/Democracy_Index"

democracy=pd.read_html(LINK_to_WIKIPAGE)

Democracy is not a data frame:

In [23]:
type(democracy)

list

This list has this amount of elements:

In [24]:
len(democracy)

10

We could shorten the list of results by telling what kind of table you want:

In [26]:
democracy=pd.read_html(LINK_to_WIKIPAGE, header=0,attrs={"class":"wikitable sortable"})
len(democracy)

5

Then, our table is:

In [28]:
democracy[4]

Unnamed: 0,Rank,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Δ Rank,Country,Regime type,Overall score,Δ Score,Elec­toral pro­cess and plura­lism,Func­tioning of govern­ment,Poli­tical partici­pation,Poli­tical cul­ture,Civil liber­ties
0,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies
1,1,,Norway,Full democracy,9.81,0.06,10.00,9.64,10.00,10.00,9.41
2,2,,Iceland,Full democracy,9.37,0.21,10.00,8.57,8.89,10.00,9.41
3,3,,Sweden,Full democracy,9.26,0.13,9.58,9.29,8.33,10.00,9.12
4,4,,New Zealand,Full democracy,9.25,0.01,10.00,8.93,8.89,8.75,9.71
...,...,...,...,...,...,...,...,...,...,...,...
166,163,,Chad,Authoritarian,1.55,0.06,0.00,0.00,1.67,3.75,2.35
167,164,,Syria,Authoritarian,1.43,,0.00,0.00,2.78,4.38,0.00
168,165,,Central African Republic,Authoritarian,1.32,,1.25,0.00,1.11,1.88,2.35
169,166,,Democratic Republic of the Congo,Authoritarian,1.13,,0.00,0.00,1.67,3.13,0.88


Let's create an object for that data frame:

In [30]:
demodata=democracy[4].copy()

1. **Verify that headers are at the top of data frame, and well written**.

In [36]:
# headers are in the right position
# are they well written?
demodata.columns#.to_list()

Index(['Rank',
       '.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Δ Rank',
       'Country', 'Regime type', 'Overall score', 'Δ Score',
       'Elec­toral pro­cess and plura­lism', 'Func­tioning of govern­ment',
       'Poli­tical partici­pation', 'Poli­tical cul­ture', 'Civil liber­ties'],
      dtype='object')

Notice the presence of weird symbols. Let me save this original column names:

In [None]:
old=demodata.columns.to_list()

You can change using the **brute-force** approach (changing without coding), but let's use **regular expressions** (regex):

* Replacing:

In [None]:
import re

# \s is 'blanks'
# replace (substitute) space by underscore
old_good=[re.sub("\s","_",name) for name in old]
old_good

In [None]:
old_better=[re.sub('\W+', '', element) for element in old_good]
old_better

These are the original ones:

In [None]:
demodata.columns

Let's make a dict with the zip:

In [None]:
dict(zip(demodata.columns,old_better))

The previous dict can serve to tell python how to rename all columns:

In [None]:
# just saving the dict
changes=dict(zip(demodata.columns,old_better))

#using the dict
demodata.rename(columns=changes)

The previous view is the result, but no changes have been made:

In [None]:
demodata

You can make those changes using the **inplace=True** argument:

In [None]:
demodata.rename(columns=changes,inplace=True)

In [None]:
# you see:
demodata

You can change just one column name by doing:

In [None]:
demodata.rename(columns={'Overall_score':'DemoIndex'},inplace=True)
demodata

* Getting rid of columns unneeded:

At this point, you should know be ready to delete some columns (or keep just the ones you need). I want to keep these positions:

In [None]:
list(range(2,11)) #range from 2 up 11 not inclusive

In [None]:
# strategy 1: subsetting
keepPositions=list(range(2,10))
demodata.iloc[:,keepPositions]

In [None]:
# strategy 2: dropping

#remember the names
demodata.columns

In [None]:
#use names to drop
bye=['Rank','Δ_Rank','Δ_Score']
demodata.drop(columns=bye)

Let's use strategy 2 this time:

In [None]:
# result is a new data frame ('inplace' not needed)
demodataok=demodata.drop(columns=bye)
demodataok

2. **Verify that categories levels are well written**.

This requires getting frequency tables:

In [None]:
demodata.Regime_type.value_counts()

The **level names** are not OK. If you see the table, it has section that have included rows with text in every cell (as you can see in the first row). I will solve this during formatting.

## b. FORMATTING

1. Verifiying data types

In [None]:
demodataok.info()

The column **Country** is text, which is OK this is **object** data type. 
The Column **Regimetype** should be ordinal,  and the rest should be **numerical**; but they are not.

2. **Correcting data types**:

Let's turn **Regime_type** into ordinal:

In [None]:
demodataok.Regime_type.value_counts()

For the case of ordinal data, these are the right steps:

In [None]:
from pandas.api.types import CategoricalDtype

# create a list with the rigth order (ascending)
levels=["Authoritarian","Hybrid regime","Flawed democracy","Full democracy"]

# create levels as a 'customized' ordinal data type
levelsRegime=CategoricalDtype(categories=levels,ordered=True)

# make the change:
demodataok.Regime_type=demodataok.Regime_type.astype(levelsRegime)

See the difference:

In [None]:
demodataok.Regime_type

Notice that the first row is now a missing value. Since the wrong values are not part of the **levels**, they were turned into NaN. The data type for that column is now OK:

In [None]:
demodataok.info()

Let me turn the other columns to numeric. Pandas has a function **to_numeric** but it works for ONE column. 


In [None]:
pd.to_numeric(demodataok.iloc[:,2:])

Then we need to **apply** this function to several columns:

In [None]:
demodataok.iloc[:,2:].apply(pd.to_numeric, errors='coerce')

Notice the use of **coerce** that is the key argument to get missing values: if the function can not turn a value into a number, the function will create a missing value.

Let's save the result now:

In [None]:
demodataok.iloc[:,2:]=demodataok.iloc[:,2:].apply(pd.to_numeric, errors='coerce')

In [None]:
# now you get:

demodataok.info()

Remember that we have rows with  missing values:

In [None]:
demodataok

Here, we need to get rid of those rows full of missing values:

In [None]:
# Let's use "drop" with "inplace":
demodataok.dropna(how='any',inplace=True)

# you get:
demodataok

Every time you drop rows, you should **reset** the row indexes:

In [None]:
demodataok.reset_index(drop=True,inplace=True)

Let's get one more data table on Military expenditure:

In [None]:
linkmil="https://www.cia.gov/the-world-factbook/field/military-expenditures/country-comparison"
milmoney=pd.read_html(linkmil)

In [None]:
milmoney[0]

This is an almost clean data frame:

In [None]:
#make new data frame
mildata=milmoney[0].copy()

#get rid of columns unneeded
mildata.drop(columns=['Rank','Date of Information'],inplace=True)

# rename one column
mildata.rename(columns={'% of GDP':'mil_expend'},inplace=True)

# check data types
mildata.info()

This last dataframe is done!

[home](#home)

______

<a id='integratesave'></a>

## INTEGRATING AND SAVING

Let me open this file:

In [None]:
codesHDI=pd.read_excel("countryCodesHDI.xlsx")
codesHDI

My goal is to add the info from **codesHDI** to  **demodataok**. Let's try this:

In [None]:
# how='outer' will use all the rows from both data frames
# indicator=True will tell you when you had matches or not.

demohdi=demodataok.merge(codesHDI,left_on='Country', right_on="NAME",how='outer',indicator=True)
demohdi

The last column **_merge** informs if there were matches or not; and also gives details where there was not a match. Let's see what both have in common:

In [None]:
demohdi[demohdi._merge=='both'] # using the indicator!

Let's see what countries in **demodataok** did not find a match in **codes**:

In [None]:
demohdi[demohdi._merge=='left_only']

Let's see if those countries exist in **codesHDI** but with different name:

In [None]:
codesHDI[codesHDI.NAME.str.contains('Korea|Macedonia|Moldo|Tanza|Ivo|Burm|Swazi|Congo|Iran|Viet|Lao|Liby|Syr')]

The function **str.contains** verifies if some string (not the whole word) exist. Now you go manual:

In [None]:
# the names in demohdi that did not find a match
currentNamesLeft=demohdi[demohdi._merge=='left_only'].Country.to_list()

# the names that will replace the ones in demohdi.
currentNamesRight=['Korea, Republic of','The former Yugoslav Republic of Macedonia','Republic of Moldova',
                  'United Republic of Tanzania',"Cote d'Ivoire",'Burma','Swaziland','Congo','Viet Nam',
                  'Iran (Islamic Republic of)',"Lao People's Democratic Republic",
                  'Libyan Arab Jamahiriya','Syrian Arab Republic',"Korea, Democratic People's Republic of"]

We will replace names. Let's see the dict of replacements:

In [None]:
dict(zip(currentNamesLeft,currentNamesRight))

Let's replace in **demodataok**:

In [None]:
###dictionary of replacements:
replacementsForDemo=dict(zip(currentNamesLeft,currentNamesRight))

### replacing
demodataok.Country.replace(replacementsForDemo,inplace=True)

We have altered names of countries in **demodataok**. Let's redo the merge:

In [None]:
demohdi=demodataok.merge(codesHDI,left_on='Country', right_on="NAME")
demohdi

In [None]:
#you can drop the repeated column:
demohdi.drop(columns="NAME",inplace=True)

Let me add the data on military expenses...

In [None]:
demohdimil=demohdi.merge(mildata) #key columns have the same name
demohdimil

You see again that we lost some rows in the merge. 

Exercise: Try to recover some rows.

Now, let's save the current **demohdimil** for future Python use, and for future R use:

* For future Python use:

In [None]:
demohdimil.to_pickle("demohdimil.pkl")
# you will open with: DF=pd.read_pickle("demohdimil.pkl")
# or, if you have link:
# from urllib.request import urlopen
# DF=pd.read_pickle(urlopen("https://..../demohdimil.pkl"),compression=None)

* For future use in R (install rpy2):

In [None]:
#import os
#os.environ['R_HOME'] = '/Library/Frameworks/R.framework/Resources'

from rpy2.robjects import pandas2ri
pandas2ri.activate()

from rpy2.robjects.packages import importr

base = importr('base')
base.saveRDS(demohdimil,file="demohdimil.RDS")

#In R, you can open it with: DF = readRDS("demohdimil.RDS")
#or, if iyou read from a link: DF = readRDS(url("https://..../demohdimil.RDS")

* An Excel version (install *openpyxl* and *xlrd*)

In [None]:
demohdimil.to_excel("demohdimil.xlsx",index=False)

FINALLY, let me use Python to alter a map in shapefile format (compressed). You need to install **geopandas**. Then: 

In [None]:
import geopandas as gpd

#zipfile from github
mapl='https://github.com/UW-eScience-WinterSchool/Python_Session/raw/main/TM_WORLD_BORDERS_SIMPL-0.3.zip'

#reading file
mymap=gpd.read_file(mapl)


You can see the result here:

In [None]:
mymap

Maps have metadata, you want to know the projection information:

In [None]:
mymap.crs

Let me export the **geoDataFrame** a a **geojson** file:

In [None]:
mymap.to_file("mymap.geojson", driver='GeoJSON')

The error tells you that _somewhere_ you have a **bytes** data type, which is not allowed in geojson. You do not see it here:

In [None]:
mymap.info()

Then I need to make a loop to find where it is:

In [None]:
for column in mymap:
    for cell in mymap[column]:
        if type(cell)==bytes:
            print(column,cell)

Once you find the column:

In [None]:
#change type to string
mymap.NAME = mymap.NAME.astype(str)

You should not have further problems:

In [None]:
mymap.to_file("mymap.geojson", driver='GeoJSON')