<br> 
<center><img src="https://i.imgur.com/hkb7Bq7.png" width="500"></center>

# [Session 1: Introduction to Python](https://github.com/eScienceWinterSchool/RRStudioSession)


_____

### Prof. José Manuel Magallanes, PhD

* Associate Professor, Departamento de Ciencias Sociales, Pontificia Universidad Católica del Perú, [jmagallanes@pucp.edu.pe](mailto:jmagallanes@pucp.edu.pe)

* Visiting Associate Professor, Evans School of Public Policy and Governance / Senior Data Science Fellow, eScience Institute, University of Washington, [magajm@uw.edu](mailto:magajm@uw.edu)
_____



# 1.  Data Structures

Python has basic native structures, like lists, tuples and dictionaries.

## A.  **LISTS** 

Lists are the most flexible structure to save or contain data elements.

In [None]:
names=["Qing", "Françoise", "Raúl", "Bjork","Marie"]
ages=[32,33,28,30,29]
country=["China", "Senegal", "Spain", "Norway","Korea"]
education=["Bach", "Bach", "Master", "PhD","PhD"]

Above we have created some lists. Lists can contain any values. Lists support different operations:

* **Accessing**:

In [None]:
# one element
ages[0]

In [None]:
# several, using slices:
ages[1:-1] #second to before last

In [None]:
# several, using slices:
ages[:-2] #all but two last ones

* **Modifying**:

In [None]:
# by position
country[2]="España"

# list changed:
country

In [None]:
# by value
country=["PR China" if x == "China" else x for x in country]

# list changed:
country

* **Deleting**

In [None]:
# by position
del country[-1]
# list changed:
country

In [None]:
# by position
names.pop(-1)
# list changed:
names

In [None]:
# only 'del' works for several positions

lista=[1,2,3,4,5,6]
del lista[1:3]
lista

In [None]:
# by value
ages.remove(29) 
# list changed:
ages # just first ocurrence!!

In [None]:
# by value
education.remove('PhD') 
# list changed:
education # just first ocurrence!!

In [None]:
# all values:

lista=[1,'a',45,'b','a']
lista=[x for x in lista if x!='a']
lista

* **Inserting values**

In [None]:
# at the end
lista.append("abc")
lista

In [None]:
# insert in some other place
# first delete
education.pop(2)
education

In [None]:
# now insert
education.insert(2,"Master")
education

## B.  **TUPLES**

Tuples are inmutable structures in Python, they look like lists but do not share much of their functionality:

In [None]:
# new list:
weekend=("Friday", "Saturday", "Sunday")

You can access:

In [None]:
weekend[0]

But no other operation is allowed.

Python itself uses tuples as output of some important functions:

In [None]:
zip(names,ages)

The **zip** functions creates tuples, by combining in parallel. You can see it if you turn the result into a list:

In [None]:
list(zip(names,ages))  # a list of tuples

## C. **DICTIONARIES**  

*Dicts* work in a more sophisticated way, as they have a **'key'**:**'value'** structure:

In [105]:
classroom={'student':names,'age':ages,'edu':education}
# see it:

classroom

{'student': ['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie'],
 'age': [32, 33, 28, 30],
 'edu': ['Bach', 'Bach', 'Master', 'PhD']}

Dictionaries do not use indexes to access values:

In [None]:
#classroom[0]

It uses keys:

In [None]:
classroom['student']

Notice I created a dictionary where the value is not ONE but a LIST of values.

Once you access a value, you can modify it. You can also use _pop_ or _del_. But you can not use _append_ to add an element, you need **update**:

In [107]:
classroom.update({'country':country})
# now:
classroom

{'student': ['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie'],
 'age': [32, 33, 28, 30],
 'edu': ['Bach', 'Bach', 'Master', 'PhD'],
 'country': ['PR China', 'Senegal', 'España', 'Norway']}

## D. DATA FRAMES

**Data frames**  are more complex containers of values. The most common analogy is a spreadsheet. To create a data frame, we need to call **pandas**:

In [None]:
import pandas

We can prepare a data frame from a dictionary immediately, but ONLY if you have the same amount of elements in each list representing a column.

In [110]:
# so in this case you get an error:
#students=pandas.DataFrame(classroom)

In our case, we had to be more explicit:

In [111]:
#then
students=pandas.DataFrame({key:pandas.Series(value) for key, value in classroom.items()})

# seeing it:
students

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,España
3,Bjork,30.0,PhD,Norway
4,Marie,,,


Sometimes, Python users code like this:

In [112]:
import pandas as pd # renaming the library

students=pd.DataFrame({key:pd.Series(value) for key, value in classroom.items()})
students

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,España
3,Bjork,30.0,PhD,Norway
4,Marie,,,


### Data frame basic operations

In [115]:
# data of structure: list? tuple? dataframe?
type(students)

pandas.core.frame.DataFrame

In [114]:
# type of data in data frame column
students.dtypes

student     object
age        float64
edu         object
country     object
dtype: object

In [116]:
# details of data frame
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
student    5 non-null object
age        4 non-null float64
edu        4 non-null object
country    4 non-null object
dtypes: float64(1), object(3)
memory usage: 288.0+ bytes


In [117]:
# number of rows and columns
students.shape 

(5, 4)

In [119]:
# number of rows:
len(students) 

5

In [120]:
# first rows
students.head(2) # compare with: students.tail(2)

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal


In [121]:
# name of columns
students.columns

Index(['student', 'age', 'edu', 'country'], dtype='object')

If you needed the column names as a list:

In [122]:
students.columns.tolist()# or simply: list(students)

['student', 'age', 'edu', 'country']

If you needed a column values as a list:

In [124]:
students.age.tolist()# list(students.ages)

[32.0, 33.0, 28.0, 30.0, nan]

### Accesing elements in DF:

The data frames in pandas behave much like in R:

In [125]:
#one particular column
students.student

0         Qing
1    Françoise
2         Raúl
3        Bjork
4        Marie
Name: student, dtype: object

In [126]:
# or
students['student'] 

0         Qing
1    Françoise
2         Raúl
3        Bjork
4        Marie
Name: student, dtype: object

In [127]:
# it is not the same as: 
students[['student']] # a data frame, not a column (or series)

Unnamed: 0,student
0,Qing
1,Françoise
2,Raúl
3,Bjork
4,Marie


In [129]:
# this is also a DF
students[['country','student']]

Unnamed: 0,country,student
0,PR China,Qing
1,Senegal,Françoise
2,España,Raúl
3,Norway,Bjork
4,,Marie


In [130]:
## Using positions is the best way to get several columns:
students.iloc[:,1:4]

Unnamed: 0,age,edu,country
0,32.0,Bach,PR China
1,33.0,Bach,Senegal
2,28.0,Master,España
3,30.0,PhD,Norway
4,,,


### Changing values

If you have a position, you can update values:

In [133]:
studentsCopy.iloc[0,1]=23 # change is immediate! (no warning)
studentsCopy

Unnamed: 0,student,age,edu,country
0,Qing,23.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,España
3,Bjork,30.0,PhD,Norway
4,Marie,,,


### Deleting columns

You can modify any values in a data frame, but let me create a **deep** copy of this data frame to play with:

In [131]:
studentsCopy=students.copy()
studentsCopy

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,España
3,Bjork,30.0,PhD,Norway
4,Marie,,,


In [135]:
# This is what you want get rid of:
byeColumns=['edu'] # you can delete more than one

#this would change the original with "inplace=False"
#axis 1 is delete by column
studentsCopy.drop(byeColumns,axis=1,inplace=False)

# this is a new DF
studentsCopy

Unnamed: 0,student,age,edu,country
0,Qing,23.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,España
3,Bjork,30.0,PhD,Norway
4,Marie,,,


### Deleting a row

Let me delete a row:

In [136]:
# axis 0 is delete by row
studentsCopy.drop(index=2,axis=0,inplace=True) 
studentsCopy

Unnamed: 0,student,age,edu,country
0,Qing,23.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
3,Bjork,30.0,PhD,Norway
4,Marie,,,


As you see, the index dissapeared. Then, you should reset the indexes:

In [137]:
studentsCopy.reset_index(drop=True,inplace=True)
studentsCopy

Unnamed: 0,student,age,edu,country
0,Qing,23.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Bjork,30.0,PhD,Norway
3,Marie,,,


----
_____

<a id='part2'></a>


## 2.  Data Pre processing

<a id='beginning'></a>

Preprocessing includes two stages:

1. [Cleaning](#part2) 
2. [Formatting](#part3) 
3. [Integrating](#part4) 


<a id='part2'></a>

## Cleaning

Cleaning requires two strategies:

* Detect patterns via REGEX approach
* Divide-and-conquer approach

For this part, I will use to data sets:

1. An excel file from United Nations on Human Developmnt Index per country  available at [https://github.com/eScienceWinterSchool/data/raw/master/HDI_2018.xlsx](https://github.com/eScienceWinterSchool/data/raw/master/HDI_2018.xlsx).

2. A data table from CIA on CO2 emissions per country available at [https://www.cia.gov/library/publications/the-world-factbook/fields/274.html](https://www.cia.gov/library/publications/the-world-factbook/fields/274.html).

Having a clean data frame means:

a. Verify that headers are well read, well written and are at the top of data frame.

b. Verify that the last lines of the data frame are just data.

c. Verify that every row is about a unit of analysis.

d. Verify that each cell has a category or a number well written.


Let's start with the Excel data from UN:

In [426]:
link1="https://github.com/eScienceWinterSchool/data/raw/master/HDI_2018.xlsx"
hdi=pd.read_excel(link1)

In [427]:
# Is the header in the right place:
hdi.head(10)

Unnamed: 0.1,Unnamed: 0,Table 1. Human Development Index and its components,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
0,,,,,,,,,,,,,,,
1,,,,,SDG3,,SDG4.3,,SDG4.6,,SDG8.5,,,,
2,,,,,,,,,,,,,,,
3,,,Human development index (HDI),,Life expectancy at birth,,Expected years of schooling,,Mean years of schooling,,Gross national income (GNI) per capita,,GNI per capita rank minus HDI rank,,HDI rank
4,HDI rank,Country,(index value),,(years),,(years),,(years),,(2011 PPP $),,,,
5,,,2018,,2018,,2018,a,2018,a,2018,,2018,,2017
6,,VERY HIGH HUMAN DEVELOPMENT,,,,,,,,,,,,,
7,1,Norway,0.953688,,82.271,,18.0608,b,12.5668,,68058.6,,5,,1
8,2,Switzerland,0.945936,,83.63,,16.2088,,13.3808,,59374.7,,8,,2
9,3,Ireland,0.942473,,82.103,,18.7933,b,12.5263,c,55659.7,,9,,3


Solving header problems:

In [428]:
# saving headers:

GoodHeaders=hdi.iloc[4,:2].tolist()+hdi.iloc[3,2:].tolist()
#
GoodHeaders

['HDI rank',
 'Country',
 'Human development index (HDI) ',
 nan,
 'Life expectancy at birth',
 nan,
 'Expected years of schooling',
 nan,
 'Mean years of schooling',
 nan,
 'Gross national income (GNI) per capita',
 nan,
 'GNI per capita rank minus HDI rank',
 nan,
 'HDI rank']

In [429]:
# deleting rows:
hdi.drop(index=range(0,7),axis=0)

Unnamed: 0.1,Unnamed: 0,Table 1. Human Development Index and its components,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
7,1,Norway,0.953688,,82.271,,18.0608,b,12.5668,,68058.6,,5,,1
8,2,Switzerland,0.945936,,83.63,,16.2088,,13.3808,,59374.7,,8,,2
9,3,Ireland,0.942473,,82.103,,18.7933,b,12.5263,c,55659.7,,9,,3
10,4,Germany,0.938785,,81.18,,17.0964,,14.1321,,46945.9,,15,,4
11,4,"Hong Kong, China (SAR)",0.938809,,84.687,,16.5122,,12.0381,,60220.8,,5,,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
264,,Column 2: UNDESA (2019b).,,,,,,,,,,,,,
265,,Column 3: UNESCO Institute for Statistics (201...,,,,,,,,,,,,,
266,,Column 4: UNESCO Institute for Statistics (201...,,,,,,,,,,,,,
267,,"Column 5: World Bank (2019a), IMF (2019) and U...",,,,,,,,,,,,,


In [430]:
# doing it
hdi.drop(index=range(0,7),axis=0,inplace=True)

In [431]:
# rename columns
hdi.columns=GoodHeaders
#reset indexes
hdi.reset_index(drop=True,inplace=True)

In [432]:
hdi.head()

Unnamed: 0,HDI rank,Country,Human development index (HDI),NaN,Life expectancy at birth,NaN.1,Expected years of schooling,NaN.2,Mean years of schooling,NaN.3,Gross national income (GNI) per capita,NaN.4,GNI per capita rank minus HDI rank,NaN.5,HDI rank.1
0,1,Norway,0.953688,,82.271,,18.0608,b,12.5668,,68058.6,,5,,1
1,2,Switzerland,0.945936,,83.63,,16.2088,,13.3808,,59374.7,,8,,2
2,3,Ireland,0.942473,,82.103,,18.7933,b,12.5263,c,55659.7,,9,,3
3,4,Germany,0.938785,,81.18,,17.0964,,14.1321,,46945.9,,15,,4
4,4,"Hong Kong, China (SAR)",0.938809,,84.687,,16.5122,,12.0381,,60220.8,,5,,6


Now check the tail:

In [433]:
hdi.tail(72)

Unnamed: 0,HDI rank,Country,Human development index (HDI),NaN,Life expectancy at birth,NaN.1,Expected years of schooling,NaN.2,Mean years of schooling,NaN.3,Gross national income (GNI) per capita,NaN.4,GNI per capita rank minus HDI rank,NaN.5,HDI rank.1
190,188,Central African Republic,0.380662,,52.805,,7.56836,e,4.282,i,776.676,,0,,188
191,189,Niger,0.376591,,62.024,,6.47145,,2.02905,e,912.042,,-3,,189
192,,OTHER COUNTRIES OR TERRITORIES,,,,,,,,,,,,,
193,..,Korea (Democratic People's Rep. of),..,,72.095,,10.8406,e,..,,..,,..,,..
194,..,Monaco,..,,..,,..,,..,,..,,..,,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
257,,Column 2: UNDESA (2019b).,,,,,,,,,,,,,
258,,Column 3: UNESCO Institute for Statistics (201...,,,,,,,,,,,,,
259,,Column 4: UNESCO Institute for Statistics (201...,,,,,,,,,,,,,
260,,"Column 5: World Bank (2019a), IMF (2019) and U...",,,,,,,,,,,,,


In [434]:
# deleting: preview
hdi.drop(index=range(192,262),axis=0)

Unnamed: 0,HDI rank,Country,Human development index (HDI),NaN,Life expectancy at birth,NaN.1,Expected years of schooling,NaN.2,Mean years of schooling,NaN.3,Gross national income (GNI) per capita,NaN.4,GNI per capita rank minus HDI rank,NaN.5,HDI rank.1
0,1,Norway,0.953688,,82.271,,18.0608,b,12.5668,,68058.6,,5,,1
1,2,Switzerland,0.945936,,83.63,,16.2088,,13.3808,,59374.7,,8,,2
2,3,Ireland,0.942473,,82.103,,18.7933,b,12.5263,c,55659.7,,9,,3
3,4,Germany,0.938785,,81.18,,17.0964,,14.1321,,46945.9,,15,,4
4,4,"Hong Kong, China (SAR)",0.938809,,84.687,,16.5122,,12.0381,,60220.8,,5,,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
187,185,Burundi,0.422882,,61.247,,11.3046,,3.12437,q,659.732,,4,,185
188,186,South Sudan,0.41277,,57.604,,5.00038,e,4.84913,,1455.23,u,-7,,186
189,187,Chad,0.401176,,53.977,,7.46536,e,2.4095,q,1715.57,,-15,,187
190,188,Central African Republic,0.380662,,52.805,,7.56836,e,4.282,i,776.676,,0,,188


In [435]:
hdi.drop(index=range(192,262),axis=0,inplace=True)
#reset indexes
hdi.reset_index(drop=True,inplace=True)
#
hdi

Unnamed: 0,HDI rank,Country,Human development index (HDI),NaN,Life expectancy at birth,NaN.1,Expected years of schooling,NaN.2,Mean years of schooling,NaN.3,Gross national income (GNI) per capita,NaN.4,GNI per capita rank minus HDI rank,NaN.5,HDI rank.1
0,1,Norway,0.953688,,82.271,,18.0608,b,12.5668,,68058.6,,5,,1
1,2,Switzerland,0.945936,,83.63,,16.2088,,13.3808,,59374.7,,8,,2
2,3,Ireland,0.942473,,82.103,,18.7933,b,12.5263,c,55659.7,,9,,3
3,4,Germany,0.938785,,81.18,,17.0964,,14.1321,,46945.9,,15,,4
4,4,"Hong Kong, China (SAR)",0.938809,,84.687,,16.5122,,12.0381,,60220.8,,5,,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
187,185,Burundi,0.422882,,61.247,,11.3046,,3.12437,q,659.732,,4,,185
188,186,South Sudan,0.41277,,57.604,,5.00038,e,4.84913,,1455.23,u,-7,,186
189,187,Chad,0.401176,,53.977,,7.46536,e,2.4095,q,1715.57,,-15,,187
190,188,Central African Republic,0.380662,,52.805,,7.56836,e,4.282,i,776.676,,0,,188


Get rid of empty columns:

In [436]:
# Get good columns

Headers=pd.unique([x for x in GoodHeaders if str(x) != 'nan'])
#
Headers

array(['HDI rank', 'Country', 'Human development index (HDI) ',
       'Life expectancy at birth', 'Expected years of schooling',
       'Mean years of schooling',
       'Gross national income (GNI) per capita',
       'GNI per capita rank minus HDI rank'], dtype=object)

In [437]:
hdi.loc[:,Headers[1:-1]]

Unnamed: 0,Country,Human development index (HDI),Life expectancy at birth,Expected years of schooling,Mean years of schooling,Gross national income (GNI) per capita
0,Norway,0.953688,82.271,18.0608,12.5668,68058.6
1,Switzerland,0.945936,83.63,16.2088,13.3808,59374.7
2,Ireland,0.942473,82.103,18.7933,12.5263,55659.7
3,Germany,0.938785,81.18,17.0964,14.1321,46945.9
4,"Hong Kong, China (SAR)",0.938809,84.687,16.5122,12.0381,60220.8
...,...,...,...,...,...,...
187,Burundi,0.422882,61.247,11.3046,3.12437,659.732
188,South Sudan,0.41277,57.604,5.00038,4.84913,1455.23
189,Chad,0.401176,53.977,7.46536,2.4095,1715.57
190,Central African Republic,0.380662,52.805,7.56836,4.282,776.676


In [438]:
hdi=hdi.loc[:,Headers[1:-1]]
hdi.head(30)

Unnamed: 0,Country,Human development index (HDI),Life expectancy at birth,Expected years of schooling,Mean years of schooling,Gross national income (GNI) per capita
0,Norway,0.953688,82.271,18.0608,12.5668,68058.6
1,Switzerland,0.945936,83.63,16.2088,13.3808,59374.7
2,Ireland,0.942473,82.103,18.7933,12.5263,55659.7
3,Germany,0.938785,81.18,17.0964,14.1321,46945.9
4,"Hong Kong, China (SAR)",0.938809,84.687,16.5122,12.0381,60220.8
5,Australia,0.938379,83.281,22.1037,12.683,44097.0
6,Iceland,0.938474,82.855,19.1745,12.5367,47566.5
7,Sweden,0.936628,82.654,18.8322,12.426,47955.4
8,Singapore,0.934819,83.458,16.3282,11.4965,83792.7
9,Netherlands,0.933495,82.143,18.0448,12.19,50012.6


In [439]:
# check empty cells
hdi[hdi.isnull().any(axis=1)]

Unnamed: 0,Country,Human development index (HDI),Life expectancy at birth,Expected years of schooling,Mean years of schooling,Gross national income (GNI) per capita
62,HIGH HUMAN DEVELOPMENT,,,,,
117,MEDIUM HUMAN DEVELOPMENT,,,,,
155,LOW HUMAN DEVELOPMENT,,,,,


In [440]:
# the opposite
hdi[~hdi.isnull().any(axis=1)]

Unnamed: 0,Country,Human development index (HDI),Life expectancy at birth,Expected years of schooling,Mean years of schooling,Gross national income (GNI) per capita
0,Norway,0.953688,82.271,18.0608,12.5668,68058.6
1,Switzerland,0.945936,83.63,16.2088,13.3808,59374.7
2,Ireland,0.942473,82.103,18.7933,12.5263,55659.7
3,Germany,0.938785,81.18,17.0964,14.1321,46945.9
4,"Hong Kong, China (SAR)",0.938809,84.687,16.5122,12.0381,60220.8
...,...,...,...,...,...,...
187,Burundi,0.422882,61.247,11.3046,3.12437,659.732
188,South Sudan,0.41277,57.604,5.00038,4.84913,1455.23
189,Chad,0.401176,53.977,7.46536,2.4095,1715.57
190,Central African Republic,0.380662,52.805,7.56836,4.282,776.676


In [441]:
hdi=hdi[~hdi.isnull().any(axis=1)]
hdi.reset_index(drop=True, inplace=True)
hdi

Unnamed: 0,Country,Human development index (HDI),Life expectancy at birth,Expected years of schooling,Mean years of schooling,Gross national income (GNI) per capita
0,Norway,0.953688,82.271,18.0608,12.5668,68058.6
1,Switzerland,0.945936,83.63,16.2088,13.3808,59374.7
2,Ireland,0.942473,82.103,18.7933,12.5263,55659.7
3,Germany,0.938785,81.18,17.0964,14.1321,46945.9
4,"Hong Kong, China (SAR)",0.938809,84.687,16.5122,12.0381,60220.8
...,...,...,...,...,...,...
184,Burundi,0.422882,61.247,11.3046,3.12437,659.732
185,South Sudan,0.41277,57.604,5.00038,4.84913,1455.23
186,Chad,0.401176,53.977,7.46536,2.4095,1715.57
187,Central African Republic,0.380662,52.805,7.56836,4.282,776.676


In [442]:
import re 

hdi.columns

Index(['Country', 'Human development index (HDI) ', 'Life expectancy at birth',
       'Expected years of schooling', 'Mean years of schooling',
       'Gross national income (GNI) per capita'],
      dtype='object')

In [443]:
[x.split(' (') for x in hdi.columns]

[['Country'],
 ['Human development index', 'HDI) '],
 ['Life expectancy at birth'],
 ['Expected years of schooling'],
 ['Mean years of schooling'],
 ['Gross national income', 'GNI) per capita']]

In [444]:
[x.split(' (')[0] for x in hdi.columns]

['Country',
 'Human development index',
 'Life expectancy at birth',
 'Expected years of schooling',
 'Mean years of schooling',
 'Gross national income']

In [445]:
newColNames=[x.split(' (')[0] for x in hdi.columns]

In [446]:
import re  # may need to be installed:

# find blanks: \\s+

pattern='\\s+'
nothing=''
testString='Human development index'
re.sub(pattern,nothing,testString)

'Humandevelopmentindex'

In [447]:
newColNames=[re.sub(pattern,nothing,x) for x in newColNames]
newColNames

['Country',
 'Humandevelopmentindex',
 'Lifeexpectancyatbirth',
 'Expectedyearsofschooling',
 'Meanyearsofschooling',
 'Grossnationalincome']

In [448]:
hdi.columns=newColNames

In [449]:
hdi

Unnamed: 0,Country,Humandevelopmentindex,Lifeexpectancyatbirth,Expectedyearsofschooling,Meanyearsofschooling,Grossnationalincome
0,Norway,0.953688,82.271,18.0608,12.5668,68058.6
1,Switzerland,0.945936,83.63,16.2088,13.3808,59374.7
2,Ireland,0.942473,82.103,18.7933,12.5263,55659.7
3,Germany,0.938785,81.18,17.0964,14.1321,46945.9
4,"Hong Kong, China (SAR)",0.938809,84.687,16.5122,12.0381,60220.8
...,...,...,...,...,...,...
184,Burundi,0.422882,61.247,11.3046,3.12437,659.732
185,South Sudan,0.41277,57.604,5.00038,4.84913,1455.23
186,Chad,0.401176,53.977,7.46536,2.4095,1715.57
187,Central African Republic,0.380662,52.805,7.56836,4.282,776.676


In [450]:
link2="https://www.cia.gov/library/publications/the-world-factbook/fields/274.html"

Let's try to get the sortable table using pandas:

In [356]:
import pandas as pd

dataco2=pd.read_html(link2,header=0,attrs={'id': 'fieldListing'})

I tried to get all those tables. I might have more than one:

In [357]:
# What do I have? / How many?
type(dataco2), len(dataco2) 

(list, 1)

I need to recover the first table from the list (the only one).

In [358]:
co2=dataco2[0]

#what is it?
type(co2)

pandas.core.frame.DataFrame

Great!...we have a data frame; then:

In [359]:
co2.head()

Unnamed: 0,Country,Carbon dioxide emissions from consumption of energy
0,Afghanistan,9.067 million Mt (2017 est.)
1,Albania,4.5 million Mt (2017 est.)
2,Algeria,135.9 million Mt (2017 est.)
3,American Samoa,"361,100 Mt (2017 est.)"
4,Angola,20.95 million Mt (2017 est.)


In [360]:
co2.columns

Index(['Country', 'Carbon dioxide emissions from consumption of energy'], dtype='object')

In [361]:
co2.rename(columns={co2.columns[1]:'CO2'},inplace=True)

Making sure no blanks in country names:

In [362]:
co2.Country=co2.Country.str.strip()

Cleaning the CO2 Emissions Column

In [363]:
co2.CO2.str.split(pat=' Mt')

0      [9.067 million,   (2017 est.)]
1        [4.5 million,   (2017 est.)]
2      [135.9 million,   (2017 est.)]
3            [361,100,   (2017 est.)]
4      [20.95 million,   (2017 est.)]
                    ...              
211          [268,400,   (2017 est.)]
212    [33.62 billion,   (2013 est.)]
213    [13.68 million,   (2017 est.)]
214    [3.777 million,   (2017 est.)]
215    [12.06 million,   (2017 est.)]
Name: CO2, Length: 216, dtype: object

In [364]:
co2.CO2.str.split(' Mt',expand=True)

Unnamed: 0,0,1
0,9.067 million,(2017 est.)
1,4.5 million,(2017 est.)
2,135.9 million,(2017 est.)
3,361100,(2017 est.)
4,20.95 million,(2017 est.)
...,...,...
211,268400,(2017 est.)
212,33.62 billion,(2013 est.)
213,13.68 million,(2017 est.)
214,3.777 million,(2017 est.)


In [365]:
co2.CO2.str.split(' Mt',expand=True)[0]

0      9.067 million
1        4.5 million
2      135.9 million
3            361,100
4      20.95 million
           ...      
211          268,400
212    33.62 billion
213    13.68 million
214    3.777 million
215    12.06 million
Name: 0, Length: 216, dtype: object

In [366]:
co2.CO2=co2.CO2.str.split(' Mt',expand=True)[0]

In [367]:
co2.head()

Unnamed: 0,Country,CO2
0,Afghanistan,9.067 million
1,Albania,4.5 million
2,Algeria,135.9 million
3,American Samoa,361100
4,Angola,20.95 million


In [368]:
# -?  with or without negative
# \\d+  one or more digits
# \\.?  with or without a dot
# \\,?  with or without a comma
# \\d*  with zero or more digits

co2.CO2.str.extract('(-?\\d+\\.?\\,?\\d*)')

Unnamed: 0,0
0,9.067
1,4.5
2,135.9
3,361100
4,20.95
...,...
211,268400
212,33.62
213,13.68
214,3.777


In [369]:
# space before a sequence of non digits
# \\s before \\D+ 
co2.CO2.str.extract('\\s(\\D+)')

Unnamed: 0,0
0,million
1,million
2,million
3,
4,million
...,...
211,
212,billion
213,million
214,million


In [370]:
# simultaneously
co2.CO2.str.extract('(-?\\d+\\.?\\,?\\d*)\\s(\\D+)')

Unnamed: 0,0,1
0,9.067,million
1,4.5,million
2,135.9,million
3,,
4,20.95,million
...,...,...
211,,
212,33.62,billion
213,13.68,million
214,3.777,million


The indexes 3 and 211 have a problem..

In [371]:
co2.CO2.str.extract('(-?\\d+\\.?\\,?\\d*)\\s?(\\D+)?')

Unnamed: 0,0,1
0,9.067,million
1,4.5,million
2,135.9,million
3,361100,
4,20.95,million
...,...,...
211,268400,
212,33.62,billion
213,13.68,million
214,3.777,million


In [372]:
co2.CO2.str.extract('(?P<values>-?\\d+\\.?\\,?\\d*)\\s?(?P<unit>\\D+)?')

Unnamed: 0,values,unit
0,9.067,million
1,4.5,million
2,135.9,million
3,361100,
4,20.95,million
...,...,...
211,268400,
212,33.62,billion
213,13.68,million
214,3.777,million


In [373]:
co2.head()

Unnamed: 0,Country,CO2
0,Afghanistan,9.067 million
1,Albania,4.5 million
2,Algeria,135.9 million
3,American Samoa,361100
4,Angola,20.95 million


In [374]:
newDF=co2.CO2.str.extract('(?P<number>-?\\d+\\.?\\,?\\d*)\\s?(?P<unit>\\D+)?')

co2New=pd.concat([co2,newDF],axis=1)

co2New.head()

Unnamed: 0,Country,CO2,number,unit
0,Afghanistan,9.067 million,9.067,million
1,Albania,4.5 million,4.5,million
2,Algeria,135.9 million,135.9,million
3,American Samoa,361100,361100.0,
4,Angola,20.95 million,20.95,million


In [375]:
co2New.number=co2New.number.str.replace(",","")

In [376]:
co2New.unit=co2New.unit.fillna(1)

In [377]:
co2New.unit.replace({'million': 10**6, "billion": 10**9})

0         1000000
1         1000000
2         1000000
3               1
4         1000000
          ...    
211             1
212    1000000000
213       1000000
214       1000000
215       1000000
Name: unit, Length: 216, dtype: int64

In [378]:
co2New.unit.replace({'million': 10**6, "billion": 10**9},inplace=True)

In [379]:
co2New

Unnamed: 0,Country,CO2,number,unit
0,Afghanistan,9.067 million,9.067,1000000
1,Albania,4.5 million,4.5,1000000
2,Algeria,135.9 million,135.9,1000000
3,American Samoa,361100,361100,1
4,Angola,20.95 million,20.95,1000000
...,...,...,...,...
211,Western Sahara,268400,268400,1
212,World,33.62 billion,33.62,1000000000
213,Yemen,13.68 million,13.68,1000000
214,Zambia,3.777 million,3.777,1000000


In [380]:
co2New.drop(columns=['CO2'],inplace=True)

## Formatting

In [451]:
co2New.dtypes

Country     object
CO2ok      float64
dtype: object

In [452]:
hdi.dtypes

Country                     object
Humandevelopmentindex       object
Lifeexpectancyatbirth       object
Expectedyearsofschooling    object
Meanyearsofschooling        object
Grossnationalincome         object
dtype: object

In [453]:
co2New.number=co2New.number.astype('float')

AttributeError: 'DataFrame' object has no attribute 'number'

In [385]:
co2New=co2New.assign(CO2ok=co2New.number*co2New.unit/10**6)

In [386]:
co2New

Unnamed: 0,Country,number,unit,CO2ok
0,Afghanistan,9.067,1000000,9.0670
1,Albania,4.500,1000000,4.5000
2,Algeria,135.900,1000000,135.9000
3,American Samoa,361100.000,1,0.3611
4,Angola,20.950,1000000,20.9500
...,...,...,...,...
211,Western Sahara,268400.000,1,0.2684
212,World,33.620,1000000000,33620.0000
213,Yemen,13.680,1000000,13.6800
214,Zambia,3.777,1000000,3.7770


In [387]:
co2New.drop(columns=['number','unit'],inplace=True)

In [455]:
headerNames=hdi.columns
hdi[headerNames[1:]]=hdi[headerNames[1:]].astype(float)
hdi.dtypes

Country                      object
Humandevelopmentindex       float64
Lifeexpectancyatbirth       float64
Expectedyearsofschooling    float64
Meanyearsofschooling        float64
Grossnationalincome         float64
dtype: object

In [456]:
hdi.describe()

Unnamed: 0,Humandevelopmentindex,Lifeexpectancyatbirth,Expectedyearsofschooling,Meanyearsofschooling,Grossnationalincome
count,189.0,189.0,189.0,189.0,189.0
mean,0.713449,72.497042,13.229548,8.613036,18442.223653
std,0.150802,7.46556,2.953921,3.08229,19708.537587
min,0.376591,52.805,5.00038,1.585574,659.732263
25%,0.59567,67.341,11.336769,6.348,3961.615683
50%,0.727787,73.861,13.104833,9.017541,11610.90818
75%,0.830096,77.77,15.23057,11.288111,26770.07268
max,0.953688,84.687,22.10372,14.13215,110488.7358


In [458]:
co2New.describe()

Unnamed: 0,CO2ok
count,216.0
mean,345.592597
std,2455.826005
min,0.000105
25%,1.732
50%,10.045
75%,65.1825
max,33620.0


Let me clean a similar data from wikipedia, about democracy index:

In [459]:
import pandas as pd #location:
demoLink = "https://en.wikipedia.org/wiki/Democracy_Index" 

#collection
demodex=pd.read_html(demoLink,header=0,flavor='bs4',attrs={'class': 'wikitable sortable'})[0]

1. Looking for messiness:

In [460]:
# what's on top?
# names? weird symbols? more links?
demodex.head(10)

Unnamed: 0,Rank,Country,Score,Electoral processand pluralism,Functioning ofgovernment,Politicalparticipation,Politicalculture,Civilliberties,Regimetype,Continent
0,1,Norway,9.87,10.0,9.64,10.0,10.0,9.71,Full democracy,Europe
1,2,Iceland,9.58,10.0,9.29,8.89,10.0,9.71,Full democracy,Europe
2,3,Sweden,9.39,9.58,9.64,8.33,10.0,9.41,Full democracy,Europe
3,4,New Zealand,9.26,10.0,9.29,8.89,8.13,10.0,Full democracy,Oceania
4,5,Denmark,9.22,10.0,9.29,8.33,9.38,9.12,Full democracy,Europe
5,6,Ireland,9.15,9.58,7.86,8.33,10.0,10.0,Full democracy,Europe
6,6,Canada,9.15,9.58,9.64,7.78,8.75,10.0,Full democracy,North America
7,8,Finland,9.14,10.0,8.93,8.33,8.75,9.71,Full democracy,Europe
8,9,Australia,9.09,10.0,8.93,7.78,8.75,10.0,Full democracy,Oceania
9,10,Switzerland,9.03,9.58,9.29,7.78,9.38,9.12,Full democracy,Europe


In [461]:
# what's at the bottom?
# note? credits? extra info?

demodex.tail(10)

Unnamed: 0,Rank,Country,Score,Electoral processand pluralism,Functioning ofgovernment,Politicalparticipation,Politicalculture,Civilliberties,Regimetype,Continent
158,159,Saudi Arabia,1.93,0.00,2.86,2.22,3.13,1.47,Authoritarian,Asia
159,159,Tajikistan,1.93,0.08,0.79,1.67,6.25,0.88,Authoritarian,Asia
160,161,Equatorial Guinea,1.92,0.00,0.43,3.33,4.38,1.47,Authoritarian,Africa
161,162,Turkmenistan,1.72,0.00,0.79,2.22,5.00,0.59,Authoritarian,Asia
162,163,Chad,1.61,0.00,0.00,1.67,3.75,2.65,Authoritarian,Africa
163,164,Central African Republic,1.52,2.25,0.00,1.11,1.88,2.35,Authoritarian,Africa
164,165,Democratic Republic of the Congo,1.49,0.50,0.71,2.22,3.13,0.88,Authoritarian,Africa
165,166,Syria,1.43,0.00,0.00,2.78,4.38,0.00,Authoritarian,Asia
166,167,North Korea,1.08,0.00,2.50,1.67,1.25,0.00,Authoritarian,Asia
167,Rank,Country,Score,Electoral processand pluralism,Functioning ofgovernment,Politicalparticipation,Politicalculture,Civilliberties,Regimetype,Continent


First, we see a column that have some messiness (symbol "=" in rank), but which can be deleted as their information is not relevant. Let me get rid of the _Score_, as it is just the mean of the other ones. The last row is the repetition of the headers, so that one should go, too:

In [462]:
#bye row 167, and two columns
demodexClean=demodex.drop(index=167,columns=['Rank','Score'])

In [463]:
demodexClean

Unnamed: 0,Country,Electoral processand pluralism,Functioning ofgovernment,Politicalparticipation,Politicalculture,Civilliberties,Regimetype,Continent
0,Norway,10.00,9.64,10.00,10.00,9.71,Full democracy,Europe
1,Iceland,10.00,9.29,8.89,10.00,9.71,Full democracy,Europe
2,Sweden,9.58,9.64,8.33,10.00,9.41,Full democracy,Europe
3,New Zealand,10.00,9.29,8.89,8.13,10.00,Full democracy,Oceania
4,Denmark,10.00,9.29,8.33,9.38,9.12,Full democracy,Europe
...,...,...,...,...,...,...,...,...
162,Chad,0.00,0.00,1.67,3.75,2.65,Authoritarian,Africa
163,Central African Republic,2.25,0.00,1.11,1.88,2.35,Authoritarian,Africa
164,Democratic Republic of the Congo,0.50,0.71,2.22,3.13,0.88,Authoritarian,Africa
165,Syria,0.00,0.00,2.78,4.38,0.00,Authoritarian,Asia


As there are few names, we can change to smaller sizes:

In [464]:
newNames=['pluralism','effectiveness','participation','culture','liberties']

# names from the second and before the last one '[1:-1]':
newMapper={old:new for old,new in zip(demodexClean.columns[1:-1],newNames)}

demodexClean.rename(columns=newMapper,inplace=True)

In [465]:
# this is what we have so far:
demodexClean.head()

Unnamed: 0,Country,pluralism,effectiveness,participation,culture,liberties,Regimetype,Continent
0,Norway,10.0,9.64,10.0,10.0,9.71,Full democracy,Europe
1,Iceland,10.0,9.29,8.89,10.0,9.71,Full democracy,Europe
2,Sweden,9.58,9.64,8.33,10.0,9.41,Full democracy,Europe
3,New Zealand,10.0,9.29,8.89,8.13,10.0,Full democracy,Oceania
4,Denmark,10.0,9.29,8.33,9.38,9.12,Full democracy,Europe


It looks good so far. Let's go to formatting.

2. Giving the rigth format:

In [468]:
# checking data types:
demodexClean.dtypes

Country           object
pluralism        float64
effectiveness    float64
participation    float64
culture          float64
liberties        float64
Regimetype        object
Continent         object
dtype: object

Above, we realized the need to make some indices into numeric:

In [467]:
demodexClean[newNames]=demodexClean[newNames].apply(pd.to_numeric)

The last one is a categorical variable:

In [470]:
demodexClean.Continent.value_counts()

Africa           50
Asia             44
Europe           38
North America    14
South America    12
Europe/Asia       5
Oceania           4
Name: Continent, dtype: int64

When you have text, you could get the unique values of a column like this:

In [472]:
pd.unique(demodexClean.Continent).tolist()

['Europe',
 'Oceania',
 'North America',
 'South America',
 'Africa',
 'Asia',
 'Europe/Asia']

Then, you can prepare the map to recode the values:

In [None]:
oldValues=pd.unique(demodexClean.Category).tolist()
newValues=[4,3,2,1]
mapNewOld={old:new for old,new in zip(oldValues,newValues)}
mapNewOld

You can do it in this way:

In [None]:
demodexClean.Category.replace(mapNewOld,inplace=True)

In [None]:
# or this one:
# demodexClean.Category=demodexClean.Category.replace(mapNewOld)

You can save it as a category, but that will be lost if sent to R:

In [None]:
demodexClean.Category=demodexClean.Category.astype('category')

In [None]:
demodexClean['Category'].cat.categories

In [None]:
# checking missing values

demodexClean.info()

This data is now ready for R.


____

* [Go to page beginning](#beginning)
* [Go to REPO in Github](https://github.com/eScienceWinterSchool/PythonSession)
* [Go to WinterSchool Repos](https://github.com/eScienceWinterSchool)