<br> 
<center><img src="https://i.imgur.com/hkb7Bq7.png" width="500"></center>


### Prof. José Manuel Magallanes, PhD

* Visiting Professor, Evans School of Public Policy and Governance / Senior Data Science Fellow, eScience Institute, University of Washington, [magajm@uw.edu](mailto:magajm@uw.edu)

* Associate Professor, Departamento de Ciencias Sociales, Pontificia Universidad Católica del Perú, [jmagallanes@pucp.edu.pe](mailto:jmagallanes@pucp.edu.pe)

_____

_____

<a id='home'></a>

# Introduction to Python

* DATA STRUCTURES
    - [Native Data Structures](#basic_ds).

    - [NonNative Data Structures](#complex_ds).

* DATA PRE PROCESSING
    - [Data cleaning and formatting](#cleanformat).

    - [Integration and Saving](#integratesave).


# 1. Basic Data Structures

Python has basic native structures, like lists, tuples and dictionaries.

## a.  **LISTS** 

Lists are the most flexible structure to save or contain data elements.

In [1]:
ages=[32,33,28,30,29]

Above we have created some lists. Lists can contain any values. Lists support different operations:

* **Accessing**:

Keep in mind the positions in Python start in **0**.

In [2]:
# one element
ages[0]

32

In [3]:
# last one
ages[-1] 

29

In [4]:
#all but two last ones
ages[:-2] 

[32, 33, 28]

In [5]:
#non consecutive
# I want: [32, 28] from [32,33,28,30,29]
ages[0:4:2] # start:end:step (open at end)

[32, 28]

* **Adding elements**:

In [6]:
# I want: [32, 28, 30] from [32,33,28,30,29]
# difficult to understand? you are adding two lists into one
ages[0:4:2] + [ages[3]]

[32, 28, 30]

In [7]:
# this will not work: you want to add a number to the list
ages[0:4:2] + ages[3]

TypeError: can only concatenate list (not "int") to list

Always check data type when adding trying to add elements:

In [8]:
type(ages[0:4:2])

list

In [9]:
type(ages[3])

int

Using the function **append**:

In [10]:
aList=ages[0:4:2]
aList

[32, 28]

In [11]:
aList.append(ages[3])

In [12]:
aList

[32, 28, 30]

Using the function **extend**:

In [13]:
list1=[1,2,3]
list2=[4,5,6]
list1.extend(list2)
list1

[1, 2, 3, 4, 5, 6]

Both *append* and *extend* will add elements to a list. The former can take care of basic elements, while the latter can add a container.

* **Modifying**:

In [15]:
country=["China", "Senegal", "España", "Norway","Korea"]


# by position
country[2]="Spain"

# list changed:
country

['China', 'Senegal', 'Spain', 'Norway', 'Korea']

In [16]:
# by value
country=["PR China" if x == "China" else x for x in country]

# list changed:
country

['PR China', 'Senegal', 'Spain', 'Norway', 'Korea']

* **Deleting**

In [17]:
# by position
del country[-1] #last value

# list changed:
country

['PR China', 'Senegal', 'Spain', 'Norway']

In [18]:
names=["Qing", "Françoise", "Raúl", "Bjork","Marie"]

# by position
names.pop() #last value by default

# list changed:
names

['Qing', 'Françoise', 'Raúl', 'Bjork']

In [19]:
# only 'del' works for several positions

lista=[1,2,3,4,5,6]
del lista[1:3]

#now:
lista

[1, 4, 5, 6]

In [20]:
lista=[5,7,13,13,29,29,33]

# by value
lista.remove(29) 

# list changed:
lista # just first ocurrence of value!!

[5, 7, 13, 13, 29, 33]

In [21]:
# deleting every  value:

lista=[1,'a',45,'b','a']
lista=[x for x in lista if x!='a']

# you get:
lista

[1, 45, 'b']

## b.  **TUPLES**

Tuples are inmutable structures in Python, they look like lists but do not share much of their functionality:

In [22]:
# a tuple:
weekend=("Friday", "Saturday", "Sunday")

You can access:

In [23]:
weekend[0]

'Friday'

But deleting, modifying, and inserting a value is **NOT PERMITTED**.




## c. **DICTIONARIES**  

*Dicts* work in a more sophisticated way, as they have a **'key'**:**'value'** structure:

In [24]:
names=["Qing", "Françoise", "Raúl", "Bjork","Marie"]
ages=[32,33,28,30,29]
education=["Bach", "Bach", "Master", "PhD","PhD"]


classroom={'student':names,'age':ages,'edu':education}
# see it:

classroom

{'student': ['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie'],
 'age': [32, 33, 28, 30, 29],
 'edu': ['Bach', 'Bach', 'Master', 'PhD', 'PhD']}

Dicts do not use indexes to access values:

In [26]:
# this will give error:
classroom[0]

KeyError: 0

Dicts use keys:

In [27]:
classroom['student']

['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie']

To see al the keys:

In [28]:
classroom.keys()

dict_keys(['student', 'age', 'edu'])

Notice that I created a dictionary where the value is not ONE but a LIST of values.

If you wish to see all the key and values:

In [29]:
classroom.items()

dict_items([('student', ['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie']), ('age', [32, 33, 28, 30, 29]), ('edu', ['Bach', 'Bach', 'Master', 'PhD', 'PhD'])])

To visit each one:

In [30]:
#wrong:
[(key,value) for key, value in classroom]

ValueError: too many values to unpack (expected 2)

In [31]:
#this will only give the keys (you can not het the values):
[element for element in classroom]

['student', 'age', 'edu']

In [32]:
# right approach to get both:

[(key,value) for key, value in classroom.items()]

[('student', ['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie']),
 ('age', [32, 33, 28, 30, 29]),
 ('edu', ['Bach', 'Bach', 'Master', 'PhD', 'PhD'])]

Notice that the previous dicts could be represented as a table, but dicts are no limited to table-like structures:

In [33]:
student1={'names':'Peter','language':['english','spanish'],'age':14,'sex':"F"}
student2={'names':'Mary','language':['english','spanish','french'],'age':16,'sex':"M"}
student3={'names':'John','language':['english'],'age':15,'sex':None}

You can create this dict:

In [34]:
class1={'students':[student1,student2,student3]}
class1

{'students': [{'names': 'Peter',
   'language': ['english', 'spanish'],
   'age': 14,
   'sex': 'F'},
  {'names': 'Mary',
   'language': ['english', 'spanish', 'french'],
   'age': 16,
   'sex': 'M'},
  {'names': 'John', 'language': ['english'], 'age': 15, 'sex': None}]}

If you have two lists:

In [35]:
list1=['a','b','c']
list2=['A','B','C']

You can create a dict like this:

In [36]:
{key:value for key,value in zip(list1,list2)}

{'a': 'A', 'b': 'B', 'c': 'C'}

Once you access a value, you can modify it:

In [37]:
classroom['student']

['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie']

In [38]:
classroom['student'][2]='Raul'
classroom['student']

['Qing', 'Françoise', 'Raul', 'Bjork', 'Marie']

You can delete a key and its elements:

In [39]:
del classroom['age'] 
classroom

{'student': ['Qing', 'Françoise', 'Raul', 'Bjork', 'Marie'],
 'edu': ['Bach', 'Bach', 'Master', 'PhD', 'PhD']}

You can also use pop:

In [40]:
classroom.pop('edu')
classroom # just 'student' will remain

{'student': ['Qing', 'Françoise', 'Raul', 'Bjork', 'Marie']}

You can not use _append_ to add an element, you need **update**:

In [41]:
countries=["China", "Senegal", "España", "Norway","Korea"]

classroom.update({'country':countries,'age':ages})
# now:
classroom

{'student': ['Qing', 'Françoise', 'Raul', 'Bjork', 'Marie'],
 'country': ['China', 'Senegal', 'España', 'Norway', 'Korea'],
 'age': [32, 33, 28, 30, 29]}

[home](#home)

______

<a id='complex_ds'></a>


# Non Native Data Structures

## DATA FRAMES

A **Data frame**  is a complex container of values. The most common analogy is a spreadsheet. To create a data frame, we need to call **[pandas](https://pandas.pydata.org/docs/index.html)**:

In [44]:
# do you have pandas:
!pip show pandas

Name: pandas
Version: 1.2.1
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location: /Users/JoseManuel/anaconda3/envs/WinterSchool/lib/python3.8/site-packages
Requires: python-dateutil, pytz, numpy
Required-by: geopandas


In [45]:
import pandas

We can prepare a data frame from a dictionary immediately, but ONLY if you have the same amount of elements in each list representing a column.

In [46]:
# this dict can be used for a data frame:
classroom

{'student': ['Qing', 'Françoise', 'Raul', 'Bjork', 'Marie'],
 'country': ['China', 'Senegal', 'España', 'Norway', 'Korea'],
 'age': [32, 33, 28, 30, 29]}

In [47]:
# our data frame:
students=pandas.DataFrame(classroom)
## see it:
students

Unnamed: 0,student,country,age
0,Qing,China,32
1,Françoise,Senegal,33
2,Raul,España,28
3,Bjork,Norway,30
4,Marie,Korea,29


In that one of the list in the dictionary were shorter that the others, you need this:

In [48]:
#then
students=pandas.DataFrame({key:pandas.Series(value) for key, value in classroom.items()})

# seeing it:
students

Unnamed: 0,student,country,age
0,Qing,China,32
1,Françoise,Senegal,33
2,Raul,España,28
3,Bjork,Norway,30
4,Marie,Korea,29


Sometimes, Python users code like this:

In [49]:
import pandas as pd # renaming the library

students=pd.DataFrame(classroom)
students

Unnamed: 0,student,country,age
0,Qing,China,32
1,Françoise,Senegal,33
2,Raul,España,28
3,Bjork,Norway,30
4,Marie,Korea,29


### Information on the Data frame 

In [50]:
# data of structure: list? tuple? dataframe?
type(students)

pandas.core.frame.DataFrame

In [51]:
# data of structure: list? tuple? dataframe?
type(students.age)#this is a SERIES (a column)

pandas.core.series.Series

In [52]:
# type of data in data frame column
students.dtypes

student    object
country    object
age         int64
dtype: object

In [53]:
# type of data and further details
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   student  5 non-null      object
 1   country  5 non-null      object
 2   age      5 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 248.0+ bytes


In [54]:
# number of rows and columns
students.shape 

(5, 3)

In [55]:
# number of rows:
len(students) 

5

In [56]:
# first rows
students.head(2) # compare with: students.tail(2)

Unnamed: 0,student,country,age
0,Qing,China,32
1,Françoise,Senegal,33


In [57]:
# name of columns
students.columns

Index(['student', 'country', 'age'], dtype='object')

If you needed the column names as a list:

In [58]:
students.columns.tolist()# better to see hidden characters

['student', 'country', 'age']

### OPERATIONS on DF:

### a. Accessing

See a column:

In [59]:
#one particular column
students.student

0         Qing
1    Françoise
2         Raul
3        Bjork
4        Marie
Name: student, dtype: object

In [60]:
# or
students['student'] 

0         Qing
1    Françoise
2         Raul
3        Bjork
4        Marie
Name: student, dtype: object

This looks like a column, but it is not:

In [61]:
students[['student']] # a data frame

Unnamed: 0,student
0,Qing
1,Françoise
2,Raul
3,Bjork
4,Marie


In [62]:
# this is also a DF
students[['country','student']]

Unnamed: 0,country,student
0,China,Qing
1,Senegal,Françoise
2,España,Raul
3,Norway,Bjork
4,Korea,Marie


A more _Python_ way:

In [63]:
# using loc (names): 
columnNames=['country','student']
students.loc[:,columnNames]

Unnamed: 0,country,student
0,China,Qing
1,Senegal,Françoise
2,España,Raul
3,Norway,Bjork
4,Korea,Marie


In [64]:
rowNames=[0,1]
students.loc[rowNames,:]

Unnamed: 0,student,country,age
0,Qing,China,32
1,Françoise,Senegal,33


In [65]:
## Using iloc:
columnPositions=[1,2,0]
students.iloc[:,columnPositions] 

Unnamed: 0,country,age,student
0,China,32,Qing
1,Senegal,33,Françoise
2,España,28,Raul
3,Norway,30,Bjork
4,Korea,29,Marie


### Changing values

If you have a position, you can update values:

In [66]:
students.iloc[4,2]=23 # change is immediate! (no warning)
students

Unnamed: 0,student,country,age
0,Qing,China,32
1,Françoise,Senegal,33
2,Raul,España,28
3,Bjork,Norway,30
4,Marie,Korea,23


In [67]:
students.loc[2,'country']='Spain'
students

Unnamed: 0,student,country,age
0,Qing,China,32
1,Françoise,Senegal,33
2,Raul,Spain,28
3,Bjork,Norway,30
4,Marie,Korea,23


Same principle can be applied to chanege column names:

In [68]:
students.columns=['studentNames','countryBorn','ageCurrent']
students

Unnamed: 0,studentNames,countryBorn,ageCurrent
0,Qing,China,32
1,Françoise,Senegal,33
2,Raul,Spain,28
3,Bjork,Norway,30
4,Marie,Korea,23


In [69]:
#some names:
changesDictCol={'ageCurrent':'currentAge'}
students.rename(columns=changesDictCol) # a view - change requires 'inplace=True'

Unnamed: 0,studentNames,countryBorn,currentAge
0,Qing,China,32
1,Françoise,Senegal,33
2,Raul,Spain,28
3,Bjork,Norway,30
4,Marie,Korea,23


In [70]:
changesDictRow={0:'First'}
students.rename(index=changesDictRow) # a view - change requires 'inplace=True'

Unnamed: 0,studentNames,countryBorn,ageCurrent
First,Qing,China,32
1,Françoise,Senegal,33
2,Raul,Spain,28
3,Bjork,Norway,30
4,Marie,Korea,23


### Adding a Column

In [71]:
students['education']=education
students

Unnamed: 0,studentNames,countryBorn,ageCurrent,education
0,Qing,China,32,Bach
1,Françoise,Senegal,33,Bach
2,Raul,Spain,28,Master
3,Bjork,Norway,30,PhD
4,Marie,Korea,23,PhD


In [72]:
students['sex']=None
students

Unnamed: 0,studentNames,countryBorn,ageCurrent,education,sex
0,Qing,China,32,Bach,
1,Françoise,Senegal,33,Bach,
2,Raul,Spain,28,Master,
3,Bjork,Norway,30,PhD,
4,Marie,Korea,23,PhD,


In [73]:
students['city']='Seattle'
students

Unnamed: 0,studentNames,countryBorn,ageCurrent,education,sex,city
0,Qing,China,32,Bach,,Seattle
1,Françoise,Senegal,33,Bach,,Seattle
2,Raul,Spain,28,Master,,Seattle
3,Bjork,Norway,30,PhD,,Seattle
4,Marie,Korea,23,PhD,,Seattle


### Deleting columns

You can modify any values in a data frame, but let me create a **copy** of this data frame to play with:

In [74]:
studentsCopy1=students.copy()
studentsCopy2=studentsCopy1

In [75]:
# This is what you want get rid of:
byeColumns=['sex'] # you can delete more than one

#this is the result
studentsCopy1.drop(columns=byeColumns,inplace=True)
studentsCopy1

Unnamed: 0,studentNames,countryBorn,ageCurrent,education,city
0,Qing,China,32,Bach,Seattle
1,Françoise,Senegal,33,Bach,Seattle
2,Raul,Spain,28,Master,Seattle
3,Bjork,Norway,30,PhD,Seattle
4,Marie,Korea,23,PhD,Seattle


In [76]:
# you did not touch this, but:
studentsCopy2

Unnamed: 0,studentNames,countryBorn,ageCurrent,education,city
0,Qing,China,32,Bach,Seattle
1,Françoise,Senegal,33,Bach,Seattle
2,Raul,Spain,28,Master,Seattle
3,Bjork,Norway,30,PhD,Seattle
4,Marie,Korea,23,PhD,Seattle


In [77]:
# any changes?
students

Unnamed: 0,studentNames,countryBorn,ageCurrent,education,sex,city
0,Qing,China,32,Bach,,Seattle
1,Françoise,Senegal,33,Bach,,Seattle
2,Raul,Spain,28,Master,,Seattle
3,Bjork,Norway,30,PhD,,Seattle
4,Marie,Korea,23,PhD,,Seattle


### Deleting rows

Let me delete a row:

In [78]:
studentsCopy1.drop(index=2,inplace=True) #index says what row(s)
studentsCopy1

Unnamed: 0,studentNames,countryBorn,ageCurrent,education,city
0,Qing,China,32,Bach,Seattle
1,Françoise,Senegal,33,Bach,Seattle
3,Bjork,Norway,30,PhD,Seattle
4,Marie,Korea,23,PhD,Seattle


As you see, the index dissapeared. Then, _iloc_ and _loc_ may give different results:

In [79]:
# this is OK
studentsCopy1.loc[0,'studentNames']==studentsCopy1.iloc[0,0]

True

In [80]:
# this is not OK
studentsCopy1.loc[3,'studentNames']==studentsCopy1.iloc[3,0]

False

In [81]:
studentsCopy1.loc[3,'studentNames']

'Bjork'

In [82]:
studentsCopy1.iloc[3,0]

'Marie'

In [83]:
# this is worse
studentsCopy1.loc[2,'student']==studentsCopy1.iloc[2,0]

KeyError: 2

After deleting rows, I recommend **resetting the index**:

In [84]:
# a view
studentsCopy1.reset_index()

Unnamed: 0,index,studentNames,countryBorn,ageCurrent,education,city
0,0,Qing,China,32,Bach,Seattle
1,1,Françoise,Senegal,33,Bach,Seattle
2,3,Bjork,Norway,30,PhD,Seattle
3,4,Marie,Korea,23,PhD,Seattle


In [85]:
# drop OLD indexes
studentsCopy1.reset_index(drop=True) # this is just a view (no change was made)

Unnamed: 0,studentNames,countryBorn,ageCurrent,education,city
0,Qing,China,32,Bach,Seattle
1,Françoise,Senegal,33,Bach,Seattle
2,Bjork,Norway,30,PhD,Seattle
3,Marie,Korea,23,PhD,Seattle


In [87]:
studentsCopy1.reset_index(drop=True, inplace=False) # actual change

Unnamed: 0,studentNames,countryBorn,ageCurrent,education,city
0,Qing,China,32,Bach,Seattle
1,Françoise,Senegal,33,Bach,Seattle
2,Bjork,Norway,30,PhD,Seattle
3,Marie,Korea,23,PhD,Seattle


##  Merging

Let me make two data frames from dicts:

In [88]:
dict1={'names':["Qing", "Françoise", "Raúl", "Bjork","Marie"],
'ages':[32,33,28,30,29],
'countries':["China", "Senegal", "España", "Norway","Korea"]}

dict2={'names':["Qing", "Françoise", "Raúl", "Bjork","Marie"],
'education':["Bach", "Bach", "Master", "PhD","PhD"]}

df1=pd.DataFrame(dict1)
df2=pd.DataFrame(dict2)

Merging requires a *key* with unique values. 

In [89]:
df12=df1.merge(df2) # d1 is "left"/d2 is "right"
df12

Unnamed: 0,names,ages,countries,education
0,Qing,32,China,Bach
1,Françoise,33,Senegal,Bach
2,Raúl,28,España,Master
3,Bjork,30,Norway,PhD
4,Marie,29,Korea,PhD


Let's make a little change, so that the _key_ columns have a different name:

In [90]:
# different column name for same key

dict3={'NAME':["Qing", "Françoise", "Raúl", "Bjork","Marie"],
'ages':[32,33,28,30,29],
'countries':["China", "Senegal", "España", "Norway","Korea"]}

dict4={'names':["Qing", "Françoise", "Raúl", "Bjork","Marie"],
'education':["Bach", "Bach", "Master", "PhD","PhD"]}

df3=pd.DataFrame(dict3)
df4=pd.DataFrame(dict4)
df34=df3.merge(df4,left_on="NAME",right_on="names",)# d3 is "left"/d4 is "right"
df34

Unnamed: 0,NAME,ages,countries,names,education
0,Qing,32,China,Qing,Bach
1,Françoise,33,Senegal,Françoise,Bach
2,Raúl,28,España,Raúl,Master
3,Bjork,30,Norway,Bjork,PhD
4,Marie,29,Korea,Marie,PhD


### Ordinal scales

Let's check the information from _df12_:

In [91]:
df12.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   names      5 non-null      object
 1   ages       5 non-null      int64 
 2   countries  5 non-null      object
 3   education  5 non-null      object
dtypes: int64(1), object(3)
memory usage: 200.0+ bytes


The column *education* can be considered ordinal. Let's make a change:

In [92]:
from pandas.api.types import CategoricalDtype

# create a list with the rigth order (ascending)
levels=["Bach","Master","PhD"]

# make the change:
df12.education=pd.Categorical(df12.education,
                              categories=levels,
                              ordered=True)

### String operations

Let me create a new data frame:

In [168]:
someData= {"country%":['Perú',"chile","paraguay","mexicO"],
           "team":['Incas',"Mapuches","guaraníes","Aztecas"],
           "points 2021":[99,'ninety','93',None]}
someDF=pd.DataFrame(someData)
someDF

Unnamed: 0,country%,team,points 2021
0,Perú,Incas,99
1,chile,Mapuches,ninety
2,paraguay,guaraníes,93
3,mexicO,Aztecas,


In [169]:
someDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   country%     4 non-null      object
 1   team         4 non-null      object
 2   points 2021  3 non-null      object
dtypes: object(3)
memory usage: 224.0+ bytes


Check if text has some string(s), using [regular expressions](https://docs.python.org/3/library/re.html):

In [170]:
#number
someDF.columns.str.contains("\d",regex=True)

array([False, False,  True])

In [171]:
# then
someDF.columns[someDF.columns.str.contains("\d",regex=True)]

Index(['points 2021'], dtype='object')

In [172]:
#number or at least one blank
someDF.columns.str.contains("\d|\s+",regex=True)

array([False, False,  True])

Replacing text:

In [173]:
#if the '%' symbol appears, replace it
someDF.columns.str.replace("%","",regex=False)

Index(['country', 'team', 'points 2021'], dtype='object')

In [174]:
someDF.columns.str.replace("\d","",regex=True)

Index(['country%', 'team', 'points '], dtype='object')

In [175]:
someDF.columns.str.replace("\%|\d","",regex=True)

Index(['country', 'team', 'points '], dtype='object')

The str.replace function does not have "inplace", then:

In [176]:
someDF.columns=someDF.columns.str.replace("\%|\d|\s+","",regex=True)

#you get:
someDF.columns

Index(['country', 'team', 'points'], dtype='object')

These might be useful:

In [177]:
someDF.country.str.capitalize()

0        Perú
1       Chile
2    Paraguay
3      Mexico
Name: country, dtype: object

In [178]:
someDF.country.str.lower()

0        perú
1       chile
2    paraguay
3      mexico
Name: country, dtype: object

In [179]:
someDF.country.str.upper()

0        PERÚ
1       CHILE
2    PARAGUAY
3      MEXICO
Name: country, dtype: object

Some functions use ONLY a particular data structure as input:

In [180]:
# str.capitalize works for ONE column (a series) not for data frame
someDF.loc[:,['country','team']].str.upper()

AttributeError: 'DataFrame' object has no attribute 'str'

So, you need to *apply* a function column by column:

In [181]:
for column in someDF.loc[:,['country','team']]:
    print(someDF[column].str.upper())
    

0        PERÚ
1       CHILE
2    PARAGUAY
3      MEXICO
Name: country, dtype: object
0        INCAS
1     MAPUCHES
2    GUARANÍES
3      AZTECAS
Name: team, dtype: object


In [182]:
# what about this?
[someDF[column].str.upper() for column in someDF.loc[:,['country','team']]]


[0        PERÚ
 1       CHILE
 2    PARAGUAY
 3      MEXICO
 Name: country, dtype: object,
 0        INCAS
 1     MAPUCHES
 2    GUARANÍES
 3      AZTECAS
 Name: team, dtype: object]

In [183]:
#maybe not
pd.Series(someDF[column].str.upper() for column in someDF.loc[:,['country','team']])

0    0        PERÚ
1       CHILE
2    PARAGUAY
3   ...
1    0        INCAS
1     MAPUCHES
2    GUARANÍES
3...
dtype: object

In [184]:
# maybe not...
pd.DataFrame(someDF[column].str.upper() for column in someDF.loc[:,['country','team']])

Unnamed: 0,0,1,2,3
country,PERÚ,CHILE,PARAGUAY,MEXICO
team,INCAS,MAPUCHES,GUARANÍES,AZTECAS


In [185]:
# works but...
pd.DataFrame([someDF[column].str.upper() for column in someDF.loc[:,['country','team']]]).T #'T' added to previous

Unnamed: 0,country,team
0,PERÚ,INCAS
1,CHILE,MAPUCHES
2,PARAGUAY,GUARANÍES
3,MEXICO,AZTECAS


You may need to use *apply*:

In [186]:
# applying function
someDF.loc[:,['country','team']].apply(lambda x: x.str.upper())

Unnamed: 0,country,team
0,PERÚ,INCAS
1,CHILE,MAPUCHES
2,PARAGUAY,GUARANÍES
3,MEXICO,AZTECAS


Maybe you need a function cell by cell:

In [187]:
# this one requires unidecode installation

from  unidecode import unidecode

unidecode('á')

'a'

In [188]:
#but this does not work
unidecode(someDF.country)

AttributeError: 'Series' object has no attribute 'encode'

In [189]:
#This does work
[unidecode(cell) for cell in someDF.country]


['Peru', 'chile', 'paraguay', 'mexicO']

So *apply* goes to the element of the structure:

In [190]:
someDF.country.apply(unidecode)

0        Peru
1       chile
2    paraguay
3      mexicO
Name: country, dtype: object

That is why this will not work:

In [191]:
someDF.loc[:,['country','team']].apply(unidecode)

AttributeError: 'Series' object has no attribute 'encode'

In this situation, to get to a cell from a data frame, you need **applymap**:

In [192]:
someDF.loc[:,['country','team']].applymap(unidecode)

Unnamed: 0,country,team
0,Peru,Incas
1,chile,Mapuches
2,paraguay,guaranies
3,mexicO,Aztecas


In [193]:
someDF.loc[:,['country','team']]=someDF.loc[:,['country','team']].applymap(unidecode)

**Checking data types**:

The column points is "assumed" to be numeric, but is an _object_ type. Let's change the type with *to_numeric*:

In [194]:
pd.to_numeric(someDF.points)

ValueError: Unable to parse string "ninety" at position 1

By default, to_numeric can not deal with text. Then:

In [195]:
pd.to_numeric(someDF.points,errors='coerce')

0    99.0
1     NaN
2    93.0
3     NaN
Name: points, dtype: float64

Let's make the change and revisit data types:

In [196]:
someDF.points=pd.to_numeric(someDF.points,errors='coerce')
someDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  4 non-null      object 
 1   team     4 non-null      object 
 2   points   2 non-null      float64
dtypes: float64(1), object(2)
memory usage: 224.0+ bytes


**Checking for missing data**:

we got some missing data:

In [197]:
someDF

Unnamed: 0,country,team,points
0,Peru,Incas,99.0
1,chile,Mapuches,
2,paraguay,guaranies,93.0
3,mexicO,Aztecas,


In [200]:
someDF[someDF.isnull().any(axis=1)]

Unnamed: 0,country,team,points
1,chile,Mapuches,
3,mexicO,Aztecas,


In [201]:
someDF.dropna()

Unnamed: 0,country,team,points
0,Peru,Incas,99.0
2,paraguay,guaranies,93.0


In [202]:
someDF.dropna(axis='columns') #dropna has inplace.

Unnamed: 0,country,team
0,Peru,Incas
1,chile,Mapuches
2,paraguay,guaranies
3,mexicO,Aztecas


You could try replacing missing data:

In [203]:
someDF.fillna(95) # anywhere

Unnamed: 0,country,team,points
0,Peru,Incas,99.0
1,chile,Mapuches,95.0
2,paraguay,guaranies,93.0
3,mexicO,Aztecas,95.0


In [204]:
someDF.points.fillna(95)#in a particular column

0    99.0
1    95.0
2    93.0
3    95.0
Name: points, dtype: float64

In [205]:
someDF.points.fillna(95,inplace=True) #this function has 'inplace'
someDF

Unnamed: 0,country,team,points
0,Peru,Incas,99.0
1,chile,Mapuches,95.0
2,paraguay,guaranies,93.0
3,mexicO,Aztecas,95.0



[home](#home)
_____

<a id='part2'></a>


# Exercise.  Data Pre processing

Get all these data frames and merge them.

* Democracy Index from the Economist (available in [wikipedia](https://en.wikipedia.org/wiki/Democracy_Index))

In [None]:
linkWiki="https://en.wikipedia.org/wiki/Democracy_Index"
democracy=pd.read_html(linkWiki, header=0,attrs={"class":"wikitable sortable"})[4]
democracy.head()

Keep Country, overall score and its five components, and regime type.

* Military expenditures as Share of GDP from [CIA Factbook](https://www.cia.gov/the-world-factbook/field/military-expenditures/country-comparison).

In [None]:
linkmil="https://www.cia.gov/the-world-factbook/field/military-expenditures/country-comparison"
milimoney=pd.read_html(linkmil)[0]
milimoney.head()

Just keep country and share of gdp.

* Data on Human Development Index 2019.

In [None]:
# the file is in excel,Excel file requires the libraries xlrd (for xls)
# or the library openpyxl (for xlsx), just check if you have them.

In [None]:
linkHDI="https://github.com/UW-eScience-WinterSchool/Python_Session/raw/main/countryCodesHDI.xlsx"
hdidata=pd.read_excel(linkHDI)
hdidata.head()

The main file should be the one with the biggest amount of rows. Then prepare a data set for R:

In [None]:
from rpy2.robjects import pandas2ri
pandas2ri.activate()

from rpy2.robjects.packages import importr

base = importr('base')
#base.saveRDS(allmerged,file="allmerged.rds")

#In R, you can open it with: DF = readRDS("allmerged.rds")
#or, if iyou read from a link: DF = readRDS(url("https://..../allmerged.rds")

Do you have **rpy2**:

In [None]:
!pip show rpy2