<center><img src="http://alacip.org/wp-content/uploads/2014/03/logoEscalacip1.png" width="500"></center>


<center> <h1>Curso: Introducción al Python</h1> </center>

<br></br>

* Profesor:  <a href="http://www.pucp.edu.pe/profesor/jose-manuel-magallanes/" target="_blank">Dr. José Manuel Magallanes, PhD</a> ([jmagallanes@pucp.edu.pe](mailto:jmagallanes@pucp.edu.pe))<br>
    - Profesor del **Departamento de Ciencias Sociales, Pontificia Universidad Católica del Peru**.<br>
    - Senior Data Scientist del **eScience Institute** and Visiting Professor at **Evans School of Public Policy and Governance, University of Washington**.<br>
    - Fellow Catalyst, **Berkeley Initiative for Transparency in Social Sciences, UC Berkeley**.


## Parte 4:  Data Cleaning en Python

El pre procesamiento de datos es la parte más tediosa del proceso de investigación.

Esta primera parte delata diversos problemas que se tienen con los datos reales que están en la web, como la que vemos a continuación:

In [1]:
import IPython
wikiLink="https://en.wikipedia.org/wiki/List_of_freedom_indices" 
iframe = '<iframe src=' + wikiLink + ' width=700 height=350></iframe>'
IPython.display.HTML(iframe)

Recuerda inspeccionar la tabla para encontrar algun atributo que sirva para su descarga. De ahí, continua.

In [2]:
# antes instala'beautifulsoup4' y 'html5lib'
# es posible que necesites salir y volver a cargar notebook

import pandas as pd

wikiTables=pd.read_html(wikiLink,header=0,flavor='bs4',attrs={'class': 'wikitable sortable'})

In [3]:
# cuantas tenemos?
len(wikiTables)

1

Hasta aquí todo parece bien. Como solo hay uno, lo traigo y comienzo a verificar 'suciedades'.

In [4]:
DF=wikiTables[0]

#primera mirada
DF.head()

Unnamed: 0,Country,Freedom in the World 2019[10],2019 Index of Economic Freedom[11],2019 Press Freedom Index[3],2018 Democracy Index[13]
0,Afghanistan,not free,mostly unfree,difficult situation,authoritarian regime
1,Albania,partly free,moderately free,noticeable problems,hybrid regime
2,Algeria,not free,repressed,difficult situation,authoritarian regime
3,Andorra,free,,satisfactory situation,
4,Angola,not free,mostly unfree,noticeable problems,authoritarian regime


La limpieza requiere estrategia. Lo primero que salta a la vista, son los _footnotes_ que están en los títulos:

In [5]:
DF.columns

Index(['Country', 'Freedom in the World 2019[10]',
       '2019 Index of Economic Freedom[11]', '2019 Press Freedom Index[3]',
       '2018 Democracy Index[13]'],
      dtype='object')

In [6]:
# aqui ves que pasa cuando divido cada celda usando el caracter '['
[element.split('[') for element in DF.columns]

[['Country'],
 ['Freedom in the World 2019', '10]'],
 ['2019 Index of Economic Freedom', '11]'],
 ['2019 Press Freedom Index', '3]'],
 ['2018 Democracy Index', '13]']]

In [7]:
# Te das cuenta que te puedes quedar con el primer elemento cada vez que partes:
[element.split('[')[0] for element in DF.columns]

['Country',
 'Freedom in the World 2019',
 '2019 Index of Economic Freedom',
 '2019 Press Freedom Index',
 '2018 Democracy Index']

También hay que evitar espacios en blanco:

In [8]:
outSymbol=' ' 
inSymbol=''
[element.split('[')[0].replace(outSymbol,inSymbol) for element in DF.columns]

['Country',
 'FreedomintheWorld2019',
 '2019IndexofEconomicFreedom',
 '2019PressFreedomIndex',
 '2018DemocracyIndex']

Los números también molestan, pero están en diferentes sitios. Mejor intentemos expresiones regulares:

In [9]:
import re  # debe estar instalado.

# espacios: \\s+
# uno o mas numeros \\d+
# bracket que abre \\[
# bracket que cierra \\]

pattern='\\s+|\\d+|\\[|\\]'
nothing=''

#substituyendo 'pattern' por 'nothing':
[re.sub(pattern,nothing,element) for element in DF.columns]

['Country',
 'FreedomintheWorld',
 'IndexofEconomicFreedom',
 'PressFreedomIndex',
 'DemocracyIndex']

Ya tengo nuevos titulos de columna (headers)!!

In [10]:
newHeaders=[re.sub(pattern,nothing,element) for element in DF.columns]

Preparemos los cambios:

In [11]:
list(zip(DF.columns,newHeaders))

[('Country', 'Country'),
 ('Freedom in the World 2019[10]', 'FreedomintheWorld'),
 ('2019 Index of Economic Freedom[11]', 'IndexofEconomicFreedom'),
 ('2019 Press Freedom Index[3]', 'PressFreedomIndex'),
 ('2018 Democracy Index[13]', 'DemocracyIndex')]

In [12]:
# veamos los cambios:
{old:new for old,new in zip(DF.columns,newHeaders)}

{'2018 Democracy Index[13]': 'DemocracyIndex',
 '2019 Index of Economic Freedom[11]': 'IndexofEconomicFreedom',
 '2019 Press Freedom Index[3]': 'PressFreedomIndex',
 'Country': 'Country',
 'Freedom in the World 2019[10]': 'FreedomintheWorld'}

Uso un dict por si hubieses querido cambiar solo algunas columnas:

In [13]:
changes={old:new for old,new in zip(DF.columns,newHeaders)}
DF.rename(columns=changes,inplace=True)

In [14]:
# ahora tenemos:
DF.head()

Unnamed: 0,Country,FreedomintheWorld,IndexofEconomicFreedom,PressFreedomIndex,DemocracyIndex
0,Afghanistan,not free,mostly unfree,difficult situation,authoritarian regime
1,Albania,partly free,moderately free,noticeable problems,hybrid regime
2,Algeria,not free,repressed,difficult situation,authoritarian regime
3,Andorra,free,,satisfactory situation,
4,Angola,not free,mostly unfree,noticeable problems,authoritarian regime


Las columnas son categorías, veamos si todas se han escrito de la manera correcta:

In [15]:
DF.FreedomintheWorld.value_counts()

free           87
partly free    62
not free       55
Name: FreedomintheWorld, dtype: int64

In [16]:
DF.IndexofEconomicFreedom.value_counts()

mostly unfree      64
moderately free    59
mostly free        29
repressed          22
free                6
Name: IndexofEconomicFreedom, dtype: int64

In [17]:
DF.PressFreedomIndex.value_counts()

noticeable problems       73
difficult situation       53
satisfactory situation    28
very serious situation    19
good situation            15
mostly free                1
Name: PressFreedomIndex, dtype: int64

In [18]:
DF.DemocracyIndex.value_counts()

flawed democracy        55
authoritarian regime    53
hybrid regime           39
full democracy          20
Name: DemocracyIndex, dtype: int64

### Ejercicio

Traer y limpiar la tabla que se ubica es este [link](https://www.cia.gov/library/publications/resources/the-world-factbook/fields/349.html)

In [21]:
import IPython
ciaLink="https://www.cia.gov/library/publications/resources/the-world-factbook/fields/349.html" 
ciaTables=pd.read_html(ciaLink,header=0,flavor='bs4',attrs={'id': 'fieldListing'})
ciaTable=ciaTables[0]

In [23]:
ciaTable.head(3)

Unnamed: 0,Country,Urbanization
0,Afghanistan,urban population: 25.5% of total population ...
1,Albania,urban population: 60.3% of total population ...
2,Algeria,urban population: 72.6% of total population ...


In [33]:


ciaTable['urbPop']=[v.split('rate of ')[0].split('%')[0].split(':  ')[1] for v in ciaTable.Urbanization]

In [40]:
ciaTable['rateUrb']=[v.split('rate of ')[1].split('%')[0].split(':  ')[1] for v in ciaTable.Urbanization]

In [41]:
ciaTable

Unnamed: 0,Country,Urbanization,urbPop,rateUrb
0,Afghanistan,urban population: 25.5% of total population ...,25.5,3.37
1,Albania,urban population: 60.3% of total population ...,60.3,1.69
2,Algeria,urban population: 72.6% of total population ...,72.6,2.46
3,American Samoa,urban population: 87.2% of total population ...,87.2,0.07
4,Andorra,urban population: 88.1% of total population ...,88.1,-0.31
5,Angola,urban population: 65.5% of total population ...,65.5,4.32
6,Anguilla,urban population: 100% of total population (...,100,0.9
7,Antigua and Barbuda,urban population: 24.6% of total population ...,24.6,0.55
8,Argentina,urban population: 91.9% of total population ...,91.9,1.07
9,Armenia,urban population: 63.1% of total population ...,63.1,0.22


____

[Ir a inicio](#beginning)

_____

**AUSPICIO**: 

* El desarrollo de estos contenidos ha sido posible gracias al grant del Berkeley Initiative for Transparency in the Social Sciences (BITSS) at the Center for Effective Global Action (CEGA) at the University of California, Berkeley


<center>
<img src="https://www.bitss.org/wp-content/uploads/2015/07/bitss-55a55026v1_site_icon.png" style="width: 200px;"/>
</center>

* Este curso cuenta con el auspicio de:


<center>
<img src="https://www.python.org/static/img/psf-logo@2x.png" style="width: 500px;"/>
</center>



**RECONOCIMIENTO**


EL Dr. Magallanes agradece a la Pontificia Universidad Católica del Perú, por su apoyo en la participación en la Escuela ALACIP.

<center>
<img src="https://dci.pucp.edu.pe/wp-content/uploads/2014/02/Logotipo_colores-290x145.jpg" style="width: 400px;"/>
</center>


El autor reconoce el apoyo que el eScience Institute de la Universidad de Washington le ha brindado desde el 2015 para desarrollar su investigación en Ciencia de Datos.

<center>
<img src="https://escience.washington.edu/wp-content/uploads/2015/10/eScience_Logo_HR.png" style="width: 500px;"/>
</center>

<br>
<br>