# Processing data with Pandas

## Agenda:


* Why Pandas?

* Series and DataFrames.

* Indexing.

* Reading and saving data.

* Missing data.

* Basic operations.

* Plots.

* Performance tips.

 (https://media.giphy.com/media/DBPEwTU6klcac/giphy.gif)

 (There is an overhead in loading the data into the data frame. This is compensated when there is a lot of processing needed. However, it’s not advised to use pandas and load the data into a data frame for one calculation.) Mencionar alternativas?

TODO: AGREGAR CODIGO QR CON LINK AL REPOSITORIO?

## Why Pandas?


- Open source Python library (age ~ 11 years).

(Es una libreria de Python open source que ya tiene unos 11 años desde su primer release)

- It became almost a standard for performing data analysis in Python.

(Es practicamente un estandar para realizar analisis de datos de manera rapida en Python)

- It's made over numpy.

(Otra libreria de Python que implementa arrays multidimensionales de forma eficiente y es la 
base para muchos de los paquetes científicos de Python)

- It has high level data structures.

(Esto son las Series y Dataframes lo que hace mas ameno trabajar con esta herramienta).

## Series and Dataframes

In [23]:
python_list = [1, 2, 3, 4]

python_list[2]

3

In [24]:
python_list[1:-1]

[2, 3]

In [80]:
python_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4}

python_dict['c']

3

- Series: one dimensional labeled array

In [111]:
import pandas
serie = pandas.Series([1, 2, 3, 4],
                       index=['a', 'b', 'c', 'd'])
serie

a    1
b    2
c    3
d    4
dtype: int64

In [112]:
serie[1:-1]

b    2
c    3
dtype: int64

In [113]:
serie['c']

3

In [114]:
serie = serie + 1
serie

a    2
b    3
c    4
d    5
dtype: int64

In [116]:
serie + pandas.Series([100, 100, 200, 200], index=['c', 'd', 'a', 'b'])

a    202
b    203
c    104
d    105
dtype: int64

In [117]:
serie.mean()

3.5

In [118]:
serie.std()

1.2909944487358056

In [87]:
print(type(series.values))
series.values

<class 'numpy.ndarray'>


array([1, 2, 3, 4])

- Dataframe: two dimensional labeled data structure with columns of potentially different types

In [275]:
df = pandas.DataFrame({'name': ['John', 'Peter', 'Anna', 'David'],
                       'number': serie,
                       'birthdate': ['2010-01-02', float('NaN'), '2011-05-23', 0.]})
df

Unnamed: 0,name,number,birthdate
a,John,2,2010-01-02
b,Peter,3,
c,Anna,4,2011-05-23
d,David,5,0


In [276]:
#df.reset_index(drop=True)

## Indexing

- By position

In [277]:
df.iloc[2]

name               Anna
number                4
birthdate    2011-05-23
Name: c, dtype: object

- By label

In [278]:
df.loc['c']

name               Anna
number                4
birthdate    2011-05-23
Name: c, dtype: object

- Boolean indexing

In [281]:
df[df['name'] == 'Anna']

Unnamed: 0,name,number,birthdate
c,Anna,4,2011-05-23


## Reading and saving data

(Pandas viene preparado para interactuar con varios formatos de datos, entre ellos CSV, JSON, Excel, HDF5, pickle, SQL y varios más.)

<table class="colwidths-given docutils" border="1">
<colgroup>
<col width="12%">
<col width="40%">
<col width="24%">
<col width="24%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Format Type</th>
<th class="head">Data Description</th>
<th class="head">Reader</th>
<th class="head">Writer</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>text</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/Comma-separated_values">CSV</a></td>
<td><a class="reference internal" href="#io-read-csv-table"><span class="std std-ref">read_csv</span></a></td>
<td><a class="reference internal" href="#io-store-in-csv"><span class="std std-ref">to_csv</span></a></td>
</tr>
<tr class="row-odd"><td>text</td>
<td><a class="reference external" href="https://www.json.org/">JSON</a></td>
<td><a class="reference internal" href="#io-json-reader"><span class="std std-ref">read_json</span></a></td>
<td><a class="reference internal" href="#io-json-writer"><span class="std std-ref">to_json</span></a></td>
</tr>
<tr class="row-even"><td>text</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/HTML">HTML</a></td>
<td><a class="reference internal" href="#io-read-html"><span class="std std-ref">read_html</span></a></td>
<td><a class="reference internal" href="#io-html"><span class="std std-ref">to_html</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/Microsoft_Excel">MS Excel</a></td>
<td><a class="reference internal" href="#io-excel-reader"><span class="std std-ref">read_excel</span></a></td>
<td><a class="reference internal" href="#io-excel-writer"><span class="std std-ref">to_excel</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://support.hdfgroup.org/HDF5/whatishdf5.html">HDF5 Format</a></td>
<td><a class="reference internal" href="#io-hdf5"><span class="std std-ref">read_hdf</span></a></td>
<td><a class="reference internal" href="#io-hdf5"><span class="std std-ref">to_hdf</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://github.com/wesm/feather">Feather Format</a></td>
<td><a class="reference internal" href="#io-feather"><span class="std std-ref">read_feather</span></a></td>
<td><a class="reference internal" href="#io-feather"><span class="std std-ref">to_feather</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://parquet.apache.org/">Parquet Format</a></td>
<td><a class="reference internal" href="#io-parquet"><span class="std std-ref">read_parquet</span></a></td>
<td><a class="reference internal" href="#io-parquet"><span class="std std-ref">to_parquet</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://msgpack.org/index.html">Msgpack</a></td>
<td><a class="reference internal" href="#io-msgpack"><span class="std std-ref">read_msgpack</span></a></td>
<td><a class="reference internal" href="#io-msgpack"><span class="std std-ref">to_msgpack</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/Stata">Stata</a></td>
<td><a class="reference internal" href="#io-stata-reader"><span class="std std-ref">read_stata</span></a></td>
<td><a class="reference internal" href="#io-stata-writer"><span class="std std-ref">to_stata</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://docs.python.org/3/library/pickle.html">Python Pickle Format</a></td>
<td><a class="reference internal" href="#io-pickle"><span class="std std-ref">read_pickle</span></a></td>
<td><a class="reference internal" href="#io-pickle"><span class="std std-ref">to_pickle</span></a></td>
</tr>
<tr class="row-even"><td>SQL</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/SQL">SQL</a></td>
<td><a class="reference internal" href="#io-sql"><span class="std std-ref">read_sql</span></a></td>
<td><a class="reference internal" href="#io-sql"><span class="std std-ref">to_sql</span></a></td>
</tr>
<tr class="row-odd"><td>SQL</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/BigQuery">Google Big Query</a></td>
<td><a class="reference internal" href="#io-bigquery"><span class="std std-ref">read_gbq</span></a></td>
<td><a class="reference internal" href="#io-bigquery"><span class="std std-ref">to_gbq</span></a></td>
</tr>
</tbody>
</table>

Some useful parameters of *read_csv()* function:

- filepath_or_buffer: required. Any valid string path is acceptable. The string could be a URL.
- parse_dates: columns to be parsed as dates.
- date_parser: function used to parsing dates.
- usecols: return a subset of the columns.
- dtype: data type for data or columns.
- na_values: additional strings to recognize as NA/NaN.

Example: 

### *Housing global price indices*. 


- Dataset comes from [Bank For International Settlements BIS](http://www.bis.org/statistics/pp.htm).

In [283]:
df = pandas.read_csv('housing_global_price_indices.csv', parse_dates=['date'])

### Missing data

In [284]:
print(df.shape)
df.head()

(12627, 3)


Unnamed: 0,date,country,price_index
0,1966-03-31,Emerging market economies,
1,1966-03-31,Advanced economies,
2,1966-03-31,United Arab Emirates,
3,1966-03-31,Austria,
4,1966-03-31,Australia,


In [285]:
df = df.dropna()
df.shape

(4561, 3)

Explicar que decidir como tratar los valores faltantes es algo no trivial y depende mucho del problema.


En lugar de remover:

- Agregar fillna(0).

- fillna(promedio, media, moda por año).

- Regresion por año y por pais.



In [286]:
df.head(10)

Unnamed: 0,date,country,price_index
60,1966-03-31,South Africa,56.13
121,1966-06-30,South Africa,56.81
182,1966-09-30,South Africa,55.86
243,1966-12-31,South Africa,57.05
304,1967-03-31,South Africa,63.09
365,1967-06-30,South Africa,64.43
426,1967-09-30,South Africa,65.9
487,1967-12-31,South Africa,66.2
548,1968-03-31,South Africa,64.87
570,1968-06-30,United Kingdom,22.5


In [248]:
df.describe()

Unnamed: 0,price_index,year
count,4561.0,4561.0
mean,90.835694,2004.19009
std,26.251676,11.023502
min,21.87,1966.0
25%,76.35,1999.0
50%,94.7,2008.0
75%,103.52,2012.0
max,218.09,2017.0


In [249]:
df['country'].value_counts()[:10]

South Africa      207
United Kingdom    197
Switzerland       191
Canada            190
United States     167
New Zealand       151
Hong Kong SAR     151
Euro area         150
Korea             127
Sweden            126
Name: country, dtype: int64

In [250]:
df['year'] = df['date'].dt.year

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [251]:
gb = df.groupby(df['date'].dt.year)

In [258]:
gb.price_index.mean()

date
1966     56.462500
1967     64.905000
1968     47.728571
1969     47.007500
1970     53.925625
1971     55.425000
1972     58.808125
1973     64.435000
1974     62.726250
1975     59.027647
1976     59.304000
1977     58.899500
1978     59.504500
1979     59.962727
1980     58.925937
1981     61.500313
1982     59.028750
1983     58.933437
1984     58.532500
1985     56.706250
1986     61.858750
1987     64.629000
1988     72.750227
1989     77.945000
1990     77.297083
1991     76.655417
1992     71.687115
1993     69.443654
1994     72.246786
1995     69.698833
1996     68.804375
1997     70.938125
1998     69.865616
1999     70.749000
2000     76.654783
2001     72.520700
2002     76.968850
2003     80.720635
2004     85.873750
2005     92.909512
2006    101.144891
2007    108.679848
2008    107.030264
2009     99.767797
2010    100.003607
2011     98.721885
2012     98.596025
2013    100.355697
2014    103.363361
2015    106.569713
2016    110.237705
2017    113.580397
Name: p

In [240]:
df

Unnamed: 0,date,country,price_index,year
0,1966-03-31,Emerging market economies,,1966
1,1966-03-31,Advanced economies,,1966
2,1966-03-31,United Arab Emirates,,1966
3,1966-03-31,Austria,,1966
4,1966-03-31,Australia,,1966
5,1966-03-31,Belgium,,1966
6,1966-03-31,Bulgaria,,1966
7,1966-03-31,Brazil,,1966
8,1966-03-31,Canada,,1966
9,1966-03-31,Switzerland,,1966


In [226]:
ts = df.iloc[0]['date']

In [227]:
ts.year

1966

In [None]:
pandas 

In [60]:
gdp = (pandas.DataFrame([('USA', 'Americas', 19_390_604, 20_035_183, 47., 49.),
                        ('China', 'Asia', 12_237_700, 15_554_902, 46.5, 48.),
                        ('Japan', 'Asia', 4_872_137, 4_893_502, 37.9, 36.2),
                        ('Germany', 'Europe', 3_677_439, 3_732_192, 27., 27.3),
                        ('UK', 'Europe', 2_622_434, 2_591_883, 32.4, 34.1),
                        ('India', 'Asia', 2_597_491, 2_712_658, 35.1, 37.9)],
                       columns=['country', 'continent', None, None, None, None])
             .set_index(['continent', 'country']))
gdp.columns = [['gdp', 'gdp', 'gini', 'gini'], [2017, 2018] * 2]
gdp

Unnamed: 0_level_0,Unnamed: 1_level_0,gdp,gdp,gini,gini
Unnamed: 0_level_1,Unnamed: 1_level_1,2017,2018,2017,2018
continent,country,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Americas,USA,19390604,20035183,47.0,49.0
Asia,China,12237700,15554902,46.5,48.0
Asia,Japan,4872137,4893502,37.9,36.2
Europe,Germany,3677439,3732192,27.0,27.3
Europe,UK,2622434,2591883,32.4,34.1
Asia,India,2597491,2712658,35.1,37.9


In [67]:
gdp.loc['Asia']

Unnamed: 0_level_0,gdp,gdp,gini,gini
Unnamed: 0_level_1,2017,2018,2017,2018
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
China,12237700,15554902,46.5,48.0
Japan,4872137,4893502,37.9,36.2
India,2597491,2712658,35.1,37.9


In [68]:
gdp.loc['Asia', 'gdp']

Unnamed: 0_level_0,2017,2018
country,Unnamed: 1_level_1,Unnamed: 2_level_1
China,12237700,15554902
Japan,4872137,4893502
India,2597491,2712658


In [72]:
gdp

Unnamed: 0_level_0,Unnamed: 1_level_0,gdp,gdp,gini,gini
Unnamed: 0_level_1,Unnamed: 1_level_1,2017,2018,2017,2018
continent,country,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Americas,USA,19390604,20035183,47.0,49.0
Asia,China,12237700,15554902,46.5,48.0
Asia,Japan,4872137,4893502,37.9,36.2
Europe,Germany,3677439,3732192,27.0,27.3
Europe,UK,2622434,2591883,32.4,34.1
Asia,India,2597491,2712658,35.1,37.9


In [74]:
gdp.xs(2018, axis='columns', level=1) # Returns a cross-section

Unnamed: 0_level_0,Unnamed: 1_level_0,gdp,gini
continent,country,Unnamed: 2_level_1,Unnamed: 3_level_1
Americas,USA,20035183,49.0
Asia,China,15554902,48.0
Asia,Japan,4893502,36.2
Europe,Germany,3732192,27.3
Europe,UK,2591883,34.1
Asia,India,2712658,37.9


In [77]:
countries = pandas.read_csv('https://github.com/datapythonista/pandas_ecosystem/raw/master/data/countries.csv.gz',
                            sep=';')
countries

Unnamed: 0,Country (en),Country (de),Country (local),Country code,Continent,Capital,Population,Area,Coastline,Government form,Currency,Currency code,Dialing prefix,Birthrate,Deathrate,Url
0,Albania,Albanien,Shqipëria,AL,Europe,Tirana,2873457,28750,362,Parliamentary republic,Lek,ALL,355,11.8,7.4,https://www.laenderdaten.info/Europa/Albanien/...
1,Angola,Angola,Angola,AO,Africa,Luanda,29784193,1246700,1600,Presidential republic,Kwanza,AOA,244,41.8,8.5,https://www.laenderdaten.info/Afrika/Angola/in...
2,Equatorial Guinea,Äquatorialguinea,Guinea Ecuatorial,GQ,Africa,Malabo,1267689,28050,296,Presidential republic,Central African Franc,XAF,240,34.1,10.2,https://www.laenderdaten.info/Afrika/Aequatori...
3,Azerbaijan,Aserbaidschan,Azärbaycan,AZ,Asia,Baku,9862429,86600,0,Presidential republic,Manat,AZN,994,16.3,5.8,https://www.laenderdaten.info/Asien/Aserbaidsc...
