# pandas ecosystem 2019

@datapythonista

## About me

Marc Garcia

**@datapythonista** (Twitter, GitHub, LinkedIn,...)

- pandas core dev
- Python fellow
- 13 years of experience with Python
- Contractor data scientist
- **NumFOCUS ambassador**

![](img/pandas_ecosystem.jpg)

## Agenda

- pandas components
- hardware and numpy
- data structures
- indexing
- functionality
- pandas API
- pandas alternatives
- distributing pandas

![](img/software.png)

![](img/components_01_numpy.png)

In [None]:
2 + 2

In [None]:
class Object:
    def __add__(self, other):
        # add numbers
        # contactenate strings or arrays
        # add date to delta
        return do_something(self, other)

**PyObject**

- type
- reference count (garbage collection)
- value
  - can be a complex structure (e.g. arbitrary-precision integers)

![](img/cpu_memory_speed.jpg)

Python lists at a scale (e.g. adding 1,000,000 integers)

- Not very efficient **representation** if we can assume homogenous types
- **Transfering** a lot of extra information from memory to CPU
- Poor use of CPU **caches**
- **Slow** compared to operations with homogenous types in C

More efficient representation: `array` module

In [1]:
import array

array.array('I', [1, 2, 3, 4])

array('I', [1, 2, 3, 4])

In [2]:
import random

big_list = [random.randint(0, 255) for i in range(10_000_000)]

big_array = array.array('I', big_list)

In [3]:
%timeit sum(big_list)

44.8 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [4]:
%timeit sum(big_array)

53.9 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [5]:
import numpy

big_numpy = numpy.array(big_array)

In [6]:
%timeit big_numpy.sum()

8.36 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [7]:
numpy.array([(1, 2, 3),
             (4, 5, 6)],
            dtype=numpy.uint8)

array([[1, 2, 3],
       [4, 5, 6]], dtype=uint8)

![](img/components_02_data_structures.png)

In [8]:
python_list = [1, 2, 3, 4]

python_list[2]

3

In [9]:
python_list[1:-1]

[2, 3]

In [10]:
python_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4}

python_dict['b']

2

In [11]:
import pandas

series = pandas.Series([1, 2, 3, 4],
                       index=['a', 'b', 'c', 'd'])

In [12]:
series[1:-1]

b    2
c    3
dtype: int64

In [13]:
series['b']

2

In [14]:
series.dtype

dtype('int64')

In [15]:
series + 1

a    2
b    3
c    4
d    5
dtype: int64

In [16]:
series.mean()

2.5

In [17]:
series.values

array([1, 2, 3, 4])

In [18]:
df = pandas.DataFrame({'countries': ['it', 'pt', 'es', 'gr'],
                       'numbers': series,
                       'floats': [3.141592, 2.718281828, float('NaN'), 0.]})
df

Unnamed: 0,countries,numbers,floats
a,it,1,3.141592
b,pt,2,2.718282
c,es,3,
d,gr,4,0.0


![](img/arrow_web.png)

![](img/apache_memory.png)

**pandas backend requirements:**
    
1. efficient data representation
2. fast memory access
3. fast operations (e.g. sum)

Apache Arrow is mature in 1 and 2. Work still needed for 3 (Gandiva is an effort in this way).

Wes is hiring full-time Apache Arrow developers (C++). Check Ursa Labs.

![](img/components_arrow.png)

![](img/components_backends.png)

In practice: **Extension arrays**

- Backend still mainly based on numpy
- Columns can use other libraries
  - Example: **Fletcher** (Arrow strings with the current backend)

![](img/components_03_indexing.png)

**Labels**

- Key feature of pandas
- Access data by names (rows or columns)
- Support of multiindices

![](img/stuck_pandas.jpg)

![](img/pandas_indexing.png)

In [19]:
gdp = (pandas.DataFrame([('USA', 'Americas', 19_390_604, 20_035_183, 47., 49.),
                        ('China', 'Asia', 12_237_700, 15_554_902, 46.5, 48.),
                        ('Japan', 'Asia', 4_872_137, 4_893_502, 37.9, 36.2),
                        ('Germany', 'Europe', 3_677_439, 3_732_192, 27., 27.3),
                        ('UK', 'Europe', 2_622_434, 2_591_883, 32.4, 34.1),
                        ('India', 'Asia', 2_597_491, 2_712_658, 35.1, 37.9)],
                       columns=['country', 'continent', None, None, None, None])
             .set_index(['continent', 'country']))
gdp.columns = [['gdp', 'gdp', 'gini', 'gini'], [2017, 2018] * 2]
gdp

Unnamed: 0_level_0,Unnamed: 1_level_0,gdp,gdp,gini,gini
Unnamed: 0_level_1,Unnamed: 1_level_1,2017,2018,2017,2018
continent,country,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Americas,USA,19390604,20035183,47.0,49.0
Asia,China,12237700,15554902,46.5,48.0
Asia,Japan,4872137,4893502,37.9,36.2
Europe,Germany,3677439,3732192,27.0,27.3
Europe,UK,2622434,2591883,32.4,34.1
Asia,India,2597491,2712658,35.1,37.9


In [20]:
gdp.loc['Asia']

Unnamed: 0_level_0,gdp,gdp,gini,gini
Unnamed: 0_level_1,2017,2018,2017,2018
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
China,12237700,15554902,46.5,48.0
Japan,4872137,4893502,37.9,36.2
India,2597491,2712658,35.1,37.9


In [21]:
gdp.loc['Asia', 'gdp']

Unnamed: 0_level_0,2017,2018
country,Unnamed: 1_level_1,Unnamed: 2_level_1
China,12237700,15554902
Japan,4872137,4893502
India,2597491,2712658


In [22]:
gdp.xs(2018, axis='columns', level=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,gdp,gini
continent,country,Unnamed: 2_level_1,Unnamed: 3_level_1
Americas,USA,20035183,49.0
Asia,China,15554902,48.0
Asia,Japan,4893502,36.2
Europe,Germany,3732192,27.3
Europe,UK,2591883,34.1
Asia,India,2712658,37.9


pandas indexing system is reused by other projects, mainly **xarray**

xarray data structures are N-dimensional labelled arrays (based on numpy).

![](img/xarray.png)

In [23]:
import xarray

gdp_xarray = xarray.Dataset.from_dataframe(gdp)
gdp_xarray

<xarray.Dataset>
Dimensions:         (continent: 3, country: 6)
Coordinates:
  * continent       (continent) object 'Americas' 'Asia' 'Europe'
  * country         (country) object 'China' 'Germany' 'India' ... 'UK' 'USA'
Data variables:
    ('gdp', 2017)   (continent, country) float64 nan nan nan ... 2.622e+06 nan
    ('gdp', 2018)   (continent, country) float64 nan nan nan ... 2.592e+06 nan
    ('gini', 2017)  (continent, country) float64 nan nan nan ... nan 32.4 nan
    ('gini', 2018)  (continent, country) float64 nan nan nan ... nan 34.1 nan

In [24]:
gdp_xarray.loc[{'continent': 'Asia'}]

<xarray.Dataset>
Dimensions:         (country: 6)
Coordinates:
    continent       <U4 'Asia'
  * country         (country) object 'China' 'Germany' 'India' ... 'UK' 'USA'
Data variables:
    ('gdp', 2017)   (country) float64 1.224e+07 nan 2.597e+06 4.872e+06 nan nan
    ('gdp', 2018)   (country) float64 1.555e+07 nan 2.713e+06 4.894e+06 nan nan
    ('gini', 2017)  (country) float64 46.5 nan 35.1 37.9 nan nan
    ('gini', 2018)  (country) float64 48.0 nan 37.9 36.2 nan nan

## pandas functionality

![](img/components_04_io.png)

<table class="colwidths-given docutils" border="1">
<colgroup>
<col width="12%">
<col width="40%">
<col width="24%">
<col width="24%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Format Type</th>
<th class="head">Data Description</th>
<th class="head">Reader</th>
<th class="head">Writer</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>text</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/Comma-separated_values">CSV</a></td>
<td><a class="reference internal" href="#io-read-csv-table"><span class="std std-ref">read_csv</span></a></td>
<td><a class="reference internal" href="#io-store-in-csv"><span class="std std-ref">to_csv</span></a></td>
</tr>
<tr class="row-odd"><td>text</td>
<td><a class="reference external" href="https://www.json.org/">JSON</a></td>
<td><a class="reference internal" href="#io-json-reader"><span class="std std-ref">read_json</span></a></td>
<td><a class="reference internal" href="#io-json-writer"><span class="std std-ref">to_json</span></a></td>
</tr>
<tr class="row-even"><td>text</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/HTML">HTML</a></td>
<td><a class="reference internal" href="#io-read-html"><span class="std std-ref">read_html</span></a></td>
<td><a class="reference internal" href="#io-html"><span class="std std-ref">to_html</span></a></td>
</tr>
<tr class="row-odd"><td>text</td>
<td>Local clipboard</td>
<td><a class="reference internal" href="#io-clipboard"><span class="std std-ref">read_clipboard</span></a></td>
<td><a class="reference internal" href="#io-clipboard"><span class="std std-ref">to_clipboard</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/Microsoft_Excel">MS Excel</a></td>
<td><a class="reference internal" href="#io-excel-reader"><span class="std std-ref">read_excel</span></a></td>
<td><a class="reference internal" href="#io-excel-writer"><span class="std std-ref">to_excel</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://support.hdfgroup.org/HDF5/whatishdf5.html">HDF5 Format</a></td>
<td><a class="reference internal" href="#io-hdf5"><span class="std std-ref">read_hdf</span></a></td>
<td><a class="reference internal" href="#io-hdf5"><span class="std std-ref">to_hdf</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://github.com/wesm/feather">Feather Format</a></td>
<td><a class="reference internal" href="#io-feather"><span class="std std-ref">read_feather</span></a></td>
<td><a class="reference internal" href="#io-feather"><span class="std std-ref">to_feather</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://parquet.apache.org/">Parquet Format</a></td>
<td><a class="reference internal" href="#io-parquet"><span class="std std-ref">read_parquet</span></a></td>
<td><a class="reference internal" href="#io-parquet"><span class="std std-ref">to_parquet</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://msgpack.org/index.html">Msgpack</a></td>
<td><a class="reference internal" href="#io-msgpack"><span class="std std-ref">read_msgpack</span></a></td>
<td><a class="reference internal" href="#io-msgpack"><span class="std std-ref">to_msgpack</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/Stata">Stata</a></td>
<td><a class="reference internal" href="#io-stata-reader"><span class="std std-ref">read_stata</span></a></td>
<td><a class="reference internal" href="#io-stata-writer"><span class="std std-ref">to_stata</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/SAS_(software)">SAS</a></td>
<td><a class="reference internal" href="#io-sas-reader"><span class="std std-ref">read_sas</span></a></td>
<td>&nbsp;</td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://docs.python.org/3/library/pickle.html">Python Pickle Format</a></td>
<td><a class="reference internal" href="#io-pickle"><span class="std std-ref">read_pickle</span></a></td>
<td><a class="reference internal" href="#io-pickle"><span class="std std-ref">to_pickle</span></a></td>
</tr>
<tr class="row-even"><td>SQL</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/SQL">SQL</a></td>
<td><a class="reference internal" href="#io-sql"><span class="std std-ref">read_sql</span></a></td>
<td><a class="reference internal" href="#io-sql"><span class="std std-ref">to_sql</span></a></td>
</tr>
<tr class="row-odd"><td>SQL</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/BigQuery">Google Big Query</a></td>
<td><a class="reference internal" href="#io-bigquery"><span class="std std-ref">read_gbq</span></a></td>
<td><a class="reference internal" href="#io-bigquery"><span class="std std-ref">to_gbq</span></a></td>
</tr>
</tbody>
</table>

In [25]:
import pandas

countries = pandas.read_csv('https://github.com/datapythonista/pandas_ecosystem/raw/master/data/countries.csv.gz',
                            sep=';')
countries

Unnamed: 0,Country (en),Country (de),Country (local),Country code,Continent,Capital,Population,Area,Coastline,Government form,Currency,Currency code,Dialing prefix,Birthrate,Deathrate,Url
0,Albania,Albanien,Shqipëria,AL,Europe,Tirana,2873457,28750,362,Parliamentary republic,Lek,ALL,355,11.8,7.4,https://www.laenderdaten.info/Europa/Albanien/...
1,Angola,Angola,Angola,AO,Africa,Luanda,29784193,1246700,1600,Presidential republic,Kwanza,AOA,244,41.8,8.5,https://www.laenderdaten.info/Afrika/Angola/in...
2,Equatorial Guinea,Äquatorialguinea,Guinea Ecuatorial,GQ,Africa,Malabo,1267689,28050,296,Presidential republic,Central African Franc,XAF,240,34.1,10.2,https://www.laenderdaten.info/Afrika/Aequatori...
3,Azerbaijan,Aserbaidschan,Azärbaycan,AZ,Asia,Baku,9862429,86600,0,Presidential republic,Manat,AZN,994,16.3,5.8,https://www.laenderdaten.info/Asien/Aserbaidsc...


![](img/components_05_joins.png)

**Joins**

![](img/pandas_join.png)

**Concatenate**

![](img/pandas_concat.png)

![](img/components_06_reshape.png)

![](img/pandas_reshape.png)

![](img/components_07_groupby.png)

In [26]:
gdp.groupby('continent').mean()

Unnamed: 0_level_0,gdp,gdp,gini,gini
Unnamed: 0_level_1,2017,2018,2017,2018
continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Americas,19390600.0,20035183.0,47.0,49.0
Asia,6569109.0,7720354.0,39.833333,40.7
Europe,3149936.0,3162037.5,29.7,30.7


In [27]:
gdp.groupby(axis='columns', level=1).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,2017,2018
continent,country,Unnamed: 2_level_1,Unnamed: 3_level_1
Americas,USA,9695325.5,10017616.0
Asia,China,6118873.25,7777475.0
Asia,Japan,2436087.45,2446769.1
Europe,Germany,1838733.0,1866109.65
Europe,UK,1311233.2,1295958.55
Asia,India,1298763.05,1356347.95


![](img/components_08_window.png)

`.rolling()` (e.g. moving average)

![](img/moving_average.png)

![](img/components_09_stats.png)

In [28]:
gdp.describe()

Unnamed: 0_level_0,gdp,gdp,gini,gini
Unnamed: 0_level_1,2017,2018,2017,2018
count,6.0,6.0,6.0,6.0
mean,7566301.0,8253387.0,37.65,38.75
std,6828598.0,7571066.0,7.915744,8.37347
min,2597491.0,2591883.0,27.0,27.3
25%,2886185.0,2967542.0,33.075,34.625
50%,4274788.0,4312847.0,36.5,37.05
75%,10396310.0,12889550.0,44.35,45.475
max,19390600.0,20035180.0,47.0,49.0


![](img/components_10_ts.png)

In [29]:
time_series = pandas.Series(range(5),
                            index=pandas.date_range('2019-05-03', periods=5, freq='H'))
time_series

2019-05-03 00:00:00    0
2019-05-03 01:00:00    1
2019-05-03 02:00:00    2
2019-05-03 03:00:00    3
2019-05-03 04:00:00    4
Freq: H, dtype: int64

In [30]:
time_series.resample('30min').mean()

2019-05-03 00:00:00    0.0
2019-05-03 00:30:00    NaN
2019-05-03 01:00:00    1.0
2019-05-03 01:30:00    NaN
2019-05-03 02:00:00    2.0
2019-05-03 02:30:00    NaN
2019-05-03 03:00:00    3.0
2019-05-03 03:30:00    NaN
2019-05-03 04:00:00    4.0
Freq: 30T, dtype: float64

In [31]:
pandas.date_range('2019-05-03',
                  periods=5,
                  freq='H',
                  tz='Europe/Rome')

DatetimeIndex(['2019-05-03 00:00:00+02:00', '2019-05-03 01:00:00+02:00',
               '2019-05-03 02:00:00+02:00', '2019-05-03 03:00:00+02:00',
               '2019-05-03 04:00:00+02:00'],
              dtype='datetime64[ns, Europe/Rome]', freq='H')

In [32]:
time_series.index + pandas.offsets.BDay(n=1)

DatetimeIndex(['2019-05-06 00:00:00', '2019-05-06 01:00:00',
               '2019-05-06 02:00:00', '2019-05-06 03:00:00',
               '2019-05-06 04:00:00'],
              dtype='datetime64[ns]', freq='H')

![](img/components_11_str.png)

In [33]:
cities = pandas.Series(['florence', 'milano', 'napoli', 'rome'])
cities

0    florence
1      milano
2      napoli
3        rome
dtype: object

In [34]:
cities.str.title()

0    Florence
1      Milano
2      Napoli
3        Rome
dtype: object

In [35]:
(cities.str[::-1]
       .str.title())

0    Ecnerolf
1      Onalim
2      Ilopan
3        Emor
dtype: object

**Accessors**

- `.str`: strings
- `.dt`: datetimes

Custom accessors to extend pandas (e.g. `.ip` in cyberpandas)

Would make sense that "everything" in pandas is a _plugin_?

Examples:

```python
df['gdp'].stats.mean()
df.io.to_csv()
```

More verbose for users and not trivial to group _vs_ more organised and scalable.

In [36]:
dir(pandas.Series())

['T',
 '_AXIS_ALIASES',
 '_AXIS_IALIASES',
 '_AXIS_LEN',
 '_AXIS_NAMES',
 '_AXIS_NUMBERS',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_AXIS_SLICEMAP',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_prepare__',
 '__array_priority__',
 '__array_wrap__',
 '__bool__',
 '__bytes__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__long__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 

![](img/components_12_nan.png)

In [37]:
import numpy

where_am_i = pandas.Series(['London', 'Florence', numpy.nan, numpy.nan, 'Milano'],
                           index=pandas.date_range('2019-05-02', periods=5))
where_am_i

2019-05-02      London
2019-05-03    Florence
2019-05-04         NaN
2019-05-05         NaN
2019-05-06      Milano
Freq: D, dtype: object

In [38]:
where_am_i.fillna('Florence')

2019-05-02      London
2019-05-03    Florence
2019-05-04    Florence
2019-05-05    Florence
2019-05-06      Milano
Freq: D, dtype: object

In [39]:
where_am_i.fillna(method='ffill')

2019-05-02      London
2019-05-03    Florence
2019-05-04    Florence
2019-05-05    Florence
2019-05-06      Milano
Freq: D, dtype: object

In [None]:
where_am_i.dropna()

![](img/components_13_plots.png)

In [None]:
%matplotlib inline

In [None]:
stocks = pandas.DataFrame({'MSFT': numpy.random.random(10),
                           'AMZN': numpy.random.random(10)},
                          index=pandas.date_range('2019-05-03', periods=10))
stocks.plot();

Extending pandas plots:

- hvplot: https://hvplot.pyviz.org/
- pandas-bokeh: https://github.com/PatrikHlobil/Pandas-Bokeh
- ...

In [None]:
import hvplot.pandas

stocks.hvplot()

In [None]:
import pandas_bokeh

stocks.plot_bokeh(kind="line");

Ideally pandas should support that:
    
```python
pandas.set_option('plotting.backend', 'bokeh')

df.plot()
```

On the "roadmap" for 3 years: https://github.com/pandas-dev/pandas/issues/14130

![](img/pizza.jpg)

![](img/components_14_api.png)

The **pandas API** is huge.

Only `Series` has more than 300 methods and attributes.

More than 1,500 public objects.

`pandas.read_csv` has around 50 parameters.

## pandas alternatives

- Vaex: https://github.com/vaexio/vaex
- cudf: https://github.com/rapidsai/cudf
- Static-frame: https://github.com/InvestmentSystems/static-frame

Distributing pandas:

- Dask: https://github.com/dask/dask
- pandas on Ray: https://github.com/modin-project/modin

In [None]:
import vaex

df = vaex.example()
df.head()

In [None]:
df['new_col'] = df['x'] + df['y']

In [None]:
df['new_col']

In [None]:
df['new_col'].mean()

In [None]:
import cudf

df = pandas.read_csv('data/countries.csv')

In [None]:
import static_frame

df = static_frame.Frame.from_json_url('https://jsonplaceholder.typicode.com/photos')

df['id'].mean()

**Dask**

Similar to pyspark but:

- With pandas API
- Without a JVM

![](img/dask.gif)

## Thank you

Questions and lunch time

![](img/pandas_eating.gif)

Contact online: **@datapythonista** (LinkedIn, Twitter, GitHub,...)