# Vectorization: Pandas and GeoPandas

Pandas is based on numpy, therefore it provides vectorized computation as well. 

&rarr; [Pandas User Guide: Accelerated Operations](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#accelerated-operations)

## 1. Vectorization in Pandas

To take a closer look at vectorized computation using Pandas, we will take a look at __Sofia Heisler's repository [PyCon 2017: Optimizing Pandas Code for Performance](https://github.com/s-heisler/pycon2017-optimizing-pandas)__. This repo contains the material of her talk which she gave at the PyCon Conference 2017.

&rarr; Watch her talk on [YouTube](https://www.youtube.com/watch?v=HN5d490_KKk) I really recommend it (especially if you like panda GIFs)

&rarr; Read her [blog post](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6)

### What we will do

Confernece talks and tutorials on GitHub are a great way to stay up-to-date with current developments in the scientific Python world and great resource to learn. Therefore, we will:

1. Fork her GitHub Repository, so that we have our own copy of it.
2. Clone our copied GitHub Repository to our computer. 
3. Work through the notebook to learn about Pandas. 

## 2. Vectorization in GeoPandas

### GeoPandas is great ... but still a bit slow

Geopandas make spatial analysis in Python a lot easier, but it has got a bottleneck: geometric opertions are performed using shapely, which - as we have seen - is not the fastest. In addition, operations along a series of shapely objects cannot be vectorized in Python. 

Do you know why? 

__Answer:__

### But the GeoPandas developers also found a solution for this problem: Yet another package - PyGEOS

The PyGEOS packages allows vectorized geometric calculations based on the C library GEOS. It is pretty new and still under development.

&rarr; Take a look at the [PyGEOS User Guide](https://pygeos.readthedocs.io/en/latest/#)

### PyGEOS integration in GeoPandas
The support of PyGEOS in GeoPandas is already partly implemented. So if pygeos is installed in your python environment you can enable the pygeos support. __But beware, it is still in development phase!__

&rarr; Take a loot at the [changes in the geopandas code](https://github.com/geopandas/geopandas/pull/1154/commits/e0658280a54e8f8ad1e9023952671553c756230a)

&rarr; Follow the development on GitHub: 
* [GeoPandas performance: optimizing vectorized operations](https://github.com/geopandas/geopandas/issues/430) 
* [Integrating pygeos in GeoPandas for vectorized array operations](https://github.com/geopandas/geopandas/issues/1155)

### Compare the shapely vs. pygeos GeoPandas

In order to use pygeos support in GeoPandas, you need to install pygeos. Unfortunately, there some conflicts with packages in our _advgeo_ environment. So in order to execute the following code you need to set up a new conda environment which has pygeos and geopandas installed. 

#### 1. Set up pygeos environment

Setup a new conda environment and install the packages pygeos, geopandas and jupyter. 


#### 2. Start the jupyter notebook in this new environment and open this notebook again. 

#### 3. Run the comparison

In [50]:
import geopandas as gpd

### Scenario: Buffering locations of the DWD temperature stations

As a case study we will perform a simple geometric operation: buffering the points of the DWD temperature measurement stations. 

In [51]:
file_path = "./data/DWD_temperature_shp/DWD_temperature.shp"

#### Buffering using shapely
In order to use shapely to perform the geometric operations, we need to disable the pygeos support of GeoPandas. 

In [53]:
gpd.options.use_pygeos = False

In [54]:
data_shapely = gpd.read_file(file_path)

In [55]:
data_shapely.to_crs(epsg="32632", inplace=True)

In [57]:
%%timeit
data_shapely.buffer(100)

34.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### Buffering using PyGEOS

In [58]:
gpd.options.use_pygeos = True

In [59]:
file_path = "/Users/chludwig/Documents/UniHD/teaching/cs4geos/ss2020/geoscripting.github.io/source/course/01_advanced_vector_processing/data/DWD_temperature_shp/DWD_temperature.shp"

In [60]:
data_pygeos = gpd.read_file(file_path)

In [62]:
data_pygeos.to_crs(epsg="32632", inplace=True)

In [63]:
%%timeit
data_pygeos.buffer(100)

19.9 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Resources

[Introducing pygeos](https://caspervdw.github.io/Introducing-Pygeos/)

[PyGEOS Documentation](https://pygeos.readthedocs.io/en/latest/)
   

[Cythonize Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html)

https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c

https://www.google.com/url?q=http://homepages.math.uic.edu/~jan/mcs275/running_cython.pdf&sa=U&ved=2ahUKEwiq_M3-vfrqAhWF-KQKHXBXCfwQFjAAegQICRAB&usg=AOvVaw0jX9BZrTt2aPsxKo30zmDb
