<img align="right" style="display:inline;" width="400" src="https://camo.githubusercontent.com/c288679ac2172d1804d8c73e0bb79e066f57e358/68747470733a2f2f676973742e6769746875622e636f6d2f646576696e2d7065746572736f686e2f66343234643966623535373961393635303763373039613336643438376632342f7261772f343936333164333739623662363364613566313833383962346661363061643433363465373764352f726973656c61622d61742d75632d6265726b656c65792e6a7067">

<br><br><br><br><br><br><br><br><br>
<b><font size="7">Modin (Pandas on Ray)</font></b>

<h3>Accelerate your pandas workflows by changing one line of code</h3>

<br><br>

##### Devin Petersohn

# An anecdote to get started
My background: Genomics and Computational Biology Data Science
<br><br>

**Comments from a Data Scientist who runs production genomics workloads:** 
- "Data is too large to use in pandas (10's of GB to TB)"
- "I want to interact with my data"
- "I end up using Big Data tools to trim the data down and use pandas to analyze it"

<h1><center>"Why can't we use the same tools for Kilo- and Megabyte-scale data as we do for Terabyte-scale data (and vice versa)?"</center></h1>

<font color="navy"><h1>Current Data Science Landscape</h1></font>
<br><br>

![](mbTools.png)

<font color="navy"><h1>Current Data Science Landscape</h1></font>
<br><br>

![](tbTools.png)

<font color="navy"><h1>Current Data Science Landscape</h1></font>
<br><br>
![](toolsNoArrow.png)

<font color="navy"><h1>Current Data Science Landscape</h1></font>
<br><br>
![](toolsFinal.png)

<font color="navy"><h1>Current Data Science Landscape (1TB+)</h1></font>
<br><br>

- New Frameworks = New APIs
    - Many also support SQL - This is good!
    - However, many pandas operations not covered by SQL (e.g. `iloc`)
- Expose distributed computing concepts to users
    - Partitioning and shuffling
    - Tuning crucial for performance
- Many are too heavyweight for good performance at smaller scale
- Optimized for batching, not necessarily optimized for 2D matrix operations

<center><h1>What is it about extracting value from data that requires expertise in distributed computing?</h1></center>
<br><br><br><br>
<center><h3>Nothing!</h3></center>

<center><h1>With Modin, we are trying to bridge the gap between analytics at MBs and TBs+</h1></center>

# Modin (모든)

### Accelerate your pandas workflows by changing one line of code
<br><br>

In [1]:
# import pandas as pd
import modin.pandas as pd

Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:25501 to respond...
Waiting for redis server at 127.0.0.1:17414 to respond...
Starting the Plasma object store with 27.00 GB memory.


<center><h4>This notebook was run on a 2013 iMac with 4-cores and 32GB RAM</h4></center>

In [2]:
import numpy as np

# Build a 2D numpy array filled with random integers
frame_data = np.random.randint(0, 100, size=(2**16, 2**4))
# Put the new random data into a dataframe
df = pd.DataFrame(frame_data)
# Add a prefix to each column for simplicity
df = df.add_prefix("col_")
type(df)

modin.pandas.dataframe.DataFrame

In [3]:
# Print the first 10 lines of the DataFrame
df.head(10)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,col_10,col_11,col_12,col_13,col_14,col_15
0,5,8,90,34,14,37,92,73,81,27,28,74,57,55,30,63
1,97,0,28,53,19,89,39,7,97,94,17,55,85,92,61,97
2,59,91,27,83,60,84,70,27,8,61,58,19,87,31,9,68
3,80,12,55,28,98,32,66,80,75,64,91,66,67,80,50,31
4,18,3,25,69,67,73,59,86,55,16,45,42,12,53,44,66
5,62,60,64,56,44,14,75,5,75,69,30,64,34,1,83,52
6,41,45,50,44,50,62,56,11,29,40,32,49,17,38,13,87
7,54,2,15,60,1,11,15,95,80,62,97,74,87,33,84,24
8,56,71,21,66,12,41,77,97,56,28,13,94,87,65,78,71
9,1,40,57,83,56,63,3,24,47,45,8,29,59,94,54,20


<center><h1>Modin manages the data partitioning and shuffling, so you can focus on extracting value from your data</h1></center>
<br><br><br><br>
<center><h3>After all, isn't that the purpose of Data Science?</h3></center>

# <center><code>pd.read_csv</code></center>
<br>

In [4]:
import pandas

#### pandas

In [5]:
%%time
# pandas `read_csv`
pandas_csv_data = pandas.read_csv("800MB.csv")

CPU times: user 26.3 s, sys: 3.14 s, total: 29.4 s
Wall time: 29.5 s


#### Modin

In [6]:
%%time
# Modin `read_csv`
csv_data = pd.read_csv("800MB.csv")

CPU times: user 76.7 ms, sys: 5.08 ms, total: 81.8 ms
Wall time: 7.6 s


# <center><code>df.groupby</code></center>
<br><br>

#### pandas

In [7]:
%%time
# pandas `groupby`
_ = pandas_csv_data.groupby(by=pandas_csv_data.col_1).sum()

CPU times: user 5.98 s, sys: 1.77 s, total: 7.75 s
Wall time: 7.74 s


#### Modin

In [8]:
%%time
# Modin `groupby`
results = csv_data.groupby(by=csv_data.col_1).sum()

CPU times: user 3.18 s, sys: 42.2 ms, total: 3.23 s
Wall time: 7.3 s


# <center><code>df.T</code></center>

#### Modin

In [9]:
%%time
# Modin transpose
results_transpose = results.T

CPU times: user 36 µs, sys: 5 µs, total: 41 µs
Wall time: 44.1 µs


In [10]:
results_transpose.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
col_251,510687,532665,531041,518312,514216,512767,517570,522186,512149,518341,...,516180,516238,518686,522299,530929,518363,515769,519990,520853,516718
col_252,506608,532956,527241,527219,517928,512462,518954,518623,510923,523518,...,517377,513713,517771,529987,534633,513980,511530,518929,523025,514076
col_253,507078,530762,535021,519887,520246,515980,516653,524330,510801,520746,...,516373,521599,512115,528876,529613,513792,521189,515032,522286,515301
col_254,506274,527389,534761,516021,517681,515188,517922,522157,514011,515284,...,513980,516344,514599,529063,531735,515256,509626,518098,522041,527306
col_255,511399,527227,531609,517430,521920,509612,518645,525048,510239,517014,...,517766,519192,513554,532277,530755,515562,521128,519129,520482,520019


<center><h3>All partitioning and shuffling is handled for you!</h3></center>

<center><h1>How do we get this speedup?</h1></center>
<br><br>
<center><h3>Modern laptop</h3></center>

<center><img src="multicore_start.png"></center>

<center><h3>pandas on your Laptop</h3></center>

<center><img src="pandas_multicore.png"></center>

<center><h3>Modin on your laptop</h3></center>

<center><img src="modin_multicore.png"></center>

<center><h3>pandas on a large machine</h3></center>

<center><img src="pandas_multicore_lots.png"></center>

<center><h3>Modin on a large machine</h3></center>

<center><img src="modin_multicore_lots.png"></center>

<center><h1>Performance of `read_csv` on various amounts of Data using 144 cores</h1></center>
<br><br>
<center><img src="read_csv_plot.png"></center>

<center><h1>Modin is an early stage, multi-process DataFrame library with an identical API to pandas.</h1></center>

<br><br><br><br>

<center><h3>Cluster support coming soon!</h3></center>

<center><h1>The pandas API is massive!</h1></center>
<br><br><br>

**`pd.DataFrame`**
- 280+ methods

**`pd.Series`**
- 280+ methods

**Other operations (`pd.concat`, etc.)**
- 40+ APIs


<center><h3>Where do you even start?</h3></center>

<center><h1>What do people use in the pandas API?</h1></center>

![](https://docs.google.com/spreadsheets/d/e/2PACX-1vSJAqz2lmMe2yxUEV1BDYYJcb7F_javeq1mwW_uoiqOi8WuXQBnDIBAOkeF_WJ9iOtxuJxgvr_8PzFv/pubchart?oid=108581991&format=image)

<center><h3>We can implement and optimize in the order of popularity!</h3></center>

# Implementing and optimizing the pandas API

### Rank-ordered by popularity

<br><br>

- Currently, we support 71.77% of the pandas API in `modin.pandas`
- This represents >93% of usage based on our study
- `pd.Series` is not yet distributed
    - This will help us optimize many operations (e.g. `df.groupby`)
- `pd.MultiIndex` preliminarily supported

<br><br>

<center><h3>What about the rest?</h3></center>

<center><h1>Defaulting to pandas in Modin</h1></center>

<br><br>

In [11]:
# Covariance not yet implemented, but you can still use it in Modin!
cov_df = df.cov()



In [12]:
# Print the first 5 lines of the result. NOTE: This is a Modin DataFrame
cov_df.head()

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,col_10,col_11,col_12,col_13,col_14,col_15
col_0,831.706909,1.22218,5.27837,8.006473,3.134061,-3.397139,-2.293764,0.659789,-3.398005,0.004688,4.408962,1.620187,4.296154,4.771234,-2.450248,5.105487
col_1,1.22218,830.09423,-1.444885,-1.991529,2.250838,-1.946475,2.780284,3.171092,2.968332,-0.836585,0.524496,4.90024,2.757136,-3.748321,-4.062521,0.605903
col_2,5.27837,-1.444885,835.962652,-5.779689,4.582706,2.638338,-0.810444,-1.991004,-3.114369,-1.170409,6.523343,-3.767481,3.899383,1.691931,2.586309,2.19262
col_3,8.006473,-1.991529,-5.779689,836.230366,-1.174508,-4.259046,4.488693,-1.133241,4.502229,-0.827306,5.196258,3.731033,-0.117715,4.618731,-3.299195,4.648533
col_4,3.134061,2.250838,4.582706,-1.174508,838.977389,3.415512,-6.104585,0.76851,-1.915474,3.626733,-3.178418,3.807477,2.452864,-2.718751,3.946541,0.711834


<center><h1>Defaulting to pandas in Modin</h1></center>

<br><br>

![](convert_to_pandas.png)

<center><h1>Now you can do almost anything in Modin you normally could do in pandas!</h1></center>

<br><br>



# Modin: Behind the Scenes

![](allNoBox.png)

# Modin: Behind the Scenes

![](pandasAPI.png)

# Modin: Behind the Scenes

![](queryCompiler.png)

# Modin: Behind the Scenes

![](partitionManager.png)

# Modin: Behind the Scenes

![](partitions.png)

# Modin: Behind the Scenes

![](all.png)

# What is Ray?

# Ray: A task parallel, low latency execution framework

<br><br><br><br>

![](ray_architecture_diagram.jpg)

# Ray: A system for parallel and distributed Python that unifies the ML ecosystem.

<br><br><br>

![](ray_overview.jpg)

# Ray

- Ray is open source at https://github.com/ray-project/ray
- Visit Ray's documentation for more: http://ray.readthedocs.io/en/latest/index.html

<center><h1>Demo!</h1></center>

# Conclusion

### Modin: Accelerate your pandas by changing one line of code

<br><br>

- API is identical to pandas
- No distributed computing knowledge required!
- Brief architecture overview
- Introduction to Ray

# Modin moving forward

- Memory management
    - Out of core, or DataFrames exceeding memory
- Query planning (exciting research to be done here!)
- Better partitioning planning
- More API coverage
- Distributed `pd.Series`
- Currently support most recent stable release of pandas API (0.23.4)
    - Continue supporting pandas API changes in the future
- Modin is only 8 months old!

<br><br>

# Interested in contributing? We'd love to have you!

- Email the mailing list (found in the documentation)
- Pick an issue!
- Come find me @ PyData!

# Modin (모든)

### Accelerate your pandas workflows by changing one line of code
<br>
```python
# import pandas as pd
import modin.pandas as pd
```

### Modin is open source at https://github.com/modin-project/modin
### Install with `pip install modin`
### Documentation: https://modin.readthedocs.io/en/latest/

<br>
##### Devin Petersohn: devin@eecs.berkeley.edu
<center><h1>Thank you</h1></center>