<a href="https://colab.research.google.com/github/drshahizan/Python_Tutorial/blob/main/big%20data/modin/lab_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![LOGO](https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/img/MODIN_ver2_hrz.png?raw=True)

<center><h2>Scale your pandas workflows by changing one line of code</h2>


# Lab 1: How to use Modin

**GOAL**: Learn how to import Modin to accelerate and scale pandas workflows.

Modin is a drop-in replacement for pandas that distributes the computation 
across all of the cores in your machine or in a cluster.
In practical terms, this means that you can continue using the same pandas scripts
as before and expect the behavior and results to be the same. The only thing that needs
to change is the import statement. Normally, you would change:

```python
import pandas as pd
```

to:

```python
import modin.pandas as pd
```

Changing this line of code will allow you to use all of the cores in your machine to do computation on your data. One of the major performance bottlenecks of pandas is that it only uses a single core for any given computation. Modin exposes an API that is identical to pandas, allowing you to continue interacting with your data as you would with pandas. There are no additional commands required to use Modin locally. Partitioning, scheduling, data transfer, and other related concerns are all handled by Modin under the hood.

<p style="text-align:left;">
        <h1>pandas on a multicore laptop
    <span style="float:right;">
        Modin on a multicore laptop
    </span>

<div>
<img align="left" src="https://raw.githubusercontent.com/modin-project/modin/ff477202978de7649b40559469e18338763d4efc/examples/tutorial/jupyter/img/pandas_multicore.png"><img src="https://raw.githubusercontent.com/modin-project/modin/ff477202978de7649b40559469e18338763d4efc/examples/tutorial/jupyter/img/modin_multicore.png">
</div>

### Concept for exercise: Dataframe constructor

Often when playing around in pandas, it is useful to create a DataFrame with the constructor. That is where we will start.

```python
import numpy as np
import pandas as pd

frame_data = np.random.randint(0, 100, size=(2**10, 2**5))
df = pd.DataFrame(frame_data)
```

When creating a dataframe from a non-distributed object, it will take extra time to partition the data. When this is happening, you will see this message:

```
UserWarning: Distributing <class 'numpy.ndarray'> object. This may take some time.
```

In [None]:
!pip install modin[all] 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting modin[all]
  Downloading modin-0.18.0-py3-none-any.whl (970 kB)
[K     |████████████████████████████████| 970 kB 18.3 MB/s 
[?25hCollecting pandas==1.5.2
  Downloading pandas-1.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
[K     |████████████████████████████████| 12.2 MB 56.3 MB/s 
Collecting unidist[mpi]>=0.2.1
  Downloading unidist-0.2.1-py3-none-any.whl (102 kB)
[K     |████████████████████████████████| 102 kB 60.5 MB/s 
Collecting ray[default]>=1.13.0
  Downloading ray-2.2.0-cp38-cp38-manylinux2014_x86_64.whl (57.4 MB)
[K     |████████████████████████████████| 57.4 MB 1.1 MB/s 
[?25hCollecting modin-spreadsheet>=0.1.0
  Downloading modin_spreadsheet-0.1.2-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 47.3 MB/s 
Collecting rpyc==4.1.5
  Downloading rpyc-4.1.5-py3-none-any.whl (68 kB)
[K     |█████████████████

In [None]:
# Note: Do not change this code!
import numpy as np
import pandas
import sys
import modin

In [None]:
pandas.__version__

'1.3.5'

In [None]:
modin.__version__

'0.18.0'

In [None]:
# Implement your answer here. You are also free to play with the size
# and shape of the DataFrame, but beware of exceeding your memory!

import pandas as pd

frame_data = np.random.randint(0, 100, size=(2**10, 2**5))
df = pd.DataFrame(frame_data)

# ***** Do not change the code below! It verifies that 
# ***** the exercise has been done correctly. *****

try:
    assert df is not None
    assert frame_data is not None
    assert isinstance(frame_data, np.ndarray)
except:
    raise AssertionError("Don't change too much of the original code!")
assert "modin.pandas" in sys.modules, "Not quite correct. Remember the single line of code change (See above)"

import modin.pandas
assert pd == modin.pandas, "Remember the single line of code change (See above)"
assert hasattr(df, "_query_compiler"), "Make sure that `df` is a modin.pandas DataFrame."

print("Success! You only need to change one line of code!")

AssertionError: ignored

Now that we have created a toy example for playing around with the DataFrame, let's print it out in different ways.

### Concept for Exercise: Data Interaction and Printing

When interacting with data, it is very imporant to look at different parts of the data (e.g. `df.head()`). Here we will show that you can print the modin.pandas DataFrame in the same ways you would pandas.

In [None]:
# Print the first 10 lines.
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,83,79,71,19,37,76,8,22,39,55,...,45,68,16,60,46,43,13,84,57,77
1,95,32,85,83,2,8,89,16,69,51,...,97,85,0,29,96,47,44,59,73,38
2,54,91,56,75,50,17,71,72,5,60,...,36,15,43,80,41,29,93,14,81,81
3,84,4,73,70,96,58,79,70,54,47,...,6,94,16,84,33,40,31,64,2,96
4,65,1,77,57,98,12,29,2,91,61,...,57,16,29,44,13,71,8,51,24,58
5,73,43,47,54,82,37,71,71,92,57,...,6,62,40,88,68,12,93,91,81,61
6,3,4,43,0,56,52,23,32,99,89,...,48,45,60,9,64,63,50,85,2,51
7,85,78,24,11,20,34,98,48,42,63,...,78,65,82,91,7,91,43,22,34,95
8,72,39,72,67,74,84,79,7,18,56,...,2,23,71,73,96,89,15,32,21,61
9,38,82,50,42,84,20,62,71,0,1,...,32,58,72,2,39,42,57,24,79,62


In [None]:
# Print the DataFrame.
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,83,79,71,19,37,76,8,22,39,55,...,45,68,16,60,46,43,13,84,57,77
1,95,32,85,83,2,8,89,16,69,51,...,97,85,0,29,96,47,44,59,73,38
2,54,91,56,75,50,17,71,72,5,60,...,36,15,43,80,41,29,93,14,81,81
3,84,4,73,70,96,58,79,70,54,47,...,6,94,16,84,33,40,31,64,2,96
4,65,1,77,57,98,12,29,2,91,61,...,57,16,29,44,13,71,8,51,24,58
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1019,90,71,42,46,99,18,45,59,70,18,...,58,0,38,57,88,15,70,87,34,11
1020,44,60,71,71,21,51,48,43,26,93,...,70,10,63,37,16,79,3,47,34,14
1021,74,88,1,23,96,15,4,50,21,23,...,57,55,42,90,7,72,92,50,15,28
1022,11,36,25,59,50,31,30,79,39,68,...,42,10,29,79,49,24,51,51,80,51


In [None]:
# Free cell for custom interaction (Play around here!)
df.add_prefix("col")

Unnamed: 0,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,...,col22,col23,col24,col25,col26,col27,col28,col29,col30,col31
0,83,79,71,19,37,76,8,22,39,55,...,45,68,16,60,46,43,13,84,57,77
1,95,32,85,83,2,8,89,16,69,51,...,97,85,0,29,96,47,44,59,73,38
2,54,91,56,75,50,17,71,72,5,60,...,36,15,43,80,41,29,93,14,81,81
3,84,4,73,70,96,58,79,70,54,47,...,6,94,16,84,33,40,31,64,2,96
4,65,1,77,57,98,12,29,2,91,61,...,57,16,29,44,13,71,8,51,24,58
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1019,90,71,42,46,99,18,45,59,70,18,...,58,0,38,57,88,15,70,87,34,11
1020,44,60,71,71,21,51,48,43,26,93,...,70,10,63,37,16,79,3,47,34,14
1021,74,88,1,23,96,15,4,50,21,23,...,57,55,42,90,7,72,92,50,15,28
1022,11,36,25,59,50,31,30,79,39,68,...,42,10,29,79,49,24,51,51,80,51


In [None]:
df.count()

0     1024
1     1024
2     1024
3     1024
4     1024
5     1024
6     1024
7     1024
8     1024
9     1024
10    1024
11    1024
12    1024
13    1024
14    1024
15    1024
16    1024
17    1024
18    1024
19    1024
20    1024
21    1024
22    1024
23    1024
24    1024
25    1024
26    1024
27    1024
28    1024
29    1024
30    1024
31    1024
dtype: int64