<a href="https://colab.research.google.com/github/drshahizan/Python_Tutorial/blob/main/big%20data/modin/lab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![LOGO](https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/img/MODIN_ver2_hrz.png?raw=True)

<center><h2>Scale your pandas workflows by changing one line of code</h2>


# Lab 2: Speed improvements

**GOAL**: Learn about common functionality that Modin speeds up by using all of your machine's cores.

## Concept for Exercise: `read_csv` speedups

The most commonly used data ingestion method used in pandas is CSV files (link to pandas survey). This concept is designed to give an idea of the kinds of speedups possible, even on a non-distributed filesystem. Modin also supports other file formats for parallel and distributed reads, which can be found in the documentation.

![](https://raw.githubusercontent.com/modin-project/modin/ff477202978de7649b40559469e18338763d4efc/examples/tutorial/jupyter/img/read_csv_perf.png)

We will import both Modin and pandas so that the speedups are evident.

**Note: Rerunning the `read_csv` cells many times may result in degraded performance, depending on the memory of the machine**

In [None]:
!pip install modin[all] 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting modin[all]
  Downloading modin-0.18.0-py3-none-any.whl (970 kB)
[K     |████████████████████████████████| 970 kB 8.0 MB/s 
Collecting pandas==1.5.2
  Downloading pandas-1.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
[K     |████████████████████████████████| 12.2 MB 47.2 MB/s 
Collecting boto3
  Downloading boto3-1.26.34-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 63.3 MB/s 
Collecting modin-spreadsheet>=0.1.0
  Downloading modin_spreadsheet-0.1.2-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 75.2 MB/s 
Collecting ray[default]>=1.13.0
  Downloading ray-2.2.0-cp38-cp38-manylinux2014_x86_64.whl (57.4 MB)
[K     |████████████████████████████████| 57.4 MB 1.2 MB/s 
[?25hCollecting unidist[mpi]>=0.2.1
  Downloading unidist-0.2.1-py3-none-any.whl (102 kB)
[K     |██████████████████████████

In [None]:

import modin.pandas as pd
import pandas
import time
from IPython.display import Markdown, display

def printmd(string):
    display(Markdown(string))

### Dataset: 2015 NYC taxi trip data

We will be using a version of this data already in S3, originally posted in this blog post: https://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes

**Size: ~1.8GB**

In [None]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/My Drive/Colab Notebooks/dataset/

/content/drive/My Drive/Colab Notebooks/dataset


In [None]:
df = pd.read_csv("yellow_tripdata_2015-01.csv")


    import ray
    ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})

2022-12-21 04:58:35,395	INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m


In [None]:
path = "yellow_tripdata_2015-01.csv"

**Optional:** Note that the dataset takes a while to download. To speed things up a bit, if you prefer to download this file once locally, you can run the following code in the notebook:

## `pandas.read_csv`

In [None]:
start = time.time()

pandas_df = pandas.read_csv(path, parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)

end = time.time()
pandas_duration = end - start
print("Time to read with pandas: {} seconds".format(round(pandas_duration, 3)))

Time to read with pandas: 6.631 seconds


### Expect pandas to take >3 minutes on EC2, longer locally

This is a good time to chat with your neighbor
Dicussion topics
- Do you work with a large amount of data daily?
- How big is your data?
- What’s the common use case of your data?
- Do you use any big data analytics tools?
- Do you use any interactive analytics tool?
- What’s are some drawbacks of your current interative analytic tools today?

## `modin.pandas.read_csv`

In [None]:
start = time.time()

modin_df = pd.read_csv(path, parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)

end = time.time()
modin_duration = end - start
print("Time to read with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("### Modin is {}x faster than pandas at `read_csv`!".format(round(pandas_duration / modin_duration, 2)))

Time to read with Modin: 5.308 seconds


### Modin is 1.25x faster than pandas at `read_csv`!

## Are they equal?

In [None]:
pandas_df

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,...,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,0,0,1,2015-01-01 00:11:33,2015-01-01 00:16:48,1,1.00,1,N,41,...,1,5.7,0.5,0.5,1.40,0.0,0.0,8.40,,
1,1,1,1,2015-01-01 00:18:24,2015-01-01 00:24:20,1,0.90,1,N,166,...,3,6.0,0.5,0.5,0.00,0.0,0.0,7.30,,
2,2,2,1,2015-01-01 00:26:19,2015-01-01 00:41:06,1,3.50,1,N,238,...,1,13.2,0.5,0.5,2.90,0.0,0.0,17.40,,
3,3,3,1,2015-01-01 00:45:26,2015-01-01 00:53:20,1,2.10,1,N,162,...,1,8.2,0.5,0.5,2.37,0.0,0.0,11.87,,
4,4,4,1,2015-01-01 00:59:21,2015-01-01 01:05:24,1,1.00,1,N,236,...,3,6.0,0.5,0.5,0.00,0.0,0.0,7.30,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1933502,1933502,1933502,2,2015-01-06 10:28:20,2015-01-06 10:40:50,1,1.13,1,N,211,...,2,8.5,0.0,0.5,0.00,0.0,0.3,9.30,,
1933503,1933503,1933503,2,2015-01-06 10:54:19,2015-01-06 11:21:37,1,5.39,1,N,87,...,1,20.0,0.0,0.5,4.00,0.0,0.3,24.80,,
1933504,1933504,1933504,2,2015-01-06 10:19:54,2015-01-06 10:24:05,2,0.83,1,N,239,...,1,5.0,0.0,0.5,1.00,0.0,0.3,6.80,,
1933505,1933505,1933505,2,2015-01-06 10:26:46,2015-01-06 10:42:05,3,2.29,1,N,143,...,1,11.0,0.0,0.5,2.20,0.0,0.3,14.00,,


In [None]:
modin_df

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,...,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,0,0,1,2015-01-01 00:11:33,2015-01-01 00:16:48,1,1.00,1,N,41,...,1,5.7,0.5,0.5,1.40,0.0,0.0,8.40,,
1,1,1,1,2015-01-01 00:18:24,2015-01-01 00:24:20,1,0.90,1,N,166,...,3,6.0,0.5,0.5,0.00,0.0,0.0,7.30,,
2,2,2,1,2015-01-01 00:26:19,2015-01-01 00:41:06,1,3.50,1,N,238,...,1,13.2,0.5,0.5,2.90,0.0,0.0,17.40,,
3,3,3,1,2015-01-01 00:45:26,2015-01-01 00:53:20,1,2.10,1,N,162,...,1,8.2,0.5,0.5,2.37,0.0,0.0,11.87,,
4,4,4,1,2015-01-01 00:59:21,2015-01-01 01:05:24,1,1.00,1,N,236,...,3,6.0,0.5,0.5,0.00,0.0,0.0,7.30,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1933502,1933502,1933502,2,2015-01-06 10:28:20,2015-01-06 10:40:50,1,1.13,1,N,211,...,2,8.5,0.0,0.5,0.00,0.0,0.3,9.30,,
1933503,1933503,1933503,2,2015-01-06 10:54:19,2015-01-06 11:21:37,1,5.39,1,N,87,...,1,20.0,0.0,0.5,4.00,0.0,0.3,24.80,,
1933504,1933504,1933504,2,2015-01-06 10:19:54,2015-01-06 10:24:05,2,0.83,1,N,239,...,1,5.0,0.0,0.5,1.00,0.0,0.3,6.80,,
1933505,1933505,1933505,2,2015-01-06 10:26:46,2015-01-06 10:42:05,3,2.29,1,N,143,...,1,11.0,0.0,0.5,2.20,0.0,0.3,14.00,,


## Concept for exercise: Reduces

In pandas, a reduce would be something along the lines of a `sum` or `count`. It computes some summary statistics about the rows or columns. We will be using `count`.

In [None]:
start = time.time()

pandas_count = pandas_df.count()

end = time.time()
pandas_duration = end - start

print("Time to count with pandas: {} seconds".format(round(pandas_duration, 3)))

Time to count with pandas: 0.386 seconds


In [None]:
start = time.time()

modin_count = modin_df.count()

end = time.time()
modin_duration = end - start
print("Time to count with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("### Modin is {}x faster than pandas at `count`!".format(round(pandas_duration / modin_duration, 2)))

Time to count with Modin: 0.043 seconds


### Modin is 9.05x faster than pandas at `count`!

## Are they equal?

In [None]:
pandas_count

Unnamed: 0.1             1933507
Unnamed: 0               1933507
VendorID                 1933507
tpep_pickup_datetime     1933507
tpep_dropoff_datetime    1933507
passenger_count          1933507
trip_distance            1933507
RatecodeID               1933507
store_and_fwd_flag       1933507
PULocationID             1933507
DOLocationID             1933507
payment_type             1933507
fare_amount              1933507
extra                    1933507
mta_tax                  1933507
tip_amount               1933507
tolls_amount             1933507
improvement_surcharge    1933504
total_amount             1933507
congestion_surcharge           0
airport_fee                    0
dtype: int64

In [None]:
modin_count

Unnamed: 0.1             1933507
Unnamed: 0               1933507
VendorID                 1933507
tpep_pickup_datetime     1933507
tpep_dropoff_datetime    1933507
passenger_count          1933507
trip_distance            1933507
RatecodeID               1933507
store_and_fwd_flag       1933507
PULocationID             1933507
DOLocationID             1933507
payment_type             1933507
fare_amount              1933507
extra                    1933507
mta_tax                  1933507
tip_amount               1933507
tolls_amount             1933507
improvement_surcharge    1933504
total_amount             1933507
congestion_surcharge           0
airport_fee                    0
dtype: int64

## Concept for exercise: Map operations

In pandas, map operations are operations that do a single pass over the data and do not change its shape. Operations like `isnull` and `applymap` are included in this. We will be using `isnull`.

In [None]:
start = time.time()

pandas_isnull = pandas_df.isnull()

end = time.time()
pandas_duration = end - start

print("Time to isnull with pandas: {} seconds".format(round(pandas_duration, 3)))

Time to isnull with pandas: 0.308 seconds


In [None]:
start = time.time()

modin_isnull = modin_df.isnull()

end = time.time()
modin_duration = end - start
print("Time to isnull with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("### Modin is {}x faster than pandas at `isnull`!".format(round(pandas_duration / modin_duration, 2)))

Time to isnull with Modin: 0.014 seconds


### Modin is 21.66x faster than pandas at `isnull`!

## Are they equal?

In [None]:
pandas_isnull

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,...,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1933502,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
1933503,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
1933504,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
1933505,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True


In [None]:
modin_isnull

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,...,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1933502,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
1933503,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
1933504,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
1933505,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True


## Concept for exercise: Apply over a single column

Sometimes we want to compute some summary statistics on a single column from our dataset.

In [None]:
start = time.time()
rounded_trip_distance_pandas = pandas_df["trip_distance"].apply(round)

end = time.time()
pandas_duration = end - start
print("Time to groupby with pandas: {} seconds".format(round(pandas_duration, 3)))

Time to groupby with pandas: 0.752 seconds


In [None]:
start = time.time()

rounded_trip_distance_modin = modin_df["trip_distance"].apply(round)

end = time.time()
modin_duration = end - start
print("Time to add a column with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("### Modin is {}x faster than pandas at `apply` on one column!".format(round(pandas_duration / modin_duration, 2)))

Time to add a column with Modin: 0.075 seconds


### Modin is 10.04x faster than pandas at `apply` on one column!

## Are they equal?

In [None]:
rounded_trip_distance_pandas

0          1
1          1
2          4
3          2
4          1
          ..
1933502    1
1933503    5
1933504    1
1933505    2
1933506    2
Name: trip_distance, Length: 1933507, dtype: int64

In [None]:
rounded_trip_distance_modin

0          1
1          1
2          4
3          2
4          1
          ..
1933502    1
1933503    5
1933504    1
1933505    2
1933506    2
Name: trip_distance, Length: 1933507, dtype: int64

## Concept for exercise: Add a column

It is common to need to add a new column to an existing dataframe, here we show that this is significantly faster in Modin due to metadata management and an efficient zero copy implementation.

In [None]:
start = time.time()
pandas_df["rounded_trip_distance"] = rounded_trip_distance_pandas

end = time.time()
pandas_duration = end - start
print("Time to groupby with pandas: {} seconds".format(round(pandas_duration, 3)))

Time to groupby with pandas: 0.008 seconds


In [None]:
start = time.time()

modin_df["rounded_trip_distance"] = rounded_trip_distance_modin

end = time.time()
modin_duration = end - start
print("Time to add a column with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("### Modin is {}x faster than pandas add a column!".format(round(pandas_duration / modin_duration, 2)))

Time to add a column with Modin: 0.004 seconds


### Modin is 2.09x faster than pandas add a column!

## Are they equal?

In [None]:
pandas_df

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,...,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,rounded_trip_distance
0,0,0,1,2015-01-01 00:11:33,2015-01-01 00:16:48,1,1.00,1,N,41,...,5.7,0.5,0.5,1.40,0.0,0.0,8.40,,,1
1,1,1,1,2015-01-01 00:18:24,2015-01-01 00:24:20,1,0.90,1,N,166,...,6.0,0.5,0.5,0.00,0.0,0.0,7.30,,,1
2,2,2,1,2015-01-01 00:26:19,2015-01-01 00:41:06,1,3.50,1,N,238,...,13.2,0.5,0.5,2.90,0.0,0.0,17.40,,,4
3,3,3,1,2015-01-01 00:45:26,2015-01-01 00:53:20,1,2.10,1,N,162,...,8.2,0.5,0.5,2.37,0.0,0.0,11.87,,,2
4,4,4,1,2015-01-01 00:59:21,2015-01-01 01:05:24,1,1.00,1,N,236,...,6.0,0.5,0.5,0.00,0.0,0.0,7.30,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1933502,1933502,1933502,2,2015-01-06 10:28:20,2015-01-06 10:40:50,1,1.13,1,N,211,...,8.5,0.0,0.5,0.00,0.0,0.3,9.30,,,1
1933503,1933503,1933503,2,2015-01-06 10:54:19,2015-01-06 11:21:37,1,5.39,1,N,87,...,20.0,0.0,0.5,4.00,0.0,0.3,24.80,,,5
1933504,1933504,1933504,2,2015-01-06 10:19:54,2015-01-06 10:24:05,2,0.83,1,N,239,...,5.0,0.0,0.5,1.00,0.0,0.3,6.80,,,1
1933505,1933505,1933505,2,2015-01-06 10:26:46,2015-01-06 10:42:05,3,2.29,1,N,143,...,11.0,0.0,0.5,2.20,0.0,0.3,14.00,,,2


In [None]:
modin_df

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,...,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,rounded_trip_distance
0,0,0,1,2015-01-01 00:11:33,2015-01-01 00:16:48,1,1.00,1,N,41,...,5.7,0.5,0.5,1.40,0.0,0.0,8.40,,,1
1,1,1,1,2015-01-01 00:18:24,2015-01-01 00:24:20,1,0.90,1,N,166,...,6.0,0.5,0.5,0.00,0.0,0.0,7.30,,,1
2,2,2,1,2015-01-01 00:26:19,2015-01-01 00:41:06,1,3.50,1,N,238,...,13.2,0.5,0.5,2.90,0.0,0.0,17.40,,,4
3,3,3,1,2015-01-01 00:45:26,2015-01-01 00:53:20,1,2.10,1,N,162,...,8.2,0.5,0.5,2.37,0.0,0.0,11.87,,,2
4,4,4,1,2015-01-01 00:59:21,2015-01-01 01:05:24,1,1.00,1,N,236,...,6.0,0.5,0.5,0.00,0.0,0.0,7.30,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1933502,1933502,1933502,2,2015-01-06 10:28:20,2015-01-06 10:40:50,1,1.13,1,N,211,...,8.5,0.0,0.5,0.00,0.0,0.3,9.30,,,1
1933503,1933503,1933503,2,2015-01-06 10:54:19,2015-01-06 11:21:37,1,5.39,1,N,87,...,20.0,0.0,0.5,4.00,0.0,0.3,24.80,,,5
1933504,1933504,1933504,2,2015-01-06 10:19:54,2015-01-06 10:24:05,2,0.83,1,N,239,...,5.0,0.0,0.5,1.00,0.0,0.3,6.80,,,1
1933505,1933505,1933505,2,2015-01-06 10:26:46,2015-01-06 10:42:05,3,2.29,1,N,143,...,11.0,0.0,0.5,2.20,0.0,0.3,14.00,,,2
