<a href="https://colab.research.google.com/github/cagBRT/PerformanceEnhancement/blob/main/Pandas_Performance_Enhancement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook covers some techniques for speeding up performance for large datasets (over 1 million rows)

In [1]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/Intro-to-Pandas.git cloned-repo
%cd cloned-repo
!ls

Cloning into 'cloned-repo'...
remote: Enumerating objects: 243, done.[K
remote: Counting objects: 100% (98/98), done.[K
remote: Compressing objects: 100% (98/98), done.[K
remote: Total 243 (delta 60), reused 0 (delta 0), pack-reused 145[K
Receiving objects: 100% (243/243), 6.35 MiB | 5.50 MiB/s, done.
Resolving deltas: 100% (130/130), done.
/content/cloned-repo
 adult.csv	        P2_Pandas_Data_Prep.ipynb	       titanic.csv
 deniro.csv	        P3_Pandas.ipynb			       train.csv
 letter_frequency.csv  'Pandas(1).png'			       trainTitanic.csv
 oscarNoErrors.csv      Pandas_Performance_Enhancement.ipynb   UFO.csv
 oscar_winners.csv      README.md			       xP1_Intro_to_Pandas.ipynb
 P1_Pandas_ES.ipynb     test.csv
 P1_Pandas.ipynb        testTitanic.csv


1. Download the dataset here:

In [2]:
#https://www.kaggle.com/competitions/tabular-playground-series-sep-2021/rules

2. It will be be in zip format. <br>

In [3]:
import zipfile

In [4]:
!unzip "/content/tabular-playground-series-sep-2021 (3).zip"

unzip:  cannot find or open /content/tabular-playground-series-sep-2021 (3).zip, /content/tabular-playground-series-sep-2021 (3).zip.zip or /content/tabular-playground-series-sep-2021 (3).zip.ZIP.


3. Upload the zipped file upload:<br>
>tabular-playground-series-sep-2021.zip


4. Wait for the upload to finish. <br>
This can take about 25 minutes

You can continue with the notebook, this file is not needed for a little while

In [5]:
import pandas as pd
import numpy as np

## Use replace to replace specific values

While speed is the first benefit of replace, the second is its flexibility.

We can replace all question marks with NaN - an operation that would take multiple calls with index-based replacement.

Nested replacement helps when you only want to affect the values of specific columns. Here, we are replacing values only in education and income columns.



In [6]:
s = pd.Series([1, 2, 3, 4, 5])
s.replace(1, 5)

0    5
1    2
2    3
3    4
4    5
dtype: int64

In [7]:
s.replace([1, 2], method='bfill')

0    3
1    3
2    3
3    4
4    5
dtype: int64

In [8]:
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
                   'B': [5, 6, 7, 8, 9],
                   'C': ['a', 'b', 'c', 'd', 'e']})
df.replace(0, 5)

Unnamed: 0,A,B,C
0,5,5,a
1,1,6,b
2,2,7,c
3,3,8,d
4,4,9,e


In [9]:
df.replace([0, 1, 2, 3], 4)

Unnamed: 0,A,B,C
0,4,5,a
1,4,6,b
2,4,7,c
3,4,8,d
4,4,9,e


In [10]:
df.replace([0, 1, 2, 3], [4, 3, 2, 1])

Unnamed: 0,A,B,C
0,4,5,a
1,3,6,b
2,2,7,c
3,1,8,d
4,4,9,e


In [11]:
adult_income = pd.read_csv("adult.csv")

In [12]:
adult_income.shape

(48842, 15)

In [13]:
adult_income.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'educational-num',
       'marital-status', 'occupation', 'relationship', 'race', 'gender',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

In [14]:
adult_income.isna().sum()

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
dtype: int64

In [15]:
adult_income.replace(to_replace="?", value=np.nan, inplace=True)

In [16]:
adult_income.isna().sum()

age                   0
workclass          2799
fnlwgt                0
education             0
educational-num       0
marital-status        0
occupation         2809
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      857
income                0
dtype: int64

replace allows using lists or dictionaries to change multiple values simultaneously:



In [17]:
adult_income.gender.value_counts()

Male      32650
Female    16192
Name: gender, dtype: int64

In [18]:
adult_income.replace(["Male", "Female"], ["M", "F"], inplace=True)

In [19]:
adult_income.gender.value_counts()

M    32650
F    16192
Name: gender, dtype: int64

When replacing a list of values with another, they will have a one-to-one, index-to-index mapping.

In [20]:
adult_income["native-country"].value_counts()

United-States                 43832
Mexico                          951
Philippines                     295
Germany                         206
Puerto-Rico                     184
Canada                          182
El-Salvador                     155
India                           151
Cuba                            138
England                         127
China                           122
South                           115
Jamaica                         106
Italy                           105
Dominican-Republic              103
Japan                            92
Guatemala                        88
Poland                           87
Vietnam                          86
Columbia                         85
Haiti                            75
Portugal                         67
Taiwan                           65
Iran                             59
Greece                           49
Nicaragua                        49
Peru                             46
Ecuador                     

In [21]:
adult_income.replace({"United States": "USA", "US": "USA"}, inplace=True)

In [22]:
adult_income["native-country"].value_counts()

United-States                 43832
Mexico                          951
Philippines                     295
Germany                         206
Puerto-Rico                     184
Canada                          182
El-Salvador                     155
India                           151
Cuba                            138
England                         127
China                           122
South                           115
Jamaica                         106
Italy                           105
Dominican-Republic              103
Japan                            92
Guatemala                        88
Poland                           87
Vietnam                          86
Columbia                         85
Haiti                            75
Portugal                         67
Taiwan                           65
Iran                             59
Greece                           49
Nicaragua                        49
Peru                             46
Ecuador                     

In [23]:
adult_income.replace({"United States": "USA", "US": "USA"}, inplace=True)

In [24]:
adult_income["native-country"].value_counts()

United-States                 43832
Mexico                          951
Philippines                     295
Germany                         206
Puerto-Rico                     184
Canada                          182
El-Salvador                     155
India                           151
Cuba                            138
England                         127
China                           122
South                           115
Jamaica                         106
Italy                           105
Dominican-Republic              103
Japan                            92
Guatemala                        88
Poland                           87
Vietnam                          86
Columbia                         85
Haiti                            75
Portugal                         67
Taiwan                           65
Iran                             59
Greece                           49
Nicaragua                        49
Peru                             46
Ecuador                     

In [25]:
adult_income.education.value_counts()

HS-grad         15784
Some-college    10878
Bachelors        8025
Masters          2657
Assoc-voc        2061
11th             1812
Assoc-acdm       1601
10th             1389
7th-8th           955
Prof-school       834
9th               756
12th              657
Doctorate         594
5th-6th           509
1st-4th           247
Preschool          83
Name: education, dtype: int64

In [26]:
adult_income.income.value_counts()

<=50K    37155
>50K     11687
Name: income, dtype: int64

In [27]:
adult_income.replace(
    {
        "education": {"HS-grad": "High school", "Some-college": "College", "12th":"High school"},
        "income": {"<=50K": 0, ">50K": 1},
    },
    inplace=True,
)

In [28]:
adult_income.education.value_counts()

High school    16441
College        10878
Bachelors       8025
Masters         2657
Assoc-voc       2061
11th            1812
Assoc-acdm      1601
10th            1389
7th-8th          955
Prof-school      834
9th              756
Doctorate        594
5th-6th          509
1st-4th          247
Preschool         83
Name: education, dtype: int64

In [29]:
adult_income.income.value_counts()

0    37155
1    11687
Name: income, dtype: int64

# Iterating efficiently

### **The golden rule for applying operations on entire columns or data frames is to never use loops**

Think about arrays as vectors and the whole data frame as a matrix

If you want to perform any mathematical operation on one or more columns, there is a good chance that the operation is vectorized in Pandas.

For example, the built-in Python operators like +, -, *, /, ** work just like on vectors.

To get a taste of vectorization, let’s perform some operations on a massive dataset. We will choose ~1M row dataset of the old Kaggle TPS September competition:

# Datatable

Datatable is a python library for manipulating tabular data.


It supports out-of-memory datasets, multi-threaded data processing, and flexible API.



The datatable module emphasizes speed and big data support (an area that pandas struggles with); it also has an expressive and concise syntax, which makes datatable also useful for small datasets.

Note: in pandas, there are two fundamental data structures: Series and DataFrame.

In [30]:
!python3 -m pip install -U pip
!python3 -m pip install -U datatable

Collecting pip
  Downloading pip-23.3.2-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.3.2
Collecting datatable
  Downloading datatable-1.1.0-cp310-cp310-manylinux_2_35_x86_64.whl.metadata (1.8 kB)
Downloading datatable-1.1.0-cp310-cp310-manylinux_2_35_x86_64.whl (82.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.0/82.0 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: datatable
Successfully installed datatable-1.1.0
[0m

In [31]:
import datatable as dt

Only do this part if the tabular-playground

In [None]:
tps = dt.fread("/content/cloned-repo/train.csv").to_pandas()
tps.shape

The fastest built-in iterator of Pandas is apply.

In [None]:
def crazy_function(col1, col2, col3):
    return np.sqrt(col1 ** 3 + col2 ** 2 + col3 * 10)

Time the crazy_function on three columns using apply

In [None]:
%time tps['f1000'] = tps.apply(lambda row: crazy_function(row['f1'], row['f56'], row['f44']), axis=1)

Watch what happens when we pass columns as vectors rather than scalars. No need to modify the function:

In [None]:
%time tps['f1001'] = crazy_function(tps['f1'], tps['f56'], tps['f44'])

About 600 times faster than the fastest iterator. But we can do even better — vectorization is even faster when used on NumPy arrays:

Add .values to get the underlying NumPy ndarray of Pandas Series.

NumPy arrays are faster because they don't perform additional calls for indexing and data type

In [None]:
%time tps['f1001'] = crazy_function(tps['f1'].values, tps['f56'].values, tps['f44'].values)

Pandas has a few more tricks up its sleeve.

**Fair warning, though — these won’t benefit you much unless you have upwards of +1M rows.**

In [None]:
massive_df = pd.concat([tps.drop(["f1000", "f1001"], axis=1)] * 10)
massive_df.shape

In [None]:
memory_usage = massive_df.memory_usage(deep=True)
memory_usage_in_mbs = np.sum(memory_usage / 1024 ** 2)
memory_usage_in_mbs

Use our crazy_function, start with NumPy vectorization as a baseline


It takes about 0.3 seconds for a 10M row dataset

In [None]:
%%time

massive_df["f1001"] = crazy_function(
    massive_df["f1"].values, massive_df["f56"].values, massive_df["f44"].values
)

Let’s improve the runtime even more.

The first candidate is Numba.

We install it via pip (pip install numba) and import it. Then, we will decorate our crazy_function with its jit function. JIT stands for just in time, and it translates pure Python and NumPy code to native machine instructions, giving massive speed-ups.

In [None]:
!pip install numba

In [None]:
import numba

@numba.jit
def crazy_function(col1, col2, col3):
    return (col1 ** 3 + col2 ** 2 + col3 * 10) ** 0.5

In [None]:
%%time

massive_df["f1001"] = crazy_function(massive_df["f1"].values, massive_df["f56"].values, massive_df["f44"].values)


We achieved about 1.5 times speed-up.

**Note that Numba works best with functions that involve many native Python loops, a lot of math, and, even better, NumPy functions and arrays.**

# The eval the function of Pandas

There are two versions -
> pd.eval (higher-level)

> df.eval (in the context of DataFrames).

**Like Numba, you should have at least +10,000 samples in the DataFrame to see improvements. But once you do, you will see sizeable benefits in speed.**

Let’s run our crazy_function in the context of df.eval:

In [None]:
%%time

massive_df.eval("f1001 = (f1 ** 3 + f56 ** 2 + f44 * 10) ** 0.5", inplace=True)


It's not as fast as vectorization or Numba, but it has several benefits. First, you write much less code by avoiding references to the DataFrame name. Next, it significantly speeds up non-math operations on DataFrames like boolean indexing, comparisons, and many more.



### **When you are not doing mathematical manipulation, evaluate your statements in pd.eval.**