<a href="https://colab.research.google.com/github/cagBRT/PerformanceEnhancement/blob/main/Pandas_Performance_Enhancement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook covers some techniques for speeding up performance for large datasets (over 1 million rows)

In [1]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/Intro-to-Pandas.git cloned-repo
%cd cloned-repo
!ls

fatal: destination path 'cloned-repo' already exists and is not an empty directory.
/content/cloned-repo
 adult.csv	        P2_Pandas_Data_Prep.ipynb	       testTitanic.csv
 deniro.csv	        P3_Pandas.ipynb			       titanic.csv
 letter_frequency.csv  'Pandas(1).png'			       train.csv
 oscarNoErrors.csv      Pandas_Performance_Enhancement.ipynb   trainTitanic.csv
 oscar_winners.csv      README.md			       UFO.csv
 P1_Pandas_ES.ipynb     sample_solution.csv		       xP1_Intro_to_Pandas.ipynb
 P1_Pandas.ipynb        test.csv


1. Download the dataset here:

In [2]:
#https://www.kaggle.com/competitions/tabular-playground-series-sep-2021/rules

2. It will be be in zip format. <br>

In [3]:
import zipfile

You can continue with the notebook, this file is not needed for a little while.



---



---



In [4]:
import pandas as pd
import numpy as np

## Use replace to replace specific values

While speed is the first benefit of replace, the second is its flexibility.

We can replace all question marks with NaN - an operation that would take multiple calls with index-based replacement.

Nested replacement helps when you only want to affect the values of specific columns. Here, we are replacing values only in education and income columns.



In [5]:
s = pd.Series([1, 2, 3, 4, 5])
s.replace(1, 5)

0    5
1    2
2    3
3    4
4    5
dtype: int64

In [6]:
s.replace([1, 2], method='bfill')

0    3
1    3
2    3
3    4
4    5
dtype: int64

In [7]:
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
                   'B': [5, 6, 7, 8, 9],
                   'C': ['a', 'b', 'c', 'd', 'e']})
df.replace(0, 5)

Unnamed: 0,A,B,C
0,5,5,a
1,1,6,b
2,2,7,c
3,3,8,d
4,4,9,e


In [8]:
df.replace([0, 1, 2, 3], 4)

Unnamed: 0,A,B,C
0,4,5,a
1,4,6,b
2,4,7,c
3,4,8,d
4,4,9,e


In [9]:
df.replace([0, 1, 2, 3], [4, 3, 2, 1])

Unnamed: 0,A,B,C
0,4,5,a
1,3,6,b
2,2,7,c
3,1,8,d
4,4,9,e


In [10]:
adult_income = pd.read_csv("adult.csv")

In [11]:
adult_income.shape

(48842, 15)

In [12]:
adult_income.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'educational-num',
       'marital-status', 'occupation', 'relationship', 'race', 'gender',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

In [13]:
adult_income.isna().sum()

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
dtype: int64

In [14]:
adult_income.replace(to_replace="?", value=np.nan, inplace=True)

In [15]:
adult_income.isna().sum()

age                   0
workclass          2799
fnlwgt                0
education             0
educational-num       0
marital-status        0
occupation         2809
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      857
income                0
dtype: int64

replace allows using lists or dictionaries to change multiple values simultaneously:



In [16]:
adult_income.gender.value_counts()

Male      32650
Female    16192
Name: gender, dtype: int64

In [17]:
adult_income.replace(["Male", "Female"], ["M", "F"], inplace=True)

In [18]:
adult_income.gender.value_counts()

M    32650
F    16192
Name: gender, dtype: int64

When replacing a list of values with another, they will have a one-to-one, index-to-index mapping.

In [19]:
adult_income["native-country"].value_counts()

United-States                 43832
Mexico                          951
Philippines                     295
Germany                         206
Puerto-Rico                     184
Canada                          182
El-Salvador                     155
India                           151
Cuba                            138
England                         127
China                           122
South                           115
Jamaica                         106
Italy                           105
Dominican-Republic              103
Japan                            92
Guatemala                        88
Poland                           87
Vietnam                          86
Columbia                         85
Haiti                            75
Portugal                         67
Taiwan                           65
Iran                             59
Greece                           49
Nicaragua                        49
Peru                             46
Ecuador                     

In [20]:
adult_income.replace({"United States": "USA", "US": "USA"}, inplace=True)

In [21]:
adult_income["native-country"].value_counts()

United-States                 43832
Mexico                          951
Philippines                     295
Germany                         206
Puerto-Rico                     184
Canada                          182
El-Salvador                     155
India                           151
Cuba                            138
England                         127
China                           122
South                           115
Jamaica                         106
Italy                           105
Dominican-Republic              103
Japan                            92
Guatemala                        88
Poland                           87
Vietnam                          86
Columbia                         85
Haiti                            75
Portugal                         67
Taiwan                           65
Iran                             59
Greece                           49
Nicaragua                        49
Peru                             46
Ecuador                     

In [22]:
adult_income.replace({"United States": "USA", "US": "USA"}, inplace=True)

In [23]:
adult_income["native-country"].value_counts()

United-States                 43832
Mexico                          951
Philippines                     295
Germany                         206
Puerto-Rico                     184
Canada                          182
El-Salvador                     155
India                           151
Cuba                            138
England                         127
China                           122
South                           115
Jamaica                         106
Italy                           105
Dominican-Republic              103
Japan                            92
Guatemala                        88
Poland                           87
Vietnam                          86
Columbia                         85
Haiti                            75
Portugal                         67
Taiwan                           65
Iran                             59
Greece                           49
Nicaragua                        49
Peru                             46
Ecuador                     

In [24]:
adult_income.education.value_counts()

HS-grad         15784
Some-college    10878
Bachelors        8025
Masters          2657
Assoc-voc        2061
11th             1812
Assoc-acdm       1601
10th             1389
7th-8th           955
Prof-school       834
9th               756
12th              657
Doctorate         594
5th-6th           509
1st-4th           247
Preschool          83
Name: education, dtype: int64

In [25]:
adult_income.income.value_counts()

<=50K    37155
>50K     11687
Name: income, dtype: int64

In [26]:
adult_income.replace(
    {
        "education": {"HS-grad": "High school", "Some-college": "College", "12th":"High school"},
        "income": {"<=50K": 0, ">50K": 1},
    },
    inplace=True,
)

In [27]:
adult_income.education.value_counts()

High school    16441
College        10878
Bachelors       8025
Masters         2657
Assoc-voc       2061
11th            1812
Assoc-acdm      1601
10th            1389
7th-8th          955
Prof-school      834
9th              756
Doctorate        594
5th-6th          509
1st-4th          247
Preschool         83
Name: education, dtype: int64

In [33]:
adult_income.income.value_counts()

0    37155
1    11687
Name: income, dtype: int64

For choosing a row or multiple rows, iloc is faster

In [34]:
%time adult_income.iloc[range(10000)]
%time adult_income.loc[range(10000)]

CPU times: user 4.32 ms, sys: 0 ns, total: 4.32 ms
Wall time: 4.45 ms
CPU times: user 4.46 ms, sys: 0 ns, total: 4.46 ms
Wall time: 4.9 ms


Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,M,0,0,40,United-States,0
1,38,Private,89814,High school,9,Married-civ-spouse,Farming-fishing,Husband,White,M,0,0,50,United-States,0
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,M,0,0,40,United-States,1
3,44,Private,160323,College,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,M,7688,0,40,United-States,1
4,18,,103497,College,10,Never-married,,Own-child,White,F,0,0,30,United-States,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,66,Self-emp-not-inc,176315,Bachelors,13,Divorced,Sales,Not-in-family,White,M,401,0,20,United-States,0
9996,35,Private,187167,High school,9,Never-married,Adm-clerical,Unmarried,White,F,0,0,40,United-States,0
9997,24,Private,241582,College,10,Never-married,Sales,Not-in-family,White,M,0,0,33,United-States,0
9998,31,Private,247328,11th,7,Married-civ-spouse,Protective-serv,Husband,White,M,0,0,40,United-States,0


Choosing columns:

In [40]:
%time adult_income.loc[:,["workclass","occupation","gender"]]
%time adult_income.iloc[:,[1,6,9]]

CPU times: user 3.25 ms, sys: 4 µs, total: 3.26 ms
Wall time: 5.27 ms
CPU times: user 1.62 ms, sys: 0 ns, total: 1.62 ms
Wall time: 1.63 ms


Unnamed: 0,workclass,occupation,gender
0,Private,Machine-op-inspct,M
1,Private,Farming-fishing,M
2,Local-gov,Protective-serv,M
3,Private,Machine-op-inspct,M
4,,,F
...,...,...,...
48837,Private,Tech-support,F
48838,Private,Machine-op-inspct,M
48839,Private,Adm-clerical,F
48840,Private,Adm-clerical,M


In [42]:
%time adult_income.sample(7,axis=0)

CPU times: user 2.05 ms, sys: 3 µs, total: 2.05 ms
Wall time: 2.06 ms


Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
20149,23,Self-emp-not-inc,216129,Assoc-acdm,12,Never-married,Craft-repair,Not-in-family,White,M,0,0,30,United-States,0
29928,44,Self-emp-not-inc,195486,High school,9,Married-civ-spouse,Sales,Husband,Black,M,0,0,70,Jamaica,0
35310,32,Private,101709,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,M,0,0,40,United-States,1
44769,30,Private,167832,High school,9,Married-civ-spouse,Machine-op-inspct,Husband,White,M,0,0,40,United-States,0
15981,32,Private,259425,College,10,Divorced,Craft-repair,Not-in-family,White,M,0,0,40,United-States,0
40446,19,Private,205830,High school,9,Never-married,Other-service,Own-child,White,F,0,0,40,El-Salvador,0
24590,46,Private,148254,High school,9,Divorced,Exec-managerial,Not-in-family,White,F,0,0,60,United-States,0


In [43]:
%time adult_income.sample(5, axis=1).sample(7, axis=0)

CPU times: user 6.23 ms, sys: 1.03 ms, total: 7.26 ms
Wall time: 7.01 ms


Unnamed: 0,gender,race,educational-num,marital-status,education
43151,M,White,12,Divorced,Assoc-acdm
9098,M,White,11,Married-civ-spouse,Assoc-voc
29222,F,White,7,Never-married,11th
35202,F,White,6,Never-married,10th
39144,M,White,13,Married-civ-spouse,Bachelors
3506,M,White,10,Never-married,College
21593,M,White,9,Married-civ-spouse,High school


# Iterating efficiently

### **The golden rule for applying operations on entire columns or data frames is to never use loops**

Think about arrays as vectors and the whole data frame as a matrix

If you want to perform any mathematical operation on one or more columns, there is a good chance that the operation is vectorized in Pandas.

For example, the built-in Python operators like +, -, *, /, ** work just like on vectors.

To get a taste of vectorization, let’s perform some operations on a massive dataset. We will choose ~1M row dataset of the old Kaggle TPS September competition:

# Datatable

Datatable is a python library for manipulating tabular data.


It supports out-of-memory datasets, multi-threaded data processing, and flexible API.



The datatable module emphasizes speed and big data support (an area that pandas struggles with); it also has an expressive and concise syntax, which makes datatable also useful for small datasets.

Note: in pandas, there are two fundamental data structures: Series and DataFrame.

In [30]:
!python3 -m pip install -U pip
!python3 -m pip install -U datatable

Collecting pip
  Downloading pip-23.3.2-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.3.2
Collecting datatable
  Downloading datatable-1.1.0-cp310-cp310-manylinux_2_35_x86_64.whl.metadata (1.8 kB)
Downloading datatable-1.1.0-cp310-cp310-manylinux_2_35_x86_64.whl (82.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.0/82.0 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: datatable
Successfully installed datatable-1.1.0
[0m

In [31]:
import datatable as dt

In [35]:
!unzip "/content/tabular-playground-series-sep-2021.zip"

Archive:  /content/tabular-playground-series-sep-2021.zip
  inflating: sample_solution.csv     
replace test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: test.csv                
  inflating: train.csv               


Only do this part if the tabular-playground-series-sssep-2021 is done uploading.<br>

In [36]:
tps = dt.fread("/content/cloned-repo/train.csv").to_pandas()
tps.shape

(957919, 120)

The shape of train.csv should be (957919, 120)

The fastest built-in iterator of Pandas is apply.

In [37]:
def crazy_function(col1, col2, col3):
    return np.sqrt(col1 ** 3 + col2 ** 2 + col3 * 10)

Time the crazy_function on three columns using apply

In [38]:
%time tps['f1000'] = tps.apply(lambda row: crazy_function(row['f1'], row['f56'], row['f44']), axis=1)

  return np.sqrt(col1 ** 3 + col2 ** 2 + col3 * 10)


CPU times: user 18.7 s, sys: 4.73 s, total: 23.5 s
Wall time: 23.6 s


Watch what happens when we pass columns as vectors rather than scalars. No need to modify the function:

In [39]:
%time tps['f1001'] = crazy_function(tps['f1'], tps['f56'], tps['f44'])

CPU times: user 53.4 ms, sys: 3.73 ms, total: 57.1 ms
Wall time: 59.9 ms


  result = getattr(ufunc, method)(*inputs, **kwargs)


About 600 times faster than the fastest iterator. But we can do even better — vectorization is even faster when used on NumPy arrays:

Add .values to get the underlying NumPy ndarray of Pandas Series.

NumPy arrays are faster because they don't perform additional calls for indexing and data type

In [40]:
%time tps['f1001'] = crazy_function(tps['f1'].values, tps['f56'].values, tps['f44'].values)

CPU times: user 51.5 ms, sys: 0 ns, total: 51.5 ms
Wall time: 65.7 ms


  return np.sqrt(col1 ** 3 + col2 ** 2 + col3 * 10)
  return np.sqrt(col1 ** 3 + col2 ** 2 + col3 * 10)
  return np.sqrt(col1 ** 3 + col2 ** 2 + col3 * 10)
  return np.sqrt(col1 ** 3 + col2 ** 2 + col3 * 10)


Pandas has a few more tricks up its sleeve.

**Fair warning, though — these won’t benefit you much unless you have upwards of +1M rows.**

In [41]:
massive_df = pd.concat([tps.drop(["f1000", "f1001"], axis=1)] * 10)
massive_df.shape

(9579190, 120)

In [42]:
memory_usage = massive_df.memory_usage(deep=True)
memory_usage_in_mbs = np.sum(memory_usage / 1024 ** 2)
memory_usage_in_mbs

8742.604093551636

Use our crazy_function, start with NumPy vectorization as a baseline


It takes about 0.3 seconds for a 10M row dataset

In [43]:
%%time

massive_df["f1001"] = crazy_function(
    massive_df["f1"].values, massive_df["f56"].values, massive_df["f44"].values
)

CPU times: user 357 ms, sys: 88.8 ms, total: 446 ms
Wall time: 442 ms


  return np.sqrt(col1 ** 3 + col2 ** 2 + col3 * 10)
  return np.sqrt(col1 ** 3 + col2 ** 2 + col3 * 10)
  return np.sqrt(col1 ** 3 + col2 ** 2 + col3 * 10)
  return np.sqrt(col1 ** 3 + col2 ** 2 + col3 * 10)


Let’s improve the runtime even more.

The first candidate is Numba.

We install it via pip (pip install numba) and import it. Then, we will decorate our crazy_function with its jit function. JIT stands for just in time, and it translates pure Python and NumPy code to native machine instructions, giving massive speed-ups.

In [44]:
!pip install numba

[0m

In [45]:
import numba

@numba.jit
def crazy_function(col1, col2, col3):
    return (col1 ** 3 + col2 ** 2 + col3 * 10) ** 0.5

  def crazy_function(col1, col2, col3):


In [46]:
%%time

massive_df["f1001"] = crazy_function(massive_df["f1"].values, massive_df["f56"].values, massive_df["f44"].values)


CPU times: user 1.2 s, sys: 223 ms, total: 1.42 s
Wall time: 1.85 s


We achieved about 1.5 times speed-up.

**Note that Numba works best with functions that involve many native Python loops, a lot of math, and, even better, NumPy functions and arrays.**

# The eval the function of Pandas

There are two versions -
> pd.eval (higher-level)

> df.eval (in the context of DataFrames).

**Like Numba, you should have at least +10,000 samples in the DataFrame to see improvements. But once you do, you will see sizeable benefits in speed.**

Let’s run our crazy_function in the context of df.eval:

In [47]:
%%time

massive_df.eval("f1001 = (f1 ** 3 + f56 ** 2 + f44 * 10) ** 0.5", inplace=True)

CPU times: user 255 ms, sys: 152 ms, total: 406 ms
Wall time: 455 ms


It's not as fast as vectorization or Numba, but it has several benefits. First, you write much less code by avoiding references to the DataFrame name. Next, it significantly speeds up non-math operations on DataFrames like boolean indexing, comparisons, and many more.



**Even though we have 1-million row dataset, all our operations were under a single second.**



### **When you are not doing mathematical manipulation, evaluate your statements in pd.eval.**