# Parallel Feature Engineering: Comparing Sequential and Multiprocessing Approaches

Created by Gaurav Kaushik

In this notebook, we explore how to speed up feature engineering by using **parallel processing**.

We fabricate a **large version** of the California Housing dataset (4 million+ rows) to simulate a real-world big data environment.

We will:
- Create 6 new engineered features
- Apply feature engineering sequentially
- Apply feature engineering using multiprocessing
- Compare the time taken by both approaches

This demonstrates how **multiprocessing** can scale feature engineering for real data science tasks.


## 1. Overview

Feature engineering is one of the most important steps in building machine learning models.

As datasets grow larger, sequential feature engineering becomes slow and inefficient.

In this notebook, I will show:
- How to build features sequentially
- How to parallelize feature generation
- How much time we can save using multiprocessing


## 2. Problem Setup

- **Dataset**: Large fabricated version of California Housing dataset
- **Size**: 4,128,000 rows and 9 columns
- **Task**: Create 6 new features
- **Goal**: Compare sequential and parallel feature engineering performance


Let's load the dataset.

In [5]:
import pandas as pd

duplication_factor = 50  # Try 50x first (~500,000+ rows)

# Load the large fabricated dataset
df_spotify = pd.read_csv("spotify.csv")

df_spotify_large = pd.concat([df_spotify] * 50, ignore_index=True)

# Check the shape
print(f"Dataset shape: {df_spotify_large.shape}")
df_spotify_large.head()

Dataset shape: (1641650, 23)


Unnamed: 0,track_id,track_name,track_artist,track_popularity,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,playlist_genre,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,6f807x0ima9a1j3VPbc7VN,I Don't Care (with Justin Bieber) - Loud Luxur...,Ed Sheeran,66,2oCs0DGTsRO98Gh5ZSl2Cx,I Don't Care (with Justin Bieber) [Loud Luxury...,2019-06-14,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,6,-2.634,1,0.0583,0.102,0.0,0.0653,0.518,122.036,194754
1,0r7CVbZTWZgbTCYdfa2P31,Memories - Dillon Francis Remix,Maroon 5,67,63rPSO264uRjW1X5E6cWv6,Memories (Dillon Francis Remix),2019-12-13,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,11,-4.969,1,0.0373,0.0724,0.00421,0.357,0.693,99.972,162600
2,1z1Hg7Vb0AhHDiEmnDE79l,All the Time - Don Diablo Remix,Zara Larsson,70,1HoSmj2eLcsrR0vE9gThr4,All the Time (Don Diablo Remix),2019-07-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-3.432,0,0.0742,0.0794,2.3e-05,0.11,0.613,124.008,176616
3,75FpbthrwQmzHlBJLuGdC7,Call You Mine - Keanu Silva Remix,The Chainsmokers,60,1nqYsOef1yKKuGOVchbsk6,Call You Mine - The Remixes,2019-07-19,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,7,-3.778,1,0.102,0.0287,9e-06,0.204,0.277,121.956,169093
4,1e8PAfcKUYoKkxPhrHqw4x,Someone You Loved - Future Humans Remix,Lewis Capaldi,69,7m7vv9wlQ4i0LFuJiE2zsQ,Someone You Loved (Future Humans Remix),2019-03-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-4.672,1,0.0359,0.0803,0.0,0.0833,0.725,123.976,189052


## 3. Features to Create

We will engineer the following features:


## 4. Approach 1: Sequential Feature Engineering

- Create all 6 features one-by-one.
- No parallelism involved.
- Measure and record total time taken.


Let's create features sequentially.


In [6]:
import re
import hashlib
import numpy as np
import pandas as pd

# 1. Complex text cleaning with regex + replacements
def clean_track_name_complex(df):
    df["CleanTrackNameComplex"] = (
        df["track_name"]
        .str.lower()
        .str.replace(r"[^\w\s]", "", regex=True)
        .str.replace(r"\s+", " ", regex=True)
        .str.strip()
    )
    return df[["CleanTrackNameComplex"]]

# 2. Very deep hash function (simulate user ID encryption)
def deep_hash_artist_id(df):
    df["DeepHashArtistID"] = df["track_artist"].apply(lambda x: int(hashlib.sha512(str(x).encode()).hexdigest(), 16) % 10**8)
    return df[["DeepHashArtistID"]]

# 3. Rolling mean of loudness with huge window
def rolling_loudness(df):
    df["RollingLoudness500"] = df["loudness"].rolling(window=500, min_periods=1).mean()
    return df[["RollingLoudness500"]]

# 4. Outlier detection on energy (apply complex rule)
def complex_energy_outlier(df):
    q_low = df["energy"].quantile(0.01)
    q_high = df["energy"].quantile(0.99)
    df["EnergyOutlierComplex"] = df["energy"].apply(lambda x: 1 if x < q_low or x > q_high else 0)
    return df[["EnergyOutlierComplex"]]

# 5. Extract year, month, day separately
def extract_date_parts(df):
    df["track_album_release_date"] = pd.to_datetime(df["track_album_release_date"], errors="coerce")
    df["ReleaseYear"] = df["track_album_release_date"].dt.year
    df["ReleaseMonth"] = df["track_album_release_date"].dt.month
    df["ReleaseDay"] = df["track_album_release_date"].dt.day
    return df[["ReleaseYear", "ReleaseMonth", "ReleaseDay"]]

# 6. Multiply multiple numeric columns (simulate feature crossing)
def create_feature_cross(df):
    df["DanceEnergy"] = df["danceability"] * df["energy"]
    df["EnergyLoudness"] = df["energy"] * df["loudness"]
    return df[["DanceEnergy", "EnergyLoudness"]]


In [7]:
import time

# Start timer
start = time.time()

# Work on a fresh copy
df_spotify_seq = df_spotify.copy()

# Feature 1: Clean track name (complex)
df_spotify_seq["CleanTrackNameComplex"] = (
    df_spotify_seq["track_name"]
    .str.lower()
    .str.replace(r"[^\w\s]", "", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

# Feature 2: Deep hash artist ID
df_spotify_seq["DeepHashArtistID"] = df_spotify_seq["track_artist"].apply(
    lambda x: int(hashlib.sha512(str(x).encode()).hexdigest(), 16) % 10**8
)

# Feature 3: Rolling mean of loudness (large window)
df_spotify_seq["RollingLoudness500"] = df_spotify_seq["loudness"].rolling(window=500, min_periods=1).mean()

# Feature 4: Complex energy outlier detection
q_low = df_spotify_seq["energy"].quantile(0.01)
q_high = df_spotify_seq["energy"].quantile(0.99)
df_spotify_seq["EnergyOutlierComplex"] = df_spotify_seq["energy"].apply(lambda x: 1 if x < q_low or x > q_high else 0)

# Feature 5: Extract year, month, day
df_spotify_seq["track_album_release_date"] = pd.to_datetime(df_spotify_seq["track_album_release_date"], errors="coerce")
df_spotify_seq["ReleaseYear"] = df_spotify_seq["track_album_release_date"].dt.year
df_spotify_seq["ReleaseMonth"] = df_spotify_seq["track_album_release_date"].dt.month
df_spotify_seq["ReleaseDay"] = df_spotify_seq["track_album_release_date"].dt.day

# Feature 6: Feature crossing - Danceability and Energy/Loudness
df_spotify_seq["DanceEnergy"] = df_spotify_seq["danceability"] * df_spotify_seq["energy"]
df_spotify_seq["EnergyLoudness"] = df_spotify_seq["energy"] * df_spotify_seq["loudness"]

# End timer
end = time.time()

print(f"üê¢ Sequential Heavy Feature Engineering Time: {round(end - start, 2)} seconds")

# Preview result
df_spotify_seq[
    ["CleanTrackNameComplex", "DeepHashArtistID", "RollingLoudness500", 
     "EnergyOutlierComplex", "ReleaseYear", "ReleaseMonth", "ReleaseDay",
     "DanceEnergy", "EnergyLoudness"]
].head()


üê¢ Sequential Heavy Feature Engineering Time: 0.24 seconds


Unnamed: 0,CleanTrackNameComplex,DeepHashArtistID,RollingLoudness500,EnergyOutlierComplex,ReleaseYear,ReleaseMonth,ReleaseDay,DanceEnergy,EnergyLoudness
0,i dont care with justin bieber loud luxury remix,24080414,-2.634,0,2019.0,6.0,14.0,0.685168,-2.412744
1,memories dillon francis remix,66469999,-3.8015,0,2019.0,12.0,13.0,0.59169,-4.049735
2,all the time don diablo remix,36410968,-3.678333,0,2019.0,7.0,5.0,0.628425,-3.195192
3,call you mine keanu silva remix,85966740,-3.70325,0,2019.0,7.0,19.0,0.66774,-3.51354
4,someone you loved future humans remix,53123681,-3.897,0,2019.0,3.0,5.0,0.54145,-3.891776


## 5. Approach 2: Parallel Feature Engineering (Multiprocessing)

Instead of creating features one-by-one sequentially, we now:

- Define each feature creation as an independent function.
- Use Python's `multiprocessing.Pool` to run multiple functions in parallel.
- Merge the results back into a single dataframe.
- Measure and record total time taken.

---
### üîÑ Why Not Multithreading? Why Multiprocessing?

Before choosing how to parallelize our feature engineering, we considered:

#### ‚ùì Are the tasks I/O-bound or CPU-bound?
- **I/O-bound** tasks involve waiting (e.g., reading files, calling APIs). These benefit from **multithreading**.
- **CPU-bound** tasks involve heavy computation (e.g., math, transformation). These benefit from **multiprocessing**.

#### üí• What‚Äôs the issue with threads in Python?

Python has something called the **Global Interpreter Lock (GIL)**.

> üîí The GIL allows **only one thread to execute Python code at a time**, even on multi-core machines.

So even if you use `threading`, only **one thread runs Python bytecode at a time**. This means:
- No real parallelism for CPU-bound tasks
- Threads end up waiting on each other

That‚Äôs why multithreading doesn‚Äôt speed up heavy computations.

---

#### ‚úÖ Why Multiprocessing Works

Multiprocessing creates **separate processes**, each with:
- Its own Python interpreter
- Its own memory space
- Full access to a CPU core

This bypasses the GIL and allows true parallel execution on multi-core systems ‚Äî ideal for our math-heavy feature generation.

‚úÖ So we chose **multiprocessing** because our tasks are:
- Independent
- CPU-intensive
- Suitable for parallel processing across multiple cores

Let's write the code


In [8]:
from joblib import Parallel, delayed
import time

# Functions to columns mapping
heavy_features_spotify = [
    (clean_track_name_complex, ["track_name"]),
    (deep_hash_artist_id, ["track_artist"]),
    (rolling_loudness, ["loudness"]),
    (complex_energy_outlier, ["energy"]),
    (extract_date_parts, ["track_album_release_date"]),
    (create_feature_cross, ["danceability", "energy", "loudness"])
]

# Build joblib tasks
tasks = [delayed(func)(df_spotify[cols].copy()) for func, cols in heavy_features_spotify]

# Run and time
start = time.time()
results = Parallel(n_jobs=-1)(tasks)
df_spotify_heavy_parallel = pd.concat(results, axis=1)
end = time.time()

print(f"‚ö° Heavy Spotify Features (Parallel) Time: {round(end - start, 2)} seconds")
df_spotify_heavy_parallel.head()


‚ö° Heavy Spotify Features (Parallel) Time: 1.96 seconds


Unnamed: 0,CleanTrackNameComplex,DeepHashArtistID,RollingLoudness500,EnergyOutlierComplex,ReleaseYear,ReleaseMonth,ReleaseDay,DanceEnergy,EnergyLoudness
0,i dont care with justin bieber loud luxury remix,24080414,-2.634,0,2019.0,6.0,14.0,0.685168,-2.412744
1,memories dillon francis remix,66469999,-3.8015,0,2019.0,12.0,13.0,0.59169,-4.049735
2,all the time don diablo remix,36410968,-3.678333,0,2019.0,7.0,5.0,0.628425,-3.195192
3,call you mine keanu silva remix,85966740,-3.70325,0,2019.0,7.0,19.0,0.66774,-3.51354
4,someone you loved future humans remix,53123681,-3.897,0,2019.0,3.0,5.0,0.54145,-3.891776


## 6. Performance Comparison

| Method           | Time Taken |
|------------------|------------|
| Sequential       | X seconds  |
| Multiprocessing  | Y seconds  |

- Multiprocessing should show major speed improvements as dataset size increases.
- It reduces total processing time by utilizing multiple CPU cores.


## 7. Key Learnings

- **Multiprocessing** helps accelerate CPU-bound tasks like feature engineering.
- Ideal when:
  - Feature functions are independent
  - Dataset is large enough to offset parallelization overhead
- **Multithreading** is NOT effective for CPU-bound tasks in Python (due to GIL).
- Always measure the time gain when switching to parallel approaches.
