Python String Processing Benchmark

This repository contains code to benchmark the performance and memory usage of different string processing frameworks in Python, including:

Python lists and strings (native Python)
NumPy arrays
Pandas Series with object dtype
Pandas Series with string dtype (Python backend)
Pandas Series with Arrow string dtype (string[pyarrow])
Native PyArrow arrays

Overview

The benchmark performs a series of string operations on a dataset of 1,000,000 random strings, each 100 characters long. The operations include:

Modify Operations:
- uppercase: Convert strings to uppercase.
- replace: Replace substrings.
- pad: Pad strings to a certain length.
- strip: Remove leading and trailing whitespace.
Non-modify Operations:
- find_substring: Check for the presence of a substring.
- count_char: Count occurrences of a character.
- startswith: Check if strings start with a specific prefix.
- sort: Sort the array of strings alphabetically.
Complex Operations:
- delete_sequence: Remove strings containing a specific sequence.
- custom_hash: Apply a Python hash function to each string.

The benchmark measures the execution time and memory usage of each operation across the different frameworks.

Requirements

Python 3.x
NumPy
Pandas
PyArrow
memory_profiler

You can install the required packages using:

pip install -r requirements.txt

Running the Benchmark

To run the benchmark, execute the benchmark.py script:

python benchmark.py

Note: Processing large datasets can consume significant system resources and may take some time to complete.

Understanding the Results

The benchmark outputs three tables:

Processing Times: Shows the execution time (in seconds) of each operation for each framework.
Speedup over Python list: Indicates how many times faster each framework is compared to the native Python implementation.
Memory Usage: Displays the memory consumed (in MB) during each operation for each framework.

Use these results to compare the performance and memory efficiency of different string processing options in Python.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
benchmark.py		benchmark.py
requirements txt		requirements txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Python String Processing Benchmark

Overview

Requirements

Running the Benchmark

Understanding the Results

License

About

Uh oh!

Releases

Packages

Languages

amineKammah/python-string-benchmark

Folders and files

Latest commit

History

Repository files navigation

Python String Processing Benchmark

Overview

Requirements

Running the Benchmark

Understanding the Results

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages