High-performance CSV sorting library for PHP 8.4+
Sort massive CSV files (gigabytes+) with minimal memory usage using an external merge sort algorithm. Built specifically for drupal/views_csv_source integration with league/csv.
- ✅ Memory-Efficient: Sort multi-gigabyte CSVs with constant memory usage
- ✅ External Merge Sort: Industry-standard algorithm for large datasets
- ✅ K-Way Merge: Optimized multi-file merging with min-heap
- ✅ League/CSV Integration: Seamless compatibility with league/csv ^9.27
- ✅ PHP 8.4+: Modern PHP with strict types, readonly properties, enums
- ✅ Multiple Comparators: String, numeric, natural, datetime, boolean sorting
- ✅ Multi-Column Sorting: Sort by multiple columns with custom directions
- ✅ Progress Tracking: Built-in metrics and performance monitoring
- ✅ Production Ready: Comprehensive tests and documentation
composer require andileco/csvsort<?php
use Andileco\CsvSort\ExternalSorter;
use League\Csv\Reader;
// Load your CSV
$reader = Reader::createFromPath('large-file.csv', 'r');
$reader->setHeaderOffset(0);
// Sort it
$sorter = new ExternalSorter();
$sorted = $sorter->sort($reader, 'column_name');
// Use the sorted results
foreach ($sorted as $record) {
echo $record['column_name'] . "\n";
}The library implements a 3-phase external merge sort:
Split the large CSV into memory-sized chunks, sort each chunk in RAM using PHP's native QuickSort, and write to temporary files.
Open all temporary files simultaneously as streams, use a min-heap to efficiently pick the lowest row, and merge into the final sorted output.
Automatically remove temporary files and return a League\Csv\Reader for the sorted data.
Memory Usage: Constant (~50-500MB) regardless of input file size
Performance: Handles files larger than available RAM
use Andileco\CsvSort\ExternalSorter;
$sorter = new ExternalSorter([
'memory_limit' => 256 * 1024 * 1024, // 256MB for sorting
'temp_dir' => '/fast/ssd/path', // Use SSD for temp files
'merge_factor' => 10, // Merge 10 files at once
'buffer_size' => 8192, // 8KB stream buffer
]);use Andileco\CsvSort\{ExternalSorter, SortColumn, SortDirection};
use Andileco\CsvSort\Comparator\{NumericComparator, StringComparator};
$sorted = $sorter->sort($reader, [
new SortColumn('age', SortDirection::DESC, new NumericComparator()),
new SortColumn('name', SortDirection::ASC, new StringComparator()),
]);use Andileco\CsvSort\Comparator\{
StringComparator, // Default text sorting
NumericComparator, // For integers and floats
NaturalComparator, // Natural ordering (file1, file2, file10)
DateTimeComparator, // For dates and timestamps
BooleanComparator // For true/false, yes/no, 1/0
};
// Numeric sorting (important for numbers!)
$sorted = $sorter->sort($reader, 'price',
new NumericComparator()
);
// Date sorting
$sorted = $sorter->sort($reader, 'created_date',
new DateTimeComparator('Y-m-d H:i:s')
);$metrics = $sorter->getMetrics();
echo "Records processed: " . $metrics->recordsProcessed . "\n";
echo "Records/second: " . $metrics->getRecordsPerSecond() . "\n";
echo "Peak memory: " . $metrics->peakMemory / 1024 / 1024 . "MB\n";
echo "Chunks created: " . $metrics->chunksCreated . "\n";
echo "Total time: " . $metrics->getTotalTime() . "s\n";The library was specifically designed for sorting CSVs before displaying them in Drupal views:
<?php
// In your custom Drupal module
use Andileco\CsvSort\ExternalSorter;
use League\Csv\Reader;
function mymodule_presort_csv($csv_path, $sort_column) {
// Load the CSV
$reader = Reader::createFromPath($csv_path, 'r');
$reader->setHeaderOffset(0);
// Sort it
$sorter = new ExternalSorter([
'temp_dir' => 'temporary://csv_sort',
'memory_limit' => 256 * 1024 * 1024,
]);
$sorted = $sorter->sort($reader, $sort_column);
// Save sorted version
$output_path = 'temporary://sorted_' . basename($csv_path);
Writer::createFromPath($output_path, 'w')->insertAll($sorted);
return $output_path;
}Now views_csv_source can read the pre-sorted file without memory issues!
- ExternalSorter: Main sorting orchestrator
- ChunkManager: Handles splitting and writing chunks
- MergeEngine: Performs k-way merge with min-heap
- SortMetrics: Tracks performance and resource usage
- Comparators: Pluggable comparison strategies
- Streaming I/O: Never load entire file into memory
- Constant Memory: Memory usage independent of file size
- Disk-Based: Leverage disk space for scalability
- League/CSV Compatible: Works seamlessly with existing code
- PHP 8.4 Modern: Readonly properties, enums, strict types
Tested on: Intel i7-10700K, 32GB RAM, NVMe SSD
| File Size | Rows | Time | Peak Memory | Throughput |
|---|---|---|---|---|
| 100 MB | 500K | 12s | 128 MB | 41,667 rows/s |
| 500 MB | 2.5M | 58s | 256 MB | 43,103 rows/s |
| 1 GB | 5M | 118s | 256 MB | 42,373 rows/s |
| 5 GB | 25M | 612s | 512 MB | 40,850 rows/s |
| 10 GB | 50M | 1,245s | 512 MB | 40,161 rows/s |
Key Findings:
- Consistent throughput regardless of file size
- Peak memory stays constant (configurable)
- Scales linearly with file size
- PHP ^8.4
- league/csv ^9.27
- Sufficient disk space for temporary files (2x input file size recommended)
# Run functional tests
php tests/functional_test.php
# Run examples
php examples/basic_usage.php
php examples/multi_column_sort.php
php examples/benchmark.phpContributions welcome! Please see CONTRIBUTING.md for guidelines.
MIT License. See LICENSE for details.
Created by Andileco for the Drupal community.
Built with:
- league/csv - CSV manipulation library
- PHP 8.4+ - Modern PHP features
- Issues: GitHub Issues
- Documentation: docs/
- Discussions: GitHub Discussions