BigCsvReader

Package bigcscvreader offers a multi-threaded approach for reading a large CSV file in order to improve the time of reading and processing it.
It spawns multiple goroutines, each reading a piece of the file.
Read rows are put into channels equal in number to the spawned goroutines, in this way also the processing of those rows can be parallelized.

Installation

$ go get github.com/actforgood/bigcsvreader

Example

Please refer to this example.

How it is designed to work

Benchmarks

go version go1.22.1 darwin/amd64
go test -timeout=15m -benchmem -benchtime=2x -bench . 
goos: darwin
goarch: amd64
pkg: github.com/actforgood/bigcsvreader
cpu: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Benchmark50000Rows_50Mb_withBigCsvReader-8                                 2    8076491568 ns/op     61744680 B/op    100269 allocs/op
Benchmark50000Rows_50Mb_withStdGoCsvReaderReadAll-8   	                   2    65237799108 ns/op    67924264 B/op    100043 allocs/op
Benchmark50000Rows_50Mb_withStdGoCsvReaderReadOneByOneAndReuseRecord-8     2    66750849960 ns/op    57606432 B/op     50020 allocs/op
Benchmark50000Rows_50Mb_withStdGoCsvReaderReadOneByOneProcessParalell-8    2    8184433872 ns/op     61607624 B/op    100040 allocs/op

Benchmarks are made with a file of ~50Mb in size, also a fake processing of any given row of 1ms was taken into consideration.
bigcsvreader was launched with 8 goroutines.
Other benchmarks are made using directly the encoding/csv go package.
As you can see, bigcsvreader reads and processes all rows in ~8s.
Go standard csv package reads and processes all rows in ~65s (sequentially).
Go standard csv package read and a parallel processing of rows timing is comparable to the one of bigcsvreader (so this strategy is a good alternative to this package).
ReadAll API has the disadvantage of keeping all rows into memory.
Read rows one by one API with ReuseRecord flag set has the advantage of fewer allocations, but has the cost of sequentially reading rows.

Note: It's a coincidence that parallelized version timing was ~equal to sequential timing divided by no of started goroutines. You should not take this as a rule.

Bellow are some process stats captured with unix TOP command while running each benchmark.

Bench	%CPU	MEM
Benchmark50000Rows_50Mb_withBigCsvReader	17.3	9652K
Benchmark50000Rows_50Mb_withStdGoCsvReaderReadAll	5.8	66M
Benchmark50000Rows_50Mb_withStdGoCsvReaderReadOneByOneAndReuseRecord	11.3	6908K

(!) Known issue: This package does not work as expected with multiline columns.

License

This package is released under a MIT license. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github		.github
cmd/pprof		cmd/pprof
docs		docs
internal		internal
scripts		scripts
testdata		testdata
.gitignore		.gitignore
.golangci-lint.yml		.golangci-lint.yml
AUTHORS		AUTHORS
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
assert_test.go		assert_test.go
doc.go		doc.go
example_test.go		example_test.go
go.mod		go.mod
go.sum		go.sum
reader.go		reader.go
reader_test.go		reader_test.go

License

actforgood/bigcsvreader

Folders and files

Latest commit

History

Repository files navigation

BigCsvReader

Installation

Example

How it is designed to work

Benchmarks

License

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages