Name	Name	Last commit message	Last commit date
Latest commit History 4 Commits
images	images
src	src
test	test
.eslintrc.js	.eslintrc.js
.gitignore	.gitignore
.mocharc.json	.mocharc.json
.np-config.js	.np-config.js
.npmignore	.npmignore
LICENSE	LICENSE
README.md	README.md
package-lock.json	package-lock.json
package.json	package.json
rollup.config.js	rollup.config.js
tsconfig.json	tsconfig.json
tsconfig.lint.json	tsconfig.lint.json

Take a byte out of Streams!

npm i -s streambyter

Zero dependency micro-library (3.3Kb)

Table of Contents
- Why
- Install
- Usage
  - Local Filesystem
  - Cloud
- Testing
- Contributing
- Publishing
- License

Why

The primary functionality of the streambyter library is to efficiently iterate and execute regex against a large number of files/streams. I created this library out of a need to quickly iterate thousands of large json files and extract out a single piece of text. Out of the box the brute force way to do this is to download each json or text file, run a regex on each file, then return results. The problems with this approach are:

Each file must be downloaded fully (Speed, bandwidth, and memory cost)
The regex is run against the full file (Speed cost)

And this is where streambyter comes in. You can use this library to efficient execute regex (testing or matching groups) against many files locally or in the cloud. This library doesn't care where the stream is located, just as long as it's a stream.

Install

$ npm i -s streambyter

Usage

Local Filesystem

In the below example a file path can be provided with a regex with named matching groups to extract those groups out as a dictionary.

import { regexGroupPathReader } from 'streambyter';

// Assume there is a `file` with contents: '{"foo":"Hello","bar":"World", /* more content */}'

const filePath = '/path/to/some.json';
const regex = /"foo":"(?<foo>.*?)","bar":"(?<bar>.*?)"/;
const result = await regexGroupPathReader({ path: filePath }, regex);

console.log(result); // prints { path: '/path/to/some.json', result: { foo: "Hello", bar: "World" }}

In this example a stream can be provided with a regex with named matching groups to extract those groups out as a dictionary. Note that here you have to create the stream yourself, but the benefit is you have full control over the options of that stream, like manually changing the highWaterMark. Why might you want to do this? Maybe you know for a fact that the data you want is in the first 100 bytes of the json, then you'd want to set the highWaterMark to 100 since the streambyter library will close the stream after the first match. Note that in the above regexGroupPathReader the stream is created with { highWaterMark: 512 } by default.

import { regexGroupStreamReader } from 'streambyter';
import { createReadStream } from 'fs';

// Assume there is a `file` with contents: '{"foo":"Hello","bar":"World", /* more content */}'

const filePath = '/path/to/some.json';
const regex = /"foo":"(?<foo>.*?)","bar":"(?<bar>.*?)"/;
const result = await regexGroupPathReader({ stream: createReadStream(filePath, { highWaterMark: 100 }) }, regex);

console.log(result); // prints { path: '/path/to/some.json', result: { foo: "Hello", bar: "World" }}

In this example an array of file paths can be provided along with a regex with named matching groups to extract those groups out as an array of dictionaries.

import { regexGroupPathsReader } from 'streambyter';

// Assume there are an array of `files` with contents: '{"foo":"Hello1","bar":"World1", /* more content */}'
const filePaths = ['/path/to/some1.json', '/path/to/some2.json'];
const regex = /"foo":"(?<foo>.*?)","bar":"(?<bar>.*?)"/;

const objs = filePaths.map((p) => ({ path: p }));
const results = await regexGroupPathsReader(objs, regex);

console.log(results); // prints [{ path: '/path/to/some1.json', result: { foo: "Hello1", bar: "World1" }}, { path: '/path/to/some2.json', result: { foo: "Hello2", bar: "World2" }}]

In this example an array of streams can be provided along with a regex with named matching groups to extract those groups out as an array of dictionaries.

import { regexGroupStreamsReader } from 'streambyter';
import { createReadStream } from 'fs';

// Assume there are an array of `files` with contents: '{"foo":"Hello1","bar":"World1", /* more content */}'
const filePaths = ['/path/to/some1.json', '/path/to/some2.json'];
const regex = /"foo":"(?<foo>.*?)","bar":"(?<bar>.*?)"/;

const objs = filePaths.map((p) => ({ stream: createReadStream(p, { highWaterMark: 100 }) }));
const results = await regexGroupStreamsReader(objs, regex);

console.log(results); // prints [{ path: '/path/to/some1.json', result: { foo: "Hello1", bar: "World1" }}, { path: '/path/to/some2.json', results: { foo: "Hello2", bar: "World2" }}]

Cloud

When dealing with the cloud, the sdk you are dealing with should have the ability to return a stream. For example, when using the Azure Storage SDK you can obtain a stream to the blob via const stream = await blockBlobClient.download(0);. This is efficient since no contents have actually been downloaded, only a stream which can iterate and close as desired.

So let's say we want to list blobs in an Azure Blob Storage container and replicate one of the above examples.

async function downloadBlobAsStream(containerClient: ContainerClient, blobName: string): Promise<NodeJS.ReadableStream> {
  const blockBlobClient = containerClient.getBlockBlobClient(blobName);
  const downloadBlockBlobResponse = await blockBlobClient.download(0);
  return downloadBlockBlobResponse.readableStreamBody;
}

const account = 'someaccountname';
const sharedKeyCredential = '...';

const client = new BlobServiceClient(`https://${account}.blob.core.windows.net`, sharedKeyCredential);
const containerClient = client.getContainerClient('somecontainer');

// Let's assume there are a list of files in the root directory of the container
const blobs = containerClient.listBlobsByHierarchy('/', { prefix: prefix || '' });

const objs = [];
// Iterate each blob returned and construct an array of objects containing the stream reference for each blob
for await (const blob of blobs) {
  objs.push({ name: blob.name, stream: await downloadBlobAsStream(containerClient, blob.name) });
}

const regex = /"foo":"(?<foo>.*?)","bar":"(?<bar>.*?)"/;
const results = await regexGroupStreamsReader(objs, regex);

console.log(results); // prints [{ path: 'blob1.json', result: { foo: "Hello1", bar: "World1" }}, { path: 'blob2.json', results: { foo: "Hello2", bar: "World2" }}]

You'll notice that objects are being passed instead of just the path or the stream alone, why? The reason is so you can map back individual results. For example if you had await regexGroupPathsReader([{ path: '/path/to/a.txt', path: '/path/to/b.txt'}], regex) that might result in: [{ path: '/path/to/a.txt', result: { someMatch: '1' }}, {path: '/path/to/b/txt', result: { someMatch: '2' }}]

See the *.spec.ts files in the ./test directory for a great reference on using the library.

Note that the library is built with rollup.js and targets commonjs and is intended to be used with nodejs.

Testing

npm run test

30 passing (3s)

----------|---------|----------|---------|---------|-------------------
File      | % Stmts | % Branch | % Funcs | % Lines | Uncovered Line #s
----------|---------|----------|---------|---------|-------------------
All files |     100 |      100 |     100 |     100 |
 index.ts |     100 |      100 |     100 |     100 |
----------|---------|----------|---------|---------|-------------------

Contributing

npm i
make code changes
npm run test
npm run lint
npm run build

Publishing

Bump the package.json version
npm publish --access public
git tag vx.y.z
git push origin --tags

License

MIT

License

engineersamuel/streambyter

Folders and files

Latest commit

History

Repository files navigation

Take a byte out of Streams!

Table of Contents

Why

Install

Usage

Local Filesystem

Cloud

Testing

Contributing

Publishing

License

About

Resources

License

Stars

Watchers

Forks

Languages