This repo illustrates how to stream a large file from S3 and split it into separate S3 files after removing prior files
- Parse a large file without loading the whole file into memory
- Remove old data when new data arrives
- Wait for all these secondary streams to finish uploading to s3
- Writing to S3 is slow. You must ensure you wait until the S3 upload is complete
- We can't start writing to S3 until all the old files are deleted.
- We don't know how many output files will be created, so we must wait until the input file has finished processing before starting to waiting for the outputs to finish
- A school district central computer uploads all the grades for the district for a semester
- The data file is has the following headers:
School,Semester,Grade,Subject,Class,Student Name,Score
- Process the uploaded file, splitting it into the following structure:
- Semester/School/Grade
- Create a file called Subject-Class.csv with all the grades for that class
- For this simulation, the central computer can update an entire Semester by uploading a new file. This could be set differently based on the application: For instance, if the central computer could upload the grades for a specific Semester + School, then we could update this line with the revised criteria to only clear that block of data
Here's the general outline of the demo program flow:
- Open the S3 file as a Stream (
readStream) - Create a
csvStreamfrom the inputreadStream - Pipe
readStreamtocsvStream - While we have New Lines
- Is this line for a new school (i.e. new CSV file)?
- Start a PassThru stream (
passThruStream) - Does this line start a new Semester (top-level folder we're replacing) in S3?
- Start deleting S3 folder
- Are all files deleted?
- Use
s3.uploadwithBody=passThruStreamto upload the file
- Use
- Start a PassThru stream (
- Write New Line to the
passThruStream
- Is this line for a new school (i.e. new CSV file)?
- Loop thru all
passThruStreamstreams and close/end - Wait for all
passThruStreamstreams to finish writing to S3
BUCKET=(your s3 bucket name)
yarn build:test: Build fake CSV data infixtures/yarn test: Run a local test outputing files to/tmp/outputinstead of S3yarn deploy:dev: Runserverless deploywith (stage=dev) to deploy function to AWS Lambdayarn deploy:prod: Runserverless deploy --stage prodto deploy function to AWS Lambdayarn logs:dev: Pull the AWS CloudWatch logs for the latest stage=dev runyarn logs:prod: Pull the AWS CloudWatch logs for the latest stage=prod runyarn upload:small:tiny: Uploadfixtures/master-data-tiny.csvto S3${BUCKET}/dev/uploadsyarn upload:small:dev: Uploadfixtures/master-data-small.csvto S3${BUCKET}/dev/uploadsyarn upload:medium:dev: Uploadfixtures/master-data-medium.csvto S3${BUCKET}/dev/uploadsyarn upload:large:dev: Uploadfixtures/master-data-large.csvto S3${BUCKET}/dev/uploadsyarn upload:small:tiny: Uploadfixtures/master-data-tiny.csvto S3${BUCKET}/prod/uploadsyarn upload:small:prod: Uploadfixtures/master-data-small.csvto S3${BUCKET}/prod/uploadsyarn upload:medium:prod: Uploadfixtures/master-data-medium.csvto S3${BUCKET}/prod/uploadsyarn upload:large:prod: Uploadfixtures/master-data-large.csvto S3${BUCKET}/prod/uploads
The following commands will downlaod the S3 processed files and use the same validations as yarn test:
- NOTE: This assumes you've already run
yarn upload:small
md /tmp/s3files
aws s3 cp s3://${BUCKET}/dev/processed /tmp/s3files --recursive
ts-node test/fileValidators.ts fixtures/master-data-small.csv /tmp/s3files/