Skip to content

gilberttaj/lambda-CSV-upload

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lambda CSV Processor

This project implements an AWS Lambda function that processes a CSV file uploaded to an S3 bucket and updates a customer master database table. The implementation uses Java 11, Spring Boot, AWS SDK v2, and OpenCSV for parsing.

Project Structure

  • src/main/java/com/example/ - Main application code
    • config/ - AWS and application configuration
    • function/ - Lambda function handler for S3 events
    • model/ - JPA entity classes
    • repository/ - Spring Data JPA repositories
    • service/ - Business logic for CSV processing
  • template.yaml - AWS SAM template for deployment

Features

  • Streaming CSV processing for memory efficiency with large files (300,000+ rows)
  • Batch inserts for performance (configurable batch size of 10,000)
  • Transaction management for atomicity
  • Comprehensive error handling and logging
  • Custom Lambda handler implementation for AWS Lambda compatibility

Prerequisites

  • Java 11
  • Maven
  • AWS SAM CLI
  • AWS CLI
  • AWS Account with appropriate permissions

Setup AWS CLI

  • Run aws configure to set up your AWS credentials Example:
aws configure

input your credentials and region Example:

aws configure
AWS Access Key ID [None]: YOUR_ACCESS_KEY
AWS Secret Access Key [None]: YOUR_SECRET_KEY
Default region name [None]: ap-northeast-1
Default output format [None]: json

Build & Deploy

1. Build the Lambda Function

mvn clean package

This will create a fat JAR file in target/lambda-csv-processor-1.0.0.jar.

2. Deploy to AWS using SAM

sam deploy --guided

During the guided deployment, you'll be prompted for parameters including:

  • Stack Name - Choose a name for your CloudFormation stack
  • AWS Region - The AWS region to deploy to
  • Parameter BucketName - Name of the S3 bucket to use (existing or to be created)
  • Parameter DatabaseUrl - JDBC URL for your PostgreSQL database
  • Parameter DatabaseUsername - Username for database access
  • Parameter DatabasePassword - Password for database access (this will be securely handled and not displayed)
  • Confirm changes before deploy - Recommended to set to "Y" to review changes
  • Allow SAM CLI IAM role creation - Required permissions for Lambda deployment
  • Disable rollback - Whether to disable rollback if errors occur

This approach keeps your database credentials secure by avoiding hardcoded values in your source code or template files.

3. Lambda Configuration

The SAM template automatically configures the Lambda function with the parameters you provide during deployment. The environment variables are set as follows:

Environment:
  Variables:
    SPRING_PROFILES_ACTIVE: default
    SPRING_DATASOURCE_URL: !Ref DatabaseUrl
    SPRING_DATASOURCE_USERNAME: !Ref DatabaseUsername
    SPRING_DATASOURCE_PASSWORD: !Ref DatabasePassword
    CUSTOM_REGION: !Ref AwsRegion
    SPRING_CLOUD_FUNCTION_DEFINITION: processCsvFile

The parameters are securely handled and never exposed in plain text after deployment.

4. Configure S3 Bucket Notifications

For Existing Buckets

If you're using an existing bucket, CloudFormation cannot directly configure S3 event notifications on it. After deploying the stack, you need to manually configure the bucket to trigger the Lambda function when CSV files are uploaded:

Using AWS CLI:

aws s3api put-bucket-notification-configuration \
  --bucket YOUR_BUCKET_NAME \
  --notification-configuration '{
    "LambdaFunctionConfigurations": [
      {
        "LambdaFunctionArn": "YOUR_LAMBDA_FUNCTION_ARN",
        "Events": ["s3:ObjectCreated:*"],
        "Filter": {
          "Key": {
            "FilterRules": [
              {
                "Name": "suffix",
                "Value": ".csv"
              }
            ]
          }
        }
      }
    ]
  }'

example:

aws s3api put-bucket-notification-configuration \
  --bucket my-csv-bucket-6-6 \
  --notification-configuration '{"LambdaFunctionConfigurations":[{"LambdaFunctionArn":"arn:aws:lambda:ap-northeast-1:446556758604:function:sam-csv-test-CsvProcessorFunction-QeEHqTHO9oIn","Events":["s3:ObjectCreated:*"],"Filter":{"Key":{"FilterRules":[{"Name":"suffix","Value":".csv"}]}}}]}'

Replace:

  • YOUR_BUCKET_NAME with the name of your existing S3 bucket
  • YOUR_LAMBDA_FUNCTION_ARN with the ARN of the deployed Lambda function (available in the CloudFormation stack outputs)

Using AWS Management Console:

  1. Navigate to Amazon S3 in the AWS Console
  2. Select your bucket and go to "Properties"
  3. Scroll down to "Event notifications" and click "Create event notification"
  4. Configure the event:
    • Enter a name for the event
    • Select "All object create events" under "Event types"
    • Under "Destination", select "Lambda function"
    • Choose your deployed Lambda function
    • Under "Filter", enter ".csv" as a suffix
  5. Click "Save changes"

5. Test the Lambda Function in AWS

After deployment and notification configuration, you can test the function by uploading a CSV file to the configured S3 bucket. The Lambda function will be triggered automatically.

Performance Considerations

  • The application processes CSV files in streaming mode to handle large files
  • Batch inserts (10,000 records per batch) are used for better database performance
  • Hibernate is configured with batch inserts for optimal database performance
  • For very large files, the Lambda function's memory allocation can be increased in the SAM template

AWS Lambda Configuration

Important configurations in the Lambda function:

  1. Custom Handler: Using a custom handler class com.example.function.LambdaHandler to process S3 events

  2. Memory Allocation: The default is set to 1024MB, but may need to be increased for large CSV files

  3. Timeout: Set to 15 minutes (900 seconds) to ensure enough time for processing large files

  4. IAM Roles: The Lambda function needs permissions to:

    • Read from S3
    • Write to CloudWatch logs
    • Connect to your RDS database

Monitoring and Logs

To monitor your Lambda function's performance and debug issues:

# using sam cli
sam logs -n CsvProcessorFunction --stack-name sam-customers-csv --tail

Best Practices

  • IAM Roles: Use the principle of least privilege for Lambda IAM roles
  • Database Connection: Use connection pooling and consider database proxies for high-volume scenarios
  • Error Handling: Implement comprehensive error handling and notifications
  • Monitoring: Set up CloudWatch alarms for errors and performance thresholds
  • Security: Ensure all sensitive data in environment variables is encrypted
  • Cost Optimization: Review Lambda execution time and memory usage to optimize costs

Additional Configuration

Database Schema

The Lambda function expects a PostgreSQL database with the following schema:

CREATE TABLE customer_master (
  customer_code VARCHAR(4) PRIMARY KEY,
  customer_name VARCHAR(50) NOT NULL
);

IAM Permissions

The Lambda function uses these AWS managed policies in the SAM template:

  • AmazonS3ReadOnlyAccess - Provides read access to S3 buckets for CSV file processing
  • AmazonRDSFullAccess - Enables database connectivity for customer data updates

These managed policies are applied in the template.yaml file.

Additional Notes

  • The application handles large CSV files (300,000+ rows) by streaming and batch processing
  • Batch processing occurs every 10,000 records for optimal performance
  • Transactions ensure all-or-nothing updates to maintain data integrity
  • Error handling captures and logs issues with CSV parsing or database operations
  • The custom Lambda handler initializes the Spring context once per container lifecycle

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages