Apache Flink connector for HTTP-based lookups with comprehensive caching support, enabling efficient data enrichment in streaming applications.
- π High Performance: Full cache loading with configurable refresh intervals
- π Automatic Refresh: Configurable cache refresh strategies
- π‘οΈ Fault Tolerant: Built-in retry mechanisms and error handling
- π― Easy Integration: Simple SQL DDL configuration
- β‘ Low Latency: In-memory caching for sub-millisecond lookups
<dependency>
<groupId>com.datanutshell.flink</groupId>
<artifactId>flink-http-lookup-connector</artifactId>
<version>1.0.0</version>
</dependency>
implementation 'com.datanutshell.flink:flink-http-lookup-connector:1.0.0'
Create a lookup table using SQL DDL:
CREATE TABLE user_lookup (
id INT,
name STRING,
email STRING,
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'http-lookup-full-cache',
'url' = 'https://api.example.com/users',
'cache.refresh-interval' = 'PT10M',
'method' = 'GET'
);
Use it in a lookup join:
SELECT
e.user_id,
e.event_type,
u.name,
u.email
FROM user_events e
LEFT JOIN user_lookup FOR SYSTEM_TIME AS OF e.proc_time AS u
ON e.user_id = u.id;
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
connector |
String | Yes | - | Must be http-lookup-full-cache |
url |
String | Yes | - | HTTP endpoint URL |
method |
String | No | GET |
HTTP method (GET, POST, etc.) |
cache.refresh-interval |
Duration | No | PT1H |
Cache refresh interval (ISO-8601 duration) |
xpath |
String | No | `` | XPath expression for data extraction |
connect.timeout.seconds |
Integer | No | 10 |
Connection timeout in seconds |
read.timeout.seconds |
Integer | No | 30 |
Read timeout in seconds |
max.retries |
Integer | No | 3 |
Maximum number of retries |
retry.delay.ms |
Long | No | 1000 |
Delay between retries in milliseconds |
CREATE TABLE user_lookup (
id INT,
name STRING,
username STRING,
email STRING,
phone STRING,
website STRING,
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'http-lookup-full-cache',
'url' = 'https://jsonplaceholder.typicode.com/users',
'cache.refresh-interval' = 'PT10M',
'method' = 'GET',
'connect.timeout.seconds' = '10',
'read.timeout.seconds' = '30',
'max.retries' = '3',
'retry.delay.ms' = '1000'
);
-- Source table with events
CREATE TABLE user_events (
user_id INT,
event_type STRING,
event_time TIMESTAMP(3),
proc_time AS PROCTIME()
) WITH (
'connector' = 'kafka',
'topic' = 'user-events',
'properties.bootstrap.servers' = 'localhost:9092',
'format' = 'json'
);
-- Enriched output
INSERT INTO enriched_events
SELECT
e.user_id,
e.event_type,
e.event_time,
u.name,
u.email
FROM user_events e
LEFT JOIN user_lookup FOR SYSTEM_TIME AS OF e.proc_time AS u
ON e.user_id = u.id;
The connector implements a full-cache strategy where:
- Initial Load: All data is loaded from the HTTP endpoint at startup
- Periodic Refresh: Cache is refreshed at configurable intervals
- In-Memory Storage: Data is stored in memory for fast lookups
- Fault Tolerance: Automatic retries and error handling
βββββββββββββββββββ HTTP Request βββββββββββββββββββ
β β βββββββββββββββββββΆβ β
β Flink Job β β HTTP Endpoint β
β β βββββββββββββββββββ β β
βββββββββββββββββββ JSON Response βββββββββββββββββββ
β β²
βΌ β
βββββββββββββββββββ β
β In-Memory β β
β Cache β β
β β β
βββββββββββββββββββ β
β Periodic Refresh
βΌ β
βββββββββββββββββββ β
β Lookup Join β βββββββββββββββββββββββββββββ
β Operation β
βββββββββββββββββββ
- Java 11 or higher
- Apache Flink 1.17+
- Gradle 8.0+
# Clone the repository
git clone https://github.com/dataengnutshell/flink-http-full-cache-connector.git
cd flink-http-full-cache-connector
# Build the project
./gradlew build
# Run tests
./gradlew test
# Run integration tests
./gradlew integrationTest
cd example
./gradlew run
This will start a Flink job that demonstrates the HTTP lookup connector using the JSONPlaceholder API.
The connector provides comprehensive metrics for monitoring:
- Cache Hit Rate: Percentage of successful cache lookups
- Cache Refresh Duration: Time taken to refresh the cache
- HTTP Request Metrics: Success/failure rates, response times
- Error Rates: Retry attempts and failure counts
Access these metrics through Flink's metrics system and your monitoring infrastructure.
- Monitor memory usage when caching large datasets
- Consider the trade-off between refresh frequency and data freshness
- Use appropriate JVM heap sizing for your cache requirements
- Set appropriate timeout values for your network conditions
- Configure retry strategies based on endpoint reliability
- Consider using connection pooling for high-throughput scenarios
-- Frequent updates for critical data
'cache.refresh-interval' = 'PT1M' -- Every minute
-- Balanced approach for most use cases
'cache.refresh-interval' = 'PT10M' -- Every 10 minutes
-- Infrequent updates for static data
'cache.refresh-interval' = 'PT1H' -- Every hour
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
This project uses Scalafmt for code formatting:
./gradlew scalafmtAll
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- π Documentation: Check the docs directory
- π Issues: Report bugs on GitHub Issues
- π¬ Discussions: Join the conversation in GitHub Discussions
Data Nutshell