The main value Kafka provides to data pipelines is its ability to serve as a very large, reliable buffer between various stages in the pipeline, effectively decoupling producers and consumers of data within the pipeline

## Considerations When Building Data Pipelines

### Timeliness

- Good data integration systems support different timeliness requirements for different pipelines, and also make the migration between different timetables easier as business requirements can change
- Kafka as a stream data platform with scalable and reliable storage, can be used to support anything from near-real-time piplines to hourly batches
    - Producers can write to Kafka as frequently and infrequently as needed
    - Consumers can read and deliver the latest events as they arrive, or work in batches
    - Kafka can act as a giant buffer that decouples the time-sensitivity requirements between producers and consumers
    - Kafka itself applies back-pressure on producers by delaying acks when needed, since consumption rate is driven entirely by the consumers

### Reliability

- It is desirable to avoid single points of failure and allow for fast and automatic recovery for all sorts of failure events
- Data pipelines are often the way data arrives to business critical systems, failure for more than a short period can be dusruptive
- Delivery guarantees are also of great significance, most of the time there is a requirement for *at-least-once* delivery sematics, in certain cases there will even be *exactly-once* requirement
    - Kafka can provide at-least-once on its own, and exactly-once when combined with an external data store that has a transactional model or unique keys
    - Kafka's Connect APIs make it simple for connectors to build an end-to-end exactly-once pipeline by providing APIs for integrating with the external systems when handling offsets
    
### High and Varying Throughput

- Modern data systems are often required to have the ability to scale to very high throughputs, also the ability to adapt if throughput suddenly increases
- Kafka, as a high-throughput distributed system, helps decoupling the throughput of the consumers and the producers, so that they scale independently of each other
- The Kafka Connect API focuses on parallelizing the work, allows data sources and sinks to split the work between multiple threads of execution

### Data Formats

- Reconciling different data formats and data types
    - One may be loading relationala data into Kafka, using Avro within Kafka, and then need to convert data to JSON when writing to Elasticsearch, to Parquet when writing to HDFS, and CSV when writing to S3, etc


- Kafka and Kafka Connect API are completely agnostic about data formats
    - Producers and consumers can use any serializer to represent data in any format


- Many data sources and sinks have a schema; we can read the schema from the source with the data, store it, and use it to validate compatbility or even update the schema in the sinks


- A generic data integration framework should also handle differences in behavior between various sources and sinks, e.g., Syslog is a source that pushes data and relation databases require the framework to pull

### Transformations

- ETL vs ELT: depending on who should handle the transformations, whether the data pipeline or the target storage system


- *Extract-Transform-Load*: the data pipeline is responsible for making modifications to the data as it passes through
    - Benefit: saves time and storage
    - Drawback: Transformations that happen to the data in the pipeline limits downstream consumption


- *Extract-Load-Transform*: the data pipeline does minimal transformation (mostly around data type conversion), with the goal of making sure the data that arrives at the target is as similar as possible to the source data
    - Also called high-fidelity pipelines or data-lake architecture
    - The target system collects "raw data" and all required processing is done at the target system, providing maximum flexibility and simpler troubleshooting
    

### Security

- Kafka allows encrypting data on the wire, as it is piped from sources to Kafka and from Kafka to sinks
- Kafka supports authentication (via SASL) and authorization 
- Kafka provides an audit log to track access

### Failure Handling

- Do not assume perfect data, plan for failure handling
    - Try prevent faulty records from ever making into the pipeline
    - Try recover from records that cannot be parsed

### Coupling and Agility

- The data pipeline should decouple data sources and data targets, coupling can happen in the following cases

#### Ad-hoc pipelines

- Buliding a custom pipeline for each pair of applications that need to be connected, e.g., Logstash to dump logs to Elasticsearch, Flume to dump logs to HDFS, etc
- This tightly couples the data pipeline to the specific end points and creates a mess of integration points that requires significant effort to deploy, maintain, and monitor
- Also increases the cost of adopting new technology since adoption will require building additional pipelines

#### Loss of metadata

- If the data pipeline doesn't preserve schema metadata and does not allow for schema evolution, you end up tightly coupling the system producing the data at the source and the system that uses it at the destination

#### Extreme processing

- Certain level of data processing is required for the data pipelines (e.g., type conversion, formatting)
- Too much processing ties all the downstream systems to the decisions made when building the pipelines 
- The more agile way is to preserve as much of the raw data as possible and allow downstream apps to make their own decisions regarding data processing and aggregation



## Kafka Connect vs Producer+Consumer

- Use Kafka clients (that can be embedded in application code) when the application code can be modified, and when you want to push data into Kafka or pull data from Kafka in the application
- Kafka Connect can be used to pull data from external datastore in to kafka or push data from Kafka to an external store; use Connect to connect Kafka to datastores that you did not write and whose code cannot be modified


## Kafka Connect

- Connect provides a scable and reliable way to move data between Kafka and other data stores
- Connect provides API and a runtime to develop and run *connector plugins*, which are libraries that Connect executes that are responsible for moving the data
- Kafka Connect runs as a cluster of *worker processes*, connector plugins are installed on the workers; a REST API is used to configure and manage these *connectors*, which run with a specific configuration
- *Connectors* start additional *tasks* to move large amounts of data in parallel and use the available resources on the worker nodes more efficiently


- Source connector tasks read data from the source system and provide Connect data objects to the worker processes
- Sink connector tasks get connector data objects from the workers and are responsible for writing them to the target data system


- Connect uses *convertors* to support storing data objects in Kafka in different formats



In [1]:
%%bash

# Make sure connect is up and running by checking its REST API
curl http://connect:28083

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    95  100    95    0     0   3653      0 --:--:-- --:--:-- --:--:--  3653
{"version":"5.4.1-ccs","commit":"fd1e543386b47352","kafka_cluster_id":"3GmGd3dlT3KPmDrfxVCB8Q"}



In [2]:
%%bash

# Check available connector plugins
curl http://connect:28083/connector-plugins

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1352  100  1352    0     0  84500      0 --:--:-- --:--:-- --:--:-- 90133
[{"class":"io.confluent.connect.activemq.ActiveMQSourceConnector","type":"source","version":"5.4.1"},{"class":"io.confluent.connect.elasticsearch.ElasticsearchSinkConnector","type":"sink","version":"5.4.1"},{"class":"io.confluent.connect.gcs.GcsSinkConnector","type":"sink","version":"5.4.1"},{"class":"io.confluent.connect.ibm.mq.IbmMQSourceConnector","type":"source","version":"5.4.1"},{"class":"io.confluent.connect.jdbc.JdbcSinkConnector","type":"sink","version":"5.4.1"},{"class":"io.confluent.connect.jdbc.JdbcSourceConnector","type":"source","version":"5.4.1"},{"class":"io.confluent.connect.jms.JmsSourceConnector","type":"source","version":"5.4.1"},{"class":"io.confluent.co

### Connector Example: File Source and File Sink

- To create a connector, feed a JSON config that includes a connector's name and a configuration map, which includes the connector class and relevant parameters:

```bash
echo '{"name": "load-kafka-config", "config": {"connector.class": "org.apache.kafka.connect.file.FileStreamSourceConnector", "file": "config/server.properties", "topic": "kafka-config-topic"}}' | \
curl -X POST -d @- http://connect:28083/connectors --header "content-Type:application/json"
```

- This connector pipe the contents of the `config/server.properties` file into Kafka line by line using the converter provided in the configuration, which default to the JSON converter

- A file sink connector can be used to dump the contents of the topic into a file:

```bash
echo '{"name": "dump-kafka-config", "config": {"connector.class": "org.apache.kafka.connect.file.FileStreamSinkConnector", "file": "copy-of-server-properties", "topics": "kafka-config-topic"}}' | \
curl -X POST -d @- http://connect:28083/connectors --header "content-Type:application/json"
```

- In this case, the `FileStreamSinkConnector` can write multiple topics into one file, where the `FileStreamSourceConnector` can only write to one topic

## Connect Internals

- To use Connect, you run a cluster of workers and start/stop connectors

### Connectors and Tasks

- Connector plugins implement the connector API, which includes *Connectors* and *Tasks*


- The connector is responsible for
    - Determining how many tasks will run for the connector
    - Deciding how to split the data-copying work between the tasks
    - Getting configurations for the tasks from the workers and passing it along
    
- Tasks are responsible for actually getting the data in and out of Kafka
    - All tasks are initialized by receiving a context from the worker
    - Source context includes an object that allows the source task to store the offsets of source records
    - Sink context includes a method that allows the connector to control the records it receives from Kafka
    - After initialization, the tasks are started with the configuration passed by the Connector
    - Source tasks poll an external system and return lists of records that the worker sends to Kafka brokers
    - Sink tasks receive records from Kafka through the worker and write the records to an external system

### Workers

- The connectors and tasks are responsible for moving data while the workers are responsible for the REST API, configuration management, reliability, high availability, scaling, and load balancing
- Connect's worker processes are the "container" processes that execute the connectors and tasks
- Worker handles the HTTP requests that define connectors and connector configs, stores the configurations, starts the connectors and their tasks with the configurations received
- Workers also automatically commit offsets for both source and sink connectors, as well as handling retries when tasks throw errors

### Converters and Connect's data model

- Connect includes a Data API which contains both data objects and a schema that describes the data
- Source connector read an event/row from the source system and generate a pair of *Schema* and *Value*
- Sink connectors get a *Schema* and *Value* pair and use the *Schema* to parse the values and insert them into the target system


- When the source connector returns a Data API record to the worker, the worker uses the configured converter (Avro, JSON, or string) to convert the record to either an Avro object, a JSON object, or a string, and the result is then stored into Kafka
- When the Connect worker reads a record from Kafka, it uses the configured converter to convert the record from the format in Kafka (i.e., Avro, JSON, or string) to the Connect Data API record and then passes it to the sink connector, which inserts it into the destination system

### Offset management

- Connectors need to know which data they have already processed
- Workers manages logical partition and offset for connectors


- For source connectors, the records the connector returns to the workers include a logical partition and a logical offset, which are not Kafka partition or offset, but rather partitions and offsets needed in the source system
- Sink connectors read Kafka records which include a topic, partition, and offset identifiers

