Skip to content
No description, website, or topics provided.
Scala
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src/main/java/com/github/ankit/spark/example
.gitignore
README.md
pom.xml

README.md

secure-spark-streaming-kafka

This demo is build to get the counts of various words in every batch interval fetched from secured Kafka cluster using this Spark Streaming Application.

Note: This Demo is build for CDS(Spark) Version: 2.3.0.cloudera3 & CDK(Kafka) Version: 4.0.0.

For kerberos enabled Kafka, we need CDS 2.1+ and CDK 2.1+ to make the combination work: https://blog.cloudera.com/blog/2017/05/reading-data-securely-from-apache-kafka-to-apache-spark/

Requirements Cloudera Distribution of Spark 2.1 release 1 or higher Cloudera Distribution of Kafka 2.1.0 or higher

also they need KAFKA 0-10 integration (export SPARK_KAFKA_VERSION=0.10 before launch spark2-submit).

Please follow the mentioned steps to build and execute this sample Spark Streaming Application to interact with secure Kafka cluster.

Download & Build the Demo App:

  1. Download or Clone the Git repo available here Demo App GitHub Link.

  2. Build this application using maven.

$ cd secure-spark-streaming-kafka

$ mvn clean package

  1. If your application is built successfully, you can observe the mentioned output.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 24.061 s
  1. After building your application, take the generated uber jar from target/secure-spark-streaming-kafka-1.0-SNAPSHOT-jar-with-dependencies.jar to the spark client node(where you are going to launch the query from).

Create the required file & permission:

  1. Once you have the uber jar available in the client node, the next step is to create a jaas.conf file for JAAS configuration for Kerberos access. As per Apache Kafka Documentation, Kafka uses the Java Authentication and Authorization Service (JAAS) for SASL configuration.

Sample jaas.conf:

KafkaClient {
    com.sun.security.auth.module.Krb5LoginModule required
    useKeyTab=true
    storeKey=true
    keyTab="./user.kafka.keytab"
    useTicketCache=false
    serviceName="kafka"
    principal="user@MY.DOMAIN.COM";
};

Modify keytab & principal value to your actual value. KafkaServer is the section name in the JAAS file used by each KafkaServer/Broker. This section provides SASL configuration options for the broker including any SASL client connections made by the broker for inter-broker communication.

  1. Also, check the permission on the created jaas.conf and keytab files to avoid file permission exceptions.

Execute the Spark Application:

  1. Once all the above steps are completed successfully, we have to execute the spark2-submit command.
 SPARK_KAFKA_VERSION=0.10 spark2-submit \
--num-executors 2 \ 
--master yarn \ 
--deploy-mode client \ 
--files //jaas.conf,//user.kafka.keytab \ 
--driver-java-options "-Djava.security.auth.login.config=./jaas.conf" \ 
--class com.github.ankit.spark.example.DirectKafkaWordCount \ 
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf" \ 
//kafka-spark-secure-demo-1.0-SNAPSHOT-jar-with-dependencies.jar \ 
:9092 \ 
 \ 
false
 

Note: Please replace the below-mentioned parameters with actual value.

<broker_hostname> :: Broker hostname. :: actual file path. Modify the last parameter to “true” if you have SSL enabled and change the port accordingly.

  1. As your Spark streaming application is running now, in another terminal, start a producer that publishes to a topic using “kafka-console-producer” client.

  2. You can observe the counts of various words in every batch interval, in your spark streaming driver’s stdout.

You can’t perform that action at this time.