## Setup the AWS Environment 
### We need to AWS Environment for Storage Purpose and Using AWS Managed KAFKA Version 



### A. Setting up the Amazon MSK(Managed Streaming for KAKFA)

1. Ensure AWS credentials are securely stored and accessed, It should have the proper permission to access and create MSK Cluster.
2. Validate the provided subnet IDs and security group ID to ensure they exist and are correct .
    -- Use the Same Subnet and Security group ID as that of workspace . It make things lot easier 
3. Handle potential exceptions beyond NoCredentialsError, such as boto3.exceptions.Boto3Error.
4. Log responses and errors appropriately for monitoring and debugging.
5. Ensure the Kafka cluster configuration (e.g., KafkaVersion, InstanceType)...
6. Consider using environment variables or a secrets manager for sensitive information.
7. Verify the region_name is correct and matches the intended AWS region for deployment.
8. Ensure the number of broker nodes is appropriate for the expected load and redundancy requirements.
    -- It should be in multiple of availability zone used in Subnet.

In [0]:
# Install the kafka-python library for Kafka client
%pip install kafka-python

# Install the AWS MSK IAM SASL Signer for authentication with AWS MSK
%pip install aws-msk-iam-sasl-signer-python

# Install the Databricks Delta Live Tables library
%pip install dlt

### Configuration Code Explanation

The following section provides a detailed explanation of the configuration code used for setting up and managing the Amazon MSK (Managed Streaming for Kafka) environment. This documentation is intended to help you understand the purpose and functionality of each part of the code, ensuring proper setup and maintenance of the MSK cluster.

1. **AWS Credentials**: Ensure that AWS credentials are securely stored and accessed. They should have the necessary permissions to create and manage the MSK cluster.
2. **Subnet and Security Group Validation**: Validate the provided subnet IDs and security group ID to ensure they exist and are correct. Using the same subnet and security group ID as the workspace simplifies the setup.
3. **Exception Handling**: Handle potential exceptions beyond `NoCredentialsError`, such as `boto3.exceptions.Boto3Error`, to ensure robust error management.
4. **Logging**: Log responses and errors appropriately for monitoring and debugging purposes.
5. **Kafka Cluster Configuration**: Ensure the Kafka cluster configuration (e.g., KafkaVersion, InstanceType) is correctly specified.
6. **Sensitive Information Management**: Consider using environment variables or a secrets manager for handling sensitive information securely.
7. **Region Verification**: Verify that the `region_name` is correct and matches the intended AWS region for deployment.
8. **Broker Nodes**: Ensure the number of broker nodes is appropriate for the expected load and redundancy requirements. The number of broker nodes should be a multiple of the availability zones used in the subnet.
9. **Mounting File Paths**: Add the different paths assigned for mounting files to ensure proper file system setup and access.

In [0]:
# Define AWS configuration details in the aws_config dictionary
# DBTITLE 1,Initialize Config Settings
if 'config' not in locals() or not isinstance(config, dict):
    config = {}

config['aws'] = {
    'access_key_id': 'AKIA3BYT2DKQQQ7QDXVW',
    'secret_access_key': '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',
    'region_name': 'us-west-2',
    'subnets': [
        'subnet-0682160c6cf0f7d33',  # SubnetID-1 
        'subnet-0ef26ca8237f37fae'   # SubnetID-2
    ],
    'security_group': 'sg-098d6eca4e4744f3f',  # Security group ID
    'cluster_name': 'real-time-pos-msk',  # Unique cluster name
    'kafka_version': '2.8.1',
    'number_of_broker_nodes': 4,
    'instance_type': 'kafka.m5.large',
    'cluster_arn': 'arn:aws:kafka:us-west-2:759713897121:cluster/real-time-pos-msk/d494f0b6-6c35-4d6e-ab46-60da511fcafd-11'
}

## Config Settings for DBFS Mount Point
config['dbfs_mount_name'] = f'/mnt/real-time-pos/' 

# Store the filenames for the data files into Config
config['inventory_change_store001_filename'] = config['dbfs_mount_name'] + '/data-generator/inventory_change_store001.txt'
config['inventory_change_online_filename'] = config['dbfs_mount_name'] + '/data-generator/inventory_change_online.txt'
 
# snapshot data files
config['inventory_snapshot_store001_filename'] = config['dbfs_mount_name'] + '/data-generator/inventory_snapshot_store001.txt'
config['inventory_snapshot_online_filename'] = config['dbfs_mount_name'] + '/data-generator/inventory_snapshot_online.txt'
 
# static data files
config['stores_filename'] = config['dbfs_mount_name'] + '/data-generator/store.txt'
config['items_filename'] = config['dbfs_mount_name'] + '/data-generator/item.txt'
config['change_types_filename'] = config['dbfs_mount_name'] + '/data-generator/inventory_change_type.txt'

# Config Settings for Checkpoint Files
config['inventory_snapshot_path'] = config['dbfs_mount_name'] + '/inventory_snapshots/'
# Config Settings for DLT Data
config['dlt_pipeline'] = config['dbfs_mount_name'] + '/dlt_pipeline_pos'

# Identify Database for Data Objects and initialize it
database_name = f'pos_dlt'
config['database'] = database_name




In [0]:
from kafka import KafkaAdminClient, KafkaProducer
from kafka.errors import NoBrokersAvailable, KafkaTimeoutError
import boto3
import socket
from botocore.exceptions import NoCredentialsError



### Creating MSK Cluster Using Boto3


In [0]:

try:
    # Initialize the Kafka client with AWS credentials
    client = boto3.client(
        'kafka',
        region_name=config['aws']['region_name'],
        aws_access_key_id=config['aws']['access_key_id'],
        aws_secret_access_key=config['aws']['secret_access_key']
    )

    try:
        # Create the Kafka cluster
        response = client.create_cluster(
            ClusterName=config['aws']['cluster_name'],
            KafkaVersion=config['aws']['kafka_version'],
            NumberOfBrokerNodes=config['aws']['number_of_broker_nodes'],
            BrokerNodeGroupInfo={
                'InstanceType': config['aws']['instance_type'],
                'ClientSubnets': config['aws']['subnets'],
                'SecurityGroups': [config['aws']['security_group']]
            }
        )

        # Retrieve and print the cluster ARN
        cluster_arn = response['ClusterArn']
        print(f"Cluster ARN: {cluster_arn}")

    except client.exceptions.ConflictException:
        # Handle the case where the cluster already exists
        print(f"Cluster {config['aws']['cluster_name']} already exists")

except NoCredentialsError:
    # Handle the case where AWS credentials are not available
    print("Credentials not available")

### Details for  MSK Cluster

1. List all available MSK clusters in your AWS account.
2. Retrieve detailed information about a specific MSK cluster, including its configuration and status.
3. Monitor the health and performance metrics of the MSK cluster.
4. Explore the topics and partitions within the MSK cluster.
5. Check the connectivity and security settings of the MSK cluster.
6. Review the broker nodes and their distribution across availability zones.
7. Analyze the logs and events related to the MSK cluster for troubleshooting.

In [0]:
def initialize_msk_session(config):
    """
    Initialize a session using Amazon MSK with provided AWS credentials and region.
    """
    try:
        session = boto3.Session(
            aws_access_key_id=config['aws']['access_key_id'],
            aws_secret_access_key=config['aws']['secret_access_key'],
            region_name=config['aws']['region_name']
        )
        return session
    except NoCredentialsError:
        print("AWS credentials not available.")
        return None

def list_msk_clusters(msk_client):
    """
    List all MSK clusters using the provided MSK client.
    """
    try:
        clusters = msk_client.list_clusters()
        return clusters
    except ClientError as e:
        print(f"Failed to list clusters: {e}")
        return None

def describe_msk_cluster(msk_client, cluster_arn):
    """
    Describe a specific MSK cluster using its ARN.
    """
    try:
        cluster_info = msk_client.describe_cluster(ClusterArn=cluster_arn)
        return cluster_info
    except ClientError as e:
        print(f"Failed to describe cluster: {e}")
        return None

# Initialize a session using Amazon MSK with configuration
session = initialize_msk_session(config)

if session:
    # Create an MSK client
    msk_client = session.client('kafka')

    # List all clusters
    clusters = list_msk_clusters(msk_client)

    if clusters and clusters.get('ClusterInfoList'):
        # Select a specific cluster (for example, the first one in the list)
        selected_cluster_arn = clusters['ClusterInfoList'][0]['ClusterArn']
        cluster_info = describe_msk_cluster(msk_client, selected_cluster_arn)
        if cluster_info:
            display(cluster_info)
    else:
        print("No clusters available or failed to retrieve clusters.")

### MSK Cluster Communication Ports

- **Plaintext Communication**: Use port `9092` to communicate with brokers in plaintext.
- **TLS Encryption**:
  - Use port `9094` for access from within AWS.
  - Use port `9194` for public access.
- **SASL/SCRAM Authentication**:
  - Use port `9096` for access from within AWS.
  - Use port `9196` for public access.
- **IAM Access Control**:
  - Use port `9098` for access from within AWS.
  - Use port `9198` for public access.
- **Apache ZooKeeper Communication**:
  - Use port `2182` for TLS encryption.
  - Default port is `2181`.

#### Example Commands
```sh
nc -vz b-1.realtimepos.qk3h2c.c11.kafka.us-west-2.amazonaws.com 9094
nc -vz b-2.realtimepos.qk3h2c.c11.kafka.us-west-2.amazonaws.com 9094

In [0]:
%sh

# To communicate with brokers in plaintext, use port 9092.

# To communicate with brokers with TLS encryption, use port 9094 for access from within AWS and port 9194 for public access.

# To communicate with brokers with SASL/SCRAM, use port 9096 for access from within AWS and port 9196 for public access.

# To communicate with brokers in a cluster that is set up to use IAM access control, use port 9098 for access from within AWS and port 9198 for public access.

### IAM Authentication
nc -vz b-1.realtimeposmsk.dhq8fv.c11.kafka.us-west-2.amazonaws.com 9098
nc -vz b-2.realtimeposmsk.dhq8fv.c11.kafka.us-west-2.amazonaws.com 9098

# SASL/SCRAM Authentication
nc -vz b-1.realtimeposmsk.dhq8fv.c11.kafka.us-west-2.amazonaws.com 9096
nc -vz b-2.realtimeposmsk.dhq8fv.c11.kafka.us-west-2.amazonaws.com 9096

## Plaintext Authentication
nc -vz b-1.realtimeposmsk.dhq8fv.c11.kafka.us-west-2.amazonaws.com 9092
nc -vz b-2.realtimeposmsk.dhq8fv.c11.kafka.us-west-2.amazonaws.com 9092

In [0]:
session = boto3.Session(
    aws_access_key_id=config['aws']['access_key_id'],
    aws_secret_access_key=config['aws']['secret_access_key'],
    region_name=config['aws']['region_name']
)

In [0]:


# Initialize a session using Amazon MSK
# Create an MSK client
msk_client = session.client('kafka')

try:
    # Get bootstrap brokers
    response = msk_client.get_bootstrap_brokers(
        ClusterArn=config['aws']['cluster_arn'],
    )
    bootstrap_servers = response['BootstrapBrokerString']
except NoCredentialsError:
    print("No credentials available. Please check your AWS credentials.")

# List all topics in the cluster
try:
    admin_client = KafkaAdminClient(
        bootstrap_servers=bootstrap_servers,
        security_protocol='PLAINTEXT',
        client_id=socket.gethostname(),
    )
    topics = admin_client.list_topics()
    print("Topics in the cluster:", topics)
except NoBrokersAvailable:
    print("No brokers available. Please check the broker endpoint and network connectivity.")
except KafkaTimeoutError:
    print("Kafka timeout error. Failed to update metadata after 60.0 secs. Please check the broker endpoint and network connectivity.")

# Create a Kafka producer using PLAINTEXT connection
try:
    producer = KafkaProducer(
        bootstrap_servers=bootstrap_servers,
        security_protocol='PLAINTEXT',
        client_id=socket.gethostname(),
    )
    print("Producer Initalized successfully.")
except NoBrokersAvailable:
    print("No brokers available. Please check the broker endpoint and network connectivity.")
except KafkaTimeoutError:
    print("Kafka timeout error. Failed to update metadata after 60.0 secs. Please check the broker endpoint and network connectivity.")

## Create A S3 Bucket and Mount to DBFS 


In [0]:
# Define the S3 bucket and mount point
s3_bucket = "s3a://real-time-pos-msk/"
mount_point = config['dbfs_mount_name']

# Set AWS credentials
aws_access_key_id = config['aws']['access_key_id']
aws_secret_access_key = config['aws']['secret_access_key']
region = config['aws']['region_name']

# Configure Spark to use the AWS credentials
spark.conf.set("fs.s3a.access.key", aws_access_key_id)
spark.conf.set("fs.s3a.secret.key", aws_secret_access_key)

# Check if the mount point already exists
mounts = [mount.mountPoint for mount in dbutils.fs.mounts()]
if mount_point in mounts:
    print("It already exists")
else:
    # Mount the S3 bucket
    dbutils.fs.mount(
      source=s3_bucket,
      mount_point=mount_point,
      extra_configs={"fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"}
    )

In [0]:
%fs
ls dbfs:/mnt/real-time-pos/data-generator