# **TOPIC: Docker**

1. Scenario: You are building a microservices-based application using Docker. Design a Docker Compose file that sets up three containers: a web server container, a database container, and a cache container. Ensure that the containers can communicate with each other properly.

In [None]:
version: '3'
services:
  web:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - 80:80
    depends_on:
      - database
      - cache

  database:
    image: mysql:latest
    environment:
      MYSQL_ROOT_PASSWORD: example
      MYSQL_DATABASE: myapp
      MYSQL_USER: myuser
      MYSQL_PASSWORD: mypassword

  cache:
    image: redis:latest



#In the same directory of docker created, run the docker by executing below command
docker-compose up


2. Scenario: You want to scale your Docker containers dynamically based on the incoming traffic. Write a Python script that utilizes Docker SDK to monitor the CPU usage of a container and automatically scales the number of replicas based on a threshold.

In [None]:
import docker
import psutil

# Set the threshold for CPU usage in percentage
cpu_threshold = 80

# Docker client setup
client = docker.from_env()

def get_container_cpu_usage(container):
    """Get the CPU usage of a container"""
    container_stats = container.stats(stream=False)
    cpu_stats = container_stats['cpu_stats']
    cpu_delta = cpu_stats['cpu_usage']['total_usage'] - cpu_stats['precpu_stats']['cpu_usage']['total_usage']
    system_delta = cpu_stats['system_cpu_usage'] - cpu_stats['precpu_stats']['system_cpu_usage']
    cpu_percent = (cpu_delta / system_delta) * len(cpu_stats['cpu_usage']['percpu_usage']) * 100
    return cpu_percent

def scale_containers(service, replicas):
    """Scale the number of replicas of a service"""
    service.scale(replicas)

def monitor_containers():
    """Monitor the CPU usage of containers and scale if necessary"""
    containers = client.containers.list()
    for container in containers:
        cpu_percent = get_container_cpu_usage(container)
        print(f"Container {container.name} - CPU Usage: {cpu_percent}%")
        if cpu_percent > cpu_threshold:
            scale_containers(container.attrs['Config']['Labels']['com.docker.compose.service'], 2)
            print(f"Scaling up {container.name} due to high CPU usage.")

# Main script
if __name__ == "__main__":
    while True:
        monitor_containers()


3. Scenario: You have a Docker image stored on a private registry. Develop a script in Bash that authenticates with the registry, pulls the latest version of the image, and runs a container based on that image.

In [None]:
#!/bin/bash

# Private registry credentials
registry_username="your-registry-username"
registry_password="your-registry-password"

# Docker registry URL
registry_url="your-registry-url"

# Image details
image_name="your-image-name"
image_tag="latest"

# Authenticate with the private registry
docker login -u "$registry_username" -p "$registry_password" "$registry_url"

# Pull the latest version of the image
docker pull "$registry_url/$image_name:$image_tag"

# Run a container based on the pulled image
docker run -d "$registry_url/$image_name:$image_tag"


# **TOPIC: Airflow**

1. Scenario: You have a data pipeline that requires executing a shell command as part of a task. Create an Airflow DAG that includes a BashOperator to execute a specific shell command.

In [None]:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime

default_args = {
    'start_date': datetime(2023, 7, 12),
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'shell_command_dag',
    default_args=default_args,
    description='DAG for executing a shell command',
    schedule_interval='0 0 * * *',  # Run daily at midnight
)

execute_shell_command = BashOperator(
    task_id='execute_shell_command',
    bash_command='your_shell_command_here',
    dag=dag,
)

execute_shell_command


2. Scenario: You want to create dynamic tasks in Airflow based on a list of inputs. Design an Airflow DAG that generates tasks dynamically using PythonOperator, where each task processes an element from the input list.

In [None]:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

default_args = {
    'start_date': datetime(2023, 7, 12),
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'dynamic_task_dag',
    default_args=default_args,
    description='DAG for generating dynamic tasks',
    schedule_interval=None,  # Set to None to disable automatic scheduling
)

def process_element(element):
    # Add your processing logic here
    print(f"Processing element: {element}")

input_list = ['element1', 'element2', 'element3']  # Replace with your actual input list

for element in input_list:
    task_id = f"process_{element}"
    process_task = PythonOperator(
        task_id=task_id,
        python_callable=process_element,
        op_args=[element],
        dag=dag,
    )

    process_task

3. Scenario: You need to set up a complex task dependency in Airflow, where Task B should start only if Task A has successfully completed. Implement this dependency using the "TriggerDagRunOperator" in Airflow.

In [None]:
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from airflow.operators.sensors import ExternalTaskSensor
from datetime import datetime

default_args = {
    'start_date': datetime(2023, 7, 12),
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'complex_dependency_dag',
    default_args=default_args,
    description='DAG with complex task dependency',
    schedule_interval=None,  # Set to None to disable automatic scheduling
)

task_a = ExternalTaskSensor(
    task_id='task_a',
    external_dag_id='your_dag_id',
    external_task_id='task_a',
    dag=dag,
)

task_b = TriggerDagRunOperator(
    task_id='task_b',
    trigger_dag_id='your_dag_id',
    execution_date="{{ execution_date }}",
    dag=dag,
)

task_a >> task_b


# **TOPIC: Sqoop**

1. Scenario: You want to import data from an Oracle database into Hadoop using Sqoop, but you only need to import specific columns from a specific table. Write a Sqoop command that performs the import, including the necessary arguments for column selection and table mapping.

In [None]:
sqoop import \
--connect jdbc:oracle:thin:@<database_host>:<port>:<database_name> \
--username <username> \
--password <password> \
--table <table_name> \
--columns "<column1>,<column2>,<column3>" \
--target-dir <target_directory> \
--as-textfile \
--m <num_mappers>

2. Scenario: You have a requirement to perform an incremental import of data from a MySQL database into Hadoop using Sqoop. Design a Sqoop command that imports only the new or updated records since the last import.

In [None]:
sqoop import \
--connect jdbc:mysql://<database_host>:<port>/<database_name> \
--username <username> \
--password <password> \
--table <table_name> \
--target-dir <target_directory> \
--as-textfile \
--incremental append \
--check-column <column_name> \
--last-value <last_imported_value> \
--m <num_mappers>

3. Scenario: You need to export data from Hadoop to a Microsoft SQL Server database using Sqoop. Develop a Sqoop command that exports the data, considering factors like database connection details, table mapping, and appropriate data types.


In [None]:
sqoop export \
--connect "jdbc:sqlserver://<database_host>:<port>;database=<database_name>" \
--username <username> \
--password <password> \
--table <table_name> \
--export-dir <export_directory> \
--input-fields-terminated-by ',' \
--input-lines-terminated-by '\n' \
--input-null-string '\\N' \
--input-null-non-string '\\N' \
--columns "<column1>,<column2>,<column3>" \
--batch \
--m <num_mappers>