## Step 1: Generating Synthetic Customer Support Data
This step uses SDS to generate synthetic customer support questions and analytics on those questions. The payload specifies parameters for the CAII model (e.g., Llama-3.1-70b) and includes a structured prompt instructing the model to:  
1. Create realistic customer comments.  
2. Automatically answer **17 questions** about each comment (e.g., technical details, urgency, presence of bugs).  

**Key Insights**  
- **Structured Output**: The prompt needs to provide clear instructions about the expected questions and solutions along with examples how to structure the questions and solutions.
- **Customization**: The inclusion of domain-specific seed topics (e.g., Cloudera AI, Data Engineering) ensures relevance to the target use case and create diverse synthetic data.
- **Parameters**: Use temperature value 1, to balance between generation accuracy and creativity.
- **Top_k**: Use a value (250) to ensure a wide range of generation paths are explored


In [5]:
%%time
import requests
import os

# Get API key from environment variable if within CDSW app/session
api_key = os.environ.get('CDSW_APIV2_KEY')


# URL for synthesis
url = 'https://synthetic-data-generator-oaqw25.ai-workbench.eng-ml-l.vnu8-sqze.cloudera.site/synthesis/generate'


# Add the API key to headers with proper Authorization format
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}

# If API key exists, add it to the headers
if api_key:
    headers['Authorization'] = f'Bearer {api_key}'
else:
    print("Warning: No API key provided")

# Payload for data synthesis
payload = {
  #Use CAII models
  "inference_type": "CAII",
  "caii_endpoint": "https://caii-prod-long-running.eng-ml-l.vnu8-sqze.cloudera.site/namespaces/serving-default/endpoints/llama-31-70b-instruct-8xl40s/v1/chat/completions",
  "model_id": "meta/llama-3.1-70b-instruct",
  #Use AWS Bedrock models
  #"inference_type": "aws_bedrock",
  #"model_id": "us.anthropic.claude-3-5-sonnet-20241022-v2:0",

  "is_demo": False,
  "num_questions": 2,
  "custom_prompt": """
You are a customer of Cloudera and you are reaching out to Cloudera customer support. Generate a comment to send to Cloudera customer support. Write clear, helpful, and friendly responses that address common customer problems, provide accurate information.


After you generate the response answer the following 17 questions about the response. 


There are 17 possible questions you can ask about a generated customer comment. Here is the list:




1. Does this comment discuss any technical information? (answer 0 for no, 1 for yes)
2. Does this comment relate to a customer complaint? (answer 0 for no, 1 for yes)
3. Customer complaint temperature or a frustration level (if there is a complaint give 1 for lowest, 4 for highest and 2,3 for in between. If there is no complaint give a score of 0).
4. Score the severity of the issue based on comment content (SCORE 1-4, give 1 for lowest, 4 for highest and 2,3 for in between)
5. Score the urgency of the issue based on the comment content (SCORE 1-4, give 1 for lowest, 4 for highest and 2,3 for in between)
6. Is this a request from a customer for an update? (answer 0 for no, 1 for yes)
7. Is there a strictly explicit and NOT an implied request from a customer for a call, meeting or a screenshare (zoom/webex/teams etc.)? Do not answer yes unless wording explicitly asks for a call. ((BOOL:0/1)
8. Did the customer request an escalation? (answer 0 for no, 1 for yes)
9. Did the customer request a priority change?  To what level? (If there is a priority change, give score 1 to indicate highest priority (indicated by S1) and 4  to indicate the lowest priority (Indicated by S4). If there is no priority change give a score of 0).
10. Did the customer request a transfer to another Customer Operations Engineer? (answer 0 for no, 1 for yes)
11. Did the customer request to speak to a manager or supervisor? (answer 0 for no, 1 for yes)
12. Did the customer request a Subject Matter Expert or expert? (answer 0 for no, 1 for yes)
13. Does this comment discuss a bug in Cloudera software? (answer 0 for no, 1 for yes)
14. Does the comment include a non-Cloudera Apache JIRA link (e.g. a Apache JIRA link with issues.apach.org domain name)? (answer 0 for no, 1 for yes)
15. Does the comment have a link to Cloudera Documentation or Community article? (answer 0 for no, 1 for yes)
16. Does the comment have any other type of hyperlink? (answer 0 for no, 1 for yes)
17. Summarize the case comment condensing it as much as possible but without losing important technical details. Omit including any meeting invite information. (TEXT)
Generate each answer in a new line using the format "Number. answer" (e.g. 1. 0)


""",
  "model_params": {
    "temperature": 1.0,
    "top_p": 1.0,
    "top_k": 250,
    "max_tokens": 8192
  },
  "technique": "sft",
  "use_case": "custom",
  "topics":[
    "Cloudera",
    "Cloudera AI",
    "Cloudera Machine Learning",
    "Cloudera Data Engineering",
    "Cloudera Data Warehouse",
    "Cloudera Data Platform"
  ],
  "generate_file_name":"CSH_Generation.json",
  "input_key":"Prompt",
  "output_value":"Completion",
  "output_key":"Prompt",
  "examples":
    [
      {
      "question": 

"""Hi, 
    After updating the configuration, we scaled down and then scaled up the ingress-controller. However, even after the ingress-controller replicas became ready, they did not transition to the available state. 
    Please take this as a high priority and schedule a call immediately.
  Thanks!""",

      "solution":
"""1. 1
2. 1
3. 4
4. 4
5. 4
6. 1
7. 1
8. 1
9. 1
10. 0
11. 0
12. 0
13. 1
14. 0
15. 0
16. 0
17. Customer reports that after updating the configuration and scaling down/up the ingress-controller, its replicas are ready but not transitioning to the available state. They have escalated the issue and requested an immediate call with a manager."""
      },
      {
      "question": 
"""Create a new machine user. After I have executed a synchronized user, it takes a long time for Linux to recognize or cannot find the user.
sudo su - srv_xxxx get error 
su: user srv_xxxx does not exist
""",
      "solution":
"""1. 1
2. 0
3. 1
4. 2
5. 2
6. 0
7. 0
8. 0
9. 0
10. 0
11. 0
12. 0
13. 0
14. 0
15. 0
16. 0
17. Issue with new machine user recognition in Linux after synchronization, resulting in "user does not exist" error."""
      },
      {
      "question": 
"""Hi Team,

We (Bigdata engineering team) are working on rolling out spark3.3.2 to our dev cdp cluster, Technical questions 1)is spark2 vs spark3 history server role can be assigned to same host/server  2)is ports changes required if we use spark3/spark2 history & gateway roles on same  host/server or its default to support both spark3/spark2 on same host 

Thanks
John
""",
      "solution":
"""1. 1
2. 0
3. 0
4. 2
5. 1
6. 0
7. 0
8. 0
9. 0
10. 0
11. 0
12. 0
13. 0
14. 0
15. 0
16. 0
17. Bigdata engineering team rolling out spark3.3.2 on dev cdp cluster. Technical questions: 1) Can spark2 and spark3 history server roles be assigned to the same host/server? 2) Are port changes required if using spark3/spark2 history & gateway roles on the same host/server, or are the default ports sufficient to support both spark3/spark2 on the same host?"""
      },
      {
      "question": 
"""Hi team,
As a part of ASA audit being conducted by 3rd party on our track, they have flagged port 1111 on one of the servers as a vulnerable port and recommend closing it but we already gave them a screenshot showing that this port is being used by IIP as a local repo server. 
However, auditors need vendor endorsement for it not just our track justification/screenshot to strike off this vulnerability finding therefore we request Cloudera to kindly endorse this so that it can be shared with auditors for their understanding. 
Regards,
Jennifer
""",
      "solution":
"""1. 1
2. 0
3. 1
4. 2
5. 2
6. 0
7. 0
8. 0
9. 0
10. 0
11. 0
12. 0
13. 0
14. 0
15. 0
16. 0
17. Cloudera support needed to endorse port 1111 usage by IIP as a local repo server for ASA audit."""
      },
      {
      "question": 
"""Hi Mario,
How about we connect zoom early at 3 PM jakarta time
Thank you.
""",
      "solution":
"""1. 0
2. 0
3. 0
4. 1
5. 2
6. 0
7. 1
8. 0
9. 0
10. 0
11. 0
12. 0
13. 0
14. 0
15. 0
16. 0
17. Customer requests to reschedule Zoom call to 3 PM Jakarta time."""
      }
    ]

}



# Make the POST request
response = requests.post(url, headers=headers, json=payload)

# Display the response
print(response.status_code)
print(response.json())

200
{'job_name': 'synth_job_5537', 'job_id': 'wrhb-b089-oi11-roa2'}
CPU times: user 39.6 ms, sys: 0 ns, total: 39.6 ms
Wall time: 2.57 s


## Step 2: Validating Synthetic Data Structure
This step reads the generated synthetic data from the output file and prints its contents. The output shows entries with `Prompt`, `Completion`, and `Seeds`, and shows generated questions for different topics (Seeds).  

**Key Insight**  
- **Input file**: Make sure the Input file is replaced with the newly generated output file of the previous step


In [10]:
import pandas as pd
import json
##############
#Replace this with the output of the data generation call
InputFile='qa_pairs_llama_20250417T005225175_final.json'
###############
with open(InputFile,'r') as f:
  data_j = json.load(f)
print(json.dumps(data_j,indent=4))


[
    {
        "Seeds": "Cloudera",
        "Prompt": "Hi Cloudera Support,\nOur team is experiencing issues with our CDH cluster. When we try to launch a MapReduce job, it takes an unusually long time to complete. We suspect this may be due to a configuration issue. Can you please assist us with troubleshooting this problem?\nThank you,\n[Your Name]",
        "Completion": "1. 1\n2. 1\n3. 2\n4. 2\n5. 2\n6. 0\n7. 0\n8. 0\n9. 0\n10. 0\n11. 0\n12. 0\n13. 0\n14. 0\n15. 0\n16. 0\n17. Customer requires assistance with troubleshooting a CDH cluster issue where MapReduce jobs are taking an unusually long time to complete, suspected to be due to a configuration issue."
    },
    {
        "Seeds": "Cloudera",
        "Prompt": "Hi Cloudera Support,\nOur team is experiencing issues with our CDH cluster. When we try to launch a MapReduce job, it takes an unusually long time to complete. We suspect this may be due to a configuration issue. Can you please assist us with troubleshooting this prob

## Step 3: Automated Evaluation via LLM "Judge"
This step evaluates the synthetic data’s quality using a LLM-as-a-judge (CAII again). The evaluation prompt instructs the model to:  
1. Evaluate based on the same guidelines used in the data generation step.  
2. Deduct points for errors (e.g., -2 for wrong answers, -1 for partial correctness).  
3. Ensures assigning a final score from 1–5.  

**Key Insights**  
- **Scalable Quality Control**: Avoids manual review by automating evaluation with an LLM, which can handle large datasets quickly.  
- **Parameter Optimization**: Lower `temperature` (0.1) ensures the LLM produces consistent and accurate evaluations rather than creative but inconsistent responses.  
- **Filtering**: Enables filtering out bad apples based on the LLM-as-a-judge score for each sample.  


In [15]:
import requests
import os
#********************Accessing Application**************************
# Get API key from environment variable if withinin CDSW app/session.
# To get your API key for using outside CDSW app/session follow given link.
# https://docs.cloudera.com/machine-learning/cloud/api/topics/ml-api-v2.html
api_key = os.environ.get('CDSW_APIV2_KEY')


# Below is your application API URL, you can look at swagger documentation for all existing # endpoints for current application
# https://<application-subdomain>.<workbench-domain>/docs--> will take user to swagger documentaion
# Link to application can be found on application details page within CAI Workbench.


# URL for evaluation
url = 'https://synthetic-data-generator-oaqw25.ai-workbench.eng-ml-l.vnu8-sqze.cloudera.site/synthesis/evaluate'

# Add the API key to headers with proper Authorization format
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {api_key}'  # Format as specified in the documentation
}   

# The prompt for evaluation
custom_prompt = """You are given a comment (question field) and 17 answers (solution field) to the following questions:


1. Does this comment discuss any technical information? (answer 0 for no, 1 for yes)
2. Does this comment relate to a customer complaint? (answer 0 for no, 1 for yes)
3. Customer complaint temperature or a frustration level (if there is a complaint give 1 for lowest, 4 for highest and 2,3 for in between. If there is no complaint give a score of 0).
4. Score the severity of the issue based on comment content (SCORE 1-4, give 1 for lowest, 4 for highest and 2,3 for in between)
5. Score the urgency of the issue based on the comment content (SCORE 1-4, give 1 for lowest, 4 for highest and 2,3 for in between)
6. Is this a request from a customer for an update? (answer 0 for no, 1 for yes)
7. Is there a strictly explicit and NOT an implied request from a customer for a call, meeting or a screenshare (zoom/webex/teams etc.)? Do not answer yes unless wording explicitly asks for a call. ((BOOL:0/1)
8. Did the customer request an escalation? (answer 0 for no, 1 for yes)
9. Did the customer request a priority change?  To what level? (If there is a priority change, give score 1 to indicate highest priority (indicated by S1) and 4  to indicate the lowest priority (Indicated by S4). If there is no priority change give a score of 0).
10. Did the customer request a transfer to another Customer Operations Engineer? (answer 0 for no, 1 for yes)
11. Did the customer request to speak to a manager or supervisor? (answer 0 for no, 1 for yes)
12. Did the customer request a Subject Matter Expert or expert? (answer 0 for no, 1 for yes)
13. Does this comment discuss a bug in Cloudera software? (answer 0 for no, 1 for yes)
14. Does the comment include a non-Cloudera Apache JIRA link (e.g. a Apache JIRA link with issues.apach.org domain name)? (answer 0 for no, 1 for yes)
15. Does the comment have a link to Cloudera Documentation or Community article? (answer 0 for no, 1 for yes)
16. Does the comment have any other type of hyperlink? (answer 0 for no, 1 for yes)
17. Summarize the case comment condensing it as much as possible but without losing important technical details. Omit including any meeting invite information. (TEXT)


Rate the answers for all 17 questions are correct. Subtract two points for each question answered wrong and 1 if partially wrong.


Give a rating from 1 to 5 and if there are more points to subtract than 4 use 1 as the lowest.
"""


# Model parameters
model_params = {
    "temperature": 0.1,
    "top_p": 1.0,
    "top_k": 100,
    "max_tokens": 4096
}

payload = {
    "export_type": "local",
    "display_name": "Customer Support Data",
    "import_path": InputFile,
    "import_type": "local",
    #Use CAII models
    #"inference_type": "CAII",
    #"caii_endpoint": "https://caii-prod-long-running.eng-ml-l.vnu8-sqze.cloudera.site/namespaces/serving-default/endpoints/llama-31-70b-instruct-8xl40s/v1/chat/completions",
    #"model_id": "meta/llama-3.1-70b-instruct",
    #Use AWS Bedrock models
    "inference_type": "aws_bedrock",
    "model_id": "us.anthropic.claude-3-5-sonnet-20241022-v2:0",

    "use_case": "custom",
    "is_demo": False,
    "custom_prompt": custom_prompt,

    "model_params": model_params
}

response = requests.post(url, headers=headers, json=payload)

# Print the response
print(response.status_code)
print(response.json())

200
{'job_name': 'Customer Support Data_9f43', 'job_id': 'cl2a-6o6c-s98a-19rl'}


## Step 4: Analyzing Evaluation Results
This step reads the evaluation results from the previous step. 

**Key Insight**  
- **Input file**: Make sure the LLMJUDGEOUT file is replaced with the newly generated output file of the previous step


In [16]:
import pandas as pd
import json
##############
#Replace this with the output of the LLM-as-a-judge step
LLMJUDGEOUT='qa_pairs_claude_20250417T012641143_evaluated.json'
###############
with open(LLMJUDGEOUT,'r') as f:
  data_j = json.load(f)
print(json.dumps(data_j,indent=4))


{
    "Cloudera AI": {
        "average_score": 5.0,
        "min_score": 5,
        "max_score": 5,
        "evaluated_pairs": [
            {
                "question": "Hi, I am reaching out to request assistance with Cloudera AI, specifically with integrating it with our existing CDP deployment. We are experiencing some technical difficulties and would appreciate any guidance or troubleshooting tips your team can provide. Thank you!",
                "solution": "1. 1\n2. 0\n3. 0\n4. 2\n5. 2\n6. 0\n7. 0\n8. 0\n9. 0\n10. 0\n11. 0\n12. 0\n13. 0\n14. 0\n15. 0\n16. 0\n17. Customer requests assistance with integrating Cloudera AI with existing CDP deployment and is experiencing technical difficulties.",
                "evaluation": {
                    "score": 5,
                    "justification": "The answers provided are highly accurate and appropriate for all 17 questions. The technical nature (Q1), severity and urgency levels (Q4,Q5), and lack of complaint (Q2,Q3) are correctl