### Step 0: specify the Free-form API urls


In [1]:
#Free form SDS API url
url='https://synthetic-data-generator-oaqw25.ai-workbench.eng-ml-l.vnu8-sqze.cloudera.site/synthesis/freeform' 
#Free form eval SDS API url
url_eval = 'https://synthetic-data-generator-oaqw25.ai-workbench.eng-ml-l.vnu8-sqze.cloudera.site/synthesis/evaluate_freeform'


### Step 1: Real Data Reference
**Purpose**: Load sample real-world lending data to understand structure for synthetic generation  
- Used as reference for generating realistic synthetic data patterns


In [8]:
#These are the lending datasets and how they look like
import pandas as pd
OutputFile='application_record.csv'
data = pd.read_csv(OutputFile)
data

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
1,5008805,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
2,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,-21474,-1134,1,0,0,0,Security staff,2.0
3,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0
4,5008809,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
438552,6840104,M,N,Y,0,135000.0,Pensioner,Secondary / secondary special,Separated,House / apartment,-22717,365243,1,0,0,0,,1.0
438553,6840222,F,N,N,0,103500.0,Working,Secondary / secondary special,Single / not married,House / apartment,-15939,-3007,1,0,0,0,Laborers,1.0
438554,6841878,F,N,N,0,54000.0,Commercial associate,Higher education,Single / not married,With parents,-8169,-372,1,1,0,0,Sales staff,1.0
438555,6842765,F,N,Y,0,72000.0,Pensioner,Secondary / secondary special,Married,House / apartment,-21673,365243,1,0,0,0,,2.0


In [36]:
#These are the lending datasets and how they look like
import pandas as pd
OutputFile='credit_record.csv'
Records = pd.read_csv(OutputFile)

Records.loc[Records['ID']==5008806,:]


Unnamed: 0,ID,MONTHS_BALANCE,STATUS
92969,5008806,0,C
92970,5008806,-1,C
92971,5008806,-2,C
92972,5008806,-3,C
92973,5008806,-4,C
92974,5008806,-5,C
92975,5008806,-6,C
92976,5008806,-7,X
92977,5008806,-8,0
92978,5008806,-9,0


### Step 2a: Synthetic Data Generation
**Key Parameters**:
- **Model**: Meta Llama-3.1-70B (large language model running on CAII)  
- **Temperature 1.0**: High creativity for diverse outputs
- **max_tokens 8192**: Enough tokens to generate each output fully.
- **Examples**: Uses `ExamplesLoanData.json` for pattern learning
  -  Fields need to match fields in the prompt in the same sequence.
  -  Reduct PII
  -  Use mock examples
  -  Consistency in categories: Ensure fields with predefined values are defined.  
- **Prompt Structure**:  
  - Specifies all fields with precise definitions  
  - Explain each field clearly in the expected sequence.


In [181]:
%%time
import requests
import os

# Get API key from environment variable if within CDSW app/session
api_key = os.environ.get('CDSW_APIV2_KEY')


# URL for synthesis


# Add the API key to headers with proper Authorization format
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}

# If API key exists, add it to the headers
if api_key:
    headers['Authorization'] = f'Bearer {api_key}'
else:
    print("Warning: No API key provided")

# Payload for data synthesis
payload = {
  #Use CAII models
  #"inference_type": "CAII",
  #"caii_endpoint": "https://caii-prod-long-running.eng-ml-l.vnu8-sqze.cloudera.site/namespaces/serving-default/endpoints/llama-31-70b-instruct-8xl40s/v1/chat/completions",
  #"model_id": "meta/llama-3.1-70b-instruct",
    "inference_type": "CAII",
    "caii_endpoint": "https://caii-prod-long-running.eng-ml-l.vnu8-sqze.cloudera.site/namespaces/serving-default/endpoints/llama3-2-90b-8xl40s/v1/chat/completions",
    "model_id": "meta/llama-3.2-90b-vision-instruct",

  #Use AWS Bedrock models
  #"inference_type": "aws_bedrock",
  #"model_id": "us.anthropic.claude-3-5-sonnet-20241022-v2:0",

  "is_demo": False,
  "num_questions": 20,
  "custom_prompt": """

Generate synthetic data for a credit card dataset. Here is the context about the dataset:

Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.
Generally speaking, credit score cards are based on historical data. Once encountering large economic fluctuations. Past models may lose their original predictive power. Logistic model is a common method for credit scoring. Because Logistic is suitable for binary classification tasks and can calculate the coefficients of each feature. In order to facilitate understanding and operation, the score card will multiply the logistic regression coefficient by a certain value (such as 100) and round it.
At present, with the development of machine learning algorithms. More predictive methods such as Boosting, Random Forest, and Support Vector Machines have been introduced into credit card scoring. However, these methods often do not have good transparency. It may be difficult to provide customers and regulators with a reason for rejection or acceptance.


The dataset consists of two tables: `User Records` and `Credit Records`, merged by `ID`. The output must create field values with the following specifications:

User Records Fields (static per user):
- ID: Unique client number (e.g., 100001, 100002).
- CODE_GENDER: Gender ('F' or 'M').
- FLAG_OWN_CAR: Car ownership ('Y' or 'N').
- FLAG_OWN_REALTY: Property ownership ('Y' or 'N').
- CNT_CHILDREN`: Number of children (0 or more).
- AMT_INCOME_TOTAL`: Annual income.
- NAME_INCOME_TYPE`: Income category (e.g., 'Commercial associate', 'State servant').
- NAME_EDUCATION_TYPE`: Education level (e.g., 'Higher education', 'Secondary').
- NAME_FAMILY_STATUS`: Marital status (e.g., 'Married', 'Single').
- NAME_HOUSING_TYPE`: Way of living. 
- DAYS_BIRTH`: Birthday	Count backwards from current day (0), -1 means yesterday.
- DAYS_EMPLOYED: Start date of employment	Count backwards from current day(0). If positive, it means the person currently unemployed. (negative for employed; positive for unemployed).
- FLAG_MOBIL: Is there a mobile phone ('Y'/'N')
- FLAG_WORK_PHONE:	Is there a work phone ('Y'/'N')	
- FLAG_PHONE: Is there a phone ('Y'/'N')
- FLAG_EMAIL: Is there an email ('Y'/'N')	
- OCCUPATION_TYPE: Occupation (e.g., 'Manager', 'Sales staff').
- CNT_FAM_MEMBERS: Family size (1 or more).

Credit records Fields (nested array):
- ID: needs to be the same as the User Records Fields ID.
- MONTHS_BALANCE: Refers to Record month.	The month of the extracted data is the starting point, backwards, 0 is the current month, -1 is the previous month, and so on.
- STATUS: 
    Must be one of ['0', '1', '2', '3', '4', '5', 'C', 'X'].
    Values description:	0: 1-29 days past due 1: 30-59 days past due 2: 60-89 days overdue 3: 90-119 days overdue 4: 120-149 days overdue 5: Overdue or bad debts, write-offs for more than 150 days C: paid off that month X: No loan for the month
    

3. Requirements:
- Consistency: Ensure `ID` consistency between the application and its nested credit records.
- Avoid real personal data (use synthetic values).
- Format output as three separate JSON objects, each with the structure shown in the examples.

When generating the data, make sure to adhere to the following guidelines:

Privacy guidelines:
- Avoid real PII.
- Ensure examples are not leaked into the synthetic data

Cross-row entries guidelines (applies to Credit Records):
- Entries must be ordered from oldest (`MONTHS_BALANCE=-60`) to newest (`MONTHS_BALANCE=0`).  
  - No duplicate `MONTHS_BALANCE` values for a single client.
  - The time-series credit record entries need to be logical and consistent when read in the correct sequence.
  - Ensure there are no other cross-row Credit Records inconsistencies not listed above. 

Formatting guidelines:
- `CNT_CHILDREN`, `AMT_INCOME_TOTAL`, `DAYS_BIRTH`, `DAYS_EMPLOYED`, etc., must be integers.  
- `MONTHS_BALANCE` must be an integer 0 or less.
- Ensure no other formatting problems or inconsistencies appear that are not listed above. 

Cross-row entries guidelines (applies to Credit Records):
- Entries must be ordered from oldest (`MONTHS_BALANCE=-60`) to newest (`MONTHS_BALANCE=0`).  
- No duplicate `MONTHS_BALANCE` values for a single client.
- If a Recent `MONTHS_BALANCE` is 0 there  should be an "X" (no loan) or "C" (paid off).  
- The time-series credit record entries need to be logical and consistent when read in the correct sequence. (e.g. delinquencies can appear in progression as "0" → "1" → "2" as months progress from  "-2" → "-1" → "0"  etc).  
- Ensure there are no other Credit Records inconsistencies appear that not listed above.


Cross-Column guidelines:  
- Check cross-column inconsistencies such as:
    If `FLAG_OWN_REALTY="Y"`, `NAME_HOUSING_TYPE` must **not** be "Rented apartment".  
    If `DAYS_EMPLOYED > 0` (unemployed), `AMT_INCOME_TOTAL` should be lower (e.g., ≤ $50,000).  
    `OCCUPATION_TYPE` must align with `NAME_INCOME_TYPE` (e.g., "Pensioner" cannot have "Manager" as occupation).  
    `CNT_FAM_MEMBERS` ≥ `CNT_CHILDREN` + 1 (accounting for at least one parent).  
- Ensure there are no other cross-field Credit Records inconsistencies appear that are not listed above.


""",
  "model_params": {
    "temperature": 1.0, # range 0-2 tyically, low temperature gives high accuracy, high temperature gives diversity
    "top_p": 1.0,
    "top_k": 250,       
    "max_tokens": 8192
  },
  "use_case": "custom",
  "topics": [
    "High income person",
    "Low income person",
    "Four-person family",
    "Three-person family",
    "Two-person family",
    "Five-person family",
    "more than 10 credit records",
    "more than 20 credit records"

  ]
  ,
  #"example_path": "ExamplesLoanData.json"
  "example_path": "ExamplesRowDependencies.json"
}

# Make the POST request
response = requests.post(url, headers=headers, json=payload)

# Display the response
print(response.status_code)
#print(response.json())


200
CPU times: user 41.3 ms, sys: 0 ns, total: 41.3 ms
Wall time: 1.36 s


### Step 2b: Filter data based on expected columns
**Filtering**:  
- Check if all samples have the exact columns as expected
- Remove those with different column names than expected


In [91]:
import pandas as pd
import json
InputFile='freeform_data_llama_20250425T072941600_final.json'

ExpectedKeys=[ 'Seeds', 'ID', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'FLAG_MOBIL', 'FLAG_WORK_PHONE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'credit_records']
CreditRecordsKeys=['ID', 'MONTHS_BALANCE', 'STATUS']
StartID=100001

with open(InputFile,'r') as f:
  data_j = json.load(f)
FilteredData=[]
for i in data_j:
    c=0
    for k in list(i.keys()):
        #print((k,k==ExpectedKeys[c]))
        c+=1
    if (list(i.keys())==ExpectedKeys)==True:
      if all([ list(j.keys()) == CreditRecordsKeys for j in i['credit_records']])==True:
        for j in range(len(i['credit_records'])):
            i['credit_records'][j]['ID']=StartID
        i['ID']=StartID
        FilteredData.append(i)
        StartID+=1

InputFile='Filtered_' + InputFile
with open(InputFile,'w') as f:
  json.dump(FilteredData,f)


### Step 2c: Synthetic Data Inspection
**Sample Output Analysis**:  
- Generated all fields as the original data
- Co-dependent variables vary as expected 
- Contains plausible occupations 
- Addresses follow realistic patterns 


In [92]:
df=pd.DataFrame(FilteredData)
df

Unnamed: 0,Seeds,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,credit_records
0,High income person,100001,M,Y,Y,2,200000,Commercial associate,Higher education,Married,House,-25000,-20000,Y,Y,Y,Y,Manager,4,"[{'ID': 100001, 'MONTHS_BALANCE': 0, 'STATUS':..."
1,High income person,100002,F,N,N,1,180000,State servant,Secondary / secondary special,Civil marriage,Rented apartment,-28000,-15000,Y,N,Y,Y,Sales staff,3,"[{'ID': 100002, 'MONTHS_BALANCE': 0, 'STATUS':..."
2,High income person,100003,M,Y,Y,2,250000,Commercial associate,Higher education,Married,House,-30000,-25000,Y,Y,Y,Y,Manager,4,"[{'ID': 100003, 'MONTHS_BALANCE': 0, 'STATUS':..."
3,High income person,100004,F,N,Y,1,220000,Pensioner,Incomplete higher,Separated,Municipal apartment,-32000,-20000,Y,N,Y,Y,Driver,2,"[{'ID': 100004, 'MONTHS_BALANCE': 0, 'STATUS':..."
4,High income person,100005,M,Y,Y,0,280000,Commercial associate,Higher education,Married,House,-35000,-30000,Y,Y,Y,Y,Manager,2,"[{'ID': 100005, 'MONTHS_BALANCE': 0, 'STATUS':..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,Five-person family,100116,M,Y,Y,2,200000,Commercial associate,Higher education,Married,House,-25000,-20000,Y,Y,Y,Y,Manager,4,"[{'ID': 100116, 'MONTHS_BALANCE': 0, 'STATUS':..."
116,Five-person family,100117,F,N,N,1,180000,State servant,Secondary / secondary special,Civil marriage,Rented apartment,-28000,-15000,Y,N,Y,Y,Sales staff,3,"[{'ID': 100117, 'MONTHS_BALANCE': 0, 'STATUS':..."
117,Five-person family,100118,M,Y,Y,1,220000,Commercial associate,Higher education,Married,House,-30000,-25000,Y,Y,Y,Y,Manager,3,"[{'ID': 100118, 'MONTHS_BALANCE': 0, 'STATUS':..."
118,Five-person family,100119,F,N,N,0,100000,Pensioner,Incomplete higher,Separated,Municipal apartment,-32000,-20000,Y,N,Y,Y,Driver,1,"[{'ID': 100119, 'MONTHS_BALANCE': 0, 'STATUS':..."


In [94]:
pd.DataFrame.from_dict(df.iloc[20,:]['credit_records'])

Unnamed: 0,ID,MONTHS_BALANCE,STATUS
0,100021,0,X
1,100021,-1,0
2,100021,-2,C
3,100021,-3,C
4,100021,-4,C


### Section 3a: LLM-Based Evaluation
**Evaluation Process**:  
1. **Prompt**:
   - Provide all required definitions and checks needed for evaluation.
   - Ensure consistency of definitions with the generation prompt
   - Explain clearly how the LLM will score
   - give the LLM examples on what to check. For example, asking the LLM specifically to check for data co-dependencies, temporal consistency and field validity improves evaluation quality.

2. **Scoring**:
   - Specify a scale (1-10 scale in this example):
   - Penalties for mistakes (e.g. field name mismatches, inconsistent field relationships, etc)


3. **Parameter tuning**: Use low `temperature` (e.g., 0.0) to prioritize accuracy over diversity.  


In [191]:
import requests
import os
#********************Accessing Application**************************
# Get API key from environment variable if withinin CDSW app/session.
# To get your API key for using outside CDSW app/session follow given link.
# https://docs.cloudera.com/machine-learning/cloud/api/topics/ml-api-v2.html
api_key = os.environ.get('CDSW_APIV2_KEY')


# Below is your application API URL, you can look at swagger documentation for all existing # endpoints for current application
# https://<application-subdomain>.<workbench-domain>/docs--> will take user to swagger documentaion
# Link to application can be found on application details page within CAI Workbench.


# URL for evaluation

# Add the API key to headers with proper Authorization format
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {api_key}'  # Format as specified in the documentation
}   

# The prompt for evaluation
custom_prompt = """
"Evaluate the quality of the provided synthetic credit data  and return a score between 1 and 10. The score should reflect how well the data adheres to the following criteria:  

Here is the context about the dataset:

Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.
Generally speaking, credit score cards are based on historical data. Once encountering large economic fluctuations. Past models may lose their original predictive power. Logistic model is a common method for credit scoring. Because Logistic is suitable for binary classification tasks and can calculate the coefficients of each feature. In order to facilitate understanding and operation, the score card will multiply the logistic regression coefficient by a certain value (such as 100) and round it.
At present, with the development of machine learning algorithms. More predictive methods such as Boosting, Random Forest, and Support Vector Machines have been introduced into credit card scoring. However, these methods often do not have good transparency. It may be difficult to provide customers and regulators with a reason for rejection or acceptance.


The dataset consists of two tables: `User Records` and `Credit Records`, merged by `ID`. The output must create field values with the following specifications:

User Records Fields (static per user):
- ID: Unique client number (e.g., 100001, 100002).
- CODE_GENDER: Gender ('F' or 'M').
- FLAG_OWN_CAR: Car ownership ('Y' or 'N').
- FLAG_OWN_REALTY: Property ownership ('Y' or 'N').
- CNT_CHILDREN`: Number of children (0 or more).
- AMT_INCOME_TOTAL`: Annual income.
- NAME_INCOME_TYPE`: Income category (e.g., 'Commercial associate', 'State servant').
- NAME_EDUCATION_TYPE`: Education level (e.g., 'Higher education', 'Secondary').
- NAME_FAMILY_STATUS`: Marital status (e.g., 'Married', 'Single').
- NAME_HOUSING_TYPE`: Way of living. 
- DAYS_BIRTH`: Birthday	Count backwards from current day (0), -1 means yesterday.
- DAYS_EMPLOYED: Start date of employment	Count backwards from current day(0). If positive, it means the person currently unemployed. (negative for employed; positive for unemployed).
- FLAG_MOBIL: Is there a mobile phone ('Y'/'N')
- FLAG_WORK_PHONE:	Is there a work phone ('Y'/'N')	
- FLAG_PHONE: Is there a phone ('Y'/'N')
- FLAG_EMAIL: Is there an email ('Y'/'N')	
- OCCUPATION_TYPE: Occupation (e.g., 'Manager', 'Sales staff').
- CNT_FAM_MEMBERS: Family size (1 or more).

Credit Records Fields (nested array):
- ID: needs to be the same as the User Records Fields ID.
- MONTHS_BALANCE: Refers to Record month.	The month of the extracted data is the starting point, backwards, 0 is the current month, -1 is the previous month, and so on.
- STATUS: 
    Must be one of ['0', '1', '2', '3', '4', '5', 'C', 'X'].
    Values description:	0: 1-29 days past due 1: 30-59 days past due 2: 60-89 days overdue 3: 90-119 days overdue 4: 120-149 days overdue 5: Overdue or bad debts, write-offs for more than 150 days C: paid off that month X: No loan for the month
    

Evaluate whether the data adhere to the following guidelines:

Privacy guidelines:
- Allow ficticious PII entries that do not leak PII.

Formatting guidelines:
- `CNT_CHILDREN`, `AMT_INCOME_TOTAL`, `DAYS_BIRTH`, `DAYS_EMPLOYED`, etc., must be integers.  
- `MONTHS_BALANCE` must be an integer 0 or less.
- Ensure no other formatting problems or inconsistencies appear that are not listed above. 

Cross-row entries guidelines (applies to Credit Records):
- Entries must be ordered from oldest (e.g. `MONTHS_BALANCE=-60`) to newest (`MONTHS_BALANCE=0`).  
- No duplicate `MONTHS_BALANCE` values for a single client.
- Consecutive STATUS=C is allowed since it indicates that each monthly payment and amount owned is paid off.
- The time-series credit record entries need to be logical and consistent when read in the correct sequence as months progress from negative to 0.
- Ensure the records dont start from deliquency 2 but rather from 0, C or X.
- Ensure there are no other Credit Records inconsistencies appear that not listed above.


Cross-Column guidelines:  
- Check cross-column inconsistencies such as:
    If `FLAG_OWN_REALTY="Y"`, `NAME_HOUSING_TYPE` must **not** be "Rented apartment".  
    If `DAYS_EMPLOYED > 0` (unemployed), `AMT_INCOME_TOTAL` should be lower (e.g., ≤ $50,000).  
    `OCCUPATION_TYPE` must align with `NAME_INCOME_TYPE` (e.g., "Pensioner" cannot have "Manager" as occupation).  
    `CNT_FAM_MEMBERS` ≥ `CNT_CHILDREN` + 1 (accounting for at least one parent).  
    DAYS_BIRTH, DAYS_EMPLOYED, OCCUPATION_TYPE and other variables are reasonable when considered together. 
- Ensure there are no other cross-field Credit Records inconsistencies appear that are not listed above.


Scoring Workflow:
  Start at 10, deduct points for violations:  
  Subtract 2 points for any Privacy guidelines violations.
  Subtract 1 point for any formatting guidelines violations.
  Subtract 1 point for any cross-column violations.
  Subtract 4 points for any Cross-row guidelines guidelines violations.
  Subtract 2 points for any other problem with the generated data not listed above.
  Cap minimum score score at 1 if any critical errors (e.g., missing `ID`, PII, or invalid `STATUS`).  


Give a score rating 1-10 for the given data.  If there are more than 9 points to subtract use 1 as the absolute minimum scoring. List all justification as list.
"""




# Model parameters
model_params = {
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": 250,
    "max_tokens": 4096
}

payload = {
    "export_type": "local",
    "display_name": "LendingData",
    "import_path": InputFile,
    "import_type": "local",
    #Use CAII models
    #"inference_type": "CAII",
    #"caii_endpoint": "https://caii-prod-long-running.eng-ml-l.vnu8-sqze.cloudera.site/namespaces/serving-default/endpoints/llama-31-70b-instruct-8xl40s/v1/chat/completions",
    #"model_id": "meta/llama-3.1-70b-instruct",
    "inference_type": "CAII",
    "caii_endpoint": "https://caii-prod-long-running.eng-ml-l.vnu8-sqze.cloudera.site/namespaces/serving-default/endpoints/llama3-2-90b-8xl40s/v1/chat/completions",
    "model_id": "meta/llama-3.2-90b-vision-instruct",

    #Use AWS Bedrock models
    #"inference_type": "aws_bedrock",
    #"model_id": "us.anthropic.claude-3-7-sonnet-20250219-v1:0",
    #"inference_type": "aws_bedrock",
    #"model_id": "us.anthropic.claude-3-5-sonnet-20241022-v2:0",

    "examples": [
        {
            "score": 10,
            "justification": """- No privacy violations detected (no PII leakage).  
- All fields adhere to formatting requirements (integers where required, valid `MONTHS_BALANCE`, etc.).  
- Cross-row entries are ordered correctly, no duplicates, and statuses progress logically (e.g., "0" → "1" → "2").  
- Cross-column consistency:  
  - `FLAG_OWN_REALTY="Y"` aligns with `NAME_HOUSING_TYPE`.  
  - Unemployed (`DAYS_EMPLOYED > 0`) have lower incomes.  
  - `OCCUPATION_TYPE` matches `NAME_INCOME_TYPE`.  
  - `CNT_FAM_MEMBERS` ≥ `CNT_CHILDREN` + 1.  
- No other critical errors.  
"""

        }
    ],
    "use_case": "custom",
    "is_demo": False,
    "custom_prompt": custom_prompt,
    "model_params": model_params
}

responseEval = requests.post(url_eval, headers=headers, json=payload)

# Print the response
print(responseEval.status_code)
print(responseEval.json())


200
{'job_name': 'LendingData_7e85', 'job_id': 'hzsi-gxus-t2e7-4uo0'}


## Step 3b: Example LLM-as-a-judge output
- Shows the sample in question.  
- Provides a score and a justification for the score which can be used for further filtering.  


In [203]:
import pandas as pd
import json
##############
#Replace this with the output of the LLM-as-a-judge step
LLMJUDGEOUT='row_data_llama_20250425T112214462_evaluated.json'
###############
with open(LLMJUDGEOUT,'r') as f:
  data_j = json.load(f)


## Step 3c: Filtering and Conversion to Tabular Data (High quality examples)
- Filter low-quality rows and convert valid data into a DataFrame.  
- Choose a score threshold (e.g., >9) to balance quality and quantity based on business needs.  


In [204]:
AllRows=[]
for i in data_j['evaluated_rows'][-5:-1]:
      del i['row']['Seeds']
      #if i['evaluation']['score'] > 0:
      AllRows.append(i['row'])
      print(json.dumps(i['row'],indent=4))
      print("Score: "+str(i['evaluation']['score']))
      print("Justification: "+i['evaluation']['justification'])
      print("\n\n\n================\n\n\n")


import pandas as pd
df=pd.DataFrame.from_records(AllRows)
#df


{
    "ID": 100116,
    "CODE_GENDER": "M",
    "FLAG_OWN_CAR": "Y",
    "FLAG_OWN_REALTY": "Y",
    "CNT_CHILDREN": 2,
    "AMT_INCOME_TOTAL": 200000,
    "NAME_INCOME_TYPE": "Commercial associate",
    "NAME_EDUCATION_TYPE": "Higher education",
    "NAME_FAMILY_STATUS": "Married",
    "NAME_HOUSING_TYPE": "House",
    "DAYS_BIRTH": -25000,
    "DAYS_EMPLOYED": -20000,
    "FLAG_MOBIL": "Y",
    "FLAG_WORK_PHONE": "Y",
    "FLAG_PHONE": "Y",
    "FLAG_EMAIL": "Y",
    "OCCUPATION_TYPE": "Manager",
    "CNT_FAM_MEMBERS": 4,
    "credit_records": [
        {
            "ID": 100116,
            "MONTHS_BALANCE": 0,
            "STATUS": "C"
        },
        {
            "ID": 100116,
            "MONTHS_BALANCE": -1,
            "STATUS": "C"
        },
        {
            "ID": 100116,
            "MONTHS_BALANCE": -2,
            "STATUS": "C"
        },
        {
            "ID": 100116,
            "MONTHS_BALANCE": -3,
            "STATUS": "0"
        },
        {
         

## Step 4: Saving to CSV
- **Export**: Save the cleaned data to a CSV file for downstream use.  
- **Format consistency**: Use a delimiter like `\t` to avoid conflicts with existing data (e.g., commas in addresses).  


In [205]:
OutputFile='data_tab_separated.csv'
df.to_csv(OutputFile, sep='\t',index=False)
