# Synthetic data generation

In this exercise you would generate synthetic data for "credit card fraud detection"



### Part-1 Use an LLM to generate the synthetic data

Use the instructions available in the section **Fine-Tuning/Project..** in the course guide. Refer to Part-1 for the prompt. At the end of part-1, you should have the following file under the folder for this notebook.

**./synthetic-credit-card-fraud/credit-card-fraud-chatgpt.json**

### Part-2 Pre-process data to JSON line format & split

In [1]:
import json

# This file has the synthetic data in JSON [ ] format
j_file = "./synthetic-credit-card-fraud/credit-card-fraud-chatgpt.json"

with open(j_file) as f:
    dat = json.load(f)

# JSON array is converted to json line format
def  write_jsonl_file(dat_subset, file_name):
    jsonl = ""
    for rec in dat_subset:
        jsonl = jsonl + json.dumps(rec) + "\n"

    with open(file_name, "w") as f:
        f.write(jsonl)

    print(file_name, "# of lines : ", len(dat_subset))

# Train - split
output_file_prefix = "./synthetic-credit-card-fraud/credit-card-fraud-chatgpt-"
file_name = output_file_prefix+"train.jsonl"
write_jsonl_file(dat[0:56], file_name)

# Validation - split
file_name = output_file_prefix+"validate.jsonl"
write_jsonl_file(dat[56:70], file_name)

# Test - split
file_name = output_file_prefix+"test.jsonl"
write_jsonl_file(dat[70:], file_name)


./synthetic-credit-card-fraud/credit-card-fraud-chatgpt-train.jsonl # of lines :  56
./synthetic-credit-card-fraud/credit-card-fraud-chatgpt-validate.jsonl # of lines :  14
./synthetic-credit-card-fraud/credit-card-fraud-chatgpt-test.jsonl # of lines :  14


### Part-3 Check data distribution of training set

#### Step-1 Check if dataset is balanced
* Get the count of fraud & not_fraud records
* Check if they are near about same

In [2]:
# Get the counts for fraud & not_fraud examples
training_file_name = "./synthetic-credit-card-fraud/credit-card-fraud-chatgpt-train.jsonl"

# Count fraud vs not fraud examples in the training file
def get_training_dataset_distribution(data_file_name):
    fraud_count = 0
    not_fraud_count = 0
    with open(data_file_name) as f:
        for line in f:
            if json.loads(line)["transaction_label"] == "fraud":
                fraud_count = fraud_count + 1
            else:
                not_fraud_count = not_fraud_count + 1
    
    # Calculate % of examples labeled as Fraud
    fraud_pct = int(fraud_count*100/(fraud_count + not_fraud_count))
    
    print("Fraud labels : ", fraud_pct, "%  ")
    print("Not_Fraud labels : ", (100-fraud_pct), "%  ")

    return fraud_count, not_fraud_count

# Check the balance
fraud_count, not_fraud_count = get_training_dataset_distribution(training_file_name)

Fraud labels :  35 %  
Not_Fraud labels :  65 %  


In [3]:
# Check number of additional examples to be generated
if (fraud_count - not_fraud_count) > 0:
    print("Augmentation suggested. add examples for 'Not Fraud':", (fraud_count - not_fraud_count))
elif (fraud_count - not_fraud_count) < 0:
    print("Augmentation suggested. add examples for 'Fraud':", (not_fraud_count - fraud_count))
else:
    print("Dataset is balanced")

Augmentation suggested. add examples for 'Fraud': 16


#### Step-2 Generate suggested number of examples

* Use earlier prompt to generate suggested number of records using ChatGPT (or other LLM)
* Save the examples to a file : **credit-card-fraud-chatgpt-train-additional.json**

#### Step-3 Balance the training dataset

* Convert the additional examples to JSON line format
* Open the training dataset for 'append'
* Append the additional examples to the training dataset

In [4]:
# JSON file with additional examples
j_file_additional = './synthetic-credit-card-fraud/credit-card-fraud-chatgpt-train-additional.json'

# Open the file and read the JSON array data
with open(j_file_additional) as f:
    additional_dat = json.load(f)

# Print count of additional examples for validation
print( "# of additional examples : ", len(additional_dat))

# Convert JSON array to JSON Line
jsonl = ""
for rec in additional_dat:
    jsonl = jsonl + json.dumps(rec) + "\n"


# of additional examples :  16


In [5]:
# Open the credit-card-fraud-chatgpt-train.json and append the augmentation examples to it
training_file_name = "./synthetic-credit-card-fraud/credit-card-fraud-chatgpt-train.jsonl"

with open(training_file_name) as training_file:
    original_training_dat = training_file.read()

output_train_file = './synthetic-credit-card-fraud/credit-card-fraud-chatgpt-train-augmented.jsonl'
with open(output_train_file, "w") as f:
    f.write(original_training_dat)
    f.write(jsonl)

In [6]:
get_training_dataset_distribution(output_train_file)

Fraud labels :  50 %  
Not_Fraud labels :  50 %  


(36, 36)