<a href="https://colab.research.google.com/github/dojian/mental_health_chatbot/blob/dongjian/Generate_Gretel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install gretel-client

In [None]:
from gretel_client import Gretel

gretel = Gretel(project_name="topic-gen", api_key="prompt", validate=True)

Gretel API Key: ··········
Using endpoint https://api.gretel.cloud
Logged in as an.dong90@berkeley.edu ✅
Project URL: https://console.gretel.ai/proj_2qYnvSmNBbeVjJnnOIErWBYeWZs


In [None]:
# Read the datasets
import pandas as pd

processed = pd.read_json('https://gretel-public-website.s3.amazonaws.com/datasets/evaluation/processed-data.json', lines=True)

In [None]:
processed.head()

Unnamed: 0,topic,question,excerpt
0,android,Why live wallpapers use phone call info,"When I install any live wallpaper, I am shown ..."
1,scifi,Where is this missing part to the Tron: Legacy...,I finally bought the Tron: Legacy Original Sou...
2,scifi,Was the tracking bug actually inserted into Ne...,"When Neo is detained by the Agents, they place..."
3,electronics,Are there strong but insulative screws?,Common through-hole power semiconductor device...
4,scifi,Guys with a “boomstick”,"Here's a tough one, I'm looking for a novel in..."


In [None]:
processed['full_text'] = processed['topic'] + ', Question:' + processed['question'] + ', Excerpt:' + processed['excerpt']

In [None]:
processed.head()

Unnamed: 0,topic,question,excerpt,full_text
0,android,Why live wallpapers use phone call info,"When I install any live wallpaper, I am shown ...","android, Question:Why live wallpapers use phon..."
1,scifi,Where is this missing part to the Tron: Legacy...,I finally bought the Tron: Legacy Original Sou...,"scifi, Question:Where is this missing part to ..."
2,scifi,Was the tracking bug actually inserted into Ne...,"When Neo is detained by the Agents, they place...","scifi, Question:Was the tracking bug actually ..."
3,electronics,Are there strong but insulative screws?,Common through-hole power semiconductor device...,"electronics, Question:Are there strong but ins..."
4,scifi,Guys with a “boomstick”,"Here's a tough one, I'm looking for a novel in...","scifi, Question:Guys with a “boomstick”, Excer..."


In [None]:
# Check for duplicate rows based on all columns
duplicates = processed.duplicated()
print(f"Number of duplicate rows: {duplicates.sum()}")

# Remove duplicate rows
processed = processed.drop_duplicates()
print(f"Data after removing duplicates: {processed.shape}")

Number of duplicate rows: 4
Data after removing duplicates: (12127, 4)


In [None]:
trained = gretel.submit_train(
    base_config="natural-language",
    data_source=processed,
    column_name="full_text",
    params={"batch_size": 16, "steps": 608,"validation": 2400},
    generate={"num_records": 100, "temperature": 0.8}
)

Submitting GPT training job...
Model Docs: https://docs.gretel.ai/create-synthetic-data/models/synthetics/gretel-gpt
Console URL: https://console.gretel.ai/proj_2qYnvSmNBbeVjJnnOIErWBYeWZs/models/6767c13d790c644a1c7ef7d5/activity
Model ID: 6767c13d790c644a1c7ef7d5
Analyzing input data and checking for auto-params... 
Resolved revision for model revision 5f0b02c75b57c5855da9ae460ce51323ea669d8a, model meta-llama/Meta-Llama-3-8B-Instruct
Parameter efficient fine tuning (PEFT) methods will be used, which greatly reduce the number of trainable parameters. 
Starting GPT model training... num_train_steps 608
Fine-tuning 'meta-llama/Meta-Llama-3-8B-Instruct' with provided dataset! 
Downloading model from remote source. Depending on the size of the model, this may take a few minutes. 
Model download 59% complete, ETA 41s (9536686709/16071953557 bytes downloaded) 
Model download 100% complete (16071953557 bytes downloaded). Loading model onto GPU ... 
Still loading model ... 
Still loading mode

In [None]:
# view the text data quality scores
print(trained.report)

GretelDataQualityReport(
    synthetic_data_quality_score: 36
    semantic_similarity: 28
    structure_similarity: 52
    membership_inference_attack_score: 93.8
    data_privacy_score: 93.8
)



In [None]:
# display the full report within this notebook
trained.report.display_in_notebook()

0,1,2,3,4,5
How to interpret the Text SQS,Excellent,Good,Moderate,Poor,Very Poor
Demo environments or mock data,,,,,
Pre-production testing environments,,,,,
Suitable for statistical analysis,,,,,
Augment machine learning data sources,,,,,
Improve your model using our tips and advice,,,,,

Unnamed: 0,Training Data,Synthetic Data
Row Count,80,80.0
Column Count,1,1.0
Training Lines Duplicated,-,0.0
Missing Values,0,0.0
Unique Values,80,80.0
Average Words Per Sentence,8.81,9.12
Average Characters Per Word,4.48,4.13
Average Sentence Count,5.66,8.96


In [None]:
def extract_question_excerpt(row):
    # Split the string into parts to extract Question and Excerpt
    question_part = row.split(", Question:", maxsplit=1)
    excerpt_part = row.split(", Excerpt:", maxsplit=1)

    question = question_part[1].split(", Excerpt:")[0].strip() if len(question_part) > 1 else None
    excerpt = excerpt_part[1].strip() if len(excerpt_part) > 1 else None

    return pd.Series({"Question": question, "Excerpt": excerpt})

In [None]:
# Deficit counts from topics_short
topics_short = {
    "scifi": 0,
    "gis": -9,
    "android": -29,
    "apple": -114,
    "electronics": -122,
    "unix": -147,
    "wordpress": -156,
    "photo": -192,
    "security": -217,
    "mathematica": -453
}

# Convert to DataFrame and filter missing counts
missing_counts = pd.DataFrame.from_dict(topics_short, orient='index', columns=['deficit'])
missing_counts = missing_counts[missing_counts['deficit'] < 0].abs()

# Generate data for each topic
generated_data = []

for topic, deficit in missing_counts['deficit'].items():
    seed_data = pd.DataFrame([topic] * deficit, columns=['text'])  # Seed data for the topic
    prompted = gretel.submit_generate(trained.model_id, seed_data=seed_data)  # Generate data
    generated_set=prompted.synthetic_data
    generated_set[['Question', 'Excerpt']] = generated_set['text'].apply(extract_question_excerpt)
    generated_set['topic']=topic
    generated_data.append(generated_set[['topic', 'Question', 'Excerpt']])  # Append the result

Submitting GPT generate job...
Model Docs: https://docs.gretel.ai/create-synthetic-data/models/synthetics/gretel-gpt
Console URL: https://console.gretel.ai/proj_2qYnvSmNBbeVjJnnOIErWBYeWZs/models/6767c13d790c644a1c7ef7d5/data
Loading model to worker 
Sampling 9 records using conditioning input... 
Using device 'cuda' 
Generating records... num_records 9
Successfully generated 9 records 
Uploading artifacts to Gretel Cloud... 
Upload to Gretel Cloud is completed. 
Submitting GPT generate job...
Model Docs: https://docs.gretel.ai/create-synthetic-data/models/synthetics/gretel-gpt
Console URL: https://console.gretel.ai/proj_2qYnvSmNBbeVjJnnOIErWBYeWZs/models/6767c13d790c644a1c7ef7d5/data
Loading model to worker 
Sampling 29 records using conditioning input... 
Using device 'cuda' 
Generating records... num_records 29
Successfully generated 29 records 
Uploading artifacts to Gretel Cloud... 
Upload to Gretel Cloud is completed. 
Submitting GPT generate job...
Model Docs: https://docs.grete

In [None]:
generated_combined = pd.concat(generated_data, ignore_index=True)

In [None]:
generated_combined

Unnamed: 0,topic,Question,Excerpt
0,gis,How to create a custom shapefile (.shp) in Pyt...,I have been using ArcGIS Desktop and would lik...
1,gis,How to create a grid from a set of points?,How can I create a grid from a set of points i...
2,gis,How to create a new raster layer from an exist...,I have a raster layer in ArcGIS 9.2. It's a sa...
3,gis,Calculate the distance from a point to a line,"This question is related to this one,..."
4,gis,How to create a custom legend for a raster map...,I have a raster map of a flood zone in ArcGIS....
...,...,...,...
1434,mathematica,Finding a solution to a differential equation,I have a differential equation that I'd like t...
1435,mathematica,What is the best way to solve a linear equatio...,I have the following equation:\n\nx+y+z = 1\n\...
1436,mathematica,How to calculate the integral $\int_0^1\frac{\...,I have tried to calculate the integral $\int_0^
1437,mathematica,How to get the derivative of a list of equations,I have a list of equations in the form y[i] = ...


In [None]:
generated_combined.to_csv("generated_data_with_topics.csv", index=False)

In [None]:
from google.colab import files

In [None]:
files.download('/content/generated_data_with_topics.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>