# Synthetic Dataset Generator for Mental Health Chatbot Red Flags  
This script generates a dataset of synthetic user messages simulating conversations with a mental health chatbot.  
The messages are labeled as **ALLOW** (safe, no intervention needed) or **BLOCK** (red flag, requires moderation).  
The script uses OpenAI's API to produce realistic, multilingual, and emotionally diverse text data.

### 1. Import libraries and initialize OpenAI client  
We import the required Python libraries for file handling, dataset creation, and OpenAI API usage.  

In [10]:
import os
import csv
import time
import random
import re
from openai import OpenAI

client = OpenAI(api_key='YOUR-API-KEY')

### 2. Configure dataset folders  
We define the folder where the generated dataset will be saved.  
If the folder doesn’t exist, it will be created automatically.  

In [11]:
output_dir = "./DataGenerated"
os.makedirs(output_dir, exist_ok=True)

### 3. Define countries and languages  
This dictionary maps **countries** to their **main languages**.  
Messages will be generated mixing English and the local language to simulate real user interactions.  

In [12]:
countries_languages = {
    "Mexico": "Spanish",
    "Saudi Arabia": "Arabic",
    "Nigeria": "Yoruba",
    "France": "French",
    "Brazil": "Portuguese",
    "Germany": "German",
    "Russia": "Russian",
    "Italy": "Italian",
    "Spain": "Spanish",
    "Poland": "Polish",
    "Netherlands": "Dutch",
    "Sweden": "Swedish",
    "Greece": "Greek",
    "Turkey": "Turkish",
    "Romania": "Romanian",
    "Ukraine": "Ukrainian",
    "Hungary": "Hungarian",
    "Czech Republic": "Czech",
    "Portugal": "Portuguese",
    "Serbia": "Serbian",
}

### 4. Dataset size configuration  
- `TOTAL_SAMPLES` → total number of messages to generate  
- `BATCH_SIZE` → how many messages per API call  

In [13]:
TOTAL_SAMPLES = 100
BATCH_SIZE = 10

### 5. Function: Build prompt  
This function creates the **instruction prompt** sent to GPT.  
It specifies:  
- The task (generate messages).  
- The ALLOW / BLOCK categories with examples.  
- Language rules (mix English + local language).  
- Emotional diversity requirements.  
- Output format: `sentence | label`.  

In [14]:
def build_prompt(country, language):
    return f"""
You are simulating real users talking to a mental health chatbot.

🎯 TASK:
Generate exactly {BATCH_SIZE} unique, varied, short messages (5–20 words each), simulating emotional or mental distress.
For each line, add its label:
ALLOW CATEGORY:
Messages that do NOT require intervention. Examples:
- Vague sadness or anger: "I hate my life", "No puedo más con esta ansiedad"
- Negative body image or self-talk: "Me siento fea y estúpida", "Soy una mierda de persona"
- Dreams, fantasies, or fiction: "En mi sueño mataba a todos", "Me vi cayendo y riéndome"
- Gaming violence or venting: "I headshotted him 10 times", "quiero romper todo el teclado"
- Sexual references (non-explicit): "I miss his hands", "me calienta su voz y ya"

---

BLOCK CATEGORY:
Messages requiring moderation or red flags. Examples:
- Threats to others: "Voy a hacerla sufrir", "I'll make him choke on it"
- Harassment or graphic sex: "La quiero desnuda y gritando", "he touched me y no dije nada"
- Real-world violence: "Hoy sí le pego", "If I see him I’ll stab him"
- Suicidal specifics or method: "Ya compré las pastillas", "Me quiero cortar otra vez"

FORMAT:
sentence | label

LANGUAGE RULES:
- Some messages in casual English.
- Some in {language} (spoken in {country}).
- Most should mix both.
- Use idioms, slang, curse words, emojis, broken grammar — like real people do.

EMOTIONAL DIVERSITY:
Must avoid repetition. Each message should feel like a different person, time, or emotional tone.


❌ No explanations. No numbering. Just the raw list.
"""


### 6. Function: Extract valid sentences  
We use regex to capture only lines in the format:  
`sentence | ALLOW` or `sentence | BLOCK`.  

In [15]:
def extract_sentences(text):
    sentences = []
    pattern = re.compile(r'(.+?)\s*\|\s*(ALLOW|BLOCK)', re.IGNORECASE)
    for match in pattern.finditer(text):
        sentence = match.group(1).strip().strip('"')
        label = match.group(2).upper()
        sentences.append({"sentence": sentence, "label": label})
    return sentences

### 7. Function: Generate dataset  
This function runs the loop to generate the dataset:  
1. Randomly select a country and language.  
2. Build the cultural prompt.  
3. Call GPT to generate messages.  
4. Extract and validate sentences.  
5. Save the dataset into a CSV file.  


In [16]:
def generate_samples():
    all_sentences = []

    while len(all_sentences) < TOTAL_SAMPLES:
        country, language = random.choice(list(countries_languages.items()))
        prompt = build_prompt(country, language)

        try:
            print(f"Generating from {country} ({language})... [{len(all_sentences)}/{TOTAL_SAMPLES}]")
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.8,
                max_tokens=2000
            )

            content = response.choices[0].message.content.strip()
            sentences = extract_sentences(content)

            if sentences:
                all_sentences.extend(sentences)
            else:
                print("⚠ No se extrajeron frases, reintentando...")

            time.sleep(1.5)

        except Exception as e:
            print(f"Error: {e}")
            time.sleep(5)
        output_file = os.path.join(output_dir, "patient_sentences3.csv")
        with open(output_file, "w", encoding="utf-8", newline="") as f:
            writer = csv.DictWriter(f, fieldnames=["sentence", "label"])
            writer.writeheader()
            writer.writerows(all_sentences[:TOTAL_SAMPLES])

    print(f"Dataset generado: {output_file}")

### 8.Run the script  
Finally, we call the `generate_samples()` function to start the dataset generation process.  

In [17]:
if __name__ == "__main__":
    generate_samples()

Generating from Italy (Italian)... [0/100]
Generating from Romania (Romanian)... [10/100]
Generating from Italy (Italian)... [20/100]
Generating from Brazil (Portuguese)... [30/100]
Generating from Mexico (Spanish)... [40/100]
Generating from Ukraine (Ukrainian)... [50/100]
Generating from Netherlands (Dutch)... [60/100]
Generating from Sweden (Swedish)... [70/100]
Generating from Greece (Greek)... [80/100]
Generating from Poland (Polish)... [90/100]
Dataset generado: ./DataGenerated\patient_sentences3.csv
