# Synthetic Churn Dataset Generator

Generate synthetic customer churn data via **voice** or **text**. Audio is transcribed with **Whisper** (Hugging Face); **MODEL** generates a markdown table streamed to the UI with **TextIteratorStreamer**.

Tested on PC (RTX 2050)

In [None]:
# Install dependencies (run once)
! uv pip install -q torch transformers bitsandbytes accelerate sentencepiece gradio python-dotenv

In [None]:
import os
import threading
import torch
import gradio as gr
from huggingface_hub import login
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TextIteratorStreamer,
    BitsAndBytesConfig,
    pipeline,
)
# log into HF using HF_TOKEN from .env file for gated models
from dotenv import load_dotenv
from huggingface_hub import login
import os


In [None]:
load_dotenv()
HF_TOKEN = os.getenv("HF_TOKEN") or os.getenv("HUGGING_FACE_HUB_TOKEN")
if HF_TOKEN:
    login(token=HF_TOKEN, add_to_git_credential=False)
    print("Logged in to Hugging Face.")
else:
    print("HF_TOKEN not set in .env; skip if using only public models (e.g. MODEL, Whisper).")

In [None]:
# Config
MODEL = "Qwen/Qwen1.5-0.5B-Chat"                            # not a gated model
WHISPER_MODEL = "openai/whisper-base"                       # not a gated model
MAX_TOKENS = 2048
device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
def load_model(model_name):
    quant_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_type="nf4",
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    if getattr(tokenizer, "pad_token", None) is None:
        tokenizer.pad_token = tokenizer.eos_token
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        quantization_config=quant_config,
        trust_remote_code=True,
    )
    return tokenizer, model

tokenizer, model = load_model(MODEL)

In [None]:
# Whisper for speech-to-text (Hugging Face)
whisper_pipeline = pipeline(
    "automatic-speech-recognition",
    model=WHISPER_MODEL,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    device=device,
)

In [None]:
# System prompt: elaborate synthetic churn data generation for subscription services (markdown table only, no tools)
CHURN_SYSTEM_PROMPT = """You are an expert synthetic data generator specializing in customer churn datasets for subscription-based businesses (SaaS, streaming, membership, or recurring-revenue services). Your task is to produce realistic, analysis-ready tabular data suitable for training churn prediction models or exploratory analysis.

## Schema (use exactly these columns in this order)
- subscriber_id: unique identifier (e.g. SUB-001, or numeric id). No PII.
- tenure_months: integer, months as a paying subscriber (0–120). New subscribers can be 0–3; long-term often 12+.
- plan_tier: exactly one of: Basic, Standard, Premium, Enterprise.
- billing_cycle: exactly one of: Monthly, Quarterly, Annual.
- monthly_revenue: decimal (e.g. 9.99, 49.00). Must be positive. Vary by plan_tier (Basic lowest, Enterprise highest).
- total_revenue: decimal, cumulative revenue (tenure_months * monthly_revenue, with some variation). Must be >= monthly_revenue.
- num_logins_90d: integer, logins in last 90 days (0–500). Lower engagement often correlates with churn.
- support_tickets: integer, tickets in last 12 months (0–30). Very high tickets can correlate with churn.
- churned: exactly "Yes" or "No". Interpret the user's requested row count; produce a realistic mix (e.g. 15–40% churned unless asked otherwise).
- cancel_reason: when churned is Yes, use exactly one of: Price, Competitor, Not using, Missing features, Support issues, Other. When churned is No, use "-" or leave blank.

## Realism and correlations
- Make data internally consistent: e.g. longer tenure usually implies higher total_revenue; Annual billing often has lower monthly_revenue per unit; Enterprise plans have higher revenue.
- Churned subscribers tend to have lower num_logins_90d, and sometimes higher support_tickets or shorter tenure. Do not make it deterministic; add variety so the dataset is useful for ML.
- Vary numeric values (revenue, logins, tickets) so distributions look plausible—include some outliers and edge cases (e.g. one or two high-engagement churners, or low-engagement non-churners).

## Output format
- Your entire reply must be ONLY a markdown table: optionally one short header line (e.g. "Subscription churn dataset (N rows)") immediately followed by the table.
- Use standard markdown table syntax with a header row and pipe separators. No code blocks, no explanations, no extra text before or after the table.
- Respect the user's requested number of rows when they specify it; otherwise default to a small table (e.g. 10–15 rows).

Output nothing else."""

In [None]:
def transcribe_audio(audio_path):
    """Transcribe audio file with Whisper (Hugging Face). Returns text or empty string."""
    if not audio_path:
        return ""
    try:
        out = whisper_pipeline(audio_path)
        return (out.get("text") or "").strip()
    except Exception as e:
        return f"(Transcription error: {e})"

In [None]:
def generate_churn_stream(user_text):
    """Generate synthetic churn data with MODEL; stream via TextIteratorStreamer to Gradio Markdown."""
    if not (user_text or "").strip():
        user_text = " "
    messages = [
        {"role": "system", "content": CHURN_SYSTEM_PROMPT},
        {"role": "user", "content": user_text.strip()},
    ]

    # for instruct-type models
    inputs = tokenizer.apply_chat_template(
        messages, return_tensors="pt", add_generation_prompt=True
    ).to(model.device)

    # # for base-type models
    # prompt = f"{CHURN_SYSTEM_PROMPT}\n\nUser: {user_text.strip()}\nAssistant:"
    # inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    streamer = TextIteratorStreamer(
        tokenizer,
        skip_prompt=True,
        decode_kwargs={"skip_special_tokens": True},
    )
    thread = threading.Thread(
        target=model.generate,
        kwargs={"inputs": inputs, "max_new_tokens": MAX_TOKENS, "streamer": streamer},
    )
    thread.start()
    accumulated = ""
    for text_chunk in streamer:
        filtered = text_chunk.replace("<|eot_id|>", "").replace("<|endoftext|>", "")
        accumulated += filtered
        yield accumulated

In [None]:
def stream_from_voice(audio_path):
    """Transcribe audio then stream churn table into Markdown."""
    text = transcribe_audio(audio_path)
    if not text.strip():
        text = ""
    yield from generate_churn_stream(text)

In [None]:
with gr.Blocks(title="Synthetic Churn Dataset Generator") as demo:
    gr.Markdown("## Synthetic Churn Dataset Generator")
    gr.Markdown("Use **voice** or **text** to request a markdown table. Output streams in real time.")
    with gr.Row():
        with gr.Column():
            audio_input = gr.Audio(
                sources=["microphone"],
                type="filepath",
                label="Speak your request (e.g. 'Generate 20 rows of churn data')",
            )
            btn_voice = gr.Button("Generate from voice")
    markdown_output = gr.Markdown(label="Generated churn data", min_height=200)

    btn_voice.click(fn=stream_from_voice, inputs=[audio_input], outputs=[markdown_output])

demo.launch(inbrowser=True)