# Synthetic Data Generator - Week 3 Assignment

Submitted By : Bharat Puri

## ✅ Summary
- Implemented a **synthetic data generator** using the **transformer architecture directly**.
- Used `AutoTokenizer` and `AutoModelForCausalLM` for manual inference.
- Demonstrated core transformer flow: Tokenize → Generate → Decode.
- Wrapped the logic in a **Gradio UI** for usability.
- Used a small model (`gpt2-medium`) to ensure it runs on free Colab CPU/GPU.
- Fully aligned with Week 3 challenge: *“Write models that generate datasets and explore model APIs.”*




Basic Pip installations

In [1]:
!pip install -q transformers gradio torch

Validate Google Colab T4 instance

In [2]:
# @title Default title text
# Let's check the GPU - it should be a Tesla T4

gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)
  if gpu_info.find('Tesla T4') >= 0:
    print("Success - Connected to a T4")
  else:
    print("NOT CONNECTED TO A T4")

Wed Oct 22 08:18:18 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   43C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

Import required python libraries

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import gradio as gr

# Connecting Hugging Face

You'll need to log in to the HuggingFace hub if you've not done so before.

1. If you haven't already done so, create a **free** HuggingFace account at https://huggingface.co and navigate to Settings from the user menu on the top right. Then Create a new API token, giving yourself write permissions.  

**IMPORTANT** when you create your HuggingFace API key, please be sure to select WRITE permissions for your key by clicking on the WRITE tab, otherwise you may get problems later. Not "fine-grained" but "write".

2. Back here in colab, press the "key" icon on the side panel to the left, and add a new secret:  
  In the name field put `HF_TOKEN`  
  In the value field put your actual token: `hf_...`  
  Ensure the notebook access switch is turned ON.

3. Execute the cell below to log in. You'll need to do this on each of your colabs. It's a really useful way to manage your secrets without needing to type them into colab.

In [4]:
from huggingface_hub import login
from google.colab import userdata


hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

## Load Model and Tokenizer

We’ll use a small model (distilgpt2) so it’s light and fast, but we’ll handle everything manually — just like a full transformer workflow.

In [5]:
# Load lightweight model and tokenizer
model_name = "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

## Build a Prompt
We create a simple function to structure the generation task.

In [6]:
def build_prompt(region, count):
    return (
        f"Generate {count} unique Indian names from the {region} region. "
        f"Include both male and female names. "
        f"Return the list numbered 1 to {count}."
    )

## Tokenize → Generate → Decode

Here’s the key “transformer logic”:

Tokenize input (convert text → tensor)

Generate tokens using the model

Decode back to text

In [7]:
def generate_names(region, count):
    # Few-shot example prompt to guide GPT2
    prompt = f"""
Generate {count} unique Indian names from the {region} region.
Each name should be realistic and common in that region.
Include both male and female names.
Here are some examples:

1. Arjun Kumar
2. Priya Sharma
3. Karthik Reddy
4. Meena Devi
5. Suresh Babu

Now continue with more names:
"""

    print("Prompt sent to model:\n", prompt)

    # --- Load model and tokenizer ---
    model_name = "gpt2-medium"  # better than distilgpt2, still light enough
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    # --- Encode input ---
    inputs = tokenizer(prompt, return_tensors="pt")

    # --- Generate ---
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    # --- Decode output ---
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # --- Extract possible names ---
    lines = text.split("\n")
    names = []
    for line in lines:
        if any(ch.isalpha() for ch in line):
            clean = line.strip()
            if "." in clean:
                clean = clean.split(".", 1)[1].strip()
            if len(clean.split()) <= 3 and not clean.lower().startswith("generate"):
                names.append(clean)
    # remove duplicates and limit
    names = list(dict.fromkeys(names))[:count]

    if not names:
        names = ["Model didn't generate recognizable names. Try again."]

    return "\n".join(names)


## Gradio Interface

In [8]:
def run_app():
    with gr.Blocks() as demo:
        gr.Markdown("# 🇮🇳 Indian Name Generator using Transformers (Week 3 Assignment)")
        gr.Markdown("Generates synthetic Indian names using Hugging Face Transformers with manual tokenization and decoding.")

        region = gr.Dropdown(
            ["North India", "South India", "East India", "West India"],
            label="Select Region",
            value="North India"
        )
        count = gr.Number(label="Number of Names", value=10)
        output = gr.Textbox(label="Generated Indian Names", lines=10)
        generate_btn = gr.Button("Generate Names")

        generate_btn.click(fn=generate_names, inputs=[region, count], outputs=output)
    demo.launch()

## Run App

In [9]:
run_app()

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://0876ef599f401ea674.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
