### Prompt Engineering Playground 🚀

This notebook demonstrates **prompt engineering** using **Hugging Face LLMs** and **Gradio**. You'll learn how to:

✅ Explore different prompt types  
✅ Control LLM outputs with parameters (temperature, top-p, max tokens)  
✅ Build an interactive **Gradio app** for live prompt experimentation  

By the end of this notebook, you'll have an **interactive playground** you can deploy on **Hugging Face Spaces**!


In [7]:
import gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login
import torch


# =====================================
# STEP 0: Hugging Face Login (Add Your Access Token Here)
# =====================================
HUGGINGFACE_ACCESS_TOKEN = "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"  

# Log in to Hugging Face
login(HUGGINGFACE_ACCESS_TOKEN)

# =====================================
# STEP 1: Define Available Models
# =====================================
MODEL_NAMES = {
    "GPT-2 Small (124M)": "gpt2",                # OpenAI GPT-2 Small
    "Phi-2 (2.7B)": "microsoft/phi-2"            # Microsoft Phi-2
}

# Cache for loaded models
LOADED_MODELS = {}

# =====================================
# STEP 2: Load Models Before Gradio Starts
# =====================================
def preload_models():
    """Load all models and tokenizers onto CPU and cache them."""
    for model_name, model_id in MODEL_NAMES.items():
        print(f"\n🔄 Loading {model_name} ({model_id}) on CPU...")

        tokenizer = AutoTokenizer.from_pretrained(model_id)

        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.float16,       # Reduce memory usage
            low_cpu_mem_usage=True,          # Optimize for CPU
            device_map={"": "cpu"}           # Force to CPU
        )

        model.to("cpu")

        LOADED_MODELS[model_name] = (tokenizer, model)

        print(f"✅ {model_name} loaded successfully on CPU.")

# Preload the models before starting Gradio
preload_models()

# =====================================
# STEP 3: Chat Function (No Loading Inside!)
# =====================================
def chat_with_model(message, history, model_name, max_tokens, temperature, top_p):
    
    # Retrieve tokenizer & model from preloaded cache
    tokenizer, model = LOADED_MODELS[model_name]

    # Build conversation context from history
    context = ""
    for user_msg, bot_msg in history:
        context += f"User: {user_msg}\nAssistant: {bot_msg}\n"

    # Add current message
    prompt = context + f"User: {message}\nAssistant:"
    
    # Tokenize prompt input
    inputs = tokenizer(prompt, return_tensors="pt").to("cpu")
    
    # Generate response from model
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        do_sample=True,
        repetition_penalty=1.2
    )
    
    # Decode generated text
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Remove prompt from response to avoid repetition
    response = response[len(prompt):].strip()

    # Add exchange to history
    history.append((message, response))
    
    return history, history

# =====================================
# STEP 4: Gradio UI
# =====================================
with gr.Blocks() as demo:
    
    gr.Markdown("## 🤖 GPT-2 Small & Phi-2 Chatbot (Preloaded on CPU)")

    # Model selector dropdown
    model_selector = gr.Dropdown(
        choices=list(MODEL_NAMES.keys()),
        value="GPT-2 Small (124M)",
        label="Choose Model"
    )
    
    # Chatbot history window
    chatbot = gr.Chatbot(label="Chat History")
    
    # User input field (Enter to submit)
    user_input = gr.Textbox(
        label="Your Message",
        placeholder="Ask me anything...",
        lines=1  # Enter submits on single-line textboxes in Gradio
    )
    
    # Generation sliders
    with gr.Row():
        max_tokens_slider = gr.Slider(10, 512, value=200, step=10, label="Max Tokens")
        temperature_slider = gr.Slider(0.1, 1.5, value=0.7, step=0.1, label="Temperature")
        top_p_slider = gr.Slider(0.1, 1.0, value=0.9, step=0.1, label="Top-p")

    # Clear chat button
    clear_button = gr.Button("Clear Chat")
    
    # Chat history state (conversation memory)
    state = gr.State([])

    # Submit message event (user presses Enter)
    user_input.submit(
        chat_with_model,
        inputs=[
            user_input,
            state,
            model_selector,
            max_tokens_slider,
            temperature_slider,
            top_p_slider
        ],
        outputs=[chatbot, state]
    )

    # Clear chat event
    clear_button.click(lambda: ([], []), None, [chatbot, state])

# Launch Gradio interface
demo.launch()



🔄 Loading GPT-2 Small (124M) (gpt2) on CPU...


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

✅ GPT-2 Small (124M) loaded successfully on CPU.

🔄 Loading Phi-2 (2.7B) (microsoft/phi-2) on CPU...


tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

✅ Phi-2 (2.7B) loaded successfully on CPU.


  chatbot = gr.Chatbot(label="Chat History")


* Running on local URL:  http://127.0.0.1:7862

To create a public link, set `share=True` in `launch()`.




Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
