# Module 10: Introduction to Hugging Face Transformers

This notebook introduces Hugging Face's ecosystem and demonstrates how to:
1. Set up the environment for local models
2. Load your first model
3. Perform basic inference
4. Compare local models with API-based models

## 1. Install Required Libraries

First, let's install the necessary libraries:

In [None]:
# Uncomment and run this cell if you need to install the libraries
# !pip install torch transformers accelerate gradio python-dotenv
# !pip install bitsandbytes  # Optional: for 4-bit quantization

## 2. Introduction to Hugging Face's Ecosystem

In [None]:
def introduction_to_huggingface():
    """Print an introduction to Hugging Face's ecosystem"""
    print("\n" + "="*80)
    print("INTRODUCTION TO HUGGING FACE TRANSFORMERS".center(80))
    print("="*80)
    
    print("""
Hugging Face is an AI community and platform that provides:

1. 🤗 Model Hub: A repository of pre-trained models (100,000+) for NLP, computer vision, 
   audio processing, and more.

2. 🔧 Transformers Library: A Python library that provides APIs and tools to easily 
   download and train state-of-the-art pretrained models.

3. 📚 Datasets: A library and platform for easily sharing and accessing datasets.

4. 🧪 Spaces: A platform for hosting ML demo apps.

5. 🧠 AutoTrain: Tools for training models without writing code.

Key advantages of using Hugging Face for local models:
- Run models on your own hardware without API costs
- Full control over model parameters and behavior
- Privacy - data doesn't leave your machine
- Ability to fine-tune models for specific use cases
- No internet connection required for inference
    """)
    print("="*80 + "\n")

introduction_to_huggingface()

## 3. Setting Up the Environment

In [None]:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Check if CUDA is available
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {DEVICE}")

# Print PyTorch version
print(f"PyTorch version: {torch.__version__}")

# Print GPU info if available
if DEVICE == "cuda":
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Available GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 4. Loading a Model

Let's load a small model first. We'll use TinyLlama, which is a 1.1B parameter model that can run on most hardware.

In [None]:
def load_model(model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_4bit=True):
    """Load a model from Hugging Face Hub"""
    print(f"\nLoading model: {model_name}")
    print("This may take a few moments depending on your internet connection and the model size...")
    
    try:
        # Load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        # Load model with quantization if requested
        if use_4bit and DEVICE == "cuda":
            model = AutoModelForCausalLM.from_pretrained(
                model_name,
                device_map="auto",
                torch_dtype=torch.float16,
                load_in_4bit=True
            )
            print("Model loaded with 4-bit quantization")
        else:
            model = AutoModelForCausalLM.from_pretrained(
                model_name,
                device_map="auto" if DEVICE == "cuda" else None,
                torch_dtype=torch.float16 if DEVICE == "cuda" else torch.float32
            )
            print(f"Model loaded in {'16-bit' if DEVICE == 'cuda' else '32-bit'} precision")
        
        return model, tokenizer
    
    except Exception as e:
        print(f"Error loading model: {str(e)}")
        print("\nTroubleshooting tips:")
        print("1. Check your internet connection")
        print("2. Verify the model name is correct")
        print("3. Try a smaller model if you're running out of memory")
        print("4. Make sure you have the latest transformers library")
        return None, None

# Load a small model
model, tokenizer = load_model("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_4bit=True)

## 5. Basic Inference

Now let's perform basic inference with our loaded model.

In [None]:
def basic_inference(model, tokenizer, prompt, max_length=512, temperature=0.7):
    """Perform basic inference with a loaded model"""
    if model is None or tokenizer is None:
        return "Model or tokenizer not loaded correctly."
    
    try:
        # Format the prompt based on model type
        if "llama" in model.config.architectures[0].lower():
            # Format for Llama models
            formatted_prompt = f"<|user|>\n{prompt}\n<|assistant|>\n"
        elif "mistral" in model.config.architectures[0].lower():
            # Format for Mistral models
            formatted_prompt = f"[INST] {prompt} [/INST]"
        elif "phi" in model.config.architectures[0].lower():
            # Format for Phi models
            formatted_prompt = f"User: {prompt}\nAssistant:"
        else:
            # Default format
            formatted_prompt = prompt
        
        # Tokenize the prompt
        inputs = tokenizer(formatted_prompt, return_tensors="pt")
        
        # Move inputs to the appropriate device
        if DEVICE == "cuda":
            inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
        
        # Generate text
        with torch.no_grad():
            outputs = model.generate(
                inputs["input_ids"],
                max_length=max_length,
                temperature=temperature,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id
            )
        
        # Decode the generated text
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Extract only the assistant's response
        if "<|assistant|>" in formatted_prompt:
            response = generated_text.split("<|assistant|>")[-1].strip()
        elif "[/INST]" in formatted_prompt:
            response = generated_text.split("[/INST]")[-1].strip()
        elif "Assistant:" in generated_text:
            response = generated_text.split("Assistant:")[-1].strip()
        else:
            response = generated_text.replace(prompt, "").strip()
        
        return response
    
    except Exception as e:
        return f"Error during inference: {str(e)}"

# Test the model with a few prompts
test_prompts = [
    "What are the main features of Python?",
    "Write a short poem about artificial intelligence.",
    "Explain quantum computing to a 10-year-old."
]

for i, prompt in enumerate(test_prompts):
    print(f"\n\nPrompt {i+1}: {prompt}")
    print("-" * 50)
    response = basic_inference(model, tokenizer, prompt)
    print(f"Response: {response}")

## 6. Experimenting with Generation Parameters

Let's see how different parameters affect the generation:

In [None]:
# Test with different temperatures
prompt = "Write a creative story about a robot who discovers emotions."
print(f"Prompt: {prompt}\n")

temperatures = [0.3, 0.7, 1.2]
for temp in temperatures:
    print(f"\nTemperature: {temp}")
    print("-" * 50)
    response = basic_inference(model, tokenizer, prompt, temperature=temp)
    print(f"Response: {response}")

## 7. Comparing Local Models vs. API-Based Models

In [None]:
def compare_local_vs_api():
    """Print a comparison between local models and API-based models"""
    print("\n" + "="*80)
    print("LOCAL MODELS VS. API-BASED MODELS".center(80))
    print("="*80)
    
    print("""
┌─────────────────────┬─────────────────────────┬─────────────────────────┐
│                     │ Local Models             │ API-Based Models        │
├─────────────────────┼─────────────────────────┼─────────────────────────┤
│ Cost                │ One-time hardware cost   │ Pay per token/request   │
│ Privacy             │ Data stays on device     │ Data sent to servers    │
│ Setup Complexity    │ Higher                   │ Lower                   │
│ Maintenance         │ Manual updates needed    │ Automatic updates       │
│ Performance         │ Depends on hardware      │ Consistent              │
│ Customization       │ Full control             │ Limited by API          │
│ Scaling             │ Limited by hardware      │ Easy to scale           │
│ Offline Usage       │ Yes                      │ No                      │
│ Model Size Options  │ Limited by hardware      │ Wide range available    │
│ Latency             │ Lower (no network)       │ Higher (network delay)  │
└─────────────────────┴─────────────────────────┴─────────────────────────┘

When to use local models:
- Privacy-sensitive applications
- Offline environments
- Cost-sensitive long-running applications
- When you need full control over the model

When to use API-based models:
- Quick prototyping
- Limited local hardware
- Need for state-of-the-art large models
- Simplicity is prioritized over customization
    """)
    print("="*80 + "\n")

compare_local_vs_api()

## 8. Creating a Simple Gradio Interface (Optional)

If you want to create a user interface for your model, you can use Gradio:

In [None]:
import gradio as gr

def create_gradio_interface():
    """Create a simple Gradio interface for the model"""
    def generate(prompt, temperature=0.7, max_length=512):
        return basic_inference(model, tokenizer, prompt, max_length=max_length, temperature=temperature)
    
    demo = gr.Interface(
        fn=generate,
        inputs=[
            gr.Textbox(lines=4, placeholder="Enter your prompt here...", label="Prompt"),
            gr.Slider(0.1, 1.5, value=0.7, label="Temperature"),
            gr.Slider(64, 1024, value=512, step=64, label="Max Length")
        ],
        outputs=gr.Textbox(label="Generated Text"),
        title="Local LLM Demo",
        description="Generate text using a local language model"
    )
    return demo

# Uncomment to create and launch the interface
# interface = create_gradio_interface()
# interface.launch()

## 9. Conclusion

In this notebook, we've explored:

1. Hugging Face's ecosystem and its components
2. How to set up the environment for local models
3. Loading a model from Hugging Face Hub
4. Performing basic inference with the model
5. Experimenting with different generation parameters
6. Comparing local models with API-based models

Next steps:
- Try different models from the Hugging Face Hub
- Experiment with fine-tuning models on your own data
- Explore more advanced inference parameters
- Integrate local models into your applications