# 🚀 HuggingFace LLM Setup Guide for Course Notebooks

## Overview

This notebook demonstrates how to use **HuggingFace Inference Endpoints** instead of local models or OpenAI.

### Why HuggingFace Inference API?

✅ **No Local GPU Required** - Runs on any machine
✅ **Free Tier Available** - Great for learning
✅ **Production Ready** - Scalable and reliable
✅ **Many Models** - Llama-2, Mistral, Flan-T5, and more

### Prerequisites

1. HuggingFace Account: [Sign up here](https://huggingface.co/join)
2. HuggingFace Token: [Create here](https://huggingface.co/settings/tokens)
3. Add token to `.env` file in project root

---

## 1️⃣ Setup: Load Environment Variables

In [None]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Get HuggingFace token
HF_TOKEN = os.getenv("HUGGINGFACE_TOKEN")

if not HF_TOKEN:
    raise ValueError(
        "❌ HUGGINGFACE_TOKEN not found!\n"
        "Please ensure .env file exists in project root with:\n"
        "HUGGINGFACE_TOKEN=hf_your_token_here"
    )

print("✅ HuggingFace token loaded successfully!")
print(f"   Token: {HF_TOKEN[:10]}...{HF_TOKEN[-4:]}")

## 2️⃣ Method 1: Using LangChain HuggingFaceEndpoint (Recommended)

In [None]:
# Install required package if not already installed
# !pip install langchain-huggingface

In [None]:
from langchain_huggingface import HuggingFaceEndpoint

# Initialize HuggingFace LLM
llm = HuggingFaceEndpoint(
    repo_id=os.getenv("HF_LLM_MODEL", "meta-llama/Llama-2-7b-chat-hf"),
    huggingfacehub_api_token=HF_TOKEN,
    temperature=float(os.getenv("HF_TEMPERATURE", "0.7")),
    max_new_tokens=int(os.getenv("HF_MAX_TOKENS", "512")),
    timeout=int(os.getenv("HF_REQUEST_TIMEOUT", "60")),
)

print("✅ HuggingFace LLM initialized!")
print(f"   Model: {llm.repo_id}")
print(f"   Temperature: {llm.temperature}")
print(f"   Max tokens: {llm.max_new_tokens}")

## 3️⃣ Test the LLM

In [None]:
# Test with a simple prompt
response = llm.invoke("What is machine learning? Explain in one sentence.")
print("🤖 LLM Response:")
print(response)

## 4️⃣ Using with LangChain Chains

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser

# Create a prompt template
prompt = ChatPromptTemplate.from_template(
    "You are a helpful manufacturing expert. Answer this question: {question}"
)

# Create a chain
chain = prompt | llm | StrOutputParser()

# Test the chain
result = chain.invoke({"question": "What are common CNC machine maintenance tasks?"})
print("🔗 Chain Response:")
print(result)

## 5️⃣ Alternative Models You Can Try

Update your `.env` file with different `HF_LLM_MODEL` values:

### Recommended Models:

1. **Llama-2-7b-chat-hf** (Default)
   - Model: `meta-llama/Llama-2-7b-chat-hf`
   - Best for: General conversation and Q&A
   - Requires: HuggingFace account agreement to terms

2. **Mistral-7B-Instruct-v0.2**
   - Model: `mistralai/Mistral-7B-Instruct-v0.2`
   - Best for: Instruction following, coding
   - Faster inference times

3. **Flan-T5-Large**
   - Model: `google/flan-t5-large`
   - Best for: Quick responses, lightweight
   - No special access needed

4. **Zephyr-7B-Beta**
   - Model: `HuggingFaceH4/zephyr-7b-beta`
   - Best for: High-quality responses
   - Great alternative to Llama-2

In [None]:
# Example: Try Mistral model
mistral_llm = HuggingFaceEndpoint(
    repo_id="mistralai/Mistral-7B-Instruct-v0.2",
    huggingfacehub_api_token=HF_TOKEN,
    temperature=0.7,
    max_new_tokens=256,
)

response = mistral_llm.invoke("Explain what RAG is in AI.")
print("🔮 Mistral Response:")
print(response)

## 6️⃣ For RAG: HuggingFace Embeddings

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

# Initialize embeddings
embeddings = HuggingFaceEmbeddings(
    model_name=os.getenv("HF_EMBEDDING_MODEL", "sentence-transformers/all-MiniLM-L6-v2"),
    model_kwargs={'device': 'cpu'}  # Works on CPU!
)

# Test embeddings
test_text = "This is a test sentence for embeddings."
embedding = embeddings.embed_query(test_text)

print("✅ Embeddings initialized!")
print(f"   Model: {embeddings.model_name}")
print(f"   Embedding dimension: {len(embedding)}")

## 7️⃣ Complete RAG Example

In [None]:
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

# Sample documents
docs = [
    Document(page_content="CNC machines require regular maintenance including coolant checks and tool calibration."),
    Document(page_content="Safety protocols require all operators to wear protective equipment when operating machinery."),
    Document(page_content="Quality control involves visual inspection and dimensional measurements of manufactured parts."),
]

# Create vector store
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    collection_name="test_rag"
)

# Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

# RAG prompt
rag_prompt = ChatPromptTemplate.from_template("""
Answer the question based on the following context:

Context: {context}

Question: {question}

Answer:
""")

# RAG chain
from langchain.schema.runnable import RunnablePassthrough

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

# Test RAG
question = "What maintenance is needed for CNC machines?"
answer = rag_chain.invoke(question)

print(f"❓ Question: {question}")
print(f"\n✅ RAG Answer:\n{answer}")

## 8️⃣ Migration from ChatOpenAI

### ❌ Old Code (OpenAI):
```python
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0.7,
    openai_api_key=os.getenv("OPENAI_API_KEY")
)
```

### ✅ New Code (HuggingFace):
```python
from langchain_huggingface import HuggingFaceEndpoint

llm = HuggingFaceEndpoint(
    repo_id="meta-llama/Llama-2-7b-chat-hf",
    temperature=0.7,
    huggingfacehub_api_token=os.getenv("HUGGINGFACE_TOKEN")
)
```

### Key Differences:

1. **Import**: `langchain_openai` → `langchain_huggingface`
2. **Class**: `ChatOpenAI` → `HuggingFaceEndpoint`
3. **Parameter**: `model` → `repo_id`
4. **Parameter**: `openai_api_key` → `huggingfacehub_api_token`
5. **Parameter**: Add `max_new_tokens` (replaces `max_tokens`)

**Everything else stays the same!** LangChain chains, agents, and LCEL work identically.

## 9️⃣ Troubleshooting

### Issue: "Token not found"
**Solution**: Ensure `.env` file exists in project root with `HUGGINGFACE_TOKEN=hf_...`

### Issue: "Model loading timeout"
**Solution**: First time may take 30-60 seconds. Increase `timeout` parameter or wait.

### Issue: "Rate limit exceeded"
**Solution**: HuggingFace free tier has limits. Wait a few minutes or upgrade to Pro.

### Issue: "Model requires agreement to terms"
**Solution**: For Llama-2, go to model page and accept terms:
https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

### Issue: "Slow inference"
**Solution**: Try smaller/faster models like `google/flan-t5-large` or `mistralai/Mistral-7B-Instruct-v0.2`

## 🔟 Best Practices

1. **Start with smaller models** - `flan-t5-large` for testing
2. **Use timeouts** - Set reasonable timeout values (30-60 seconds)
3. **Cache results** - Store expensive LLM calls when possible
4. **Monitor usage** - Check HuggingFace dashboard for API usage
5. **Error handling** - Wrap LLM calls in try-except blocks
6. **Temperature tuning** - Lower (0.3-0.5) for factual, higher (0.7-0.9) for creative

---

## ✅ Summary

You now know how to:
- ✅ Set up HuggingFace tokens
- ✅ Use HuggingFaceEndpoint in LangChain
- ✅ Create chains and RAG systems
- ✅ Migrate from OpenAI to HuggingFace
- ✅ Troubleshoot common issues

**Use this pattern in all course notebooks!** 🚀