<a href="https://colab.research.google.com/github/amir3x0/LLM-mini_projects/blob/main/Hugging_Face_HowToUse.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- Hugging Face: https://huggingface.co/models

# INSTALL KEY LIBRARIES, OBTAIN HF ACCESS TOKENS, & GPU CHECK

**Installing Libraries:**

*   `transformers`: The core Hugging Face library for models and tokenizers.
*   `accelerate`: Helps run models efficiently across different hardware (like GPUs) and use less memory.
*   `bitsandbytes`: Enables model quantization (like loading in 4-bit or 8-bit), drastically reducing memory usage. Essential for running decent models on free Colab GPUs!
*   `torch`: The underlying deep learning framework (PyTorch).
*   `pypdf`: A library to easily extract text from PDF files.


In [None]:
print("Installing necessary libraries...")
!pip install -q transformers accelerate bitsandbytes torch pypdf gradio
print("Libraries installed successfully!")

Installing necessary libraries...
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.5/310.5 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25hLibraries installed successfully!


In [None]:
import torch  # PyTorch, the backend for transformers
import pypdf  # For reading PDFs
import gradio as gr  # For building the UI
from IPython.display import display, Markdown  # For nicer printing in notebooks
print("Core libraries imported.")

Core libraries imported.


**Hugging Face Hub Login Steps:**

    1.  Go to [huggingface.co](https://huggingface.co/).
    2.  Sign up or log in.
    3.  profile picture (top right) -> Settings -> Access Tokens.
    4.  Create a new token (a 'read' role is usually sufficient).
    5.  Copy the generated token. **Treat this like a password!**
*   **Log in within Colab:** We'll use a helper function from the `huggingface_hub` library.

In [None]:
import os
from huggingface_hub import login, notebook_login
print("Attempting Hugging Face login...")

notebook_login()
print("Login successful (or token already present)!")

Attempting Hugging Face login...


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Login successful (or token already present)!


In [None]:
if torch.cuda.is_available():
    print(f"GPU detected: {torch.cuda.get_device_name(0)}")
    # Set default device to GPU
    torch.set_default_device("cuda")
    print("PyTorch default device set to CUDA (GPU).")
else:
    print("WARNING: No GPU detected. Running these models on CPU will be extremely slow!")
    print("Make sure 'GPU' is selected in Runtime > Change runtime type.")

GPU detected: Tesla T4
PyTorch default device set to CUDA (GPU).


In [None]:
# Helper function for markdown display
def print_markdown(text):
    """Displays text as Markdown in Colab"""
    display(Markdown(text))

In [None]:
# pipelines - easy way to use models for inference.
# These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks
# Those tasks include Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering.
from transformers import pipeline

# Load a sentiment classifier model on financial news data
# Check the model here: https://huggingface.co/ProsusAI/finbert
pipe = pipeline(model = "ProsusAI/finbert")
pipe("Apple lost 10 Million dollars today due to US tarrifs")

In [None]:
# A tokenizer converts text into numerical IDs that the model understands
# Check a demo for OpenAI's Tokenizers here: https://platform.openai.com/tokenizer
from transformers import AutoTokenizer

# Load tokenizer for GPT-2
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Encode text to token IDs
tokens = tokenizer("Hello, I am studying LLM and AI Agent more in depth.")
print(tokens['input_ids'])

# HUGGING FACE TRANSFORMERS LIBRARY: AutoModelForCausalLM

AutoModelForCausalLM is a Hugging Face class that automatically loads a pretrained model for causal (left-to-right) language modeling, such as GPT, LLaMA, or Gemma.

Let's get hands-on and load a model! We'll start with a relatively small but capable model that should fit comfortably in Colab's free tier GPU memory, thanks to quantization.

**Key Steps:**

1.  **Choose a Model ID:** We need the unique identifier from the Hugging Face Hub (e.g., `"google/gemma-2b-it"` or `"microsoft/Phi-3-mini-4k-instruct"`).
2.  **Load the Tokenizer:** Use `AutoTokenizer.from_pretrained(model_id)` to get the specific tokenizer for that model.
3.  **Load the Model:** Use `AutoModelForCausalLM.from_pretrained(...)` with crucial arguments:
    *   `model_id`: The identifier.
    *   `torch_dtype=torch.float16` (or `bfloat16`): Loads the model using 16-bit floating point numbers instead of 32-bit, saving memory.
    *   `load_in_4bit=True` or `load_in_8bit=True`: This is **quantization** via `bitsandbytes`. It further reduces memory by representing model weights with fewer bits (4 or 8 instead of 16/32). Essential for free Colab! 4-bit saves more memory but might have a tiny impact on quality compared to 8-bit.
    *   `device_map="auto"`: Tells `accelerate` to automatically figure out how to spread the model across available devices (primarily the GPU in our case).
4.  **Combine Tokenizer and Model (Optional but common):** Using the `pipeline` function is often simpler for basic text generation. It handles tokenization, model inference, and decoding back to text for you.


In [None]:
!pip install -U bitsandbytes

In [None]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Alternatives models to try (might need login/agreement):
# model_id = "unsloth/gemma-3-4b-it-GGUF"
# model_id = "Qwen/Qwen2.5-3B-Instruct"
model_id = "microsoft/Phi-4-mini-instruct"
# model_id = "unsloth/Llama-3.2-3B-Instruct"

In [None]:
# load the Tokenizer
# The tokenizer prepares text input for the model
# trust_remote_code=True is sometimes needed for newer models with custom code.
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code = True)
print("Tokenizer loaded successfully.")

In [None]:
# Let's Load the Model with Quantization

print(f"Loading model: {model_id}")
print("This might take a few minutes, especially the first time...")

# Create BitsAndBytesConfig for 4-bit quantization
quantization_config = BitsAndBytesConfig(load_in_4bit = True,
                                         bnb_4bit_compute_dtype = torch.float16,  # or torch.bfloat16 if available
                                         bnb_4bit_quant_type = "nf4",  # normal float 4 quantization
                                         bnb_4bit_use_double_quant = True  # use nested quantization for more efficient memory usage
                                         )

# Load the model with the quantization config
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             quantization_config = quantization_config,
                                             device_map = "auto",
                                             trust_remote_code = True)


In [None]:
prompt = "Explain how Electric Vehicles work in a funny way!"

In [None]:
prompt = "What is the capital of France?"

In [None]:
# Method 1: test the model and Tokenizer using the .generate() method!

# encode the input first
inputs = tokenizer(prompt, return_tensors = "pt")

# generate the output
outputs = model.generate(**inputs, max_new_tokens = 1000)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print_markdown(response)

In [None]:
# Method 2: create a pipeline that includes your model and tokenizer
# The pipeline wraps tokenization, generation, and decoding

pipe = pipeline("text-generation",
                model = model,
                tokenizer = tokenizer,
                torch_dtype = "auto", # Match model dtype
                device_map = "auto" # Ensure pipeline uses the same device mapping
                )


outputs = pipe(prompt,
               max_new_tokens = 1000, # max_new_tokens limits the length of the generated response.
               temperature = 1, # temperature controls randomness (lower = more focused).
               )

# Print the generated text
print_markdown(outputs[0]['generated_text'])

# READ PDF DOCUMENTS & EXTRACT TEXT USING PYPDF LIBRARY

**Steps:**
1.  **Get the PDF:** Download it or specify the path if uploaded.
2.  **Open the PDF:** Use `pypdf.PdfReader`.
3.  **Iterate Through Pages:** Loop through each page in the PDF.
4.  **Extract Text:** Use `page.extract_text()`.
5.  **Combine Text:** Join the text from all pages into a single string.

In [None]:
import requests
from pathlib import Path

# --- Get the PDF File ---
pdf_url = "https://abc.xyz/assets/66/ae/c94682fc4137b5fb90a5d709ac4b/2025-q1-earnings-transcript.pdf"
pdf_filename = "google_earning_transcript.pdf"
pdf_path = Path(pdf_filename)

# Download the file if it doesn't exist
if not pdf_path.exists():
    response = requests.get(pdf_url)
    response.raise_for_status()  # Check for download errors
    pdf_path.write_bytes(response.content)
    print(f"PDF downloaded successfully to {pdf_path}")
else:
    print(f"PDF file already exists at {pdf_path}")


# --- Read Text from PDF using pypdf ---
pdf_text = ""

print(f"Reading text from {pdf_path}...")
reader = pypdf.PdfReader(pdf_path)
num_pages = len(reader.pages)
print(f"PDF has {num_pages} pages.")

# Extract text from each page
all_pages_text = []
for i, page in enumerate(reader.pages):

    page_text = page.extract_text()
    if page_text:  # Only add if text extraction was successful
        all_pages_text.append(page_text)
    # print(f"Read page {i+1}/{num_pages}") # Uncomment for progress

# Join the text from all pages
pdf_text = "\n".join(all_pages_text)
print(f"Successfully extracted text. Total characters: {len(pdf_text)}")


In [None]:
# Display a small snippet of the PDF
print("\n--- Snippet of Extracted Text ---")
print_markdown(f"{pdf_text[:1000]}")



--- Snippet of Extracted Text ---


 
This  transcript  is  provided  for  the  convenience  of  investors  only,  for  a  full  recording  please  see  the  Q1  2025  Earnings  Call  webcast.    Operator:  Welcome,  everyone.  Thank  you  for  standing  by  for  the  Alphabet  First  Quarter  2025  
Earnings
 
conference
 
call.
 
  At  this  time,  all  participants  are  in  a  listen-only  mode.  After  the  speaker  presentations,  there  will  
be
 
a
 
question-and-answer
 
session.
 
To
 
ask
 
a
 
question
 
during
 
the
 
session,
 
you
 
will
 
need
 
to
 
press
 
*1
 
on
 
your
 
telephone.
  I  would  now  like  to  hand  the  conference  over  to  your  speaker  today,  Jim  Friedland,  Senior  
Director
 
of
 
Investor
 
Relations.
 
Please
 
go
 
ahead.
 
  Jim  Friedland,  Senior  Director,  Investor  Relations:  Thank  you.  Good  afternoon,  everyone,  
and
 
welcome
 
to
 
Alphabet's
 
First
 
Quarter
 
2025
 
Earnings
 
Conference
 
Call.
 
With
 
us
 
today
 
are
 
Sundar
 
Pichai,
 
Philipp
 
Schin

# BUILD THE Q&A LOGIC & PROMPT THE MODEL

two key ingredients:
1.  A loaded open-source LLM (and its tokenizer/pipeline).
2.  The text content extracted from our PDF document.

combine these to answer user questions. The core idea is **prompt engineering**: I will create a prompt that includes both the user's question and the relevant document context, instructing the model to answer based only on that context.

**Steps:**
1.  **Define a Prompt Template:** Create a string that structures the input for the LLM. This typically includes placeholders for the context (PDF text) and the question.
2.  **Create an Answering Function:** Write a Python function that takes the PDF text, the user question, and the model/tokenizer (or pipeline) as input.
3.  **Format the Prompt:** Inside the function, fill the template with the actual PDF text and question.
4.  **Handle Context Length:** LLMs have a maximum context window (how much text they can read at once). Our sample PDF might be too long! For simplicity now, we might just truncate the PDF text if it's excessive. More advanced techniques involve chunking the document and retrieving only relevant parts, but we'll keep it basic here.
5.  **Run Inference:** Send the formatted prompt to the model pipeline.
6.  **Extract the Answer:** Process the model's output to get just the answer part.


In [None]:
# Define a limit for the context length to avoid overwhelming the model

MAX_CONTEXT_CHARS = 6000

def answer_question_from_pdf(document_text, question, llm_pipeline):
    """
    Answers a question based on the provided document text using the loaded LLM pipeline.

    Args:
        document_text (str): The text extracted from the PDF.
        question (str): The user's question.
        llm_pipeline (transformers.pipeline): The initialized text-generation pipeline.

    Returns:
        str: The model's generated answer.
    """
    # Truncate context if necessary
    if len(document_text) > MAX_CONTEXT_CHARS:
        print(f"Warning: Document text ({len(document_text)} chars) exceeds limit ({MAX_CONTEXT_CHARS} chars). Truncating.")
        context = document_text[:MAX_CONTEXT_CHARS] + "..."
    else:
        context = document_text

    # Prompt Template
    # instruct the model to use only the provided document.
    # <|system|> provides context/instructions, <|user|> is the question.
    # Note: Different models might prefer different prompt structures.
    prompt_template = f"""<|system|>
    You are an AI assistant. Answer the following question based *only* on the provided document text. If the answer is not found in the document, say "The document does not contain information on this topic." Do not use any prior knowledge.

    Document Text:
    ---
    {context}
    ---
    <|end|>
    <|user|>
    Question: {question}<|end|>
    <|assistant|>
    Answer:""" # prompt the model to start generating the answer

    print(f"\n--- Generating Answer for: '{question}' ---")

    # Run Inference on the chosen model
    outputs = llm_pipeline(prompt_template,
                           max_new_tokens = 500,  # Limit answer length
                           do_sample = True,
                           temperature = 0.2,   # Lower temperature for more factual Q&A
                           top_p = 0.9)

    # Let's extract the answer
    # The output includes the full prompt template. We need the text generated *after* it.
    full_generated_text = outputs[0]['generated_text']
    answer_start_index = full_generated_text.find("Answer:") + len("Answer:")
    raw_answer = full_generated_text[answer_start_index:].strip()

    # Sometimes the model might still include parts of the prompt or trail off.
    # Basic cleanup: Find the end-of-sequence token if possible, or just return raw.
    # Phi-3 uses <|end|> or <|im_end|>
    end_token = "<|end|>"
    if end_token in raw_answer:
            raw_answer = raw_answer.split(end_token)[0]

    print("--- Generation Complete ---")
    return raw_answer


In [None]:
# Let's test the function
test_question = "What is this document about?"
generated_answer = answer_question_from_pdf(pdf_text, test_question, pipe)

print("\nTest Question:")
print_markdown(f"**Q:** {test_question}")
print("\nGenerated Answer:")
print_markdown(f"**A:** {generated_answer}")


--- Generating Answer for: 'What is this document about?' ---
--- Generation Complete ---

Test Question:


**Q:** What is this document about?


Generated Answer:


**A:** This document is about Alphabet's First Quarter 2025 Earnings Conference Call, where Jim Friedland, Senior Director of Investor Relations, presents the company's financial performance and forward-looking statements. CEO Sundar Pichai discusses the company's growth in various business areas, including AI and cloud services, and highlights the progress in AI infrastructure and research. The document also mentions the release of Gemini 2.5 Pro, a state-of-the-art AI model, and the introduction of 2.5 Flash for developers.

# SWITCH MODELS & BUILD GRADIO INTERFACE

load one model at a time based on the user's selection in the Gradio interface. This means unloading the previous model before loading the new one. This will introduce a loading delay when switching models, but it's necessary for memory constraints.

**Steps:**

1.  **Define Model Choices:** Create a dictionary mapping user-friendly names (e.g., "Phi-3 Mini") to their Hugging Face model IDs. Include models known to work in Colab free tier with 4-bit quantization.
2.  **Global State:** Keep track of the currently loaded model and tokenizer globally (or using Gradio's `State`).
3.  **Model Loading Function:** Create a function `load_model(model_id)` that handles unloading the old model (if any) and loading the new tokenizer and quantized model. It should return the new `pipeline`.
4.  **Gradio Interface:**
    *   Use `gr.Blocks` for more layout control.
    *   Add a `gr.Dropdown` for the user to select the desired model.
    *   Add a `gr.Textbox` for the user's question.
    *   Add a `gr.Textbox` (or `gr.Markdown`) for the output answer.
    *   Add a `gr.Button` to submit the question.
5.  **Event Handling:**
    *   When the dropdown selection changes, trigger the `load_model` function. Show a loading indicator.
    *   When the submit button is clicked, call our `answer_question_from_pdf` function, passing the current PDF text, the question, and the currently loaded pipeline.


In [None]:
# Make sure we have the pdf_text
# Configuration: Models available for selection
# Use models known to fit in Colab free tier with 4-bit quantization

available_models = {
    "Llama 3.2": "unsloth/Llama-3.2-3B-Instruct",
    "Microsoft Phi-4 Mini": "microsoft/Phi-4-mini-instruct",
    "Google Gemma 3": "unsloth/gemma-3-4b-it-GGUF"
    }

In [None]:
# --- Global State (or use gr.State in Blocks) ---
# To keep track of the currently loaded model/pipeline
current_model_id = None
current_pipeline = None
print(f"Models available for selection: {list(available_models.keys())}")


# Define a function to Load/Switch Models
def load_llm_model(model_name):
    """Loads the selected LLM, unloading the previous one."""
    global current_model_id, current_pipeline, tokenizer, model

    new_model_id = available_models.get(model_name)
    if not new_model_id:
        return "Invalid model selected.", None  # Return error message and None pipeline

    if new_model_id == current_model_id and current_pipeline is not None:
        print(f"Model {model_name} is already loaded.")
        # Indicate success but don't reload
        return f"{model_name} already loaded.", current_pipeline

    print(f"Switching to model: {model_name} ({new_model_id})...")

    # Unload previous model (important for memory)
    # Clear variables and run garbage collection
    current_pipeline = None
    if "model" in locals():
        del model
    if "tokenizer" in locals():
        del tokenizer
    if "pipe" in locals():
        del pipe
    torch.cuda.empty_cache()  # Clear GPU memory cache
    import gc

    gc.collect()
    print("Previous model unloaded (if any).")

    # --- Load the new model ---
    loading_message = f"Loading {model_name}..."
    try:
        # Load Tokenizer
        tokenizer = AutoTokenizer.from_pretrained(new_model_id, trust_remote_code = True)

        # Load Model (Quantized)
        model = AutoModelForCausalLM.from_pretrained(new_model_id,
                                                     torch_dtype = "auto",  # "torch.float16", # Or bfloat16 if available
                                                     load_in_4bit = True,
                                                     device_map = "auto",
                                                     trust_remote_code = True)

        # Create Pipeline
        loaded_pipeline = pipeline(
            "text-generation", model = model, tokenizer = tokenizer, torch_dtype = "auto", device_map = "auto")

        print(f"Model {model_name} loaded successfully!")
        current_model_id = new_model_id
        current_pipeline = loaded_pipeline  # Update global state
        # Use locals() or return values with gr.State for better Gradio practice
        return f"{model_name} loaded successfully!", loaded_pipeline  # Status message and the pipeline object

    except Exception as e:
        print(f"Error loading model {model_name}: {e}")
        current_model_id = None
        current_pipeline = None
        return f"Error loading {model_name}: {e}", None  # Error message and None pipeline

Models available for selection: ['Llama 3.2', 'Microsoft Phi-4 Mini', 'Google Gemma 3', 'Qwen 2.5']


In [None]:
# --- Function to handle Q&A Submission ---
# This function now relies on the globally managed 'current_pipeline'
# In a more robust Gradio app, you'd pass the pipeline via gr.State
def handle_submit(question):
    """Handles the user submitting a question."""
    if not current_pipeline:
        return "Error: No model is currently loaded. Please select a model."
    if not pdf_text:
        return "Error: PDF text is not loaded. Please run Section 4."
    if not question:
        return "Please enter a question."

    print(f"Handling submission for question: '{question}' using {current_model_id}")
    # Call the Q&A function defined in Section 5
    answer = answer_question_from_pdf(pdf_text, question, current_pipeline)
    return answer



In [None]:

# --- Build Gradio Interface using Blocks ---
print("Building Gradio interface...")
with gr.Blocks(theme=gr.themes.Soft()) as demo:
    gr.Markdown(
        f"""
    # PDF Q&A Bot Using Hugging Face Open-Source Models
    Ask questions about the document ('{pdf_filename}' if loaded, {len(pdf_text)} chars).
    Select an open-source LLM to answer your question.
    **Note:** Switching models takes time as the new model needs to be downloaded and loaded into the GPU.
    """
    )

    # Store the pipeline in Gradio state for better practice (optional for this simple version)
    # llm_pipeline_state = gr.State(None)

    with gr.Row():
        model_dropdown = gr.Dropdown(
            choices=list(available_models.keys()),
            label="🤖 Select LLM Model",
            value=list(available_models.keys())[0],  # Default to the first model
        )
        status_textbox = gr.Textbox(label="Model Status", interactive=False)

    question_textbox = gr.Textbox(
        label="❓ Your Question", lines=2, placeholder="Enter your question about the document here..."
    )
    submit_button = gr.Button("Submit Question", variant="primary")
    answer_textbox = gr.Textbox(label="💡 Answer", lines=5, interactive=False)

    # --- Event Handlers ---
    # When the dropdown changes, load the selected model
    model_dropdown.change(
        fn = load_llm_model,
        inputs = [model_dropdown],
        outputs = [status_textbox],  # Update status text. Ideally also update a gr.State for the pipeline
        # outputs=[status_textbox, llm_pipeline_state] # If using gr.State
    )

    # When the button is clicked, call the submit handler
    submit_button.click(
        fn = handle_submit,
        inputs = [question_textbox],
        outputs = [answer_textbox],
        # inputs=[question_textbox, llm_pipeline_state], # Pass state if using it
    )

    # --- Initial Model Load ---
    # Easier: Manually load first model *before* launching Gradio for simplicity here
    initial_model_name = list(available_models.keys())[0]
    print(f"Performing initial load of default model: {initial_model_name}...")
    status, _ = load_llm_model(initial_model_name)
    status_textbox.value = status  # Set initial status
    print("Initial load complete.")


# --- Launch the Gradio App ---
print("Launching Gradio demo...")
demo.launch(debug=True)  # debug=True provides more detailed logs

Building Gradio interface...


NameError: name 'gr' is not defined