# 📂 Upload-Driven RAG

This notebook shows **every step**—from config to final answer download—of building an **Upload-Driven RAG** app in Streamlit where users can upload either their CSV or PDF files and LLM or hybrid ruling can answer it based on their uploaded files. Please dont run the code literally as this notebook only intended for explanation purposes. The steps: 

1. Making the data ingestion or insertion part while previewing it 
2. Converting both type of files into JSON 
3. Chunking the info and setting token limits to avoid sudden stop from LLM for ingesting too much info 
4. Constructing FAISS index to calculate the similarity distance 
5. Preparing the LLM model and ensuring the hybrid rules to ensure proper answer (pands or tiny snippet for number-based question and LLM for semantic questions) 
6. Some of the factors that might alter the answer through controlling the size of the info and how exact we want the those info to be

## Step 1: Data Ingestion & Preview  
**Concept**: Read a CSV or PDF, sample up to `MAX_ROWS`, and show a preview so users know they uploaded correctly and we don’t freeze on huge files. 

In [None]:
import streamlit as st
import pandas as pd, chardet
from io import BytesIO
import tabula 

# streamlit sidebar determining max rows from uploaded CSV will be used
st.sidebar.header("⚙️ Settings")
MAX_ROWS = st.sidebar.slider("🔢 Max CSV rows to ingest", 100, 20000, 1000, step=100)

uploaded = st.file_uploader("Upload CSV or PDF", type=["csv","pdf"])
if not uploaded:
    st.stop()

# we can preview the top 5 of the CSV from the streamlit interface
if uploaded.type == "text/csv":
    raw = uploaded.read()
    enc = chardet.detect(raw)["encoding"]
    df  = pd.read_csv(BytesIO(raw), encoding=enc, nrows=MAX_ROWS)
    st.subheader("📊 CSV Preview")
    st.dataframe(df.head(5))

# we can also preview the first page of the PDF from the streamlit interface
else:
    st.subheader("📄 PDF Preview (first page snippet)")
    raw_pdf = uploaded.read()
    reader = PdfReader(BytesIO(raw_pdf))
    pages = [p.extract_text() or "" for p in reader.pages]
    st.text(pages[0][:500] + "…")

    # OPTIONAL: we need to use tabulas if there are tables inside the PDF
    try:
        with open("temp.pdf", "wb") as f:
            f.write(raw_pdf)
        dfs = tabula.read_pdf("temp.pdf", pages="all", multiple_tables=True)
        st.write(f"Detected {len(dfs)} tables")

## Step 2: JSON Conversion for Table-QA 
**Concept**: Convert the entire table (CSV or PDF-extracted) into JSON so numeric/date queries can be handled exactly (via Pandas) or via a small JSON snippet to the LLM.

In [None]:
import json

texts, metas = [], []
table_json = None 

# for CSV:
table_json = json.dumps(df.to_dict("records"), indent=2)
with st.expander("🔎 View CSV as JSON"):
    st.code(table_json, language="json")

# for PDF with tables:
if dfs:
    full_tables = {}
    for ti, tdf in enumerate(dfs):
        # show head
        st.subheader(f"📋 Table {ti+1} Preview")
        st.dataframe(tdf.head(5))
        # JSON
        j = json.dumps(tdf.to_dict(orient="records"), indent=2)
        full_tables[f"table_{ti}"] = tdf.to_dict(orient="records")
        with st.expander(f"🔎 View Table {ti+1} as JSON"):
            st.code(j, language="json")
        # flatten rows
        for i, row in tdf.iterrows():
            txt = "; ".join(f"{c}: {row[c]}" for c in tdf.columns)
            texts.append(txt)
            metas.append(("pdf_table", i))
        table_json = json.dumps(full_tables, indent=2)

# for PDF without tables: 
for idx, pg in enumerate(pages):
        if idx == 0:
            st.text(pg[:500] + "…")
        texts.append(pg)
        metas.append(("pdf_page", idx))
if table_json is None:
        pages_list = [{"page": i+1, "text": pg[:2000]} for i, pg in enumerate(pages)]
        table_json = json.dumps({"pages": pages_list}, indent=2)

## Step 3: Chunking & Token-Limit Enforcement
**Concept**: Split each row or page of text into ≤512-token chunks so our embedding/RAG steps never overflow.

In [None]:
# %%
from transformers import AutoTokenizer
from sentence_transformers import SentenceTransformer, CrossEncoder

# model init (cached)
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", use_fast=True)
embedder  = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
reranker  = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

def chunk_text(txt, chunk_size=200):
    toks = tokenizer.tokenize(txt)
    max_len = tokenizer.model_max_length - 2
    step = min(chunk_size, max_len)
    for i in range(0, len(toks), step):
        yield tokenizer.convert_tokens_to_string(toks[i:i+step])

# flatten all rows/pages into chunks
texts, metas = [], []   # assume these were built in Step 1/2
chunks, chunk_meta = [], []
for (kind, idx), doc in zip(metas, texts):
    for c in chunk_text(doc, chunk_size=200):
        chunks.append(c)
        chunk_meta.append((kind, idx))

## Step 4: FAISS Index Construction
**Concept**: Embed every chunk, then choose FlatL2 for small sets or IVF for larger, so ANN search is both fast and robust. Basically the **IndexFlatL2** counts the nearest-neighbors distance to all vectors while **IndexIVFFlat** is used to count the similarity distance between per centroid if we have enough data points (in this case, 39 points are the threshold per centroid).

In [None]:
import faiss, numpy as np

embs = embedder.encode(chunks, convert_to_numpy=True).astype("float32")
dim, n = embs.shape[1], embs.shape[0]
nlist = min(100, n)

# If too few vectors, use exact FLAT index
if n < 39 * nlist:
    index = faiss.IndexFlatL2(dim)
else:
    quant = faiss.IndexFlatL2(dim)
    index = faiss.IndexIVFFlat(quant, dim, nlist, faiss.METRIC_L2)
    index.train(embs); index.nprobe = min(10, nlist)

index.add(embs)

## Step 5: Hybrid Retrieval
**Concept**: after we instantiate the free llm model with the openrouter (we used free mistral small 24B through openrouter), we will set the rules based on the given query. 
### Numeric/date queries 
- **CSV path**: finds all the number-columns in the table, picks the last one, calculates its maximum value, and prints it. 
- **Fallback path**: takes the first 20 rows of your table (in JSON) & asks the AI “Here’s these 20 rows—what’s your answer?”

### Free-text queries 
1. **Search**: We convert your question into a format the computer can compare against our indexed text, then pull out 20 candidates.

2. **Fallback guess**: If none of those 20 are “good enough” (distance < threshold), we let the AI guess an answer and search again to improve recall.

3. **Rerank**: We ask a second, lightweight AI model to score how well each of those 20 snippets actually matches your question, then pick the top K of them.

4. **Show context**: Those top K snippets are fed into the AI.

5. **Answer**: LLM gives the answers based on the those top K snippets

In [None]:
import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
api_key = os.getenv("OPENROUTER_API_KEY","").strip()
assert api_key, "❗️ Set OPENROUTER_API_KEY in .env or sidebar"

router = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=api_key
)

In [None]:
query = st.text_input("❓ Ask your question:")
if st.button("Retrieve & Answer") and query:
    lowq = query.lower()
    # A) Numeric/Date
    if any(k in lowq for k in ("max","min","sum","earliest")):
        # exact with Pandas
        if uploaded.type=="text/csv":
            numcols = df.select_dtypes(include=[np.number]).columns
            if "max" in lowq and numcols.any():
                col = numcols[-1]; mv = df[col].max()
                st.write(f"Max {col} = {mv}"); st.stop()
        # Otherwise, grab the first 20 rows as JSON and ask the AI directly
        sample = json.loads(table_json)[:20]
        prompt = f"Here are 20 rows as JSON:\n{json.dumps(sample,indent=2)}\nQ: {query}"
        resp = router.chat.completions.create(
            model="mistralai/mistral-small-3.2-24b-instruct:free",
            messages=[{"role":"user","content":prompt}], temperature=0
        )
        st.write(resp.choices[0].message.content); st.stop()

    # B) Semantic RAG
    q_emb = embedder.encode([query], convert_to_numpy=True).astype("float32")
    D,I   = index.search(q_emb,20)
    
    # If nothing was similar enough, ask the AI to draft a quick answer (“HyDE”)
    if D.max() < 0.3:
        pseudo = router.chat.completions.create(
            model="mistralai/mistral-small-3.2-24b-instruct:free",
            messages=[{"role":"user","content":f"Answer briefly: {query}"}],
            temperature=0.7,max_tokens=100
        ).choices[0].message.content
        q_emb = embedder.encode([pseudo], convert_to_numpy=True).astype("float32")
        D,I   = index.search(q_emb,20)
    
    # Rerank those 20 by how well the AI thinks they match your question
    cands  = [chunks[i] for i in I[0]]
    scores = reranker.predict([[query,c] for c in cands])
    
    # Pick the top-K best snippets and show them in a collapsible box
    k = min(TOP_K, len(cands))
    top  = [cands[i] for i in np.argsort(scores)[-k:]]
    with st.expander("🔍 Retrieved Chunks", expanded=False):
        for t in top: st.markdown(f"> {t}")

    # Send those top-K snippets + your question back to the AI for an actual answer
    final_prompt = f"Context:\n{chr(10).join(top)}\n\nQ: {query}"
    gen = router.chat.completions.create(
        model="mistralai/mistral-small-3.2-24b-instruct:free",
        messages=[{"role":"user","content":final_prompt}],
        temperature=0.7, max_tokens=300
    )
    st.subheader("💬 Answer"); st.write(gen.choices[0].message.content)

# Optional infos 

We make additional sliders as info to allow users tune the **speed vs accuracy** with three key stages: chunk size, how many snippets, and how often we should guess/check 

## Chunk size (tokens)
Controls how big each piece of text is when we break your document up:

- Smaller means more pieces and more precise search, but more work to process.

- Larger means fewer, longer pieces which resulted in faster process but can miss fine details.

## Top-K texts
How many of the best-matching pieces we show to the AI for the final answer.

- Higher means more context, but may include irrelevant bits.

- Lower means keeps the AI focused, but might leave out useful info.

## HyDE threshold
A “picky” number between 0 and 1 that says how good a match has to be before we skip the AI-guess step.

- Lower threshold means we’re easy to please, so we rarely ask the AI to draft a pseudo-answer.

- Higher threshold means we’re more demanding, so we fall back to a quick AI guess more often to widen our search.

In [None]:
CHUNK_SIZE  = st.sidebar.slider(
    "📄 Chunk size (tokens)", 100, 500, 200,
    help="Max number of tokens per text chunk. Smaller values mean finer-grained chunks (more precise retrieval) but a larger index."
)
st.sidebar.markdown("*(How many tokens each chunk can have. Smaller → more chunks, finer retrieval.)*")

TOP_K       = st.sidebar.slider(
    "🔍 top-K texts", 1, 10, 5,
    help="Number of top-retrieved chunks shown to the LLM. Higher values give more context but may include less relevant passages."
)
st.sidebar.markdown("*(How many retrieved snippets to pass to the model.)*")

HYDE_THRESH = st.sidebar.slider(
    "🤖 HyDE threshold", 0.1, 0.9, 0.3, step=0.05,
    help="Recall similarity cutoff below which a 'pseudo-answer' (HyDE) is generated to improve retrieval. Lower → fewer HyDE calls."
)
st.sidebar.markdown("*(Cosine similarity below which we ask the LLM to draft a hypothetical answer.)*")

# More optional Infos: Temperatures vs HyDE Threshold 
## Temperature
- When it applies: Only during the generation step, when you ask the LLM to produce text (HyDE pseudo-answers or the final answer).

- What it does: Think of it like “how creative” you want the AI to be.

- Low (0–0.3) → very predictable, repeats the most likely phrasing.

- Medium (0.4–0.7) → a bit more variety, can reword or add small surprises.

- High (0.8–1.0) → lots of creativity, but also more risk of going off-topic or making stuff up.

## HyDE Threshold
- When it applies: In the retrieval stage, before generation.

- What it does: Measures how “close” your question is to anything in the index.

- After doing the first pass of similarity search, you look at the best match score. If that best match is below your HyDE threshold, it means “we’re not confident anything really matches,” so you ask the LLM to draft a pseudo-answer. You then re-search using that pseudo-answer to pull in more relevant chunks.