#  Customized ChatGPT mit RAG

Dieses Notebook vereint Daten-Ingestion und RAG-Chat in einem Schritt-für-Schritt-Prozess zur lokalen Erstellung eines RAG-Systems mit LangChain, ChromaDB und SentenceTransformers.

## 1️. Daten-Ingestion & Vektordatenbank-Erzeugung
Wir laden PDF-Dateien, extrahieren Text, erzeugen Chunks und speichern diese in einer Chroma-Vektordatenbank.

In [None]:

import sys
print(sys.executable)



In [1]:
import os
import shutil
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from sentence_transformers import SentenceTransformer
import torch


# Basisverzeichnis mit Unternehmensordnern
basisordner = os.path.join(
    os.environ["USERPROFILE"],
    "OneDrive", "Desktop", "KI", "github", "customized_chatgpt_rag_", "Dokumentenbasis"
)

# Embedding-Modell initialisieren
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("all-MiniLM-L6-v2", device=device)

class MyEmbeddingFunction:
    def __init__(self, model):
        self.model = model

    def embed_documents(self, texts):
        return list(self.model.encode(texts, convert_to_tensor=False))

    def embed_query(self, text):
        return list(self.model.encode(text, convert_to_tensor=False))

embedding_function = MyEmbeddingFunction(model)

# Verarbeitung pro Unternehmensordner
for unternehmen in os.listdir(basisordner):
    ordnerpfad = os.path.join(basisordner, unternehmen)
    if not os.path.isdir(ordnerpfad):
        continue

    dokumente = []
    for datei in os.listdir(ordnerpfad):
        if datei.endswith(".pdf"):
            pfad = os.path.join(ordnerpfad, datei)
            loader = PyPDFLoader(pfad)
            geladene_dokumente = loader.load()
            for doc in geladene_dokumente:
                doc.metadata["source"] = datei
            dokumente.extend(geladene_dokumente)

    if not dokumente:
        continue

    # Chunks erzeugen
    splitter = RecursiveCharacterTextSplitter(chunk_size=350, chunk_overlap=30)
    chunks = splitter.split_documents(dokumente)

    # Chunks in Vektordatenbank speichern
    vektorordner = os.path.join(ordnerpfad, "vectordb")
    if os.path.exists(vektorordner):
        shutil.rmtree(vektorordner)

    vectordb = Chroma.from_documents(
        chunks,
        embedding_function,
        persist_directory=vektorordner
    )
    

    # Relativen Pfad anzeigen statt Benutzernamen
    relativer_pfad = os.path.relpath(vektorordner, start=basisordner)
    print(f"{unternehmen}: {len(chunks)} Chunks gespeichert in Dokumentenbasis/{relativer_pfad}")


Deutsche Bank: 1517 Chunks gespeichert in Dokumentenbasis/Deutsche Bank\vectordb


Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 74 0 (offset 0)
Ignoring wrong pointing object 76 0 (offset 0)


Deutsche Boerse: 290 Chunks gespeichert in Dokumentenbasis/Deutsche Boerse\vectordb


Could not reliably determine page label for 1.
Could not reliably determine page label for 2.
Could not reliably determine page label for 3.
Could not reliably determine page label for 4.
Could not reliably determine page label for 5.
Could not reliably determine page label for 6.
Could not reliably determine page label for 7.
Could not reliably determine page label for 8.
Could not reliably determine page label for 9.
Could not reliably determine page label for 10.
Could not reliably determine page label for 11.
Could not reliably determine page label for 12.
Could not reliably determine page label for 13.
Could not reliably determine page label for 14.
Could not reliably determine page label for 15.
Could not reliably determine page label for 16.
Could not reliably determine page label for 17.
Could not reliably determine page label for 18.
Could not reliably determine page label for 19.
Could not reliably determine page label for 20.
Could not reliably determine page label for 21.
C

GEA: 460 Chunks gespeichert in Dokumentenbasis/GEA\vectordb


Could not reliably determine page label for 33.
Could not reliably determine page label for 34.
Could not reliably determine page label for 35.
Could not reliably determine page label for 36.
Could not reliably determine page label for 37.
Could not reliably determine page label for 38.
Could not reliably determine page label for 39.
Could not reliably determine page label for 1.
Could not reliably determine page label for 2.
Could not reliably determine page label for 3.
Could not reliably determine page label for 4.
Could not reliably determine page label for 5.
Could not reliably determine page label for 6.
Could not reliably determine page label for 7.
Could not reliably determine page label for 8.
Could not reliably determine page label for 9.
Could not reliably determine page label for 10.
Could not reliably determine page label for 11.
Could not reliably determine page label for 12.
Could not reliably determine page label for 13.
Could not reliably determine page label for 14.
C

Telekom: 1194 Chunks gespeichert in Dokumentenbasis/Telekom\vectordb


## 2. RAG-Chat mit lokalem Modell (z. B. Mistral via Ollama)
Wählt ein Unternehmen, lädt die passende Vektordatenbank, stellt die Frage und gibt eine GPT-generierte Antwort zurück.

In [2]:
from langchain.chains import RetrievalQA
from langchain_community.llms import Ollama

# 📂 Unternehmensauswahl
unternehmen = [name for name in os.listdir(basisordner) if os.path.isdir(os.path.join(basisordner, name))]
print("Unternehmen:", unternehmen)

wahl = input(" Unternehmen wählen: ")

frage = input(" Frage stellen: ")

if frage and wahl:
    vektorordner = os.path.join(basisordner, wahl, "vectordb")
    vectordb = Chroma(
        persist_directory=vektorordner,
        embedding_function=embedding_function
    )
    retriever = vectordb.as_retriever(search_kwargs={"k": 2})
    llm = Ollama(model="mistral", temperature=0)

    qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
    antwort = qa_chain.run(frage)

    print("\n📘 Antwort:")
    print(antwort)

Unternehmen: ['Deutsche Bank', 'Deutsche Boerse', 'GEA', 'Telekom']
