# Vektordaten visualisieren

Dieses Notebook führt dich durch folgende Schritte:

- Installation der benötigten Pakete.
- Benutzen des Sitemap Loaders um die deepshore.de nach Wissensbeiträgen zu durchkämmen.
- Die Dokumente an ein Embeddings Modell schicken, in Vektoren verwandeln und im Index speichern.
- Daten lokal mittels chromaviz (webanwendung) visualisieren.
- Daten online mittels nomic ai Atlas visualisieren.

Du benötigst:

- python 3.10
- Jupyter Notebook Server
- Einen OpenAI API Token


## Installation

In [None]:
%pip install llama_index=="0.6.38"
%pip install llama_hub=="0.0.5"
%pip install langchain=="0.0.222"
%pip install chromadb=="0.3.26"
%pip install git+https://github.com/selamanse/chromaviz/

## Open AI Konfiguration

In [None]:
import os
import logging
import sys
import getpass

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Dokumente laden

 - Die Dokumente werden geladen, indem man den [Sitemap Loader](https://llama-hub-ui.vercel.app/l/web-sitemap) benutzt, um die Webseite deepshore.de nach Wissensbeiträgen zu durchkämmen

In [None]:
from llama_hub.web.sitemap.base import SitemapReader

import nest_asyncio
nest_asyncio.apply()

loader = SitemapReader(html_to_text=True)
documents = loader.load_data(sitemap_url='https://deepshore.de/sitemap.xml', filter='https://deepshore.de/knowledge')

print(len(documents))

## Die Dokumente an ein Embeddings Modell schicken

 - in Vektoren verwandeln und im Index speichern 

In [None]:
from langchain.vectorstores import Chroma
from llama_index.schema import Document
from langchain.embeddings.openai import OpenAIEmbeddings
import chromadb
from chromadb.config import Settings
from langchain.vectorstores import Chroma

#https://docs.trychroma.com/telemetry#opting-out
chromadb_settings = Settings(anonymized_telemetry=False, persist_directory="./chroma", chroma_db_impl="duckdb+parquet")
chromadb_client = chromadb.Client(chromadb_settings)
chroma_client = Chroma(collection_name='deepshore-sitemap', client=chromadb_client, embedding_function=OpenAIEmbeddings())

langchain_documents = []
for d in documents:
    langchain_documents.append(d.to_langchain_format())

vectordb = chroma_client.from_documents(langchain_documents, OpenAIEmbeddings(), collection_name='deepshore-sitemap', client_settings=chromadb_settings, persist_directory="./chroma")

vectordb.persist()

# Vektordaten sichtbar machen (offline methode)

- Hochdimensionale Daten mit [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding)


In [None]:
from chromaviz import visualize_collection

visualize_collection(col=vectordb._collection)

# Vektordaten sichtbar machen (online methode)

Mittels [Atlas Software](https://atlas.nomic.ai/). Registrierung erforderlich.

In [None]:
"""
Visualizing your pinecone vector database index in Atlas
"""
import numpy as np
from nomic import atlas
import nomic

nomic.login(getpass.getpass("Nomic API Key:"))

num_embeddings = 999

#now pull the embeddings out of pinecone by id
vectors = vectordb._collection.get()

ids = []
info_jsons = []
embeddings = []
titles = []
for id in vectors['ids']:
    ids.append(id)
    meta_source = vectordb._collection.get(ids=id, include=['metadatas'])['metadatas'][0]['Source']
    text = vectordb._collection.get(ids=id, include=['documents'])['documents'][0]
    idx = text.find('\n### ')
    info_jsons.append({'id': id, 'Source': meta_source, 'Document': text[idx + 6:idx + 50]})    
    embeddings.append(vectordb._collection.get(ids=id, include=['embeddings'])['embeddings'][0])

embeddings = np.array(embeddings)

atlas.map_embeddings(embeddings=embeddings, data=info_jsons, id_field='id')
