### Let's Code a Semantic Search
In this example we use vector embeddings to convert text data related to the Porsche 911 Wikipedia page into numerical representations. Ingest these embeddings into a vector database, enabling users to perform similarity search and discover related content based on the Porsche 911 Wikipedia page.

In [None]:
pip install langchain lxml chromadb sentence-transformers

### Load the dataset
Visit Wikipedia and retrieve the [Wikipedia page for the Porsche 911](https://en.wikipedia.org/wiki/Porsche_911). In this simplified example, we are only loading a single page, but in practice, you have the capability to load multiple pages.

In [2]:
from langchain.text_splitter import HTMLHeaderTextSplitter

# Load the dataset
file_path= ".\data\Wikipedia 911\Porsche 911 - Wikipedia.html"
file1 = open(file_path, encoding='utf-8')

# Split
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
    ("h5", "Header 5"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(file1.read())
print("Loader {} files".format(len(html_header_splits)))

Loader 44 files


### Split documents and save them into a vector database
In the subsequent phase, we divide the HTML page into subsections according to the headers. Feel free to employ alternative criteria for segmentation.

In [3]:
# Define the Text Splitter 
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

#Create a split of the document using the text splitter
splits = text_splitter.split_documents(html_header_splits)

### Create and Store embeddings in a vector database.
Next, we save all the data as word embeddings in ChromaDB. The embedding function() utilizes a transformer model loaded from the SBert library. You have the flexibility to experiment with various models. For reference, check the available options in the Pretrained Models section of the Sentence-Transformers documentation at [sbert.net](https://www.sbert.net/).

In [4]:
from langchain.vectorstores import Chroma
from chromadb.utils import embedding_functions
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

# The Word Embeddings function from SBert
default_ef = embedding_functions.DefaultEmbeddingFunction()
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Create the vector store
vectordb = Chroma.from_documents(
    documents=splits,             # documents
    embedding=embedding_function, # embeddings
)

print(f'{vectordb._collection.count()} Embeddings are loaded in the Vector Database')

97 Embeddings are loaded in the Vector Database


### Semantic Search
Here we bring everything together to find the section in a document that can answer our question

In [5]:
# Question
question = "Which Porsche has the best top speed?"

# Sematic Similarity search with cosine distance
docs_similarity = vectordb.similarity_search_with_relevance_scores(question, k=5)

# helper to make the answer more readable
def make_docs_readable(docs, truncate = None):  
    for doc, score in docs:
        print(doc.page_content[:truncate])
        print("Score: {}".format(score))
        print(doc.metadata)
        print("================================")

make_docs_readable(docs_similarity, truncate=250) 

The 911 GT3 was added to the 997 lineage on 23 February 2006. Performance figures include a 0–100 kilometres per hour (0–62 mph) acceleration time of 4.1 seconds and a top speed of 310 km/h (193 mph), almost as fast as the Turbo. Porsche's factory re
Score: 0.5464710815783989
{'Header 1': 'Porsche 911', 'Header 2': 'Water-cooled engines (1998–present)[edit]', 'Header 3': '997 (2004–2013)[edit]', 'Header 4': '997 GT3[edit]'}
The Porsche 911 GT1 is a race car that was developed in 1996 for the GT1 class in the 24 Hours of Le Mans. In order to qualify for GT racing, 25 road-going models were built to achieve type homologation. The engine in the GT1 is rated at 608 PS (447 
Score: 0.5167648869298593
{'Header 1': 'Porsche 911', 'Header 2': '911 GT1[edit]'}
In 2016, Porsche unveiled a limited production 911 R based on the GT3 RS. Production was limited to 991 units worldwide.[62] It has an overall weight of 1,370 kg (3,020 lb), a high-revving 4.0 L six-cylinder naturally aspirated engine fro