# Core Semantic Search Concepts

This notebook covers the fundamental concepts of semantic search, including vectors, cosine similarity, vector databases, Pinecone implementation, and hybrid search.

## Introduction to Vectors

Vectors are mathematical representations of data, essentially, a list of numbers. This representation allows us to express different types of data in a unified, mathematical way and apply mathematical tools in order to analyze and manipulate them. For this projetc, the crucial part is the ability to measure 'similarity' between vectors and, conversely, between different pieces of information. For beginners, let's break this down:

### What are Vectors?

It's useful to think about vectors as points in space (or arrows for more physics-oriented viewers). For example:
- A 2D vector [3, 4] represents a point in a flat space, 3 units along the x-axis and 4 units along the y-axis
- A 3D vector [1, 2, 3] represents a point in 3D space

A visual format helps us understand relationship between vectors. It's not obvious if 2 vectors are 'similar' to each other when looking at a list of numbers. However, when turned into points in space we can simply check how close those vectors are to each other. This 'closeness' represents similarity.

While we can easily visualize 2D and 3D vectors as points(arrows) in space, higher dimensional vectors are harder to picture. However, the same principles apply - each number in the vector represents a position along a different dimension.

### Why Vectors Matter in Semantic Search

The critical feature of a vector is that it can represent "meaning". Text, images, sound waves, all can be converted into vectors (with proper tools) that contain similar meaning to the original data. 

Disclaimer: Vector is just a way to represent data. It can store anything that can be turned into numbers. In terms of text, it's often used to represent, a 'structure' of a text (syntactics), a 'meaning' of a text (semantics) and presence of keywords.

Vectors allow us to:
1. **Represent meaning mathematically** - Convert words, sentences, images into numbers
2. **Measure similarity** - Similar meanings have vectors that are close to each other
3. **Perform efficient searches** - Find information based on meaning rather than just keywords

## Vector Representations of Text

Text can be converted into vectors using embedding models, which capture semantic meaning. This process is fundamental to semantic search, so let's explore it in details.

### What are Text Embeddings?

Text embeddings are vector representations of words, sentences, or documents. These vectors capture the meaning of the text in a way that computers can process.

For example, the word "king" might be represented as a vector like [0.1, -0.2, 0.5, ...], where each number captures some aspect of its meaning. Words with similar meanings will have similar vectors.

Very inituive embedding method is **Bag of Words**, which represents a document as a vector of word frequencies. Each dimension in the vector corresponds to a word in the vocabulary, and the value represents the frequency of that word in the document.
The order of words in a document is not important, and that the presence or absence of a word is what matters.
BoW does not require any training data; it simply counts the frequency of each word in the document. 

![image](../bow.png)

While Bag Of Words is very simple embedding method, the most popular are embedding gerated by Neural Networks. LLMs, such as BERT, RoBERTa, or ada-002, generate contextualized embeddings that capture the nuances of language. These embeddings are learned during the pre-training process and can be fine-tuned for specific downstream tasks. LLM embeddings are highly effective in capturing complex semantic relationships and have achieved state-of-the-art results in many NLP tasks.

### Why Do We Need Embeddings?

1. **Computers don't understand text directly** - They need numerical representations to process language
2. **Semantic relationships become mathematical** - Words like "king" and "queen" will be close in vector space
3. **Enable similarity calculations** - We can use vector operations to find similar content

### How Embedding Models Work (Simplified)

1. **Neural networks learn from vast amounts of text** - They read billions of documents
2. **They identify patterns in how words are used together** - Words that appear in similar contexts get similar vectors
3. **The resulting model can convert any text to a fixed-length vector** - Usually hundreds of dimensions

### Remarkable Properties of Embeddings

- **Semantic relationships**: Words with similar meanings have vectors that are close together
- **Analogy solving**: Famous example: "king" - "man" + "woman" ≈ "queen"
- **Cross-lingual capabilities**: Some models can relate similar concepts across languages

Let's see a practical example of generating text embeddings:

In [None]:
# Example of text embeddings with visualization
from langchain_huggingface import HuggingFaceEmbeddings
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load a pre-trained embedding model
model_name = "sentence-transformers/all-mpnet-base-v2"
model = HuggingFaceEmbeddings(model_name=model_name)

# Let's define some example sentences with semantic relationships
sentences = [
    "Dogs are wonderful pets.",
    "I love my canine companion.",  # Similar meaning to the first sentence
    "Cats make great household pets.",  # Related but different animal
    "Artificial intelligence is changing technology.",  # Completely different topic
    "Machine learning algorithms can solve complex problems.",  # Similar to AI sentence
    "The weather is beautiful today.",  # Unrelated topic
]

# Generate embeddings for each sentence
embeddings = model.embed_documents(sentences)

# Basic embedding information
print(f"Shape of a single embedding: {len(embeddings[0])}")
print(f"Number of dimensions: {len(embeddings)}")
print(f"\nFirst 5 values of first embedding: \n{embeddings[0][:5]}")
print(
    "\nNotice these are just numbers - the meaning is distributed across all dimensions!"
)


## Cosine Similarity

Cosine similarity measures the angle between two vectors, providing a metric for similarity. This is a fundamental concept in semantic search, so let's break it down for beginners.

### What is Cosine Similarity?

Cosine similarity measures how similar two vectors are by calculating the cosine of the angle between them. It ranges from -1 (completely opposite) to 1 (exactly the same), with 0 indicating no relationship.

### Cosine Similarity Formula

Mathematically, cosine similarity is calculated as (**feel free to skip this section**):

$$\text{Cosine Similarity} = \frac{A \cdot B}{||A|| \times ||B||} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}}$$

Where:
- $A \cdot B$ is the dot product of vectors A and B
- $||A||$ and $||B||$ are the magnitudes (lengths) of vectors A and B

In [None]:
# Calculate similarity matrix between all sentence pairs
similarity_matrix = cosine_similarity(embeddings)

# Display similarities
print("\n==== Semantic Similarity Between Sentences ====\n")
for i, sentence1 in enumerate(sentences):
    print(f"Sentence {i + 1}: {sentence1}")

print("\nSimilarity Matrix (Cosine Similarity):")
print(np.round(similarity_matrix, 2))

# Let's highlight specific relationships
print("\nObserve the similarities:\n")
print(
    f"Pet sentences (1 & 2): {similarity_matrix[0][1]:.2f} - High similarity as they're about the same concept"
)
print(
    f"Pet sentences (1 & 3): {similarity_matrix[0][2]:.2f} - Moderate similarity (both about pets, but different animals)"
)
print(
    f"Unrelated topics (1 & 6): {similarity_matrix[0][5]:.2f} - Low similarity as they're about unrelated topics"
)

## Vector Databases

To compare texts we need to firstly transtale them to embeddings (vectors). Then we can simply find the similar ones using cosine similary. 

To store vector we use Vector Datastores, which are optimized for vector calculations.

### What are Vector Databases?

Vector databases are purpose-built systems that:
- Store high-dimensional vectors (like our text embeddings)
- Create special indexes for fast similarity searches
- Allow retrieval of the most similar vectors to a query vector
- Store additional metadata alongside the vectors

## Pinecone Implementation

Pinecone is a popular managed vector database service that provides scalable and efficient vector search capabilities for production environments. Let's explore how it works and how to implement it:

### What is Pinecone?

Pinecone is a cloud-based vector database that specializes in:
- **Storing and searching large-scale vector embeddings** - Billions of vectors at production scale
- **Low-latency retrieval** - Sub-second query times even with large datasets
- **Horizontal scaling** - Easy to grow as your data volume increases
- **Managed infrastructure** - No need to manage your own servers or instances

### How Pinecone Works

At a high level, Pinecone follows these steps:

1. **Create an index** - A specialized data structure for storing vectors
2. **Upload vectors** - Along with optional metadata (like the original text or other attributes)
3. **Query the index** - Find the most similar vectors to a query vector
4. **Retrieve results** - Get back the most similar vectors and their metadata

### Key Concepts in Pinecone

- **Index**: A collection of vectors with a specific dimensionality and distance metric
- **Vector**: An array of floating-point numbers (your embeddings)
- **Metadata**: Additional information about each vector (e.g., the original text, URL, category)
- **Namespace**: An optional way to partition your vectors for better organization
- **Query**: A request to find vectors similar to a given vector

## Pinecone setup

Please visit `http://pinecone.io`, log in with your email and generate api key. **Share this api key in your teammebers, so all of you could access the same database.**

Create `.env` file and paste there Pinecone API key:

`PINECONE_API_KEY=your_api_key`

Then continue the code.

In [None]:
from dotenv import load_dotenv

# Load environment variables (particularly PINECONE_API_KEY)
load_dotenv()

In [None]:
import os

# Example of Pinecone implementation
from pinecone import Pinecone, ServerlessSpec

# Initialize Pinecone with your API key
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))


# Function to create and populate an index
def setup_pinecone_index(index_name, dimension=768):
    """
    Create and populate a Pinecone index.
    args:
        index_name (str): Name of the index to create.
        dimension (int): Dimension of the embeddings model. Use HuggingFace description of the model to find the dimension.
    """
    # Check if the index already exists
    existing_indexes = pc.list_indexes()
    if index_name not in [
        existing_index.get("name") for existing_index in existing_indexes
    ]:
        # Create a new index
        pc.create_index(
            name=index_name,
            dimension=dimension,
            metric="cosine",
            spec=ServerlessSpec(cloud="aws", region="us-east-1"),
        )

    # Connect to the index
    index = pc.Index(index_name)

    return index


index = setup_pinecone_index("test2")

In [None]:
from langchain_pinecone import PineconeVectorStore

vector_store = PineconeVectorStore(index=index, embedding=model)

In [None]:
from langchain_core.documents import Document
from uuid import uuid4

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]
vector_store.add_documents(documents=documents)

In [None]:
# load data from csv file
import pandas as pd

# Load the CSV file into a DataFrame
df = pd.read_csv("../data/example_csv.csv")
# Display the first few rows of the DataFrame
display(df.head())

In [None]:
vector_store.add_documents(
    documents=[
        Document(page_content=row["text"], metadata={"source": row["source"]})
        for _, row in df.iterrows()
    ],
    ids=[str(uuid4()) for _ in range(len(df))],
)

In [None]:
results = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=2,
    filter={"source": "tweet"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

In [None]:
results = vector_store.similarity_search_with_score(
    "Will it be hot tomorrow?", k=1, filter={"source": "news"}
)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

In [None]:
results = vector_store.similarity_search_with_score("Welcome in excel world?", k=4)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

## Conclusion: Putting It All Together

### Key Concepts Reviewed

In this notebook, we've explored the fundamental concepts behind semantic search:

1. **Vectors**: The mathematical backbone of semantic search
   - Lists of numbers representing points in multi-dimensional space
   - Allow us to convert text, images, and other data into mathematical representations
   - Enable similarity calculations between different pieces of information

2. **Text Embeddings**: How we transform text into vectors
   - Neural networks create high-dimensional vectors that capture semantic meaning
   - Similar concepts have similar vector representations, regardless of the exact words used
   - Modern embedding models encode context and nuance, not just keywords

3. **Cosine Similarity**: How we measure similarity between vectors
   - Based on the angle between vectors, not their magnitude
   - Values close to 1 indicate high similarity; values close to 0 indicate low similarity
   - Works well in high-dimensional spaces used for text embeddings

4. **Vector Databases**: How we store and retrieve vectors efficiently
   - Specialized for handling high-dimensional vector data
   - Enable fast similarity search across millions or billions of vectors
   - Examples include Pinecone, FAISS, and Milvus

5. **Semantic Search**: Understanding meaning, handles synonyms and related concepts

### Semantic Search Workflow

The typical semantic search workflow consists of these steps:

1. **Indexing Phase**:
   - Convert documents/content into vector embeddings using a neural network
   - Store these embeddings in a vector database (optionally with metadata)
   - Create efficient indexes for fast retrieval

2. **Query Phase**:
   - Convert the search query into a vector embedding using the same model
   - Find the closest vectors to the query vector using cosine similarity
   - Return the documents/content associated with the most similar vectors

   
By understanding these core concepts, you now have the foundation to implement and work with semantic search systems in various applications. The power of semantic search comes from its ability to understand meaning rather than just matching keywords, making it an essential technology for processing and retrieving information in our increasingly data-rich world.

# Next steps



Now you can link materials found by LLM with matching one from ecoinvent. 