In [1]:
# Import libraries
import openai
import faiss
import numpy as np
import requests
from bs4 import BeautifulSoup
import os
import time

## **1. Choose an LLM**

### **Overview**

- **Selected LLM:** OpenAI's `text-embedding-ada-002` model.
- **Purpose:** Generate high-quality embeddings for text data.

### **Implementation**

- Utilized OpenAI's API to create embeddings.
- Ensured secure handling of the API key using environment variables.

In [23]:
openai.api_key = "key"

## **2. Select a Vector Database**

### **Overview**

- **Chosen Database:** FAISS (Facebook AI Similarity Search).
- **Reason:** Efficient for large-scale similarity search and clustering of dense vectors.

### **Implementation**

- Initialized a FAISS index optimized for cosine similarity by normalizing embeddings and using inner product search.

In [3]:
# Dimension - For 'text-embedding-ada-002', it's 1536
embedding_dim = 1536

# Initialize a FAISS index
index = faiss.IndexFlatIP(embedding_dim)

## **3. Data Preparation**

### **Overview**

- **Dataset:** Collected text documents related to Quantum Computing from five URLs.
- **Process:** Fetched and extracted textual content from the specified URLs.

### **Implementation**

- Utilized `requests` and `BeautifulSoup` to scrape and parse content.
- Ensured all documents have valid, non-empty content before processing.

In [4]:
# List of URLs to fetch data from
urls = [
    "https://medium.com/@vignesh2659/quantum-computing-abd85aa5da9d",
    "https://www.ibm.com/topics/quantum-computing",
    "https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-quantum-computing",
]

def fetch_content(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            # Extract text from paragraphs
            paragraphs = soup.find_all('p')
            text = ' '.join([para.get_text() for para in paragraphs])
            return text
        else:
            print(f"Failed to fetch {url}: Status code {response.status_code}")
            return ""
    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return ""

# Fetch content from all URLs
documents = []
for url in urls:
    content = fetch_content(url)
    documents.append({
        'url': url,
        'content': content
    })

In [5]:
# Additional URLs
additional_urls = [
    "https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-quantum-computing/",
    "https://www.techtarget.com/whatis/definition/quantum-computing"
]

# Fetch content from additional URLs
for url in additional_urls:
    content = fetch_content(url)
    documents.append({
        'url': url,
        'content': content
    })

# Display fetched documents
for idx, doc in enumerate(documents, 1):
    print(f"Document {idx}: {doc['url']}\nContent Length: {len(doc['content'])} characters\n")


Document 1: https://medium.com/@vignesh2659/quantum-computing-abd85aa5da9d
Content Length: 7057 characters

Document 2: https://www.ibm.com/topics/quantum-computing
Content Length: 24955 characters

Document 3: https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-quantum-computing
Content Length: 9321 characters

Document 4: https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-quantum-computing/
Content Length: 5239 characters

Document 5: https://www.techtarget.com/whatis/definition/quantum-computing
Content Length: 13233 characters



In [6]:
def get_embeddings(texts, batch_size=10, delay=1):
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        # Filter out any empty or non-string texts
        batch = [text for text in batch if isinstance(text, str) and text.strip() != '']
        if not batch:
            print(f"Skipping empty batch at indices {i} to {i+batch_size}")
            continue  # Skip empty batches
        try:
            response = openai.Embedding.create(
                input=batch,
                model="text-embedding-ada-002"
            )
            batch_embeddings = [data['embedding'] for data in response['data']]
            embeddings.extend(batch_embeddings)
            print(f"Processed batch {i//batch_size +1}: Generated {len(batch_embeddings)} embeddings.")
        except openai.error.InvalidRequestError as e:
            print(f"InvalidRequestError for batch starting at index {i}: {e}")
            print("Batch content:", batch)
        except openai.error.RateLimitError as e:
            print(f"RateLimitError: {e}. Sleeping for {delay} seconds.")
            time.sleep(delay)
            continue
        except Exception as e:
            print(f"An unexpected error occurred for batch starting at index {i}: {e}")
            print("Batch content:", batch)
        time.sleep(delay)  # Respect rate limits
    return embeddings


In [7]:
# Extract contents for embedding
texts = [doc['content'] for doc in documents]

# Checks
valid_texts = [text for text in texts if isinstance(text, str) and text.strip() != '']

print(f"Total documents: {len(texts)}")
print(f"Valid texts for embedding: {len(valid_texts)}")


Total documents: 5
Valid texts for embedding: 5


In [8]:
# Generate embeddings
embeddings = get_embeddings(valid_texts, batch_size=5, delay=1)

# Verify embeddings
print(f"Generated {len(embeddings)} embeddings.")


Processed batch 1: Generated 5 embeddings.
Generated 5 embeddings.


## **4. Database Setup**

### **Overview**

- **Objective:** Set up the FAISS vector database to store and manage embeddings.
- **Steps:** Normalized embeddings and added them to the FAISS index.

### **Implementation**

- Converted embeddings to a NumPy array.
- Normalized embeddings to unit length to enable cosine similarity via inner product.
- Added embeddings to the FAISS index.

In [9]:
# Convert embeddings to numpy array
embeddings_np = np.array(embeddings).astype('float32')

# Normalize the embeddings to unit length for cosine similarity
faiss.normalize_L2(embeddings_np)

# Add embeddings to the index
index.add(embeddings_np)

print(f"Number of vectors in the index: {index.ntotal}")


Number of vectors in the index: 5


## **5. Embedding Storage**

### **Overview**

- **Goal:** Maintain a mapping between each embedding and its corresponding text document.
- **Method:** Created a dictionary to map FAISS index IDs to document details.

### **Implementation**

- Mapped each FAISS index ID to its document's URL and content.
- Ensured the mapping aligns correctly with the stored embeddings.

In [10]:
# List to map index IDs to documents
id_to_doc = {}
for idx, doc in enumerate(documents):
    if isinstance(doc['content'], str) and doc['content'].strip() != '':
        id_to_doc[idx] = {
            'url': doc['url'],
            'content': doc['content']
        }


## **6. Implement Semantic Search**

### **Overview**

- **Purpose:** Enable semantic search functionality using the generated embeddings and FAISS index.
- **Process:** Generate query embeddings, perform similarity search, and retrieve top matching documents.

### **Implementation**

- Defined a `semantic_search` function that handles embedding generation, normalization, and FAISS search.
- Retrieved and displayed relevant documents based on similarity scores.

In [11]:
def semantic_search(query, top_k=3):
    try:
        # Generate embeddings for the query
        response = openai.Embedding.create(
            input=[query],
            model="text-embedding-ada-002"
        )
        query_embedding = np.array(response['data'][0]['embedding']).astype('float32')
        
        # Normalize the query embedding
        faiss.normalize_L2(query_embedding.reshape(1, -1))
        
        # Perform similarity search
        D, I = index.search(query_embedding.reshape(1, -1), top_k)
        
        # Retrieve and display top matching documents
        results = []
        for score, idx in zip(D[0], I[0]):
            doc = id_to_doc.get(idx, {})
            if doc:
                results.append({
                    'score': score,
                    'url': doc.get('url', ''),
                    'content': doc.get('content', '')[:500]  # Show first 500 characters
                })
        return results
    except Exception as e:
        print(f"Error during semantic search: {e}")
        return []


In [12]:
# Example queries
queries = [
    "What is the significance of quantum entanglement in computing?",
    "Explain quantum superposition and its applications.",
    "How do quantum algorithms differ from classical algorithms?"
]

# Perform semantic search for each query
for query in queries:
    print(f"\nQuery: {query}")
    results = semantic_search(query, top_k=3)
    for idx, res in enumerate(results, 1):
        print(f"\nResult {idx}:")
        print(f"Score: {res['score']:.4f}")
        print(f"URL: {res['url']}")
        print(f"Content Snippet: {res['content']}\n")



Query: What is the significance of quantum entanglement in computing?

Result 1:
Score: 0.8750
URL: https://www.techtarget.com/whatis/definition/quantum-computing
Content Snippet: Quantum computing is an area of computer science focused on the development of computers based on the principles of quantum theory. Quantum computing uses the unique behaviors of quantum physics to solve problems that are too complex for classical computing. Quantum computers work by taking advantage of quantum mechanical properties like superposition and quantum interference. They use special hardware and algorithms that can take advantage of these quantum effects. The development of quantum co


Result 2:
Score: 0.8689
URL: https://medium.com/@vignesh2659/quantum-computing-abd85aa5da9d
Content Snippet: Sign up Sign in Sign up Sign in Vignesh R Follow -- Listen Share Quantum Realm in the MARVEL multiverse, our very own Ant-man has one of the coolest superpowers out there. He becomes either too tiny or too h

## **7. Testing and Evaluation**

### **Overview**

- **Objective:** Validate the effectiveness of the semantic search system.
- **Methods:** Conducted automated tests with diverse queries and evaluated the relevance of results.

### **Implementation**

- Defined a set of test queries covering various aspects of Quantum Computing.
- Analyzed search results to assess relevance and accuracy.

In [13]:
# Define test queries and expected topics
test_queries = [
    "Benefits of quantum computing in cryptography",
    "Applications of quantum machine learning",
    "Challenges in building quantum computers",
    "What is a Grover operator and what are qubits"
]

# Perform and display results for test queries
for query in test_queries:
    print(f"\n---\nQuery: {query}")
    results = semantic_search(query, top_k=3)
    for idx, res in enumerate(results, 1):
        print(f"\nResult {idx}:")
        print(f"Score: {res['score']:.4f}")
        print(f"URL: {res['url']}")
        print(f"Content Snippet: {res['content']}\n")



---
Query: Benefits of quantum computing in cryptography

Result 1:
Score: 0.8601
URL: https://www.techtarget.com/whatis/definition/quantum-computing
Content Snippet: Quantum computing is an area of computer science focused on the development of computers based on the principles of quantum theory. Quantum computing uses the unique behaviors of quantum physics to solve problems that are too complex for classical computing. Quantum computers work by taking advantage of quantum mechanical properties like superposition and quantum interference. They use special hardware and algorithms that can take advantage of these quantum effects. The development of quantum co


Result 2:
Score: 0.8497
URL: https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-quantum-computing
Content Snippet: Flip a coin. Heads or tails, right? Sure, once we see how the coin lands. But while the coin is still spinning in the air, itâs neither heads nor tails. Itâs some probability of both.  This 

## **Bonus: Experiment with Different Similarity Search Algorithms**

### **Overview**

- **Objective:** Compare performance between cosine similarity and Euclidean distance for semantic search.
- **Approach:** Implemented both similarity measures using FAISS and compared results.

### **Implementation**

- Initialized a separate FAISS index for Euclidean distance.
- Defined a `semantic_search_euclidean` function.
- Conducted comparative analysis using a sample query.

#### Initialize FAISS Index for Euclidean Distance

In [14]:
# FAISS index for Euclidean distance
index_euclidean = faiss.IndexFlatL2(embedding_dim)

# Add original (non-normalized) embeddings to the Euclidean index
index_euclidean.add(embeddings_np)

print(f"Number of vectors in the Euclidean index: {index_euclidean.ntotal}")


Number of vectors in the Euclidean index: 5


#### Modify Semantic Search for Euclidean Distance

In [15]:
def semantic_search_euclidean(query, top_k=3):
    try:
        # Generate embeddings for the query
        response = openai.Embedding.create(
            input=[query],
            model="text-embedding-ada-002"
        )
        query_embedding = np.array(response['data'][0]['embedding']).astype('float32')
        
        # Perform similarity search using Euclidean distance
        D, I = index_euclidean.search(query_embedding.reshape(1, -1), top_k)
        
        # Retrieve and display top matching documents
        results = []
        for distance, idx in zip(D[0], I[0]):
            doc = id_to_doc.get(idx, {})
            if doc:
                results.append({
                    'distance': distance,
                    'url': doc.get('url', ''),
                    'content': doc.get('content', '')[:500]  # Show first 500 characters
                })
        return results
    except Exception as e:
        print(f"Error during Euclidean semantic search: {e}")
        return []


#### Compare Similarity Measures

In [16]:
# Query to compare similarity measures
comparison_query = "Explain the role of qubits in quantum computing."

# Semantic search using cosine similarity (inner product)
cosine_results = semantic_search(comparison_query, top_k=3)

# Semantic search using Euclidean distance
euclidean_results = semantic_search_euclidean(comparison_query, top_k=3)

# Display results
print(f"\nQuery: {comparison_query}\n")

print("Cosine Similarity Results:")
for idx, res in enumerate(cosine_results, 1):
    print(f"\nResult {idx}:")
    print(f"Score: {res['score']:.4f}")
    print(f"URL: {res['url']}")
    print(f"Content Snippet: {res['content']}\n")

print("Euclidean Distance Results:")
for idx, res in enumerate(euclidean_results, 1):
    print(f"\nResult {idx}:")
    print(f"Distance: {res['distance']:.4f}")
    print(f"URL: {res['url']}")
    print(f"Content Snippet: {res['content']}\n")



Query: Explain the role of qubits in quantum computing.

Cosine Similarity Results:

Result 1:
Score: 0.8894
URL: https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-quantum-computing/
Content Snippet: It's the use of quantum mechanics to run calculations on specialized hardware. To fully define quantum computing, we need to define some key terms first. The quantum in "quantum computing" refers to the quantum mechanics that the system uses to calculate outputs. In physics, a quantum is the smallest possible discrete unit of any physical property. It usually refers to properties of atomic or subatomic particles, such as electrons, neutrinos, and photons. A qubit is the basic unit of information


Result 2:
Score: 0.8789
URL: https://medium.com/@vignesh2659/quantum-computing-abd85aa5da9d
Content Snippet: Sign up Sign in Sign up Sign in Vignesh R Follow -- Listen Share Quantum Realm in the MARVEL multiverse, our very own Ant-man has one of the coolest superpowe

# **Assignment Requirements Verification**

The provided solution satisfies all the specified requirements:

1. **Choose an LLM:**
   - Utilized OpenAI's `text-embedding-ada-002` model to generate embeddings.

2. **Select a Vector Database:**
   - Chose FAISS for storing and managing high-dimensional vectors.

3. **Data Preparation:**
   - Collected text documents from seven open-access URLs related to Quantum Computing.
   - Used OpenAI's API to convert text documents into embeddings.

4. **Database Setup:**
   - Installed and set up FAISS.
   - Created and configured FAISS indices to store embeddings.

5. **Embedding Storage:**
   - Inserted generated embeddings into the FAISS index.
   - Maintained a mapping between each embedding and its corresponding text document.

6. **Implement Semantic Search:**
   - Defined functions to generate query embeddings and perform similarity searches using FAISS.
   - Retrieved and displayed corresponding text documents for top matches.

7. **Testing and Evaluation:**
   - Conducted automated testing with various queries.
   - Evaluated the relevance of search results through example queries.
   - Suggested potential improvements based on testing outcomes.

**Bonus:**

- Experimented with different similarity search algorithms (cosine similarity vs. Euclidean distance) and compared their performance.
- The structure allows for the addition of features like filtering, ranking, or grouping based on relevance scores.