### Information Retrieval (w/ BERT)

Information Retrieval is the process of obtaining relevant information from large datasets, typically documents or databases, in response to a query. The main goal of an IR system is to find and retrieve the most relevant data based on the user’s needs.

#### Key Concepts in Information Retrieval

1. **Document**: A unit of information that is stored in the system, such as a web page, article, or any piece of data.
2. **Query**: The search term or question provided by the user to retrieve relevant documents.
3. **Ranking**: The process of ordering retrieved documents based on their relevance to the query.
4. **Relevance**: A measure of how well a document satisfies the user's query.

#### Types of Information Retrieval

1. **Boolean Retrieval**: This model retrieves documents based on exact matches to the query terms. The query is typically represented using Boolean operators (AND, OR, NOT).

   - Example: A query "apple AND orange" retrieves documents that contain both "apple" and "orange."

2. **Vector Space Model (VSM)**: In VSM, documents and queries are represented as vectors in a multi-dimensional space. The relevance is determined by the cosine similarity between the query and the document vectors.

   - Example: A query "apple fruit" is compared to document vectors, and the closest matching document is retrieved.

3. **Probabilistic Retrieval**: Probabilistic models estimate the likelihood that a document is relevant to a given query. One of the well-known models is the BM25 (Best Matching 25), which ranks documents based on term frequency, inverse document frequency, and document length.

#### Information Retrieval Example

- **Query**: "Best smartphones 2025"
- **Documents**: Articles, blogs, and product reviews about smartphones for the year 2025.
- **Retrieved Documents**: The system retrieves documents that provide relevant information about the best smartphones of 2025.

#### Challenges in Information Retrieval

- **Ambiguity**: Queries may have multiple meanings or interpretations, which makes it hard to match documents.
- **Relevance Ranking**: Determining the most relevant documents among a large collection can be complex.
- **Handling Noise**: Some irrelevant or low-quality documents may be retrieved along with relevant ones, requiring better filtering techniques.

#### Modern Approaches in IR

- **Search Engines**: Modern search engines like Google, Bing, and others are built on sophisticated IR models that incorporate machine learning, natural language processing, and deep learning techniques to rank and retrieve documents more accurately.
- **Neural IR**: Deep learning models, such as those based on BERT or T5, have been applied to IR tasks to improve the relevance ranking of search results by understanding the context of queries and documents better.


---


In [2]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertModel, BertTokenizer

# BERT Model and Tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Documents and query
documents = [
    "Machine learning is a field of artificial intelligence.",
    "Banking system involves understanding human language.",
    "Calculus encompasses probability",
    "Data science combines statistics, data analysis, and machine learning.",
]
query = "What is machine learning?"


# Function to get embeddings
def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)

    # Extract the last hidden state from the model output
    last_hidden_state = outputs.last_hidden_state

    embedding = last_hidden_state.mean(dim=1)

    return embedding.detach().numpy()


# Get embedding vectors for documents and query
doc_embeddings = np.vstack([get_embedding(doc) for doc in documents])
query_embedding = get_embedding(query)

# Cosine similarity calculation
similarities = cosine_similarity(query_embedding, doc_embeddings)

# Print the similarity scores for each document
for i, score in enumerate(similarities[0]):
    print(f"Document {i+1}: {score}")

# Get the most similar document
most_similar_index = similarities.argmax()
print("\nMost similar document")
print(documents[most_similar_index])

Document 1: 0.7235309481620789
Document 2: 0.6327658891677856
Document 3: 0.698382556438446
Document 4: 0.6562472581863403

Most similar document
Machine learning is a field of artificial intelligence.
