# Information Retrieval

<div style="text-align:justify">
Information Retrieval (IR) is a computer science field that deals with finding valuable information from large data collections. It involves users submitting queries, and systems ranking and retrieving relevant documents using indexing, ranking models, and user feedback. Challenges include dealing with vast data, user intent, and multilingual content. IR has applications in search engines, libraries, and more, helping users access and utilize information effectively.
</div>

#### Dependencies

In [1]:
import string, numpy
from numpy.linalg import norm

#### Documents

In [2]:
docs = [
    'I love machine learning.',
    'The weather is great today.',
    'Machine learning is fun.',
    'I enjoy learning new things.'
]

docs

['I love machine learning.',
 'The weather is great today.',
 'Machine learning is fun.',
 'I enjoy learning new things.']

#### Pre-processing

In [3]:
processed_docs = [''.join([char for char in doc if char not in string.punctuation]).lower() for doc in docs]
tokens = [doc.split() for doc in processed_docs]
all_words = set([word for token in tokens for word in token])

print(all_words)

{'things', 'weather', 'machine', 'is', 'today', 'new', 'enjoy', 'love', 'learning', 'i', 'fun', 'the', 'great'}


#### One-hot Enconding

In [4]:
def one_hot_encode(doc):
    encoding = [0] * len(all_words)  # Initialize a zero-vector with length = total unique words
    for word in doc:
        if word in all_words:
            encoding[list(all_words).index(word)] = 1
    return encoding

#### Search Engine

<div style="text-align:justify">
Search engine techniques range from simple keyword-based approaches to more complex methods. Simple techniques involve matching keywords or using boolean operators. More advanced techniques include vector space models, probabilistic models, and latent semantic analysis. Modern approaches involve machine learning, NLP, semantic understanding, neural networks, cross-modal search, and entity-driven search. These techniques progressively enhance the search engine's ability to understand user intent and retrieve relevant information from diverse sources.
<br><br>
We will introduce the most straightforward method, which employs a vector-based approach, to locate documents based on a given query.
</div>

In [5]:
def search(A, B):
    return numpy.dot(A, B) / (norm(A) * norm(B))

#### Encoding Documents

In [6]:
encoded_docs = [one_hot_encode(doc) for doc in tokens]
encoded_docs

[[0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0],
 [0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1],
 [0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0],
 [1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0]]

#### Query

In [7]:
query = 'machine learning'
processed_query = ''.join([char for char in query if char not in string.punctuation]).lower()
query = one_hot_encode(processed_query.split())

query

[0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

#### Ranking

In [8]:
scores = [search(query, doc) for doc in encoded_docs]

threshold = 0.5
matched_docs = [docs[i] for i, score in enumerate(scores) if score >= threshold]

print('Documents matching the query:')
for doc in matched_docs:
    print("-", doc)

Documents matching the query:
- I love machine learning.
- Machine learning is fun.
