# Boolean Retrieval

**Boolean Retrieval Disadvantages:**

1. To process large document collections quickly. The amount of online data
has grown at least as quickly as the speed of computers, and we would
now like to be able to search collections that total in the order of billions
to trillions of words.
2. To allow more flexible matching operations. For example, it is impractical
to perform the query Romans NEAR countrymen with grep, where NEAR
might be defined as “within 5 words” or “within the same sentence”.
3. To allow ranked retrieval: in many cases you want the best answer to an
information need among many documents that contain certain words.

## 1. Intersection

In [1]:
def intersection(A, B):
    a, b = len(A), len(B)
    i , j = 0, 0
    result = []
    while i < a and j < b:
        if A[i] == B[j]:
            result.append(A[i])
            i += 1
            j += 1
        elif A[i] < B[j]:
            i += 1
        else:
            j += 1
    return result

A = [1, 2, 4, 11, 31, 45, 173, 174]
B = [2, 31, 54, 101]

print(intersection(A, B))

[2, 31]


## Exercises

### Exercise 1.1: Draw the inverted index that would be built for the following document collection.
---
**Doc 1** new home sales top forecasts

**Doc 2** home sales rise in july

**Doc 3** increase in home sales in july

**Doc 4** july new home sales rise

In [2]:
tokens, inverted_index = [], {}

def tokenizer(doc, doc_id):
    list_of_tokens = doc.split(' ')
    for token in list_of_tokens:
        if (token, doc_id) not in tokens:
            tokens.append((token, doc_id))

doc1 = 'new home sales top forecasts'
doc2 = 'home sales rise in july'
doc3 = 'increase in home sales in july'
doc4 = 'july new home sales rise'

tokenizer(doc1, 1)
tokenizer(doc2, 2)
tokenizer(doc3, 3)
tokenizer(doc4, 4)

tokens = sorted(tokens)
num_of_tokens = len(tokens)


for i in range(num_of_tokens):
    counter = 1
    if tokens[i][0] not in inverted_index.keys():
        inverted_index[tokens[i][0]] = [tokens[i][1]]
        
    for j in range(i + 1, num_of_tokens):
        if tokens[i][0] == tokens[j][0]:
            if tokens[j][1] not in inverted_index[tokens[i][0]]:
                inverted_index[tokens[i][0]].append(tokens[j][1])
                counter += 1
print(tokens)
print()
for key in inverted_index.keys():
    print(key, '->', inverted_index[key], '->', len(inverted_index[key]))

[('forecasts', 1), ('home', 1), ('home', 2), ('home', 3), ('home', 4), ('in', 2), ('in', 3), ('increase', 3), ('july', 2), ('july', 3), ('july', 4), ('new', 1), ('new', 4), ('rise', 2), ('rise', 4), ('sales', 1), ('sales', 2), ('sales', 3), ('sales', 4), ('top', 1)]

forecasts -> [1] -> 1
home -> [1, 2, 3, 4] -> 4
in -> [2, 3] -> 2
increase -> [3] -> 1
july -> [2, 3, 4] -> 3
new -> [1, 4] -> 2
rise -> [2, 4] -> 2
sales -> [1, 2, 3, 4] -> 4
top -> [1] -> 1


### Exercise 1.2
---

**Doc 1** breakthrough drug for schizophrenia

**Doc 2** new schizophrenia drug

**Doc 3** new approach for treatment of schizophrenia

**Doc 4** new hopes for schizophrenia patients


In [3]:
import numpy as np
import pandas as pd

In [4]:
tokens = []

doc1 = "breakthrough drug for schizophrenia"
doc2 = "new schizophrenia drug"
doc3 = "new approach for treatment of schizophrenia"
doc4 = "new hopes for schizophrenia patients"

tokenizer(doc1, 1)
tokenizer(doc2, 2)
tokenizer(doc3, 3)
tokenizer(doc4, 4)

tokens = sorted(tokens)
unique_tokens = []
for token in tokens:
    if token[0] not in unique_tokens:
        unique_tokens.append(token[0])
print(unique_tokens)

['approach', 'breakthrough', 'drug', 'for', 'hopes', 'new', 'of', 'patients', 'schizophrenia', 'treatment']


In [5]:
incidence_matrix = np.zeros(shape = (len(unique_tokens), 4))
df = pd.DataFrame(incidence_matrix, index=unique_tokens)

for word in unique_tokens:
    if word in doc1:
        df.loc[word, 0] = 1
    elif word in doc2:
        df.loc[word, 1] = 1
    elif word in doc3:
        df.loc[word, 2] = 1
    elif word in doc4:
        df.loc[word, 3] = 1

df = df.astype(int)
df

Unnamed: 0,0,1,2,3
approach,0,0,1,0
breakthrough,1,0,0,0
drug,1,0,0,0
for,1,0,0,0
hopes,0,0,0,1
new,0,1,0,0
of,0,0,1,0
patients,0,0,0,1
schizophrenia,1,0,0,0
treatment,0,0,1,0


### Exercise 1.3

In [6]:
answer = ''
for i in range((len(df.columns))):
    if df.loc['schizophrenia', i] == 1 and df.loc['drug', i] == 1:
        answer += '1'
    else:
        answer += '0'
answer

'1000'