# Assignment
### 1. Draw the term-document incidence matrix and inverted index representation for this collection
- **Doc 1 breakthrough drug for schizophrenia**
- **Doc 2 new schizophrenia drug**
- **Doc 3 new approach for treatment of schizophrenia**
- **Doc 4 new hopes for schizophrenia patients**

### 2. What are the returned results for these queries?
- **schizophrenia and drug**
- **Drug or approach**

In [1]:
import pandas as pd
import numpy as np

In [2]:
Doc_1= 'breakthrough drug for schizophrenia'
Doc_2= 'new schizophrenia drug'
Doc_3= 'new approach for treatment of schizophrenia'
Doc_4= 'new hopes for schizophrenia patients'

In [3]:
# creating a new list
docs = [Doc_1, Doc_2, Doc_3,Doc_4 ]
docs

['breakthrough drug for schizophrenia',
 'new schizophrenia drug',
 'new approach for treatment of schizophrenia',
 'new hopes for schizophrenia patients']

In [4]:
# Gather the set of all unique terms
unique_terms = {term for doc in docs for term in doc.split()}
unique_terms

{'approach',
 'breakthrough',
 'drug',
 'for',
 'hopes',
 'new',
 'of',
 'patients',
 'schizophrenia',
 'treatment'}

### 1.(a) Term-Document Matrix

In [5]:
# Creating a term-document matrix as a python dictionary

term_doc_matrix={}

for term in unique_terms:
    #setting unique terms as keys & an empty list as value
    term_doc_matrix[term]=[]
    
    for doc in docs:
        # append 1 to the list if term is present in the doc, else 0
        if term in doc:
            term_doc_matrix[term].append(1)
        else:
            term_doc_matrix[term].append(0)

term_doc_matrix

{'of': [0, 0, 1, 0],
 'drug': [1, 1, 0, 0],
 'for': [1, 0, 1, 1],
 'treatment': [0, 0, 1, 0],
 'patients': [0, 0, 0, 1],
 'breakthrough': [1, 0, 0, 0],
 'approach': [0, 0, 1, 0],
 'schizophrenia': [1, 1, 1, 1],
 'hopes': [0, 0, 0, 1],
 'new': [0, 1, 1, 1]}

- **The query to find all documents containing "schizophrenia" AND "drug"**

In [6]:
docs_array = np.array(docs, dtype='object')

v1 = np.array(term_doc_matrix['schizophrenia'])    
v2 = np.array(term_doc_matrix['drug'])
print(v1)
print(v2)
print('---------')
q1 = v1 & v2
print(q1)

[1 1 1 1]
[1 1 0 0]
---------
[1 1 0 0]


In [7]:
# We can now get the matching documents with the result
[doc for doc in q1 * docs_array if doc]

['breakthrough drug for schizophrenia', 'new schizophrenia drug']

- **The query to find all documents containing "drug" OR "approach"**

In [8]:
v3 = np.array(term_doc_matrix['drug'])    
v4 = np.array(term_doc_matrix['approach'])
print(v3)
print(v4)
print('---------')
q2 = v3 | v4
print(q2)

[1 1 0 0]
[0 0 1 0]
---------
[1 1 1 0]


In [9]:
# We can now get the matching documents with the result
[doc for doc in q2 * docs_array if doc]

['breakthrough drug for schizophrenia',
 'new schizophrenia drug',
 'new approach for treatment of schizophrenia']

### 1.(b) Inverted Index

In [10]:
# Construct an inverted index as a dictionary

inverted_index = {}

for i, doc in enumerate(docs):
    for term in doc.split():
        if term in inverted_index:
            inverted_index[term].add(i)
        else:
            inverted_index[term] = {i}

inverted_index

{'breakthrough': {0},
 'drug': {0, 1},
 'for': {0, 2, 3},
 'schizophrenia': {0, 1, 2, 3},
 'new': {1, 2, 3},
 'approach': {2},
 'treatment': {2},
 'of': {2},
 'hopes': {3},
 'patients': {3}}

Now we can get posting lists for any term. For example,

In [11]:
posting_list = inverted_index['schizophrenia']
posting_list

{0, 1, 2, 3}

In [12]:
# We can perform boolean operations on postings lists for Boolean search operations

def OR_postings(posting1, posting2):
    p1 = 0
    p2 = 0
    result = list()
    while p1 < len(posting1) and p2 < len(posting2):
        if posting1[p1] == posting2[p2]:
            result.append(posting1[p1])
            p1 += 1
            p2 += 1
        elif posting1[p1] > posting2[p2]:
            result.append(posting2[p2])
            p2 += 1
        else:
            result.append(posting1[p1])
            p1 += 1
    while p1 < len(posting1):
        result.append(posting1[p1])
        p1 += 1
    while p2 < len(posting2):
        result.append(posting2[p2])
        p2 += 1
    return result


def AND_postings(posting1, posting2):
    p1 = 0
    p2 = 0
    result = list()
    while p1 < len(posting1) and p2 < len(posting2):
        if posting1[p1] == posting2[p2]:
            result.append(posting1[p1])
            p1 += 1
            p2 += 1
        elif posting1[p1] > posting2[p2]:
            p2 += 1
        else:
            p1 += 1
    return result

- **The query to find all documents containing "schizophrenia" AND "drug"**

In [13]:
pl_1 = list(inverted_index['schizophrenia'])
pl_2 = list(inverted_index['drug'])
AND_postings(pl_1, pl_2)

[0, 1]

In [14]:
# We can now get the matching documents with the result

[docs[i] for i in AND_postings(pl_1, pl_2)]

['breakthrough drug for schizophrenia', 'new schizophrenia drug']

- **The query to find all documents containing "drug" OR "approach"**

In [15]:
pl_3 = list(inverted_index['drug'])
pl_4 = list(inverted_index['approach'])
OR_postings(pl_3, pl_4)

[0, 1, 2]

In [16]:
# We can now get the matching documents with the result

[docs[i] for i in OR_postings(pl_3, pl_4)]

['breakthrough drug for schizophrenia',
 'new schizophrenia drug',
 'new approach for treatment of schizophrenia']