## COMP6714 Project

$\textbf{Note:}$ It will take you quite some time to complete this project, therefore, we earnestly recommend that you start working as early as possible. You should read the specs carefully at least 2-3 times before you start coding.

* $\textbf{Submission deadline for the Project is 20:59:59 on 19th Nov, 2021}$
* $\textbf{LATE PENALTY: 10% on day-1 and 30% on each subsequent day.}$

# Project Specification

## Instructions
1. This note book contains instructions for $\textbf{COMP6714-Project}$. 

* You are required to complete your implementation for part-1 in a file `project.py` provided along with this notebook. Please $\textbf{DO NOT ALTER}$ the name of the file.

* You are not allowed to print out unnecessary stuff. We will not consider any output printed out on the screen. All results should be returned in appropriate data structures via corresponding functions.

* You can submit your implementation for **Project** via `give`.

* For each question, we have provided you with detailed instructions along with question headings. In case of problems, you can post your query @ Piazza.

* You are allowed to add other functions and/or import modules (you may have to for this project), but you are not allowed to define global variables. **Only functions are allowed** in `project.py`

* You should not import unnecessary and non-standard modules/libraries. Loading such libraries at test time will lead to errors and hence 0 mark for your project. If you are not sure, please ask @ Piazza.

### Allowed Libraries:

You are required to write your implementation for the project using `Python 3.6.5`. You are allowed to use any python `standard libraries`(https://docs.python.org/3.6/library/).

## Part One - Group Varint Encoding

### Input Format:
The function `encode()` should receive **One** argument:<br>
`posting_list` which is a `list` of integers, where each integer represents a document ID (all the document IDs are sorted).

### Output Format:
Your output should be a bytearray, which is the group varint encoding for `posting_list`.

In [25]:
def encode(posting_list):
    append_zero = 4-len(posting_list) % 4
    if(append_zero == 4):
        append_zero = 0
    
    i = 0
    while(i < append_zero):
        posting_list.append(0)
        i += 1
    
    start = 0
    end = 4
    res = ""
    prev = 0
    while(end <= len(posting_list)):
        tag = ""
        ans = ""
        for posting in posting_list[start:end]:
            curr = posting - prev
            prev = posting
            count = 0
            if(posting == 0):
                curr = 0
            binStr = '{0:b}'.format(curr)
            while(len(binStr) > 8):
                ans += binStr[-8:]
                binStr = binStr[0:-8]
                count += 1
            if(len(binStr) > 0):
                temp = (8-len(binStr))*'0'
                ans += temp + binStr
                count += 1
            tag += '{0:02b}'.format(count-1)
        ans = tag + ans
        start = end
        end += 4
        res += ans

    return int(res, 2).to_bytes((len(res) + 7) // 8, byteorder='big')

### Toy Example for Illustration 

Here, we provide a small toy example for this part: <br>
Let `posting_list` be:

In [2]:
posting_list = [1, 16, 527, 131598]

In [27]:
posting_list = [1, 2, 3, 4, 4294967299]

In [28]:
encoded_list = encode(posting_list)

In [29]:
[bin(code)[2:].zfill(8)for code in encoded_list]

['00000000',
 '00000001',
 '00000001',
 '00000001',
 '00000001',
 '11000000',
 '11111111',
 '11111111',
 '11111111',
 '11111111',
 '00000000',
 '00000000',
 '00000000']

## Part Two - Group Varint Decoding

### Input Format:
The function `decode()` should receive **One** argument:<br>
`encoded_list` is a Bytearray which corresponds to the encoded binary sequence.

### Output Format:
Your output should be a `list` of integers, where each integer represents a document ID that is decoded from the encoded list.

In [30]:
def decode(encoded_list):
    decode_str = ''.join(format(byte, '08b') for byte in encoded_list)
    res = []
    reading_start = 0
    early_stop = False
    prev = 0
    while(reading_start < len(decode_str)):
        tag = decode_str[reading_start:reading_start+8]
        i = 0
        reading_start = reading_start+8
        
        while(i <= 6):
            reading_len = int(tag[i:i+2],2)+1
            gap_b = decode_str[reading_start: reading_start+reading_len*8]
            correct_order = "".join(reversed([gap_b[i:i+8] for i in range(0, len(gap_b), 8)]))
            gap_int = int(correct_order,2)
            ans = gap_int + prev
            res.append(ans)
            if(ans == prev and reading_start != 8):
                early_stop = True
                break
            prev = ans
            reading_start = reading_start+reading_len*8
            i += 2
        if(early_stop):
            res = res[0:-1]
            break
    return res

### Toy Example for Illustration 

Here, we provide a small toy example for this part: <br>
Let `encoded_list` be:

In [7]:
# encoded_list = bytearray(b'\x06\x01\x0f\xff\x01\xff\xff\x01')

In [31]:
decoded_list = decode(encoded_list)

In [32]:
decoded_list

[1, 2, 3, 4, 4294967299]

## Part Three - Evaluation

In this part, you need to implement a function that computes the F1 score and MAP with the given informtion.

### Input Format:
The function `evaluation()` should receive **two** argument:<br>
`rel_list` is a list of 0s and 1s, where 0 indicates that the corresponding document is irrelevant, and 1 indicates that the corresponding document is relevant.
`total_rel_doc` is an integer that indicates the total relevant documents to the query.

### Output Format:
Your output should be two float numbers, where the first one is the F1 score, and the second one is the MAP.

In [10]:
def evaluation(rel_list, total_rel_doc):
    retrived_docs = len(rel_list)
    rel_retrived = rel_list.count(1)
    relevant_docs = total_rel_doc
    recall = rel_retrived / relevant_docs
    precision = rel_retrived / retrived_docs
    f1 = 2*(recall*precision) / (recall+precision)

    sum_AP = 0
    count_retrived = 0
    count_rel_retrived = 0
    for indicator in rel_list:
        count_retrived += 1
        if(indicator == 1):
            count_rel_retrived += 1
            sum_AP += count_rel_retrived / count_retrived
        if(count_rel_retrived == rel_retrived):
            break
    mAP = sum_AP / total_rel_doc

    return f1,mAP


### Toy Example for Illustration 

Here, we provide a small toy example for this part: <br>
Let `rel_list` and `total_rel_doc` be:

In [11]:
rel_list = [1,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1]
total_rel_doc = 8

In [12]:
F1_score, MAP = evaluation(rel_list, total_rel_doc)

In [13]:
F1_score

0.4285714285714285

In [14]:
MAP

0.4162878787878788

## Project Submission and Feedback

For project submission, you are required to submit a python file named `project.py` via `give`:

You can submit the file by `give cs6714 proj1 project.py`. The file size is limited to 1MB.