# ɣ code

This week's exercise is on ɣ-codes (book paragraph 5.3.2).

You'll implement encoding and decoding ɣ-codes. Initially, the focus will be on encoding a single code. The second part requires you to write functions that decode a (non-delimited) sequence of codes.

We provide the following function, that returns the index of the most significant bit (MSB) of the binary representation of num (from the right):

In [1]:
import sys
sys.path.append("../../")

In [2]:
import math

# Set this to true to get some debug output
debug = False

def msb_index(num):
    if num == 0:
        return 0
    return int(math.log(num, 2))

print(msb_index(1))
print(msb_index(2))
print(msb_index(3))
print(msb_index(16))
print(msb_index(17))

0
1
1
4
4


Implement the function `gamma_encode` that encodes `number` as a ɣ code:

In [3]:
def gamma_encode(number):
    if number <= 0:
        raise ValueError("Cannot gamma-encode number < 0")
    # Handle 1 specially
    if number == 1:
        return 0

    m = msb_index(number)
    length = (1 << (m)) - 1
    length = length << 1
    offset = number ^ (1 << m)
    gcode = (length << m) | offset
    if debug:
        print("number = {0}, length = 0b{1:b}, offset = {2}, gcode = {3:b}" \
                .format(number, length, offset, gcode))
    return gcode

Here are some tests.

In [4]:
# From book, table 5.5
# (number, gamma code, length)
values = \
[
    (1, 0b0, 1),
    (2, 0b100, 3),
    (3, 0b101, 3),
    (4, 0b11000, 5),
    (9, 0b1110001, 7),
    (13, 0b1110101, 7),
    (24, 0b111101000, 9),
    (511 , 0b11111111011111111, 17),
    (1025, 0b111111111100000000001, 21)
]

for n, g, b in values:
    assert(gamma_encode(n) == g)

Now implement the function `gamma_decode` that decodes a ɣ-encoded number. The function should accept a sequence of ɣ-codes, decode the first one and return `(decoded number, number of bits decoded)`.
For example `gamma_decode(0b1110101)` should return `(13, 7)` because the number 13 is encoded with 7 bits in ɣ-code.

In [5]:
def gamma_decode(gcode):
    # Handle gamma code 0 specially
    if gcode < 0:
        raise ValueError("gcode must be >= 0")
    if gcode == 0:
        return (1, 1)

    length = 0
    m = msb_index(gcode)
    for k in range(m, -1, -1):
        bdigit = gcode >> k & 0x1
        if bdigit == 0:
            break
        length += 1
    bitsconsumed = 2*length + 1
    gcode = gcode >> (m-bitsconsumed+1)
    offset = gcode & ((1 << length) - 1)
    number = offset | (1 << length)
    if debug:
        print("gcode = 0b{0:b}, length = {1}, offset = {2}, num = {3}" \
                .format(gcode, length, offset, number))
    return (number, bitsconsumed)

Tests:

In [6]:
for n, g, b in values:
    assert(gamma_decode(g) == (n, b))

To make the last part of the exercise easier, `gamma_decode` should support decoding the prefix of a bit string, i.e. the most significant bits of a bit string, as a ɣ-code:

In [7]:
assert(gamma_decode(0b100101110001111010000) == (2, 3))

In this test, the first ɣ-code in the bitstring should be decoded (2) and the length of its encoding returned (3 bits).

We can also verify that numbers correctly round-trip through a pair of encode and decode operations:

In [8]:
for x in range(1, 2049):
    assert(gamma_decode(gamma_encode(x))[0] == x)

Now that we have a function to decode a single ɣ-code, we can decode a bitstring containing a sequence of codes.

Implement the following function that accepts a bitstring with a sequence of codes and its length in bits; it should return a list of the decoded numbers.

In [9]:
def gamma_decode_stream(gcodes, slen):
    # Note: this method currently does not handle trailing garbage. By design
    # there can't be leading garbage, as any bitstring starts with a valid
    # gamma code.
    if debug:
        print("####")
        print("decode stream: 0b{0:0{l}b}, slen: {1}".format(gcodes, slen, l=slen))
    numbers = []
    rlen = slen
    # Handle all 1s (gs=0) specially
    if gcodes == 0 and rlen > 0:
        return [ 1 ] * (rlen)

    # Handle gs=0 at beginning of stream
    while rlen > 0 and msb_index(gcodes) < rlen-1:
        if debug:
            print("> decoded 0b0->1")
        numbers.append(1)
        rlen -= 1

    while rlen > 0:
        n, b = gamma_decode(gcodes)
        m = msb_index(gcodes)
        mask = (1 << (m-b + 1)) - 1
        gc = gcodes >> (m-b+1)
        if debug:
            print("> decoded 0b{0:b}->{1}, mask 0b{2:b}, rlen {3}, remainder 0b{4:0{l}b}" \
                    .format(gc, n, mask, rlen-b, gcodes & mask, l=rlen-b))
        gcodes = gcodes & mask
        rlen -= b
        numbers.append(n)
        # handle ones (gc=0)
        while rlen > 0 and msb_index(gcodes) < rlen-1:
            if debug:
                print("> decoded 0b0->1")
            numbers.append(1)
            rlen -= 1
            if debug:
                print("> rlen {}, msb_index({:b}) == {}".format(rlen, gcodes, msb_index(gcodes)))

    if debug:
        print("Decoded:", numbers)
        print("####")
    return numbers

Tests for decoding a bitstring:

In [10]:
assert(gamma_decode_stream(0b100101110001111010000, 21) == [ 2, 3, 4, 24, 1 ])
assert(gamma_decode_stream(0b11111111011111111000101, 23) == [ 511, 1, 1, 1, 3 ])
assert(gamma_decode_stream(0b0000, 4) == [ 1, 1, 1, 1 ])
assert(gamma_decode_stream(0b0000101, 7) == [ 1, 1, 1, 1, 3 ])
assert(gamma_decode_stream(0b111101000001011001110001, 24) == [ 24, 1, 1, 3, 2, 9 ])

In [11]:
#Construct Posting list from gamma coded gaps
#The first element of the gamma coded gaps list is document ID, rest are gaps
# construct the posting list with document IDs by adding subsequent gaps to the initial document ID
def construct_posting_list(gamma_decoded_gaps_list):
    posting_list = [sum(gamma_decoded_gaps_list[:i+1]) for i in range(len(gamma_decoded_gaps_list))]
    return(posting_list)

In [12]:
assert(construct_posting_list(gamma_decode_stream(0b11111111011111111000101, 23)) == [511, 512, 513, 514, 517])

## Solutions to Moodle Programming Assignment Based Questions

### Exercise 5 - Q7
Match the γ codes for gap list <777, 17743, 294068, 31251336>

In [13]:
hw_list = [555, 16789, 17, 78854432, 234190]
for num in hw_list:
    print(num, bin(gamma_encode(num)))

555 0b1111111110000101011
16789 0b11111111111111000000110010101
17 0b111100001
78854432 0b11111111111111111111111111000101100110011100100100000
234190 0b11111111111111111011001001011001110


### Exercise 5 - Q8

The following questions are based on the programming assignment

Consider the following posting list 

[1044, 1765, 2117, 2814, 29273, 31817, 32584, 34936, 38435, 40050, 41777, 45017, 56469, 58884, 67206, 69481, 75047, 87590, 92877, 98267]

Which of the following numbers is NOT part of the gap list corresponding to the posting list?

* 1044
* 3499
* 3240
* 3975

What is the number of bits in the γ code stream for the gap list corresponding to the posting list?

* 460
* 514
* 454
* 476

In [14]:
import re
    
#From posting list construct the gaplist 
# gap_list[0] = posting_list[0] (docID for the first posting)
# gap_list[i] = posting_list[i] - posting_list[i-1] (for i != 0)
def construct_gap_list(posting_list):
    gap_list = []
    gap_list.append(posting_list[0])
    for i in range(1,len(posting_list)):
        gap_list.append(posting_list[i]-posting_list[i-1])
    return(gap_list)

# Function to generate the encoded string by
# encoding each number in the gap_list and creating a single 
# sequence of binary numbers
def generate_encoded_string(gap_encoding):
    encoded_str = ""
    length_enc_string = 0
    for x in gap_encoding:
        encoded_str = encoded_str + bin(x)
    encoded_str = re.sub("0b","",encoded_str)
    length_enc_string = len(encoded_str)
    encoded_str = '0b'+encoded_str
    return([encoded_str, length_enc_string])

In [15]:
#Given posting list
posting_list = [1044, 1765, 2117, 2814, 29273, 31817, 32584, 34936, 38435, 40050, 41777, 45017, 56469, 58884, 67206, 69481, 75047, 87590, 92877, 98267]

#Get gap list from posting list
gap_list = construct_gap_list(posting_list)
print("Gap list:", gap_list)

#Encode the numbers in the gap list using gamma encoding
gap_encoding = []
for gap in range(0,len(gap_list),1):
    gap_encoding.append(gamma_encode(gap_list[gap]))

encoded_stream,length_enc_stream = generate_encoded_string(gap_encoding)
#print("Encoded Stream:", encoded_stream)
print("Length of encoded stream:", length_enc_stream)

Gap list: [1044, 721, 352, 697, 26459, 2544, 767, 2352, 3499, 1615, 1727, 3240, 11452, 2415, 8322, 2275, 5566, 12543, 5287, 5390]
Length of encoded stream: 460


#### Answers:
Looking into the gap list, we find 3975 is not part of the gap list

Length of encoded stream is 460

### Exercise 5 - Q9
For the following γ code stream 

0b11111101110111111111111111111111010010111011111110101111111111111111110111001111001111001111111110110001001111111111110101011110010

(Length of gamma code is 131)

which of the following numbers is not part of the decoded list?

In [16]:
# Running gamma decode stream on the encoded stream we get
print(gamma_decode_stream(0b11111101110111111111111111111111010010111011111110101111111111111111110111001111001111001111111110110001001111111111110101011110010,131))

[123, 834554, 499321, 452, 6898]


#### Answers:
499320 is not in the decoded list

### Exercise 5 - Q9
The following γ coded stream encodes the numbers in the gap list of a term 

0b1111111100101110011111111110001100111011111111110110010001111111111111111111110000000010001000011111111111111111111011101100110011001

(Length of stream is 131)

Which of the following numbers are NOT part of the Posting list for the term

* 529776
* 1578
* 782189
* 3405

In [17]:
#Getting the posting list from the gap list
# 1. deocde the stream using gamma_decode_stream() to get the gap list
# 2. construct the posting list from the gap list using construct_posting_list()
construct_posting_list(gamma_decode_stream(0b1111111100101110011111111110001100111011111111110110010001111111111111111111110000000010001000011111111111111111111011101100110011001,131))

[348, 1578, 3405, 529876, 782189]

#### Answer:

529776 is not part of the posting list