# 11.9 Takeaway - Find Missing IP in Huge File


This problem was very tricky, even just reading the answer was confusing. Still, this is a really good problem that tackles hardware constraints when trying to search.

The question is: With just a few Mb of RAM, but infinite HDD storage space, how can we find the missing IP in a huge file?

DISCLAIMER: The test cases with EPI Judge are only numbers from [1-999] rather than actual IP's, which made testing very confusing. Their test cases are really just simulating to find the first missing element.

The actual problem asks us to look for a missing 32bit IP address in a file with billions of IPs, potentially all IP combinations but 1. For simplicity's sake, we can just return the first missing IP that doesn't appear.

## All IP's available
Example IP:

`255.255.255.255`

OR in Binary Representation

`11111111.11111111.11111111.11111111`


This shows there are 32 bits total, and this leads to the idea that there are 2^32 possiblities. 2^32 / 8 bytes = ~ 0.5 GB. This is far more than how much RAm we have available.

## Using Our Available RAM

Instead of looking at the whole IP, let's just look at first half of the IP first to compare. This allows us to just worry about 2^16 possibilities. 

2^16 / 8 bytes is < 1 Mb. We have plenty of RAM to handle this case.

## Implementation

### 16 MSB (Most Significant bits)
First we want to create "buckets". Essentially, the first bucket out of the 2^16 entries in the `counter` array that is less than the total potential bucket capacity. 

The way we will traverse the file is by using an iterator in python to stream the file, one IP at a time.

In [None]:

import itertools

# Each of these represents different IPS
# Ex as shown below: 0.0.0.0, 0.0.0.1, 255.255.255.255, ..etc
stream = iter([0, 1, 222200, 4294967295, 112])

num_bucket = 1 << 16
counter = [0] * num_bucket
stream, stream_copy = itertools.tee(stream)
# We can easily get the 16 Most significant bits by just right shifting the IP by 16 bits. The result is the MSB for making our buckets
for x in stream:
    upper_part_x = x >> 16
    counter[upper_part_x] += 1

print(counter)

Since our test case is so small, the first bucket will be the first entry in the counter array.

In [21]:
# Return the first bucket index that has less than the capacity
bucket_capacity = 1 << 16
candidate_bucket = next(i for i, c in enumerate(counter) if c < num_bucket)
print(candidate_bucket)

0


### 16 LSB (Least Significant bits)

When we find a "bucket" that contains less entries than the potential capacity, we can deduce that there MUST be a missing IP in the last 16 least significant bits.

We can drill down which element is the first missing one in that bucket. Let's also reset the stream so we can iterate through our testcase/list of IPs again

In [None]:
candidates = [0] * bucket_capacity
stream = stream_copy

# To get the last 16 bits, we can take ((1 << 16) - 1) to get 1111111111111111 and & that with x to get the 16 LSB value
for x in stream:
    lower_part_x = ((1 << 16) - 1) & x
    candidates[lower_part_x] = 1

print(candidates)

Now that we've labeled all the elements in that bucket that have been scanned, 1 at a time using only 2^16 bits of memory at a time, we can go through the candidates and select the first one that's 0 to be our resulting value as the missing IP.

We just need to reuse our old candidate bucket to derive the original value of the IP's `16 MSB`, and do an OR operation on the candidate index to derive the value of the IP's `16 LSB`. The OR will combine these together into the value we're looking for

In [27]:
for i in range(len(candidates)):
    if candidates[i] == 0:
        print((candidate_bucket << 16) | i)
        break


2


This is correct! `2` or `0.0.0.2` is the first IP that we don't have in our test case. PASS!