In [1]:
import os

# Instruction

The **objective** is to taking as argument the path to the pdf file to analyze, and displaying the offset in decimal on the standard output (-1 if there is no "%%EOF").

**Run**:

To run the code you can easily use Jupyter Notebook or simply run the script in command line by "ipython netheos_amir.ipynb" (you have to have installed **ipython**)

**Solution**:
* 1) Instead of reading the whole file, I read from the end of the file. I read until I reach the "\n" or "\r" to find the last line. This way is efficient, especially for large files.
* 2) The *max_reading_bytes* defines the maximum number of character we will read from the end of the file. Each character can be from 1 to 4 bytes and I assumed that it is 1 (no difference in complexity of the method). If we read more than 30 bytes then it means that the last line does not contain "%%EOF" and we have to terminate the searching. This is done by checking the variable *sum_of_read_bytes*.
* 3) At the end, I will read the line (which is the last line) and then I check if it contains "%%EOF" or not.
* 4) Using the size of the file and the number of characters we read from the end, we can easily compute the offset (using *size_of_file* and *sum_of_read_bytes*).

In [2]:
file_path = input("File Path: ")

File Path: C:\Users\amirz\PycharmProjects\Netheos\Data\test_1.pdf


In [3]:
# size of the file (we use it to find the offset)
size_of_file = os.path.getsize(file_path)

# maximum number of bytes to be read
max_reading_bytes = 30


In [4]:
def find_offset(path):
    with open(path, 'rb') as f:
        try:  # catch OSError in case of a one line file 
            f.seek(-2, os.SEEK_END)

            sum_of_read_bytes = 0
            read_char = f.read(1)
            while read_char != b'\n' and read_char != b'\r' and sum_of_read_bytes <= max_reading_bytes:
                f.seek(-2, os.SEEK_CUR)
                read_char = f.read(1)
                sum_of_read_bytes += 1 # each character is 1 to 4 bytes. Let's assume it is 1 byte

        except OSError:
            f.seek(0)

        last_line = f.readline().decode()
        
    if sum_of_read_bytes >= max_reading_bytes or "%%EOF" not in last_line:
        offset = -1
    else:
        offset = size_of_file - (sum_of_read_bytes + 1)
    
    return offset

In [5]:
print("offset: ", find_offset(file_path))

offset:  4568
