# Inverted File Test

This notebook has for purpose the benchmarking of an InvertedFile. Below, we will try to evaluate the efficiency of our system by measuring :
- the space taken on disc by an inverted file of an arbitrary size 
- the time taken to write on disc an inverted file of an arbitrary size 
- the time taken to merge on disc two inverted files of an arbitrary size
  
Constants :  
  
LATIMES_PATH : string, the path to the xml files to read  

In [None]:
from pyscripts.inverted_file import InvertedFile
from pyscripts.formatted_document import FormattedDocument
import glob
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import ParseError
import matplotlib.pyplot as plt
import numpy as np

LATIMES_PATH = '../latimes'

## Define utilities functions

In [None]:
def read_files(paths, n=-1):
    """
    Read n files from a list of paths and convert them as xml trees. A root node <RAC> is added to every file to avoid some
    ParseError
    parameters :
        - paths : enumeration of strings, a list of absolute paths where datas have to be read (datas must be xml files)
        - n : number of files needed to be read, if -1, every possible files will be read
    return :
        - a list of len=(min(n, number of files) if n != -1, else number of files) of xml trees representations
          of the documents
    """
    output = []
    for path in paths:
        try:
            txt = open(path, 'r').read()
            output.append(ET.fromstring('<RAC>'+txt+'</RAC>'))
            n -= 1
            print('Successfully parsed document <{}>'.format(path))
        except ParseError as e:
            print('Can\'t parse document <{}>. Doesn\'t matter, skip'.format(path))
        except IsADirectoryError:
            print('Can\'t parse directory <{}>. Doesn\'t matter, skip'.format(path))
        if n == 0:
            return output
    return output

def score(token, document):
    """
    Basic score function to make the inverted files work.
    Doesn't have any computational interest
    """
    paragraph_tokens = document['text']
    paragraph_tokens.append(document['title'])
    token_count = 0
    for paragraph in paragraph_tokens:
        for word in paragraph:
            if word == token:
                token_count += 1
    return token_count

## Define Benchmarking functions

In [None]:
def space_on_disc(number_of_document):
    """
    Simulate the evolution of the space taken on disc in function of the number of
    document we put in an inverted file.
    The simulation takes place only in the size order of small datasets, when all the inverted file can fit in memory
    :param number_of_document : integer, how many documents should be saved on disc
    return : integer, the size of the file (in bytes) if we would save it
    """
    inverted_file = InvertedFile(score)
    files = glob.iglob(LATIMES_PATH + '/*')
    xml_files = read_files(files, number_of_document)
    for f in xml_files:
        document = FormattedDocument(f)
        for article in document.matches:
            inverted_file.add_document(article)
            
    return len(inverted_file.get_object_as_array())
    

## Space on disc benchmarking

In [None]:
space_taken = []
for n in [1, 10, 100, -1]:
    cur_space= space_on_disc(n)
    print(cur_space)
    space_taken.append(cur_space)

## Observation

Between one and 10 documents saved, the weight of a document is approximately halved. Then it remains stable. The growth of the file size is linear in function of the number of documents

In [None]:
space_taken = np.asarray(space_taken)
print(space_taken)
number_of_docs = np.asarray([1, 10, 100, 730])
unique_doc_space = space_taken / number_of_docs
unique_doc_space /= unique_doc_space[0]
plt.plot([1, 10, 100, 730], space_taken)
plt.show()