# Memory Mapped files (mmap files) in Python


Memory mapping refers to the ability to load a file directly into computer memory. It can speed up file I/O performance with respect to other techniques.

To understand memory mapping one needs to have some basic understanding of diferent types of computer memory:

- **Physical memory**: Amount of volatile memory available to your programs shile running. This is not "storage" (once a program is shut down the memory is freed).


- **Virtual memory**: Is a layer of abstraction over the physical memory, so that the programmer does not need to worry about accessing concrete parts of the physical memory and programs might use more memory than what it might be physically available. To do this, operating systems use algorithms to map virtual memory adresses to physical ones, using a data structure called a page table.

    - **mmap** uses virtual memory to make it appear that the coder has loaded a larger file than it might even fit in physical memory.


- **Shared memory**: Is a technique used by  the operating system to allow multiple programs to access the same data in diferent cores.

## 1) Create, read, append a mmap file

mmap files must be mapped to an existing file. Let us first create a memory mapped file as follows with the first line containing "VOCABULARY".

In [556]:
def create_vocabulary_file(filepath):
    # create file object using open function call
    with open(filepath, mode="w", encoding="utf8") as file_object:
        file_object.write('VOCABULARY\ncastaña\nthe\ncat\nwas\nnot\na\ncaterpilar\n')
        file_object.close()
                
# define filepath
filepath="./vocab.txt"
create_vocabulary_file(filepath)

In [557]:
cat vocab.txt

VOCABULARY
castaña
the
cat
was
not
a
caterpilar


Now we want keep adding words to the mmap file

In [558]:
file_object = open(filepath, mode="r+", encoding="utf8") 

In [559]:
#file_object = open(filepath, mode="rw", encoding="utf8") 

#import module
import mmap

#create an mmap object using mmap function call
mmap_object= mmap.mmap(file_object.fileno(), length=0, access=mmap.ACCESS_WRITE, offset=0)
 
#read data from mmap object
txt = mmap_object.read()
 
#print the data
print("Data read from file in byte format is:")
print(txt)
print("Text data is:")
print(txt.decode())

Data read from file in byte format is:
b'VOCABULARY\ncasta\xc3\xb1a\nthe\ncat\nwas\nnot\na\ncaterpilar\n'
Text data is:
VOCABULARY
castaña
the
cat
was
not
a
caterpilar



### Get a slice of the mmap file

We can get an slice of the mmap file as if it was an array

In [560]:
mmap_object[0:4], mmap_object[0:]

(b'VOCA', b'VOCABULARY\ncasta\xc3\xb1a\nthe\ncat\nwas\nnot\na\ncaterpilar\n')

In [561]:
print(mmap_object[11:20])
print(mmap_object[11:20].decode('utf8'))

b'casta\xc3\xb1a\n'
castaña



### Read line by line of a mmapfile


In [562]:
# Load mmap object
mmap_object= mmap.mmap(file_object.fileno(), length=0, access=mmap.ACCESS_READ, offset=0)
# Note `pos=0` is the initial pointer position to the data in the file
print(mmap_object)

<mmap.mmap closed=False, access=ACCESS_READ, length=49, pos=0, offset=0>


We can read line by line of a mmap file iterating over each linen until the last line is found.

In [563]:
def print_line_by_line_mmap(mmap_object):
    line = True
    while line:
        print('\t', mmap_object)
        line = mmap_object.readline()
        print(line)

Note that the `pos` attribute in the mmap file keeps track of the position pointer to the data.

In [564]:
mmap_object= mmap.mmap(file_object.fileno(), length=0, access=mmap.ACCESS_READ, offset=0)
print_line_by_line_mmap(mmap_object)

	 <mmap.mmap closed=False, access=ACCESS_READ, length=49, pos=0, offset=0>
b'VOCABULARY\n'
	 <mmap.mmap closed=False, access=ACCESS_READ, length=49, pos=11, offset=0>
b'casta\xc3\xb1a\n'
	 <mmap.mmap closed=False, access=ACCESS_READ, length=49, pos=20, offset=0>
b'the\n'
	 <mmap.mmap closed=False, access=ACCESS_READ, length=49, pos=24, offset=0>
b'cat\n'
	 <mmap.mmap closed=False, access=ACCESS_READ, length=49, pos=28, offset=0>
b'was\n'
	 <mmap.mmap closed=False, access=ACCESS_READ, length=49, pos=32, offset=0>
b'not\n'
	 <mmap.mmap closed=False, access=ACCESS_READ, length=49, pos=36, offset=0>
b'a\n'
	 <mmap.mmap closed=False, access=ACCESS_READ, length=49, pos=38, offset=0>
b'caterpilar\n'
	 <mmap.mmap closed=False, access=ACCESS_READ, length=49, pos=49, offset=0>
b''


### Get number of lines of a mmap file


In [565]:
def get_len(mmap_object):
    n_lines = 0
    line = True
    while line:
        line = mmap_object.readline()
        n_lines += 1
    return n_lines

In [566]:
mmap_object = mmap.mmap(file_object.fileno(), length=0, access=mmap.ACCESS_READ, offset=0)
get_len(mmap_object)

9

### Search the first position at which a particular substring appears

In [567]:
word = b'cat'
mmap_object = mmap.mmap(file_object.fileno(), length=0, access=mmap.ACCESS_READ, offset=0)
start_position_word = mmap_object.find(word)

print(f'start position for {word} is {start_position_word}')
print(mmap_object[start_position_word:start_position_word+3])

start position for b'cat' is 24
b'cat'


In [568]:
mmap_object

<mmap.mmap closed=False, access=ACCESS_READ, length=49, pos=0, offset=0>

### Update a slice of a mmap file

We can get update a slice of the mmap file as if it was an array

In [569]:
mmap_object = mmap.mmap(file_object.fileno(), length=0, access=mmap.ACCESS_WRITE, offset=0)
mmap_object[0:3] = b"XXX"
print(mmap_object[0:])
mmap_object.flush()

b'XXXABULARY\ncasta\xc3\xb1a\nthe\ncat\nwas\nnot\na\ncaterpilar\n'


We can see that the update is present on the file

In [570]:
cat vocab.txt

XXXABULARY
castaña
the
cat
was
not
a
caterpilar


### Update a line of a mmap file

Consider the case you want to 

In [571]:
mmap_object = mmap.mmap(file_object.fileno(), length=0, access=mmap.ACCESS_WRITE, offset=0)
mmap_object[0:3] = b"XXX"
print(mmap_object[0:])

b'XXXABULARY\ncasta\xc3\xb1a\nthe\ncat\nwas\nnot\na\ncaterpilar\n'


### Append to a file

We can write to a file as follows (this is read not as a mmemap but as a regular file)

In [572]:
file_object = open(filepath, mode="a", encoding="utf8") 

In [573]:
with open(filepath, mode="a", encoding="utf8") as file_object:
    new_words = ['dog', 'house','sheep','conjuración']
    for w in new_words:
        file_object.write(w+'\n')

In [574]:
cat vocab.txt

XXXABULARY
castaña
the
cat
was
not
a
caterpilar
dog
house
sheep
conjuración


### Modify a file without opening and closing it many times

Note that if a file has to be constantly modified, for example, adding new words to it, it might be worth to keep it open, since the cost of opening and closing is much more expensive than writting to it

In [592]:
%%timeit
with open(filepath, mode="a", encoding="utf8") as file_object:
    new_words = ['dog', 'house','sheep','conjuración']
    for w in new_words:
        file_object.write(w+'\n')

30.9 µs ± 2.52 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [595]:
%%timeit
file_object = open(filepath, mode="a", encoding="utf8")
new_words = ['dog', 'house','sheep','conjuración']
for w in new_words:
    file_object.write(w+'\n')

31.9 µs ± 752 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


### Compare string presence: file VS mmap file

If you want to check if a string
Imagine you want to write to disk all words you find in a huge corpus, and you can't have in memory a set containing all the visited terms

In [596]:
file_object = open(filepath, mode="r", encoding="utf8")

In [597]:
%%time
'cat' in file_object.read()

CPU times: user 5.85 ms, sys: 6.71 ms, total: 12.6 ms
Wall time: 13.2 ms


True

In [619]:
word = b'cat'
mmap_object = mmap.mmap(file_object.fileno(), length=0, access=mmap.ACCESS_READ, offset=0)

%time start_position_word = mmap_object.find(word)

CPU times: user 12 µs, sys: 16 µs, total: 28 µs
Wall time: 33.1 µs


### Working with opened files and mmaped files at the same time

One might want to append data to a file and at the same time check if the file contains some information.