# Data Reader
Pickled object is stored in the shared physical memory. Therefore, `read_from_shared_memory` returns a copy of pandas object saved in memory by `loader` process. Pickling will not serve well for large data. However, we aren't short of RAM and **this approach is 135% faster on a 2GB dataset**.

In [1]:
import mmap
import pickle
import posix_ipc as ipc
import pandas as pd

In [2]:
# variables..
shared_memory_name = "/shared-memory-bucket"
data_file = "data-01.txt"

## Description

In [3]:
def get_file_size(path):
    with open(path) as f:
        io = mmap.mmap(f.fileno(), 0, mmap.MAP_PRIVATE, mmap.PROT_READ)
        return io.size()

In [4]:
print("File:   ", data_file)
print("Type:   ", "text/csv")
print("Size:   ", get_file_size(data_file) / 10**6, "MB")

File:    data-01.txt
Type:    text/csv
Size:    2504.015418 MB


## 1. Read from shared memory created by Loader 

In [5]:
def read_from_shared_memory(name):
    memory = ipc.SharedMemory(name=name)
    io = mmap.mmap(memory.fd, memory.size, mmap.MAP_SHARED, mmap.PROT_READ)
    bytes_obj = io.read()
    obj = pickle.loads(bytes_obj)
    io.close()
    memory.close_fd()
    return obj

In [6]:
%%timeit
df = read_from_shared_memory(shared_memory_name)

14 s ± 830 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
df = read_from_shared_memory(shared_memory_name)
df.head()

Unnamed: 0,0,1,2,3,4
0,ed0a74c730b1a8470d8dd371714a8b9182d4d860319097...,todayilearned,eug0v,en.wikipedia.org,"<title> jeannette rankin - wikipedia , the fre..."
1,5e67754718366a3da48080bd0f7a9b3ed5f2102ab74218...,todayilearned,eug0v,en.wikipedia.org,<h1> jeannette rankin </h1>
2,142dc63bcecab761cd8273741aeb55c5cfced225ade3d0...,todayilearned,eug0v,en.wikipedia.org,"from wikipedia , the free encyclopedia"
3,18e7c9d00a8ba512e750606f8287236caf94196b102f56...,todayilearned,eug0v,en.wikipedia.org,"jump to : navigation , search"
4,25b5f520a7ff719a45d47c1ca714ba632608d180b9368d...,todayilearned,eug0v,en.wikipedia.org,jeannette rankin


## 2. Read from disk

In [8]:
%%timeit
pd.read_table(data_file, header=None)

32.9 s ± 259 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
