Francesc Dantí (fdanti@ub.edu)
# Cloud Computing (part 2)

In this second part of the lesson we will focus on how to work with **big files** and how to avoid **slowdowns** due to Operating System (OS) **memory system**.

## 1.- Memory usage

Something important to be considered in our programming model is memory usage.  

Memory is where our CPU stores all instructions and data that needs to operate. In a computer, memory is a hierarchical structure wirh four levels:  

+ Internal. Processor registers and cache. Are "as fast as expensive".
+ Main. Outside the cpu, we have RAM. It's cheaper and pretty fast; in most of cases we have some Gb in our computers.
+ On-line mass storage. Some Operating systems includes a swap space in the HDD.

The computer will store most used data "as near as possible". If the nearest level is full, it will use next level and so on.  
So, overloading our computer RAM means that OS is going to use Swap, that means a signifficant slowdown of the system. 

How can we work with multiple Gb files in a computer with few Gb memory? We must change our programming model to be able to work with **Big Data**!

First of all, let's see how do we work with small files.

With small files, we can load them into memory at once.
Let's load a CSV file contining few thousands crime records in Sacramento. With Pandas read.csv() method, we are able to load an entire csv file into a DataFrame:

In [None]:
import urllib
import time as tm
import pandas as pd

#We use a sample CSV file from samplecsvs website.
# It contains a ten thousand registers (1.3Mb)

#Get the file from internet
fileURL = "http://samplecsvs.s3.amazonaws.com/SacramentocrimeJanuary2006.csv"
fileName = "SacramentocrimeJanuary2006.csv"
urllib.request.urlretrieve(fileURL,fileName)

csvFile = "SacramentocrimeJanuary2006.csv"

#Let's store a timestamp
t0 = tm.time() 

#We put all data in a DataFrame. That is: loaded ~1000 rows in memory
data = pd.read_csv(csvFile)

#And we store the timestamp again:
t1 = tm.time()

#How long it takes to load data?
print("Time loading: {:.3f}s".format(t1 - t0))

#How many records do we have in DataFrame?
print("Total number of registers loaded into dataFrame: {0}".format(len(data)))

#What kind of data do we have in DataFrame?
data[:2]

What we've done is loading 7584 registers from a ~700Kb the file into a DataFrame.  

If you try to laod a multiple GB file into memory at once, in a well configured OS your notebook will hang, due to problems allocating memory.

## 2.- Create a big file

We must define a method for creating a big file, choosing its size and with no random contents. 
In Jesse Noller blog [1], there is code that does what we need. It uses collections, a module that implements a very fast container datatype (deque) .

This code generates a file with "reproducible-random" words:

In [None]:
import collections
import os
import time
import urllib

#Create a non-random number. This one permits us to generate the same file
seed = "578945241245768521425"

#Get a file with some sample words.
lorem_URL = "https://raw.githubusercontent.com/mxw/grmr/master/src/inputs/lorem.txt"
lorem_file = "lorem.txt"
urllib.request.urlretrieve(lorem_URL,lorem_file)

# Split words in the file, and replace newlines, commas and dots to spaces
words = open(lorem_file, "r").read().replace("\n",' ').replace(",",'').replace(".",'').split() #Get the words in lorem.txt

#Creates a yield that provides non-random words as needed.
def fdata():
    # Deque is a container that implements fast append and pops.
    a = collections.deque(words)
    b = collections.deque(seed)
    while True:
        yield '\n '.join(list(a)[0:1024])
        a.rotate(int(b[0]))
        b.rotate(1)

# "Connect" with the yield        
g = fdata()

# Set the size and the file you want to create.
#size = 17179869184 #16Gb (~5minutes)
#size = 4294967296 #4Gb (~1.2minutes)
size = 419430400 #400Mb (~6s)

#Define output file and its fileHandler
output_path = "/tmp/bigFile1.out"
output_file = open(output_path, 'w')

#Add words into the file, until its bigger than size:
t0 = time.time()
while os.path.getsize(output_path) < size:
    output_file.write(next(g)) #in python 2, is g.next()
    
t1 = time.time()

output_file.close()

print("Time generating file: {:.2f}m".format((t1-t0)/60))
print("Size of generated file: {0}bytes".format(os.path.getsize(output_path)))

## 3.- Count the words

How many words are in the file /tmp/bigFile1.out?  

### Loading the entire file into memory 
This code loads the entire file into memory and counts the numer of words.
With a big File, can you guess the result? If you want to try it, first of all, save your work!

In [None]:
file_path = "/tmp/bigFile1.out"

t0 = tm.time()

count = 0
with open(file_path, "r") as f: #Use the with statement to avoid calling close() method
    data = f.read()            #Loads file contents into memory!
    count = len(data.split())
    
t1 = tm.time()

print("We have {0} words in file {1}".format(count, file_path))
print("Time: ", t1 - t0)

We are not able to load a big file into memory. Read method will hang our computer with a 40Gb file.

### Reading the file by lines

Python solves this problem using **Iterators** in a transparent way. Itrerables are objects with elements that can be readed by parts: strings, lists, dicts, **files** and so on:

In [None]:
#Three iterable objects
a_string = 'Hello'
a_list = [23,4,76,1]
a_file = "/tmp/bigFile1.out"

The **for** statement reads each part of an interator:

In [None]:
for letter in a_string:
    print(letter)

In [None]:
for element in a_list:
    print(element)

The **for** statement recives an iterable, creates the iterator and uses it to operate over each element.  
By default, strings are splitted by words, lists by elements, dicts by keys and files by lines.

Let's see how to read our bigFile line by line using a for loop:

In [None]:
#Count words in a file line by line
file_path = "/tmp/bigFile1.out"

t0 = tm.time()
with open(file_path, "r") as f:
    count = 0
    # Creates an interator in the file, reading line by line
    for line in f:
        count = count + len(line.split())
            
t1 = tm.time()

print("We have {0} words in file {1}".format(count, file_path))
print("Time: ", t1 - t0)

In the above code, we have spplitted the file in lines. This gives us the chance to operate with a very large file and we can modify our code to use multiple engines to operate with each "chunk".

Anyway, this method has some disadvantages:
+ Time is bigger than with read (with 400Mb, 22sec vs 4sec)
+ What happens if our big file has only a few lines? We would have to create a **second for** to iterate over line.
+ What happens if our lines has **different lenght**? It would be difficult to predict time spent in parallel programming, so it'll be difficult to optimize our code.

### Binary split

To overcome this, we can use a binary split. That is, cutting the file by fixed size blocks.  

In [None]:
#from: http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python

import time as tm

def read_in_chunks(file, size=1024):
    """Generator that returns contents of a file in chunks of fixed size (default = 1k)"""
    while True:
        data = file.read(size)
        if not data:
            break
        yield data
        
def words_in_file(file):
    """Counts words in a file"""
    with open(fname, "r") as f:
        count = 0
        for chunk in read_in_chunks(f):
            count = count + len(chunk.split())
    return count

file_path = "/tmp/bigFile1.out"
t0 = tm.time()
num_words = words_in_file(file_path)
t1 = tm.time()

print("We have {0} words in file {1}".format(num_words, file_path))
print(t1 - t0)

That's awesome, with no parallelization, the code is so fast: 400Mb in ~2.8seconds.  
But wait, is the result correct? **Binary split is counting more words**

## 4.- Exercice: think ways to get better results when using binary splitting.

Your solution must be parallelizable, with no RAM overlads and mus use binary split.  

In [None]:
# Your solution here

Sources
[1] http://jessenoller.com/blog/2009/02/27/generating-re-creatable-random-files