# Part 3: Batching

<hr>

In [1]:
import importlib
import gensim
import nltk
import json
from materials.code import utils
importlib.reload(utils)
import matplotlib.pyplot as plt

# IMPORT SOME BASIC TOOLS:
from pprint import pprint
import pyarrow


### Batch Processing Data
In the previous two portions of the tutorial (and the previous assignments for that matter) we've followed a standard procedure when we want to use text data for classification:

0. Load our text data into memory.
1. Format the text data to a set of tokenized sentences (e.g. list of lists). 
2. Format the labels we want to predict as a list. (e.g. "Positive" or "Negative")
3. Generate a numerical representation of the text data (e.g. bag-of-words, or word vectors).
4. Generate a numerical representation of the of the labels (e.g. Positive = `1`, Negative = `0`). 
5. Train a model that maps the numerical representation of the text to the numerical representation of the labels.

This procedure is going to be roughly the same no matter what kind of data we're working with, or what kind of labels we want to predict. But one fatal flaw in our execution of this general procedure has been the foundation "Step 0" - where we've been loading **our entire dataset into memory, all at once** for tokenization, formating, splitting into training and testing sets, modeling etc.  

We've been able to get away with this because our datasets have (intentionally) been on the smaller side; the Rotten Tomatoes movie dataset, for instance, was a puny 10,000 reviews. But when you are training models in the real-world, it's often not practical (or even possible) to pull the entire dataset into memory at once. Perhaps you faced a painful memory error in the previous learning exercises if you dared not to store the data in sparse array or dict - **and that data wasn't even too large by NLP standards!** Imagine if you had to represent all of wikipedia as a bag of words, or train a model to predict poorly written Wikipedia articles. 

One solution is to quit graduate school, and go work for a company with giant super-computers; another (and I think better) solution is to find ways to do your processing in manageable chunks, or **batches**. Let's go through some practical examples of batching with a larger version of the rotten tomatoes movie review dataset that I [downloaded from the web](https://drive.google.com/file/d/1N8WCMci_jpDHwCVgSED-B9yts-q9_Bb5/view). You can see the data at `materials/data/rotten_tomatoes_reviews_raw.csv`. This version has 480,000 reviews so it's getting to the point that it might be painful to do any kind of advanced processing with it.  

So, instead of processing the data all at once, let's open the data file and read in only the first three lines, each time placing (and replacing) what see in the variable `line`, so that we don't abuse our poor computer's memory.

In [4]:
# Location of the rotten tomatoes review data.
data_path = 'materials/data/rotten_tomatoes_reviews_raw.csv'

# Open the data file
with open(data_path) as fp:
    
    # read the first line and print
    line = fp.readline()
    print('Line 1:', line)
    
    # read the second line and print
    line = fp.readline()
    print('Line 2:', line)

    # read the third line and print
    line = fp.readline()
    print('Line 3:', line)
    
    # Shall i go on?
    

Line 1: Freshness,Review

Line 2: 1," Manakamana doesn't answer any questions, yet makes its point: Nepal, like the rest of our planet, is a picturesque but far from peaceable kingdom."

Line 3: 1," Wilfully offensive and powered by a chest-thumping machismo, but it's good clean fun."



<br><br> As we can tell from line 1, this is a simple CSV file where the first element indicates the `Freshness` (1 = positive), and the second entry indicates the free text `Review`. If we were to parse this file one line at a time, we'll probably want to split it on the `,` character, but note from the above examples that there are commas all over the reviews too! So, this will make splitting the text data up neatly (with `.split(',')` for instance) a little less straight forward.

If you are collecting a custom dataset, or working with publicly available sources, you will probably have to work with data that is in CSV format at some point. However, especially for text data, it is often useful to convert your data to JSON lines format, which [has several advantages over CSV format](https://jsonlines.org/examples/), especially when dealing with sparse datasets.

So without futher delay, let's write a memory-friendly function that will convert the larger Rotton Tomatoes movie review data from a CSV file to JSON lines format file. Let's do 100,000 lines at a time, so I'll set the `batch_size = 100,000`. Note that the code for the `csvToJsonl` function shown below is in `utils.py` in case you want to take a look.

In [5]:
importlib.reload(utils)
utils.csvToJsonl(source_file      = 'materials/data/rotten_tomatoes_reviews_raw.csv', 
                 destination_file = 'materials/data/rt_reviews/re_reviews.jsonl', 
                 batch_size       = 100000, 
                 verbose          = True)

Processed Batch 0 :  100001 / 480001 Lines
Processed Batch 1 :  200001 / 480001 Lines
Processed Batch 2 :  300001 / 480001 Lines
Processed Batch 3 :  400001 / 480001 Lines
Processed Batch 4 :  480001 / 480001 Lines


<br><br> The file had a size of 480,000 data rows (+1 header row) - and we wanted to convert the data in batches of 100,000 rows - so we completed the processing in 5 batches - as shown in the handy printout above.

<br> So now we have our data in `jonsl` format, but how are we going to use it to train a model? Well, if we're going to train a neural network (or some other) classifier, we don't *have to* update our model parameters using all the data at once. Instead, we can update our model parameters using a strategic **subset** of the data; that is, we can train the model in multiple **batches** over multiple epochs, and given enough batches+epochs, we should be able to converge to a setting of the parameters that is close to what we would have obtained if the batch size was as large as the entire dataset! 

If we're going to train our models using batches, we'll need a function that will pull only the batch into memory so we can leave the rest of the data on the disk until we need it! To help with this, I've written a function that gets a batch from the `.jsonl` data we just generated (you can find the code for this in the `/materials/code/utils.py`). Let's use this function to get `batch_number` 89, assuming a `batch_size` of 5.

In [7]:
batch, end_flag = utils.getBatch(data_path    = 'materials/data/rt_reviews/re_reviews.jsonl', 
                                      batch_size   = 5,
                                      batch_number = 89, 
                                      random_seed  = 10)

# Print out the batch
pprint(batch)

[{'Freshness': '0',
  'Review': 'Only two scenes do more than hint at the poetic potential of the '
            'premise.'},
 {'Freshness': '0',
  'Review': "[T]he movie's inability to make up its mind on an approach sinks "
            'it.'},
 {'Freshness': '0',
  'Review': 'Hiistorical events referenced in the first two installments of '
            'the Underworld franchise are brought to life in this passionless, '
            'idea-deprived prequel."'},
 {'Freshness': '1',
  'Review': 'Incendiary material treated in a (thankfully) non-incendiary '
            'manner.'},
 {'Freshness': '1',
  'Review': "Hand the Oscar to Jeff Bridges right now, and let's be done with "
            'it."'}]


<br><br> As we can see, we get 5 rows of json style data back. Here's another example of how to pull the first and the last batches of size 100,000 from the `.jsonl` file we created.

In [17]:
importlib.reload(utils)

#-----------------------------------------------------
# Read in the first batch of the data, assuming a batch size of 100,000
#-----------------------------------------------------
batch, end_flag = utils.getBatch(data_path    = 'materials/data/rt_reviews/re_reviews.jsonl', 
                                      total_lines  = 480000,
                                      batch_size   = 100000,
                                      batch_number = 0, 
                                       random_seed  = 1)

print('The Size of the first batch (batch "0") o is:', len(batch))
print('Was this the final batch:', end_flag)

#-----------------------------------------------------
# Read in the fifth batch of the data, assuming a batch size of 100,000
#-----------------------------------------------------
batch, end_flag = utils.getBatch(data_path    = 'materials/data/rt_reviews/re_reviews.jsonl', 
                                      total_lines  = 480000,
                                      batch_size   = 100000,
                                      batch_number = 4, 
                                      random_seed  = 1)

print('\nThe Size of the fifth batch (batch "4") is:',len(batch))
print('Was this the final batch:', end_flag)


The Size of the first batch (batch "0") o is: 100000
Was this the final batch: False

The Size of the fifth batch (batch "4") is: 80000
Was this the final batch: True


<br><br> Notice that the size of the 4th and final batch is less than 100,000 in size - this is because the data is not a perfect multiple of 100,000 (it's 480,000, remember?). The function contains a helpful indicator that let's us know when we've reached the final batch - the `end_flag`.

So now we have `jsonl` data, and we can collect batches from it but we're not done yet. We still need to generate a `training`, `validation`, and `testing` set. In the spirit of protecting our precious memory - let's generate those datasets in batches as well!

In [21]:
importlib.reload(utils)
utils.splitFile( file       = 'materials/data/rt_reviews/re_reviews.jsonl',
                 splits     = {'train'     :{'percentage':60},
                               'validation':{'percentage':20},
                               'test'      :{'percentage':20}},
                batch_size  = 100000,
                random_seed = 99,
                verbose     = True
               )

Initializing:  materials/data/rt_reviews/train_re_reviews.jsonl
Initializing:  materials/data/rt_reviews/validation_re_reviews.jsonl
Initializing:  materials/data/rt_reviews/test_re_reviews.jsonl
Processed Batch 1 :  100000 / 480000 Lines
Processed Batch 2 :  200000 / 480000 Lines
Processed Batch 3 :  300000 / 480000 Lines
Processed Batch 4 :  400000 / 480000 Lines
Processed Batch 5 :  480000 / 480000 Lines


<br><br> The `splitFile` function (again, available for your review in `utils`) splits the data into batches according to the arguments provided in the `splits` dictionary. Importantly, we don't have to call the partitions of the data `train`, `validation` and `test` - we can call it whatever we want, for instance:

In [24]:
importlib.reload(utils)
utils.splitFile( file       = 'materials/data/rt_reviews/re_reviews.jsonl',
                 splits     = {'1'     :{'percentage':25},
                               '2'     :{'percentage':25},
                               '3'     :{'percentage':25},
                               '4'     :{'percentage':25}
                              },
                batch_size  = 50000,
                random_seed = 99,
                verbose     = True
               )

Initializing:  materials/data/rt_reviews/1_re_reviews.jsonl
Initializing:  materials/data/rt_reviews/2_re_reviews.jsonl
Initializing:  materials/data/rt_reviews/3_re_reviews.jsonl
Initializing:  materials/data/rt_reviews/4_re_reviews.jsonl
Processed Batch 1 :  50000 / 480000 Lines
Processed Batch 2 :  100000 / 480000 Lines
Processed Batch 3 :  150000 / 480000 Lines
Processed Batch 4 :  200000 / 480000 Lines
Processed Batch 5 :  250000 / 480000 Lines
Processed Batch 6 :  300000 / 480000 Lines
Processed Batch 7 :  350000 / 480000 Lines
Processed Batch 8 :  400000 / 480000 Lines
Processed Batch 9 :  450000 / 480000 Lines
Processed Batch 10 :  480000 / 480000 Lines


<hr> 

## Learning Exercise 3: 
#### Worth 1/5 Points
#### A. Generating Vocabulary in Batches
The following function was written to naively extract the vocabulary (i.e. the set of distinct tokens) from the larger Rotten Tomatoes dataset in `materials/data/rt_reviews/re_reviews.jsonl`. The function reads in the file one line at a time, appends all the data together, does some simple pre-processing and then saves the tokens and their counts in a regular json (not jsonl) file `vocabulary.json`. Please update the code block below so that the vocabulary and the frequency count of each word is generated in batches; that is, we would like to obtain the same vocabulary and word counts but without loading everything into memory! 

After you implemented your function, compare it's memory usage and run time against the original implementation below (if you're using Unix, you can see memory use by running `top` in the command line). Comment on any differences you see, and discuss why those differences might exist.

In [68]:
import json
import nltk
import re 
  
#Counts the frequency of terms in a list
def countFrequency(my_list): 
    # Creating an empty dictionary  
    freq = {} 
    for item in my_list: 
        if (item in freq): 
            freq[item] += 1
        else: 
            freq[item] = 1
    return freq

#-----------------------------------------------------
# Inputs
#-----------------------------------------------------
file, text_key    = 'materials/data/rt_reviews/re_reviews.jsonl', 'Review'
batch_size        = 100000

#-----------------------------------------------------
# House-keeping variables
#-----------------------------------------------------
line, vocabulary  = True, set([])
reviews           = ''
line_num          = 0
n                 = utils.file_len(file)

#-----------------------------------------------------
# Reading in the data
#-----------------------------------------------------
with open(file) as read_file: 
    while line:   

        # If this is the last line, break
        if line_num == n:
            break
        
        # Read the line
        line               = read_file.readline()
        
        # Process the line
        try:
            processed_line     = json.loads(line)
        except:
            print(line)
        
        # Append to the reviews
        reviews           += ' ' + processed_line[text_key]
        

#-----------------------------------------------------        
# Some Very Simple Pre-processing
#-----------------------------------------------------
# Basic Tokenization with NLTK
vocabulary = list(nltk.word_tokenize(gensim.utils.to_unicode(reviews.lower())))       

# Counting Frequency
vocabulary = countFrequency(vocabulary)

#-----------------------------------------------------
# Save the results as a regular JSON, not json lines
#-----------------------------------------------------
dirname, filename  = '/'.join(file.split('/')[:-1]), file.split('/')[-1]       
savedir            = dirname + '/' + 'vocabulary.json'

with open(savedir, 'w') as outfile:
    json.dump(vocabulary, outfile)




<span style="color:red"> INSERT AN INTERPRETATION OF YOUR RESULTS HERE </span>

<hr>
<h1><span style="color:red"> Self Assessment </span></h1>
Please provide an assessment of how successfully you accomplished the learning exercises in this assignment according to the instruction provided; do not assign yourself points for effort. This self assessment will be used as a starting point when I grade your assignments. Please note that if you over-estimate your grade on a given learning exercise, you will face a 50% penalty on the total points granted for that exercise. If you underestimate your grade, there will be no penalty.

* Learning Exercise: 
    * <span style="color:red">X</span>/1 points