<a href="https://colab.research.google.com/github/duchaba/aud3_augmentation_data_deep_dive/blob/main/AUD3_augmentation_data_deep_dive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# OPTIONAL
# For Google Colab, (1) Open the "concole", e.g. right-click and inspect, (2) Copy the below scripts (from line #10 to #17) and run it.
#
# If you know how to hack Google Colab Jupyter notebook and run "javascripts" as-is below, 
#i.e., without the need for opening up the console, please share it with me.
#
# The Javascript is to highlight the code cells' input and output and the code-cells you have executed.
#
%%js
var head = document.head || document.getElementsByTagName("head")[0];
var style = document.createElement("style");
var css = ".inputarea.code{border-left: 4px solid #20c997;}.cell.focused .inputarea.code{border-left: 4px solid #d63384;}.cell .output{border-left: 4px solid #ffc107;}";
css = css + ":root { --colab-fresh-execution-count-color: #d63384;}";
css = css + ".markdown blockquote {border-left: 10px solid #fd7e14 !important;border-radius: 10px 0 0 10px;padding: 1em 1em 1em 1em;border-bottom: 1px solid #343845}"
css = css + " h1,h2,h3,h4,h5 {font-family:serif !important;}"
css = css + "h1{color:#e83e8c !important;;} h2{color:#20c997 !important;font-size:120%;} h3{color:#fd7e14 !important;font-size:120%;} h4{color:#6610f2 !important;}"
head.appendChild(style);
style.type = "text/css";
style.appendChild(document.createTextNode(css));


<IPython.core.display.Javascript object>

# 1.0 |- Introduction | Augmentation Data Deep Dive (AUD3)

Welcome to the “Augmentation Data Deep Dive” (AUD3) project. It is another journey in the “Demystify AI” series. 

The only thing these journeys have in common is the problems are taken from the real-world Artificial Neural Network (ANN) project. I have worked on them, coded them, cried over them, and sometimes had a flash-of-insight or an original thought about them.

The journeys are a fun and fascinating insight into the daily working of an AI Scientist and Big-Data Scientist. They are for colleagues and AI students, but I hope that you, a gentle reader, would enjoy it too. 

The logic behind data augmentation is uncomplicated. You need more pictures to increase the ANN model accuracy, and data augmentation gives you more images. 

The AUD3 is a hackable, step-by-step Jupyter Notebook. It is for learning about data augmentation and selecting the correct parameters in the ANN image-classification and image-segmentation projects. You can skip ahead to the result-cells and not read the math and the code-cells. The math is elementary, and coding is straightforward logic, so try it out. You might enjoy “hacking” along the journey. 

Data augmentation increases the training images by a factor of 2 to 16 or more. For ANN, that means the model achieves better accuracy with more epoch and without over-fitting. 

For example, I was working on an AI project for a large casino hotel. The goal is to classify every customer into one of the 16 categories as they walk through the door. In other words, it is not to identify that guy walking through the door is “Duc Haba,” but to classify him as a “whale (A-1)” category, i.e., a big spender. 

As you have guessed, the first and most significant problem is the lack of labeled pictures. I need millions of tagged photos because of human diversity in race, ethnicity, clothing, different camera angle, and so on. 

ANN is not a ruled-based expert system. For example, a person wearing a Rolex watch is an “A-1”, or a guy with no shoe and no shirt is a “D-4” category. ANN does not use rules, so it needs millions of labeled images to train and to generalize so the ANN model can classify a guy who enters the casino for the first time correctly. In ANN’s lingual, it means the ANN model is not over-fitting. 

I classify the AUD3 as a "sandbox" project. In other words, it is a fun, experimental project focusing on solving one problem.






><center><h2><i>So if you are ready, let's take a collective calming breath …  … and begin.</i></h2></center>

# 2.0 |- The Journey




- As with the previous journey, the first step is to choose and name your dog companion. With the project code name AUD3, the natural name has to be “Wallaby.” 

- Typically a dog name is chosen, e.g., "Lefty," "Roosty," or "Eggna," but be warned, don't name it after your cat because a "cat" will not follow any commands.

- If you are serious about learning augmenting data, start hacking by changing the companion name to your preference. Jupyter notebook allows you to add notes, add new codes, and hack Wallaby’s code. 

- If you are a friend tagging along, you will like Wallaby. He is a friendly, helpful dog. He will do all the tedious work in good spirits, and he likes to hop around. 

- As a good little programmer, Wallaby (or insert your companion name here) starts by creating an object or class.

![wallaby](https://github.com/duchaba/aud3_augmentation_data_deep_dive/blob/main/wallaby3.jpg?raw=true)

## 2.1 | Wallaby's "River" Coding Style

- Wallaby uses the "river" coding style.

- The style uses a full library name, sub-class, and following by the function name. Jupyter notebook has auto-complete, so Wallaby would not misspell long variable and function name. 

- Wallaby is NOT using the global-space, such as "import numpy *" or using the shorten name like "import matplotlib.pyplot as ptl.” Instead he is using the full {river} name as in “numpy.random.randint().”

- Furthermore, Wallaby shies away from using Python language-specific syntax shorthand, such as the “assigned if” statement construct. He likes to write the Python code using standard Python libraries and not relying on exotic packages. 

- The primary reason for using the “river” coding style coupled with a descriptive naming convention is that it is easier to read, hack, and translate to Swift or Javascript.

- For any sandbox project, Wallaby is in the exploration mode, and therefore, he will refactor and optimize the code afterward. When Wallaby using [Atom IDE](htpps://atom.io) to upload the code to GitHub, he may refactor them to make them run faster, but not for syntax compaction or syntax shorthand. 


## 2.2 | Wallaby Class

In [2]:
# import standard libraries
import numpy
import pathlib
import os
import pandas
import matplotlib

In [3]:
class AUD3(object):
  #
  # initialize the object
  def __init__(self, name="Wallaby"):
    self.author = "Duc Haba"
    self.name = name
    self._ph()
    self._pp("Hello from", self.__class__.__name__)
    self._pp("Code name", self.name)
    self._pp("Author is", self.author)
    self._ph()
    return
  #
  # pretty print output name-value line
  def _pp(self, a, b):
    print("%40s : %s" % (str(a), str(b)))
    return
  #
  # pretty print the header or footer lines
  def _ph(self):
    print("-" * 40, ":", "-" * 40)
    return
  #
  # dance
  def dance_happy(self):
    char = "        _=,_\n    o_/6 /#\\\n    \\__ |##/\n     ='|--\\\n       /   #'-.\n"
    char = char + "       \\#|_   _'-. /\n        |/ \\_( # |\" \n       C/ ,--___/\n"
    print(char)
    self._ph()
    return
# ---end of AUD3 class

In [4]:
# Let start
wallaby = AUD3("Wallaby")

---------------------------------------- : ----------------------------------------
                              Hello from : AUD3
                               Code name : Wallaby
                               Author is : Duc Haba
---------------------------------------- : ----------------------------------------


In [5]:
# dance baby dance
wallaby.dance_happy()

        _=,_
    o_/6 /#\
    \__ |##/
     ='|--\
       /   #'-.
       \#|_   _'-. /
        |/ \_( # |" 
       C/ ,--___/

---------------------------------------- : ----------------------------------------


- The following is a clean version. Wallaby cleans up the tried-and-errors cells, but please don't let it stop you from inserting your code-cells and notes as we make this journey together. 

- When copying the code into the Atom's project, Wallaby would add the methods during the class definition, but in a notebook, he will hack-it and add new functions as need it. 

In [6]:
# Hack it!
# add_method() is copy from Michael Garod's blog, 
# https://medium.com/@mgarod/dynamically-add-a-method-to-a-class-in-python-c49204b85bd6
# AND correction by: Филя Усков
#
import functools
def add_method(cls):
  def decorator(func):
    @functools.wraps(func) 
    def wrapper(self, *args, **kwargs): 
      return func(self,*args, **kwargs)
    setattr(cls, func.__name__, wrapper)
    return func # returning func means func can still be used normally
  return decorator

## 2.3 | Detour to Find Our Friend Monty

- Monty is like Wallaby. He is a Python class refactored in the Atom project and stored in GitHub. 

- Monty is an alpha-dog, and therefore, he follows the same methodology. Hacked it in a Jupyter notebook and then copy and refactor in a Python Atom project.

- Monty is not a public Github project at this stage. However, Monty's code exists in many of Duc Haba's sandbox projects on Github.

- Monty uses "[fast.ai](https://fasta.ai)" library version 1.0.62.x from Jeremy Howard, Rachel Thomas, and Sylvain Gugger. Fast.ai library uses PyTorch version 1.6.x and Python 3.6.9.

- For this journey, Monty ability to draw 2D, 3D graphs, and image-cleaning will be handy. They were from previous journies, the "[Demystify Python 2D Charts](https://www.linkedin.com/pulse/demystify-python-charts-hackable-step-by-step-jupyter-duc-haba/)," and the "[3D Visualization](https://www.linkedin.com/pulse/python-3d-visualization-hackable-step-by-step-jupyter-duc-haba/)" sandbox projects.

In [None]:
%%capture out_1
# load in fastai and pytorch. It is optional if are coding on your labtop
# load in fastai at May 1 2020 version
!pip install --upgrade git+https://github.com/duchaba/fastai.git
# !pip install --upgrade git+https://github.com/duchaba/monty_NOT_AVAILABLE

# import Monty and create a monty instant. The preference is NOT using global space
import d0hz.fastai_util 
monty = d0hz.fastai_util.base_monty()

In [9]:
# double checked
monty.print.sys_info()

---------------------------------------- : ----------------------------------------
                             System time : 2020/12/10 21:56
                                Platform : linux
                          Python version : 3.6.9 (default, Oct  8 2020, 12:12:24) 
[GCC 8.4.0]
                         PyTorch version : 1.7.0+cu101
                     Fastai version is:  : 1.0.62.dev0
                           Monty version : 0.6.0
                               CPU count : 4
                              *CPU speed : NOT available
                               RAM total : 25.51 Gb
                                RAM free : 24.46 Gb, 95.9%
                                GPU-Cuda : True
                        Disk space total : 147.15 Gb
                         Disk space free : 114.95 Gb, 78.1%
                      Current directory: : /content
             Python import packages path : Full path below...
                                       + : 
                     

## 2.4 | Fetch Images Data

- Wallaby has a companion named Monty. He will do all the dirty works that do not directly pertain to this journey. If we spend time teaching Wallaby, then it will distract from the AUD3 journey.

- Wallaby encourages you to hack the notebook, and use your image data set.

- His first task is as follows.

1. Fetch the farm animal image set.

2. Fetch the city image set.

3. Fetch the people faces image set.

4. Fetch the satellite image set.

- Wallaby randomly pulls the images from "Google" or "Bing" image-searches. He uses the Chrome extention "abc" to aid in downloading and packed them into a zip-file. 

- Wallaly claims <u>no rights</u> on these pictures. He uses them only for research purposes. 

In [57]:
import requests
import zipfile
# fetch data from url
@add_method(AUD3)
def _fetch_external_file(self,src_url, dst_path):
  ext_file = requests.get(src_url, allow_redirects=True)
  self._pp("Response Status Code " + str(ext_file.status_code), ext_file.reason)
  local_file = open(dst_path,mode="wb")
  local_file.write(ext_file.content)
  local_file.close()
  return dst_path
#
#
# unzip file
@add_method(AUD3)
def _unpack_zipfile(self, src,dst):
  with zipfile.ZipFile(src, "r") as zip_ref:
    zip_ref.extractall(path=dst)
  return
#
#
# fetch data
@add_method(AUD3)
def fetch_data(self):
  #set up
  self.img_ext_faces_url = "https://github.com/duchaba/aud3_augmentation_data_deep_dive/blob/main/scratch/faces.zip?raw=True"
  # create the "data"
  self.data_path = pathlib.Path("data")
  if os.path.isdir(self.data_path) == False:
    os.mkdir(self.data_path)
  dst = self.data_path.joinpath("faces.zip")
  # header
  self._ph()
  self._pp("Fetch", "Image data sets")
  self._pp("Destination: "+str(dst), "Source: " + self.img_ext_faces_url)
  # fetch faces
  self.img_path = self.data_path.joinpath("img")
  self._fetch_external_file(self.img_ext_faces_url,dst)
  self._unpack_zipfile(dst,self.img_path)
  self._pp("Unpack "+str(dst) + " at", self.img_path)
  # fetch cityscape
  self.img_ext_cityscape_url = "https://github.com/duchaba/aud3_augmentation_data_deep_dive/blob/main/scratch/cityscape.zip?raw=True"
  dst = self.data_path.joinpath("cityscape.zip")
  self._pp("Destination: "+str(dst), "Source: " + self.img_ext_cityscape_url)
  self._fetch_external_file(self.img_ext_cityscape_url,dst)
  self._unpack_zipfile(dst,self.img_path)
  self._pp("Unpack "+str(dst) + " at", self.img_path)
  #
  # fetch landscape
  self.img_ext_landscape_url = "https://github.com/duchaba/aud3_augmentation_data_deep_dive/blob/main/scratch/landscape.zip?raw=True"
  dst = self.data_path.joinpath("landscape.zip")
  self._pp("Destination: "+str(dst), "Source: " + self.img_ext_landscape_url)
  self._fetch_external_file(self.img_ext_landscape_url,dst)
  self._unpack_zipfile(dst,self.img_path)
  self._pp("Unpack "+str(dst) + " at", self.img_path)
  #
  self._ph()
  return 

In [58]:
# do it
wallaby.fetch_data()

---------------------------------------- : ----------------------------------------
                                   Fetch : Image data sets
             Destination: data/faces.zip : Source: https://github.com/duchaba/aud3_augmentation_data_deep_dive/blob/main/scratch/faces.zip?raw=True
                Response Status Code 200 : OK
                Unpack data/faces.zip at : data/img
         Destination: data/cityscape.zip : Source: https://github.com/duchaba/aud3_augmentation_data_deep_dive/blob/main/scratch/cityscape.zip?raw=True
                Response Status Code 200 : OK
            Unpack data/cityscape.zip at : data/img
         Destination: data/landscape.zip : Source: https://github.com/duchaba/aud3_augmentation_data_deep_dive/blob/main/scratch/landscape.zip?raw=True
                Response Status Code 200 : OK
            Unpack data/landscape.zip at : data/img
---------------------------------------- : ----------------------------------------


- That was easy-peazy-lemon-squeezy. Wallaby is a good fetching dog. Imagine if you choose a cat as your companion on this journey. First, a cat will not listen to the command "fetch," and if a merical-of-merical happens, a cat will most likely coming back with a dead bird. It is because a cat always do what he wants to do.

In [None]:
@add_method(AUD3)
def print_readme(self):
  self.readme = pathlib.Path("data/imdb/README")
  monty.print.text_file(self.readme, max_line_display=100) # all of it
  return

In [None]:
Wallaby.print_readme()

- Before diving back in, the IMDB-data is one of a more well-organized data set. Most real-world projects are not that lucky.

- The "README" file confirmed Wallaby’s initial observations. Wallaby also learns a few more tidbits. Which to say, if you have a "README" file, then read it first.

- There is a rating number for each review embedded in the filename. It ranges from 1 to 10, i.e., 1 to 5 stars. For example, rating three is 1.5 stars, nine is 4.5 stars, and ten is five stars rating.

- It is a fantastic insight because after Wallaby completes the NLP sentiment identification model, "positive and negative," she can train the NLP model to rate the movie reviews. Wallaby has a chance to do an off-book project.

- The "unsup" is the "unsupervised" reviews, i.e., movie reviews without a label.

- When training NLP using Convolutional Neural Network (CNN), there are two sets of data-bunch required. The first train session teaches the CNN model to predict the next words. We will include the files in the "unsup" directory, and the second train session instructs the CNN model to identify the "positive and negative" sentiment.

- The third NLP training session is Wallaby’s “product rating review” off-book idea. In other words, she will train the NLP model to predict one to five stars for movie reviews.

- In doing the next-word NLP training session, she will create the vocabulary file, and therefore, she does not need the "imdb.vocab" and the "*.feat" files.

- Most of Wallaby’s friends jump head-first in creating the data-bunch at this stage in the journey. They start the training session and fuzz over the hyper-parameters.



In [None]:
Wallaby._ph()
Wallaby._pp("Wallaby", "Slowing down and drawing graphs.")
Wallaby._ph()

## 2.5 Clean The Input-data

- Before coding and tokenizing the data-bunch, Wallaby will clean up the data.

1. Move all unnecessary files and directories into a “scratch” directory. In other words, keep only the “*.txt” files.

2. Remove all HTML-tags or unprintable characters from the movie reviews.

3. Remove any movie reviews that have more than 1,100 words. It could be contentious, but from the graphs, there are about 50 reviews in each category that were too long, and we have 25,000 files in the “train” and “test” directories and 50,000 in the “unsup” directory. Furthermore, the average word count is 236, so chop away.

- Normally, Wallaby uses the 80-20 rules, i.e., 80% of data for training and 20% for validation. We have 50% of data for the “test,” and that is too much. Wallaby would like to split 75-15-10, i.e., 75% for train-set, 15% for validation-set, and 10% for test-set. There is a total of 50,000 labeled movie reviews.

- Before you cry foul, you are not allowed to change the “test” set in a competition, so by doing this, the competition might disqualify Wallaby, but who could be mean to a happy, tail-waggy dog. You can’t say “no” to Wallaby.

- The AUD3 journey is about demystifying NLP data and not so much about Kaggle's competition. Therefore, we will follow Wallaby’s 75-15-10 suggestion. 

- It is a logical split, and it does not violate the rule. The train-set has a separate validation-set, and the test-set is unused in the training session.


In [None]:
# clean files
@add_method(AUD3)
def clean_files(self):
  # creat scratch dir
  self.scratch_dir = pathlib.Path("data/scratch")
  if os.path.isdir(self.scratch_dir) == False:
    os.mkdir(self.scratch_dir)
  # move the tmp dir
  src = pathlib.Path("data/imdb/tmp_lm")
  dst = pathlib.Path(self.scratch_dir, "/tmp_lm")
  os.rename(src,dst)
  src = pathlib.Path("data/imdb/tmp_clas")
  dst = pathlib.Path(self.scratch_dir, "/tmp_clas")
  os.rename(src,dst)
  # ask Monty to move non-*.txt files to scratch
  src = pathlib.Path("data/imdb")
  dst = self.scratch_dir
  monty.clean.unwanted_files(src, ".txt", scratch_dir=dst, is_inversed=True)
  # ask Monty to clean html-tag
  monty.clean.html_tags_all(src)
  return

In [None]:
Wallaby._ph()
Wallaby._pp("Clean files", "Remove non-text files, remove HTML-tag and non-printable char.")
Wallaby.clean_files()

In [None]:
# clean long files
@add_method(AUD3)
def clean_long_files(self,max_wc=1100):
  src = pathlib.Path("data/imdb")
  i = 0
  for root, dirs, files in os.walk(src):  # @UnusedVariable
    for name in files:
      a = pathlib.Path(root, name)
      wc = self._fetch_words_in_file(a)
      if (wc > max_wc):
        dst = pathlib.Path(self.scratch_dir, name)
        os.rename(a,dst)
        i = i + 1
  self._ph()
  self._pp("Clean files", "Remove movie reviews longer than 1,100 word count.")
  self._pp("Total delete file count", i)
  self._ph()
  return

In [None]:
# do it
Wallaby.clean_long_files()

In [None]:
# clean 75-15-10
@add_method(AUD3)
def clean_75_15_10(self):
  # set up creat scratch dir
  self.valid_dir = pathlib.Path("data/imdb/valid")
  if os.path.isdir(self.valid_dir) == False:
    os.mkdir(self.valid_dir)
    pos = self.valid_dir.joinpath("pos")
    os.mkdir(pos)
    neg = self.valid_dir.joinpath("neg")
    os.mkdir(neg)
  self.train_dir = pathlib.Path("data/imdb/train")
  self.test_dir = pathlib.Path("data/imdb/test")
  # ask Monty to do the heavy lifting
  # valid 15% total or 30% files in "test" dir
  monty.clean.split_files_by_perc(self.test_dir.joinpath("pos"),pos, perc=0.30)
  monty.clean.split_files_by_perc(self.test_dir.joinpath("neg"),neg, perc=0.30)
  # test is 10% total or 20% files in "test" dir. 
  # quick math show that we need to move the remaining in "test" by 0.7143 to train
  monty.clean.split_files_by_perc(self.test_dir.joinpath("pos"), self.train_dir.joinpath("pos"), perc=0.7143, head="t")
  monty.clean.split_files_by_perc(self.test_dir.joinpath("neg"), self.train_dir.joinpath("neg"), perc=0.7143, head="t")
  return

In [None]:
Wallaby._ph()
Wallaby._pp("Clean files", "The 75-15-10 ratio")
Wallaby.clean_75_15_10()

In [None]:
# double checked
monty.print.dir_tree("data", max_file_display=2)

- Wallaby loves to draw, so humors her and asks her to draw a few bar charts.

In [None]:
# count files
@add_method(AUD3)
def _fetch_count_files(self, src_dir):
  i = 0
  for root, dirs, files in os.walk(src_dir):  # @UnusedVariable
      for name in files:
        i += 1
  return i
#
#
# draw the train, test, valid data set
@add_method(AUD3)
def draw_data_set(self):
  # set up
  mx_data = numpy.ones((3,2))
  mx_data[:,0] = numpy.arange(1,4)
  mx_data[0,1] = self._fetch_count_files(self.train_dir)
  mx_data[1,1] = self._fetch_count_files(self.valid_dir)
  mx_data[2,1] = self._fetch_count_files(self.test_dir)
  # draw bar chart
  frame, pic = monty.fetch.graph_canvas()
  monty.draw.graph_bar(pic,mx_data, is_hand=True,color=monty.bag.color.pink)
  monty.draw.graph_label(pic,xlabel="Train, Validate, and Test Data Set", ylabel="Word Count", head="The 75-15-10 Split")
  frame.show()
  return

In [None]:
# Wallaby, draw away
Wallaby.draw_data_set()

## 2.6 The Fast.ai Standard Six Steps for Creating the "Next Word Prediction" Data-bunch

- The hand-drawn “75-15-10 split” bar chart is beautiful. 

- Moving ahead, Wallaby creates the data-bunch for "next word prediction" and "sentiment prediction."

- The process is that Wallaby does one-step and inspects it. The process is the same for NLP as for image classification. The six-steps are as follows.

1. Read (or input) the data and inspect it.

2. Split the “train” and “valid” set and inspect it.

3. Label the data and inspect it. For NLP, this step includes the tokenizer.

4. Add data augmentation and inspect it. Wallaby will skip this step for NLP.

5. Fetch the batch-size, the data-bunch, and inspect it.

6. Normalize the input-data for the selected base-architecture. Wallaby will skip this step for NLP.

  - Wallaby relies on Fast.ai library and Monty to do the heavy lifting. If we dive deep into writing the code from scratch, it will distract from the “AUD3” journey. Moreover, Fast.ai libraries are the best artificial neural network (ANN) libraries. It is far superior to Tensor or Keras.

  - Incidentally, Wallaby stumbled on a salient factor for using the Jupyter notebook. If you disagree, you can “hack it.”

  - If you think Tensor or any other ANN libraries are better, then you must hack this notebook. Wallaby and I welcome and encourage you to hack the notebook.

  - **If you can’t hack-it, you won’t make-it. :-)**

### 2.6.1 Read (or input) the data and inspect it.

In [None]:
import fastai
import fastai.text
#
@add_method(AUD3)
def fetch_1of6_read_data(self):
  self.data_path = pathlib.Path("data/imdb")
  self._nlp_1of6 = fastai.text.TextList.from_folder(self.data_path)
  monty.print.inspect_fastai_textlist(self._nlp_1of6)
  return

In [None]:
# do it
Wallaby.fetch_1of6_read_data()

- That looks too easy. The folks from Fast.ai deserve big applause to “make AI cool again.”

- Wallaby read in almost one hundred thousand movie reviews, 99,944, which is correct because she decided to delete the "too long" reviews. Also, the movie reviews are clear of HTML-tag and non-printing characters.  

### 2.6.2 Split the “train” and “valid” set and inspect it.

In [None]:
# split 85/15
@add_method(AUD3)
def fetch_2of6_split_data(self):
  self._nlp_2of6 = self._nlp_1of6.split_by_rand_pct(0.15)
  monty.print.inspect_fastai_itemlists(self._nlp_2of6)
  return

In [None]:
# do it
Wallaby.fetch_2of6_split_data()

- So why does Wallaby NOT splits by the “train” and “valid” directories? It is because during the first training session, she is trying to predict the next word, and she uses the “unsupervised” data set. In the second data-bunch, Wallaby will split by the “train” and “valid” directories.

- Once again, the code looks too easy. It is because the Fast.ai libraries do extensive work under the hood.


### 2.6.3 Label and tokenize the data and inspect it.

In [None]:
# label for language model
@add_method(AUD3)
def fetch_3of6_label_data(self):
  self._nlp_3of6 = self._nlp_2of6.label_for_lm()
  monty.print.inspect_fastai_labellists(self._nlp_3of6 )
  return


In [None]:
# do it
Wallaby.fetch_3of6_label_data()

- Wallaby gives too much information to digest, so she will take it one at a time.

1. In the output "Section #1" above, the 75-15 split between “train” (84,953 files) and “valid” (14,991 files) is good.

2. In "Section #2", the movie reviews are converted to tokens.

3. At first glance, the first 20 tokens of a file are correct.

4. In "Section #3", the total tokens count is 163,207. That is too many tokens because, from the IMDB vocab file, there should be 89,527 tokens. It is almost twice as many tokens. What’s going on?

5. Wallaby shows the first 20 tokens. The “xx-” tokens are special tokens.

6. In "Section #4", Wallaby shows the last 20 tokens, which marked as “zero” or “xxunk” mean unknown. There are misspelled words or make-up words, but we have to dive deep into the unused words.

7. In "Section #5 and #6", Fast.ai default to 60,000 tokens. That means the system discards 103,207 words, which is throw out 63% of the text in the movie reviews. It’s not good. Wallaby should rework our data or deep dive into the Fast.ai library to fix it.

8. There is no easy way to do this or any fancy graphs that would shorten the time. Wallaby has to dig deep. It is good that Wallaby is a dog and not a cat. :-)

### 2.6.3B -- Dig Deep in Tokenizer

In [None]:
import textwrap
# print out random 200 unused words, max_limit (60,000) is from fast.ai
@add_method(AUD3)
def print_random_unused_words(self, wc=200, max_limit=60000, is_reversed=False):
  k = list(self._nlp_3of6.vocab.stoi.keys())
  v = list(self._nlp_3of6.vocab.stoi.values())
  max = len(k)
  self._ph()
  if is_reversed:
    i = numpy.random.randint(1, high=(max_limit-wc))
    self._pp("Random Good Words, Tokens, count", wc)
  else:
    i = numpy.random.randint(max_limit, high=(max-wc))
    self._pp("Random Unknown, Unused Words, count", wc)
  self._pp("Index", i)
  j = i + wc
  self._ph()
  words = str(k[i:j])
  print(textwrap.fill(words,width=80))
  self._ph()
  return

In [None]:
#print random unknown, unused words, run it a dozen of time or more.
Wallaby.print_random_unused_words()

In [None]:
#print random valid token
Wallaby.print_random_unused_words(is_reversed=True)

- After running the “print_random_unused_words” code-cell a dozen times or more, Wallaby has a few recommendations to improve the tokenizer process.

1. There are many combined words with a “period without space,” such as “somewhat.like,” “like.that,” “floor.what,” “drunk.and,” “score.well.defies,” “as.well.something,” or “convincing.all.”

2. Similar to the above, combined words with a “comma without space” and “dash without space” is the problem.

3. By purposely discarding the misspelled words, the system bias against reviewers who can’t spell. In other words, if the users are lazy and a lousy speller, e.g., write a ten words review with five misspelled words, the NLP model will not be able to predict it correctly.

4. No one worried about biases in a Kaggle competition, but one should speak out about the intentional biases in a real-world project.

5. For example, if the NLP is used to monitor and recommend your newsfeed and advertising-feed and doesn’t correct your spelling, the system will classify you with a false persona.

6. Wallaby is a dog, but she loved to read science fiction stories. In her story, a government drone using NLP to grant entrance to the castle. It’s a very dull and desolate castle because the drone gave only privileged lawyers and English-major students access. :-)

7. Wallaby can choose to include misspelled words or make-up words. The NLP model does what she told it to do. In other words, the NLP model can be used to write like a Democrat and not Republican, write like a Nobel laureate, or in the opposite end, write like Batman's Joker. It could be propaganda or fake news, but the salient point is that the decision which words to tokenized or discarded contributes substantially to the NLP model or adversely affects the accuracy with intentional biases.

8. Furthermore, if you read the NLP’s valid words, some spell incorrectly. The algorithm is if the term is used twice and the maximum buffer has not exceeded, then count the word as valid.

9. Wallaby read the valid token by setting the parameter “is_reversed=True.” There are many misspelled words and make-up words. In other words, if a couple of people write “Kastle instead of Castle” and “Kastle” appear earlier in the tokenizer process, then “Kastle” is valid. There is no grammar checking in the tokenizer.

### 2.6.3C -- Tokinizer Graphs, an Original Thought

- There are too much data to ingest. Wallaby is a dog. She can’t count past five, and I can’t do much better. Therefore, we draw graphs. 

- There are 100,000 files, and each file has on the average 200 words, so that is 20 million tokens. I have read extensively in NLP books and blogs, and nowhere did I see illustrated graphs for the NLP data set. 

- <h2>Why not?</h2>

- Wallaby and I are not shy from original thoughts, so we draw graphs after graphs. We used Monty’s ability to draw 2D diagrams and 3D charts, and Wallaby found what she is hoping to find, an original thought. The below is a cleanup code-cell version. 


In [None]:
# return, tokens-count, unique token-count, unknown token-count
@add_method(AUD3)
def _fetch_token_info(self, token_arr):
  i = len(token_arr)
  j = len(numpy.unique(token_arr))
  k = i - numpy.count_nonzero(token_arr)
  return i, j, k
#
#
#
@add_method(AUD3)
def fetch_token_graph_data(self):
  # set up
  t = self._nlp_3of6.train.x.items
  self.train_count = len(t)
  self.train_graph_data = numpy.ones((self.train_count,3))
  v = self._nlp_3of6.valid.x.items
  self.valid_count = len(v)
  self.valid_graph_data = numpy.ones((self.valid_count,3))
  for i in range(self.train_count):
    a,b,c = self._fetch_token_info(t[i])
    self.train_graph_data[i,:] = [a, b, c]
  #
  for i in range(self.valid_count):
    a,b,c = self._fetch_token_info(v[i])
    self.valid_graph_data[i,:] = [a, b, c]
  #
  return


In [None]:
# do it
Wallaby.fetch_token_graph_data()

In [None]:
# draw the train, test, valid data set
@add_method(AUD3)
def draw_tokenizer(self):
  # set up
  mx_train_token = numpy.ones((self.train_count,2))
  mx_train_unique = numpy.ones((self.train_count,2))
  mx_train_discard = numpy.ones((self.train_count,2))
  #
  mx_train_token[:,0] = numpy.arange(0, self.train_count)
  mx_train_token[:,1] = self.train_graph_data[:,0]
  mx_train_token[:,1].sort()
  #
  mx_train_unique[:,0] = numpy.arange(0, self.train_count)
  mx_train_unique[:,1] = self.train_graph_data[:,1]
  mx_train_unique[:,1].sort()
  #
  mx_train_discard[:,0] = numpy.arange(0, self.train_count)
  mx_train_discard[:,1] = self.train_graph_data[:,2]
  mx_train_discard[:,1].sort()
  #
  #
  mx_valid_token = numpy.ones((self.valid_count,2))
  mx_valid_unique = numpy.ones((self.valid_count,2))
  mx_valid_discard = numpy.ones((self.valid_count,2))
  #
  mx_valid_token[:,0] = numpy.arange(0, self.valid_count)
  mx_valid_token[:,1] = self.valid_graph_data[:,0]
  mx_valid_token[:,1].sort()
  #
  mx_valid_unique[:,0] = numpy.arange(0, self.valid_count)
  mx_valid_unique[:,1] = self.valid_graph_data[:,1]
  mx_valid_unique[:,1].sort()
  #
  mx_valid_discard[:,0] = numpy.arange(0, self.valid_count)
  mx_valid_discard[:,1] = self.valid_graph_data[:,2]
  mx_valid_discard[:,1].sort()
  #
  # draw area/line graph
  frame, pic = monty.fetch.graph_canvas(row=1,col=2,size=(18,9))
  monty.draw.graph_line(pic[0], mx_train_token,is_shade_area=True, shade_alpha=0.5, shade_color=monty.bag.color.teal)
  monty.draw.graph_line(pic[0], mx_train_unique,is_shade_area=True, shade_alpha=0.5, shade_color=monty.bag.color.yellow)
  monty.draw.graph_line(pic[0], mx_train_discard,is_shade_area=True, shade_alpha=0.5, shade_color=monty.bag.color.red)
  #
  monty.draw.graph_line(pic[1], mx_valid_token,is_shade_area=True, shade_alpha=0.5, shade_color=monty.bag.color.teal)
  monty.draw.graph_line(pic[1], mx_valid_unique,is_shade_area=True, shade_alpha=0.5, shade_color=monty.bag.color.yellow)
  monty.draw.graph_line(pic[1], mx_valid_discard,is_shade_area=True, shade_alpha=0.5, shade_color=monty.bag.color.red)
  # label them
  xlab = "Movie Reviews Sorted By Token Count"
  ylab = "Token Count (Teal), Unique (Yellow/Green), Discard (Red)"
  monty.draw.graph_label(pic[0],xlabel=xlab, ylabel=ylab, head="Tokenizer, Training Set")
  monty.draw.graph_label(pic[1],xlabel=xlab, ylabel=ylab, head="Tokenizer, Validation Set")
  frame.show()
  return

In [None]:
# do it
Wallaby.draw_tokenizer()

- They are beautiful graphs. With a glance, we verify the tokenizer is working magnificently, and the people at the IMDB curated a perfect set of NLP data. 

- When you are working on your NLP project, the NLP data should look like these two graphs. It’s so well normalized. Wallaby is ready to re-use this graphing method for all of her NLP projects. 

- Wallaby was so worried about the “63% discarded tokens,” and the tokens count (163,207) is twice as many as the IMDB’s vocab files (89,527). 

- The graphs show the “discarded token” per file is tiny. It’s barely registered on the chart. Therefore, it does not matter how many total tokens we discarded, as long as the per file per discard token is low. 

- We will ask Wallaby to graph the average per file count to verify the point above. 

In [None]:
# draw tokenizer average
@add_method(AUD3)
def draw_tokenizer_average(self):
  # set up
  mx_data = numpy.ones((3,2))
  mx_data[:,0] = numpy.arange(1,4)
  i = (self.train_graph_data[:,0].mean() + self.valid_graph_data[:,0].mean()) / 2
  mx_data[0,1] = i
  #
  j = (self.train_graph_data[:,1].mean() + self.valid_graph_data[:,1].mean()) / 2
  mx_data[1,1] = j
  #
  k = (self.train_graph_data[:,2].mean() + self.valid_graph_data[:,2].mean()) / 2
  mx_data[2,1] = k
  # draw bar chart
  frame, pic = monty.fetch.graph_canvas()
  monty.draw.graph_bar(pic,mx_data, is_hand=True,color=monty.bag.color.orange)
  xlab = "Token (" + str(round(i,2)) + "), Unique (" + str(round(j,2)) + "), Discarded (" + str(round(k,2)) + ")"
  monty.draw.graph_label(pic,xlabel=xlab, ylabel="Token Count", head="Tokenizer, Average Token Per File")
  frame.show()
  return

In [None]:
# do it
Wallaby.draw_tokenizer_average()

- The “Tokenizer Average” graph proves it. The tokenizer is working correctly, and the default 60,000 maximum token buffer is sufficient. 

- Wallaby doesn't need to increase the maximum token buffer or change the frequency of words from twice to three-time before accepting it as a valid token. 

- On average, for a movie review of 297 words, the system only discarded 1.64 words. That is 0.55% per file. 

- At first look, there are 63% token discarded. Wallaby thought that the system dropped every other word in a movie review, but the system only dropped one or two words in actuality.

- The “Tokenizer graphs” count as an “original thought.” Therefore, Wallaby wins 1,000 gold coins. 


><center><h2><i>Wallaby has an original thought, Yippy!</i></h2></center>

### 2.6.3D -- Normalize Token Words

- Moving forward, we will ask Wallaby to do the following task.

1. Teach Wallaby to add spaces before and after “period, comma, and dash.”

1. We could hack the Fast.ai libraries, but that would take a bit longer, and Wallaby, with her tail wagging, is so eager to help, so Wallaby, go for it.

3. Re-run the first three steps.

In [None]:
import re
#
# 
# clean glob words
@add_method(AUD3)
def _clean_glob_line(self, line):
  return re.sub(r'(?<=[.,-])(?=[^\s])', r' ', line)
#
#
# clean glob in file
@add_method(AUD3)
def _clean_glob_file(self, original_file, _clean_function):
  src = "_tmp1__.txt"
  is_ok = True
  with open(original_file) as old, open(src, 'w') as new:
    for line in old:
      a = _clean_function(line)
      new.write(a)
  new.close()
  old.close()
  os.rename(src, original_file)
  return is_ok
#
# -------------------------------------- + ----------------------------------------
#
@add_method(AUD3)
def clean_glob_dir(self, src_dir,_clean_function):
  is_ok = True
  i = 0
  self._ph()
  if (os.path.isdir(src_dir)):
    for root, dirs, files in os.walk(src_dir):  # @UnusedVariable
      for name in files:
        a = pathlib.Path(root, name)
        self._clean_glob_file(a, _clean_function)
        i = i + 1
    self._pp("Total files count", i)
  else:
    self._pp("**Error not a directory", src_dir)
    is_ok = False
  self._ph()
  return is_ok
#


In [None]:
# do it
Wallaby.clean_glob_dir(pathlib.Path("data/imdb"), Wallaby._clean_glob_line)

- Wallaby, good job!

- Re-run the first 3 steps of the data-bunch.

### 2.6.1 (Rerun) Read the data and inspect it.

In [None]:
# do it
Wallaby.fetch_1of6_read_data()

- It is the same result as the first time. The good news is that Wallaby has not broken anything.

### 2.6.2 (Rerun) Split the “train” and “valid” set and inspect it.

In [None]:
# do it
Wallaby.fetch_2of6_split_data()

- It is the same result as the first time. Good job, Wallaby.

### 2.6.3 (Rerun) Label and tokenize the data and inspect it.

In [None]:
# do it
Wallaby.fetch_3of6_label_data()

In [None]:
#print random unknown, unused words, run it a dozen of time or more.
Wallaby.print_random_unused_words()

In [None]:
#print random unknown, unused words, run it a dozen of time or more.
Wallaby.print_random_unused_words(is_reversed=True)

- From the above results, in section #3, the total tokens were reduced to 156,091. That is a 7,116 tokens reduction. Good job, Wallaby!

- In section #4, the total discarded token percentage reduced from 63% to 62%. It’s a small step in the right direction, but as the “Tokenizer” graphs show, the discarded tokens per file are more weighty.


In [None]:
# do it
Wallaby.fetch_token_graph_data()
Wallaby.draw_tokenizer()

In [None]:
# do it
Wallaby.draw_tokenizer_average()

- It is amazing that with over 20 million data points, we can verify results with a glance. We didn’t break the tokenizer, and we improved it by the tiniest of margin. The average discard tokens per file are reducing from 1.64 tokens to 1.63 tokens.

### 2.6.4 Add data augmentation and inspect it. 

- With the image classifier or image segmentation ANN model, Wallaby uses data augmentation with a notable effect. By flipping, skewing, and warping the images, she increases the training data set by a factor of four to twelve. 

- So why can’t Wallaby augments the writing words? In English, you can’t flip characters or words. Skewing and warping could not apply to words. 

- Wallaby doesn't know how or read any book or article about word-augmentation. She doesn't even know if it would be possible. She needs to consult a linguist. 

- Even if everyone said, “NO, it can’t be done.” Wallaby would not stop exploring “why” because it would be hubris to think that “if I can’t do it, no one else can.”

- Wallaby could smell another adventure for the next “sandbox” project. 

### 2.6.5 Fetch the batch-size, the data-bunch, and inspect it.

- Wallaby needs a better method of calculating batch-size. For now, Monty based it only on the available free GPU RAM. It should be a straightforward math equation of free GPU RAM, input data size, and the number of layers in the base architecture. 

- Wallaby did a quick online search, and she did not find any article or blog about a formal batch-size calculation. It would make a fun next “sandbox” project. 



In [None]:
# GPU and RAM info
monty.print.gpu_info()

In [None]:
# 
# data bunch
@add_method(AUD3)
def fetch_5of6_databunch(self):
  self.batch_size = monty.fetch.batch_size()
  self._nlp_5of6 = self._nlp_3of6.databunch(bs=self.batch_size)
  monty.print.inspect_fastai_data_bunch(self._nlp_5of6)
  return

In [None]:
# do it
Wallaby.fetch_5of6_databunch()

- Once again, the Fast.ai library does all the work. Inspecting the data-bunch, section #1 shows that the train data-set is correct. Section #2 shows the valid data-set is accurate, and section #3 shows the batch-size is 48, and we are using the GPU-Cuda interface. 

- Section #4 shows the tokenizer is working reliably.

- To double-checking the data-bunch, we use Fast.ai’s show_batch() method. 

In [None]:
Wallaby._nlp_5of6.show_batch()

### 2.6.6 Normalize the input-data for the selected base-architecture.

- Similar to the “data augmentation” discussion, Wallaby routinely tag on the normalization function to the data-bunch. For an image classifier project that uses the “resnet-34 or resnet-50” base architecture, she uses the “imagenet_stats” as an input to the normalization function. 

- The data-bunch normalization function is not the same as “NLP Normalization.” Some literature refers to “NLP Normalization” as transforming the words to a standard format, such as converting all words to lowercase.

- One more time, “Why not?”

- It is the four detours in the “AUD3” journey that Wallaby didn’t take. She will need a pack of friends for upcoming adventures.

- That concludes the “next word” data-bunch. Wallaby will move forward to define the “sentiment prediction” data-bunch.

## 2.8 Milestone

- Henna is happy to see the milestone mark. She has completed the AUD3 journey, except for a quick detour to hyper-parameters.

- At the beginning of the journey, we promised to show Henna how the movie reviews affect the hyper-parameters. 

In [None]:
#

## 2.10 Wrap Up 

- We are back from the hyper-parameters detour, and Henna is happy heading back to the home base. Henna has been on a similar journey with real-world data. It was for a national retail customer feedback blog site. The difference is that Henna spends more time cleaning, labeling, augmenting, and segmenting customer feedback. In other words, it is not as normalized as the IMDB movie reviews.

- Henna starts by downloading the IMDB movie reviews. She looks at the data structure, counting words, and drawing graphs. The combined word-counts and standard deviation charts in section #2.3 are very useful. 

- Henna moves forward to reading the movie reviews, the “README” file, and other supporting files. She found a few issues, such as HTML-tags and unprintable characters that need to be clean. 

- After cleaning the movie reviews, Henna splits the “labeled” data to “75-15-10”, i.e., 75% for the training, 15% for the validating, and 10% for the testing. The “75-15-10” split might disqualify Henna in a competition, but it is a balanced distribution for a real-world project. 

- Hanna did not cheat. She didn’t use the “validation” movie reviews for “training,” and the “test” data are kept separate. The “75-15-10” hand-drawn bar chart in section #2.5 confirmed that Henna did it correctly.

- At first glance at the tokenizer result, Henna was panic because the system tokenizes 47% of all the available unique words. It implies the system dropped every other word in the movie reviews. 

- However, through Henna’s original thinking and graphs, the “tokenizer charts” in section #2.6.3C show that in actuality, for the average 232 words movie review, the system only discards 1.62 words. The “tokenizer charts” conclusively prove that the system discards 0.7% per movie review. 

- Henna discovered a few intentional biases in the movie reviews. In her biases discussion, she explains how to correct them and encourages data scientists to document their data’s intentional biases. 

- In the last part of the journey, Henna create the “next word” data-bunch and “sentiment classification” data-bunch. She inspects each step in the data-bunch creation. 

- It is an enjoyable journey, and Henna wishes you to be back for the next adventure.


## 2.11 Bonus Section, NLP Model Loss and Accuracy Result

- Henna was asked about the NLP model training accuracy results using her two data-bunches, so she asked “Spooky,” who is from the Jupiter notebook that does the model training. 

- Hanna happy to report that the final accuracy rate is 96.800%. Compared to the Kaggle IMDB competition about two years ago, the “[ml410-IMDb](https://www.kaggle.com/c/ml410-imdb/leaderboard)” Private Leaderboard is 90.528%.

- Granted, Spooky is using the Fast.ai library, which included the last two years advances in CNN, but to beat the leader board by 6.272% is a substantial margin where Kaggle competition winners are separated tenth or hundredth of a percentage. 

- Spooky’s Jupyter notebook is not (yet) published on the public GitHub, but Henna has permission to includes the result here. 

- Using Panda, the raw data are as follows. The system recalculates the learning rate after each unfreezing-layers training session. In addition, Henna splits the clean-data into the “75-15-10” split. See section #2.5 above.


In [None]:
# read it
henna.df_result = pandas.read_csv("spooky_result.csv")

In [None]:
# display it
henna.df_result

- Henna likes to draw, so here is a beautiful Loss and Accuracy’s graph.

In [None]:
# draw loss and accuracy training graph
@add_method(AUD3)
def draw_loss_acu_result(self,is_logr=False):
  try:
    #set up train set
    row = len(self.df_result)
    mx_accuracy = numpy.ones((row,2))
    mx_accuracy[:,0] = numpy.arange(0,row,1)
    mx_accuracy[:,1] = self.df_result["accuracy"]
    # set up train_loss
    mx_train_loss = mx_accuracy.copy()
    mx_train_loss[:,1] = self.df_result["train_loss"]
    #
    mx_valid_loss = mx_accuracy.copy()
    mx_valid_loss[:,1] = self.df_result["valid_loss"]
    # 
    mx_1_line = numpy.zeros((2,2))
    mx_1_line[0,1] = 1.0
    mx_1_line[1,0] = row - 1
    mx_1_line[1,1] = 1.0
    # 
    #
    # draw it
    frame, pic = monty.fetch.graph_canvas()
    monty.draw.graph_line(pic,mx_1_line,is_grid=True)
    monty.draw.graph_line(pic,mx_accuracy,is_shade_area=True,shade_alpha=0.8,shade_color=monty.bag.color.yellow)
    monty.draw.graph_line(pic,mx_valid_loss,is_shade_area=True,shade_alpha=0.8,shade_color=monty.bag.color.blue)
    monty.draw.graph_line(pic,mx_train_loss,is_shade_area=True,shade_alpha=0.7,shade_color=monty.bag.color.teal)
    #
    #
    x = "Epoch"
    y = "Loss and Accuracy"
    h = "Accuracy (Yellow), Train Loss (Teal), Valid Loss (Blue)"
    xl = list(map(str, list(self.df_result["epoch"])))
    monty.draw.graph_label(pic,xlabel=x, ylabel=y, head=h)
    pic.set_xticks(mx_accuracy[:,0])
    pic.set_xticklabels(xl)
    if (is_logr):
      pic.set_yscale('log')
    pic.grid(True)
    frame.show()
  except:
    self._pp("**Error, can not draw graph", "Did you create my buddy, Monty?")
  return
# set_xticklabels

In [None]:
# do it
henna.draw_loss_acu_result()

In [None]:
# do it using logarithmic scale
henna.draw_loss_acu_result(is_logr=True)

In [None]:
monty.draw.graph_line??

><h2><center>The end.</center></h2>

# 3 - Conclusion




The “AUD3” is a typical Jupyter notebook in my workday. The difference is that the actual notebooks are messier, containing many detours, deadends, and mistakes.

I choose the IMDB data because I can’t share the actual customer data, and I am pleasantly surprised how clean are the movie reviews. The IMDB folks have rightfully deserved the credits for making the NLP data freely available. 

I discovered that graphing NLP data is my secret weapon, especially the “tokenizer graph per file.” I can’t believe that no one does it before. 

Typically, I spend more time cleaning, augmenting, segmenting, and labeling the NLP data in a real-world project. Furthermore, the data scientists and project managers rarely document intentional biases. They are not hard to spot once you compared with the project objectives.

As with the previous “sandbox” project, [the 3D visualizing](https://www.linkedin.com/pulse/python-3d-visualization-hackable-step-by-step-jupyter-duc-haba/), I encourage everyone should publish articles or lessons using the Jupyter notebook. The readers can hack the notebook and make it their own. That’s is where learning truly takes root. It is by reading and doing it.

> <h2>“A doer of deeds…”</h2>

I am looking forward to seeing you again in the next “sandbox” adventure, and if you read this on LinkedIn or GitHub, give me a “thumbs up” and send me feedback.

- LinkedIn, "Demystify Python 3D Visualization -- A Hackable Step-by-step Jupyter Notebook", (add link)

- If you read this on LinkedIn, what are you waiting for? Heading over to Github, using Google Collab or your favorite Jupyter notebook option, and hacking away. (add link)

In [None]:
# end of jupyter notebook