# Assignment #1

**DISCLAIMER** Portions of this assignment were completed with AI assistance and those instances are noted with details on the AI used, the prompt, and the output or the general outcome of the interaction.

## Introduction

For this assignment, I wanted to take the opportunity to improve my Lisp skills by challenging myself to only use Lisp for the entirety of this project. I do believe there are many packages for statistical and machine learning available for Lisp through avenues such as Quicklisp (a repository for packages, similar to PyPI) and I will want to use them for more complicated problems, however, the best way to learn the language is to actually build something useful and sufficiently complicated so as to explore the 80% of the features which are used 20% of the time (Pareto principle.) Of course, applying my learning to the area I want to use the language (ML, NLP, etc.) makes the learning much more fit-to-purpose and some skills, such as processing the json data and manipulating the tokens, will be applicable even when using packages. Each of the cells will complete some aspect of this program, and each should be run in sequence (or run all at once, if you prefer). Some cells are separated out so that you can play with them and try to enter your own data or parameters.

## How I Used This Notebook
Since the use of a notebook is typically out of order and exploratory, you will notice that not all cells reveal all information that is supported in the notebook markdown (such as all results); this is because each variation / experiment was run by modifying variables and parameters and running them again to receive new output. The output was then saved off-notebook and ultimately placed back into the explanatory portions of the notebook. If this is not your style of running a notebook, enjoy being a free moral agent.

Throughout the notebook, I will indicate difficult programming issues that came up, how I approached solving them, and in cases where I utilized AI to help learn the language, I will indicate the prompts and general outputs, or in the interest of time, a summary of the interaction, the trajectory of the code improvements, and the final result.

For the model evaluation aspect of this assignment, I will go into the various choices of model, setup and evaluation, some key insights and takeaways. Specifically:
 - Feature engineering / data choices (reasoning)
    - consider out of distribution words, word negation, etc.
 - Implementation of at least 3 models (possibly variations of 1 kind of model)
 - Evaluation of models
    - metrics (acc, F1, precision, recall)
    - more robust?
    - time complexity, space?
    - general insights
 - Writing and Presentation
    - (throughout this notebook, discussion)
    - slides on the work in these evaluations / programming
    - why I chose different models,
    - practical considerations for these models, deployment



##  (Part 1) Introduction to Naive Bayes Classifier From Scratch in Common Lisp (SBCL)

### Packages Used
The primary packages used to assist this portion of the assignment are the `cl-json` package for JSON object loading, `cl-ppcre` for regular expression handling, and `alexandria` which contains various macros for helping to code.

In [1]:
;;;; quick install all our needed libraries
(ql:quickload :cl-json)
(ql:quickload :cl-ppcre)
(ql:quickload :alexandria)

;;;; setup our packages and namespace
(defpackage :mine (:use :cl :alexandria))
(in-package :mine)

To load "cl-json":
  Load 1 ASDF system:
    cl-json


(:CL-JSON)

; Loading "cl-json"

To load "cl-ppcre":
  Load 1 ASDF system:
    cl-ppcre
; Loading "cl-ppcre"
.

(:CL-PPCRE)


To load "alexandria":
  Load 1 ASDF system:
    alexandria


(:ALEXANDRIA)

#<PACKAGE "MINE">

#<PACKAGE "MINE">

; Loading "alexandria"



## Utility Functions
Here I define some utility functions for processing the data. The `split-set` function gathers the train, validation, and test splits into pairs of text and labels. The only included samples are those for the "student" category. The function is defined specifically for this dataset, that is, it is made only for the Wizard of Tasks dataset.

The `count-labels` function does just that, only counting the class labels from the samples. This way, I can get the keys for the classes right from the training data and use it for making all the other hash tables. As disclosed, this function was mostly gotten from chatGPT when I was learning Lisp early in the assignment by prompting with the question "I"m doing machine learning for intent classification with common lisp. How can I count up the class labels from the dataset? I do not know the different number of classes."

The `tokenize` function was another function that I used chatGPT for, in order to define the regular expressions for tokenizing the sample text. In the context of common lisp, I prompted chatGPT: "I need a series of regex expressions for tokenizing English text." Since this was in the context of common lisp, it generated a tokenizing function as well, which did well in covering the cases I was interested in.

## Decisions for Tokenizing
The main driver for this decision was to keep things simple and not be bogged down by the details of tokenization, since I still was learning to use this language and I hadn't yet developed the code for the Naive Bayes Classifier. However, some research led me to the Stanford NLP course material which discussed the Naive Bayes Classifier and some of the intuitive choices one might make. For example, as discussed in class, we might remove all stop words or essentially words which are expected to be in nearly all classes and text and not discriminative, such as "a", "the", "and", etc. The advice given by this course material essentially said that it doesn't do much so in practice they are kept in. Another choice, that of especially defining negation words by combining words such as "not like" into a super token "NOT_like" such that the negative is considered in the classification, was left out as well, for the simple reason that negation is not expected to affect intent, certainly in this problem context, since the classes are focused on what kind of interaction the user is having, not their sentiment. Had this been a sentiment classification task, I might have chosen to especially consider negation. For my resources on this particular topic, I used our course material, the suggested books, and this presentation from the Stanford NLP course: [Text Classification and Naive Bayes](https://web.stanford.edu/class/cs124/lec/naivebayes2021.pdf)




In [2]:
;;;; utility functions definitions

(defun split-set (split-string raw-json)
"Takes the split string, which looks at the DATA_SPLIT field, 
and gathers the TEXT and INTENT from the student turns. for the split.
Splits are 'test', 'validation', and 'train.'"
 (let ((split))
 (loop :for item :in raw-json
 :when (string= split-string (cdar (cddr item)))                 ;; this is "test","train",or "validation"
 do (setq split (append split 
    (loop :for turns :in (cdar (cdddr item))                    ;; this is a set of turns in a given split
     :when (string= "student" (cdar (nthcdr 10 turns)))          ;; this checks the role of the turn, we want student.
     :collect (list (cdar turns) (cdar (nthcdr 4 turns))))))     ;; (cdar turns) ;; the text
                                                                ;; (cdar (nthcdr 4 turns)) ;; intent
 :finally (return split))))

;; This function was chatGPT assisted, by the following prompt / query:
;; "I'm doing machine learning for intent classification with common lisp. How can I 
;; count up the class labels from the dataset? I do not know the different number of classes."
(defun count-labels (dataset)
  "Count how many times each class label appears in DATASET.
   DATASET is a list of (input label) pairs, or just labels."
  (let ((counts (make-hash-table :test 'equalp)))
    (dolist (item dataset)
      ;; if dataset items are just labels, use ITEM directly
      ;; if they are (input label) pairs, use (second item)
      (let* ((label (if (consp item) (second item) item))
             (current (gethash label counts 0)))
        (setf (gethash label counts) (1+ current))))
    counts))

;; This function was chatGPT assisted, using the following prompt / query:
;; "I need a series of regex expressions for tokenizing English text."
(defun tokenize (text)
"This function takes in text, processes it, and returns a list of tokens."
  (let ((s text))
    (setf s (cl-ppcre:regex-replace-all "\\s+" s " ")) ;; white space collapse
    (setf s (cl-ppcre:regex-replace-all "([.,!?;:\"(){}\\[\\]])" s " \\1 ")) ;; punctuation
    (setf s (cl-ppcre:regex-replace-all "(\\w+)('ll|'re|'ve|n't|'s|'m|'d)" s "\\1 \\2")) ;; contractions
    (setf s (cl-ppcre:regex-replace-all "(\\w+)-(\\w+)" s "\\1 - \\2")) ;; dashes
    (remove "" (cl-ppcre:split "\\s+" s))))


SPLIT-SET

COUNT-LABELS

TOKENIZE

## Evaluating
The following function creates the confusion matrix from the ground truth labels and the predicted labels, and then returns a function which allows to call specific metrics either classwise or macro-averaged. The helper functions inside `evaluation` were discovered by prompting ChatGPT with the following prompt:
> In Lisp, I am writing an evaluation function, which will take in the list of all classes, a list of ground truth labels, and a list of predicted labels. I want this function to build the confusion matrix and save it in a closure over a lambda function which can be called to generate precision, recall, F1 score, and accuracy.

The output mirrored my original function, but used this `labels` function which allowed for locally defined named functions. This made the evaluator function much cleaner, as the helper functions could go through the confusion matrix for the true positives, false positives, and false negatives while the evaluator could either call for each class or compute the macro average.

### Macro-Average
The decision to use the macro average instead of the micro average was due to two reasons: firstly, I am concerned with overall performance across the range of classes, i.e. I want to know if I have a decent classifier generally; secondly, as described in the course material, micro-averaging gives more weight to the frequent class, and since there are significant class imbalances, micro-averaging would be misleading.

In [3]:
;;;; Metric functions for evaluating the classifier
(defun evaluation (class-labels y-list pred-list)
"This takes the data in and saves the environment, creates a lambda function
which can have a case called to return one of a particular metric.
Metrics available are: accuracy, precision / macro precision, recall, and macro F1."
(let* ((cls (length class-labels))
       (results (make-array (list cls cls) :initial-element 0.0d0))
       (to-idx #'(lambda (x) (position-if #'(lambda (y) (equalp x y)) class-labels))))

      ;; build confusion matrix
      (loop for y in y-list
            for p in pred-list
            do (incf (aref results (funcall to-idx y) (funcall to-idx p))))
      
      ;; local function declarations! Cool. Helper functions for precision, recall, and F1.
      ;; For this portion of the function, ChatGPT was used by prompting as indicated in the markdown cell.

      (labels ((precision (i) ;; tp / tp + fp
                  (let ((tp (aref results i i))
                        (fp (loop for k from 0 below cls
                                 :sum (if (= k i) 0 (aref results k i)))))
                        (if (zerop (+ tp fp)) 0 (/ (float tp) (+ tp fp)))))
               
               (recall (i) ;; tp / tp + fn
                  (let ((tp (aref results i i))
                        (fn (loop for k from 0 below cls
                                 :sum (if (= k i) 0 (aref results i k)))))
                        (if (zerop (+ tp fn)) 0 (/ (float tp) (+ tp fn)))))
               
               (f1 (i) ;;
                  (let ((p (precision i))
                        (r (recall i)))
                        (if (zerop (+ p r)) 0 (/ (* 2 p r) (+ p r))))))

      ;; evaluator function
      (lambda (&key metric k)
      "Takes a metric: accuracy, precision, recall, f1
      and takes a class k. If no class, then macro-average is returned.
      Return the confusion matrix with metric as cmatrix."
            (ecase metric
             (:accuracy 
                  (let ((correct (loop for i from 0 below cls
                                    :sum (aref results i i)))
                        (total (loop for i from 0 below cls
                                    :sum (loop for j from 0 below cls
                                          :sum (aref results i j)))))
                  (if (zerop total) 0 (/ (float correct) total))))
             (:precision
                  (if k (precision (funcall to-idx k))
                        (/ (reduce #'+ (loop for i from 0 below cls :collect (precision i))) cls)))
             
             (:recall
                  (if k (recall (funcall to-idx k))
                        (/ (reduce #'+ (loop for i from 0 below cls :collect (recall i))) cls)))
            
             (:f1
                  (if k (f1 (funcall to-idx k))
                        (/ (reduce #'+ (loop for i from 0 below cls :collect (f1 i))) cls)))
             (:cmatrix
                  results))))))

EVALUATION

## The Naive Bayes Classifier

In this code, I define the classifier function. The definition of this function follows the "let over lambda" concept to encapsulate an environment in the returned function. Initially, the function takes in the training document, optionally printing some statistics, and then builds ("trains") the classifier. This is primarily made up of building hash tables to contain the classes, the priors, word tables for each class, and the overall vocabulary. The training data is tokenized inside the function as well.

### The Classifier Function
The purpose of the `naive-bayes-classifier` function is to build the training statistics, and return a function that can be called on some string, in other words, it is a function factory for making NBCs on a particular set of training data. The training data is set at the time of the function call, but the smoothing can be selected when calling the returned function.

In [4]:
;;;; build an NBC classifier which takes in a dataset and builds all the internal tracking,
;;;; and returns a function which takes a sample string and returns the predicted class.

(defun naive-bayes-classifier (train &key (print-stats nil) 
                                          (fit-prior "data")  
                                          (binary nil) 
                                          (ngrams 1)
                                          (filter-list '()))
"NBC that takes the training data in the format of ('string of text' . 'label' )
and returns a naive bayes classifer. "

;; disclaimer: the two local functions here, make-ngrams and ngram-strings, were written with the assistance
;; of chatGPT, with the following prompt: 
;; "how to write a function which turns a list of ordered tokens into a list of bigrams?"
;; The follow up question from chatGPT was:
;; "Do you want me to also extend this so it can make n-grams in general (with n as a parameter), not just bigrams?"
;; after prompting with "yes", both functions were complete, however, they were not written as local functions as here.
(labels ((make-ngrams (tokens n &key include-unigrams)
  "Create a list of n-gram tokens from a list of 1-gram tokens.
If INCLUDE-UNIGRAMS is non-nil, also include the unigrams."
  (let* ((len (length tokens))
         (ngrams (loop for i from 0 to (- len n)
                       collect (subseq tokens i (+ i n)))))
    (if include-unigrams
        (append (loop for i from 0 below len
                      collect (list (elt tokens i)))
                ngrams)
        ngrams)))
         (ngram-strings (tokens n)
            "take a list of n-gram tokens and turn them into a list of n-gram strings."
            (mapcar #'(lambda (ng) (format nil "~{~a~^ ~}" ng)) (make-ngrams tokens n :include-unigrams T)))
         ;; chatGPT was used to look up the function that was used to 
         ;; remove items from one list in another, set-difference, with the prompt:
         ;; "how do you remove from one list any items that are in another?"
         ;; The answer, "That’s exactly what set-difference is for in Common Lisp ✅"
         (filter (x) (set-difference x filter-list :test #'string-equal)))
      
(let ((vocab (make-hash-table :size 10000 :test 'equalp))
      (priors (make-hash-table :test 'equalp))
      (label-counts (make-hash-table :test 'equalp))
      (word-counts (make-hash-table :test 'equalp))
      (training-tokens (map 'list #'(lambda (n) 
                        (list (cadr n) (if binary (remove-duplicates 
                                                   (ngram-strings (filter (tokenize (car n))) ngrams) 
                                                      :test #'string-equal) 
                                                  (ngram-strings (filter (tokenize (car n))) ngrams)))) 
                                                  train)))

;; get label counts (and so the keys for classes)
      (setf label-counts (count-labels train))
;; build prior
      (cond ((string= fit-prior "data")
                  (loop for k being the hash-keys in label-counts using (hash-value v)
                  do (setf (gethash k priors) (/ (float v) (length training-tokens)))))
            ((string= fit-prior "uniform")
                  (loop for k being the hash-keys in label-counts
                  do (setf (gethash k priors) (/ (float 1) (hash-table-count label-counts)))))
            ((string= fit-prior "none")
                  (loop for k being the hash-keys in label-counts
                  do (setf (gethash k priors) 1.0d0))))

;; build per-class word counts
      ;; build a hash table for the words in each class.
      (loop for k being the hash-keys in label-counts
      do (setf (gethash k word-counts) (make-hash-table :size 1000 :test 'equalp)))

      ;; go through the token class pairs, adding all words to vocab, and 
      ;; to the class hash tables
      (dolist (pair training-tokens)
            (destructuring-bind (label words) pair
                  (dolist (w words)
                        (incf (gethash w vocab 0))
                        (incf (gethash w (gethash label word-counts) 0)))))

;; if stats are requested, print them when creating the classifier
(when print-stats
(format t "parameters:~% .... fit-prior: ~S ~% .... binary: ~A ~% .... ngrams: ~D ~2%" 
                                                            fit-prior binary ngrams)
( format t "stop words: ~A ~%" filter-list)
(format t "Class labels and counts: ~%")
(maphash #'(lambda (k v) (format t "~S: ~D~%" k v)) label-counts)
(format t "Class priors: ~%")
(maphash #'(lambda (k v) (format t "~A: ~D~%" k v)) priors)
(format t "Total words: ~D~%" (hash-table-count vocab)))
(maphash (lambda (k v)
           (format t "Label: ~A → table ~A (~D grams)~%"
                   k v (if v (hash-table-count v) -1)))
         word-counts)
;; make the classifier
#'(lambda (command &rest args)
"Options are: predict (x &optional smooth get-probs)
      x is the string to predict on.
      smooth is the smoothing factor alpha, and default is 0.0d0 (no smoothing.)
      get-probs set to T returns the list of class labels and probabilities
      instead of a class prediction. Default is nil.
              get-word-table (key)
              get-vocab
              get-priors
              get-classes"
   (ecase command

      (:predict
        (destructuring-bind (x &optional (smooth 0.0d0) (get-probs nil)) args
            (let ((token-input (if binary (remove-duplicates 
                                                (ngram-strings (filter (tokenize x)) ngrams) 
                                                      :test #'string-equal)
                                          (ngram-strings (filter (tokenize x)) ngrams)))
                   (results '()))
                   (maphash #'(lambda (k wtable)
                        (unless (hash-table-p wtable)
                              (error "class ~S has non-hash value ~S" k wtable))
                        (let* ((V (hash-table-count vocab))
                               (WC (hash-table-count wtable))
                               (denom (+ WC (* smooth V)))
                               (likelihood (loop for w in token-input     
                                                        ;; zero numerator means word not found; before smoothing.
                                                        ;; zero denominator means that we have no words, basically.
                                           summing (let ((num (gethash w wtable 0))) 
                                           (if (or (zerop denom) (zerop num))
                                           0.0d0
                                           (log (/ (+ smooth num) denom)))))))
                        (push (list k (+ (log (gethash k priors 0.0d0)) likelihood)) results)))
                        word-counts)
                  (if get-probs
                  results
                  (car (reduce #'(lambda (a b) (if (> (cadr a)(cadr b)) a b))
                  results))))))
            
      (:get-word-table
        (destructuring-bind (key) args
         (gethash key word-counts)))
         
      (:get-vocab
       vocab)
       
      (:get-priors
       priors)

      (:get-classes
       (sort (loop for k being the hash-keys in label-counts
            :collect k) #'string-lessp)))))))


NAIVE-BAYES-CLASSIFIER

## Loading the Data
The asterisk-flanked variables are _dynamic_ or _global_ variables which can be used throughout the program once defined. The naming scheme is customary, and expected by the Lisp environment. In Steel Bank Common Lisp (SBCL) the compiler will warn and complain if a dynamic variable is defined without this naming convention. The Wizard of Tasks data is loaded by using the `cl-json` package, and then the custom split function is applied to get the three splits.

You may notice that some dynamic variables are indicated with `defparameter` and some with `defvar`; the only difference is that `defparameter` is unconditionally declared, overwriting with the initial data, and `defvar` will only initialize if the name is not already bound to a value. So, as I understand it, for variables which I plan on re-running with different values, I'd use the parameter variation.

In [5]:
;;;; setup and prepare our data here.
(defparameter *data-location* "./data")
(defparameter *dataset* "/wizard_of_tasks_cooking_v1.0.json")
;;(defparameter *dataset* "/wizard_of_tasks_diy_v1.0.json")

;; the raw json text
(defparameter *json-text* nil)

;; all our splits from the json text
(defparameter *test-list* nil)
(defparameter *train-list* nil)
(defparameter *valid-list* nil)

;; NLTK stop word list
(defparameter *stop-words-nltk* '("i" "me" "my" "myself" "we" "our" "ours" "ourselves" 
"you" "your" "yours" "yourself" "yourselves" "he" "him" "his" "himself" "she" "her" "hers" 
"herself" "it" "its" "itself" "they" "them" "their" "theirs" "themselves" "what" "which" 
"who" "whom" "this" "that" "these" "those" "am" "is" "are" "was" "were" "be" "been" "being" 
"have" "has" "had" "having" "do" "does" "did" "doing" "a" "an" "the" "and" "but" "if" "or" 
"because" "as" "until" "while" "of" "at" "by" "for" "with" "about" "against" "between" "into" 
"through" "during" "before" "after" "above" "below" "to" "from" "up" "down" "in" "out" "on" "off" 
"over" "under" "again" "further" "then" "once" "here" "there" "when" "where" "why" "how" "all" 
"any" "both" "each" "few" "more" "most" "other" "some" "such" "no" "nor" "not" "only" "own" 
"same" "so" "than" "too" "very" "s" "t" "can" "will" "just" "don" "should" "now"))

;; stop word list
(defparameter *stop-words* '("I" "?" "the" "What" "do" "to" "," "should" "."
                            "is" "it" "for" "!" "next" "Now" "and" "This" 
                            "That" "need" "you" "'s" "How" "a" "be" "have" "in" "of"))
                            
;; load the json file and preprocess data into the splits.
(setf *json-text* (cl-json:decode-json-from-source (open (concatenate 'string *data-location* *dataset*))))
(setf *test-list* (split-set "test" *json-text*))
(setf *train-list* (split-set "train" *json-text*))
(setf *valid-list* (split-set "validation" *json-text*))

*DATA-LOCATION*

*DATASET*

*JSON-TEXT*

*TEST-LIST*

*TRAIN-LIST*

*VALID-LIST*

*STOP-WORDS-NLTK*

*STOP-WORDS*

((:*WIZARD-OF-*TASK-FOOD-1
  (:DOCUMENT--URL
   . "https://www.wholefoodsmarket.com/recipes/labneh-fresh-herbs-and-olive-oil")
  (:DATA--SPLIT . "test")
  (:TURNS
   ((:TEXT
     . "Hi! I love labneh but I've never mixed it with herbs into a spread before, looks amazing. What ingredients do I need to start? Thank you :)")
    (:TURN--COUNTER . 1) (:DANGEROUS--TOOLS) (:SHARED--DATA)
    (:INTENT . "ask_question_ingredients_tools") (:REAL--LIFE--ACTION . "N/A")
    (:RELEVANT . "yes") (:USEFUL . "yes") (:WORKER--ID . 111)
    (:PREVIOUS--WORKER--ID) (:ROLE . "student"))
   ((:TEXT . "Here are the ingredients you will need!") (:TURN--COUNTER . 2)
    (:DANGEROUS--TOOLS)
    (:SHARED--DATA "2 cups plain Greek yogurt"
     "1 tablespoon extra-virgin olive oil" "1 teaspoon lemon zest"
     "1/2  teaspoon fine sea salt" "1/4  teaspoon ground black pepper"
     "1 teaspoon chopped fresh chives" "1 teaspoon chopped fresh thyme"
     "1 tablespoon chopped fresh tarragon" "1 tablespoon chopped fr

(("Hi! I love labneh but I've never mixed it with herbs into a spread before, looks amazing. What ingredients do I need to start? Thank you :)"
  "ask_question_ingredients_tools")
 ("After I've chopped all of the herbs and gathered the other ingredients, do I mix them altogether. "
  "request_next_step")
 ("The ingredients are now mixed. What should be done now?"
  "request_next_step")
 ("Once the ingredients are mixed how should I proceed?" "request_next_step")
 ("Can I let the ingredients sit for longer to make the flavors stronger?"
  "ask_question_recipe_steps")
 ("I let them sit for 15 minutes. What step comes next?" "request_next_step")
 ("Do you think I could freeze this recipe?" "ask_question_recipe_steps")
 ("Please restate and answer, I want to know if I can make this recipe and then freeze it for later when I'm busy, \"interesting the task\" is meaningless. Please answer."
  "request_next_step")
 ("Okay and are there any other steps that I should take? "
  "request_next_step

(("How much cream cheese and other ingredients will I need?"
  "ask_question_ingredients_tools")
 ("What are the few other ingredients?" "ask_question_recipe_steps")
 ("Okay, my mistake!  So what's the first step?" "request_next_step")
 ("That's been blended, what's next? " "request_next_step")
 ("And are there any other steps I need to take? " "request_next_step")
 ("This is great! Thank you for the help! " "stop")
 ("Will I be using premade pasta from the box or making it myself?"
  "ask_question_ingredients_tools")
 ("Okay and what other ingredients do I need? "
  "ask_question_ingredients_tools")
 ("I am all out of garlic-infused olive oil. Is regular okay to use?"
  "ask_question_recipe_steps")
 ("That's too time-consuming for me to do today. Now that I have all of the things on your list, can you help me start cooking?"
  "request_next_step")
 ("Gotcha!  Anything I can do while I wait for it to boil?"
  "request_next_step")
 ("Okay, the veggies are ready and the water is boiling.

(("What is required to make the turkey with chile-citrus butter?"
  "ask_question_ingredients_tools")
 ("I have all of the ingredients listed. What is the first step to the recipe?"
  "request_next_step")
 ("Those ingredients are now in a bowl. What is the second step?"
  "request_next_step")
 ("I beat the ingredients with an electric mixer, so they are now combined. What is the next step in the recipe?"
  "request_next_step")
 ("Can I use any kind of citrus juice or is there a specific one that is recommended?"
  "ask_question_recipe_steps")
 ("Why do I need to reserve the skins of the lemons and limes?"
  "ask_question_recipe_steps")
 ("Do the rinds get thrown out once the turkey is roasted?"
  "ask_question_recipe_steps")
 ("Do you think those rinds could be composted so they don't go to total waste?"
  "ask_question_recipe_steps")
 ("What should I do once I've composted and gotten rid of the pork rinds?"
  "request_next_step")
 ("I prepped the bird just like you suggested.  What no

## Declaring the Classifier
I can define a dynamic variable which is the classifier by calling the `naive-bayes-classifier` on the training data. This function can be called with the `funcall` function and different amounts of smoothing applied for each call.

In [6]:
;; get a classifer.
(defparameter *nbc* (naive-bayes-classifier *train-list* :print-stats T))

*NBC*

parameters:
 .... fit-prior: "data" 
 .... binary: NIL 
 .... ngrams: 1 

stop words: NIL 
Class labels and counts: 
"ask_question_ingredients_tools": 477
"ask_question_recipe_steps": 1086
"request_next_step": 1519
"stop": 228
"chitchat": 11
"misc": 16
Class priors: 
ask_question_ingredients_tools: 0.14294276
ask_question_recipe_steps: 0.32544202
request_next_step: 0.45519927
stop: 0.06832484
chitchat: 0.0032963739
misc: 0.0047947257
Total words: 2308
Label: ask_question_ingredients_tools → table #<HASH-TABLE :TEST EQUALP :COUNT 905 {10055B5E33}> (905 grams)
Label: ask_question_recipe_steps → table #<HASH-TABLE :TEST EQUALP :COUNT 1606 {10055BDE33}> (1606 grams)
Label: request_next_step → table #<HASH-TABLE :TEST EQUALP :COUNT 1255 {10055C5E33}> (1255 grams)
Label: stop → table #<HASH-TABLE :TEST EQUALP :COUNT 320 {10055CDE33}> (320 grams)
Label: chitchat → table #<HASH-TABLE :TEST EQUALP :COUNT 62 {10055D5E33}> (62 grams)
Label: misc → table #<HASH-TABLE :TEST EQUALP :COUNT 93 {10055D

## Example String
Here is a place to run the classifier on your own string. The string following the `:predict` keyword can be changed to anything you like, and the number following that is the smoothing factor. Leaving off the number is the same as a smoothing factor of 0, or no smoothing.

In [7]:
;; nbc without smoothing
(let ((cls (funcall *nbc* :predict "Can you tell me what I should do next?" 0.2d0)))
    (format t "the predicted class is: ~A ~%" cls))

NIL

the predicted class is: request_next_step 


## Validation of Different Models
With the different aspects of the model to explore, identifying the most performant hyperparameters for the naive bayes classifier requires a search through the available parameter space. The independent choices to make are:

* binary or all occurences (remove duplicates in training and during inference)
* which sort of prior (proportional to the class labels, a uniform prior, or no prior)
* unigrams or bigrams (although "independent" is a funny word here, since this feature takes precedence over the binary)
* the smoothing alpha.

Other choices could have been made, including changing aspects of the tokenizer, but I feel that this is a sufficient number of parameters to examine the NBC performance.



## Setup of Experimental Models
The following cell is used for generating the different models to use for evaluation.

In [8]:
;; options: :print-stats T :fit-prior (or "data" "uniform" "none"), default "data" :binary (or nil T), default nil
;;          :ngrams int, default 1. :filter-list *stop-words*, default '()

(defparameter *nbc-eval* (naive-bayes-classifier *train-list* :print-stats T))

*NBC-EVAL*

parameters:
 .... fit-prior: "data" 
 .... binary: NIL 
 .... ngrams: 1 

stop words: NIL 
Class labels and counts: 
"ask_question_ingredients_tools": 477
"ask_question_recipe_steps": 1086
"request_next_step": 1519
"stop": 228
"chitchat": 11
"misc": 16
Class priors: 
ask_question_ingredients_tools: 0.14294276
ask_question_recipe_steps: 0.32544202
request_next_step: 0.45519927
stop: 0.06832484
chitchat: 0.0032963739
misc: 0.0047947257
Total words: 2308
Label: ask_question_ingredients_tools → table #<HASH-TABLE :TEST EQUALP :COUNT 905 {1007B7DE33}> (905 grams)
Label: ask_question_recipe_steps → table #<HASH-TABLE :TEST EQUALP :COUNT 1606 {1007B85E33}> (1606 grams)
Label: request_next_step → table #<HASH-TABLE :TEST EQUALP :COUNT 1255 {1007B8DE33}> (1255 grams)
Label: stop → table #<HASH-TABLE :TEST EQUALP :COUNT 320 {1007B95E33}> (320 grams)
Label: chitchat → table #<HASH-TABLE :TEST EQUALP :COUNT 62 {1007B9DE33}> (62 grams)
Label: misc → table #<HASH-TABLE :TEST EQUALP :COUNT 93 {1007BA

### Collecting evaluator functions
The following cell sets the *wizard-evaluations* variable as a list of functions which can be called on to get the evaluation results. Essentially the `evaluation` function builds a confusion matrix and then returns a function which can be called for a specific metric by calculating from the confusion matrix.

In [9]:
;; use classifier on validation set for various smoothing levels.
(defparameter *wizard-evaluations* nil)

(setf *wizard-evaluations* 
     (let* ((smoothing-levels '(0.0d0 0.05d0 0.1d0 0.5d0 1.0d0 1.5d0 2.0d0))
       (classes (funcall *nbc-eval* :get-classes))
       (val-results (loop for level in smoothing-levels
            :collect (cons level (loop for sample in *valid-list*
                                  :collect (funcall *nbc-eval* :predict (first sample) level) into pred
                                  :collect (second sample) into y
                                  :finally (return (list y pred)))))))
          (loop for r in val-results
          :collect (cons (car r) (evaluation classes (cadr r) (caddr r))))))

*WIZARD-EVALUATIONS*

((0.0d0 . #<FUNCTION (LAMBDA (&KEY :METRIC :K) :IN EVALUATION) {10032A193B}>)
 (0.05d0 . #<FUNCTION (LAMBDA (&KEY :METRIC :K) :IN EVALUATION) {10032A344B}>)
 (0.1d0 . #<FUNCTION (LAMBDA (&KEY :METRIC :K) :IN EVALUATION) {10032A4F5B}>)
 (0.5d0 . #<FUNCTION (LAMBDA (&KEY :METRIC :K) :IN EVALUATION) {10032A6A6B}>)
 (1.0d0 . #<FUNCTION (LAMBDA (&KEY :METRIC :K) :IN EVALUATION) {10032A857B}>)
 (1.5d0 . #<FUNCTION (LAMBDA (&KEY :METRIC :K) :IN EVALUATION) {10032AA08B}>)
 (2.0d0 . #<FUNCTION (LAMBDA (&KEY :METRIC :K) :IN EVALUATION) {10032ABB9B}>))

## Validation Results
This cell was used to view and save the results of each model, for deciding the final model to use and report on the test data.

In [10]:
(loop for ev in *wizard-evaluations*
    do  (format t "smoothing: ~5,4F : " (car ev))
        (let ((metrics (loop for m in '(:accuracy :precision :recall :f1)
                        collect (funcall (cdr ev) :metric m))))
             (destructuring-bind (a p r f) metrics
                (format t " .......... acc: ~5,4F, pr: ~5,4F, re: ~5,4F, f1: ~5,4F ~%" a p r f))))

NIL

smoothing: .0000 :  .......... acc: .4310, pr: .4087, re: .3342, f1: .2637 
smoothing: .0500 :  .......... acc: .4975, pr: .3857, re: .3841, f1: .2806 
smoothing: .1000 :  .......... acc: .5074, pr: .3592, re: .3828, f1: .2778 
smoothing: .5000 :  .......... acc: .4877, pr: .3281, re: .3278, f1: .2456 
smoothing: 1.0000 :  .......... acc: .4631, pr: .2873, re: .2982, f1: .2223 
smoothing: 1.5000 :  .......... acc: .4384, pr: .2794, re: .2837, f1: .2103 
smoothing: 2.0000 :  .......... acc: .4212, pr: .2829, re: .2773, f1: .2076 


### Stop Word List
To try and improve the performance, I listed out the vocabulary of a model and took some of the high frequency words and copied them to a stop word list. Different lists could be better, but the 28th word was "ingredients" so I figured that beyond that might still be helpful.

In [11]:
(let* ((vocab (funcall *nbc-eval* :get-vocab))
       (v-lst (loop for k being the hash-keys of vocab using (hash-value v)
                :collect (list k v))))
        (mapcar #'(lambda (x) (format t "\"~A\" " (car x))) (sort v-lst #'> :key #'cadr)))

(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL
 NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL
 NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL
 NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL
 NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL
 NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL
 NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL
 NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL
 NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL
 NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL
 NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL
 NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL
 NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL

"I" "?" "the" "What" "do" "to" "," "should" "." "is" "it" "for" "!" "next" "Now" "and" "This" "That" "need" "you" "'s" "How" "a" "be" "have" "in" "of" "ingredients" "step" "after" "can" "are" "once" "my" "Okay" "'ve" "'m" "use" "or" "done" "with" "all" "thanks" "will" "me" "on" "recipe" "make" "there" "long" "help" "them" "If" "n't" "So" "dish" "first" "am" "does" "Ok" "take" "GOOD" "know" "oven" "much" "time" "up" "ready" "add" "Anything" "just" "get" "could" "oil" "added" "They" "cooking" "any" "minutes" "would" "other" "your" "great" "Thank" "heat" "think" "mixture" "Baking" "these" "else" "at" "before" "water" "Why" "cook" "go" "going" "supposed" "while" "out" "like" "as" "instead" "when" "been" "into" "serve" "cooked" "put" "from" "making" "temperature" "eat" "let" "then" "-" "Cool" "finished" "but" "start" "those" "one" "off" "about" "sauce" "cheese" "has" "please" "Salad" "right" "more" "tell" "well" "using" "bowl" "together" "'ll" "pan" "mixed" "'re" "grill" "recommend" "everyt

## Testing
After selecting the best performing model, I decided to use a smoothing alpha of 0.07 (between the best peforming values) and to use the model with priors fitted to the data, no binarization, no stop words, unigrams only.

In [12]:
;; use classifier on validation set for various smoothing levels.
(defparameter *wizard-test* nil)

(setf *wizard-test* 
     (let* ((smoothing-levels '(0.1d0))
       (classes (funcall *nbc-eval* :get-classes))
       (test-results (loop for level in smoothing-levels
            :collect (cons level (loop for sample in *test-list*
                                  :collect (funcall *nbc-eval* :predict (first sample) level) into pred
                                  :collect (second sample) into y
                                  :finally (return (list y pred)))))))
          (loop for r in test-results
          :collect (cons (car r) (evaluation classes (cadr r) (caddr r))))))

*WIZARD-TEST*

((0.1d0 . #<FUNCTION (LAMBDA (&KEY :METRIC :K) :IN EVALUATION) {1003E74B6B}>))

In [13]:
(loop for ev in *wizard-test*
    do  (format t "smoothing: ~5,4F : " (car ev))
        (let ((metrics (loop for m in '(:accuracy :precision :recall :f1)
                        collect (funcall (cdr ev) :metric m))))
             (destructuring-bind (a p r f) metrics
                (format t " .......... acc: ~5,4F, pr: ~5,4F, re: ~5,4F, f1: ~5,4F ~%" a p r f))))

NIL

smoothing: .1000 :  .......... acc: .4530, pr: .3351, re: .2617, f1: .2470 


### Collected Results

When validating the performance of the NBC, we examined the effects of the different hyperparameter choices and came to some interesting conclusions, supported by the Jurafsky / Martin book as well.

Again, the parameters we initially examined were:
* smoothing choice
* prior, learned from data or uniform.
* binary (removal of duplicates)
* n-grams, and combination of n-grams.
* the stop word list, or no stop word list.

The smoothing was determined to be best around 0.1 to 0.5 by validation. 
The results across both WoT datasets showed that the prior learned from the data was the best prior and any deviation from this worsened performance.
The removal of duplicates did not do much to improve the results, but did not cause too many problems, with the assumption that few words were removed in the typical case.
n-grams were adjusted so that they could have bi-grams and tri-grams but also include unigrams, so bigrams + unigrams for example. No ngrams > 1 were helpful in performance,
and the results were very bad when set to 2 or greater. However, strangely, choosing ngrams = 1 and additionally adding unigrams (in other words, unigrams + unigrams) improved performance
across the data. I believe this has to do with the length of short samples being preferred, and by essentially doubling the words in the text, the probabilities improve somehow.
Finally, the stop word list did not help, neither the NLTK list, or the custom lists.



## Final Cooking Test Results

The final results came from deciding by the validation performance the best model choices and the smoothing function. 

```
smoothing: .1000 :  .......... acc: .4530, pr: .3351, re: .2617, f1: .2470
```
## Cooking Validation Results

### No added unigram

#### data fit better than uniform
```
parameters:
 .... fit-prior: "data" 
 .... binary: NIL 
 .... ngrams: 1 

stop words: NIL 

smoothing: .0000 :  .......... acc: .3719, pr: .3449, re: .3013, f1: .2283 
smoothing: .0500 :  .......... acc: .4360, pr: .3502, re: .3289, f1: .2385 
smoothing: .1000 :  .......... acc: .4557, pr: .3454, re: .3370, f1: .2466 
smoothing: .5000 :  .......... acc: .4458, pr: .3006, re: .3095, f1: .2261 
smoothing: 1.0000 :  .......... acc: .4212, pr: .2734, re: .2757, f1: .2003 
smoothing: 1.5000 :  .......... acc: .4039, pr: .2760, re: .2697, f1: .1979 
smoothing: 2.0000 :  .......... acc: .3744, pr: .2624, re: .2424, f1: .1756


parameters:
 .... fit-prior: "uniform" 
 .... binary: NIL 
 .... ngrams: 1 

stop words: NIL 

smoothing: .0000 :  .......... acc: .2586, pr: .4094, re: .2393, f1: .1919 
smoothing: .0500 :  .......... acc: .3744, pr: .3555, re: .3030, f1: .2282 
smoothing: .1000 :  .......... acc: .3966, pr: .3513, re: .3137, f1: .2314 
smoothing: .5000 :  .......... acc: .3990, pr: .3042, re: .2806, f1: .2112 
smoothing: 1.0000 :  .......... acc: .3768, pr: .2976, re: .2622, f1: .2011 
smoothing: 1.5000 :  .......... acc: .3202, pr: .2619, re: .2361, f1: .1730 
smoothing: 2.0000 :  .......... acc: .2882, pr: .2573, re: .2246, f1: .1641
```
#### Binary comparison
```
parameters:
 .... fit-prior: "data" 
 .... binary: T 
 .... ngrams: 1 

stop words: NIL

smoothing: .0000 :  .......... acc: .3153, pr: .3474, re: .2705, f1: .2123 
smoothing: .0500 :  .......... acc: .3892, pr: .3407, re: .3008, f1: .2216 
smoothing: .1000 :  .......... acc: .4089, pr: .3324, re: .3029, f1: .2236 
smoothing: .5000 :  .......... acc: .4064, pr: .2756, re: .2757, f1: .1985 
smoothing: 1.0000 :  .......... acc: .3645, pr: .2839, re: .2501, f1: .1840 
smoothing: 1.5000 :  .......... acc: .3202, pr: .2753, re: .2236, f1: .1651 
smoothing: 2.0000 :  .......... acc: .2808, pr: .2628, re: .2032, f1: .1466
```
#### Stop words
```
parameters:
 .... fit-prior: "data" 
 .... binary: NIL 
 .... ngrams: 1 

stop words: (i me my myself we our ours ourselves you your yours yourself
             yourselves he him his himself she her hers herself it its itself
             they them their theirs themselves what which who whom this that
             these those am is are was were be been being have has had having
             do does did doing a an the and but if or because as until while of
             at by for with about against between into through during before
             after above below to from up down in out on off over under again
             further then once here there when where why how all any both each
             few more most other some such no nor not only own same so than too
             very s t can will just don should now)

smoothing: .0000 :  .......... acc: .1552, pr: .1794, re: .2053, f1: .1290 
smoothing: .0500 :  .......... acc: .1823, pr: .1492, re: .2232, f1: .1178 
smoothing: .1000 :  .......... acc: .1970, pr: .1676, re: .2287, f1: .1246 
smoothing: .5000 :  .......... acc: .1823, pr: .1511, re: .2014, f1: .1106 
smoothing: 1.0000 :  .......... acc: .1650, pr: .1439, re: .1872, f1: .0992 
smoothing: 1.5000 :  .......... acc: .1527, pr: .1409, re: .1774, f1: .0929 
smoothing: 2.0000 :  .......... acc: .1330, pr: .1353, re: .1594, f1: .0817

parameters:
 .... fit-prior: "data" 
 .... binary: NIL 
 .... ngrams: 1 

stop words: (I ? the What do to , should . is it for ! next Now and This That
             need you 's How a be have in of)

smoothing: .0000 :  .......... acc: .0493, pr: .1481, re: .1329, f1: .0406 
smoothing: .0500 :  .......... acc: .0591, pr: .1541, re: .1530, f1: .0492 
smoothing: .1000 :  .......... acc: .0616, pr: .1898, re: .1535, f1: .0587 
smoothing: .5000 :  .......... acc: .0394, pr: .1298, re: .1239, f1: .0333 
smoothing: 1.0000 :  .......... acc: .0320, pr: .1205, re: .1158, f1: .0278 
smoothing: 1.5000 :  .......... acc: .0271, pr: .1096, re: .1141, f1: .0249 
smoothing: 2.0000 :  .......... acc: .0271, pr: .1099, re: .1141, f1: .0252
```
#### N-grams = 2
```
parameters:
 .... fit-prior: "data" 
 .... binary: NIL 
 .... ngrams: 2 

stop words: NIL 

smoothing: .0000 :  .......... acc: .0025, pr: .0009, re: .0833, f1: .0018 
smoothing: .0500 :  .......... acc: .0025, pr: .0008, re: .0833, f1: .0016 
smoothing: .1000 :  .......... acc: .0025, pr: .0008, re: .0833, f1: .0016 
smoothing: .5000 :  .......... acc: .0025, pr: .0008, re: .0833, f1: .0016 
smoothing: 1.0000 :  .......... acc: .0025, pr: .0008, re: .0833, f1: .0016 
smoothing: 1.5000 :  .......... acc: .0025, pr: .0008, re: .0833, f1: .0016 
smoothing: 2.0000 :  .......... acc: .0025, pr: .0008, re: .0833, f1: .0016
```

### Unigram + Unigram

#### prior fit
```
parameters:
 .... fit-prior: "data" 
 .... binary: NIL 
 .... ngrams: 1 

stop words: NIL

smoothing: .0000 :  .......... acc: .4310, pr: .4087, re: .3342, f1: .2637 
smoothing: .0500 :  .......... acc: .4975, pr: .3857, re: .3841, f1: .2806 
smoothing: .1000 :  .......... acc: .5074, pr: .3592, re: .3828, f1: .2778 
smoothing: .5000 :  .......... acc: .4877, pr: .3281, re: .3278, f1: .2456 
smoothing: 1.0000 :  .......... acc: .4631, pr: .2873, re: .2982, f1: .2223 
smoothing: 1.5000 :  .......... acc: .4384, pr: .2794, re: .2837, f1: .2103 
smoothing: 2.0000 :  .......... acc: .4212, pr: .2829, re: .2773, f1: .2076

parameters:
 .... fit-prior: "uniform" 
 .... binary: NIL 
 .... ngrams: 1

smoothing: .0000 :  .......... acc: .3818, pr: .4036, re: .3047, f1: .2435 
smoothing: .0500 :  .......... acc: .4631, pr: .3872, re: .3691, f1: .2743 
smoothing: .1000 :  .......... acc: .4754, pr: .3809, re: .3657, f1: .2766 
smoothing: .5000 :  .......... acc: .4729, pr: .3427, re: .3249, f1: .2498 
smoothing: 1.0000 :  .......... acc: .4384, pr: .3052, re: .2966, f1: .2264 
smoothing: 1.5000 :  .......... acc: .4187, pr: .2861, re: .2761, f1: .2093 
smoothing: 2.0000 :  .......... acc: .4039, pr: .2811, re: .2697, f1: .2035
```
####  Binary
```
parameters:
 .... fit-prior: "data" 
 .... binary: T 
 .... ngrams: 1 

smoothing: .0000 :  .......... acc: .3153, pr: .3474, re: .2705, f1: .2123 
smoothing: .0500 :  .......... acc: .3892, pr: .3407, re: .3008, f1: .2216 
smoothing: .1000 :  .......... acc: .4089, pr: .3324, re: .3029, f1: .2236 
smoothing: .5000 :  .......... acc: .4064, pr: .2756, re: .2757, f1: .1985 
smoothing: 1.0000 :  .......... acc: .3645, pr: .2839, re: .2501, f1: .1840 
smoothing: 1.5000 :  .......... acc: .3202, pr: .2753, re: .2236, f1: .1651 
smoothing: 2.0000 :  .......... acc: .2808, pr: .2628, re: .2032, f1: .1466
```

#### Stop Words
```
parameters:
 .... fit-prior: "data" 
 .... binary: NIL 
 .... ngrams: 1 

stop words: (i me my myself we our ours ourselves you your yours yourself
             yourselves he him his himself she her hers herself it its itself
             they them their theirs themselves what which who whom this that
             these those am is are was were be been being have has had having
             do does did doing a an the and but if or because as until while of
             at by for with about against between into through during before
             after above below to from up down in out on off over under again
             further then once here there when where why how all any both each
             few more most other some such no nor not only own same so than too
             very s t can will just don should now)

smoothing: .0000 :  .......... acc: .1576, pr: .2245, re: .2196, f1: .1542 
smoothing: .0500 :  .......... acc: .1995, pr: .2269, re: .2400, f1: .1621 
smoothing: .1000 :  .......... acc: .2069, pr: .2140, re: .2351, f1: .1506 
smoothing: .5000 :  .......... acc: .2020, pr: .2030, re: .2257, f1: .1424 
smoothing: 1.0000 :  .......... acc: .1798, pr: .1823, re: .2009, f1: .1253 
smoothing: 1.5000 :  .......... acc: .1601, pr: .1773, re: .1803, f1: .1094 
smoothing: 2.0000 :  .......... acc: .1453, pr: .1735, re: .1752, f1: .1033
```

#### N-Grams = 2
```
parameters:
 .... fit-prior: "data" 
 .... binary: NIL 
 .... ngrams: 2 

stop words: NIL

smoothing: .0000 :  .......... acc: .0099, pr: .0507, re: .1026, f1: .0293 
smoothing: .0500 :  .......... acc: .0123, pr: .0276, re: .1090, f1: .0280 
smoothing: .1000 :  .......... acc: .0123, pr: .0225, re: .1090, f1: .0253 
smoothing: .5000 :  .......... acc: .0074, pr: .0125, re: .0962, f1: .0141 
smoothing: 1.0000 :  .......... acc: .0025, pr: .0010, re: .0833, f1: .0019 
smoothing: 1.5000 :  .......... acc: .0025, pr: .0010, re: .0833, f1: .0019 
smoothing: 2.0000 :  .......... acc: .0025, pr: .0010, re: .0833, f1: .0019
```


#### Final Insights
The primary insight into a generative model such as Naive Bayes is that the data distribution is very important, and that with less data, more heuristics and expert decisions need to be applied to achieve the best performance. With very little data, a reasonable model could be made, as shown by the Scikit Learn model. However, the assumptions on the data must hold for the real world use. This is something that generative models trained on gobs and gobs of data do not need to worry about, as the latent space is discovered implicitly by improving on the simple task of next word prediction across text domains which include practically all public knowledge. The nature of the Naive Bayes Classifier algorithm being a small, simple, yet powerful model with the simplification of considering tokens as independent works well for the task, however, as a generative model, it would be expected to do poorly, for the fact that the probabilities _are_ independent, and so each draw from the distribution is completely disconnected from the others - a very terrible sentence generator.

Another insight is that data preparation is very important for a task where we are relying on relatively little data. Even simple decisions such as leaving in an article or removing a certain set of stop words, or whether to keep duplicates can make all the difference. Comparing this to the LLM paradigm, the errors across all English knowledge, for example, that is, the situations in which semantic meaning is subversed by some example text or altered by random issues such as spelling variations or errors is overwhelmed by the sheer number of examples which are meaningful. This could be perceived in some sense in the way the boosting method ensembles a number of weak learners to make a strong one, which is to say, the numbers of documents which provide even slightly useful semantic value, in the totality of English knowledge, will "ensemble" to give a picture to the LLM of language meaning. We have no such advantage here with Naive Bayes; we must rely on good and representative examples, and expert knowledge to make a model. But it seems that expert knowledge and good examples is a fairly cheap way to build a half decent classifier.

The final insight for this assignment section, is that there is a reason to not roll your own; some persons worked very hard to make code bases and models work well, and consistently, with as few bugs as possible. When in an operating environment, off-load risk by relying on the already tried and true infrastructure _unless_ there is a very good reason to do otherwise. There is also a good reason to make your own; to learn something more deeply than simply running pre-packaged one-liners which my ten year old child could do with some basic instruction.