# Assignment #1

## (Introduction)

For this assignment, I wanted to take the opportunity to improve my Lisp skills by challenging myself to only use Lisp for the entirety of this project. I do believe there are many packages for statistical and machine learning available for Lisp through avenues such as Quicklisp (a repository for packages, similar to PyPI) and I will want to use them for more complicated problems, however, the best way to learn the language is to actually build something useful and sufficiently complicated so as to explore the 80% of the features which are used 20% of the time (Pareto principle.) Of course, applying my learning to the area I want to use the language (ML, NLP, etc.) makes the learning much more fit-to-purpose and some skills, such as processing the json data and manipulating the tokens, will be applicable even when using packages. Each of the cells will complete some aspect of this program, and each should be run in sequence (or run all at once, if you prefer). Some cells are separated out so that you can play with them and try to enter your own data or parameters.

Throughout the notebook, I will indicate difficult programming issues that came up, how I approached solving them, and in cases where I utilized AI to help learn the language, I will indicate the prompts and general outputs, or in the interest of time, a summary of the interaction, the trajectory of the code improvements, and the final result.

For the model evaluation aspect of this assignment, I will go into the various choices of model, setup and evaluation, some key insights and takeaways. Specifically:
 - Feature engineering / data choices (reasoning)
    - consider out of distribution words, word negation, etc.
 - Implementation of at least 3 models (possibly variations of 1 kind of model)
 - Evaluation of models
    - metrics (acc, F1, precision, recall)
    - more robust?
    - time complexity, space?
    - general insights
 - Writing and Presentation
    - (throughout this notebook, discussion)
    - slides on the work in these evaluations / programming
    - why I chose different models,
    - practical considerations for these models, deployment

Finally, similar evaluation will be performed for a second, minor task on entity recognition.

##  (Part 1.1) Introduction to Naive Bayes Classifier From Scratch in Common Lisp (SBCL)



In [None]:
;;;; quick install all our needed libraries
(ql:quickload :cl-json)
(ql:quickload :cl-ppcre)
(ql:quickload :alexandria)

;;;; setup our packages and namespace
(defpackage :mine (:use :cl :alexandria))
(in-package :mine)

## Utility functions
In developing this from scratch implementation, I 

In [None]:
;;;; utility functions definitions

(defun split-set (split-string raw-json)
"Takes the split string, which looks at the DATA_SPLIT field, 
and gathers the TEXT and INTENT from the student turns. for the split.
Splits are 'test', 'validation', and 'train.'"
 (let ((split))
 (loop :for item :in raw-json
 :when (string= split-string (cdar (cddr item)))                 ;; this is "test","train",or "validation"
 do (setq split (append split 
    (loop :for turns :in (cdar (cdddr item))                    ;; this is a set of turns in a given split
     :when (string= "student" (cdar (nthcdr 10 turns)))          ;; this checks the role of the turn, we want student.
     :collect (list (cdar turns) (cdar (nthcdr 4 turns))))))     ;; (cdar turns) ;; the text
                                                                ;; (cdar (nthcdr 4 turns)) ;; intent
 :finally (return split))))

;; This function was chatGPT assisted, by the following prompt / query:
;; "I'm doing machine learning for intent classification with common lisp. How can I 
;; count up the class labels from the dataset? I do not know the different number of classes."
(defun count-labels (dataset)
  "Count how many times each class label appears in DATASET.
   DATASET is a list of (input label) pairs, or just labels."
  (let ((counts (make-hash-table :test 'equalp)))
    (dolist (item dataset)
      ;; if dataset items are just labels, use ITEM directly
      ;; if they are (input label) pairs, use (second item)
      (let* ((label (if (consp item) (second item) item))
             (current (gethash label counts 0)))
        (setf (gethash label counts) (1+ current))))
    counts))

;; This function was chatGPT assisted, using the following prompt / query:
;; "I need a series of regex expressions for tokenizing English text."
(defun tokenize (text)
"This function takes in text, processes it, and returns a list of tokens."
  (let ((s text))
    (setf s (cl-ppcre:regex-replace-all "\\s+" s " ")) ;; white space collapse
    (setf s (cl-ppcre:regex-replace-all "([.,!?;:\"(){}\\[\\]])" s " \\1 ")) ;; punctuation
    (setf s (cl-ppcre:regex-replace-all "(\\w+)('ll|'re|'ve|n't|'s|'m|'d)" s "\\1 \\2")) ;; contractions
    (setf s (cl-ppcre:regex-replace-all "(\\w+)-(\\w+)" s "\\1 - \\2")) ;; dashes
    (remove "" (cl-ppcre:split "\\s+" s))))


## The Naive Bayes Classifier

In this code, I define the classifier function. The definition of this function follows the "let over lambda" concept to encapsulate an environment in the returned function. Initially, the function takes in the training document, optionally printing some statistics, and then builds ("trains") the classifier. This is primarily made up of building hash tables to contain the classes, the priors, word tables for each class, and the overall vocabulary. The training data is tokenized inside the function as well.

### The lambda function
The purpose of the `naive-bayes-classifier` function is to build the training statistics, and return a function that can be called on some string, in other words, it is a function factory for making NBCs on a particular set of training data. The training data is set at the time of the function call, but the smoothing can be selected when calling the returned function.

In [None]:
;;;; build an NBC classifier which takes in a dataset and builds all the internal tracking,
;;;; and returns a function which takes a sample string and returns the predicted class.

(defun naive-bayes-classifier (train &optional (print-stats nil))
"NBC that takes the training data in the format of ('string of text' . 'label' )
and returns a naive bayes classifer. "
(let ((vocab (make-hash-table :size 10000 :test 'equalp))
      (priors (make-hash-table :test 'equalp))
      (label-counts (make-hash-table :test 'equalp))
      (word-counts (make-hash-table :test 'equalp))
      (training-tokens (map 'list #'(lambda (n) (list (cadr n) (tokenize (car n)))) train)))
;; get label counts (and so the keys for classes)
      (setf label-counts (count-labels train))
;; build prior
      (loop for k being the hash-keys in label-counts using (hash-value v)
      do (setf (gethash k priors) (/ (coerce v 'double-float) (length training-tokens))))
;; build per-class word counts
      ;; build a hash table for the words in each class.
      (loop for k being the hash-keys in label-counts
      do (setf (gethash k word-counts) (make-hash-table :size 1000 :test 'equalp)))

      ;; go through the token class pairs, adding all words to vocab, and 
      ;; to the class hash tables
      (dolist (pair training-tokens)
            (destructuring-bind (label words) pair
                  (dolist (w words)
                        (incf (gethash w vocab 0))
                        (incf (gethash w (gethash label word-counts) 0)))))

;; if stats are requested, print them when creating the classifier
(when print-stats
(format t "Class labels and counts: ~%")
(maphash #'(lambda (k v) (format t "~S: ~D~%" k v)) label-counts)
(format t "Class priors: ~%")
(maphash #'(lambda (k v) (format t "~A: ~D~%" k v)) priors)
(format t "Total words: ~D~%" (hash-table-count vocab)))
(maphash (lambda (k v)
           (format t "Label: ~A → table ~A (~D words)~%"
                   k v (if v (hash-table-count v) -1)))
         word-counts)
;; make the classifier
#'(lambda (command &rest args)
"Options are: predict (string and optional smoothing)
              get-word-table (key)
              get-vocab
              get-priors"
   (ecase command

      (:predict
        (destructuring-bind (x &optional (smooth 0.0d0) (get-probs nil)) args
            (let ((token-input (tokenize x))
                   (results '()))
                   (maphash #'(lambda (k wtable)
                        (unless (hash-table-p wtable)
                              (error "class ~S has non-hash value ~S" k wtable))
                        (let* ((V (hash-table-count vocab))
                               (WC (hash-table-count wtable))
                               (denom (+ WC (* smooth V)))
                               (likelihood (loop for w in token-input     
                                                        ;; zero numerator means word not found; before smoothing.
                                                        ;; zero denominator means that we have no words, basically.
                                           summing (let ((num (gethash w wtable 0))) 
                                           (if (or (zerop denom) (zerop num))
                                           0.0d0
                                           (log (/ (+ smooth num) denom)))))))
                        (push (list k (+ (log (max 1 (gethash k priors 0.0d0))) likelihood)) results)))
                        word-counts)
                  (if get-probs
                  results
                  (car (reduce #'(lambda (a b) (if (> (cadr a)(cadr b)) a b))
                  results))))))
            
      (:get-word-table
        (destructuring-bind (key) args
         (gethash key word-counts)))
         
      (:get-vocab
       vocab)
       
      (:get-priors
       priors)))))


In [None]:
;;;; setup and prepare our data here.
(defvar *data-location* "./data")
;; the raw json text
(defvar *json-text* nil)

;; all our splits from the json text
(defvar *test-list* nil)
(defvar *train-list* nil)
(defvar *valid-list* nil)

;; load the json file and preprocess data into the splits.
(setf *json-text* (cl-json:decode-json-from-source (open (concatenate 'string *data-location* "/wizard_of_tasks_cooking_v1.0.json"))))
(setf *test-list* (split-set "test" *json-text*))
(setf *train-list* (split-set "train" *json-text*))
(setf *valid-list* (split-set "validation" *json-text*))

In [None]:
;; get a classifer.
(defvar *nbc* (naive-bayes-classifier *train-list* T))

In [None]:
;; nbc without smoothing
(let ((cls (funcall *nbc* :predict "Can you tell me what I should do next?" 0.2d0)))
    (format t "the predicted class is: ~A ~%" cls))