<a id='top'></a><a name='top'></a>
# Chapter 1: Wordy machines


* [Introduction](#introduction)
* [1.0 Packets of thought (NLP overview)](#1.0)
* [1.1 Natural language vs programming language](#1.1)
* [1.2 The magic](#1.2)
    - [1.2.1 Machines that converse](#1.2.1)
    - [1.2.2 The math](#1.2.2)
* [1.3 Practical applications](1.3#)
* [1.3 Language through a computer's "eyes"](#1.3)
* [1.4 A brief overview of hyperspace](#1.4)
    - [1.4.1 The language of locks](#1.4.1)
* [1.5 A brief overflight of hyperspace](#1.5)
* [1.6 Word order and grammer](#1.6)
* [1.7 A chatbot natural language pipeline](#1.7)
* [1.8 Processing in depth](#1.8)
* [1.9 Natural language](#1.9)

---
<a name='introduction'></a><a id='introduction'></a>
# Introduction
<a href="#top">[back to top]</a>

### Dataset

* No real datasets

### Explore

* What natural language processing (NLP) is
* Why NLP is hard and only recently has become widespread
* When word order and grammar is important and when it can be ignored
* How a chatbot combines many of the tools of NLP
* How to use a regular expression to build the start of a tiny chatbot

### Key points

* NLP can be very useful
* The meaning and intent of words can be deciphered by machines
* A smart NLP pipeline will be able to deal with ambiguity
* We can teach machines common sense knowledge without spending a lifetime training them
* Chatbots can be thought of as semantic search engines
* Regular expressions are useful for more than just search

---
<a name='1.0'></a><a id='1.0'></a>
# 1.0 Imports and Setup
<a href="#top">[back to top]</a>

In [1]:
import os
if not os.path.exists('setup'):
    os.mkdir('setup')

In [2]:
req_file = "setup/requirements_01.txt"

In [3]:
%%writefile {req_file}
isort
scikit-learn-intelex
watermark

Overwriting setup/requirements_01.txt


In [4]:
import sys
IS_COLAB = 'google.colab' in sys.modules

if IS_COLAB:
    print("Installing packages")
    !pip install --upgrade --quiet -r {req_file}
else:
    print("Running locally.")

Running locally.


In [5]:
if IS_COLAB:
    from sklearnex import patch_sklearn
    patch_sklearn()

In [6]:
%%writefile setup/chp01_imports.py
import locale
import pprint
import random
import re
import warnings
from collections import Counter
from itertools import permutations

import numpy as np
import seaborn as sns
from tqdm.auto import tqdm
from watermark import watermark

Overwriting setup/chp01_imports.py


In [7]:
!isort setup/chp01_imports.py
!cat setup/chp01_imports.py

import locale
import pprint
import random
import re
from collections import Counter
from itertools import permutations

import numpy as np
import seaborn as sns
from tqdm.auto import tqdm
from watermark import watermark


In [8]:
import locale
import pprint
import random
import re
import warnings
from collections import Counter
from itertools import permutations

import numpy as np
import seaborn as sns
from tqdm.auto import tqdm
from watermark import watermark

In [9]:
def HR():
    print("-"*40)
    
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding
warnings.filterwarnings('ignore')
sns.set_style("darkgrid")
tqdm.pandas(desc="progress-bar")
pp = pprint.PrettyPrinter(indent=4)
random.seed(23)

print(watermark(iversions=True,globals_=globals(),python=True,machine=True))

Python implementation: CPython
Python version       : 3.8.12
IPython version      : 7.34.0

Compiler    : Clang 13.0.0 (clang-1300.0.29.3)
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

seaborn: 0.12.1
sys    : 3.8.12 (default, Dec 13 2021, 20:17:08) 
[Clang 13.0.0 (clang-1300.0.29.3)]
numpy  : 1.23.5
re     : 2.2.1



---
<a name='1.1'></a><a id='1.1'></a>
# 1.1 Natural language vs. programming language
<a href="#top">[back to top]</a>

Problem: Natural languages are not intended to be translated into a set of finite set of mathematical operations, like programming languages are.

Idea: Create representations that enable mathematical and statistical operations on natural languages. 

---
<a name='1.2'></a><a id='1.2'></a>
# 1.2 The magic
<a href="#top">[back to top]</a>

Problem: There is a massive amount of natural language text, so we need a way to automatically process it.

Idea: Use programming languages and machines to enable processing natural language text in a scalable manner.

<a name='1.2.1'></a><a id='1.2.1'></a>
## 1.2.1 Machines that converse
<a href="#top">[back to top]</a>

Problem: Natural languages cannot be directly translated into a precise set of mathematical operations.
    
Idea: Text contains enough information and instructions that can be extracted. This information and instructions can be stored, indexed, searched or immediately used. 

<a name='1.2.2'></a><a id='1.2.2'></a>
## 1.2.2 The math
<a href="#top">[back to top]</a>

Problem: Processing natural languages to extract useful information entails tedious statistical bookkeeping.

Idea: Use machines to systematically extract structured numerical data from text in the form of vectors. Then, use mathematical operations (especially from linear algebra) to find and build on statistical relationships between words, instead of a system of grammatical rules.

---
<a name='1.3'></a><a id='1.3'></a>
# 1.3 Practical applications
<a href="#top">[back to top]</a>

Problem: What are some useful applications of NLP?

Idea: Search, editing, dialog, writing, email, text mining, law, news, attribution, sentiment analysis, behaviour prediction, creative writing.

---
<a name='1.4'></a><a id='1.4'></a>
# 1.4 Language through a computer's "eyes"
<a href="#top">[back to top]</a>

Problem: How can a computer respond to text input?

Idea: A primitive approach is to create a finite state machine (FSM), which contains a manually created nested tree of conditionals. This is a pattern-based approach to NLP.

<a name='1.4.1'></a><a id='1.4.1'></a>
## 1.4.1 The language of locks
<a href="#top">[back to top]</a>

Problem: How to better understand a simple language processing machine?

Idea: Use the analogy of a mechanical lock and pattern matching.

<a name='1.4.2'></a><a id='1.4.2'></a>
## 1.4.2 Regular expression
<a href="#top">[back to top]</a>

Problem: How do regular expressions operate?

Idea: Regex use a special class of formal language grammar called a *regular grammar*. A machine that processes this kind of language is a formal mathematical object called a finite state machine (FSM). 

<a name='1.4.3'></a><a id='1.4.3'></a>
## 1.4.3 A simple chatbot
<a href="#top">[back to top]</a>

Problem: How to create a pattern matching chatbot?

Idea: Create a FSM regular expression that can speak a *regular language*, which here is the capability to respond to simple greetings. This is the pattern-based approach to NLP. The challenge of pattern-matching approaches to NLP is to create elegant patterns that capture what you want, without too many lines of regex code.

In [10]:
r = "(hi|hello|hey)[ ]*([a-z]*)"
re.match(r, 'Hello Rosa', flags=re.IGNORECASE)

<re.Match object; span=(0, 10), match='Hello Rosa'>

In [11]:
re.match(r, "hi ho, hi ho, it's off to work ...", flags=re.IGNORECASE)

<re.Match object; span=(0, 5), match='hi ho'>

In [12]:
re.match(r, "hey, what's up", flags=re.IGNORECASE)

<re.Match object; span=(0, 3), match='hey'>

In [13]:
# More detailed regex
r = r"[^a-z]*([y]o|[h']?ello|ok|hey|(good[ ])?(morn[gin']{0,3}|"\
    r"afternoon|even[gin']{0,3}))[\s,;:]{1,3}([a-z]{1,20})"

In [14]:
re_greeting = re.compile(r, flags=re.IGNORECASE)

In [15]:
re_greeting.match('Hello Rosa')

<re.Match object; span=(0, 10), match='Hello Rosa'>

In [16]:
re_greeting.match('Hi Rosa')

In [17]:
re_greeting.match('Hello Rosa').groups()

('Hello', None, None, 'Rosa')

In [18]:
re_greeting.match("Good morning Rosa")

<re.Match object; span=(0, 17), match='Good morning Rosa'>

In [19]:
re_greeting.match("Good Manning Rosa")

In [20]:
re_greeting.match('Good evening Rosa Parks').groups()

('Good evening', 'Good ', 'evening', 'Rosa')

In [21]:
re_greeting.match("Good Morn'n Rosa")

<re.Match object; span=(0, 16), match="Good Morn'n Rosa">

In [22]:
re_greeting.match("yo Rosa")

<re.Match object; span=(0, 7), match='yo Rosa'>

In [23]:
# Add an output generator
my_names = set(['rosa', 'rose', 'chatty', 'chatbot', 'bot', 'chatterbot'])

curt_names = set(['hal', 'you', 'u'])
greeter_name = ''

# Manually assign input
match = re_greeting.match("Hello Rosa")

if match:
    at_name = match.groups()[-1]
    # print(f"at_name: {at_name}")
    if at_name in curt_names:
        print("Good one.")
    elif at_name.lower() in my_names:
        greeter_name = at_name
        print(f"Hi {greeter_name}, How are you?")

Hi Rosa, How are you?


<a name='1.4.4'></a><a id='1.4.4'></a>
## 1.4.4 Another way
<a href="#top">[back to top]</a>

Problem: How to replace the fragile pattern-based approach?

Idea: Use a statistical or machine-learning approach, in which we use vector representation of words.

This enables the simple but powerful mechanism of measuring the difference or similarity in meaning between character sequences.

In [24]:
Counter("Guten Morgen Rosa".split())

Counter({'Guten': 1, 'Morgen': 1, 'Rosa': 1})

In [25]:
Counter("Good morning, Rosa!".split())

Counter({'Good': 1, 'morning,': 1, 'Rosa!': 1})

---
<a name='1.5'></a><a id='1.5'></a>
# 1.5 A brief overflight of hyperspace
<a href="#top">[back to top]</a>

Problem: How to constrain a huge amount of possible patterns into a manageable number?

Idea: Create a reduced dimension vector space model of messages.

This allows us to use labels vectors with a set of continuous float values. The list of dimensions should be much smaller than the number of possible statements. And statements that mean the same thing should have similar values. We can further simplify vectors by clustering statements together where appropriate. The simplest form of a vector space model is the "one-hot encoded" model.

<a name='1.6'></a><a id='1.6'></a>
# 1.6 Word order and grammar
<a href="#top">[back to top]</a>

Problem: Is word order important?

Idea: For encoding the general sense and sentiment of a short sentence, it is not super important. But it is very important in longer sentences, which rely on word order to convey logical relationships between things. 

In [26]:
# Equivalent to factorial(3)
[" ".join(combo) for combo in permutations("Good morning Rosa!".split(), 3)]

['Good morning Rosa!',
 'Good Rosa! morning',
 'morning Good Rosa!',
 'morning Rosa! Good',
 'Rosa! Good morning',
 'Rosa! morning Good']

In [27]:
s = """Find textbooks with titles containing 'NLP', or 'natural' and 'language', or 'computations' and 'linguistics'."""
len(set(s.split()))

12

In [28]:
# Equivalent to factorial(12)
np.arange(1, 12 + 1).prod()

479001600

---
<a name='1.7'></a><a id='1.7'></a>
# 1.7 A chatbot natural language pipeline
<a href="#top">[back to top]</a>

Problem: What is the mechanism necessary for a chatbot for processing and state management?

Idea: We need stages for parsing, analyzing, generation, and execution. We also need a database to maintain a memory of past statements and responses. 

1. Parse: Extract features, structured numerical data, from natural language text.
2. Analyze: Generate and combine features by scoring text for sentiment, grammaticality, and semantics.
3. Generate: Compose possible responses using templates, search, or language models.
4. Execute: Plan statements based on conversation history and objectives, and select the next response.


---
<a name='1.8'></a><a id='1.8'></a>
# 1.8 Processing in depth
<a href="#top">[back to top]</a>

Problem: How to conceptualize a NLP pipeline?

Idea: Think of the stages of a NLP pipeline as layers in a feed-forward neural network. Deep learning is all about creating more complex models and behavior by adding additional processing layers. We add these layers to the standard two-layer machine learning model architecture of feature extraction and modeling.

Misc: What is *inference*?

Inferences are logical extrapolations from a set of conditions detected in the environment, like the logic contained in the statement of a chatbot user. 


---
<a name='1.9'></a><a id='1.9'></a>
# 1.9 Natural language IQ
<a href="#top">[back to top]</a>

Problem: How to best measure the power of an NLP pipeline?

Idea: Conceptualize this by measuring the breadth and depth of complexity (as two axis) of the NLP pipeline. 