<a href="https://colab.research.google.com/github/donghuna/AI-Expert/blob/main/%5BStudent%2C_Samsung%5D_Day1_2_Language_Model_and_Bayes_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 2: Language Model and Bayes Classifier

## Language Model

A language model is a type of artificial intelligence designed to understand and generate human language. It predicts the probability of a sequence of words, which can be used for various tasks such as text generation, translation, sentiment analysis, and more. In essence, a language model can read and write, making it a powerful tool for understanding and generating text.

Contents of this section:
*   Practice: Tokenization
*   Practice: Language Model



### Setting Environment
Install python packages

In [None]:
!pip install nltk
!pip install tokenizers



In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('book')

## Practice: Tokenization

Tokenization is a fundamental step in natural language processing (NLP) that involves breaking down text into smaller units called tokens. These tokens can be words, subwords, or even characters, depending on the specific task and the method used. The process of tokenization helps in structuring the text in a way that it can be easily analyzed and processed by NLP models.

### Rule-based Tokenizer
The Natural Language Toolkit (NLTK) is a powerful Python library designed for working with human language data, often referred to as Natural Language Processing (NLP). NLTK provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

In [None]:
from nltk.tokenize import word_tokenize

text = 'To be, or not to be, that is the question.' # TODO: Observe how the result changes by inputting different strings.
tokens = word_tokenize(text)
print(tokens)

['To', 'be', ',', 'or', 'not', 'to', 'be', ',', 'that', 'is', 'the', 'question', '.']


### Subword Tokenizer
Byte Pair Encoding (BPE) is a data compression technique that iteratively merges the most frequent pairs of bytes (or characters) in a text corpus. This merging process continues until a specified vocabulary size is reached. The primary advantage of BPE is its ability to break down rare or unseen words into smaller, more frequent subword units, thereby reducing the incidence of out-of-vocabulary (OOV) words and enabling better handling of morphologically rich languages.

In [None]:
from collections import Counter, defaultdict

def get_stats(vocab):
    pairs = defaultdict(int)
    for word, freq in vocab.items():
        symbols = ... # TODO: Separate a word into symbols based on spaces
        for i in range(len(symbols)-1):
            pairs[symbols[i], symbols[i+1]] += freq
    return pairs

In [None]:
vocab = {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}
num_merges = 12
pairs = get_stats(vocab)
pairs

defaultdict(int,
            {('l', 'o'): 7,
             ('o', 'w'): 7,
             ('w', '</w>'): 5,
             ('w', 'e'): 8,
             ('e', 'r'): 2,
             ('r', '</w>'): 2,
             ('n', 'e'): 6,
             ('e', 'w'): 6,
             ('e', 's'): 9,
             ('s', 't'): 9,
             ('t', '</w>'): 9,
             ('w', 'i'): 3,
             ('i', 'd'): 3,
             ('d', 'e'): 3})

In [None]:
best = max(pairs, key=pairs.get)
best

('e', 's')

In [None]:
def merge_vocab(pair, v_in):
    '''
    Examples:
        pair: ['l', 'o']
        v_in: {'l o w </w>': 5, 'l o w e r </w>': 2}
        -------------------------
        v_out: {'lo w </w>': 5, 'lo w e r </w>': 2}
    '''
    v_out = {}
    bigram = ... # TODO: fill here
    replacement = ... # TODO: fill here
    for word in v_in:
        w_out = word.replace(...) # TODO: fill here
        v_out[w_out] = v_in[word]
    return v_out

vocab = {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}
num_merges = 12
for i in range(num_merges):
    pairs = get_stats(vocab)
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    print('Merge:', best)
    print('Vocab:', vocab)


Merge: ('e', 's')
Vocab: {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}
Merge: ('es', 't')
Vocab: {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}
Merge: ('est', '</w>')
Vocab: {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3}
Merge: ('l', 'o')
Vocab: {'lo w </w>': 5, 'lo w e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3}
Merge: ('lo', 'w')
Vocab: {'low </w>': 5, 'low e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3}
Merge: ('n', 'e')
Vocab: {'low </w>': 5, 'low e r </w>': 2, 'ne w est</w>': 6, 'w i d est</w>': 3}
Merge: ('ne', 'w')
Vocab: {'low </w>': 5, 'low e r </w>': 2, 'new est</w>': 6, 'w i d est</w>': 3}
Merge: ('new', 'est</w>')
Vocab: {'low </w>': 5, 'low e r </w>': 2, 'newest</w>': 6, 'w i d est</w>': 3}
Merge: ('low', '</w>')
Vocab: {'low</w>': 5, 'low e r </w>': 2, 'newest</w>': 6, 'w i d est</w>': 3}
Merge: ('w', 'i')
Vocab: {'low</w>': 5, 'low e r </w>': 2, 'newest</w

### NLTK vs. BPE

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders, processors

import re
from collections import Counter, defaultdict

class BPETokenizer:
    def __init__(self, vocab_size):
        self.vocab_size = vocab_size
        self.vocab = {}

    def get_stats(self, tokens):
        pairs = defaultdict(int)
        for token in tokens:
            symbols = token.split()
            for i in range(len(symbols)-1):
                pairs[(symbols[i], symbols[i+1])] += 1
        return pairs

    def merge_vocab(self, pair, tokens):
        new_tokens = []
        bigram = ' '.join(pair)
        replacement = ''.join(pair)
        for token in tokens:
            new_token = token.replace(bigram, replacement)
            new_tokens.append(new_token)
        return new_tokens

    def fit(self, text):
        words = re.findall(r'\w+', text)
        tokens = [' '.join(list(word)) for word in words]
        vocab = Counter(tokens)

        while len(self.vocab) < self.vocab_size:
            pairs = self.get_stats(vocab)
            if not pairs:
                break
            best = max(pairs, key=pairs.get)
            vocab = self.merge_vocab(best, vocab)
            self.vocab[''.join(best)] = pairs[best]

        self.vocab = dict(sorted(self.vocab.items(), key=lambda item: item[1], reverse=True))

    def tokenize(self, word):
        word = ' '.join(list(word))
        while True:
            pairs = self.get_stats([word])
            if not pairs:
                break
            best = max(pairs, key=pairs.get)
            bigram = ' '.join(best)
            if bigram not in word:
                break
            word = word.replace(bigram, ''.join(best))
        return word.split()

text = '''
To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them. To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
When we have shuffled off this mortal coil,
Must give us pause: there's the respect
That makes calamity of so long life;
For who would bear the whips and scorns of time,
The oppressor's wrong, the proud man's contumely,
The pangs of despised love, the law's delay,
The insolence of office and the spurns
That patient merit of the unworthy takes,
When he himself might his quietus make
With a bare bodkin? who would fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscovered country from whose bourn
No traveller returns, puzzles the will
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all;
And thus the native hue of resolution
Is sicklied o'er with the pale cast of thought,
And enterprises of great pitch and moment
With this regard their currents turn awry,
And lose the name of action.--Soft you now!
The fair Ophelia! Nymph, in thy orisons
Be all my sins remember'd.
'''
text = ' '.join(text.split())

words = word_tokenize(text)
print('NLTK Tokenization:', words)

tokenizer = BPETokenizer(vocab_size=50)
tokenizer.fit(text)
words = []
for word in re.findall(r'\w+', text):
    words.extend(tokenizer.tokenize(word))
print('BPE Tokenization:', words)

## Practice: Langauge Model
In this tutorial, we will explore language models, how they compute the probability of words, and the application of the chain rule in probability. We will also discuss the practical considerations of using uni-gram and n-gram models, including the trade-offs between them. To make things more concrete, we will use examples from Shakespeare's works.

In [None]:
import numpy as np
import pandas as pd
from collections import defaultdict, Counter
from matplotlib import pyplot as plt
from nltk.tokenize import RegexpTokenizer
import math
import random

tokenizer = RegexpTokenizer('[\w]+')
text = nltk.corpus.gutenberg.raw('shakespeare-hamlet.txt').lower()
words = tokenizer.tokenize(text)
print(words)

In [None]:
text[:10]

### Markov Assumption and Chain Rule

The chain rule of probability states that:

\[ P(A, B, C) = P(A) P(B|A) P(C|A, B) \]

By the Markov assumption, we approximate this to:

- Unigram: \( P(A, B, C) \approx P(A) P(B) P(C) \)
- Bigram: \( P(A, B, C) \approx P(A) P(B|A) P(C|B) \)


### Unigram Model

In a unigram model, we assume that the probability of each word is independent of the previous words. Therefore, the probability of a sequence of words \( w_1, w_2, \ldots, w_n \) is simply the product of the probabilities of each word:

\[ P(w_1, w_2, \ldots, w_n) = P(w_1) P(w_2) \ldots P(w_n) \]



In [None]:
unigram_pairs = words
unigram_counts = Counter(unigram_pairs)
unigram_total = # TODO: Total counts of unigram pairs

unigram_probs = {pair: ... for pair in unigram_counts} # TODO: fill here
sorted_unigram_probs = dict(sorted(unigram_probs.items(), key=lambda item: item[1], reverse=True))
df = pd.DataFrame.from_dict(data=sorted_unigram_probs, orient='index')
df

In [None]:
df[0].plot(kind='line', figsize=(8, 4), title=0)
plt.gca().spines[['top', 'right']].set_visible(False)

In [None]:
def unigram_model(start_word, max_token=100):
    result = [start_word]
    for _ in range(max_token):
        best_pair = # TODO: Select the pair with the highest probability
        result.append(best_pair)
    return result
generated_text = unigram_model('the')
print(' '.join(generated_text))

### Bigram Model

In a bigram model, we assume that the probability of each word depends only on the previous word. Therefore, the probability of a sequence of words \( w_1, w_2, \ldots, w_n \) is given by:

\[ P(w_1, w_2, \ldots, w_n) = P(w_1) P(w_2 | w_1) P(w_3 | w_2) \ldots P(w_n | w_{n-1}) \]


In [None]:
bigram_pairs = []
... # TODO: Make bigram pairs (e.g. a b c d -> [(a, b), (b, c), (c, d)])
bigram_counts = Counter(bigram_pairs)
bigram_total = sum(bigram_counts.values())
bigram_probs = {pair: bigram_counts[pair] / bigram_total for pair in bigram_counts}
sorted_bigram_probs = dict(sorted(bigram_probs.items(), key=lambda item: item[1], reverse=True))
df = pd.DataFrame.from_dict(data=sorted_bigram_probs, orient='index')
df

In [None]:
df[0].plot(kind='line', figsize=(8, 4), title=0)
plt.gca().spines[['top', 'right']].set_visible(False)

In [None]:
def bigram_model(start_word, max_token=100):
    result = [start_word]
    for _ in range(max_token):
        bigram_subset_probs = ... # TODO: Select a subset of bigram_probs where the first element of the pair matches the last word of the result
        best_pair = max(bigram_subset_probs, key=bigram_subset_probs.get)
        result.append(best_pair[1])
    return result
generated_text = bigram_model('the')
print(' '.join(generated_text))

### General N-gram Model

An n-gram model generalizes this idea by assuming that the probability of each word depends on the previous $ ( n-1 ) $ words. The probability of a sequence of words $ ( w_1, w_2, \ldots, w_n ) $ in an n-gram model is given by:

$ P(w_1, w_2, \ldots, w_n) = P(w_1) P(w_2 | w_1) P(w_3 | w_1, w_2) \ldots P(w_n | w_{n-(n-1)}, \ldots, w_{n-1})$

In [None]:
n = 10
ngram_pairs = []
for i in range(1, len(words)):
    ngram_pairs.append((' '.join(words[max(0, i-n+1):i]), words[i]))
ngram_counts = Counter(ngram_pairs)
ngram_total = sum(ngram_counts.values())
ngram_probs = {pair: ngram_counts[pair] / ngram_total for pair in ngram_counts}
sorted_ngram_probs = dict(sorted(ngram_probs.items(), key=lambda item: item[1], reverse=True))
df = pd.DataFrame.from_dict(data=sorted_ngram_probs, orient='index')
df

In [None]:
df[0].plot(kind='line', figsize=(8, 4), title=0)
plt.gca().spines[['top', 'right']].set_visible(False)

In [None]:
def ngram_model(start_word, max_token=100, n=10):
    result = [start_word]
    for _ in range(max_token):
        context = ... # Concat the last (n-1) words in the results (with spaces)
        ngram_subset_probs = {pair : ngram_probs[pair] for pair in ngram_probs if pair[0] == context}
        best_pair = max(ngram_subset_probs, key=ngram_subset_probs.get)
        result.append(best_pair[-1])
    return result
generated_text = ngram_model('the')
print(' '.join(generated_text))

### Trade-offs of Unigram vs. N-gram Models

- **Unigram Model**: Simple and computationally efficient, but ignores context, leading to less accurate predictions.
- **Bigram Model**: Considers the immediate context (previous word), providing better predictions than unigram but still limited in capturing longer dependencies.
- **N-gram Model**: Captures longer contexts, improving prediction accuracy at the cost of increased computational complexity and data sparsity.


## N-gram for email classification

In [None]:
!pip install --quiet gdown pandas nltk

# Connect to Google Drive
from google.colab import drive
drive.mount('/content/drive')

# download spam/ham dataset
import gdown
from pathlib import Path

url = 'https://drive.google.com/uc?id=1GaUS8wMlWQwqhgCX2wsvhOkQ8_Uw_x2r'
dataset_path = dataset_path = Path('/content/drive/MyDrive/spam_dataset.tsv')
gdown.download(url, str(dataset_path), quiet=False)

In [None]:
import pandas as pd
df = pd.read_csv(dataset_path, sep="\t")
df

In [None]:
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(df["text"],df["label_num"], test_size=0.2, random_state=10)

In [None]:
train_X[0]

In [None]:
# concat all train_X to make .txt
# TODO: {text} {spam/ham} \n
email_text_train = ""
for x, l in zip(train_X, train_y):
  ...

In [None]:
email_words_train = tokenizer.tokenize(email_text_train)

In [None]:
email_text_train[:10]

In [None]:
email_words_train[:3]

In [None]:
# logic
from nltk.corpus import stopwords

STOPWORDS = set(stopwords.words("english"))

from string import punctuation

PUNCTUATIONS = set(punctuation)

def regularize_tokens(tokens):
    tokens = [t.strip() for t in tokens]
    tokens = [t.strip("".join(PUNCTUATIONS)) for t in tokens]
    tokens = [t for t in tokens if len(t) > 1]
    tokens = [t for t in tokens if t not in STOPWORDS]
    tokens = [t for t in tokens if t not in PUNCTUATIONS]
    tokens = [t for t in tokens if not re.match(r"^\d+?\.\d+?$", t)]  # e.g., 1.23
    tokens = [t for t in tokens if not re.match(r"^\d+?\,\d+?$", t)]  # e.g., 1,234
    tokens = [t for t in tokens if not t.isnumeric()]  # e.g., 123
    return tokens

In [None]:
email_words_train = regularize_tokens(email_words_train)

In [None]:
# make ngram model
...

In [None]:
train_df[0].plot(kind='line', figsize=(8, 4), title=0)
plt.gca().spines[['top', 'right']].set_visible(False)

In [None]:
def ngram_model(start_words, n=10):
    result = start_words
    context = ' '.join(result[-n+1:])
    ngram_subset_probs = {pair : ngram_probs[pair] for pair in ngram_probs if pair[0] == context}
    return ngram_subset_probs

In [None]:
regularized_email_example = ... # TODO: tokenize and regularize
regularized_email_example

In [None]:
generated_text_probs = ngram_model(
    regularized_email_example
)
generated_text_probs

In [None]:
from tqdm import tqdm
test_email_preds = []
for email_text in tqdm(test_X):
  email_words = regularize_tokens(word_tokenize(email_text))
  ngram_output = ngram_model(email_words)
  # TODO: if ngram output exists && if the highest probability shows spam for next token
  #       return spam
  #       else
  #       return ham
  if ...:
    if ...:
      test_email_preds.append(1)
      continue

  test_email_preds.append(0)

In [None]:
# evaluate the model
from sklearn.metrics import classification_report, confusion_matrix
