# Chunking

We implement a very basic form of chunking at a document level. We do fixed token size chunking with some overlap. The main reasons behind this are:
<ul>
    <li>The code is much simpler</li>
    <li>Because there are no embeddings based or hierarchical clustering based similarity measures to chunking, this is very fast</li>
    <li>We chunk text from a document and posit that most of the text in one document would be relevant to at least an umbrella topic and therefore inherently the chunks will be somewhat similar</li>
</ul>
Other forms of chunking could include:
<ul>
    <li>Embedding based similarity chunking</li>
    <li>Chunking on similarity based on semantic meaning</li>
    <li>Agentic chunking</li>
</ul>

In [1]:
sample = [{'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 0,
  'text': 'Text block:\n# Byte-Pair Encoding tokenization\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 1,
  'text': 'Text block:\nBPE training starts by computing the unique set of words used in the corpus (after the normalization and pre-tokenization steps are completed), then building the vocabulary by taking all the symbols used to write those words. As a very simple example, let’s say our corpus uses these five words:\n\nThis material was adapted from the Huggingface tutorial available here:\n\nhttps://huggingface.co/learn/nlp-course/chapter6/5?fw=pt\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 2,
  'text': 'Code block:\ncorpus = ["hug", "pug", "pun", "bun", "hugs"]\nOutput:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 3,
  'text': "Code block:\nvocab = set([ c for w in corpus for c in w ])\nprint(vocab)\nOutput:\n{'n', 'g', 'h', 'p', 'b', 's', 'u'}\n\n"},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 4,
  'text': 'Text block:\nAfter getting this base vocabulary, we add new tokens until the desired vocabulary size is reached by learning merges, which are rules to merge two elements of the existing vocabulary together into a new one. So, at the beginning these merges will create tokens with two characters, and then, as training progresses, longer subwords.\n\nAt any step during the tokenizer training, the BPE algorithm will search for the most frequent pair of existing tokens (by “pair,” here we mean two consecutive tokens in a word). That most frequent pair is the one that will be merged, and we rinse and repeat for the next step.\n\nGoing back to our previous example, let’s assume the words had the following frequencies:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 5,
  'text': 'Code block:\ncorpus = [("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)]\nOutput:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 6,
  'text': 'Text block:\n"hug" was present 10 times in the corpus, "pug" 5 times, "pun" 12 times, "bun" 4 times, and "hugs" 5 times. We start the training by splitting each word into characters (the ones that form our initial vocabulary) so we can see each word as a list of tokens:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 7,
  'text': 'Code block:\ncorpus = [("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)]\nOutput:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 8,
  'text': 'Text block:\nThen we look at pairs. The pair ("h", "u") is present in the words "hug" and "hugs", so 15 times total in the corpus. It’s not the most frequent pair, though: the most frequent is ("u", "g"), which is present in "hug", "pug", and "hugs", for a grand total of 20 times in the vocabulary.\n\nThus, the first merge rule learned by the tokenizer is ("u", "g") -> "ug", which means that "ug" will be added to the vocabulary, and the pair should be merged in all the words of the corpus. At the end of this stage, the vocabulary and corpus look like this:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 9,
  'text': 'Code block:\nvocab = ["b", "g", "h", "n", "p", "s", "u", "ug"]\ncorpus = [("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)]\nOutput:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 10,
  'text': 'Text block:\nNow we have some pairs that result in a token longer than two characters: the pair ("h", "ug"), for instance (present 15 times in the corpus). The most frequent pair at this stage is ("u", "n"), however, present 16 times in the corpus, so the second merge rule learned is ("u", "n") -> "un". Adding that to the vocabulary and merging all existing occurrences leads us to:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 11,
  'text': 'Code block:\nvocab = ["b", "g", "h", "n", "p", "s", "u", "ug", "un"]\ncorpus = [("h" "ug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("h" "ug" "s", 5)]\nOutput:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 12,
  'text': 'Text block:\nNow the most frequent pair is ("h", "ug"), so we learn the merge rule ("h", "ug") -> "hug", which gives us our first three-letter token. After the merge, the corpus looks like this:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 13,
  'text': 'Code block:\nvocab = ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]\ncorpus = [("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)]\nOutput:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 14,
  'text': 'Text block:\nAnd we continue like this until we reach the desired vocabulary size. Usually we provide the number of merges we want to obtain a particular vocabulary size.\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 15,
  'text': 'Text block:\n### Tokenization Algorithm\n\nTokenization follows the training process closely, in the sense that new inputs are tokenized by applying the following steps:\n\n1. Normalization\n1. Pre-tokenization\n1. Splitting the words into individual characters\n1. Applying the merge rules learned in order on those splits\n\nLet’s take the example we used during training, with the three merge rules learned:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 16,
  'text': 'Text block:\n```\n("u", "g") -> "ug"\n("u", "n") -> "un"\n("h", "ug") -> "hug"\n```\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 17,
  'text': 'Text block:\nThe word "bug" will be tokenized as ["b", "ug"]. "mug", however, will be tokenized as ["[UNK]", "ug"] since the letter "m" was not in the base vocabulary. Likewise, the word "thug" will be tokenized as ["[UNK]", "hug"]: the letter "t" is not in the base vocabulary, and applying the merge rules results first in "u" and "g" being merged and then "hu" and "g" being merged.\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 18,
  'text': 'Text block:\n**Question**: How do you think the word "unhug" will be tokenized?\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 19,
  'text': 'Text block:\n### Implementing BPE for sub-word tokenization\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 20,
  'text': 'Text block:\nInstall the Transformers, Datasets, and Evaluate libraries to run this notebook.\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 21,
  'text': 'Code block:\n!pip install datasets evaluate transformers[sentencepiece]\nOutput:\nCollecting datasets\n  Using cached datasets-3.0.1-py3-none-any.whl.metadata (20 kB)\nCollecting evaluate\n  Using cached evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)\nRequirement already satisfied: transformers[sentencepiece] in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (4.45.1)\nRequirement already satisfied: filelock in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from datasets) (3.16.1)\nRequirement already satisfied: numpy>=1.17 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from datasets) (1.26.4)\nCollecting pyarrow>=15.0.0 (from datasets)\n  Using cached pyarrow-17.0.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (3.3 kB)\nCollecting dill<0.3.9,>=0.3.0 (from datasets)\n  Using cached dill-0.3.8-py3-none-any.whl.metadata (10 kB)\nRequirement already satisfied: pandas in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from datasets) (2.2.3)\nRequirement already satisfied: requests>=2.32.2 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from datasets) (2.32.3)\nRequirement already satisfied: tqdm>=4.66.3 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from datasets) (4.66.5)\nCollecting xxhash (from datasets)\n  Using cached xxhash-3.5.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (12 kB)\nCollecting multiprocess (from datasets)\n  Using cached multiprocess-0.70.17-py312-none-any.whl.metadata (7.2 kB)\nCollecting fsspec<=2024.6.1,>=2023.1.0 (from fsspec[http]<=2024.6.1,>=2023.1.0->datasets)\n  Using cached fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)\nCollecting aiohttp (from datasets)\n  Downloading aiohttp-3.10.9-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.6 kB)\nRequirement already satisfied: huggingface-hub>=0.22.0 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from datasets) (0.25.1)\nRequirement already satisfied: packaging in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from datasets) (24.1)\nRequirement already satisfied: pyyaml>=5.1 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from datasets) (6.0.2)\nRequirement already satisfied: regex!=2019.12.17 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from transformers[sentencepiece]) (2024.9.11)\nRequirement already satisfied: safetensors>=0.4.1 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from transformers[sentencepiece]) (0.4.5)\nRequirement already satisfied: tokenizers<0.21,>=0.20 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from transformers[sentencepiece]) (0.20.0)\nCollecting protobuf (from transformers[sentencepiece])\n  Using cached protobuf-5.28.2-cp38-abi3-macosx_10_9_universal2.whl.metadata (592 bytes)\nCollecting sentencepiece!=0.1.92,>=0.1.91 (from transformers[sentencepiece])\n  Using cached sentencepiece-0.2.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.7 kB)\nCollecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)\n  Using cached aiohappyeyeballs-2.4.3-py3-none-any.whl.metadata (6.1 kB)\nCollecting aiosignal>=1.1.2 (from aiohttp->datasets)\n  Using cached aiosignal-1.3.1-py3-none-any.whl.metadata (4.0 kB)\nRequirement already satisfied: attrs>=17.3.0 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from aiohttp->datasets) (24.2.0)\nCollecting frozenlist>=1.1.1 (from aiohttp->datasets)\n  Using cached frozenlist-1.4.1-cp312-cp312-macosx_11_0_arm64.whl.metadata (12 kB)\nCollecting multidict<7.0,>=4.5 (from aiohttp->datasets)\n  Using cached multidict-6.1.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (5.0 kB)\nCollecting yarl<2.0,>=1.12.0 (from aiohttp->datasets)\n  Downloading yarl-1.14.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (52 kB)\nRequirement already satisfied: typing-extensions>=3.7.4.3 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from huggingface-hub>=0.22.0->datasets) (4.12.2)\nRequirement already satisfied: charset-normalizer<4,>=2 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from requests>=2.32.2->datasets) (3.3.2)\nRequirement already satisfied: idna<4,>=2.5 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from requests>=2.32.2->datasets) (3.10)\nRequirement already satisfied: urllib3<3,>=1.21.1 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from requests>=2.32.2->datasets) (2.2.3)\nRequirement already satisfied: certifi>=2017.4.17 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from requests>=2.32.2->datasets) (2024.8.30)\nINFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.\nCollecting multiprocess (from datasets)\n  Using cached multiprocess-0.70.16-py312-none-any.whl.metadata (7.2 kB)\nRequirement already satisfied: python-dateutil>=2.8.2 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from pandas->datasets) (2.9.0.post0)\nRequirement already satisfied: pytz>=2020.1 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from pandas->datasets) (2024.2)\nRequirement already satisfied: tzdata>=2022.7 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from pandas->datasets) (2024.2)\nRequirement already satisfied: six>=1.5 in /Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\nCollecting propcache>=0.2.0 (from yarl<2.0,>=1.12.0->aiohttp->datasets)\n  Downloading propcache-0.2.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.7 kB)\nUsing cached datasets-3.0.1-py3-none-any.whl (471 kB)\nUsing cached evaluate-0.4.3-py3-none-any.whl (84 kB)\nUsing cached dill-0.3.8-py3-none-any.whl (116 kB)\nUsing cached fsspec-2024.6.1-py3-none-any.whl (177 kB)\nDownloading aiohttp-3.10.9-cp312-cp312-macosx_11_0_arm64.whl (391 kB)\nUsing cached pyarrow-17.0.0-cp312-cp312-macosx_11_0_arm64.whl (27.2 MB)\nUsing cached sentencepiece-0.2.0-cp312-cp312-macosx_11_0_arm64.whl (1.2 MB)\nUsing cached multiprocess-0.70.16-py312-none-any.whl (146 kB)\nUsing cached protobuf-5.28.2-cp38-abi3-macosx_10_9_universal2.whl (414 kB)\nUsing cached xxhash-3.5.0-cp312-cp312-macosx_11_0_arm64.whl (30 kB)\nUsing cached aiohappyeyeballs-2.4.3-py3-none-any.whl (14 kB)\nUsing cached aiosignal-1.3.1-py3-none-any.whl (7.6 kB)\nUsing cached frozenlist-1.4.1-cp312-cp312-macosx_11_0_arm64.whl (51 kB)\nUsing cached multidict-6.1.0-cp312-cp312-macosx_11_0_arm64.whl (29 kB)\nDownloading yarl-1.14.0-cp312-cp312-macosx_11_0_arm64.whl (85 kB)\nDownloading propcache-0.2.0-cp312-cp312-macosx_11_0_arm64.whl (45 kB)\nInstalling collected packages: sentencepiece, xxhash, pyarrow, protobuf, propcache, multidict, fsspec, frozenlist, dill, aiohappyeyeballs, yarl, multiprocess, aiosignal, aiohttp, datasets, evaluate\n  Attempting uninstall: fsspec\n    Found existing installation: fsspec 2024.9.0\n    Uninstalling fsspec-2024.9.0:\n      Successfully uninstalled fsspec-2024.9.0\nSuccessfully installed aiohappyeyeballs-2.4.3 aiohttp-3.10.9 aiosignal-1.3.1 datasets-3.0.1 dill-0.3.8 evaluate-0.4.3 frozenlist-1.4.1 fsspec-2024.6.1 multidict-6.1.0 multiprocess-0.70.16 propcache-0.2.0 protobuf-5.28.2 pyarrow-17.0.0 sentencepiece-0.2.0 xxhash-3.5.0 yarl-1.14.0\n\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 22,
  'text': 'Text block:\nFirst we need a corpus, so let’s create a simple one with a few sentences:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 23,
  'text': 'Code block:\ncorpus = [\n    "This is a sample corpus.",\n    "This corpus will be used to show how subword tokenization works.",\n    "This section shows several tokenizer algorithms.",\n    "Hopefully, you will be able to understand how they are trained and generate tokens.",\n]\nOutput:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 24,
  'text': 'Text block:\nNext, we need to pre-tokenize that corpus into words. Since we are replicating a BPE tokenizer (like GPT-2), we will use the gpt2 tokenizer for the pre-tokenization:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 25,
  'text': 'Code block:\nfrom transformers import AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained("gpt2")\nOutput:\n/Users/anoop/git-repos/teaching/nlp-class-hw/bertchunker/venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884\n  warnings.warn(\n\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 26,
  'text': 'Text block:\nThen we compute the frequencies of each word in the corpus as we do the pre-tokenization:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 27,
  'text': "Code block:\nfrom collections import defaultdict\n\nword_freqs = defaultdict(int)\n\nfor text in corpus:\n    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)\n    new_words = [word for word, offset in words_with_offsets]\n    for word in new_words:\n        word_freqs[word] += 1\n\nprint(word_freqs)\nOutput:\ndefaultdict(<class 'int'>, {'This': 3, 'Ġis': 1, 'Ġa': 1, 'Ġsample': 1, 'Ġcorpus': 2, '.': 4, 'Ġwill': 2, 'Ġbe': 2, 'Ġused': 1, 'Ġto': 2, 'Ġshow': 1, 'Ġhow': 2, 'Ġsubword': 1, 'Ġtokenization': 1, 'Ġworks': 1, 'Ġsection': 1, 'Ġshows': 1, 'Ġseveral': 1, 'Ġtokenizer': 1, 'Ġalgorithms': 1, 'Hopefully': 1, ',': 1, 'Ġyou': 1, 'Ġable': 1, 'Ġunderstand': 1, 'Ġthey': 1, 'Ġare': 1, 'Ġtrained': 1, 'Ġand': 1, 'Ġgenerate': 1, 'Ġtokens': 1})\n\n"},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 28,
  'text': 'Text block:\nThe next step is to compute the base vocabulary, formed by all the characters used in the corpus:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 29,
  'text': "Code block:\nalphabet = []\n\nfor word in word_freqs.keys():\n    for letter in word:\n        if letter not in alphabet:\n            alphabet.append(letter)\nalphabet.sort()\n\nprint(alphabet)\nOutput:\n[',', '.', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ']\n\n"},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 30,
  'text': 'Text block:\nWe also add the special tokens used by the model at the beginning of that vocabulary. In the case of GPT-2, the only special token is `<|endoftext|>`:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 31,
  'text': "Code block:\nvocab = ['<|endoftext|>'] + alphabet.copy()\nOutput:\n"},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 32,
  'text': 'Text block:\nWe now need to split each word into individual characters, to be able to start training:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 33,
  'text': 'Code block:\nsplits = {word: [c for c in word] for word in word_freqs.keys()}\nOutput:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 34,
  'text': 'Text block:\nNow that we are ready for training, let’s write a function that computes the frequency of each pair. We’ll need to use this at each step of the training:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 35,
  'text': 'Code block:\ndef compute_pair_freqs(splits):\n    pair_freqs = defaultdict(int)\n    for word, freq in word_freqs.items():\n        split = splits[word]\n        if len(split) == 1:\n            continue\n        for i in range(len(split) - 1):\n            pair = (split[i], split[i + 1])\n            pair_freqs[pair] += freq\n    return pair_freqs\nOutput:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 36,
  'text': 'Text block:\nLet’s have a look at a part of this dictionary after the initial splits:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 37,
  'text': 'Code block:\npair_freqs = compute_pair_freqs(splits)\n\nfor i, key in enumerate(pair_freqs.keys()):\n    print(f"{key}: {pair_freqs[key]}")\n    if i >= 5:\n        break\nOutput:\n(\'T\', \'h\'): 3\n(\'h\', \'i\'): 3\n(\'i\', \'s\'): 4\n(\'Ġ\', \'i\'): 1\n(\'Ġ\', \'a\'): 5\n(\'Ġ\', \'s\'): 6\n\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 38,
  'text': 'Text block:\nFinding the most frequent pair only takes a quick loop:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 39,
  'text': 'Code block:\nbest_pair = ""\nmax_freq = None\n\nfor pair, freq in pair_freqs.items():\n    if max_freq is None or max_freq < freq:\n        best_pair = pair\n        max_freq = freq\n\nprint(best_pair, max_freq)\nOutput:\n(\'Ġ\', \'t\') 7\n\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 40,
  'text': "Text block:\nSo the first merge to learn is ('Ġ', 't') -> 'Ġt', and we add 'Ġt' to the vocabulary:\n"},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 41,
  'text': 'Code block:\nmerges = {("Ġ", "t"): "Ġt"}\nvocab.append("Ġt")\nOutput:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 42,
  'text': 'Text block:\nTo continue, we need to apply that merge in our splits dictionary. Let’s write another function for this:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 43,
  'text': 'Code block:\ndef merge_pair(a, b, splits):\n    for word in word_freqs:\n        split = splits[word]\n        if len(split) == 1:\n            continue\n\n        i = 0\n        while i < len(split) - 1:\n            if split[i] == a and split[i + 1] == b:\n                split = split[:i] + [a + b] + split[i + 2 :]\n            else:\n                i += 1\n        splits[word] = split\n    return splits\nOutput:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 44,
  'text': 'Text block:\nAnd we can have a look at the result of the first merge:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 45,
  'text': 'Code block:\nsplits = merge_pair("Ġ", "t", splits)\nprint(splits["Ġtrained"])\nOutput:\n[\'Ġt\', \'r\', \'a\', \'i\', \'n\', \'e\', \'d\']\n\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 46,
  'text': 'Text block:\nNow we have everything we need to loop until we have learned all the merges we want. Let’s aim for a vocab size of 50:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 47,
  'text': 'Code block:\nvocab_size = 50\n\nwhile len(vocab) < vocab_size:\n    pair_freqs = compute_pair_freqs(splits)\n    best_pair = ""\n    max_freq = None\n    for pair, freq in pair_freqs.items():\n        if max_freq is None or max_freq < freq:\n            best_pair = pair\n            max_freq = freq\n    splits = merge_pair(*best_pair, splits)\n    merges[best_pair] = best_pair[0] + best_pair[1]\n    vocab.append(best_pair[0] + best_pair[1])\nOutput:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 48,
  'text': 'Text block:\nAs a result, we’ve learned 19 merge rules (the initial vocabulary had a size of 31 — 30 characters in the alphabet, plus the special token):\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 49,
  'text': "Code block:\nprint(merges)\nOutput:\n{('Ġ', 't'): 'Ġt', ('Ġ', 's'): 'Ġs', ('Ġ', 'a'): 'Ġa', ('o', 'r'): 'or', ('Ġt', 'o'): 'Ġto', ('i', 's'): 'is', ('h', 'o'): 'ho', ('ho', 'w'): 'how', ('e', 'n'): 'en', ('e', 'r'): 'er', ('T', 'h'): 'Th', ('Th', 'is'): 'This', ('u', 's'): 'us', ('Ġ', 'w'): 'Ġw', ('l', 'l'): 'll', ('Ġto', 'k'): 'Ġtok', ('Ġtok', 'en'): 'Ġtoken', ('n', 'd'): 'nd', ('l', 'e'): 'le', ('Ġ', 'c'): 'Ġc', ('Ġc', 'or'): 'Ġcor'}\n\n"},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 50,
  'text': "Code block:\nprint(vocab)\nOutput:\n['<|endoftext|>', ',', '.', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ', 'Ġt', 'Ġs', 'Ġa', 'or', 'Ġto', 'is', 'ho', 'how', 'en', 'er', 'Th', 'This', 'us', 'Ġw', 'll', 'Ġtok', 'Ġtoken', 'nd', 'le', 'Ġc', 'Ġcor']\n\n"},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 51,
  'text': 'Text block:\nTo tokenize a new text, we pre-tokenize it, split it, then apply all the merge rules learned:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 52,
  'text': 'Code block:\ndef tokenize(text):\n    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)\n    pre_tokenized_text = [word for word, offset in pre_tokenize_result]\n    splits = [[l for l in word] for word in pre_tokenized_text]\n    for pair, merge in merges.items():\n        for idx, split in enumerate(splits):\n            i = 0\n            while i < len(split) - 1:\n                if split[i] == pair[0] and split[i + 1] == pair[1]:\n                    split = split[:i] + [merge] + split[i + 2 :]\n                else:\n                    i += 1\n            splits[idx] = split\n\n    return sum(splits, [])\nOutput:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 53,
  'text': 'Text block:\nWe can try this on any text composed of characters in the alphabet:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 54,
  'text': 'Code block:\ntokenize("This is not a token.")\nOutput:\n[\'This\', \'Ġ\', \'is\', \'Ġ\', \'n\', \'o\', \'t\', \'Ġa\', \'Ġtoken\', \'.\']\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 55,
  'text': 'Text block:\nOur implementation will throw an error if there is an unknown character since we didn’t do anything to handle them. GPT-2 doesn’t actually have an unknown token (it’s impossible to get an unknown character when using byte-level BPE), but this could happen here because we did not include all the possible bytes in the initial vocabulary. This aspect of BPE is beyond the scope of this section, so we’ve left the details out.\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 56,
  'text': 'Text block:\n### Training a transformers library tokenizer\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 57,
  'text': "Code block:\ntraining_corpus = [ [i] for i in corpus ]\nprint(training_corpus)\nOutput:\n[['This is a sample corpus.'], ['This corpus will be used to show how subword tokenization works.'], ['This section shows several tokenizer algorithms.'], ['Hopefully, you will be able to understand how they are trained and generate tokens.']]\n\n"},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 58,
  'text': 'Code block:\nbpe_tokenizer = tokenizer.train_new_from_iterator(training_corpus, 275) # do 275 merges\ntokens = bpe_tokenizer.tokenize("This is not a token")\nprint(tokens)\nOutput:\n\n\n\n[\'This\', \'Ġ\', \'is\', \'Ġ\', \'n\', \'o\', \'t\', \'Ġa\', \'Ġtoken\']\n\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 59,
  'text': "Text block:\n## OpenAI vocab\n\nThis 50K vocabulary is created using OpenAI's variant of BPE sub-word tokenization called [tiktoken](https://github.com/openai/tiktoken) and is available here:\n\nhttps://huggingface.co/gpt2/blob/main/vocab.json\n"},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 60,
  'text': "Code block:\nimport json\ngpt_vocab = None\nwith open('gpt_vocab.json', 'r') as f:\n    gpt_vocab = json.load(f)\nif gpt_vocab:\n    print(gpt_vocab['<|endoftext|>'])\n    try:\n        print(gpt_vocab['Anoop'])\n    except:\n        print('Anoop does not exist')\n    print(gpt_vocab['An'])\n    print(gpt_vocab['oop'])\nOutput:\n50256\nAnoop does not exist\n2025\n11224\n\n"},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 61,
  'text': 'Text block:\n## End\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 62,
  'text': 'Code block:\nfrom IPython.core.display import HTML\n\n\ndef css_styling():\n    styles = open("../css/notebook.css", "r").read()\n    return HTML(styles)\ncss_styling()\nOutput:\n<IPython.core.display.HTML object>\n'}
         ]


In [2]:
sample

[{'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 0,
  'text': 'Text block:\n# Byte-Pair Encoding tokenization\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 1,
  'text': 'Text block:\nBPE training starts by computing the unique set of words used in the corpus (after the normalization and pre-tokenization steps are completed), then building the vocabulary by taking all the symbols used to write those words. As a very simple example, let’s say our corpus uses these five words:\n\nThis material was adapted from the Huggingface tutorial available here:\n\nhttps://huggingface.co/learn/nlp-course/chapter6/5?fw=pt\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 2,
  'text': 'Code block:\ncorpus = ["hug", "pug", "pun", "bun", "hugs"]\nOutput:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 3,
  'text': "Code b

In [3]:
sample_html = [{'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Graph models\nInternal Helpers\nCustom Layers and Utilities Utilities for pipelines Utilities for Tokenizers Utilities for Trainer Utilities for Generation Utilities for Image Processors Utilities for Audio processing General Utilities Utilities for Time Series\nGeneration with LLMs\nLLMs, or Large Language Models, are the key component behind text generation.',
  'marker': 0},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'In a nutshell, they consist of large pretrained transformer models trained to predict the next word (or, more precisely, token) given some input text.',
  'marker': 1},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Since they predict one token at a time, you need to do something more elaborate to generate new sentences other than just calling the model — you need to do autoregressive generation.',
  'marker': 2},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Autoregressive generation is the inference-time procedure of iteratively calling a model with its own generated outputs, given a few initial inputs.',
  'marker': 3},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'In 🤗 Transformers, this is handled by the generate() method, which is available to all models with generative capabilities.',
  'marker': 4},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'This tutorial will show you how to:\nGenerate text with an LLM\nAvoid common pitfalls\nNext steps to help you get the most out of your LLM\nBefore you begin, make sure you have all the necessary libraries installed:\nCopied\npip install transformers bitsandbytes>=0.39.0 -q\nGenerate text\nA language model trained for causal language modeling takes a sequence of text tokens as input and returns the probability distribution for the next token.',
  'marker': 5},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': '"Forward pass of an LLM"\nA critical aspect of autoregressive generation with LLMs is how to select the next token from this probability distribution.',
  'marker': 6},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Anything goes in this step as long as you end up with a token for the next iteration.',
  'marker': 7},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'This means it can be as simple as selecting the most likely token from the probability distribution or as complex as applying a dozen transformations before sampling from the resulting distribution.',
  'marker': 8},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': '"Autoregressive generation iteratively selects the next token from a probability distribution to generate text"\nThe process depicted above is repeated iteratively until some stopping condition is reached.',
  'marker': 9},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Ideally, the stopping condition is dictated by the model, which should learn when to output an end-of-sequence (EOS) token.',
  'marker': 10},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'If this is not the case, generation stops when some predefined maximum length is reached.',
  'marker': 11},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Properly setting up the token selection step and the stopping condition is essential to make your model behave as you’d expect on your task.',
  'marker': 12},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'That is why we have a GenerationConfig file associated with each model, which contains a good default generative parameterization and is loaded alongside your model.',
  'marker': 13},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Let’s talk code!',
  'marker': 14},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'If you’re interested in basic LLM usage, our high-level Pipeline interface is a great starting point.',
  'marker': 15},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'However, LLMs often require advanced features like quantization and fine control of the token selection step, which is best done through generate() .',
  'marker': 16},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Autoregressive generation with LLMs is also resource-intensive and should be executed on a GPU for adequate throughput.',
  'marker': 17},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'First, you need to load the model.',
  'marker': 18},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Copied\n>>> from transformers import AutoModelForCausalLM >>> model = AutoModelForCausalLM.from_pretrained( ... "mistralai/Mistral-7B-v0.1", device_map="auto", load_in_4bit=True ... )\nYou’ll notice two flags in the from_pretrained call:\ndevice_map ensures the model is moved to your GPU(s)\nload_in_4bit applies 4-bit dynamic quantization to massively reduce the resource requirements\nThere are other ways to initialize a model, but this is a good baseline to begin with an LLM.',
  'marker': 19},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Next, you need to preprocess your text input with a tokenizer .',
  'marker': 20},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Copied\n>>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left") >>> model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda")\nThe model_inputs variable holds the tokenized text input, as well as the attention mask.',
  'marker': 21},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'While generate() does its best effort to infer the attention mask when it is not passed, we recommend passing it whenever possible for optimal results.',
  'marker': 22},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'After tokenizing the inputs, you can call the generate() method to returns the generated tokens.',
  'marker': 23},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'The generated tokens then should be converted to text before printing.',
  'marker': 24},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': "Copied\n>>> generated_ids = model.generate(**model_inputs) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 'A list of colors: red, blue, green, yellow, orange, purple, pink,'\nFinally, you don’t need to do it one sequence at a time!",
  'marker': 25},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'You can batch your inputs, which will greatly improve the throughput at a small latency and memory cost.',
  'marker': 26},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'All you need to do is to make sure you pad your inputs properly (more on that below).',
  'marker': 27},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Copied\n>>> tokenizer.pad_token = tokenizer.eos_token # Most LLMs don\'t have a pad token by default >>> model_inputs = tokenizer( ... ["A list of colors: red, blue", "Portugal is"], return_tensors="pt", padding=True ... ).to("cuda") >>> generated_ids = model.generate(**model_inputs) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)\n[\'A list of colors: red, blue, green, yellow, orange, purple, pink,\', \'Portugal is a country in southwestern Europe, on the Iber\']\nAnd that’s it!',
  'marker': 28},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'In a few lines of code, you can harness the power of an LLM.',
  'marker': 29},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Common pitfalls\nThere are many generation strategies , and sometimes the default values may not be appropriate for your use case.',
  'marker': 30},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'If your outputs aren’t aligned with what you’re expecting, we’ve created a list of the most common pitfalls and how to avoid them.',
  'marker': 31},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Copied\n>>> from transformers import AutoModelForCausalLM, AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") >>> tokenizer.pad_token = tokenizer.eos_token # Most LLMs don\'t have a pad token by default >>> model = AutoModelForCausalLM.from_pretrained( ... "mistralai/Mistral-7B-v0.1", device_map="auto", load_in_4bit=True ... )\nGenerated output is too short/long\nIf not specified in the GenerationConfig file, generate returns up to 20 tokens by default.',
  'marker': 32},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'We highly recommend manually setting max_new_tokens in your generate call to control the maximum number of new tokens it can return.',
  'marker': 33},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Keep in mind LLMs (more precisely, decoder-only models ) also return the input prompt as part of the output.',
  'marker': 34},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Copied\n>>> model_inputs = tokenizer(["A sequence of numbers: 1, 2"], return_tensors="pt").to("cuda") >>> # By default, the output will contain up to 20 tokens >>> generated_ids = model.generate(**model_inputs) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] \'A sequence of numbers: 1, 2, 3, 4, 5\' >>> # Setting `max_new_tokens` allows you to control the maximum length >>> generated_ids = model.generate(**model_inputs, max_new_tokens=50) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] \'A sequence of numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\'\nIncorrect generation mode\nBy default, and unless specified in the GenerationConfig file, generate selects the most likely token at each iteration (greedy decoding).',
  'marker': 35},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Depending on your task, this may be undesirable; creative tasks like chatbots or writing an essay benefit from sampling.',
  'marker': 36},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'On the other hand, input-grounded tasks like audio transcription or translation benefit from greedy decoding.',
  'marker': 37},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Enable sampling with do_sample=True, and you can learn more about this topic in this blog post .',
  'marker': 38},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Copied\n>>> # Set seed for reproducibility -- you don\'t need this unless you want full reproducibility >>> from transformers import set_seed >>> set_seed(42) >>> model_inputs = tokenizer(["I am a cat.',
  'marker': 39},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': '"], return_tensors="pt").to("cuda") >>> # LLM + greedy decoding = repetitive, boring output >>> generated_ids = model.generate(**model_inputs) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] \'I am a cat.',
  'marker': 40},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'I am a cat.',
  'marker': 41},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'I am a cat.',
  'marker': 42},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': "I am a cat' >>> # With sampling, the output becomes more creative!",
  'marker': 43},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': ">>> generated_ids = model.generate(**model_inputs, do_sample=True) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 'I am a cat.",
  'marker': 44},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Specifically, I am an indoor-only cat.',
  'marker': 45},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': "I'\nWrong padding side\nLLMs are decoder-only architectures, meaning they continue to iterate on your input prompt.",
  'marker': 46},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'If your inputs do not have the same length, they need to be padded.',
  'marker': 47},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Since LLMs are not trained to continue from pad tokens, your input needs to be left-padded.',
  'marker': 48},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Make sure you also don’t forget to pass the attention mask to generate!',
  'marker': 49},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Copied\n>>> # The tokenizer initialized above has right-padding active by default: the 1st sequence, >>> # which is shorter, has padding on the right side.',
  'marker': 50},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Generation fails to capture the logic.',
  'marker': 51},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': '>>> model_inputs = tokenizer( ... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt" ... ).to("cuda") >>> generated_ids = model.generate(**model_inputs) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] \'1, 2, 33333333333\' >>> # With left-padding, it works as expected!',
  'marker': 52},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': '>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left") >>> tokenizer.pad_token = tokenizer.eos_token # Most LLMs don\'t have a pad token by default >>> model_inputs = tokenizer( ... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt" ... ).to("cuda") >>> generated_ids = model.generate(**model_inputs) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] \'1, 2, 3, 4, 5, 6,\'\nWrong prompt\nSome models and tasks expect a certain input prompt format to work properly.',
  'marker': 53},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'When this format is not applied, you will get a silent performance degradation: the model kinda works, but not as well as if you were following the expected prompt.',
  'marker': 54},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'More information about prompting, including which models and tasks need to be careful, is available in this guide .',
  'marker': 55},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Let’s see an example with a chat LLM, which makes use of chat templating :\nCopied\n>>> tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha") >>> model = AutoModelForCausalLM.from_pretrained( ... "HuggingFaceH4/zephyr-7b-alpha", device_map="auto", load_in_4bit=True ... ) >>> set_seed(0) >>> prompt = """How many helicopters can a human eat in one sitting?',
  'marker': 56},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Reply as a thug."""',
  'marker': 57},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': '>>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda") >>> input_length = model_inputs.input_ids.shape[1] >>> generated_ids = model.generate(**model_inputs, max_new_tokens=20) >>> print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0]) "I\'m not a thug, but i can tell you that a human cannot eat" >>> # Oh no, it did not follow our instruction to reply as a thug!',
  'marker': 58},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Let\'s see what happens when we write >>> # a better prompt and use the right template for this model (through `tokenizer.apply_chat_template`) >>> set_seed(0) >>> messages = [ ... { ... "role": "system", ... "content": "You are a friendly chatbot who always responds in the style of a thug", ... }, ... {"role": "user", "content": "How many helicopters can a human eat in one sitting?',
  'marker': 59},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': '"}, ... ] >>> model_inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda") >>> input_length = model_inputs.shape[1] >>> generated_ids = model.generate(model_inputs, do_sample=True, max_new_tokens=20) >>> print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0]) \'None, you thug.',
  'marker': 60},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': "How bout you try to focus on more useful questions?'",
  'marker': 61},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': '>>> # As we can see, it followed a proper thug style 😎\nFurther resources\nWhile the autoregressive generation process is relatively straightforward, making the most out of your LLM can be a challenging endeavor because there are many moving parts.',
  'marker': 62},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'For your next steps to help you dive deeper into LLM usage and understanding:\nAdvanced generate usage\nGuide on how to control different generation methods , how to set up the generation configuration file, and how to stream the output;',
  'marker': 63}]

In [4]:
sample_html

[{'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Graph models\nInternal Helpers\nCustom Layers and Utilities Utilities for pipelines Utilities for Tokenizers Utilities for Trainer Utilities for Generation Utilities for Image Processors Utilities for Audio processing General Utilities Utilities for Time Series\nGeneration with LLMs\nLLMs, or Large Language Models, are the key component behind text generation.',
  'marker': 0},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'In a nutshell, they consist of large pretrained transformer models trained to predict the next word (or, more precisely, token) given some input text.',
  'marker': 1},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Since they predict one token at a time, you need to do something more elaborate to generate new sentences other than just calling th

In [5]:
def sequential_chunk_text(data, max_tokens=512, overlap=50):
     
    chunked_data = []
    
    for entry in data:
        text = entry['text'].split("\n")  # Split text into sentences/lines
        chunks = []
        chunk = []
        current_tokens = 0
        sub_marker = 0
        
        for idx, sentence in enumerate(text):
            sentence_tokens = len(sentence.split())  # Approximation of token count
            if current_tokens + sentence_tokens > max_tokens:
                # Finalize current chunk
                chunk_text = " ".join(chunk)
                chunks.append({
                    **entry,  # Copy the original dictionary fields
                    'text': chunk_text,
                    'sub_marker': sub_marker,
                    'first_10_words': " ".join(chunk_text.split()[:10])
                })
                sub_marker += 1
                
                # Start a new chunk with overlap
                overlap_sentences = chunk[-overlap:] if overlap < len(chunk) else chunk
                chunk = overlap_sentences[:]
                current_tokens = sum(len(s.split()) for s in overlap_sentences)
            
            # Add the current sentence to the chunk
            chunk.append(sentence)
            current_tokens += sentence_tokens
        
        # Add the last chunk
        if chunk:
            chunk_text = " ".join(chunk)
            chunks.append({
                **entry,
                'text': chunk_text,
                'sub_marker': sub_marker,
                'first_10_words': " ".join(chunk_text.split()[:10])
            })
        
        # Append all chunks for this entry to the result
        chunked_data.extend(chunks)
    
    return chunked_data

In [6]:
chunks = sequential_chunk_text(sample)

In [7]:
chunks

[{'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 0,
  'text': 'Text block: # Byte-Pair Encoding tokenization ',
  'sub_marker': 0,
  'first_10_words': 'Text block: # Byte-Pair Encoding tokenization'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 1,
  'text': 'Text block: BPE training starts by computing the unique set of words used in the corpus (after the normalization and pre-tokenization steps are completed), then building the vocabulary by taking all the symbols used to write those words. As a very simple example, let’s say our corpus uses these five words:  This material was adapted from the Huggingface tutorial available here:  https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt ',
  'sub_marker': 0,
  'first_10_words': 'Text block: BPE training starts by computing the unique set'},
 {'file_type': 'ipynb',
  'file_name': '../data/cmpt-713/notebooks/bpe.ipynb',
  'marker': 2,
  'text': 'Co

In [8]:
chunks_html = sequential_chunk_text(sample_html)

In [9]:
chunks_html

[{'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'Graph models Internal Helpers Custom Layers and Utilities Utilities for pipelines Utilities for Tokenizers Utilities for Trainer Utilities for Generation Utilities for Image Processors Utilities for Audio processing General Utilities Utilities for Time Series Generation with LLMs LLMs, or Large Language Models, are the key component behind text generation.',
  'marker': 0,
  'sub_marker': 0,
  'first_10_words': 'Graph models Internal Helpers Custom Layers and Utilities Utilities for'},
 {'file_type': 'html',
  'file_name': '../data/cmpt-713/references/Generation with LLMs.html',
  'text': 'In a nutshell, they consist of large pretrained transformer models trained to predict the next word (or, more precisely, token) given some input text.',
  'marker': 1,
  'sub_marker': 0,
  'first_10_words': 'In a nutshell, they consist of large pretrained transformer models'},
 {'file_type': 'htm