# MinBPE Exercise

At this point you have everything you need to build your own GPT-4 tokenizer. This is the [exercise progression](https://github.com/karpathy/minbpe/blob/master/exercise.md) you may wish to follow. You'll note that it is part of the [minbpe](https://github.com/karpathy/minbpe) repo, which is the solution to that exercise, and is a cleaned up version of the code above.

## Task A - 

Write the BasicTokenizer class, with the following three core functions:

- def train(self, text, vocab_size, verbose=False)
- def encode(self, text)
- def decode(self, ids)

Train your tokenizer on whatever text you like and visualize the merged tokens. Do they look reasonable? 
One default test you may wish to use is the text file tests/taylorswift.txt.

In [77]:
with open('taylorswift.txt', 'r') as file:
    text = file.read()
    
print(text)

Copy paste of the Wikipedia article on Taylor Swift, as of Feb 16, 2024.
---

Main menu

WikipediaThe Free Encyclopedia

Search
Create account
Log in

Personal tools
Contents  hide
(Top)
Life and career
Toggle Life and career subsection
Artistry
Toggle Artistry subsection
Accolades and achievements
Cultural status
Toggle Cultural status subsection
Wealth
Toggle Wealth subsection
Discography
Filmography
Tours
See also
Footnotes
References
Toggle References subsection
External links
Taylor Swift

136 languages
Article
Talk
Read
View source
View history

Tools
 Featured article
Page semi-protected
From Wikipedia, the free encyclopedia
For the album, see Taylor Swift (album).
Taylor Swift
Portrait of Taylor Swift in a cocktail dress
Swift at the 2023 MTV Video Music Awards
Born	Taylor Alison Swift
December 13, 1989 (age 34)
West Reading, Pennsylvania, US
Occupations
Singer-songwriter producer director businesswoman actress
Years active	2004â€“present
Works
Albumssinglessongsvideosperforman

In [3]:
class BasicTokenizer():
    
    def __init__(self):
        self.vocab = None
        self.merges = {} # (int, int) -> int
        self.ids = None    
        pass
    
    
    def get_stats(self, ids):
        counts = {}
        for pair in zip(ids, ids[1:]):
            counts[pair] = counts.get(pair, 0) + 1    
        return counts
    
    
    def merge(self, ids, pair, idx):
        newids = []
        i = 0
        while i < len(ids):
            # Given our current location isn't the end, and the current and next values match pair
            if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
                # Add in the Replacement value
                newids.append(idx)
                # Skip two locations, as we've replaced them with idx.
                i += 2
            else:
                # If not a match, add in the appropriate i-th value from ids.
                newids.append(ids[i])
                # Move along by one.
                i += 1
        return newids
        
    
    def train(self, text, vocab_size, verbose=False):
        """
        :param text: Text used to train tokenizer model.
        :param vocab_size: Size of vocabulary. Start size is 256, defines desirable final vocab size.
        :param verbose: Provides verbose output.
        :return: 
        """
        assert vocab_size >= 256
        
        if self.ids is not None:
            print("Tokenizer already trained.")
            return None
        
        byte_text = text.encode("utf-8")
        ids = list(map(int, byte_text))
        self.ids = ids
        
        if verbose:
            print(f"length of text: {len(text)}")
            print(f"length of tokens: {len(self.ids)}")
            
        num_merges = vocab_size - 256
        
        for i in range(num_merges):
            stats = self.get_stats(self.ids)
            pair = max(stats, key=stats.get)
            idx = 256 + i # Existing tokens are 0...255, so create from 256 onwards.
            
            if verbose:
                print(f"Merging {pair} into a new token {idx}")
            
            self.ids = self.merge(self.ids, pair, idx)
            self.merges[pair] = idx
        pass
    
    
    def encode(self, text):
        tokens = list(text.encode("utf-8"))
        
        # Apply Merges:
        while len(tokens) >= 2: 
            stats = self.get_stats(tokens)
            pair = min(stats, key=lambda p: self.merges.get(p, float('inf')))
            if pair not in self.merges:
                break
            idx = self.merges[pair]
            tokens = self.merge(tokens, pair, idx)
            
        return tokens
    
    def decode(self, ids):
        vocab = {idx: bytes([idx]) for idx in range(256)}
        for (p0, p1), idx in self.merges.items():
            vocab[idx] = vocab[p0] + vocab[p1]
    
        self.vocab = vocab
        
        tokens = b"".join(vocab[idx] for idx in ids) # b"" for bytes
        text = tokens.decode("utf-8", errors="replace") # Decode from utf-8 bytes to a string.
        
        return text
        
        
        
        

In [4]:
tokenizer = BasicTokenizer()
tokenizer.train(text, 286, verbose=True)

length of text: 185767
length of tokens: 186258
Merging (101, 32) into a new token 256
Merging (44, 32) into a new token 257
Merging (100, 32) into a new token 258
Merging (46, 32) into a new token 259
Merging (114, 32) into a new token 260
Merging (50, 48) into a new token 261
Merging (115, 32) into a new token 262
Merging (105, 110) into a new token 263
Merging (111, 110) into a new token 264
Merging (114, 105) into a new token 265
Merging (116, 32) into a new token 266
Merging (116, 104) into a new token 267
Merging (101, 258) into a new token 268
Merging (257, 261) into a new token 269
Merging (97, 110) into a new token 270
Merging (97, 114) into a new token 271
Merging (101, 260) into a new token 272
Merging (121, 32) into a new token 273
Merging (97, 108) into a new token 274
Merging (267, 256) into a new token 275
Merging (118, 268) into a new token 276
Merging (119, 105) into a new token 277
Merging (101, 114) into a new token 278
Merging (264, 32) into a new token 279
Merging 

In [5]:
tokenizer.decode(tokenizer.ids)

'Copy paste of the Wikipedia article on Taylor Swift, as of Feb 16, 2024.\n---\n\nMain menu\n\nWikipediaThe Free Encyclopedia\n\nSearch\nCreate account\nLog in\n\nPersonal tools\nContents  hide\n(Top)\nLife and career\nToggle Life and career subsection\nArtistry\nToggle Artistry subsection\nAccolades and achievements\nCultural status\nToggle Cultural status subsection\nWealth\nToggle Wealth subsection\nDiscography\nFilmography\nTours\nSee also\nFootnotes\nReferences\nToggle References subsection\nExternal links\nTaylor Swift\n\n136 languages\nArticle\nTalk\nRead\nView source\nView history\n\nTools\n Featured article\nPage semi-protected\nFrom Wikipedia, the free encyclopedia\nFor the album, see Taylor Swift (album).\nTaylor Swift\nPortrait of Taylor Swift in a cocktail dress\nSwift at the 2023 MTV Video Music Awards\nBorn\tTaylor Alison Swift\nDecember 13, 1989 (age 34)\nWest Reading, Pennsylvania, US\nOccupations\nSinger-songwriter producer director businesswoman actress\nYears active

In [6]:
tokenizer.encode(text)

[67,
 111,
 112,
 273,
 112,
 97,
 115,
 116,
 256,
 111,
 102,
 32,
 275,
 87,
 105,
 107,
 105,
 112,
 101,
 100,
 105,
 97,
 32,
 271,
 116,
 105,
 99,
 108,
 256,
 279,
 84,
 97,
 121,
 108,
 283,
 282,
 116,
 257,
 97,
 262,
 111,
 102,
 32,
 70,
 101,
 98,
 32,
 49,
 54,
 269,
 50,
 52,
 46,
 10,
 45,
 45,
 45,
 10,
 10,
 77,
 97,
 263,
 32,
 109,
 101,
 110,
 117,
 10,
 10,
 87,
 105,
 107,
 105,
 112,
 101,
 100,
 105,
 97,
 84,
 104,
 256,
 70,
 114,
 101,
 256,
 69,
 110,
 99,
 121,
 99,
 108,
 111,
 112,
 101,
 100,
 105,
 97,
 10,
 10,
 83,
 101,
 271,
 284,
 10,
 67,
 114,
 101,
 97,
 116,
 256,
 97,
 99,
 99,
 111,
 117,
 110,
 116,
 10,
 76,
 111,
 103,
 32,
 263,
 10,
 10,
 80,
 278,
 115,
 264,
 274,
 32,
 116,
 111,
 111,
 108,
 115,
 10,
 67,
 264,
 116,
 101,
 110,
 116,
 262,
 32,
 104,
 105,
 100,
 101,
 10,
 40,
 84,
 111,
 112,
 41,
 10,
 76,
 105,
 102,
 256,
 270,
 258,
 99,
 271,
 101,
 278,
 10,
 84,
 111,
 103,
 103,
 108,
 256,
 76,
 105,
 102,
 256,
 270,

## Task B:
Convert your BasicTokenizer into a RegexTokenizer, which takes a regex pattern and splits the text exactly as GPT-4 would. Process the parts separately as before, then concatenate the results. Retrain your tokenizer and compare the results before and after. You should see that you will now have no tokens that go across categories (numbers, letters, punctuation, more than one whitespace). Use the GPT-4 pattern:

In [7]:
import regex as re
GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
gpt4pat = re.compile(GPT4_SPLIT_PATTERN)

print(re.findall(gpt4pat, "Hello! You're such a SUPER person!"))
print(re.findall(gpt4pat, "Hello! You're such a                    SUPER person!"))



text_chunks = re.findall(gpt4pat, text)
ids = [list(ch.encode("utf-8")) for ch in text_chunks]


# The Text Below Translated into Encoded Chunks:
print(text[:27])
print(ids[:5])

['Hello', '!', ' You', "'re", ' such', ' a', ' SUPER', ' person', '!']
['Hello', '!', ' You', "'re", ' such', ' a', '                   ', ' SUPER', ' person', '!']
Copy paste of the Wikipedia
[[67, 111, 112, 121], [32, 112, 97, 115, 116, 101], [32, 111, 102], [32, 116, 104, 101], [32, 87, 105, 107, 105, 112, 101, 100, 105, 97]]


In [80]:
class RegexTokenizer():
    
    def __init__(self, pattern=None):
        self.vocab = {idx: bytes([idx]) for idx in range(256)}
        self.merges = {} # (int, int) -> int
        self.ids = None
        
        self.pattern = GPT4_SPLIT_PATTERN if pattern is None else pattern
        self.compiled_pattern = re.compile(self.pattern)
        pass
    
    
    def get_stats(self, ids, counts=None):
        counts = {} if counts is None else counts # If Existing Counts already exist, we can count ONTOP of them.
        for pair in zip(ids, ids[1:]):
            counts[pair] = counts.get(pair, 0) + 1    
        return counts
    
    
    def merge(self, ids, pair, idx):
        newids = []
        i = 0
        while i < len(ids):
            # Given our current location isn't the end, and the current and next values match pair
            if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
                # Add in the Replacement value
                newids.append(idx)
                # Skip two locations, as we've replaced them with idx.
                i += 2
            else:
                # If not a match, add in the appropriate i-th value from ids.
                newids.append(ids[i])
                # Move along by one.
                i += 1
        return newids
        
    
    def train(self, text, vocab_size, verbose=False):
        """
        :param text: Text used to train tokenizer model.
        :param vocab_size: Size of vocabulary. Start size is 256, defines desirable final vocab size.
        :param verbose: Provides verbose output.
        :return: 
        """
        assert vocab_size >= 256
        num_merges = vocab_size - 256
        
                
        text_chunks = re.findall(self.pattern, text)
        ids = [list(ch.encode("utf-8")) for ch in text_chunks]
        self.ids = ids
        
        
        if verbose:
            print(f"length of text: {len(text)}")
            print(f"length of tokens: {len(ids)}")
            
        for i in range(num_merges):
            stats = {}
            for chunk_ids in ids:
                self.get_stats(chunk_ids, stats)
                
            pair = max(stats, key=stats.get)
            idx = 256 + i # Existing tokens are 0...255, so create from 256 onwards.
            
            if verbose:
                print(f"Merging {pair} into a new token {idx}")
            
            ids = [self.merge(chunk_ids, pair, idx) for chunk_ids in ids] # Replace across all text chunks in ids
            self.merges[pair] = idx
            self.vocab[idx] = self.vocab[pair[0]] + self.vocab[pair[1]]
        pass
    
    
    def _encode_chunk(self, text_bytes):
        ids = list(text_bytes)
        print(ids)
        while len(ids) >= 2:
            stats = self.get_stats(ids)
            pair = min(stats, key=lambda p: self.merges.get(p, float('inf')))
            
            if pair not in self.merges:
                break
            
            idx = self.merges[pair]
            ids = self.merge(ids, pair, idx)
            
        return ids
    
    def encode(self, text):
        text_chunks = re.findall(self.pattern, text)
        # Encode chunks separately then join.
        ids = []
        for chunk in text_chunks:
            chunks_bytes = chunk.encode("utf-8")
            chunk_ids = self._encode_chunk(chunks_bytes)
            ids.extend(chunk_ids)
        return ids
    
    
    def decode(self, ids):
        
        tokens = b"".join(self.vocab[idx] for idx in ids) # b"" for bytes
        text = tokens.decode("utf-8", errors="replace") # Decode from utf-8 bytes to a string.
        
        return text
    
    def decode2(self, ids):
        text = ""
        for chunk_ids in ids:
            chunk_tokens = b"".join(self.vocab[idx] for idx in chunk_ids)
            chunk_text = chunk_tokens.decode("utf-8", errors="replace")   
            text += chunk_text
        return text

In [81]:
tokenizer = RegexTokenizer()
tokenizer.train(text, 286, verbose=True)

length of text: 185767
length of tokens: 46303
Merging (101, 114) into a new token 256
Merging (50, 48) into a new token 257
Merging (111, 114) into a new token 258
Merging (105, 110) into a new token 259
Merging (101, 100) into a new token 260
Merging (32, 116) into a new token 261
Merging (111, 110) into a new token 262
Merging (104, 101) into a new token 263
Merging (32, 83) into a new token 264
Merging (97, 114) into a new token 265
Merging (97, 110) into a new token 266
Merging (32, 65) into a new token 267
Merging (261, 263) into a new token 268
Merging (97, 108) into a new token 269
Merging (114, 105) into a new token 270
Merging (118, 260) into a new token 271
Merging (115, 116) into a new token 272
Merging (119, 105) into a new token 273
Merging (32, 82) into a new token 274
Merging (257, 49) into a new token 275
Merging (32, 102) into a new token 276
Merging (257, 50) into a new token 277
Merging (32, 84) into a new token 278
Merging (102, 116) into a new token 279
Merging (9

In [82]:
tokenizer.vocab

{0: b'\x00',
 1: b'\x01',
 2: b'\x02',
 3: b'\x03',
 4: b'\x04',
 5: b'\x05',
 6: b'\x06',
 7: b'\x07',
 8: b'\x08',
 9: b'\t',
 10: b'\n',
 11: b'\x0b',
 12: b'\x0c',
 13: b'\r',
 14: b'\x0e',
 15: b'\x0f',
 16: b'\x10',
 17: b'\x11',
 18: b'\x12',
 19: b'\x13',
 20: b'\x14',
 21: b'\x15',
 22: b'\x16',
 23: b'\x17',
 24: b'\x18',
 25: b'\x19',
 26: b'\x1a',
 27: b'\x1b',
 28: b'\x1c',
 29: b'\x1d',
 30: b'\x1e',
 31: b'\x1f',
 32: b' ',
 33: b'!',
 34: b'"',
 35: b'#',
 36: b'$',
 37: b'%',
 38: b'&',
 39: b"'",
 40: b'(',
 41: b')',
 42: b'*',
 43: b'+',
 44: b',',
 45: b'-',
 46: b'.',
 47: b'/',
 48: b'0',
 49: b'1',
 50: b'2',
 51: b'3',
 52: b'4',
 53: b'5',
 54: b'6',
 55: b'7',
 56: b'8',
 57: b'9',
 58: b':',
 59: b';',
 60: b'<',
 61: b'=',
 62: b'>',
 63: b'?',
 64: b'@',
 65: b'A',
 66: b'B',
 67: b'C',
 68: b'D',
 69: b'E',
 70: b'F',
 71: b'G',
 72: b'H',
 73: b'I',
 74: b'J',
 75: b'K',
 76: b'L',
 77: b'M',
 78: b'N',
 79: b'O',
 80: b'P',
 81: b'Q',
 82: b'R',
 83: b'

In [88]:
tokenizer.decode(tokenizer.encode(text))

[67, 111, 112, 121]
[32, 112, 97, 115, 116, 101]
[32, 111, 102]
[32, 116, 104, 101]
[32, 87, 105, 107, 105, 112, 101, 100, 105, 97]
[32, 97, 114, 116, 105, 99, 108, 101]
[32, 111, 110]
[32, 84, 97, 121, 108, 111, 114]
[32, 83, 119, 105, 102, 116]
[44]
[32, 97, 115]
[32, 111, 102]
[32, 70, 101, 98]
[32]
[49, 54]
[44]
[32]
[50, 48, 50]
[52]
[46, 10]
[45, 45, 45, 10, 10]
[77, 97, 105, 110]
[32, 109, 101, 110, 117]
[10, 10]
[87, 105, 107, 105, 112, 101, 100, 105, 97, 84, 104, 101]
[32, 70, 114, 101, 101]
[32, 69, 110, 99, 121, 99, 108, 111, 112, 101, 100, 105, 97]
[10, 10]
[83, 101, 97, 114, 99, 104]
[10]
[67, 114, 101, 97, 116, 101]
[32, 97, 99, 99, 111, 117, 110, 116]
[10]
[76, 111, 103]
[32, 105, 110]
[10, 10]
[80, 101, 114, 115, 111, 110, 97, 108]
[32, 116, 111, 111, 108, 115]
[10]
[67, 111, 110, 116, 101, 110, 116, 115]
[32]
[32, 104, 105, 100, 101]
[10]
[40, 84, 111, 112]
[41, 10]
[76, 105, 102, 101]
[32, 97, 110, 100]
[32, 99, 97, 114, 101, 101, 114]
[10]
[84, 111, 103, 103, 108, 10

'Copy paste of the Wikipedia article on Taylor Swift, as of Feb 16, 2024.\n---\n\nMain menu\n\nWikipediaThe Free Encyclopedia\n\nSearch\nCreate account\nLog in\n\nPersonal tools\nContents  hide\n(Top)\nLife and career\nToggle Life and career subsection\nArtistry\nToggle Artistry subsection\nAccolades and achievements\nCultural status\nToggle Cultural status subsection\nWealth\nToggle Wealth subsection\nDiscography\nFilmography\nTours\nSee also\nFootnotes\nReferences\nToggle References subsection\nExternal links\nTaylor Swift\n\n136 languages\nArticle\nTalk\nRead\nView source\nView history\n\nTools\n Featured article\nPage semi-protected\nFrom Wikipedia, the free encyclopedia\nFor the album, see Taylor Swift (album).\nTaylor Swift\nPortrait of Taylor Swift in a cocktail dress\nSwift at the 2023 MTV Video Music Awards\nBorn\tTaylor Alison Swift\nDecember 13, 1989 (age 34)\nWest Reading, Pennsylvania, US\nOccupations\nSinger-songwriter producer director businesswoman actress\nYears active

In [84]:
tokenizer.encode(text)

[67, 111, 112, 121]
[32, 112, 97, 115, 116, 101]
[32, 111, 102]
[32, 116, 104, 101]
[32, 87, 105, 107, 105, 112, 101, 100, 105, 97]
[32, 97, 114, 116, 105, 99, 108, 101]
[32, 111, 110]
[32, 84, 97, 121, 108, 111, 114]
[32, 83, 119, 105, 102, 116]
[44]
[32, 97, 115]
[32, 111, 102]
[32, 70, 101, 98]
[32]
[49, 54]
[44]
[32]
[50, 48, 50]
[52]
[46, 10]
[45, 45, 45, 10, 10]
[77, 97, 105, 110]
[32, 109, 101, 110, 117]
[10, 10]
[87, 105, 107, 105, 112, 101, 100, 105, 97, 84, 104, 101]
[32, 70, 114, 101, 101]
[32, 69, 110, 99, 121, 99, 108, 111, 112, 101, 100, 105, 97]
[10, 10]
[83, 101, 97, 114, 99, 104]
[10]
[67, 114, 101, 97, 116, 101]
[32, 97, 99, 99, 111, 117, 110, 116]
[10]
[76, 111, 103]
[32, 105, 110]
[10, 10]
[80, 101, 114, 115, 111, 110, 97, 108]
[32, 116, 111, 111, 108, 115]
[10]
[67, 111, 110, 116, 101, 110, 116, 115]
[32]
[32, 104, 105, 100, 101]
[10]
[40, 84, 111, 112]
[41, 10]
[76, 105, 102, 101]
[32, 97, 110, 100]
[32, 99, 97, 114, 101, 101, 114]
[10]
[84, 111, 103, 103, 108, 10

[67,
 111,
 112,
 121,
 32,
 112,
 97,
 272,
 101,
 32,
 111,
 102,
 268,
 32,
 87,
 105,
 107,
 105,
 112,
 260,
 105,
 97,
 32,
 265,
 116,
 105,
 99,
 108,
 101,
 32,
 262,
 278,
 280,
 108,
 258,
 284,
 44,
 32,
 97,
 115,
 32,
 111,
 102,
 32,
 70,
 101,
 98,
 32,
 49,
 54,
 44,
 32,
 277,
 52,
 46,
 10,
 45,
 45,
 45,
 10,
 10,
 77,
 97,
 259,
 32,
 109,
 101,
 110,
 117,
 10,
 10,
 87,
 105,
 107,
 105,
 112,
 260,
 105,
 97,
 84,
 263,
 32,
 70,
 114,
 101,
 101,
 32,
 69,
 110,
 99,
 121,
 99,
 108,
 111,
 112,
 260,
 105,
 97,
 10,
 10,
 83,
 101,
 265,
 285,
 10,
 67,
 114,
 101,
 97,
 116,
 101,
 32,
 97,
 99,
 99,
 111,
 117,
 110,
 116,
 10,
 76,
 111,
 103,
 32,
 259,
 10,
 10,
 80,
 256,
 115,
 262,
 269,
 261,
 111,
 111,
 108,
 115,
 10,
 67,
 262,
 116,
 101,
 110,
 116,
 115,
 32,
 32,
 104,
 105,
 100,
 101,
 10,
 40,
 84,
 111,
 112,
 41,
 10,
 76,
 105,
 102,
 101,
 32,
 266,
 100,
 32,
 99,
 265,
 101,
 256,
 10,
 84,
 111,
 103,
 103,
 108,
 101,
 32,
 76,
 105

## Step 3 
You're now ready to load the merges from the GPT-4 tokenizer and show that your tokenizer produces the identical results for both encode and decode, matching tiktoken.

In [45]:
# match this
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # this is the GPT-4 tokenizer
ids = enc.encode("hello world!!!? (안녕하세요!) lol123 😉")
text = enc.decode(ids) # get the same text back

In [85]:
enc2 = tokenizer.encode(text)

[67, 111, 112, 121]
[32, 112, 97, 115, 116, 101]
[32, 111, 102]
[32, 116, 104, 101]
[32, 87, 105, 107, 105, 112, 101, 100, 105, 97]
[32, 97, 114, 116, 105, 99, 108, 101]
[32, 111, 110]
[32, 84, 97, 121, 108, 111, 114]
[32, 83, 119, 105, 102, 116]
[44]
[32, 97, 115]
[32, 111, 102]
[32, 70, 101, 98]
[32]
[49, 54]
[44]
[32]
[50, 48, 50]
[52]
[46, 10]
[45, 45, 45, 10, 10]
[77, 97, 105, 110]
[32, 109, 101, 110, 117]
[10, 10]
[87, 105, 107, 105, 112, 101, 100, 105, 97, 84, 104, 101]
[32, 70, 114, 101, 101]
[32, 69, 110, 99, 121, 99, 108, 111, 112, 101, 100, 105, 97]
[10, 10]
[83, 101, 97, 114, 99, 104]
[10]
[67, 114, 101, 97, 116, 101]
[32, 97, 99, 99, 111, 117, 110, 116]
[10]
[76, 111, 103]
[32, 105, 110]
[10, 10]
[80, 101, 114, 115, 111, 110, 97, 108]
[32, 116, 111, 111, 108, 115]
[10]
[67, 111, 110, 116, 101, 110, 116, 115]
[32]
[32, 104, 105, 100, 101]
[10]
[40, 84, 111, 112]
[41, 10]
[76, 105, 102, 101]
[32, 97, 110, 100]
[32, 99, 97, 114, 101, 101, 114]
[10]
[84, 111, 103, 103, 108, 10

In [87]:
text2 = tokenizer.decode(enc2)
text2

'Copy paste of the Wikipedia article on Taylor Swift, as of Feb 16, 2024.\n---\n\nMain menu\n\nWikipediaThe Free Encyclopedia\n\nSearch\nCreate account\nLog in\n\nPersonal tools\nContents  hide\n(Top)\nLife and career\nToggle Life and career subsection\nArtistry\nToggle Artistry subsection\nAccolades and achievements\nCultural status\nToggle Cultural status subsection\nWealth\nToggle Wealth subsection\nDiscography\nFilmography\nTours\nSee also\nFootnotes\nReferences\nToggle References subsection\nExternal links\nTaylor Swift\n\n136 languages\nArticle\nTalk\nRead\nView source\nView history\n\nTools\n Featured article\nPage semi-protected\nFrom Wikipedia, the free encyclopedia\nFor the album, see Taylor Swift (album).\nTaylor Swift\nPortrait of Taylor Swift in a cocktail dress\nSwift at the 2023 MTV Video Music Awards\nBorn\tTaylor Alison Swift\nDecember 13, 1989 (age 34)\nWest Reading, Pennsylvania, US\nOccupations\nSinger-songwriter producer director businesswoman actress\nYears active