Start with downloading 10 files.

The code in [dataset.py](https://github.com/karpathy/nanochat/blob/master/nanochat/dataset.py) and [scripts/tok_train.py](https://github.com/karpathy/nanochat/blob/master/scripts/tok_train.py) seems straightforward enough. Rather than recreating all of it at this point, "copy" just enough to `my_dataset.py` to learn.

In [1]:
from my_dataset import download_single_file, parquets_iter_batched, text_iterator

In [2]:
from multiprocessing import Pool

In [6]:
# Try downloading 10 files with 2 workers (in the style of dataset.py)

ids_to_download = list(range(10))
with Pool(processes=2) as pool:
    results = pool.map(download_single_file, ids_to_download)
results

[True, True, True, True, True, True, True, True, True, True]

In [3]:
!ls -lh | grep parquet

-rw-r--r--  1 ericsilberstein  staff    90M Oct 26 08:26 shard_00000.parquet
-rw-r--r--  1 ericsilberstein  staff    90M Oct 26 08:27 shard_00001.parquet
-rw-r--r--  1 ericsilberstein  staff    89M Oct 26 08:26 shard_00002.parquet
-rw-r--r--  1 ericsilberstein  staff    89M Oct 26 08:27 shard_00003.parquet
-rw-r--r--  1 ericsilberstein  staff    91M Oct 26 08:27 shard_00004.parquet
-rw-r--r--  1 ericsilberstein  staff    89M Oct 26 08:27 shard_00005.parquet
-rw-r--r--  1 ericsilberstein  staff    89M Oct 26 08:27 shard_00006.parquet
-rw-r--r--  1 ericsilberstein  staff    89M Oct 26 08:27 shard_00007.parquet
-rw-r--r--  1 ericsilberstein  staff    89M Oct 26 08:27 shard_00008.parquet
-rw-r--r--  1 ericsilberstein  staff    89M Oct 26 08:27 shard_00009.parquet


### Look at one of the parquet files with pyarrow

In [4]:
import pyarrow.parquet as pq

In [5]:
pf = pq.ParquetFile("shard_00000.parquet")

In [6]:
pf.schema

<pyarrow._parquet.ParquetSchema object at 0x103f03240>
required group field_id=-1 schema {
  optional binary field_id=-1 text (String);
}

In [7]:
pf.num_row_groups

52

In [8]:
rg = pf.read_row_group(0); rg

pyarrow.Table
text: string
----
text: [["Shipment & Transport-Sea, Air, Rail, Road, Pipeline
The mode of transportation is an important con (... 8601 chars omitted)","12. Definition — In this Part, unless the context otherwise requires, “the State” includes t (... 9050 chars omitted)","Gúthwinë was the sword that belonged to Éomer.
It was borne by him at the Battle of the Hornbur (... 713 chars omitted)","The robot in the picture above is called YOLO, which stands for “your own living object.” It� (... 14639 chars omitted)","Metal additive manufacturing (AM) is growing at a fast-paced spurring the world’s current econom (... 1730 chars omitted)",...,"Coconut oil is often touted as a wonder oil. It speeds up the metabolism, improves calcium and mag (... 5327 chars omitted)","Posted on Apr 01, 2017, 6 a.m.
WHO (World Health Organization) report confirms that air pollution  (... 3686 chars omitted)","Their suffering under racist Italian colonialism pushed Eritreans to retreat into their m

In [9]:
len(rg)

1024

In [10]:
print(rg.column('text').to_pylist()[0][:1000])

Shipment & Transport-Sea, Air, Rail, Road, Pipeline
The mode of transportation is an important consideration when planning the shipment process. Besides the costs, the urgency of the shipment, the value of the goods being shipped as well as the size and weight of the goods need to be evaluated when determining the form of transportation.
Seaborne trade accounts for about 90% of the global trade, and as per UNCTAD, 1687 million tons (2015 estimate) were carried in around 177.6 million containers (2015 estimate) covering 998 billion ton-miles (2016 estimate).
Because of size or volume, there are several types of cargoes that cannot be or is economically unviable to move by other modes of transport than the sea.
Ocean freight is a less expensive method of shipping goods, but the drawback is a longer transit time. Another benefit for ocean freight is while size and weight may be an issue for air; it is not for ocean freight.
Ocean freight is used quite extensively for the movement of bulk 

In [11]:
print(rg.column('text').to_pylist()[50][:1000])

What do we mean by disability?
Under the Disability Discrimination Act 2005 (updated by Equality Act 2010) a disabled student may be a student with:
- specific Learning Difficulties (SpLD) including dyslexia, dyspraxia, dyscalculia, Attention Deficit Disorder (ADD)
- mental Health difficulties (including anxiety and depressive disorders, psychological and psychiatric illness)
- long term medical conditions such as arthritis, epilepsy, diabetes, asthma, chronic fatigue syndrome
- autistic Spectrum Disorders such as Asperger’s’ Syndrome
- sensory impairments
- neurological conditions such as Multiple Sclerosis, Cerebral Palsy
- mobility difficulties
A person can be defined as disabled if their physical or mental impairment:
- has a substantial effect on them
- is long term and has lasted or is expected to last 12 months or more
- has an adverse effect on his/her ability to carry out normal day to day activities
Sharing information and confidentiality
Many disabled people have impairments

### Now try our functions for iterating through the parquet files

In [12]:
for i, texts in enumerate(parquets_iter_batched('train')):
    for j, text in enumerate(texts):
        if i % 100 == 0 and j % 100 == 0:
            excerpt = text[:30].replace('\n','\\n')
            print(f"{i},{j}: {excerpt}")

0,0: Shipment & Transport-Sea, Air,
0,100: The Center for Integrative Sci
0,200: - Special Sections\n- Public No
0,300: Regarded as one of the greates
0,400: 2-D or 3-D plot of output from
0,500: We love nurses. Their dedicati
0,600: In China, bamboo is a symbol o
0,700: Students will become actively 
0,800: Letter S Tracing\nThis workshee
0,900: Many scientists have done focu
0,1000: A dental filling is a dental r
100,0: H.323 is a standard that speci
100,100: Table of Contents\nWhy People w
100,200: Siyavush and Afrasiab\nThe Lege
100,300: One of the most effective ways
100,400: Most students don’t want to mi
100,500: In the existing world, the INT
100,600: The treatment of crop residues
100,700: During this trying time, UC Sa
100,800: Buckthorn is a tree, from whic
100,900: Energy security has received s
100,1000: Students in Perrysburg High Sc
200,0: As much as social media platfo
200,100: Teaching Effective Classroom R
200,200: “Everything was trial and erro
200,300: Did you think

In [13]:
# this is the text_iterator intended for training the tokenizer
for i, doc in enumerate(text_iterator(max_chars=1000, doc_cap=50)):
    doc = doc.replace('\n','\\n')
    print(f"{i}: {doc}")

0: Shipment & Transport-Sea, Air, Rail, Road, Pipelin
1: 12. Definition — In this Part, unless the context 
2: Gúthwinë was the sword that belonged to Éomer.\nIt 
3: The robot in the picture above is called YOLO, whi
4: Metal additive manufacturing (AM) is growing at a 
5: The region Bergisches Land is located in southern 
6: The investigation of past cultures of the modern n
7: Agreement On Food Safety\nInternational trade rules
8: Many good novels in the past have had films produc
9: In January we began a survey of the history of Ame
10: Wednesday, October 13, 2010\nSEEDS OF CHANGE- JEN C
11: Japan is set to be nuclear power-free, for just th
12: Modern humans crowded out Europes Neanderthals\nSci
13: USGS Groundwater Information\nGroundwater Resources
14: We all know the human heart helps pump blood throu
15: It is frequently cited that more than half of us n
16: At the start of the 20th century:\nLawrence produce
17: The ease of doing business rankings for 2017 was r
18: They may b

### Now train our tokenizer

In [14]:
from itertools import chain, islice, count
import sys
sys.path.append('../challenge-07-rust-and-python-simplified-tokenizer')
from my_tokenizer import MyTokenizer, SPLIT_PATTERN

#### first with very little text (1000 chars)

In [15]:
tokenizer = MyTokenizer.train_from_iterator(text_iterator(max_chars=1000, doc_cap=50), vocab_size = 65536)

In [16]:
tokenizer.enc.n_vocab

700

In [17]:
def print_tokens(ids):
    for id in ids:
        print(f"{id} -> {tokenizer.decode([id])}")

In [18]:
print_tokens(chain(
    islice(count(699, -1), 5),
    islice(count(600, -1), 5),
    islice(count(300, -1), 5),
    islice(count(70, -1), 5),
))

699 -> <bos>
698 -> Wednesday
697 -> Lawrence
696 -> International
695 -> Gúthwinë
600 -> less
599 -> bot
598 ->  ‘
597 ->  —
596 ->  locat
300 ->  R
299 ->  O
298 ->  I
297 ->  be
296 -> ing
70 -> F
69 -> E
68 -> D
67 -> C
66 -> B


#### now with more text (a million chars)

In [19]:
tokenizer = MyTokenizer.train_from_iterator(text_iterator(max_chars=1_000_000), vocab_size = 65536)

In [20]:
tokenizer.enc.n_vocab

34288

34288 < vocab_size means that every word was fully merged (turned into a token), does that seem right?

Let's count total and unique words to sanity check

In [21]:
import regex
total_words = 0
unique_words = set()
for doc in text_iterator(max_chars=1_000_000):
    for word in regex.findall(SPLIT_PATTERN, doc):
        total_words += 1
        unique_words.add(word)
print(f"total words: {total_words:,}\nunique_words: {len(unique_words):,}\nchars per word: {(1_000_000 / total_words):.2f}")

total words: 193,066
unique_words: 22,386
chars per word: 5.18


5.2 seems high for chars per word, but maybe not because many "words" including a leading space

In [22]:
import random
for word in random.sample(list(unique_words), 10):
    print(f"{word} -> {tokenizer.encode(word)}")

 Programs -> [16968]
 anthems -> [17959]
extra -> [27169]
 cameras -> [29648]
-HIV -> [28586]
sebaceous -> [14906]
 extract -> [4616]
 thrusts -> [18482]
Intervention -> [24531]
 productive -> [12497]


In [23]:
print_tokens(chain(
    islice(count(34287, -1), 10),
    islice(count(32000, -1), 5),
    islice(count(25000, -1), 5),
    islice(count(5000, -1), 5),
    islice(count(1000, -1), 5),
    islice(count(500, -1), 5),
    islice(count(70, -1), 5),
))

34287 -> <bos>
34286 ->  чрезвычайной
34285 ->  人类的交往可建立信任
34284 ->  водонагревателя
34283 ->  Русский
34282 ->  θερμος
34281 ->  KINDERGARTEN
34280 ->  RELATIVES
34279 ->  riboflavin
34278 ->  BANKRUPTCY
32000 -> -coloured
31999 ->  delightful
31998 ->  retrieval
31997 ->  retriever
31996 ->  immunomod
25000 ->  –

24999 ->  estuarine
24998 ->  estates
24997 -> urns
24996 ->  heralded
5000 -> Carbon
4999 ->  juice
4998 ->  Physical
4997 ->  moderate
4996 ->  precise
1000 -> vern
999 ->  law
998 ->  dr
997 ->  school
996 -> 17
500 ->  all
499 -> ally
498 ->  J
497 ->  which
496 ->  whe
70 -> F
69 -> E
68 -> D
67 -> C
66 -> B


In [24]:
tokenizer.encode('the')

[956]

#### now with even more text, ten million chars

This is still 200x smaller than the 2B he trains on in `speedrun.sh` and 400x smaller than the 4B in `run1000.sh`

In [25]:
tokenizer = MyTokenizer.train_from_iterator(text_iterator(max_chars=10_000_000), vocab_size = 65536)

In [26]:
tokenizer.enc.n_vocab

65537

In [27]:
print_tokens(chain(
    islice(count(65536, -1), 10),
    islice(count(50000, -1), 10),
    islice(count(40000, -1), 10),
    islice(count(32000, -1), 5),
    islice(count(25000, -1), 5),
    islice(count(5000, -1), 5),
    islice(count(1000, -1), 5),
    islice(count(500, -1), 5),
    islice(count(70, -1), 5),
))

65536 -> <bos>
65535 ->  HIPC
65534 ->  Hammon
65533 ->  Wanless
65532 ->  Wanat
65531 ->  BSEs
65530 ->  Breedlove
65529 ->  Breeders
65528 ->  BETA
65527 ->  Boudnath
50000 ->  discernment
49999 ->  obliv
49998 ->  Yamamoto
49997 ->  Yamuna
49996 ->  Vilayph
49995 ->  unrestrained
49994 ->  unlawful
49993 ->  unlucky
49992 ->  Jonathan
49991 ->  Laubach
40000 -> -selected
39999 -> -seat
39998 ->  acquies
39997 ->  atmospheres
39996 ->  Serving
39995 ->  Sanctuary
39994 ->  dedicating
39993 ->  vulnerabilities
39992 ->  electrically
39991 ->  Remedies
32000 ->  Osborne
31999 ->  stupas
31998 ->  Tibbits
31997 -> iliary
31996 ->  erectus
25000 ->  Behavioral
24999 ->  acknowledging
24998 ->  instrumentalist
24997 ->  Showers
24996 ->  Affordable
5000 ->  Cath
4999 -> iration
4998 ->  Sal
4997 -> igure
4996 ->  northern
1000 ->  If
999 ->  good
998 ->  proble
997 ->  def
996 -> ices
500 -> end
499 ->  O
498 ->  ad
497 ->  will
496 ->  which
70 -> F
69 -> E
68 -> D
67 -> C
66 -> B


In [28]:
ids = tokenizer.encode("That is a cat."); ids

[5187, 310, 257, 2483, 46]

In [29]:
print_tokens(ids) # as expected with these common words, each has its own token

5187 -> That
310 ->  is
257 ->  a
2483 ->  cat
46 -> .


In [30]:
ids = tokenizer.encode("Griffonage and proxinosini are rare words."); ids

[37027, 609, 262, 488, 288, 9034, 33135, 7346, 345, 3066, 2085, 46]

In [31]:
print_tokens(ids)

37027 -> Gr
609 -> iff
262 -> on
488 -> age
288 ->  and
9034 ->  prox
33135 -> inos
7346 -> ini
345 ->  are
3066 ->  rare
2085 ->  words
46 -> .


In [32]:
ids = tokenizer.encode("一会儿去看电影"); len(ids)

21

In [33]:
print_tokens(ids)
# there must not be much Chinese text in what we're training on because none of the bytes got merged into common chars

228 -> �
184 -> �
128 -> �
228 -> �
188 -> �
154 -> �
229 -> �
132 -> �
191 -> �
229 -> �
142 -> �
187 -> �
231 -> �
156 -> �
139 -> �
231 -> �
148 -> �
181 -> �
229 -> �
189 -> �
177 -> �


In [34]:
ids = tokenizer.encode("Я говорю по-русски"); len(ids)

17

In [36]:
print_tokens(ids)
# same for Russian, maybe there's a bit since "ов" and "ск", both very common in Russian words and names, got merged

208 -> �
175 -> �
13875 ->  �
179 -> �
64329 -> ов
12288 -> о
14761 -> р
209 -> �
142 -> �
63808 ->  п
12288 -> о
45 -> -
14761 -> р
23850 -> у
16975 -> с
29353 -> ск
10608 -> и


In [37]:
ids = tokenizer.encode("El gato compró un mapa."); len(ids)

10

In [38]:
print_tokens(ids)

15747 -> El
313 ->  g
7442 -> ato
522 ->  comp
114 -> r
8073 -> ó
572 ->  un
4179 ->  map
97 -> a
46 -> .


In [39]:
ids = tokenizer.encode("10 11 12 13 14 15 16 17 18 19 20 90 91 92 93 94 95 96 97 98 99"); len(ids)

41

In [40]:
print_tokens(ids)
# as expected at least these 20 2 digit numbers all got their own token

734 -> 10
32 ->  
1306 -> 11
32 ->  
1029 -> 12
32 ->  
1301 -> 13
32 ->  
1312 -> 14
32 ->  
1105 -> 15
32 ->  
1321 -> 16
32 ->  
1167 -> 17
32 ->  
832 -> 18
32 ->  
559 -> 19
32 ->  
503 -> 20
32 ->  
2166 -> 90
32 ->  
5101 -> 91
32 ->  
4373 -> 92
32 ->  
5241 -> 93
32 ->  
4565 -> 94
32 ->  
3358 -> 95
32 ->  
4191 -> 96
32 ->  
3910 -> 97
32 ->  
3843 -> 98
32 ->  
4037 -> 99
