# Building Makemore From Scratch


**Makemore** is an educational language model by [Andrej Karpathy](https://github.com/karpathy) that demonstrates how to build text generation models from scratch.<br>
As we can see the name comes from its purpose i.e. training models that can *"make more"* examples of data (e.g., generate new names, words, or text) based on a given dataset. It shows how character-level language models are built step by step



## Part-1: Bigrams

### 1.  Loading and Inspecting The Data

In [2]:
words = open('names.txt', 'r').read().splitlines()#reading names from file

In [3]:
words[:10]

['emma',
 'olivia',
 'ava',
 'isabella',
 'sophia',
 'charlotte',
 'mia',
 'amelia',
 'harper',
 'evelyn']

In [4]:
len(words) # total number of words

32033

In [5]:
# finding minimum and maximum length of words
min_len = min(len(w) for w in words)
max_len = max(len(w) for w in words)
print(f"Minimum length: {min_len} and Maximum length: {max_len}")

Minimum length: 2 and Maximum length: 15


### 2. Building the Bigram Model (Dictionary Approach)

A bigram is simply a pair of two consecutive characters. We're going to count how many times every possible bigram occurs in our entire dataset.

In [16]:
b = {} # bigram dictionary
for w in words:# printing first 3 words
    chs = ['<S>'] + list(w) + ['<E>']# adding start and end tokens for each word
    for ch1, ch2 in zip(chs, chs[1:]):
        bigram = (ch1, ch2)# creating bigram tuple
        b[bigram] = b.get(bigram, 0) + 1# counting occurrences of each bigram and returning 0 if not found
        #print(ch1, ch2)

In [15]:
b

{('<S>', 'e'): 1531,
 ('e', 'm'): 769,
 ('m', 'm'): 168,
 ('m', 'a'): 2590,
 ('a', '<E>'): 6640,
 ('<S>', 'o'): 394,
 ('o', 'l'): 619,
 ('l', 'i'): 2480,
 ('i', 'v'): 269,
 ('v', 'i'): 911,
 ('i', 'a'): 2445,
 ('<S>', 'a'): 4410,
 ('a', 'v'): 834,
 ('v', 'a'): 642,
 ('<S>', 'i'): 591,
 ('i', 's'): 1316,
 ('s', 'a'): 1201,
 ('a', 'b'): 541,
 ('b', 'e'): 655,
 ('e', 'l'): 3248,
 ('l', 'l'): 1345,
 ('l', 'a'): 2623,
 ('<S>', 's'): 2055,
 ('s', 'o'): 531,
 ('o', 'p'): 95,
 ('p', 'h'): 204,
 ('h', 'i'): 729,
 ('<S>', 'c'): 1542,
 ('c', 'h'): 664,
 ('h', 'a'): 2244,
 ('a', 'r'): 3264,
 ('r', 'l'): 413,
 ('l', 'o'): 692,
 ('o', 't'): 118,
 ('t', 't'): 374,
 ('t', 'e'): 716,
 ('e', '<E>'): 3983,
 ('<S>', 'm'): 2538,
 ('m', 'i'): 1256,
 ('a', 'm'): 1634,
 ('m', 'e'): 818,
 ('<S>', 'h'): 874,
 ('r', 'p'): 14,
 ('p', 'e'): 197,
 ('e', 'r'): 1958,
 ('r', '<E>'): 1377,
 ('e', 'v'): 463,
 ('v', 'e'): 568,
 ('l', 'y'): 1588,
 ('y', 'n'): 1826,
 ('n', '<E>'): 6763,
 ('b', 'i'): 217,
 ('i', 'g'): 428,
