In [3]:
#Boiler Plate

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline

#select one of the devices, note GPU (Nvidia or Metal) slower than CPU atm

#cpu
device = torch.device("cpu")

#nvidia cuda
#device = torch.device("cuda")

#high-performance training on Metal GPU for Mac - https://pytorch.org/docs/stable/notes/mps.html
#device = torch.device("mps")
#%env PYTORCH_ENABLE_MPS_FALLBACK=0

#words(names) loaded
words = open('AI4-names.txt', 'r').read().splitlines()
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)} 
stoi['.'] = 0
itos = {i:s for s, i in stoi.items()}

# Previously Achieved Results

## BORA'S FAST NN SETUP
Below setup with `17735`no. parameters, results in Losses of ~`1.9850` for training / ~`2.0858` for dev/validation data sets:

`block_size = 4 | d = 4 | hl = 400 | mb_size = 64 | lr=[0.2, 0.1, 0.05, 0.01] | loop_lr = 200000`
>`block_size` is the context length; how many characters do we take to predict the next one),<br>
>`d` is the embedding size,<br>
>`hl` is number of hidden layers,<br>
>`mb_size` is minibatch size,<br>
>`lr` is learning rates decaying in four parts at the halfway of each loop (two total),<br>
>`loop_lr` is number steps in each loop for gradient descent, applied twice (two loops).

Generated words with Uniqueness Score of **~95%** and other stats as follows:

| First Run| Second Run| Third Run |
|---|---|---|
| xevan, snylie, olden, sandanerimya, trus, renulia, judiya, madhaiser, edin+, roon, kelie, bradashiyan, suka, carriusiah, sanny, brixna, delviha, krith, kaeliyah, kinz | vick, aswynnie, fonyn, maramine, chanory, iuganor, kobia, alasimyurdei, kipelyn, elaysen, elezarias, jedy, annis, wafley, briyona, thaidonie, bexlee, tashad, srion, keltan | tyla, eikdy, dairo, corron, araiya, alvier, ocland, axina, drenli, malia, kaniya+, akima, koteh, bouro, abdor, katthilanarthli, juony, cai+, aymager, broken |
| 1 no. (+ marked) generated words are exact copies from the training set; uniqueness score 19/20, 95%. | 0 no. (+ marked) generated words are exact copies from the training set; uniqueness score 20/20, 100%. | 2 no. (+ marked) generated words are exact copies from the training set; uniqueness score 18/20, 90%. |
| Losses @ Train & Dev Sets: 1.9902 & 2.0940 | Losses @ Train & Dev Sets: 1.9883 & 2.0854 | Losses @ Train & Dev Sets: 1.9765 & 2.0779 |

<hr/>

## BORA'S OPTIMUM NN SETUP
Below setup with `26662`no. parameters, results in Losses of ~`1.9115` for training / ~`2.0564` for dev/validation data sets:

`block_size = 5 | d = 5 | hl = 500 | mb_size = 80 | lr=[0.2, 0.1, 0.05, 0.01] | loop_lr = 200000`
>`block_size` is the context length; how many characters do we take to predict the next one),<br>
>`d` is the embedding size,<br>
>`hl` is number of hidden layers,<br>
>`mb_size` is minibatch size,<br>
>`lr` is learning rates decaying in four parts at the halfway of each loop (two total),<br>
>`loop_lr` is number steps in each loop for gradient descent, applied twice (two loops).

Generated words with Uniqueness Score of **~92%** and other stats as follows:

| First Run| Second Run| Third Run |
|---|---|---|
| tyma, eindy, dairo, corron, aibre, zalanis, oclaun, axiah, drenor, malia, kaniya+, frica, kator, brumi, abdore, jillie, warthli, juones, gicay, agar | angelise, gabrier, bella, zannin, finsly, iuganous, kyrael, derry, addie+, costyn, elays, rhee, zariah+, jedya, anney, palest, aidona, thaison, mebibeh, mirsh | xossa, veyli, soldee, samaaher, myran, breya+, jusniya, kyatt, dechetriel, jerron+, kelse, britang, yuvinna, diang, mairse, manyia, manila, kamot, kristophe, dannga |
| 1 no. (+ marked) generated words are exact copies from the training set; uniqueness score 19/20, 95%. | 2 no. (+ marked) generated words are exact copies from the training set; uniqueness score 18/20, 90%. | 2 no. (+ marked) generated words are exact copies from the training set; uniqueness score 18/20, 90%. |
| Losses @ Train & Dev Sets: 1.9150 & 2.0533 | Losses @ Train & Dev Sets: 1.9145 & 2.0598 | Losses @ Train & Dev Sets: 1.9050 & 2.0560 |

<hr/>

## BORA'S BIG NN SETUP
Below setup with `64297`no. parameters, results in Losses of ~`2.0337` for training / ~`2.2310` for dev/validation data sets:

`block_size = 10 | d = 10 | hl = 500 | mb_size = 100 | lr=[0.2, 0.1, 0.05, 0.01] | loop_lr = 200000`
>`block_size` is the context length; how many characters do we take to predict the next one),<br>
>`d` is the embedding size,<br>
>`hl` is number of hidden layers,<br>
>`mb_size` is minibatch size,<br>
>`lr` is learning rates decaying in four parts at the halfway of each loop (two total),<br>
>`loop_lr` is number steps in each loop for gradient descent, applied twice (two loops).

Generated words with Uniqueness Score of **~97%** and other stats as follows:

| First Run| Second Run| Third Run |
|---|---|---|
| xobif, veylin, oleejah, zoshern, marthey, renulyn, fudryan, jahiyah, ferin, roou, kelie, britasgh, anzisa, diano, maira+, fannykan, anilia, dahmand, raylae, danngo | jazaya, tahron, valenia, eyocin, kaniella, madoneia, bezren, bitsiel, anslen, miuft, lulabe, waina, ajaril, tnony, deyson, zeloo, veliah, rankellond, kaisyncl, kaesiyah | jazaya, tahanne, tirihase, kylen, deley, aliah+, aria, bervewve, aureona, syana, nufayla, kayvan, alaija, alien, nyliey, falleegh, vaigah, raykel, tahar, leylalose |
| 1 no. (+ marked) generated words are exact copies from the training set; uniqueness score 19/20, 95%. | 0 no. (+ marked) generated words are exact copies from the training set; uniqueness score 20/20, 100%. | 1 no. (+ marked) generated words are exact copies from the training set; uniqueness score 19/20, 95%. |
| Losses @ Train & Dev Sets: 2.0386 & 2.2338 | Losses @ Train & Dev Sets: 2.0277 & 2.2251 | Losses @ Train & Dev Sets: 2.0348 & 2.2341 |

<hr/>


In [8]:
#parameters of NN

#learning rates same for all NN setups
lr=[0.2, 0.1, 0.05, 0.01] #learning rates decaying in four parts, halfway for each loop defined below
loop_lr = 200000 #single loop no for learning rates, applied twice

#NN parameters to be played with 
name_parameterSet = ""
block_size = 1 #context length: how many characters do we take to predict the next one
d = 1 #embedding size
hl = 10 #no of hidden layers
mb_size = 10 #minibatch size

#make selection here        
selection = "fast" #fast, optimum or big

#from these predefined sets
if selection == "fast":
    # setup 1
    name_parameterSet = "BORA'S FAST NN SETUP"
    block_size = 4; d = 4; hl = 400; mb_size = 64
    
elif selection == "optimum":
    # setup 2
    name_parameterSet = "BORA'S OPTIMUM NN SETUP"
    block_size = 5; d = 5; hl = 500; mb_size = 80
    
elif selection == "big":
    # setup 3
    name_parameterSet = "BORA'S BIG NN SETUP"
    block_size = 10; d = 10; hl = 500; mb_size = 100

In [9]:
#build dataset
def build_dataset(words):
    X, Y = [], []
    
    for w in words:
    
        #print(w)
        context = [0] * block_size
        for ch in w + '.':
            ix = stoi[ch]
            X.append(context)
            Y.append(ix)
            #print(''.join(itos[i] for i in context), '--->', itos[ix])
            context = context[1:] + [ix] #crop and append
    
    X = torch.tensor(X, device=device)
    Y = torch.tensor(Y, device=device)
    print(X.shape, Y.shape)
    return X, Y

import random
random.seed(42)
random.shuffle(words)
n1 = int(0.8*len(words))
n2 = int(0.9*len(words))

# training split, dev/validation split, test split
# 80%, 10%, 10%
Xtr, Ytr = build_dataset(words[:n1])
Xdev, Ydev = build_dataset(words[n1:n2])
Xte, Yte = build_dataset(words[n2:])

torch.Size([182580, 4]) torch.Size([182580])
torch.Size([22767, 4]) torch.Size([22767])
torch.Size([22799, 4]) torch.Size([22799])


In [10]:
lossT_all = 0.0
lossD_all = 0.0
bestquality_all = 0.0
bestgenwords_all = []
bestcount_all = []
loss_all = []

print("\nSelected NN parameter set: " + name_parameterSet)

for run in range(3):
    print('\nRun #' + str(run+1))
    td = block_size * d 
    g = torch.Generator().manual_seed(2147483647) #deterministic reproducibility
    C = torch.randn((27,d), generator=g).to(device) 
    W1 = torch.randn((td,hl), generator=g).to(device) 
    b1 = torch.randn(hl, generator=g).to(device) 
    W2 = torch.randn((hl,27), generator=g).to(device) 
    b2 = torch.randn(27, generator=g).to(device) 
    parameters = [C, W1, b1, W2, b2]

    for p in parameters:
        p.requires_grad = True

    lossi = []
    stepi = []

    pp = 0
    for j in range(2):
        for i in range(loop_lr):

            #minibatch
            ix = torch.randint(0, Xtr.shape[0], (mb_size,), device=device) #use training set

            #forward pass
            emb = C[Xtr[ix]] #use training set
            h = torch.tanh(emb.view(-1, td) @ W1 + b1) #up to 30 here too
            logits = h @ W2 + b2 
            loss = F.cross_entropy(logits, Ytr[ix]) #use training set
            #print(loss.item())

            #backward pass
            for p in parameters:
                p.grad = None
            loss.backward()

            lr_current = 0

            if j < 1:
                lr_current = lr[0] if i < loop_lr/2 else lr[1] #first run
            else:
                lr_current = lr[2] if i < loop_lr/2 else lr[3] #second run

            for p in parameters:
                p.data += -lr_current * p.grad

            #print progress
            currentStep = i + 1 + loop_lr*j
            totalNoSteps = loop_lr * 2
            currentProgress = int(float(currentStep) / float(totalNoSteps) *100)
            reportingStep = 2
            if (currentStep > pp) and (currentProgress % reportingStep == 0):
                progress = "Progressed " + str(currentProgress) + "%" + " at Step " + str(currentStep) + " of " + str(totalNoSteps)
                print(progress, end="\r")
                pp = currentStep + (totalNoSteps * 2 / 100) - reportingStep #cut-off until next step

            #track stats
            stepi.append(i+loop_lr*j)
            lossi.append(loss.log10().item())

    print('\nFinal Loss @ MiniBatch ' + str(loss.item()))
    # plt.plot(stepi, lossi)

    emb = C[Xtr] #use training set to check
    h = torch.tanh(emb.view(-1, td) @ W1 + b1) 
    logits = h @ W2 + b2
    lossT = F.cross_entropy(logits, Ytr) #use training set to check
    lossT_all += lossT / 3.0

    emb = C[Xdev] #use dev set to evaluate
    h = torch.tanh(emb.view(-1, td) @ W1 + b1)
    logits = h @ W2 + b2
    lossD = F.cross_entropy(logits, Ydev) #use dev set to evaluate
    lossD_all += lossD / 3.0

    #finally lets sample from the new NN model

    #sample 10 time for maximum quality
    bestquality = 0.0
    bestgenwords = ""
    bestcountexct = 0

    for rand in range(11):
        g = torch.Generator().manual_seed(2147483647+rand)

        generation = []
        numberofgen = 20
        for i in range(numberofgen):

            out = []
            context = [0] * block_size #start with dot
            while True:
                emb = C[torch.tensor([context])] # (1, block_size, d)
                h = torch.tanh(emb.view(1, -1) @ W1 + b1)
                logits = h @ W2 + b2
                probs = F.softmax(logits, dim=1)
                ix = torch.multinomial(probs, num_samples=1, generator=g).item()
                context = context[1:] + [ix]
                out.append(ix)
                if ix == 0:
                    break

            generated = ''.join(itos[i] for i in out).rstrip(".")
            generation.append(generated) 


        #find the words which already existed in the training set
        genwords = ""
        countexct = 0
        #c = [0 for x in range(0, len(generation))] #just marking the index
        for i, x in enumerate(generation):
            if len(genwords) > 0:
                genwords += ", "

            genwords += x

            if x in words[:n1]:
                #c[i] = 1
                countexct += 1
                genwords += "+"

        quality = (100*(1.0 - float(countexct)/float(numberofgen)))

        if quality > bestquality:
            bestgenwords = genwords
            bestcountexct = countexct
            bestquality = quality

    bestquality_all += bestquality / 3.0
    bestgenwords_all.append(bestgenwords)
    bestcount_all.append(str(bestcountexct) + " no. (+ marked) generated words are exact copies from the training set; uniqueness score " + str(numberofgen-bestcountexct) + "/" + str(numberofgen) + ", " + f'{bestquality:.0f}' + "%.")
    loss_all.append("Losses @ Train & Dev Sets: " + f'{lossT.item():.4f}' + " & " +  f'{lossD.item():.4f}')


Selected NN parameter set: BORA'S FAST NN SETUP

Run #1
Progressed 100% at Step 400000 of 400000
Final Loss @ MiniBatch 2.0236551761627197

Run #2
Progressed 100% at Step 400000 of 400000
Final Loss @ MiniBatch 2.006887674331665

Run #3
Progressed 100% at Step 400000 of 400000
Final Loss @ MiniBatch 1.7776563167572021


In [11]:
#Markdown output
from IPython.display import display, Markdown, Latex
output = """\
## {name}
Below setup with `{parameters}`no. parameters, results in Losses of ~`{lossT}` for training / ~`{lossD}` for dev/validation data sets:

`block_size = {block_size} | d = {d} | hl = {hl} | mb_size = {mb_size} | lr={lr} | loop_lr = {loop_lr}`
>`block_size` is the context length; how many characters do we take to predict the next one),<br>
>`d` is the embedding size,<br>
>`hl` is number of hidden layers,<br>
>`mb_size` is minibatch size,<br>
>`lr` is learning rates decaying in four parts at the halfway of each loop (two total),<br>
>`loop_lr` is number steps in each loop for gradient descent, applied twice (two loops).

Generated words with Uniqueness Score of **~{bestquality}%** and other stats as follows:

| First Run| Second Run| Third Run |
|---|---|---|
| {bestgen1} | {bestgen2} | {bestgen3} |
| {bestcount1} | {bestcount2} | {bestcount3} |
| {loss1} | {loss2} | {loss3} |
<hr/>
""".format(
    name=name_parameterSet,
    parameters=sum(p.nelement() for p in parameters),
    lossT=f'{lossT_all:.4f}',
    lossD=f'{lossD_all:.4f}',
    block_size = block_size, #context length: how many characters do we take to predict the next one
    d = d, #embedding size
    hl = hl, #no of hidden layers
    mb_size = mb_size, #minibatch size
    lr= lr,#learning rates decaying in four parts, halfway for each loop defined below
    loop_lr = loop_lr, #single loop no for learning rates, applied twice
    bestquality = f'{bestquality_all:.0f}',
    bestgen1 = bestgenwords_all[0],
    bestgen2 = bestgenwords_all[1],
    bestgen3 = bestgenwords_all[2],
    bestcount1 = bestcount_all[0],
    bestcount2 = bestcount_all[1],
    bestcount3 = bestcount_all[2],
    loss1 = loss_all[0],
    loss2 = loss_all[1],
    loss3 = loss_all[2]
)
display(Markdown(output))

## BORA'S FAST NN SETUP
Below setup with `17735`no. parameters, results in Losses of ~`1.9773` for training / ~`2.0701` for dev/validation data sets:

`block_size = 4 | d = 4 | hl = 400 | mb_size = 64 | lr=[0.2, 0.1, 0.05, 0.01] | loop_lr = 200000`
>`block_size` is the context length; how many characters do we take to predict the next one),<br>
>`d` is the embedding size,<br>
>`hl` is number of hidden layers,<br>
>`mb_size` is minibatch size,<br>
>`lr` is learning rates decaying in four parts at the halfway of each loop (two total),<br>
>`loop_lr` is number steps in each loop for gradient descent, applied twice (two loops).

Generated words with Uniqueness Score of **~93%** and other stats as follows:

| First Run| Second Run| Third Run |
|---|---|---|
| garri, yadi, tayn, aleja, ayler, allo, jubel, deppe, dalen, kaizy, kardo, eeng, cosley, reei, abhet, elson+, zolin, braxlyn, kaedence, anell | abilyani, mihe, raydel+, yosmeeya, aadtan, silma, jashaanela, pynzamiroy, idam, braquen, sohaia, maysika, jaseus, karley+, megunnday, hawilker, avysi, oliyah, dishanvika, erianna | victon, wylah, dern, nolana, naka, finos, micka, orson+, falayae, kyurdai, kiphustoe, austocee, zarias, jedy, annistophir, nasiyora, thahdon, melie, houd, shodessa |
| 1 no. (+ marked) generated words are exact copies from the training set; uniqueness score 19/20, 95%. | 2 no. (+ marked) generated words are exact copies from the training set; uniqueness score 18/20, 90%. | 1 no. (+ marked) generated words are exact copies from the training set; uniqueness score 19/20, 95%. |
| Losses @ Train & Dev Sets: 1.9790 & 2.0685 | Losses @ Train & Dev Sets: 1.9771 & 2.0738 | Losses @ Train & Dev Sets: 1.9756 & 2.0681 |
<hr/>


In [12]:
display(Markdown("## Raw Markdown code:")); print(output)

## Raw Markdown code:

## BORA'S FAST NN SETUP
Below setup with `17735`no. parameters, results in Losses of ~`1.9773` for training / ~`2.0701` for dev/validation data sets:

`block_size = 4 | d = 4 | hl = 400 | mb_size = 64 | lr=[0.2, 0.1, 0.05, 0.01] | loop_lr = 200000`
>`block_size` is the context length; how many characters do we take to predict the next one),<br>
>`d` is the embedding size,<br>
>`hl` is number of hidden layers,<br>
>`mb_size` is minibatch size,<br>
>`lr` is learning rates decaying in four parts at the halfway of each loop (two total),<br>
>`loop_lr` is number steps in each loop for gradient descent, applied twice (two loops).

Generated words with Uniqueness Score of **~93%** and other stats as follows:

| First Run| Second Run| Third Run |
|---|---|---|
| garri, yadi, tayn, aleja, ayler, allo, jubel, deppe, dalen, kaizy, kardo, eeng, cosley, reei, abhet, elson+, zolin, braxlyn, kaedence, anell | abilyani, mihe, raydel+, yosmeeya, aadtan, silma, jashaanela, pynzamiroy, idam, braquen, soh