#### Natural Language Processing
We will analyze the text of the book war and piece and try to generate new text in the same style.

In [1]:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import copy
np.random.seed(42)
import plotly.express as px
import plotly.graph_objects as go

In [2]:
with open("D://Datasets/names.txt", 'r') as file:
    names = file.read().splitlines()

In [3]:
names[0:5]

['emma', 'olivia', 'ava', 'isabella', 'sophia']

In [4]:
vocabulary = sorted(list(set(''.join(names))))
chartoidx = {}
idxtochar = {}
chartoidx['.'] = 0   # Putting a special token to denote the start and the end of a sentence.
idxtochar[0] = '.'
for i,char in enumerate(vocabulary):
    chartoidx[char] = i+1
    idxtochar[i+1] = char

chartoidx

{'.': 0,
 'a': 1,
 'b': 2,
 'c': 3,
 'd': 4,
 'e': 5,
 'f': 6,
 'g': 7,
 'h': 8,
 'i': 9,
 'j': 10,
 'k': 11,
 'l': 12,
 'm': 13,
 'n': 14,
 'o': 15,
 'p': 16,
 'q': 17,
 'r': 18,
 's': 19,
 't': 20,
 'u': 21,
 'v': 22,
 'w': 23,
 'x': 24,
 'y': 25,
 'z': 26}

##### A bigram model gives us two samples occuring next to each other.

In [5]:
print(list(zip(names[0], names[0][1:])))

[('e', 'm'), ('m', 'm'), ('m', 'a')]


The best way to generate bigrams is to count the number of times bigrams occur in the model and sample according to the probability distribution of the bigram.

In [6]:
bigram_dict = {}
for word in names:
    word =  ['.'] + list(word) + ['.']
    for char1, char2 in zip(word, word[1:]):
        bigram = (char1,char2)
        bigram_dict[bigram] = bigram_dict.get(bigram, 0)+1

# lets see the most likely bigrams by sorting the bigram dict according to the values of the bigrams
sorted(bigram_dict.items(), key= lambda kv: -kv[1])
# print(bigram_dict)

[(('n', '.'), 6763),
 (('a', '.'), 6640),
 (('a', 'n'), 5438),
 (('.', 'a'), 4410),
 (('e', '.'), 3983),
 (('a', 'r'), 3264),
 (('e', 'l'), 3248),
 (('r', 'i'), 3033),
 (('n', 'a'), 2977),
 (('.', 'k'), 2963),
 (('l', 'e'), 2921),
 (('e', 'n'), 2675),
 (('l', 'a'), 2623),
 (('m', 'a'), 2590),
 (('.', 'm'), 2538),
 (('a', 'l'), 2528),
 (('i', '.'), 2489),
 (('l', 'i'), 2480),
 (('i', 'a'), 2445),
 (('.', 'j'), 2422),
 (('o', 'n'), 2411),
 (('h', '.'), 2409),
 (('r', 'a'), 2356),
 (('a', 'h'), 2332),
 (('h', 'a'), 2244),
 (('y', 'a'), 2143),
 (('i', 'n'), 2126),
 (('.', 's'), 2055),
 (('a', 'y'), 2050),
 (('y', '.'), 2007),
 (('e', 'r'), 1958),
 (('n', 'n'), 1906),
 (('y', 'n'), 1826),
 (('k', 'a'), 1731),
 (('n', 'i'), 1725),
 (('r', 'e'), 1697),
 (('.', 'd'), 1690),
 (('i', 'e'), 1653),
 (('a', 'i'), 1650),
 (('.', 'r'), 1639),
 (('a', 'm'), 1634),
 (('l', 'y'), 1588),
 (('.', 'l'), 1572),
 (('.', 'c'), 1542),
 (('.', 'e'), 1531),
 (('j', 'a'), 1473),
 (('r', '.'), 1377),
 (('n', 'e'),

Now we would create a two dimesional tensor which we would use to map the numbers of times one char in a bigram follows the second character.

In [7]:
bigram_tensor = torch.zeros((27,27), dtype = torch.int32)
print(bigram_tensor.shape)

torch.Size([27, 27])


Now we would map the number of occurences of one character after the other in the 2D array. Using the chartoidx dict and idxtochar dict

In [8]:
for key, value in bigram_dict.items():
    char1, char2 = key
    ch1_idx, ch2_idx = chartoidx[char1], chartoidx[char2]
    bigram_tensor[ch1_idx][ch2_idx] = value
bigram_tensor

tensor([[   0, 4410, 1306, 1542, 1690, 1531,  417,  669,  874,  591, 2422, 2963,
         1572, 2538, 1146,  394,  515,   92, 1639, 2055, 1308,   78,  376,  307,
          134,  535,  929],
        [6640,  556,  541,  470, 1042,  692,  134,  168, 2332, 1650,  175,  568,
         2528, 1634, 5438,   63,   82,   60, 3264, 1118,  687,  381,  834,  161,
          182, 2050,  435],
        [ 114,  321,   38,    1,   65,  655,    0,    0,   41,  217,    1,    0,
          103,    0,    4,  105,    0,    0,  842,    8,    2,   45,    0,    0,
            0,   83,    0],
        [  97,  815,    0,   42,    1,  551,    0,    2,  664,  271,    3,  316,
          116,    0,    0,  380,    1,   11,   76,    5,   35,   35,    0,    0,
            3,  104,    4],
        [ 516, 1303,    1,    3,  149, 1283,    5,   25,  118,  674,    9,    3,
           60,   30,   31,  378,    0,    1,  424,   29,    4,   92,   17,   23,
            0,  317,    1],
        [3983,  679,  121,  153,  384, 1271,   82,

In [9]:
# Now would write some plotly code to visualize the bigrams tensor
x = px.imshow(bigram_tensor, x = list(chartoidx.keys()),
                     y = list(chartoidx.keys()),
                      color_continuous_scale='blues',
                      width= 800,
                      height = 700,
                      
                      )
x.update_traces(texttemplate = "%{y}%{x}<br>%{z}")
x.update_layout(margin=dict(l=0, r=0, t=50, b=50))
x.update_coloraxes(showscale = False)
x.show()

In [10]:
# The raw counts of the bigrams are given below
bigram_tensor[0]

tensor([   0, 4410, 1306, 1542, 1690, 1531,  417,  669,  874,  591, 2422, 2963,
        1572, 2538, 1146,  394,  515,   92, 1639, 2055, 1308,   78,  376,  307,
         134,  535,  929], dtype=torch.int32)

What we will do not is that we would convert these raw count to probabilites i.e they should sum to 1 and then we will sample from these probabbilities and get text which is according to this distribution.

In [11]:
# To create a probability distribution here we will have to divide the above values by their sum.
# This gives us probability of every single character to be the first character.
prob_tensor = bigram_tensor[0].float()/ bigram_tensor[0].sum()
prob_tensor

tensor([0.0000, 0.1377, 0.0408, 0.0481, 0.0528, 0.0478, 0.0130, 0.0209, 0.0273,
        0.0184, 0.0756, 0.0925, 0.0491, 0.0792, 0.0358, 0.0123, 0.0161, 0.0029,
        0.0512, 0.0642, 0.0408, 0.0024, 0.0117, 0.0096, 0.0042, 0.0167, 0.0290])

In [12]:
# To sample from this distruibution we would do a torch.multinomial which generates samples from a given probability distribution.
# To make everything deterministic we would use a pytorch generator.
generator = torch.Generator().manual_seed(2147483647+1)
# Now we will get samples from the prob_tensor distribution using the generator for determinism. We would sample once from the distribution.
# The number that we sample from this distribution will be the index from this distribution
ix = torch.multinomial(prob_tensor, generator=generator, num_samples=1, replacement=True).item()
# We can map the index of the character from the index to the character to generate a letter from the index
char = idxtochar[ix]
# similarly we can generate other indexes and to sample the next character after m.
char

'e'

There is one inefficiency here is that we are sampling again and again the counts matrix and then converting to the probability distribution. What we need to do is that we need to convert the whole counts array to the probability distribution to not do conversion again and again and directly sample from the array.

In [13]:
# bigram_tensor.sum() # But we do not want this because it takes the sum of all the counts in the matrix.
# Instead what we need to do is that we need to get the probabilities of the rows. So we will do sum(dim=1).

In [14]:
sum_array = bigram_tensor.sum(dim=1, keepdims = True)
print(sum_array.shape)

torch.Size([27, 1])


In [15]:
# Now we would need to divide the whole matrix by the sum_array.
p_matrix = (bigram_tensor.float()+1)/sum_array
print(p_matrix.shape)
print(p_matrix[0].sum())

torch.Size([27, 27])
tensor(1.0008)


In [16]:
# Now would write some plotly code to visualize the bigrams tensor
x = px.imshow(p_matrix, x = list(chartoidx.keys()),
                     y = list(chartoidx.keys()),
                      color_continuous_scale='blues',
                      width= 800,
                      height = 700,
                      
                      )
x.update_traces(texttemplate = "%{y}%{x}<br>%{z}")
x.update_layout(margin=dict(l=0, r=0, t=50, b=50))
x.update_coloraxes(showscale = False)
x.show()

In [17]:
# We could write a loop to genrate character given a starting word from the probability distribution of characters.
generator = torch.Generator().manual_seed(2147483647)

for i in range(10):
    ix = 0 # start 
    out = []
    while True:
        p = p_matrix[ix]
        # p = bigram_tensor[ix].float()
        # p = p/p.sum()
        ix = torch.multinomial(p, generator=generator, num_samples=1, replacement=True).item()
        out.append(idxtochar[ix])
        if ix == 0:
            # We have an end token. We would exit the loop
            break
    print(''.join(out))

mor.
axx.
minaymoryles.
kondlaisah.
anchshizarie.
odaren.
iaddash.
h.
jhinatien.
egushl.


1. So we trained a bigram language model by litreally counting how much a pairing occurs and we trained it by normalizing the pairing and sampling from the probability distribution by giving it the starting character and sampling the next character from the probability distribution.
2. So the elements of the array P are the parameters of our bigram language model.
3. Now we would like to evaluate the quality of this model into a single number. We would evalute the training loss.
4. So what we would do is that we would like to look at the probabilities that our trained model assigns to these bigrams.
5. Now what the model is doing that it is summarizing the probabilities that it thinks are likely. So if everything was equallly likely then each and every bigram  would have a probability of roughly equal to 4%. So if it assigns a probability of more than 4% to any bigram that means that it has learned something from the statistics (count) that we have calculated.
6. Basically if we have a very good model we would expect these probabilities to be near 1 for every bigram which shows that the model accurately predicts the next token given one token. So this would be a good model.
7. So how can we summarize these probabilities which can measure the quality of the model in a single number.
8. When we look at the litreature of maximum likelihood estimation we see that it contains something called as the likelihood which is the product of all the individual prbabilities and what it is telling us is the probabilities of the entire dataset assigned by the model.
9. So as the product is a very small number because they all are probabilities we actually look at the `log-likelihood` of the probabilities.
10. The log_prob is convinient because as the likelihood is the product of probabilities, the log_likelihood is the sum of the logs of individual probabilities.\
log(a*b*c) = log(a) + log(b) + log(c)
11. So when all the probabilities are 1 the log likelihood will be zero. And as we move away from 1 the probabillites will go down to be negative i.e the log likelihood will be more and more negtive. But by common sense we need the loss to a high number so that when we minimize the loss it can get to zero. So we invert the log_likelihood by multiplying it  by -1.
12. What some people also like to do is they like to make the nLL loss as an average rather than an absolute value. So they divide by the count.

In [18]:
log_likelihood = 0.0
n = 0
for word in names:
    word =  ['.'] + list(word) + ['.']
    for char1, char2 in zip(word, word[1:]):
        prob = p_matrix[chartoidx[char1]][chartoidx[char2]]
        log_prob = torch.log(prob)
        log_likelihood += log_prob
        n+=1
        # print(f"{char1}{char2}: {prob.item():.4f} : {log_prob: .4f}")

print(f"log_likelihood = {log_likelihood}")
print(f"nLL = {-log_likelihood}")
print(f"normlaized_nLL = {-log_likelihood/n}")
        

log_likelihood = -559322.6875
nLL = 559322.6875
normlaized_nLL = 2.4515998363494873


We can also get the probability of any word that we want by replacing the names of the dataset with our own name

In [19]:
log_likelihood = 0.0
n = 0
for word in ['anant']:
    word =  ['.'] + list(word) + ['.']
    for char1, char2 in zip(word, word[1:]):
        prob = p_matrix[chartoidx[char1]][chartoidx[char2]]
        log_prob = torch.log(prob)
        log_likelihood += log_prob
        n+=1
        print(f"{char1}{char2}: {prob.item():.4f} : {log_prob: .4f}")

print(f"log_likelihood = {log_likelihood}")
print(f"nLL = {-log_likelihood}")
print(f"normlaized_nLL = {-log_likelihood/n}")

.a: 0.1377 : -1.9827
an: 0.1605 : -1.8294
na: 0.1625 : -1.8171
an: 0.1605 : -1.8294
nt: 0.0242 : -3.7203
t.: 0.0869 : -2.4431
log_likelihood = -13.621914863586426
nLL = 13.621914863586426
normlaized_nLL = 2.2703192234039307


2.27 is a not a very good probability score of our name

If we assign a very arbitrary string to the model in which the probability of that bigram is zero in our model. Then it will give out infinity loss which is not very desirable. In order to combat this we add some number to the zero counts to make them a little bit larger and to make the probability of random string very unlikely but not zero. `This is called as model smoothening.`

In [20]:
# without the model smoothening we can see that the loss is infinity because one bigram has zero probability.
log_likelihood = 0.0
n = 0
for word in ['anantq']:
    word =  ['.'] + list(word) + ['.']
    for char1, char2 in zip(word, word[1:]):
        prob = p_matrix[chartoidx[char1]][chartoidx[char2]]
        log_prob = torch.log(prob)
        log_likelihood += log_prob
        n+=1
        print(f"{char1}{char2}: {prob.item():.4f} : {log_prob: .4f}")

print(f"log_likelihood = {log_likelihood}")
print(f"nLL = {-log_likelihood}")
print(f"normlaized_nLL = {-log_likelihood/n}")

.a: 0.1377 : -1.9827
an: 0.1605 : -1.8294
na: 0.1625 : -1.8171
an: 0.1605 : -1.8294
nt: 0.0242 : -3.7203
tq: 0.0002 : -8.6252
q.: 0.1066 : -2.2385
log_likelihood = -22.04250717163086
nLL = 22.04250717163086
normlaized_nLL = 3.1489295959472656


The above model asigned a very low probability to tq but not zero so the loss was high but not infinity.

#### Now we would like to cast the problem of bigram model into a neural network
So what the neural network would do is to predict the next character of the bigram sequence given the first character. We would also output a loss  number and based on this loss number we will optimize the Neural Network and reduce the loss.

In [21]:
# The first thing that we would do is to make the training dataset of bigrams
xs = []
ys = []

for word in names[:1]:
    word =  ['.'] + list(word) + ['.']
    for char1, char2 in zip(word, word[1:]):
        ix1 = chartoidx[char1]
        ix2 = chartoidx[char2]

        xs.append(ix1)
        ys.append(ix2)

xs = torch.tensor(xs)
ys = torch.tensor(ys)


In [22]:
xs, ys

(tensor([ 0,  5, 13, 13,  1]), tensor([ 5, 13, 13,  1,  0]))

In [23]:
# we would use one-hot encoding to represent each individual index of word in the vocabulary. The resulting tensor would be 27 charcers long
import torch.nn.functional as F
xenc = F.one_hot(xs, num_classes = 27).float()
yenc = F.one_hot(ys, num_classes = 27).float()

In [24]:
xenc

tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [25]:
xenc.shape

torch.Size([5, 27])

In [26]:
px.imshow(xenc, color_continuous_scale="viridis")

Now we will code our first neuron

In [27]:
# Lets define the weights of our neuron
w = torch.randn((27,27), requires_grad=True, generator=generator)
xenc@w   #(5,27) @ (27,27) = (5,27)

tensor([[-0.0123,  1.4722, -2.1259,  0.9604,  1.2482,  0.2534,  2.8188, -0.3398,
          1.7807,  1.4590, -0.1902, -0.6965,  1.7039,  0.7420,  0.9737,  0.3003,
         -2.2396, -0.7125, -0.8790,  0.1066,  1.8598,  0.0558,  1.2815, -0.6318,
         -0.7340,  2.0002, -0.3946],
        [ 0.5259, -0.6117,  0.5482, -0.2568, -1.5437,  0.3795, -1.7705, -1.2085,
          0.9477,  0.1029, -0.6808,  0.7951,  0.5766, -0.7378, -1.5264,  0.7117,
          1.4056,  1.3924,  0.4346,  0.4979,  0.1130, -0.4185,  0.1791,  0.2348,
          0.7351, -0.3884, -0.8240],
        [-0.4116, -1.6739, -0.9180,  1.5021, -0.6285, -0.4425,  0.5689,  1.2803,
         -0.5540, -0.1041,  1.4335, -0.5862, -0.2828,  0.5339, -0.9939, -1.6996,
          1.8362, -0.3288,  0.7960, -0.3506,  0.7560, -0.9363, -0.0841, -1.6361,
          1.0224,  0.0985,  1.1773],
        [-0.4116, -1.6739, -0.9180,  1.5021, -0.6285, -0.4425,  0.5689,  1.2803,
         -0.5540, -0.1041,  1.4335, -0.5862, -0.2828,  0.5339, -0.9939, -1.6996

In [28]:
# Now we would want to convert the outputs of the matrix multiplication to be some sort of probabilities of the occurence of the next word
# But the probabilities have a special structure. They are positive and sum to 1.
# They cannot be counts because counts are integers and they are positive, so they are not a good type of output.
# So we are going to interpret the output of the  neural net i.e. the 27 numbers as log counts or LOGITS basically.
# To get the counts we are going to get the log counts and exponentiate them.
# In exponentiation operation For negative numbers we get numbers less than 1(but not negative) and for postive numbers we get numbers greater than 1.
# So the probabilities are just the counts normalized
logits = (xenc@w)   # Log-counts
counts = logits.exp()  # counts
probs = counts/counts.sum(dim=1, keepdims = True)  # So the probabilities are just the counts normalized
probs      # You get a 5,27 tensor of the probabilites of each character.

tensor([[0.0130, 0.0575, 0.0016, 0.0345, 0.0460, 0.0170, 0.2212, 0.0094, 0.0783,
         0.0568, 0.0109, 0.0066, 0.0725, 0.0277, 0.0350, 0.0178, 0.0014, 0.0065,
         0.0055, 0.0147, 0.0848, 0.0140, 0.0475, 0.0070, 0.0063, 0.0976, 0.0089],
        [0.0463, 0.0148, 0.0474, 0.0212, 0.0058, 0.0400, 0.0047, 0.0082, 0.0706,
         0.0303, 0.0139, 0.0606, 0.0487, 0.0131, 0.0059, 0.0558, 0.1116, 0.1101,
         0.0423, 0.0450, 0.0306, 0.0180, 0.0327, 0.0346, 0.0571, 0.0186, 0.0120],
        [0.0157, 0.0044, 0.0095, 0.1064, 0.0126, 0.0152, 0.0419, 0.0853, 0.0136,
         0.0214, 0.0994, 0.0132, 0.0179, 0.0404, 0.0088, 0.0043, 0.1486, 0.0171,
         0.0525, 0.0167, 0.0505, 0.0093, 0.0218, 0.0046, 0.0659, 0.0262, 0.0769],
        [0.0157, 0.0044, 0.0095, 0.1064, 0.0126, 0.0152, 0.0419, 0.0853, 0.0136,
         0.0214, 0.0994, 0.0132, 0.0179, 0.0404, 0.0088, 0.0043, 0.1486, 0.0171,
         0.0525, 0.0167, 0.0505, 0.0093, 0.0218, 0.0046, 0.0659, 0.0262, 0.0769],
        [0.0216, 0.0378,

1. All of the operation done above are differentiable operation. So we can backpropagate through the network.\
2. Now the question is to find the values of w for which the probabilities that come out are pretty good next characters. And the way that we find out pretty good is through the loss function
3. The last two operation of exponentiation and normalization of the logits is called as the softmax. Softmax function exponentiates and normalizes the logits to produce probabilities.
4. But right now the probabilities are very bad at predicting the next word. What we can do is that we can resample the w to see if we can get better probabilites by changing the seed of the generator.

In [29]:
# every row has a shape of 1,27 and has a sum of 1
print(probs.shape, probs.sum())

torch.Size([5, 27]) tensor(5., grad_fn=<SumBackward0>)


In [30]:



# Forward Pass
logits = (xenc@w)   # Log-counts
counts = logits.exp()  # counts
probs = counts/counts.sum(dim=1, keepdims = True)  # So the probabilities are just the counts normalized
probs      # You get a 5,27 tensor of the probabilites of each character.


# We are interested in the probabilities of the labels of the model.
# So out of the probabilities outputted by the model we will take out the prbabilities at the index of the labels ys
# Now we will evaluate the loss which will take the probability tensor and give the probabilities at the index of the labels.
# We will also average the losses and take the log so that the loss is a positive number
loss = -probs[torch.arange(5), ys].mean().log()
print(loss)


# Now we will do the backward pass
# First reset the grdients to zeros
w.grad = None

# Doing the backward pass
loss.backward()


tensor(3.9469, grad_fn=<NegBackward0>)


In [31]:
# printing the gradient shape
print(w.grad.shape)

# updating the weights
w.data += -0.1*w.grad
# If we recalculate the forward pass loss should be lower

torch.Size([27, 27])


In [32]:
# The first thing that we would do is to make the training dataset of bigrams
xs = []
ys = []

for word in names[:1]:
    word =  ['.'] + list(word) + ['.']
    for char1, char2 in zip(word, word[1:]):
        ix1 = chartoidx[char1]
        ix2 = chartoidx[char2]

        xs.append(ix1)
        ys.append(ix2)

xs = torch.tensor(xs)
ys = torch.tensor(ys)

import torch.nn.functional as F
xenc = F.one_hot(xs, num_classes = 27).float()
yenc = F.one_hot(ys, num_classes = 27).float()

w = torch.randn((27,27), requires_grad=True, generator=generator)

In [34]:

# Lets rearrange everythig to work out of a loop
for k in range(300):
    logits = (xenc@w)   # Log-counts
    counts = logits.exp()  # counts
    probs = counts/counts.sum(dim=1, keepdims = True)  # So the probabilities are just the counts normalized
    loss = -probs[torch.arange(5), ys].mean().log() + 0.01*(w**2).mean()   # Adding regularization to the loss.
    w.grad = None
    loss.backward()
    w.data += -0.1*w.grad

    if k%10 == 0:
        print(f"Epoch: {k} | Loss: {loss}")

Epoch: 0 | Loss: 1.169492483139038
Epoch: 10 | Loss: 1.1404589414596558
Epoch: 20 | Loss: 1.113649606704712
Epoch: 30 | Loss: 1.0892647504806519
Epoch: 40 | Loss: 1.0673104524612427
Epoch: 50 | Loss: 1.0476566553115845
Epoch: 60 | Loss: 1.030092716217041
Epoch: 70 | Loss: 1.014372706413269
Epoch: 80 | Loss: 1.000246286392212
Epoch: 90 | Loss: 0.9874756932258606
Epoch: 100 | Loss: 0.9758449196815491
Epoch: 110 | Loss: 0.9651618599891663
Epoch: 120 | Loss: 0.955258846282959
Epoch: 130 | Loss: 0.9459893703460693
Epoch: 140 | Loss: 0.9372262954711914
Epoch: 150 | Loss: 0.9288581609725952
Epoch: 160 | Loss: 0.9207869172096252
Epoch: 170 | Loss: 0.9129248261451721
Epoch: 180 | Loss: 0.9051932692527771
Epoch: 190 | Loss: 0.8975202441215515
Epoch: 200 | Loss: 0.8898394703865051
Epoch: 210 | Loss: 0.8820886611938477
Epoch: 220 | Loss: 0.8742100596427917
Epoch: 230 | Loss: 0.8661486506462097
Epoch: 240 | Loss: 0.8578534126281738
Epoch: 250 | Loss: 0.8492769598960876
Epoch: 260 | Loss: 0.84037762

Regularization restricts the growth of the weights. It is added as a loss to the loss parameter where what it does is that it forces the weights to be closer to zero by adding to the loss. So if the weights grow rapidly in the positive or the negative direction then what we do is we add the mean_quared weights to the loss which increases the loss. So the Neural Network does not like this increased loss and it forces the weights to decrease in accordance to the increased loss and thereby due to gradient descent and parameter update the values of weights get chopped down aggresively. This has a sort of smoothening effect on the weights.