<a href="https://colab.research.google.com/github/ferjorosa/learn-pytorch/blob/main/Examples/cbow_human_numbers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Objective of this notebook:

* To implement a simple CBOW model and compare its results in the "human numbers" data with those produced by our LSTM and GRU models from chapter 12 in FastAI book.

* To better understand the output of nn.Embedding when multiple words are provided. 

In the data example with a context of size 3, a batch size of 64, and a embedding dimension of 64, we would have the following tensor shapes:

```python
> inputs.shape
torch.Size([64, 3])
> x.shape
torch.Size([64, 3, 64])
> y.shape
torch.Size([64, 64])
> out.shape
torch.Size([64, 30])
```

In [1]:
#hide (Google Colab)
# !pip install fastai --upgrade -q
import fastai
print(fastai.__version__)

# !pip install -Uqq fastbook
import fastbook
fastbook.setup_book()


/bin/bash: pip: command not found
2.6.0
/bin/bash: pip: command not found


In [2]:
# hide (debugging)
# !pip install -Uqq ipdb
# import ipdb
# %pdb on

/bin/bash: pip: command not found


ModuleNotFoundError: No module named 'ipdb'

In [None]:
import torch.nn as nn
from fastbook import *
from fastai.text.all import *

In [None]:
path = untar_data(URLs.HUMAN_NUMBERS)

Path.BASE_PATH = path

In [None]:
lines = L()
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'valid.txt') as f: lines += L(*f.readlines())
lines

In [None]:
text = ' . '.join([l.strip() for l in lines])
tokens = text.split(' ')
vocab = L(*tokens).unique()
word2idx = {w:i for i,w in enumerate(vocab)}
nums = L(word2idx[i] for i in tokens)

In [None]:
#seqs_raw = L((tokens[i:i+3], tokens[i+3]) for i in range(0,len(tokens)-4,3)) # raw form

seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0,len(nums)-4,3)) # coded-number form
seqs

**Note:** `seqs_raw` is not valid because our model expects tensor data and **tensors can only be in numeric form**

In [None]:
bs = 64
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False) # train, validation

In [None]:
class CBOW(Module):

  def __init__(self, vsz, nh):
    self.i_h = nn.Embedding(vsz, nh)
    self.h_o = nn.Linear(nh, vsz)
  
  def forward(self, inputs):
    x = self.i_h(inputs)
    y = torch.mean(x, axis=1)
    out = self.h_o(y)
    #ipdb.set_trace()
    return out

In [None]:
learn = Learner(dls, CBOW(len(vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)