<a href="https://colab.research.google.com/github/belisards/nlp_intro/blob/main/ConferenceNotebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to NLP (inspired and modified from [Harry Berg of AiCore](https://github.com/life-efficient/How-ChatGPT-Works))

## Essential Tools and Concepts within NLP

### What is a corpus?

A corpus is a body of text that represents your data. One classic example would be the [Gutenberg corpus](https://zenodo.org/record/2422561#.Y8NpV-zP06E), which contains the text of over 50,000 books.

### What is a token?

A token is an atomic unit of text. In most cases, you can think of tokens as individual words, but in many cases, tokens may be something like a common part of a word, like a suffix, in other cases a token might be an individual character.

### What is a tokeniser?

A tokeniser is a function that takes in raw text and turns it into a sequence of tokens.
A tokeniser performs tokenisation on raw text to produce tokens.


### What is a vocab?

A vocab is an assignment of an integer index to each token.
If you imagine a list of tokens, the index of each token is the position of that token in the list.

![](https://github.com/life-efficient/How-ChatGPT-Works/blob/main/5.%20Intro%20to%20AI%20for%20Text%20Data/0.%20Intro%20to%20NLP/images/Vocab.png?raw=1)

## How do we represent words?

![](https://github.com/life-efficient/How-ChatGPT-Works/blob/main/5.%20Intro%20to%20AI%20for%20Text%20Data/0.%20Intro%20to%20NLP/images/One-hot%20Vector.png?raw=1)

![](https://github.com/life-efficient/How-ChatGPT-Works/blob/main/5.%20Intro%20to%20AI%20for%20Text%20Data/0.%20Intro%20to%20NLP/images/One-hot%20Word%20Embeddings.png?raw=1)

> Overall, we want to avoid using 1-hot encodings to represent our words and try something else... word embeddings

### Word embeddings

Word embeddings are vector representations of tokens that contain a meaningful representation of what the word means.

![](https://github.com/life-efficient/How-ChatGPT-Works/blob/main/5.%20Intro%20to%20AI%20for%20Text%20Data/0.%20Intro%20to%20NLP/images/Dense%20Word%20Embeddings.png?raw=1)

Where 1-hot encodings are "sparse", containing mostly zeros, word embeddings are "dense".

## Pre-trained word embeddings

Learning meaningful word representations can take a lot of time and compute.
Thankfully, we can take the embeddings learnt by others straight off the shelf.

One of the most influential machine learning models is BERT

- BERT stands for Bidirectional Encoding Representations using Transformers
- It is trained to fill in the missing word in text
- It contains the word embeddings within its first layer's parameters. These BERT embeddings are widely used as a good starting point for word embeddings.

Let's start by downloading the model.

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m56.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m108.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90

In [2]:
from transformers import BertModel

model_name = 'bert-base-uncased'
model = BertModel.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


You can see the parameters that the model contains by printing its `modules` attribute.

In [3]:
print(model.modules)

<bound method Module.modules of BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): 

You can see that the very first layer is the embedding layer. The parameters of this layer are the embeddings for the thousands of words which BERT recognises.

Now, let's get those embeddings.

In [4]:
n_embeddings = 30000

embedding_matrix = model.embeddings.word_embeddings.weight.detach()

embedding_matrix = embedding_matrix[:n_embeddings]
# print(embedding_matrix)
print("Embedding shape:", embedding_matrix.shape)

Embedding shape: torch.Size([30000, 768])


Now we have the embedding matrix, but we don't know which word each of those embeddings correspond to. This is where we need to use the vocab to map from the index of the word (its row in the embedding matrix) to the word itself.

In HuggingFace, the vocab is accessible through the tokeniser. In the same way that we loaded in a pre-trained BERT model, we can load in the corresponding tokeniser.

In [5]:
from transformers import BertTokenizer

bert_tokenizer = BertTokenizer.from_pretrained(model_name)
sentence = "How does this sentence get tokenised?"
tokens = bert_tokenizer.encode(sentence)

print(tokens)

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

[101, 2129, 2515, 2023, 6251, 2131, 19204, 5084, 1029, 102]


In [6]:
from torch.utils.tensorboard import SummaryWriter
from time import time


def create_embedding_labels():
    # ADD NEW COLS
    label_functions = {
        "Length": lambda word: len(word),
        "# vowels": lambda word: len([char for char in word if char in "aeiou"]),
        "is number": lambda word: word.isdigit(), # boolean label for numbers
        # "is preposition": lambda word: word in prepositions
    }
    labels = [
        [
            word,
            *[label_function(word) for label_function in label_functions.values()]
        ]
        for word in list(bert_tokenizer.ids_to_tokens.values())[:n_embeddings]
    ]

    label_names = ["Word", *list(label_functions.keys())]

    return labels, label_names


def visualise_embeddings(embeddings, labels=None, label_names="Label"):
    print("Embedding")

    writer = SummaryWriter()
    start = time()
    writer.add_embedding(
        mat=embeddings,
        metadata=labels,
        metadata_header=label_names
    )
    print(f"Total time:", time() - start)

    print("Embedding done")

labels, label_names = create_embedding_labels()
visualise_embeddings(embedding_matrix, labels, label_names)

Embedding
Total time: 28.160290002822876
Embedding done


Now, open tensorboard by running the below cell.


In [None]:
%load_ext tensorboard
%tensorboard --logdir runs

_Note that this is a 3D projection of much higher dimensional embeddings, so most information is lost when we visualise it._