# Lab 1: Introduction

First, we need to install the transformers library.
Other than that, the packages will need are already installed in Colab (e.g., pytorch).

In [None]:
!pip install transformers

In [None]:
# Imports
import torch
from transformers import BertTokenizer
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import random

In [None]:
# Set plotting style
sns.set(style='darkgrid')

# Increase the plot size and font size.
sns.set(font_scale=1.5)
plt.rcParams["figure.figsize"] = (10,5)

## Inspect BERT Vocabulary

Let us by inspecing the BERT vocabulary that is the words, subwords and characters that BERT learned their embeddings during pretraining.

### Vocabulary
First, we'll retrieve the entire list of "tokens" and write these out to text files so we can see them.

In [None]:
# Load pre-trained model tokenizer, and write each token on a new line
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

with open("vocabulary.txt", 'w') as f:
    for token in tokenizer.vocab.keys():
        f.write(token + '\n')

Now if you go and open the file we're just dumped, you'll see the vocabulary BERT uses, for example:

* The first 999 tokens (1-indexed) appear to be reserved, and most are of the form [unused957].
    * 1   - [PAD]
    * 101 - [UNK]
    * 102 - [CLS]
    * 103 - [SEP]
    * 104 - [MASK]
* Rows 1000-1996 appear to be a dump of individual characters. 
    * They don't appear to be sorted by frequency (e.g., the letters of the alphabet are all in sequence).
* The first word is "the" at position 1997.
    * From there, the words appear to be sorted by frequency. 
    * The top ~18 words are whole words, and then number 2016 is ##s, the most common subword.
    * The last whole word is at 29612, "necessitated"

### Single Characters

As discussed earlier, BERT vocabulary contains subwords and characters, that are very useful to represent some the input text if its not in the vocabulary in the form of whole word. Avoiding the need to UNKOWN tokens.

Let's see investigate how much of the vocabulary are single characters and subwords of single characters, i.e., subwords have a '##' as a prefix, so **##s** is a subword and **s** in a charcater.


The following code prints out all of the single character tokens in vocabulary, as well as all of the single-character tokens preceded by '##'.

It turns out that these are matching sets--for every standalone character there is also a '##' version. There are 997 single character tokens.

The following cell iterates over the vocabulary, pulling out all of the single character tokens.

In [None]:
# Fetch tokens that are either characters, so of length one
# Or tokens that are either subword of one character, so of length 3 and a prefix ##

one_chars = []
one_chars_subwords = []

for token in tokenizer.vocab.keys():
    if len(token) == 1:
        one_chars.append(token)
    
    elif len(token) == 3 and token[0:2] == '##':
        one_chars_subwords.append(token)

print('Number of single character tokens:', len(one_chars), '\n')

print('Number of single character subwords:', len(one_chars_subwords), '\n')

In [None]:
# Print all of the single characters, 40 per row.
for i in range(0, len(one_chars), 40):
    print(' '.join(one_chars[i:i + 40]))

In [None]:
# Print all of the single character subwords, 40 per row, without the hashes.
one_chars_subwords = [token.replace('##', '') for token in one_chars_subwords]

for i in range(0, len(one_chars_subwords), 40):
    print(' '.join(one_chars_subwords[i:i + 40]))

In [None]:
# We see that each character can also be a subword
print('Are the two sets identical?', set(one_chars) == set(one_chars_subwords))

### Subwords vs. Whole-words

Now, let's gather some statistics on the vocabulary.

In [None]:
# Measure the length of every token in the vocab.
token_lengths = [len(token) for token in tokenizer.vocab.keys()]

# Plot the number of tokens of each length.
sns.countplot(token_lengths)
plt.title('Vocab Token Lengths')
plt.xlabel('Token Length')
plt.ylabel('# of Tokens')

print('Maximum token length:', max(token_lengths))

##  <span style="color:red">Your turn. </span>

1. **Count the number of subwords and whole words in the vocabulary.**
2. **Plot the lengths of the subwords and whole words.**
3. **Percentage of subwords and whole words out of the whole vocabulary.**

In [None]:
# Count the number of subwords in the vocabulary.

In [None]:
# Plot the subword lengths (not including the two '##' characters).

In [None]:
# Calculate the percentage of words that are '##' subwords.

### Names



Let's see if BERT vocabulary contrains any names, we'll use a list of popular names provided by gutenberg [here]('http://www.gutenberg.org/files/3201/files/NAMES.TXT'). So first, we'll download using wget.

In [None]:
!pip install wget
import wget

url = 'http://www.gutenberg.org/files/3201/files/NAMES.TXT'
wget.download(url, 'first-names.txt')

In [None]:
# Read and decode the names, then convert them to lowercase, and strip newlines.

with open('first-names.txt', 'rb') as f:
    names_encoded = f.readlines()

names = []
for name in names_encoded:
    try:
        names.append(name.rstrip().lower().decode('utf-8'))
    except:
        continue

print('Number of names: {:,}'.format(len(names)))
print('Example:', random.choice(names))

##  <span style="color:red">Your turn.</span>

1. **Count how many names are in the vocabulary.**
2. **Count how many number are in the vocabulary.**

### Names

In [None]:
# Count the number of namesin the vocabulary


### Numbers

In [None]:
# Count how many numbers are in the vocabulary.
