# What is a Token?

A **token** is the smallest unit of text that a computer can process. It can be:  
- A **word** (e.g., `"cat"`),  
- A **subword** (e.g., `"##ing"` from `"jumping"`),  
- A **single character** (e.g., `"A"`),  
- Or a **special symbol** (e.g., `[UNK]` for unknown words).  

---

## Examples of Tokenization

| **Text**         | **Word Tokens**               | **Subword Tokens (BERT)**      | **Character Tokens**                           |
|------------------|-------------------------------|--------------------------------|------------------------------------------------|
| `"I ate apples!"` | `["I", "ate", "apples", "!"]` | `["i", "ate", "apple", "##s", "!"]` | `["I", " ", "a", "t", "e", " ", "a", "p", "p", "l", "e", "s", "!"]` |
| `"ChatGPT"`      | `["ChatGPT"]` (or `[UNK]`)    | `["chat", "##g", "##pt"]`       | `["C", "h", "a", "t", "G", "P", "T"]`          |

---

## Key Idea

Tokens are like **building blocks** for NLP models. Just as you break a sentence into words to understand it, a tokenizer breaks text into tokens for a computer to process.

**Example:**  
- **Input:** `"Don't panic!"`  
- **Tokens:** `["Do", "n't", "panic", "!"]` (Word-level)  
- **Tokens:** `["don", "'", "t", "panic", "!"]` (Subword-level)  


# Importing the Tokenizer

In [2]:
from transformers import BertTokenizer

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


# BertTokenizer and Hugging Face Transformers

## BertTokenizer

`BertTokenizer` is a class from the **Hugging Face Transformers** library that is used to convert text into a format that can be processed by **BERT** (Bidirectional Encoder Representations from Transformers) or other transformer-based models. Essentially, it helps in preprocessing text data before passing it to the model.

## Hugging Face Transformers

**Hugging Face Transformers** is an open-source library developed by Hugging Face that provides easy access to a variety of state-of-the-art **natural language processing (NLP)** models, such as **BERT**, **GPT**, **T5**, and more. 

These models are pre-trained on massive datasets and can be used for a wide range of tasks, including:
- Text classification
- Sentiment analysis
- Question answering
- Translation

# Loading the Pretrained Tokenizer

In [3]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


# `.from_pretrained("bert-base-uncased")`

- This means you're downloading a **pre-trained tokenizer** that matches the `"bert-base-uncased"` model.
- `"bert-base-uncased"` is one version of BERT. 
  - The word **"uncased"** means that this tokenizer will treat all words as lowercase (e.g., "Apple" and "apple" will be treated the same).

## What does it do?

- When you use this line of code, the `tokenizer` object will know how to:
  1. Take any sentence you give it.
  2. Split it into smaller parts (**tokens**).
  3. Convert those tokens into numbers (**IDs**). 

These numbers are what the BERT model needs to work properly.

# Defining Text