<h1 style="text-align:center;">🦜 Natural Language Processing with Transformers 🤗</h1><br><br>

<h2 style="text-align:center;">Chapter 2. Text Classification</h2><br><br>

<h4 style="text-align:center;"><b>Christopher Akiki</b></h4>

<ul>
    <li><h3>The Dataset</h3></li>
    <br>
    <li><h3>From Text to Tokens</h3></li>
    <br>
    <li><h3>Training a Text Classifier</h3></li>
</ul>

<center><img src="images/chapter02_hf-libraries.png" width=1800></center>

<h1 style="text-align:center;">🤗 Datasets</h1>

# Apache Arrow backend ➡️ Low RAM use

In [1]:
import os
import psutil
from datasets import load_dataset


mem_before = psutil.Process(os.getpid()).memory_info().rss / (1024**2)

stack_smol = load_dataset("bigcode/the-stack-smol", split="train")

mem_after = psutil.Process(os.getpid()).memory_info().rss / (1024**2)

print(f"RAM usage when loading a {stack_smol.dataset_size / (1024**3):.3f}GB dataset: {(mem_after - mem_before)} MB")

Using custom data configuration bigcode--the-stack-smol-2e98eace392455c7
Found cached dataset json (/mnt/1da05489-3812-4f15-a6e5-c8d3c57df39e/cache/huggingface/bigcode___json/bigcode--the-stack-smol-2e98eace392455c7/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab)


RAM usage when loading a 2.547GB dataset: 244.859375 MB



# Apache Arrow Backend ➡️ Fast Iteration
<br>

In [2]:
%%time
batch_size = 1000
for i in range(0, len(stack_smol), batch_size):
    batch = stack_smol[i:i + batch_size]

CPU times: user 2.97 s, sys: 318 ms, total: 3.29 s
Wall time: 3.67 s


<h2 style="text-align:center;">Loading your own files</h2>
<br><br>

<table><thead><tr><th align="center">Data format</th> <th align="center">Loading script</th> <th align="center">Example</th></tr></thead> <tbody><tr><td align="center">CSV &amp; TSV</td> <td align="center"><code>csv</code></td> <td align="center"><code>load_dataset("csv", data_files="my_file.csv")</code></td></tr> <tr><td align="center">Text files</td> <td align="center"><code>text</code></td> <td align="center"><code>load_dataset("text", data_files="my_file.txt")</code></td></tr> <tr><td align="center">JSON &amp; JSON Lines</td> <td align="center"><code>json</code></td> <td align="center"><code>load_dataset("json", data_files="my_file.jsonl")</code></td></tr> <tr><td align="center">Pickled DataFrames</td> <td align="center"><code>pandas</code></td> <td align="center"><code>load_dataset("pandas", data_files="my_dataframe.pkl")</code></td></tr></tbody></table>

<h1 style="text-align:center;">🤗 Tokenizers</h1>

<center><img src="images/tokenization_pipeline.svg" width=1200></center>

In [3]:
from transformers import AutoTokenizer

bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

sentence = "Hello how are U tday?"

<h1 style="text-align:center;">Classifiying Text</h1>

<ul>
    <li><h3>Transformers as Feature Extractors</h3></li>
    <br>
    <li><h3>Fine-tuning Transformers</h3></li>
</ul>

<h1 style="text-align:center;">More Ways of Classifiying Text</h1>

<ul>
    <li><h3>Sentence Transformers as Feature Extractors</h3></li>
    <br>
    <li><h3>Pattern-Exploiting Training (PET, ADAPET)</h3></li>
    <br>
    <li><h3>Sentence Transformer Fine-tuning (SetFit)</h3></li>
</ul>

<h1 style="text-align:center;">(Re)sources</h1>

- https://github.com/nlp-with-transformers/notebooks

- https://huggingface.co/docs

- https://github.com/huggingface/course / https://github.com/huggingface/notebooks

- https://github.com/NielsRogge/Transformers-Tutorials

<center><a href="https://www.oreilly.com/library/view/natural-language-processing/9781098103231/"><img src="images/book_cover.png" width=400></a></center>