# Lecture 6: Pretraining

Lecture 6 | CMU ANLP Spring 2025 | Instructor: Sean Welleck


This is a notebook for [CMU CS11-711 Advanced NLP](https://cmu-l3.github.io/anlp-spring2025/) that generates text with a pretrained model, and shows an example from a  pretraining dataset.

Install [HuggingFace Transformers](https://huggingface.co/docs/transformers/en/index)

In [None]:
!pip install transformers --upgrade

Load tokenizer and model

- Find models at https://huggingface.co/

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model = "HuggingFaceTB/SmolLM2-360M"

tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model)

Generate

In [None]:
prompts = [
    "The meaning of life is",
    "142 + 29 =",
    "On February 12th 2025, the temperature in Pittsburgh was"
]

for prompt in prompts:
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        **inputs, pad_token_id=tokenizer.eos_token_id
    )
    print(tokenizer.decode(outputs[0]))
    print("====")

The meaning of life is to be happy.

The meaning of life is to be happy.

The meaning of
====
142 + 29 = 161

(ii) 123 + 29 = 1
====
On February 12th 2025, the temperature in Pittsburgh was 100 degrees Fahrenheit.

The temperature in Pittsburgh was 100 degrees Fahrenheit
====


### Inspect DOLMA data

- Paper: https://arxiv.org/abs/2402.00159
- Data: https://huggingface.co/datasets/allenai/dolma


In [26]:
# From https://huggingface.co/datasets/allenai/dolma/blob/main/urls/v1_5-sample.txt
!wget https://olmo-data.org/dolma-v1_5-1913B-sample/cc_en_head/cc_en_head-0000.json.gz


--2025-01-30 12:59:01--  https://olmo-data.org/dolma-v1_5-1913B-sample/cc_en_head/cc_en_head-0000.json.gz
Resolving olmo-data.org (olmo-data.org)... 172.67.175.213, 104.21.83.123
Connecting to olmo-data.org (olmo-data.org)|172.67.175.213|:443... connected.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


HTTP request sent, awaiting response... 200 OK
Length: 1947344553 (1.8G) [application/json]
Saving to: ‘cc_en_head-0000.json.gz’


2025-01-30 13:00:01 (31.6 MB/s) - ‘cc_en_head-0000.json.gz’ saved [1947344553/1947344553]



In [30]:
!head -n 1 cc_en_head-0000.json > sample.json

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [31]:
import json
with open("sample.json") as f:
    data = json.load(f)

for k, v in data.items():
    print(k, v)

added 2023-04-11T09:57:03.044571+00:00
attributes {'random_number_v1__random_number_v1__random': [[0, 9626, 0.11918]]}
created 2020-01-17T12:48:23Z
id http://250news.theexplorationplace.com/www.250news.com/65595.html
metadata {'bucket': 'head', 'cc_segment': 'crawl-data/CC-MAIN-2020-05/segments/1579250589560.16/wet/CC-MAIN-20200117123339-20200117151339-00000.warc.wet.gz', 'date_download': '2020-01-17T12:48:23Z', 'digest': 'sha1:K2RJ3HQ73U7W4LKKI2JVVGZ4ABQFRF6Q', 'language': 'en', 'language_score': 0.97, 'length': 13925, 'line_ids': [27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 62, 63, 64, 65, 67, 68, 69, 70, 71, 72, 73, 75, 76, 77, 78, 79, 80, 81, 82, 84, 85, 86, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 102, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 116, 119, 120, 121, 122, 123, 124], 'nlines': 85, 'original_length': 15502, 'original_nlines': 130, 'perplexity': 321.5, 'source_domain':