# Lecture 6: Pretraining

Lecture 6 | CMU ANLP Spring 2025 | Instructor: Sean Welleck


This is a notebook for [CMU CS11-711 Advanced NLP](https://cmu-l3.github.io/anlp-spring2025/) that generates text with a pretrained model, and shows an example from a  pretraining dataset.

Install [HuggingFace Transformers](https://huggingface.co/docs/transformers/en/index)

In [None]:
!pip install transformers --upgrade

Load tokenizer and model

- Find models at https://huggingface.co/

In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import os

os.environ["http_proxy"] = "http://127.0.0.1:7897"
os.environ["https_proxy"] = "http://127.0.0.1:7897"

model = "HuggingFaceTB/SmolLM2-360M"

tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/724M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Generate

In [17]:
prompts = [
    "The meaning of life is",
    "142 + 29 =",
    "On February 123th 2025, the temperature in Pittsburgh was",
    "河南科技学院是 "
]

for prompt in prompts:
    inputs = tokenizer(prompt, return_tensors="pt")
    # 改进生成参数
    outputs = model.generate(
        **inputs,
        max_length=50,  # 限制生成长度
        num_return_sequences=1,
        temperature=0.7,  # 控制随机性
        do_sample=True,
        repetition_penalty=1.2,  # 减少重复
        pad_token_id=tokenizer.eos_token_id
    )
    
    # 解码时跳过特殊token
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Prompt: {prompt}")
    print(f"Generated: {generated_text}")
    print("="*50)

Prompt: The meaning of life is
Generated: The meaning of life is: to live!
Life means, it's something you can't stop or change. So why not just enjoy what comes your way? What better time than now for that adventure in the sun with your family and friends!"
Prompt: 142 + 29 =
Generated: 142 + 29 =
R = (20 x 5) / (7 x 3) - 6
= 100/21 - 18
= 2.6
Question no
Prompt: On February 123th 2025, the temperature in Pittsburgh was
Generated: On February 123th 2025, the temperature in Pittsburgh was 48 degrees Fahrenheit (9.67 Celsius). A couple had to be evacuated out of their home and were later found dead due to exposure at a
Prompt: 河南科技学院是 
Generated: 河南科技学院是 2014年自然储存师加工化中心的主要研设。 如


### Inspect DOLMA data

- Paper: https://arxiv.org/abs/2402.00159
- Data: https://huggingface.co/datasets/allenai/dolma


In [18]:
# From https://huggingface.co/datasets/allenai/dolma/blob/main/urls/v1_5-sample.txt
!wget https://olmo-data.org/dolma-v1_5-1913B-sample/cc_en_head/cc_en_head-0000.json.gz


'wget' 不是内部或外部命令，也不是可运行的程序
或批处理文件。


In [30]:
!head -n 1 cc_en_head-0000.json > sample.json

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [31]:
import json
with open("sample.json") as f:
    data = json.load(f)

for k, v in data.items():
    print(k, v)

added 2023-04-11T09:57:03.044571+00:00
attributes {'random_number_v1__random_number_v1__random': [[0, 9626, 0.11918]]}
created 2020-01-17T12:48:23Z
id http://250news.theexplorationplace.com/www.250news.com/65595.html
metadata {'bucket': 'head', 'cc_segment': 'crawl-data/CC-MAIN-2020-05/segments/1579250589560.16/wet/CC-MAIN-20200117123339-20200117151339-00000.warc.wet.gz', 'date_download': '2020-01-17T12:48:23Z', 'digest': 'sha1:K2RJ3HQ73U7W4LKKI2JVVGZ4ABQFRF6Q', 'language': 'en', 'language_score': 0.97, 'length': 13925, 'line_ids': [27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 62, 63, 64, 65, 67, 68, 69, 70, 71, 72, 73, 75, 76, 77, 78, 79, 80, 81, 82, 84, 85, 86, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 102, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 116, 119, 120, 121, 122, 123, 124], 'nlines': 85, 'original_length': 15502, 'original_nlines': 130, 'perplexity': 321.5, 'source_domain':