# Casual Langauge Modeling

## About
- Finetune DistilGPT2 on the r/askscience subset of the ELI5 dataset.
- Use finetuned model for inference.

## Load ELI5 Dataset

In [2]:
from datasets import load_dataset

eli5 = load_dataset(
    "dany0407/eli5_category",
    split="train[:5000]"
)

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/98.6M [00:00<?, ?B/s]

data/validation1-00000-of-00001.parquet:   0%|          | 0.00/7.92M [00:00<?, ?B/s]

data/validation2-00000-of-00001.parquet:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/6.09M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/91772 [00:00<?, ? examples/s]

Generating validation1 split:   0%|          | 0/5446 [00:00<?, ? examples/s]

Generating validation2 split:   0%|          | 0/2375 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5411 [00:00<?, ? examples/s]

In [3]:
eli5 = eli5.train_test_split(test_size=0.2)
eli5['train'][0]

{'q_id': '76p019',
 'title': 'Why do businesses get to offset their costs against their tax liability but not individuals?',
 'selftext': '',
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dofoc9w', 'dofob2e', 'dofq4dn', 'dog4dve'],
  'text': ['If a business has 1 million in revenue, but 1.2 million in costs where are they going to take the money from to pay taxes on the 1 million?',
   'Their costs are the cost of doing business. It is the expenses they incur to make a profit. You do get to deduct certain things if you go to the trouble, things specifically bought to do your trade. It gets too complicated for this thread. But ordinary living expenses cannot be deducted. Business expenses do.',
   'I don’t know where you’re from but in Portugal there are several expenses you can use to offset your taxes as an individual. Medical, education, donation to charity, rent/mortgage, hair saloon, mechanic (cars) and even generic expenses (via VAT). Of cours

## Preprocessing

In [4]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [5]:
eli5 = eli5.flatten()
eli5['train'][0]

{'q_id': '76p019',
 'title': 'Why do businesses get to offset their costs against their tax liability but not individuals?',
 'selftext': '',
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dofoc9w', 'dofob2e', 'dofq4dn', 'dog4dve'],
 'answers.text': ['If a business has 1 million in revenue, but 1.2 million in costs where are they going to take the money from to pay taxes on the 1 million?',
  'Their costs are the cost of doing business. It is the expenses they incur to make a profit. You do get to deduct certain things if you go to the trouble, things specifically bought to do your trade. It gets too complicated for this thread. But ordinary living expenses cannot be deducted. Business expenses do.',
  'I don’t know where you’re from but in Portugal there are several expenses you can use to offset your taxes as an individual. Medical, education, donation to charity, rent/mortgage, hair saloon, mechanic (cars) and even generic expenses (via VAT). Of cour

In [6]:
def preprocess_fn(examples):
    return tokenizer([" ".join(x) for x in examples['answers.text']])

In [10]:
# testing the working of preprocess_fn
rec_0 = eli5['train'][0]

rec_0

{'q_id': '76p019',
 'title': 'Why do businesses get to offset their costs against their tax liability but not individuals?',
 'selftext': '',
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dofoc9w', 'dofob2e', 'dofq4dn', 'dog4dve'],
 'answers.text': ['If a business has 1 million in revenue, but 1.2 million in costs where are they going to take the money from to pay taxes on the 1 million?',
  'Their costs are the cost of doing business. It is the expenses they incur to make a profit. You do get to deduct certain things if you go to the trouble, things specifically bought to do your trade. It gets too complicated for this thread. But ordinary living expenses cannot be deducted. Business expenses do.',
  'I don’t know where you’re from but in Portugal there are several expenses you can use to offset your taxes as an individual. Medical, education, donation to charity, rent/mortgage, hair saloon, mechanic (cars) and even generic expenses (via VAT). Of cour

In [11]:
rec_0['answers.text']

['If a business has 1 million in revenue, but 1.2 million in costs where are they going to take the money from to pay taxes on the 1 million?',
 'Their costs are the cost of doing business. It is the expenses they incur to make a profit. You do get to deduct certain things if you go to the trouble, things specifically bought to do your trade. It gets too complicated for this thread. But ordinary living expenses cannot be deducted. Business expenses do.',
 'I don’t know where you’re from but in Portugal there are several expenses you can use to offset your taxes as an individual. Medical, education, donation to charity, rent/mortgage, hair saloon, mechanic (cars) and even generic expenses (via VAT). Of course it’s not a 1:1 reduction though. That is because you pay taxes on your income but companies pay taxes on their profits.',
 'People aren’t profit making entities whereas that is all businesses are. Business pays tax on retained profit. To calculate profit you need to account for the

In [12]:
[" ".join(x) for x in rec_0['answers.text']]

['I f   a   b u s i n e s s   h a s   1   m i l l i o n   i n   r e v e n u e ,   b u t   1 . 2   m i l l i o n   i n   c o s t s   w h e r e   a r e   t h e y   g o i n g   t o   t a k e   t h e   m o n e y   f r o m   t o   p a y   t a x e s   o n   t h e   1   m i l l i o n ?',
 'T h e i r   c o s t s   a r e   t h e   c o s t   o f   d o i n g   b u s i n e s s .   I t   i s   t h e   e x p e n s e s   t h e y   i n c u r   t o   m a k e   a   p r o f i t .   Y o u   d o   g e t   t o   d e d u c t   c e r t a i n   t h i n g s   i f   y o u   g o   t o   t h e   t r o u b l e ,   t h i n g s   s p e c i f i c a l l y   b o u g h t   t o   d o   y o u r   t r a d e .   I t   g e t s   t o o   c o m p l i c a t e d   f o r   t h i s   t h r e a d .   B u t   o r d i n a r y   l i v i n g   e x p e n s e s   c a n n o t   b e   d e d u c t e d .   B u s i n e s s   e x p e n s e s   d o .',
 'I   d o n ’ t   k n o w   w h e r e   y o u ’ r e   f r o m   b u t   i n   P o r t u g a l 

In [8]:
tokenized_eli5 = eli5.map(
    preprocess_fn,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (3077 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1027 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3376 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1259 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2110 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1049 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1480 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1129 > 1024). Running this sequence through the model will result in indexing errors


In [17]:
block_size = 128

def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size

    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [18]:
lm_dataset = tokenized_eli5.map(
    group_texts, 
    batched=True, 
    num_proc=4
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [20]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

## Train

In [26]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

In [27]:
training_args = TrainingArguments(
    output_dir="eli5_causal_lm_model",
    eval_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset['train'],
    eval_dataset=lm_dataset['test'],
    data_collator=data_collator,
    tokenizer=tokenizer
)

  trainer = Trainer(


In [28]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.


Epoch,Training Loss,Validation Loss
1,3.9212,3.794562
2,3.8293,3.783511
3,3.794,3.782328




TrainOutput(global_step=3960, training_loss=3.848212872129498, metrics={'train_runtime': 1087.4025, 'train_samples_per_second': 29.12, 'train_steps_per_second': 3.642, 'total_flos': 1034245200936960.0, 'train_loss': 3.848212872129498, 'epoch': 3.0})

In [None]:
import math

eval_results = trainer.evaluate()

In [30]:
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}") 

Perplexity: 43.92


## Inference

In [31]:
prompt = "Somatic hypermutation allows the immune system to"

In [35]:
from transformers import pipeline

generator = pipeline(
    'text-generation', 
    model='eli5_causal_lm_model/checkpoint-3960',
)

generator(prompt)

Device set to use mps:0


[{'generated_text': "Somatic hypermutation allows the immune system to create a better response to the threat. This is why you can kill a person to make sure they get the proper dosage of the drug. The immune system can't get enough of the drug's molecules to make it effective and the immune system can't do enough to make it effective. It could be a case of a person being in a panic mode. This is why you need a lot of oxygen to combat a certain type of infection. There are lots of other ways to survive these types of infections. The immune system can't use oxygen in the most basic way. But if you're a person with an upper respiratory infection (e.g. pneumonia), you should be able to do it. It's not exactly a medical emergency, but it's a survival emergency. We need to use oxygen to do it. When you're sick, you need to be able to stop the infection. So every person that is sick should be able to stop the infection. If you are sick, you need to keep the infection alive and keep it in che

In [38]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("eli5_causal_lm_model/checkpoint-3960")
inputs = tokenizer(prompt, return_tensors="pt").input_ids

In [41]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('eli5_causal_lm_model/checkpoint-3960')
outputs = model.generate(
    inputs,
    max_new_tokens=100,
    do_sample=True,
    top_k=50,
    top_p=0.95
)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [42]:
tokenizer.batch_decode(outputs, skip_special_tokens=True)

['Somatic hypermutation allows the immune system to adapt to a more drastic response. When the immune system is overwhelmed, the immune system does NOT automatically adapt. Instead, when someone is over-imaging their own body, they can become over-stimulating. They can start to suffer from hypermutations by learning to eat more food more quickly. This is why the immune system is called a "lunatic system". When you take a meal without being able to sense how quickly it takes to digest food, that\'s why people are afraid']