# Feedback Prize - English Language Learning
### Experiment fine-tuning a small GPT model to autocomplete scores for English essays in the Feedback Prize - English Language Learning Kaggle competition

### Author: Adrián Melic
Twitter: [@adrianmelic](https://www.twitter.com/adrianmelic)

Kaggle: [adrianmelic](https://www.kaggle.com/adrianmelic)

## Kaggle Notebook

This is a Kaggle Notebook competition. To run this notebook you can:

* Go to [the Kggle competition](https://www.kaggle.com/competitions/feedback-prize-english-language-learning/code)
* Click on New Notebook
* Choose Acceleration GPU P100
* Internet on if this is the first time running the notebook
* Run this code

In [None]:
import pandas as pd
import torch
from torch.utils.data import Dataset, random_split
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelForCausalLM, IntervalStrategy

MAX_LENGHT = 2048
USE_LOCAL_MODEL = False

## Load tokenizer and model
The model "EleutherAI/gpt-neo-1.3B" is too big for Kaggle Notebooks memory. "EleutherAI/gpt-neo-125M" seems to be on the limit of the machine.


In [3]:
torch.manual_seed(42)
# tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B", bos_token='<|startoftext|>',
#                                           eos_token='<|endoftext|>', pad_token='<|pad|>')
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M", bos_token='<|startoftext|>',
                                          eos_token='<|endoftext|>', pad_token='<|pad|>')

Downloading:   0%|          | 0.00/560 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.98k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/357 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [4]:
if USE_LOCAL_MODEL:
    # Load the EleutherAI/gpt-neo-125M fine-tuned model after emptying the GPU memory
    model = AutoModelForCausalLM.from_pretrained('./fine-tuned-model').cuda()
else:
    # model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B").cuda()
    # model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
    model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M").cuda()

Downloading:   0%|          | 0.00/502M [00:00<?, ?B/s]

In [10]:
model.resize_token_embeddings(len(tokenizer))

Embedding(50259, 768)

## Load train and test data and prompt functions

In [20]:
train = pd.read_csv('../input/feedback-prize-english-language-learning/train.csv')
test = pd.read_csv('../input/feedback-prize-english-language-learning/test.csv')

In [43]:
def create_prompt(row, train):
    """
    Make a new text for fine-tune the model with the essay and the feedback
    Example:
    <|startoftext|>
    Assess the language proficiency of 8th-12th grade English Language Learners
    (ELLs). Utilizing a dataset of essays written by ELLs will help to develop
    proficiency models that better supports all students.

    This essay will be scored according to six analytic measures:
    cohesion, syntax, vocabulary, phraseology, grammar, and conventions.

    Each measure represents a component of proficiency in essay writing, with
    greater scores corresponding to greater proficiency in that measure. The
    scores range from 1.0 to 5.0 in increments of 0.5. Your task is to predict
    the score of each of the six measures for the essay given.

    Essay:
    "The hardest part of school is getting ready. you wake up go brush [...]"
    
    Scores:
    coheshion: 3.5
    syntax: 3.5
    vocabulary: 3.0
    phraseology: 3.0
    grammar: 4.0
    conventions: 3.0
    <|endoftext|>
    """
    prompt = (
        '<|startoftext|>'
        'Assess the language proficiency of 8th-12th grade English Language Learners (ELLs). Utilizing a dataset of essays written by ELLs will help to develop proficiency models that better supports all students.\n'
        'This essay will be scored according to six analytic measures:\n'
        'cohesion, syntax, vocabulary, phraseology, grammar, and conventions.\n'
        '\n'
        'Each measure represents a component of proficiency in essay writing, with greater scores corresponding to greater proficiency in that measure. The scores range from 1.0 to 5.0 in increments of 0.5. Your task is to predict the score of each of the six measures for the essay given.\n'
        '\n'
        'Essay:\n'
        f'"{row["full_text"]}"\n'
        '\n'
        'Scores:\n'
        'coheshion:'
    )
    if train:
        prompt += (
            f' {row["cohesion"]}\n'
            f'syntax: {row["syntax"]}\n'
            f'vocabulary: {row["vocabulary"]}\n'
            f'phraseology: {row["phraseology"]}\n'
            f'grammar: {row["grammar"]}\n'
            f'conventions: {row["conventions"]}\n'
        )
    # Add the end of text
    prompt += '<|endoftext|>'
    return prompt

In [49]:
def filter_prompts(texts):
    """Ignore fine_tune_texts that once tokenized are longer than MAX_LENGHT tokens"""
    ingored_texts = []
    filtered_texts = []
    for text in texts:
        if len(tokenizer.encode(text)) > MAX_LENGHT:
            ingored_texts.append(text)
        else:
            filtered_texts.append(text)
    print(f'Percentage of ignored texts: {len(ingored_texts) / len(filtered_texts) * 100:.2f}%')
    return filtered_texts

In [50]:
train_prompts = []
for idx, row in train.iterrows():
    train_prompts.append(create_prompt(row, train=True))
filtered_train_prompts = filter_prompts(train_prompts)

test_prompts = []
for idx, row in test.iterrows():
    test_prompts.append(create_prompt(row, train=False))
filtered_test_prompts = filter_prompts(test_prompts)

Token indices sequence length is longer than the specified maximum sequence length for this model (2328 > 2048). Running this sequence through the model will result in indexing errors


Percentage of ignored texts: 0.62%
Percentage of ignored texts: 0.00%


## Fine-tune model

In [8]:
class FeedbackDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for txt in txt_list:
            encodings_dict = tokenizer('<|startoftext|>' + txt + '<|endoftext|>', truncation=True,
                                       max_length=max_length, padding="max_length")
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

In [10]:
if not USE_LOCAL_MODEL:
    # Avoid wandb.ai logs
    import os
    os.environ["WANDB_DISABLED"] = "true"
    # Trick to remove this after using os.environ["WANDB_DISABLED"]
    # os.environ.pop("WANDB_DISABLED")

    dataset = FeedbackDataset(filtered_fine_tune_texts, tokenizer, max_length=MAX_LENGHT)
    train_size = int(0.9 * len(dataset))
    train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])
    training_args = TrainingArguments(output_dir='./results', num_train_epochs=5, logging_steps=5000,
                                      save_strategy=IntervalStrategy.NO,
                                      per_device_train_batch_size=2, per_device_eval_batch_size=2,
                                      warmup_steps=100, weight_decay=0.01, logging_dir='./logs')
    Trainer(model=model, args=training_args, train_dataset=train_dataset,
            eval_dataset=val_dataset, data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                                  'attention_mask': torch.stack([f[1] for f in data]),
                                                                  'labels': torch.stack([f[0] for f in data])}).train()
     model.save_pretrained('./fine-tuned-model')

***** Running training *****
  Num examples = 3499
  Num Epochs = 5
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 8750
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33madrianmelic[0m. Use [1m`wandb login --relogin`[0m to force relogin


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step,Training Loss
5000,1.3399




Training completed. Do not forget to share your model on huggingface.co/models =)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0: Assess the language proficiency of 8th-12th grade English Language Learners.
This essay will be scored according to four analytic measures:
cohesion, syntax, vocabularyaries, phraseology, grammar, and conventions.

This essay will be one weekend break."

Parents decide to Government wanted to "Dear and or first time in summer?  that means you dont wanna be as something faster to be in this forever and more than something. is that we're going back away by thele of someone that is anEss wrote and I feel cool way. When people has been practicing so hard times, I still wanted was because I'm gonna understand what, that way on there lives in all of life.

With everyone looks good because when I like what happens. The reason are wonderful stuff because one moment has a bad impact on the ofection has gotten anoder.

It can cause you can't happen because, my uncle's about one and thats what there first thought or when I want it. Also that people see me when someone likes Emerson is perfect 

## Try the fine-tunned model with the test prompts

In [65]:
# Using our fine-tuned model, generate text using the filtered_test_prompts prompts
for test_prompt in filtered_test_prompts[0]:
    generated = tokenizer(test_prompt, return_tensors="pt").input_ids.cuda()
    sample_outputs = model.generate(
        generated,
        do_sample=True,
        top_k=50,
        bos_token='<|startoftext|>',
        eos_token='<|endoftext|>',
        pad_token='<|pad|>',
        pad_token_id=50256,
        max_length=MAX_LENGHT,
        top_p=0.95,
        temperature=0.7,
        num_return_sequences=1
    )
    for i, sample_output in enumerate(sample_outputs):
        print(f'Output {i}' + '\n')
        # Sepparate text prompt from the generated text of sample_output
        # Substract the test_prompt from the sample_output
#         print('test_prompt:\n' + test_prompt + '\n')
        print('generated: \n' + tokenizer.decode(sample_output, skip_special_tokens=True))
        break

Output 0

generated: 
<Assess the language proficiency of 8th-12th grade English Language Learners.
This essay will be scored according to six analytic measures:
cohesion, syntax, vocabulary, phraseology, grammar, and conventions.

Each measure represents a component of proficiency in essay writing, with greater scores corresponding to greater proficiency in that measure. The scores range from 1.0 to 5.0 in increments of 0.5. Your task is to predict the score of each of the six measures for the essay given.

Essay:
"I think that we like to the first impression is good, because we can be in a good idea to be a good attitude and positive attitude, and some people can have a positive attitude, and good attitude, and be a good attitude, and be a positive attitude, but its good attitude can see this people who did.

It's the key to do more and the people need to be positive attitude and make a positive attitude in life, and others people will have a good way to have positive attitude and if

KeyboardInterrupt: 