# Baseball Play Description Inference Model - Game Information

This notebook runs code to further fine-tune an LLM trained to return the number of runs a baseball team scored in a game based on the description of scoring plays from that game. This model builds upon the previous one by returning the winner of the game and the score for both teams.

## Step 0: Load Necessary Python Modules

The modules **datasets**, **accelerate**, and **peft** are not included in the Google Colab environment, but they are required to run the Trainer object. Upon opening this notebook or connecting to a new runtime, run this cell and restart the runtime to let the changes take effect.

In [None]:
!pip install datasets
!pip install accelerate
!pip install peft

## Step 1: Mount Google Drive and Import Functions

The following cell connects the notebook to Google Drive, imports the necessary functions from outside models, and sets some options.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
from transformers import AutoTokenizer, Seq2SeqTrainer, Seq2SeqTrainingArguments
from peft import get_peft_model, LoraConfig, TaskType, PeftConfig, PeftModel, AutoPeftModelForSeq2SeqLM
from datasets import Dataset
from pathlib import Path
import pandas as pd
filepath = Path('/content/drive/MyDrive/Text Mining Project')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Step 2: Load and Prepare the Model and Data

The following cell loads a previously fine-tuned language model (trained using Run_Calculation_Trainer.ipynb), imports the dataset, and preprocesses the data for training. Here, a PEFT version of the model is used to lessen the computational workload and make the training work on this hardware.

In [None]:
### Load the pre-trained model and tokenizer:

model_name = filepath / 'score-flan-t5'
model = AutoPeftModelForSeq2SeqLM.from_pretrained(model_name,
                                                  is_trainable = True)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast = True)
model.gradient_checkpointing_enable({"use_reentrant": False})
model.config.use_cache = False
model.enable_input_require_grads()


### Load, prepare, and tokenize the documents for training

train_df = pd.read_csv(filepath / 'train_game_docs.csv').iloc[:,1:]
test_df = pd.read_csv(filepath / 'test_game_docs.csv').iloc[:,1:]
train_info = Dataset.from_pandas(train_df)
test_info = Dataset.from_pandas(test_df)

def preprocess(docs):
  """
  A helper function to preprocess/tokenize the documents and
  responses for a set of game prompts and game information

  Inputs:
  docs - a Dataset object containing game information

  Outputs:
  inputs - a Dataset object formatted for use with a Trainer object
  """

  # Tokenize inputs
  inputs = tokenizer(docs['document'], padding = 'max_length',
                     max_length = 810, truncation = True)

  # Tokenize targets
  targets = tokenizer(docs['score'], padding = 'max_length',
                      max_length = 25, truncation = True)

  # Put tokenized targets in inputs Dataset
  inputs['labels'] = targets['input_ids']

  return inputs

train_data = train_info.map(preprocess, batched = True)
test_data = test_info.map(preprocess, batched = True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/5221 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

## Step 3: Run the Training

The following cell runs the training and saves the resulting fine-tuned model.

In [None]:
### Set the training arguments:

training_args = Seq2SeqTrainingArguments(
    output_dir = "./baseball-flan-t5",
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 8,
    logging_steps = 100,
    num_train_epochs = 3,
)


### Build and run the trainer:

trainer = Seq2SeqTrainer(
    model = model,
    args = training_args,
    train_dataset = train_data,
    eval_dataset = test_data,
)

trainer.train()


### Save the fine-tuned model

model.save_pretrained(filepath / 'baseball-flan-t5')
tokenizer.save_pretrained(filepath / 'baseball-flan-t5')

Step,Training Loss
100,9.2576
200,3.4413
300,2.1039
400,1.6504
500,1.4257
600,1.2504
700,1.1339
800,1.0276
900,0.9327
1000,0.8611


Step,Training Loss
100,9.2576
200,3.4413
300,2.1039
400,1.6504
500,1.4257
600,1.2504
700,1.1339
800,1.0276
900,0.9327
1000,0.8611


('/content/drive/MyDrive/Text Mining Project/baseball-flan-t5/tokenizer_config.json',
 '/content/drive/MyDrive/Text Mining Project/baseball-flan-t5/special_tokens_map.json',
 '/content/drive/MyDrive/Text Mining Project/baseball-flan-t5/tokenizer.json')