In [2]:
%pip install unidecode
%pip install pybaseball

Collecting unidecode
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
   ---------------------------------------- 0.0/235.5 kB ? eta -:--:--
   ----- ---------------------------------- 30.7/235.5 kB 1.3 MB/s eta 0:00:01
   ----- ---------------------------------- 30.7/235.5 kB 1.3 MB/s eta 0:00:01
   ---------- ---------------------------- 61.4/235.5 kB 544.7 kB/s eta 0:00:01
   ------------------ ------------------- 112.6/235.5 kB 819.2 kB/s eta 0:00:01
   ------------------- ------------------ 122.9/235.5 kB 722.1 kB/s eta 0:00:01
   -------------------------------------- - 225.3/235.5 kB 1.1 MB/s eta 0:00:01
   ---------------------------------------- 235.5/235.5 kB 1.0 MB/s eta 0:00:00
Installing collected packages: unidecode
Successfully installed unidecode-1.3.8
Note: you may need to restart the kernel to use updated packages.
Collecting pybaseball
  Downloading pybaseball-2.2.7-py3-none-any.whl.metadata (11 kB)

In [3]:
import warnings
warnings.filterwarnings('ignore')
from pybaseball import statcast
from pybaseball import playerid_reverse_lookup
from unidecode import unidecode 
import os
import json
import time
import pickle
import random

## Encoding a dataset for the [Alpaca LoRA](https://github.com/tloen/alpaca-lora) format.

### What's a [LoRA](https://github.com/microsoft/LoRA)? 

[Paper](https://www.microsoft.com/en-us/research/publication/lora-low-rank-adaptation-of-large-language-models/)

A LoRA stands for Low-Rank Adaptation of Large Language Models.  LoRA reduces the number of trainable parameters by learning pairs of rank-decompostion matrices while freezing the original weights. This vastly reduces the storage requirement for large language models adapted to specific tasks and enables efficient task-switching during deployment all without introducing inference latency. LoRA also outperforms several other adaptation methods including adapter, prefix-tuning, and fine-tuning.

### [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca) 
Standfor Alpaca is an instruction-following LLaMA Model.  

### [LLaMA](https://github.com/facebookresearch/llama)

Meta developed and released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama-2-Chat models outperform open-source chat models on most benchmarks we tested, and in our human evaluations for helpfulness and safety, are on par with some popular closed-source models like ChatGPT and PaLM.

**Model Architecture** Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

**Research Paper** More information can be found in the paper "Llama-2: Open Foundation and Fine-tuned Chat Models", available at https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/.

**Where to send questions or comments about the model** Instructions on how to provide feedback or comments on the model can be found in the model [README](README.md).

## **Intended Use**
**Intended Use Cases** Llama 2 is intended for commercial and research use in English. Tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks.

**Out-of-scope** Uses Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 2 Community License. Use in languages other than English**. 

**Note: Developers may fine-tune Llama 2 models for languages beyond English provided they comply with the Llama 2 Community License and the Acceptable Use Policy.

||Time (GPU hours)|Power Consumption (W)|Carbon Emitted(tCO<sub>2</sub>eq)|
|---|---|---|---|
|Llama 2 7B|184320|400|31.22|
|Llama 2 13B|368640|400|62.44|
|Llama 2 70B|1720320|400|291.42|
|Total|3311616||539.00|

## **Training Data**
**Overview** Llama 2 was pretrained on 2 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over one million new human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data.

**Data Freshness** The pretraining data has a cutoff of September 2022, but some tuning data is more recent, up to July 2023.

## **Prompt Template**

```
{
    "description": "Template used by Alpaca-LoRA.",
    "prompt_input": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n",
    "prompt_no_input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:\n",
    "response_split": "### Response:"    
}
```

```"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"```

In [9]:
path = "../data/seasons/"
files = os.listdir(path)
files.sort()
prompts = []
json_data = []
season_count = 0
for file in files:
    print(file)
    pickle_in = open(f"{path}/{file}","rb")
    every_pitch = pickle.load(pickle_in)
    every_pitch = every_pitch.iloc[::-1]
    for index, row in every_pitch.iterrows():
        if isinstance(row.events, str):
            instruction = f"what is the outcome of pitcher {row.pitcher} pitching to batter {row.batter}\n\n"
            input = f"{row.inning_topbot} of the {row.inning} inning with {row.outs_when_up} outs\n\n"
            pitch = f" a {f'{int(row.release_speed)} miles per hour ' if isinstance(row.release_speed, float) else ''}{row.pitch_name}"
            response = f"With {row.strikes} strike and {row.balls} balls, {row.pitcher} throws{pitch if isinstance(row.release_speed, float) else ''} to {row.batter}. {row.des}"
            prompts.append(f"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n ### Instruction:\n{instruction} ### Input:\n{input} ### Response:\n{response}\n")
            json_data.append({
                "instruction":instruction,
                "input":input,
                "output":response
            })

with open(f'2011_2023_alpaca_text_encoded.json', 'w', encoding='utf-8') as f:
    json.dump(prompts, f, ensure_ascii=True, indent=4, allow_nan=True)
    f.close()

with open(f'2011_2023_alpaca_json.json', 'w', encoding='utf-8') as f:
    json.dump(json_data, f, ensure_ascii=True, indent=4, allow_nan=True)
    f.close()


season-2011.pickle
season-2012.pickle
season-2014.pickle
season-2015.pickle
season-2016.pickle
season-2018.pickle
season-2019.pickle
season-2020.pickle
season-2021.pickle
season-2022.pickle
season-2023.pickle
