#### Neural next-step prediction | part 1: data
Tutorial on neural theorem proving\
Author: Sean Welleck

----------------

#### High-level goal

Our goal is to train a neural next-step prediction model, $p(y_t|x_t)$. Here $x_t$ is a _proof state_, and $y_t$ is a next-step. 

To do so, we will create a dataset $\mathcal{D}=\{(x_t,y_t)\}$ from human-written proofs. 

We can then train a next-step prediction model using a next-token prediction loss on the dataset.

#### Simple example

To see what proof states and next-steps look like, let's look at an example human-written theorem and proof:



In [1]:
!cat ../ntp-training-data/Examples/Example0.lean

import Mathlib.Data.Nat.Prime

theorem test_thm (m n : Nat) (h : m.Coprime n) : m.gcd n = 1 := by
  rw [Nat.Coprime] at h
  exact h


We would like to transform this theorem and proof into a sequence of (proof_state, next_step) examples.

First, notice that the proof has two steps:

1. $y_1=$ `rw [Nat.Coprime] at h`
2. $y_2=$ `exact h`

We can manually see the proof states by looking in VSCode. 

For example, placing the cursor before $y_1$ gives us the proof state $x_1$ (shown as "Tactic state"):

<img src="images/proof_state_1.png" width=600px>


That is, the image above corresponds to $(x_1,y_1)$ defined as:

  $x_1$: 
  ```
    m n : ℕ
    h : Nat.Coprime m n
    ⊢ Nat.gcd m n = 1
  ```

  $y_1$: `rw [Nat.coprime] at h`


Similarly, we can get the proof state $x_2$ prior to the step $y_2$ (`exact h`):

<!-- ![title](images/proof_state_2.png) -->

<img src="images/proof_state_2.png" width=600px>

After step $y_2$, the proof is complete: the proof state $x_3$ says we have "No goals":

<!-- ![title](images/proof_state_3.png) -->
<img src="images/proof_state_3.png" width=600px>


In summary, it is possible to *manually* transform the theorem and proof into a sequence $[(x_1,y_1),(x_2,y_2),(x_3)]$.

Formally, the steps $y_t$ are known as a "tactics". As a result, we will often use "step" and "tactic" interchangeably.

---------------------

## Automatically extracting proof states and next-steps 

To scale up data collection, we need a way to *automatically* extract proof states and next-steps (tactics) from human-written proofs.

The [ntp-training-data](../ntp-training-data/) directory contains Lean code that automatically extracts proof states and tactics. 

It is a modified version of Scott Morrison's [lean-training-data](https://github.com/semorrison/lean-training-data).

#### 1. Transform a Lean file

The extraction is done by [training_data.lean](../ntp-training-data/scripts/training_data.lean), which conceptually implements:

$\quad f_{\text{extract}}(\text{lean file})\rightarrow \{(x,y,c)_i\}$,

where $\{(x,y,c)_i\}$ denotes a collection of (proof state, next-tactic) pairs along with additional context $c$.

We can run it on the [Example0.lean](../ntp-training-data/Examples/Example0.lean) file from above:

In [None]:
!cd ../ntp-training-data && lake exe training_data Examples/Example0.lean > ../notebooks/data/example0.jsonl
!cat data/example0.jsonl

This output is a `.jsonl` file where each line is an example of the form:
```json
{
   "state": "{tactic state}",
   "nextTactic" : "{pretty-printed next tactic}",
   "srcUpToTactic" : "{source code in the file up to the tactic invocation}",
   "declUpToTactic" : "{source code in the declaration up to the tactic invocation}",
   "decl": "{declaration without proof (e.g., statement of a theorem)}",
   "declId": "{unique identifier of the declaration}"
}
```
Here are the extracted (state, next-tactic) examples:

In [3]:
import json
with open('data/example0.jsonl') as f:
    examples = [json.loads(line) for line in f.readlines()]

for example in examples:
    print("=== State:", example['state'], '', sep='\n')
    print("--- Next tactic", example['nextTactic'], '', sep='\n')

=== State:
m n : ℕ
h : Nat.Coprime m n
⊢ Nat.gcd m n = 1

--- Next tactic
rw [Nat.Coprime] at h

=== State:
m n : ℕ
h : Nat.gcd m n = 1
⊢ Nat.gcd m n = 1

--- Next tactic
exact h



Notice that the proof states are the ones we saw above in VS Code.

We can also view additional extracted information, such as the source up to the tactic invocation:

In [4]:
for example in examples:
    print("==========", example['srcUpToTactic'], sep='\n')
    print("/- State:", example['state'], '-/', sep='\n')
    print("/- Next tactic -/", example['nextTactic'], '', sep='\n')

import Mathlib.Data.Nat.Prime

theorem test_thm (m n : Nat) (h : m.Coprime n) : m.gcd n = 1 := by
  
/- State:
m n : ℕ
h : Nat.Coprime m n
⊢ Nat.gcd m n = 1
-/
/- Next tactic -/
rw [Nat.Coprime] at h

import Mathlib.Data.Nat.Prime

theorem test_thm (m n : Nat) (h : m.Coprime n) : m.gcd n = 1 := by
  rw [Nat.Coprime] at h
  
/- State:
m n : ℕ
h : Nat.gcd m n = 1
⊢ Nat.gcd m n = 1
-/
/- Next tactic -/
exact h



-----------------

## Scaling up data extraction

We can run the above script on a full repository:

$\quad f_{\text{extract}}(\text{lean repository})\rightarrow \mathcal{D}.$

Doing so involves calling the script and keeping track of files, which we prefer to do in a scripting language such as Python.

Thus `ntp-training-data` uses two Python scripts: 
1. [extract_repos.py](../ntp-training-data/scripts/extract_repos.py), which reads in a configuration file with repository information. For each repository, it calls:
2.  [run_pipeline.py](../ntp-training-data/scripts/run_pipeline.py) script, which runs `lake exe training_data` and keeps track of files.

#### Extracting 300k examples from mathlib

We use these scripts to extract data from [Mathlib](https://github.com/leanprover-community/mathlib4), a large community-driven library of mathematics in Lean. 

To do so, we first specify Mathlib in [ntp-training-data/configs/config.json](../ntp-training-data/configs/config.json):

```json
    {
        "repo": "https://github.com/leanprover-community/mathlib4",
        "commit": "cf8e23a62939ed7cc530fbb68e83539730f32f86",
        "lean": "leanprover/lean4:v4.4.0",
        "name": "mathlib",
        "import_file": "Mathlib.lean",
        "imports": ["Mathlib"]
    }
```

Then we execute the [extract_repos.py](../ntp-training-data/scripts/extract_repos.py) script. 

On a Macbook Pro (M3 Max, 14 CPU) it takes around 2 hours.

In [None]:
!cd ../ntp-training-data && python scripts/extract_repos.py

In [3]:
# Number of files
!ls ../ntp-training-data/Examples/Mathlib/TacticPrediction/ | wc -l

    3759


#### Output data

The extracted data is in [llm-training-data/Examples/Mathlib/TacticPrediction](../ntp_lean/llm-training-data/Examples/Mathlib/TacticPrediction).

It is organized by file. For instance, here are extracted examples for Mathlib's Algebra/AddTorsor.lean:

In [None]:
!head -n 2 ../ntp-training-data/Examples/Mathlib/TacticPrediction/Mathlib_Algebra_AddTorsor.jsonl

#### Extracted data on Huggingface

We provide extracted Mathlib data (i.e., the result of running the command above) on HuggingFace:
- [`l3lab/ntp-mathlib`](https://huggingface.co/datasets/l3lab/ntp-mathlib)

Additional repositories can be extracted by adding them to `configs/config.json`.

*If you use this data or code, we kindly ask that you cite this neural theorem proving tutorial*.

--------------
## Fine-tuning data

Finally, we would like to format the data so that we can finetune a language model with a standard finetuning script.

To this end, we format the data into (prompt, completion) examples using [ntp-training-data/scripts/instruction_tuning.py](../ntp-training-data/scripts/instruction_tuning.py).

Notationally, our formatted dataset is of the form:

\begin{align}
\mathcal{D}=\{(f_{\text{prompt}}(x_t), f_{\text{completion}}(y_t))\},
\end{align}

where  $f_{\text{prompt}}$ maps a state $x_t$ to a string, and $f_{\text{completion}}$ maps a next-tactic $y_t$ to a string.

Here is the prompt and completion created from the first (state, tactic) pair from our simple example above (`example0.jsonl`):

In [6]:
import sys, json
sys.path.append('../ntp-training-data/scripts/')

from instruction_tuning import prompt_state_tactic

examples = [json.loads(line) for line in open('data/example0.jsonl').readlines()]

prompt, completion = prompt_state_tactic(examples[0])
print(f'=== Prompt:\n{prompt}', f'=== Completion:\n{completion}', sep='')

=== Prompt:
/- You are proving a theorem in Lean 4.
You are given the following information:
- The current proof state, inside [STATE]...[/STATE]

Your task is to generate the next tactic in the proof.
Put the next tactic inside [TAC]...[/TAC]
-/
[STATE]
m n : ℕ
h : Nat.Coprime m n
⊢ Nat.gcd m n = 1
[/STATE]
[TAC]
=== Completion:
rw [Nat.Coprime] at h
[/TAC]


Notice that we added a natural language description of the task into the prompt, commonly known as an "instruction" (and hence we call the script [instruction_tuning.py](../ntp-training-data/scripts/instruction_tuning.py)). In this case the instruction may not be strictly necessary, but including it suggests several other directions to experiment with (one of which we will see later in part 5). When we eventually incorporate our model into a tool (part 6), the instruction format will also allow us to swap in other off-the-shelf models that support instruction following.


Let's convert all of the extracted examples into (prompt, completion) data:

In [8]:
!cd ../ntp-training-data/ && python scripts/instruction_tuning.py --mathlib-only  # flag excludes e.g. Example0.lean

Examples/Mathlib
num_train_decls	59587
num_dev_decls	1568
num_test_decls	1568
num_dev_file_split_decls	0
num_test_file_split_decls	0
num_train	291262
num_dev	7735
num_test	8016
num_file_split_dev	0
num_file_split_test	0
instructions/state_tactic_mathlib_only


In [26]:
# Upload to huggingface (this will only work with authorized accounts, 
# but we leave this command here so you can use a similar pattern for your own projects)
# 
# !cd ../ntp-training-data/instructions/ && bash upload.sh

#### Data on Huggingface

We provide formatted fine-tuning data on HuggingFace:

- [`l3lab/ntp-mathlib-instruct-st`](https://huggingface.co/datasets/l3lab/ntp-mathlib-instruct-st) (Mathlib)

Alternate formats can be produced using variants of [scripts/instruction_tuning.py](../ntp-training-data/scripts/instruction_tuning.py).

*If you use this data or code, we kindly ask that you cite this neural theorem proving tutorial*.

In [None]:
from datasets import load_dataset

dataset = load_dataset('l3lab/ntp-mathlib-instruct-st')

In [8]:
print(len(dataset['train']))
dataset['train'][0]

291262


{'task': 'tactic_predition',
 'prompt': '/- You are proving a theorem in Lean 4.\nYou are given the following information:\n- The current proof state, inside [STATE]...[/STATE]\n\nYour task is to generate the next tactic in the proof.\nPut the next tactic inside [TAC]...[/TAC]\n-/\n[STATE]\nJ : Type v\ninst✝¹ : SmallCategory J\ninst✝ : IsFiltered J\nF : J ⥤ GroupCat\nx y : (j : J) × ↑(F.obj j)\nh : Types.FilteredColimit.Rel (F ⋙ forget GroupCat) x y\n⊢ colimitInvAux F x = colimitInvAux F y\n[/STATE]\n[TAC]\n',
 'completion': 'apply G.mk_eq\n[/TAC]',
 'metadata': {'task': 'tactic_prediction',
  'project': 'Examples/Mathlib',
  'file': 'Examples/Mathlib/TacticPrediction/Mathlib_Algebra_Category_GroupCat_FilteredColimits.jsonl',
  'declId': 'Mathlib.Algebra.Category.GroupCat.FilteredColimits.83_0.OlIvs5vXvq1jJtZ',
  'target': 'apply G.mk_eq',
  'split': 'train'}}

In [9]:
print(dataset['train'][0]['prompt'])

/- You are proving a theorem in Lean 4.
You are given the following information:
- The current proof state, inside [STATE]...[/STATE]

Your task is to generate the next tactic in the proof.
Put the next tactic inside [TAC]...[/TAC]
-/
[STATE]
J : Type v
inst✝¹ : SmallCategory J
inst✝ : IsFiltered J
F : J ⥤ GroupCat
x y : (j : J) × ↑(F.obj j)
h : Types.FilteredColimit.Rel (F ⋙ forget GroupCat) x y
⊢ colimitInvAux F x = colimitInvAux F y
[/STATE]
[TAC]



In [10]:
print(dataset['train'][0]['completion'])

apply G.mk_eq
[/TAC]


#### Next steps

In part 2, we'll train a next-step generation model on the fine-tuning dataset.