Topic gpt experiments (#173)

* Added configuration file that initializes training with an existing model * Added more detailed info to README files
empirical-org · Jan 3, 2024 · 0efc8bd · 0efc8bd
1 parent 79c9ccc
commit 0efc8bd
Show file tree

Hide file tree

Showing 6 changed files with 201 additions and 109 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,24 @@
 # Quill NLP Tools and Datasets
 
-This is the respository for Quill's NLP experiments. Most importantly, it contains the code for creating data with synthetic grammar errors, and our investigation of large language models for student feedback.
+## Background: NLP at Quill
+
+At Quill, we want to help students become better writers. More specifically, we are developing automatic methods for identifying the argumentation students use in their texts, and for checking the grammatical correctness of sentences. To this goal, we are using Natural Language Processing (NLP), the subfield of Artificial Intelligence that deals with the automatic processing of text.
+
+### Grammar Correction
+
+In Quill's [Reading for evidence](https://www.quill.org/tools/evidence), students read a nonfiction text and are asked to support a series of claims with evidence sourced form the text. First we want to check their responses for grammatical correctness. We're focusing in particular on a range of grammar errors that we frequently see in students’ writings, such as confusion between *it’s* and *its*, between *than* and *then*, between a possessive form (*year’s*) and a plural form of the same word (*years*), and subject-verb agreement errors. Our goal is to automatically spot these errors, so that we can inform students about them and ask them to correct the error.
+
+To to this, we’re training machine learning models that automatically assign particular labels to words in a text. Training such a machine learning model for grammar correction is done by showing the computer thousands of example sentences where the grammar errors have already been labeled, and then evaluating to what degree the model is able to identify in sentences that have not been labeled yet.
+
+Unfortunately, we don’t have thousands of example sentences at hand where the errors have already been identified. To deal with this challenge, we mainly work with so-called synthetic data &mdash; sentences from sources like Wikipedia where we’ve automatically replaced a word by an incorrect alternative. For example, by replacing _it’s_ by _its_ in the sentence _it’s a sunny day_, we’ve automatically created a grammar error and we can tell our model what word in the sentence is incorrect.
+
+Since around 2012, neural networks are the standard model type for solving this type of task in NLP. In the last few years transformer models have emerged as the most popular type of neural network for language tasks. To train such models, we use spaCy, one of the most popular open-source NLP libraries.
+
+This repository contains our code for generating synthetic data with the types of grammatical errors that we're interested in, and for creating a spaCy transformer model to find these errors automatically.
+
+### Feedback
+
+Second, we are investigating generative AI models to help students develop strong argumentation skills. These models should give custom, targeted feedback to the arguments in students' responses, so that students strengthen their reading comprehension and hone their writing skills. This repository contains the code for our experiments with OpenAI's GPT models in particular. These experiments are focused on both prompt engineering, where we feed GPT a custom prompt with elaborate instructions and examples, and model finetuning, where we finetune a custom GPT model to give relevant feedback to students.
 
 ## Setup
 
@@ -23,16 +41,21 @@ python myScript
 deactivate
 ```
 
-## Grammar
+
+## Grammar Correction: Technical Details
+
+This repository contains the scripts for creating synthetic data and training a grammar model with spaCy.
+
+### Grammar
 
 Quill has developed a grammar pipeline that labels sentences with frequent grammar errors, such as subject-verb agreement errors and plural-possessive errors.
 The goal is to give students feedback on their writing, so that they can correct grammatical errors.
 This pipeline is a combination of simple rules and a machine-learning model. The machine-learning model is trained on a mix of real data from students and data with
 synthetic grammar errors. This repository has the code for creating such synthetic grammar errors and preparing a training corpus for spaCy.
 
-### Data
+#### Data
 
-#### Option 1: Get existing training data
+##### Option 1: Get existing training data
 
 All grammar errors in the grammar model that are identified with a machine-learning model already have synthetically generated data.
 This data is stored in a Google Cloud bucket and can be pulled with our DVC account:
@@ -43,11 +66,11 @@ This data is stored in a Google Cloud bucket and can be pulled with our DVC acco
 
 The training data will be downloaded to the `data/training` directory of this repository.
 
-#### Option 2: Generate synthetic data
+##### Option 2: Generate synthetic data
 
 Alternatively, it is possible to create new synthetic training data. Every grammar error has an `ErrorGenerator`
 that takes an input sentence and inserts a synthetic error in that sentence (if possible). For example, the `SubjectVerbAgreementWithSimpleNounErrorGenerator`
-takes a sentence and replaces the present verb by another verb form if the subject contains a simple noun.
+takes a sentence and replaces the present verb by another verb form if the subject contains a simple noun. The README file in the directory `scripts/quillgrammar` contains more detailed information about this synthetic data generation.
 
 The error generators can be run with the script `create_grammar_training_corpus.py`:
 
@@ -66,7 +89,7 @@ Add this training data to the directory `data/training` and upload it to the Goo
 > dvc push
 ```
 
-### SpaCy training corpus
+#### SpaCy training corpus
 
 We train our grammar model as a spaCy pipeline. As a result, we need to prepare a training and development corpus
 that spaCy can work with. This is done in the script `prepare_spacy_grammar_corpus`.
@@ -87,7 +110,7 @@ This script has the following output:
 - `<output_path>/test.spacy`: a test file that can be used for testing the grammar model after training
 - `<outputpath>/train/*.spacy`: one or more training files on which the grammar model will train
 
-### Training
+#### Training
 
 Now the grammar model can be trained with spaCy's standard training command:
 
@@ -98,20 +121,25 @@ spacy train config_distilbert.cfg --output output_path \
 --gpu-id 0
 ```
 
-## Large Language Models for student feedback
+With `config_distilbert.cfg` as a configuration file, this command trains a model from scratch with the training corpus in `paths.train`. With `config_distilbert_add.cfg` as a configuration file, spaCy will load an already trained model from the directory `quillgrammar/models/current`, and continue training on the data in `paths.train`. This is convenient if an
+existing model needs to be updated with some additional (e.g. manually labelled Quill) data. To create a training corpus
+with new Quill data, format the data like the other training files (see for example `data/training/quill_labels_20231101_train.ndjson`), and rerun the previous step (`prepare_spacy_training_corpus.py`) with only the new
+files in `grammar_files.csv`.
+
+## Large Language Models for Student Feedback: Technical Details
 
-Second, this corpus contains all data and scripts for our experiments with Large Language Models for student feedback.
+Second, this repository contains all data and scripts for our experiments with Large Language Models for student feedback.
 The goal of this task is to provide automatic feedback on the content of student responses.
 The files with examples of human feedback are in `data/automl`, organized by passage and prompt. The scripts are in `scripts/gpt`.
 
-## GPT scripts
+### GPT scripts
 
 There are several scripts for our experiments with GPT:
 - `finetune.py`: finetune a GPT model with Quill's feedback
 - `test_openai_for_feedback.py`: evaluate the output of a large language model against Quill's feedback
 - `moderate_feedbac.py`: moderate GPT feedback by an additional GPT step that removes undesired elements
 
-### Finetuning script
+#### Finetuning script
 
 First, this repo contains a script to finetune a GPT-3.5-turbo model with Quill's human feedback. This can be done with the script `finetune.py`:
 
@@ -121,7 +149,7 @@ First, this repo contains a script to finetune a GPT-3.5-turbo model with Quill'
 > python scripts/gpt/finetune.py <output_file>.json
 ```
 
-### Evaluation script
+#### Evaluation script
 
 Second, it is possible to evaluate GPT-3.5, GPT-4 or a finetuned GPT model by comparing their feedback to Quill's human feedback, using `test_openai_for_feedback.py`:
 
@@ -137,10 +165,9 @@ For example:
 > python scripts/test_openai_for_feedback.py gpt-3.5-turbo gpt3-5-turbo
 ```
 
-### Moderation script
+#### Moderation script
 
-The moderation script is a basic script that calls a GPT model to moderate automatic feedback. It takes Quill feedback as input, asks the
-GPT model to remove any undesired elements, and writes the output to a file. It is used in the following way:
+Finally, the moderation script is a basic script that calls a GPT model to moderate automatic feedback. This moderation step can be necessary when GPT gives feedback that does not focus on argumentation: comments on spelling or grammar, clarity or conciceness, or when GPT gives away the correct answer. The moderation script takes one or more pieces of feedback as input, asks the GPT model to remove any undesired elements, and writes the output to a file. It is used in the following way:
 
 ```
 > python scripts/moderate_feedback.py <gpt_model> <output_file> --verbose <True/False>

diff --git a/config_distilbert_add.cfg b/config_distilbert_add.cfg
@@ -0,0 +1,104 @@
+[paths]
+train = null
+dev = null
+vectors = null
+init_tok2vec = null
+
+[system]
+gpu_allocator = "pytorch"
+seed = 0
+
+[nlp]
+lang = "en"
+pipeline = ["transformer","ner"]
+tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
+disabled = []
+before_creation = null
+after_creation = null
+after_pipeline_creation = null
+
+[components]
+
+[components.ner]
+source = "/home/yves/projects/grammar-api3/grammar-api/quillgrammar/models/current/"
+
+[components.transformer]
+source = "/home/yves/projects/grammar-api3/grammar-api/quillgrammar/models/current/"
+
+[corpora]
+
+[corpora.dev]
+@readers = "spacy.Corpus.v1"
+path = ${paths.dev}
+max_length = 30
+gold_preproc = false
+limit = 0
+augmenter = null
+
+[corpora.train]
+@readers = "spacy.Corpus.v1"
+path = ${paths.train}
+max_length = 30
+gold_preproc = false
+limit = 2000000
+augmenter = null
+
+[training]
+accumulate_gradient = 3
+dev_corpus = "corpora.dev"
+train_corpus = "corpora.train"
+seed = ${system.seed}
+gpu_allocator = ${system.gpu_allocator}
+dropout = 0.1
+patience = 1000000
+max_epochs = 1
+max_steps = 200000
+eval_frequency = 200
+frozen_components = []
+before_to_disk = null
+
+[training.batcher]
+@batchers = "spacy.batch_by_padded.v1"
+discard_oversize = true
+size = 100
+buffer = 256
+get_length = null
+
+[training.logger]
+@loggers = "spacy.WandbLogger.v1"
+project_name = "quill-grammar"
+remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
+
+[training.optimizer]
+@optimizers = "Adam.v1"
+beta1 = 0.9
+beta2 = 0.999
+L2_is_weight_decay = true
+L2 = 0.01
+grad_clip = 1.0
+use_averages = false
+eps = 0.00000001
+
+[training.optimizer.learn_rate]
+@schedules = "warmup_linear.v1"
+warmup_steps = 100
+total_steps = 50000
+initial_rate = 0.000005
+
+[training.score_weights]
+ents_per_type = null
+ents_f = 1.0
+ents_p = 0.0
+ents_r = 0.0
+
+[pretraining]
+
+[initialize]
+vectors = ${paths.vectors}
+init_tok2vec = ${paths.init_tok2vec}
+vocab_data = null
+lookups = null
+
+[initialize.components]
+
+[initialize.tokenizer]
diff --git a/quillnlp/corpora/notw.py b/quillnlp/corpora/notw.py
@@ -155,8 +155,6 @@ def read_sentences(corpus_dir: str, max_sentences: int=None, is_complex: bool=Fa
                                         if max_sentences is not None and len(notw_sentences) > max_sentences:
                                             break
 
-                print(len(notw_sentences))
-
                 if max_sentences is not None and len(notw_sentences) > max_sentences:
                     break
 

diff --git a/scripts/get_complex_sentences.py b/scripts/get_complex_sentences.py
diff --git a/scripts/quillgrammar/README.md b/scripts/quillgrammar/README.md
@@ -0,0 +1,53 @@
+# How to create synthetic data for the grammar model
+
+General information about the grammar scripts can be found in the README file in the top directory. This README contains more information about our methods for creating synthetic data.
+
+## General
+
+Synthetic data can be generated with the script `create_grammar_training_corpus.py`.
+
+```bash
+> python scripts/quillgrammar/create_grammar_training_corpus.py <path_to_newsoftheworld_corpus>
+```
+
+The errors are created by taking sentences from US sources in the news of the world corpus, and injecting grammar errors into these sentences. By default, this injection is done with a probability of 50%. This means that:
+
+- We take an initial set of sentences from the News of the World corpus.
+- An error generator determines if a sentence is relevant for a particular error (e.g. it contains a commonly confused word) and injects an error by replacing the original word with an alternative word or word form.
+- Of all these relevant sentences, 50% are written to the output file in their original (correct) form, and 50% are written to the output file in their synthetic (incorrect) form. In this way, we ensure that the grammar model sees both correct and incorrect examples during its training process. The output file takes the name of the grammar error, with the extension `ndjson`.
+
+All output files that are used for training a spaCy model have to be added to `scripts/grammar/training_files.csv`.
+
+To change the error that is generated by the script, update the relevant error generator on line 56 of `scripts/quillgrammar/create_grammar_training_corpus.py`
+
+## Error generators
+
+Errors are created by so-called error generators. These can be found in `quillnlp.grammar.generation`, `quillnlp.grammar.fragments` (for fragment generators) and `quillnlp.grammar.verb` (for verb error generators).
+
+The general `ErrorGenerator` class is defined in `quillnlp.grammar.generation`. Crucially. it has a `generate_from_text` method, which takes a text and injects an error. All specific error generators inherit from this class.
+
+### Simple errors
+
+For simple errors, in which one word is confused with another, use the `TokenReplacementErrorGenerator` from `quillnlp.grammar.generation`. Upon creation, this error generator takes at least two arguments: a dictionary that describes which correct tokens should be replaced by which incorrect tokens, and the name of the grammar error. An optional third argument describes the probabilities of the replacement tokens.
+
+For example, the following code creates an error generator that replaces instances of *than* by *then*, instances of *then* by *than*, and labels the resulting error as `than_versus_then`.
+
+```python
+generator = TokenReplacementErrorGenerator({"then": ["than"], "than": ["then"]}, GrammarError.THAN_THEN.value)
+```
+
+Similarly, the code below creates an error generator that replaces instances of *too* by *two* or *to* and labels the resulting error as `to_vs_too_vs_two_too_optimal`. The list of probabilities ensures it uses *to* as a replacement 90% of the time (because this is the more frequent error), and *two* 10% of the time.
+
+```python
+generator = TokenReplacementErrorGenerator({'too': ['two', 'to']},
+                                            GrammarError.TO_TWO_TOO_TOO_OPTIMAL.value,
+											probs=[0.1, 0.9])
+```
+
+### Pronoun errors
+
+Pronoun error generators are created from the `PronounReplacementErrorGenerator`, defined in `quillnlp.grammar.generation`. This is slightly more advanced than the token replacement error generator. For example, it allows us to check the spaCy dependency role of words, in order to focus on subject/object/posessive/… pronouns. `quillnlp.grammar.generation` already has pronoun error generators for the most common pronoun errors.
+
+### Complex errors
+
+Other, more complex errors, have their own specific error generator, such as the `PassiveWithIncorrectBeErrorGenerator`. The `name` field of such a specific error generator specifies what grammar error they generate.