Skip to content

Commit

Permalink
Topic gpt experiments (#173)
Browse files Browse the repository at this point in the history
* Added configuration file that initializes training with an existing model

* Added more detailed info to README files
  • Loading branch information
yvespeirsman committed Jan 3, 2024
1 parent 79c9ccc commit 0efc8bd
Show file tree
Hide file tree
Showing 6 changed files with 201 additions and 109 deletions.
59 changes: 43 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,24 @@
# Quill NLP Tools and Datasets

This is the respository for Quill's NLP experiments. Most importantly, it contains the code for creating data with synthetic grammar errors, and our investigation of large language models for student feedback.
## Background: NLP at Quill

At Quill, we want to help students become better writers. More specifically, we are developing automatic methods for identifying the argumentation students use in their texts, and for checking the grammatical correctness of sentences. To this goal, we are using Natural Language Processing (NLP), the subfield of Artificial Intelligence that deals with the automatic processing of text.

### Grammar Correction

In Quill's [Reading for evidence](https://www.quill.org/tools/evidence), students read a nonfiction text and are asked to support a series of claims with evidence sourced form the text. First we want to check their responses for grammatical correctness. We're focusing in particular on a range of grammar errors that we frequently see in students’ writings, such as confusion between *it’s* and *its*, between *than* and *then*, between a possessive form (*year’s*) and a plural form of the same word (*years*), and subject-verb agreement errors. Our goal is to automatically spot these errors, so that we can inform students about them and ask them to correct the error.

To to this, we’re training machine learning models that automatically assign particular labels to words in a text. Training such a machine learning model for grammar correction is done by showing the computer thousands of example sentences where the grammar errors have already been labeled, and then evaluating to what degree the model is able to identify in sentences that have not been labeled yet.

Unfortunately, we don’t have thousands of example sentences at hand where the errors have already been identified. To deal with this challenge, we mainly work with so-called synthetic data — sentences from sources like Wikipedia where we’ve automatically replaced a word by an incorrect alternative. For example, by replacing _it’s_ by _its_ in the sentence _it’s a sunny day_, we’ve automatically created a grammar error and we can tell our model what word in the sentence is incorrect.

Since around 2012, neural networks are the standard model type for solving this type of task in NLP. In the last few years transformer models have emerged as the most popular type of neural network for language tasks. To train such models, we use spaCy, one of the most popular open-source NLP libraries.

This repository contains our code for generating synthetic data with the types of grammatical errors that we're interested in, and for creating a spaCy transformer model to find these errors automatically.

### Feedback

Second, we are investigating generative AI models to help students develop strong argumentation skills. These models should give custom, targeted feedback to the arguments in students' responses, so that students strengthen their reading comprehension and hone their writing skills. This repository contains the code for our experiments with OpenAI's GPT models in particular. These experiments are focused on both prompt engineering, where we feed GPT a custom prompt with elaborate instructions and examples, and model finetuning, where we finetune a custom GPT model to give relevant feedback to students.

## Setup

Expand All @@ -23,16 +41,21 @@ python myScript
deactivate
```

## Grammar

## Grammar Correction: Technical Details

This repository contains the scripts for creating synthetic data and training a grammar model with spaCy.

### Grammar

Quill has developed a grammar pipeline that labels sentences with frequent grammar errors, such as subject-verb agreement errors and plural-possessive errors.
The goal is to give students feedback on their writing, so that they can correct grammatical errors.
This pipeline is a combination of simple rules and a machine-learning model. The machine-learning model is trained on a mix of real data from students and data with
synthetic grammar errors. This repository has the code for creating such synthetic grammar errors and preparing a training corpus for spaCy.

### Data
#### Data

#### Option 1: Get existing training data
##### Option 1: Get existing training data

All grammar errors in the grammar model that are identified with a machine-learning model already have synthetically generated data.
This data is stored in a Google Cloud bucket and can be pulled with our DVC account:
Expand All @@ -43,11 +66,11 @@ This data is stored in a Google Cloud bucket and can be pulled with our DVC acco

The training data will be downloaded to the `data/training` directory of this repository.

#### Option 2: Generate synthetic data
##### Option 2: Generate synthetic data

Alternatively, it is possible to create new synthetic training data. Every grammar error has an `ErrorGenerator`
that takes an input sentence and inserts a synthetic error in that sentence (if possible). For example, the `SubjectVerbAgreementWithSimpleNounErrorGenerator`
takes a sentence and replaces the present verb by another verb form if the subject contains a simple noun.
takes a sentence and replaces the present verb by another verb form if the subject contains a simple noun. The README file in the directory `scripts/quillgrammar` contains more detailed information about this synthetic data generation.

The error generators can be run with the script `create_grammar_training_corpus.py`:

Expand All @@ -66,7 +89,7 @@ Add this training data to the directory `data/training` and upload it to the Goo
> dvc push
```

### SpaCy training corpus
#### SpaCy training corpus

We train our grammar model as a spaCy pipeline. As a result, we need to prepare a training and development corpus
that spaCy can work with. This is done in the script `prepare_spacy_grammar_corpus`.
Expand All @@ -87,7 +110,7 @@ This script has the following output:
- `<output_path>/test.spacy`: a test file that can be used for testing the grammar model after training
- `<outputpath>/train/*.spacy`: one or more training files on which the grammar model will train

### Training
#### Training

Now the grammar model can be trained with spaCy's standard training command:

Expand All @@ -98,20 +121,25 @@ spacy train config_distilbert.cfg --output output_path \
--gpu-id 0
```

## Large Language Models for student feedback
With `config_distilbert.cfg` as a configuration file, this command trains a model from scratch with the training corpus in `paths.train`. With `config_distilbert_add.cfg` as a configuration file, spaCy will load an already trained model from the directory `quillgrammar/models/current`, and continue training on the data in `paths.train`. This is convenient if an
existing model needs to be updated with some additional (e.g. manually labelled Quill) data. To create a training corpus
with new Quill data, format the data like the other training files (see for example `data/training/quill_labels_20231101_train.ndjson`), and rerun the previous step (`prepare_spacy_training_corpus.py`) with only the new
files in `grammar_files.csv`.

## Large Language Models for Student Feedback: Technical Details

Second, this corpus contains all data and scripts for our experiments with Large Language Models for student feedback.
Second, this repository contains all data and scripts for our experiments with Large Language Models for student feedback.
The goal of this task is to provide automatic feedback on the content of student responses.
The files with examples of human feedback are in `data/automl`, organized by passage and prompt. The scripts are in `scripts/gpt`.

## GPT scripts
### GPT scripts

There are several scripts for our experiments with GPT:
- `finetune.py`: finetune a GPT model with Quill's feedback
- `test_openai_for_feedback.py`: evaluate the output of a large language model against Quill's feedback
- `moderate_feedbac.py`: moderate GPT feedback by an additional GPT step that removes undesired elements

### Finetuning script
#### Finetuning script

First, this repo contains a script to finetune a GPT-3.5-turbo model with Quill's human feedback. This can be done with the script `finetune.py`:

Expand All @@ -121,7 +149,7 @@ First, this repo contains a script to finetune a GPT-3.5-turbo model with Quill'
> python scripts/gpt/finetune.py <output_file>.json
```

### Evaluation script
#### Evaluation script

Second, it is possible to evaluate GPT-3.5, GPT-4 or a finetuned GPT model by comparing their feedback to Quill's human feedback, using `test_openai_for_feedback.py`:

Expand All @@ -137,10 +165,9 @@ For example:
> python scripts/test_openai_for_feedback.py gpt-3.5-turbo gpt3-5-turbo
```

### Moderation script
#### Moderation script

The moderation script is a basic script that calls a GPT model to moderate automatic feedback. It takes Quill feedback as input, asks the
GPT model to remove any undesired elements, and writes the output to a file. It is used in the following way:
Finally, the moderation script is a basic script that calls a GPT model to moderate automatic feedback. This moderation step can be necessary when GPT gives feedback that does not focus on argumentation: comments on spelling or grammar, clarity or conciceness, or when GPT gives away the correct answer. The moderation script takes one or more pieces of feedback as input, asks the GPT model to remove any undesired elements, and writes the output to a file. It is used in the following way:

```
> python scripts/moderate_feedback.py <gpt_model> <output_file> --verbose <True/False>
Expand Down
104 changes: 104 additions & 0 deletions config_distilbert_add.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null

[components]

[components.ner]
source = "/home/yves/projects/grammar-api3/grammar-api/quillgrammar/models/current/"

[components.transformer]
source = "/home/yves/projects/grammar-api3/grammar-api/quillgrammar/models/current/"

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 30
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 30
gold_preproc = false
limit = 2000000
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1000000
max_epochs = 1
max_steps = 200000
eval_frequency = 200
frozen_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 100
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.WandbLogger.v1"
project_name = "quill-grammar"
remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 100
total_steps = 50000
initial_rate = 0.000005

[training.score_weights]
ents_per_type = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null

[initialize.components]

[initialize.tokenizer]
2 changes: 0 additions & 2 deletions quillnlp/corpora/notw.py
Original file line number Diff line number Diff line change
Expand Up @@ -155,8 +155,6 @@ def read_sentences(corpus_dir: str, max_sentences: int=None, is_complex: bool=Fa
if max_sentences is not None and len(notw_sentences) > max_sentences:
break

print(len(notw_sentences))

if max_sentences is not None and len(notw_sentences) > max_sentences:
break

Expand Down
88 changes: 0 additions & 88 deletions scripts/get_complex_sentences.py

This file was deleted.

53 changes: 53 additions & 0 deletions scripts/quillgrammar/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# How to create synthetic data for the grammar model

General information about the grammar scripts can be found in the README file in the top directory. This README contains more information about our methods for creating synthetic data.

## General

Synthetic data can be generated with the script `create_grammar_training_corpus.py`.

```bash
> python scripts/quillgrammar/create_grammar_training_corpus.py <path_to_newsoftheworld_corpus>
```

The errors are created by taking sentences from US sources in the news of the world corpus, and injecting grammar errors into these sentences. By default, this injection is done with a probability of 50%. This means that:

- We take an initial set of sentences from the News of the World corpus.
- An error generator determines if a sentence is relevant for a particular error (e.g. it contains a commonly confused word) and injects an error by replacing the original word with an alternative word or word form.
- Of all these relevant sentences, 50% are written to the output file in their original (correct) form, and 50% are written to the output file in their synthetic (incorrect) form. In this way, we ensure that the grammar model sees both correct and incorrect examples during its training process. The output file takes the name of the grammar error, with the extension `ndjson`.

All output files that are used for training a spaCy model have to be added to `scripts/grammar/training_files.csv`.

To change the error that is generated by the script, update the relevant error generator on line 56 of `scripts/quillgrammar/create_grammar_training_corpus.py`

## Error generators

Errors are created by so-called error generators. These can be found in `quillnlp.grammar.generation`, `quillnlp.grammar.fragments` (for fragment generators) and `quillnlp.grammar.verb` (for verb error generators).

The general `ErrorGenerator` class is defined in `quillnlp.grammar.generation`. Crucially. it has a `generate_from_text` method, which takes a text and injects an error. All specific error generators inherit from this class.

### Simple errors

For simple errors, in which one word is confused with another, use the `TokenReplacementErrorGenerator` from `quillnlp.grammar.generation`. Upon creation, this error generator takes at least two arguments: a dictionary that describes which correct tokens should be replaced by which incorrect tokens, and the name of the grammar error. An optional third argument describes the probabilities of the replacement tokens.

For example, the following code creates an error generator that replaces instances of *than* by *then*, instances of *then* by *than*, and labels the resulting error as `than_versus_then`.

```python
generator = TokenReplacementErrorGenerator({"then": ["than"], "than": ["then"]}, GrammarError.THAN_THEN.value)
```

Similarly, the code below creates an error generator that replaces instances of *too* by *two* or *to* and labels the resulting error as `to_vs_too_vs_two_too_optimal`. The list of probabilities ensures it uses *to* as a replacement 90% of the time (because this is the more frequent error), and *two* 10% of the time.

```python
generator = TokenReplacementErrorGenerator({'too': ['two', 'to']},
GrammarError.TO_TWO_TOO_TOO_OPTIMAL.value,
probs=[0.1, 0.9])
```

### Pronoun errors

Pronoun error generators are created from the `PronounReplacementErrorGenerator`, defined in `quillnlp.grammar.generation`. This is slightly more advanced than the token replacement error generator. For example, it allows us to check the spaCy dependency role of words, in order to focus on subject/object/posessive/… pronouns. `quillnlp.grammar.generation` already has pronoun error generators for the most common pronoun errors.

### Complex errors

Other, more complex errors, have their own specific error generator, such as the `PassiveWithIncorrectBeErrorGenerator`. The `name` field of such a specific error generator specifies what grammar error they generate.
Loading

0 comments on commit 0efc8bd

Please sign in to comment.