Skip to content

Commit

Permalink
Merge pull request #163 from empirical-org/topic-gpt-experiments
Browse files Browse the repository at this point in the history
Topic gpt experiments
  • Loading branch information
happythenewsad committed Nov 17, 2023
2 parents 15405cd + 9fe192e commit 79c9ccc
Show file tree
Hide file tree
Showing 229 changed files with 35,115 additions and 26,928 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,5 @@ data/fragments
db
nohup.out
env/*
env*
.idea/*
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.11.6
195 changes: 103 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,142 +1,153 @@
# Quill NLP Tools and Datasets
Notebooks, scrapers, corpora, and utilities built and maintained Quill.org.

## About the Repo
This repo contains all of our data for Quill.org's machine learning models. This includes both grammar models that will be used across multiple products, and the algorthims for Quill Comprehension, a product that builds critical thinking skills. Quill Comprehension uses a topic classification algorthim to identify the main pieces of evidence in a student's writing in order to serve feedback that pushes the student to use more precise evidence.
This is the respository for Quill's NLP experiments. Most importantly, it contains the code for creating data with synthetic grammar errors, and our investigation of large language models for student feedback.

## Setup

## Quill Comprehension's Grading Logic
To understand the grading process for Quill Comprehension, please click on the link below to see a document that explains the steps of the grading process. To process this data, Quill first uses a script that helps us extract features from the student's writing for both the data labelling process and the machine learning models. This script incorporates AllenNLP. You can find an explanation of what the script does, and why each step is necessary. [Find the document here.](https://www.notion.so/Quill-Comprehension-Grading-Logic-395e3ba566484790a9187ddeb7cdfc6a#e34312ec6830435ba5e1c5b70737898e)
All scripts have been tested with Python 3.11.6 and pip 23.2.1.

Different scripts in this repo rely on different pip packages. We currently use python's `virtualenv` standard library to manage dependencies.

## Structure
Here's how to set up the (currently) two virtual envs:

```bash
.
├── data # data we use for our experiments
├── interim # preprocessed data
├── raw # original, unprocessed data
└── validated # validated gold standard data for evaluation
├── demo # D3 visualization that demonstrates NLP capabilities
├── experiments # the json configuration files for our experiments
├── genmodel
├── models # saved models for classification and other NLP tasks
├── notebooks # Jupyter notebooks for data exploration & simple experiments
├── quillnlp # the main package with the NLP code, including the dataset readers,
# models and predictors for AllenNLP
├── scrapers # data collection tools
├── scripts # scripts for data processing, etc.
├── tests # unit and more high-level tests
├── utils # useful tools and scripts including document parsing
├── LICENSE
├── README.md # this file
└── __init__.py
```

## Show version control how to deal with ipynb files
```shell
python -m venv env-grammar
python -m venv env-gpt
```

```bash
$ # ensure you are in the top level of the project before running these commands
$
$ source activate <YOUR CONDA ENV>
$ conda install -c conda-forge nbstripout
$ nbstripout --install
$ nbstripout --install --attributes .gitattributes
Here's how to use a virtualenv in the context of running a script:

```shell
source env-myEnvName/bin/activate
python myScript
deactivate
```

Running the above commands will ensure generated output from the notebooks is
not versioned, but that regular code changes will still be reflected.
## Grammar

Quill has developed a grammar pipeline that labels sentences with frequent grammar errors, such as subject-verb agreement errors and plural-possessive errors.
The goal is to give students feedback on their writing, so that they can correct grammatical errors.
This pipeline is a combination of simple rules and a machine-learning model. The machine-learning model is trained on a mix of real data from students and data with
synthetic grammar errors. This repository has the code for creating such synthetic grammar errors and preparing a training corpus for spaCy.

Note: this means that switching branches could mean changes to notebook state.
Be aware of this and don't be alarmed.
### Data

## Experiments how-to
#### Option 1: Get existing training data

### Set up
All grammar errors in the grammar model that are identified with a machine-learning model already have synthetically generated data.
This data is stored in a Google Cloud bucket and can be pulled with our DVC account:

#### Run the install script
```
sh bootstrap.sh
> dvc pull
```
This will install python and all of the required dependencies, mostly within a virtual environment. This script should be idempotent and can be run multiple times without messing up your environment (It will update your dependencies though).

### Experiments
The training data will be downloaded to the `data/training` directory of this repository.

Experiments follow the general pattern:
#### Option 2: Generate synthetic data

1. Start Virtual Environment.
2. **Run Experiments/Training.**
3. Close Virtual Environment.
Alternatively, it is possible to create new synthetic training data. Every grammar error has an `ErrorGenerator`
that takes an input sentence and inserts a synthetic error in that sentence (if possible). For example, the `SubjectVerbAgreementWithSimpleNounErrorGenerator`
takes a sentence and replaces the present verb by another verb form if the subject contains a simple noun.

Start a virtual environment with:
The error generators can be run with the script `create_grammar_training_corpus.py`:

```bash
> export PYTHONPATH=.
> python scripts/quillgrammar/create_grammar_training_corpus.py \
path_to_newsoftheworld_corpus
```
source env/bin/activate
```

Close it with:
It will generate a synthetic training file for each of the error generators that is called in the script.

Add this training data to the directory `data/training` and upload it to the Google Cloud with

```
deactivate
> dvc commit
> dvc push
```

**Note, if you are doing multiple experiments, you can open the environment, do a bunch of stuff, and then close the environment.**
### SpaCy training corpus

#### Preparing Data
We train our grammar model as a spaCy pipeline. As a result, we need to prepare a training and development corpus
that spaCy can work with. This is done in the script `prepare_spacy_grammar_corpus`.
This takes as its only argument the directory to which the corpus files will be written:

1. Put all labelled data in a file. This should be a tab-separated file
with two columns. The first column contains the sentence (prompt and response),
the second column contains the label. Save this file in the directory `data/raw`
```bash
> export PYTHONPATH=.
> python scripts/quillgrammar/prepare_spaCy_grammar_corpus.py output_path
```

2. Process the file with the script `create_train_and_test_data`:
The list of synthetic error files that will be used for the corpus can be adapted in `scripts/quillgrammar/grammar_files.csv`.
This csv file contains a list of the error files that will be used, together with the number of training items and the number
of dev/test items that will be taken from the file. The more difficult the error, the more training (and dev/test) files we
collect.

This script has the following output:
- `<output_path>/dev.spacy`: a development file on which the grammar model will be tested repeatedly during training
- `<output_path>/test.spacy`: a test file that can be used for testing the grammar model after training
- `<outputpath>/train/*.spacy`: one or more training files on which the grammar model will train

### Training

Now the grammar model can be trained with spaCy's standard training command:

From the directory root:
```
source env/bin/activate
```
spacy train config_distilbert.cfg --output output_path \
--paths.train <prepare_spacy_grammar_corpus.py's output.path>/train \
--paths.dev <prepare_spacy_grammar_corpus.py's output.path>/dev.spacy \
--gpu-id 0
```
python3 scripts/create_train_and_test_data.py --input_file data/raw/example.tsv

```
## Large Language Models for student feedback

This will create three ndjson files in the `data/interim` directory: a train file
with the training data, a dev file with the development data and a test file with
the test data.
Second, this corpus contains all data and scripts for our experiments with Large Language Models for student feedback.
The goal of this task is to provide automatic feedback on the content of student responses.
The files with examples of human feedback are in `data/automl`, organized by passage and prompt. The scripts are in `scripts/gpt`.

#### Run the baseline experiments:
## GPT scripts

```python3 scripts/train_baseline.py --train data/interim/example_train.ndjson --test data/interim/example_test.ndjson```
There are several scripts for our experiments with GPT:
- `finetune.py`: finetune a GPT model with Quill's feedback
- `test_openai_for_feedback.py`: evaluate the output of a large language model against Quill's feedback
- `moderate_feedbac.py`: moderate GPT feedback by an additional GPT step that removes undesired elements

This will train a simple classifier. After evaluation, it prints out an
accuracy and performance per label.
### Finetuning script

#### Run the AllenNLP experiments.
First, this repo contains a script to finetune a GPT-3.5-turbo model with Quill's human feedback. This can be done with the script `finetune.py`:

Download the Glove 6B 300 data set **(800MB)** from this [website](https://nlp.stanford.edu/projects/glove/)
```
> pip install -r requirements-gpt.txt
> export OPENAI_API_KEY=<YOUR_KEY>
> python scripts/gpt/finetune.py <output_file>.json
```

Here is the direct [800 MB download link](http://nlp.stanford.edu/data/glove.6B.zip)
### Evaluation script

Create a configuration file in the `experiments` directory. Start from
`example.json`, where you fill in the paths to your train, dev (validation)
and test files. If your machine does not have a GPU, set `cuda_device` (towards
the bottom) to `-1`. Otherwise, set it to 0. Since our experiments are small,
they can be run without a GPU. Also, update the `example.json` to point to the glove data set on your laptop.
Second, it is possible to evaluate GPT-3.5, GPT-4 or a finetuned GPT model by comparing their feedback to Quill's human feedback, using `test_openai_for_feedback.py`:

#### Train an AllenNLP model:
```
> pip install -r requirements-gpt.txt
> export OPENAI_API_KEY=<YOUR_KEY>
> python scripts/gpt/test_openai_for_feedback.py <model> <tag_for_output_file>
```

```allennlp train experiments/example.json -s /tmp/example --include-package quillnlp```
For example:

Evaluate the AllenNLP model. We have our own script for this,
`evaluate_topic_classification`, which takes as first argument the test file,
and as second argument the directory where the model was saved:
```
> python scripts/test_openai_for_feedback.py gpt-3.5-turbo gpt3-5-turbo
```

```python3 -m scripts.evaluate_topic_classification data/interim/example_test.ndjson /tmp/example/```
### Moderation script

#### Run the Google Sentence Encoder scripts:
The moderation script is a basic script that calls a GPT model to moderate automatic feedback. It takes Quill feedback as input, asks the
GPT model to remove any undesired elements, and writes the output to a file. It is used in the following way:

```python3 scripts/sentence_encoder_tests.py --train data/interim/example_train.ndjson --dev data/interim/example_dev.ndjson --test data/interim/example_test.ndjson --out /tmp/classifier```
```
> python scripts/moderate_feedback.py <gpt_model> <output_file> --verbose <True/False>
```

#### Deactivate the virtual environment:
For example:

```deactivate```
```
> python scripts/moderate_feedback.py gpt-4 feedback_output.csv --verbose False
```
8 changes: 4 additions & 4 deletions config_distilbert.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ augmenter = null
path = ${paths.train}
max_length = 30
gold_preproc = false
limit = 1500000
limit = 2000000
augmenter = null

[training]
Expand All @@ -82,7 +82,7 @@ gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1000000
max_epochs = 10
max_steps = 200000
max_steps = 50000
eval_frequency = 200
frozen_components = []
before_to_disk = null
Expand Down Expand Up @@ -111,8 +111,8 @@ eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
warmup_steps = 1000
total_steps = 50000
initial_rate = 0.00005

[training.score_weights]
Expand Down
1 change: 1 addition & 0 deletions data/.gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
/corpora
/training
Loading

0 comments on commit 79c9ccc

Please sign in to comment.