GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning

✨ Latest News

[07/31/2023]: Release the model weights.
[07/26/2023]: Release the tech report.

⚡ Introduction

Welcome to the repository of GrammarGPT.

The implementation repository for NLPCC 2023 Sharedtask1, which achieves third place.

Here is a list of what has been released:

The 1k data for training, 65% of which are generated by ChatGPT, and the rest are manually annotated.
The code for training and inferencing.
You can find more details about the data and model on our technical report.

💭 Overview

We introduced GrammarGPT, an open-source LLM, to preliminary explore its potential for native Chinese grammatical error correction. The core recipe of GrammarGPT is to leverage the hybrid dataset of ChatGPT-generated and human-annotated. For grammatical errors with clues, we proposed a heuristic method to guide ChatGPT to generate ungrammatical sentences by providing those clues. For grammatical errors without clues, we collected ungrammatical sentences from publicly available websites and manually corrected them. In addition, we employed an error-invariant augmentation method to enhance the ability of the model to correct native Chinese grammatical errors.

📚 Construction of Hybrid Dataset

This table shows the six main types of grammatical errors made by native Chinese speakers, which can be divided into two types, e.g., with (w/) and without (w/o) clues. We can find that the incorrect sentences are fluent and in line with the habits of native Chinese. However, they do not conform to Chinese grammar, which is more difficult to correct. We utilized both ChatGPT-generated data and human-annotated data for dealing with grammatical errors with and without clues, respectively.

ChatGPT-generated Data

Grammatical errors with clues are easy to detect and correct by recognizing the specific clues. For example, more than and about are used together leading to redundant component, The cause and caused by are used together leading to structural confusion, and prompting and pace are used together leading to improper collocation. Conversely, we can construct ungrammatical sentences by inserting these cues into grammatical sentences. We can instruct ChatGPT to generate the ungrammatical sentences that meet our requirements by providing these Clues collected from public websites.

Human-annotated Data

For those ungrammatical errors,we collected data from public websites 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 and manaually annotated them.

Error-invariant Augmentation

Native Chinese grammatical errors are often subtle and infrequently found in the position of named entities. Therefore, we adopt a strategy of substituting the named entities in the parallel data with similar ones(Synonyms).

🚀 Training

python finetuning.py

🧐 Inferencing

python generate.py

😀 Acknowledgement

We are aware that our works are inspired by the following works, including but not limited to

Bloom: https://huggingface.co/bigscience/bloom
Self-instruct: https://github.com/yizhongw/self-instruct
LLMZoo: https://github.com/FreedomIntelligence/LLMZoo

Without these, nothing could happen in this repository.

Citation

@inproceedings{fan2023grammargpt,
  title={GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning},
  author={Fan, Yaxin and Jiang, Feng and Li, Peifeng and Li, Haizhou},
  booktitle={CCF International Conference on Natural Language Processing and Chinese Computing},
  pages={69--80},
  year={2023},
  organization={Springer}
}

We are from the School of Data Science, the Chinese University of Hong Kong, Shenzhen (CUHKSZ), and the Shenzhen Research Institute of Big Data (SRIBD).

The first author is a visiting student from Soochow University, and we welcome aspiring individuals to join our group and contribute to the new era of LLM.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
pseudo_data		pseudo_data
templates		templates
utils		utils
LICENSE.txt		LICENSE.txt
README.md		README.md
finetune.py		finetune.py
generate.py		generate.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

pseudo_data

pseudo_data

templates

templates

utils

utils

LICENSE.txt

LICENSE.txt

README.md

README.md

finetune.py

finetune.py

generate.py

generate.py

requirements.txt

requirements.txt

Repository files navigation

GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning

✨ Latest News

⚡ Introduction

💭 Overview

📚 Construction of Hybrid Dataset

ChatGPT-generated Data

Human-annotated Data

Error-invariant Augmentation

🚀 Training

🧐 Inferencing

😀 Acknowledgement

Citation

About

Releases

Packages

Contributors 2

Languages

License

FreedomIntelligence/GrammarGPT

Folders and files

Latest commit

History

Repository files navigation

GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning

✨ Latest News

⚡ Introduction

💭 Overview

📚 Construction of Hybrid Dataset

ChatGPT-generated Data

Human-annotated Data

Error-invariant Augmentation

🚀 Training

🧐 Inferencing

😀 Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Languages