Skip to content

bhadreshpsavani/t5-sentence-split

Repository files navigation

T5 for Sentence Split in English:

Sentence Split is task of dividing complex sentence in two simple sentences ex. complex sentence

Mary likes to play football in her freetime whenever she meets with her friends that are very nice people.

can be divided in

Mary likes to play football in her freetime whenever she meets with her friends.

and

Her friends are very nice people.

Goal:

To make best sentence split model available till now

Demo:

Check out the Demo

ui_image

How to use in your python code:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("flax-community/t5-base-wikisplit")
model = AutoModelForSeq2SeqLM.from_pretrained("flax-community/t5-base-wikisplit")

complex_sentence = "This comedy drama is produced by Tidy , the company she co-founded in 2008 with her husband David Peet , who is managing director ."
sample_tokenized = tokenizer(complex_sentence, return_tensors="pt")

answer = model.generate(sample_tokenized['input_ids'], attention_mask = sample_tokenized['attention_mask'], max_length=256, num_beams=5)
gene_sentence = tokenizer.decode(answer[0], skip_special_tokens=True)
gene_sentence

"""
Output:
This comedy drama is produced by Tidy. She co-founded Tidy in 2008 with her husband David Peet, who is managing director.
"""

Application:

  • Sentence Simplification
  • Data Augmentation
  • Sentence Rephrase

Current Basline from paper

baseline

Our Results:

Model Exact SARI BLEU
t5-base-wikisplit 17.93 67.5438 76.9
t5-v1_1-base-wikisplit 18.1207 67.4873 76.9478
byt5-base-wikisplit 11.3582 67.2685 73.1682
t5-large-wikisplit 18.6632 68.0501 77.1881

Accomplishment:

  • All of our models are having better result for two metrics(Exact and SARI scores) than baseline models
  • Our t5-base-wikisplit and t5-v1_1-base-wikisplit model are achieving comparative results with half model size or weights that will enable faster inference
  • We added wikisplit metrics which is freely available at huggingface datasets. It will be easy to calculate relevent scores for this task from now on

Contributors:

To Do:

  • t5-base training on Wiki Split
  • t5-v1_1-base training on Wiki Split
  • byt5-base training on Wiki Split
  • t5-large training on Wiki Split
  • Streamlit UI for App
  • Single Websplit Evaluation Metrics Addition in Huggingface Datasets
  • Challenge: Get better performance than roberta2roberta_L-24_wikisplit
  • Performance improvement with Research
  • Tackle Gender Biasness and fairness while text generation
  • Benchmarking and Experimenting with Web Split