Replies: 1 comment 1 reply
-
Hi @geoffreya thanks for bringing up this interesting topic. In Haystack, we support a variety of summarization models with the Summarizer node within pipelines: https://haystack.deepset.ai/reference/summarizer However, summarization hasn't been our focus so far. We are closely following NLP research, including the topic of long document summarization but there are no specific plans on our side. Maybe somebody from the community is interested and would like to implement/integrate a recent research paper? |
Beta Was this translation helpful? Give feedback.
-
I would like to know what plans if any that Deepset is going to use to build some kind of long document summarizer.
On my own so far, I found out these things about this subject. My concrete test case is to summarize the text of a particular youtube video's CC aka closed caption aka transcript. It's the asteroid video by Veritasium, where he shows you the asteroid size that you should be worried about hitting the Earth. Veritasium is a science educator. Sorry I dont have the video URL handy; I can update with the specific URL later if you ask me. My goal is staying under 20 words or less in the summary, because the summary of the CC will be used as one of the ways to evaluate the accuracy of the title of this video. Youtube titles are typically very short texts (relative to the whole CC of the video) and they are somewhat related to the contents of the video, of course.
1 - I directly experimented with using huggingface models specifically intended for summarization, like BART and a few others. Problems encountered in some models but not others included: Model too large to fit in memory of my workstation; Time to perform a single inference was very long like some minutes, when I wanted it to complete in merely some seconds; Redundant sentences in the output, ie, poor quality. Overall, I was eventually able to find a model or two that could run on my workstation, however it was very slow to run the inference, using like several minutes and all GPU or CPU of my workstation resources while running. So I need something either lighter or faster to be able to serve lots of users on a production web host. I did not attempt any fine-tuning of models; I used models only strictly off the shelf directly from huggingface, so far.
2 - OpenAI offers a generally state of the art GPT-3 based or similar model online for summarization purpose. https://deepai.org/machine-learning-model/summarization Thus, I used this web UI to try it out on my CC text example. The OpenAI model does a word volume reduction of about 20% according to the instructions. So my approach was to repeatedly 3 times run this model, sending its output back into the input each time. This recursive style usage would result in a final summary of about 0.2 * 0.2 * 0.2 = 1/125 of original size. The output was actually the following gibberish which is clearly not good enough to use in production:
These became the asteroids,
an asteroid impact
the asteroid around.
of an asteroid impact.
3 - Perhaps the Haystack technique for QA can be also used for summaries: A two phase, coarse and fine processing approach. Not sure.
4 - Perhaps summarize by short segments at a time, followed by concatenating them and then doing this again, repeating as many times as necessary until the size requirement of the output text is reached?
5 - Perhaps summarize by top-down and bottom-up, as described in 2021 paper "Long Document Summarization with Top-Down and Bottom-Up Representation Inference" by Pang+ (https://openreview.net/forum?id=xiXOrugVHs)
I can provide anyone with the actual test youtube video CC text, if you ask, no problem! Just ask.
Thanks to all in the community or Deepset's suggestions or plans for Haystack as regards Summarization of Long Documents!
Beta Was this translation helpful? Give feedback.
All reactions