Design plans of large document summarizer? #2181

geoffreya · 2022-02-13T20:04:42Z

geoffreya
Feb 13, 2022

I would like to know what plans if any that Deepset is going to use to build some kind of long document summarizer.

On my own so far, I found out these things about this subject. My concrete test case is to summarize the text of a particular youtube video's CC aka closed caption aka transcript. It's the asteroid video by Veritasium, where he shows you the asteroid size that you should be worried about hitting the Earth. Veritasium is a science educator. Sorry I dont have the video URL handy; I can update with the specific URL later if you ask me. My goal is staying under 20 words or less in the summary, because the summary of the CC will be used as one of the ways to evaluate the accuracy of the title of this video. Youtube titles are typically very short texts (relative to the whole CC of the video) and they are somewhat related to the contents of the video, of course.

1 - I directly experimented with using huggingface models specifically intended for summarization, like BART and a few others. Problems encountered in some models but not others included: Model too large to fit in memory of my workstation; Time to perform a single inference was very long like some minutes, when I wanted it to complete in merely some seconds; Redundant sentences in the output, ie, poor quality. Overall, I was eventually able to find a model or two that could run on my workstation, however it was very slow to run the inference, using like several minutes and all GPU or CPU of my workstation resources while running. So I need something either lighter or faster to be able to serve lots of users on a production web host. I did not attempt any fine-tuning of models; I used models only strictly off the shelf directly from huggingface, so far.

2 - OpenAI offers a generally state of the art GPT-3 based or similar model online for summarization purpose. https://deepai.org/machine-learning-model/summarization Thus, I used this web UI to try it out on my CC text example. The OpenAI model does a word volume reduction of about 20% according to the instructions. So my approach was to repeatedly 3 times run this model, sending its output back into the input each time. This recursive style usage would result in a final summary of about 0.2 * 0.2 * 0.2 = 1/125 of original size. The output was actually the following gibberish which is clearly not good enough to use in production:

These became the asteroids,
an asteroid impact
the asteroid around.
of an asteroid impact.

3 - Perhaps the Haystack technique for QA can be also used for summaries: A two phase, coarse and fine processing approach. Not sure.

4 - Perhaps summarize by short segments at a time, followed by concatenating them and then doing this again, repeating as many times as necessary until the size requirement of the output text is reached?

5 - Perhaps summarize by top-down and bottom-up, as described in 2021 paper "Long Document Summarization with Top-Down and Bottom-Up Representation Inference" by Pang+ (https://openreview.net/forum?id=xiXOrugVHs)

I can provide anyone with the actual test youtube video CC text, if you ask, no problem! Just ask.

Thanks to all in the community or Deepset's suggestions or plans for Haystack as regards Summarization of Long Documents!

julian-risch · 2022-02-14T12:58:16Z

julian-risch
Feb 14, 2022
Maintainer

Hi @geoffreya thanks for bringing up this interesting topic. In Haystack, we support a variety of summarization models with the Summarizer node within pipelines: https://haystack.deepset.ai/reference/summarizer However, summarization hasn't been our focus so far. We are closely following NLP research, including the topic of long document summarization but there are no specific plans on our side. Maybe somebody from the community is interested and would like to implement/integrate a recent research paper?
For your use case, I could imagine that you can simplify the problem. For such a short summarization it might work to use only the first few sentences of the video CC text as input to the summarizer? In my mind, the problem that you describe sounds similar to the task of creating a title given a newspaper article. There, the first paragraph is often a short summary of the entire article and could be enough to generate a title. Maybe there are already some trained models for title generation.
I like the idea of having a two-stage approach for summarization and will think about it! Thanks for sharing your thoughts on that. 👍
Hope to see more comments from the community here.

1 reply

geoffreya Feb 25, 2022
Author

It might be an architectural mistake for big social media companies like FB to try to do NLP by using only the huge DNN models because they seem to be too compute-intensive and slow when used alone to do high volume of processing of social media. To me, it was very impressive watching the much greater speed of your QA framework running on my workstation, which used the blend of lightweight BM25 acting somewhat like a database index to subsequently inform the heavy but accurate DNN where to focus its attention or where not to waste time. I had played with some QA using pure DNN previously and it was so much slower. I do think there is something here for summarization work too.

In a 2 phase design maybe machine learning can be used to dynamically learn which BM25 features are most relevant for the subsequent DNN summarizer model, instead of assuming.

Anyway the LED model at huggingface was the subjectively best pure DNN long / 2K character input summarizer that I was able to run in my experiments as of a few months ago.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design plans of large document summarizer? #2181

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Design plans of large document summarizer? #2181

geoffreya Feb 13, 2022

Replies: 1 comment · 1 reply

julian-risch Feb 14, 2022 Maintainer

geoffreya Feb 25, 2022 Author

geoffreya
Feb 13, 2022

Replies: 1 comment 1 reply

julian-risch
Feb 14, 2022
Maintainer

geoffreya Feb 25, 2022
Author