# Text summarization

Approaches can be roughly categorized into *abstractive* and *extractive* summarization. Extractive summarization works more closely with the input given, *extracting* the most important sentence(s) from a text and making a summary out of that. Abstractive summarization tries to synthesize the text in a more holistic way, producing completely new sentences.

The (extractive) summarization pipeline follows three steps

1. Sentence scoring: Which sentences are the most important?
2. Sentence selection: Which sentences out of 1 carry complementary information?
3. Sentence reformulation: Which material can I reformulate / compress further?

<div class="alert alert-block alert-info"> <b>Discussion.</b> Many approaches are frequency based. That is, they assume that the most important information in a text will appear more frequently in the text. Is this a reasonable assumption?
</div>



# ROUGE-N (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE-N is the n-gram recall between a candidate summary and a set of reference summaries. 

$$\Large \text{ROUGE-N}_{\text{recall}} = \frac{\sum_{S \in \{ \text{Reference Summaries} \}}\sum_{gram_n \in S} \text{Count}_{\text{match}}(gram_n)}{\sum_{S \in \{ \text{Reference Summaries} \}}\sum_{gram_n \in S} \text{Count}(gram_n)},$$

where $n$ is the length of the n-gram, and $\text{Count}_{\text{match}}$ is the maximum number of n-grams co-occurring in a candidate summary and a reference summary $S$.

ROUGE-L is based on the longest common subsequence shared between the reference and the candidate summary (N.B.: the common words are not necessarily consecutive, just in the same sequence).

$$\Large \text{ROUGE-L} = \frac{LCS(S,X)}{m},$$

where $m$ is the length of the reference summary. So if *Government reduces taxes next Monday* is the reference summary and our candidate is *The goverment reduces income taxes starting the following week*, we have a LCS of with "government reduces taxes", so 3 out of a reference summary of 5 (i.e., a ROUGE-L of 0.6).

<div class="alert alert-block alert-info"> <b>Discussion.</b> What are the advantages and weaknesses of ROUGE-N and ROUGE-L?
</div>


# Term Frequency–Inverse Document Frequency (TF-IDF)

$$\Large \text{tf(t,d)} = \frac{f_{t,d}}{\sum_{t' \in d}f_{t',d}},$$

where $f_{t,d}$ is the raw count of a term $t$ in a document $d$.

$$\Large \text{idf(t,D)} = \text{log}\frac{|D|}{\mid \{ d \in D: t \in d \} \mid}$$

$$\Large \text{tf-idf(t,d,D)} = \text{tf}(t,d) * \text{idf}(t,D)$$

Intuitively: How important a term is in a document, weighted by how frequent that word is, in general

Since we have a single document to summarize, and our task is to find sentences that have "more information"; here "document" refers to a sentence. 


## TF-IDF-based extractive summarization
The weight of each sentence is based on a sum of the TF-IDF of its words. Pick the $n$ highest scoring ones.

<div class="alert alert-block alert-info"> <b>Discussion.</b> What are the advantages and weaknesses of TF-IDF-based summarization?
</div>

1. Download the DailyMail data from the [CNN/DailyMail summarization dataset](https://github.com/abisee/cnn-dailymail). You can download a (slightly pre-processed) version directly from here (second link): [https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail](https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail)
2. Write a script that computes the ROUGE-2 of a model's summary based on a story, benchmarked against a concatenation of its highlights
3. Implement a LEAD-2 model and evaluate its ROUGE-2 on the entire DailyMail dataset
4. Implement a TF-IDF-based extractive summarization model. Find the optimal $n$-highest ranking sentences that should be picked. What is its ROUGE-2? How How does it compare to LEAD-2?

# Notes

For class: They come with the dataset prepared to use. They need to know about ROC/AUC to pick the optimal $n$ for the TF-IDF model, and about splitting datasets


# For task 2/3

Implement an extractive summarization model that uses Lead-5 together with at least two linguistically motivated compressive capabilities (e.g., with two tree-trimming rules), and compare it to either (i) a fully abstractive summarization model or (ii) another extractive model that uses regression for importance prediction. Explain the main features of your implementation. Evaluate it using ROUGE-L on both the English MLSUM subset and a non-English subset. Discuss the quality of the summarizations in connection to your results within each subset and across.


# For preparation
* Read sections 1 and 2.3 of "Recent Advances in Document Summarization" and then the full "MLSUM: The Multilingual Summarization Corpus" paper
https://wanxiaojun.github.io/summ_survey_draft.pdf
* Prepare DailyMail data for processing: Script that retrieves the main text to summarize AND a concatenation of the highlights
Maybe read: https://j.mecs-press.net/ijieeb/ijieeb-v11-n3/IJIEEB-V11-N3-5.pdf