# UN ML Assessment
### Rebecca Calinsky

### a. How did you approach the problem?

I used HuggingFace’s transformers library, as it provides pretrained model options for abstractive summarization. One issue that I came across is that the length of the original texts are quite long (over the 1024-word limit), and I resolved this problem by breaking each original text into groups of sub-texts, summarizing them separately, and then re-summarizing their merged result. Due to computation restraints, I generated only the first 150 (2%) of the 7,507 summaries.

### b. In this problem, how can you measure the performance?

One way to measure performance would be to randomly choose N of the original texts, summarize each of these myself (or ideally, with a few independent unbiased annotators), and then compare these human-made summarizations with the generated summaries. I would then rate each generated summary based on how many of the salient points from the annotated summary are present. The overall performance rating would be an average of these N ratings.

### load and preview data

In [1]:
# load data
import pandas as pd
Data = pd.read_csv("un-general-debates.csv")

In [2]:
Data

Unnamed: 0,session,year,country,text
0,44,1989,MDV,﻿It is indeed a pleasure for me and the member...
1,44,1989,FIN,"﻿\nMay I begin by congratulating you. Sir, on ..."
2,44,1989,NER,"﻿\nMr. President, it is a particular pleasure ..."
3,44,1989,URY,﻿\nDuring the debate at the fortieth session o...
4,44,1989,ZWE,﻿I should like at the outset to express my del...
...,...,...,...,...
7502,56,2001,KAZ,﻿This session\nthat is taking place under extr...
7503,56,2001,LBR,﻿I am honoured to\nparticipate in this histori...
7504,56,2001,BDI,﻿It\nis for me a signal honour to take the flo...
7505,56,2001,HUN,"﻿First, may I congratulate Mr. Han Seung-soo o..."


### summarize texts

In [3]:
# import summarization transformer
from transformers import pipeline
summarization = pipeline("summarization")

# function to summarize text
def summarize( input_text ):
    try:
        summary = summarization(input_text)[0]['summary_text']
 
    # this except clause should be more specific, but will keep as is for now
    except:
        summary = ''
        
    return summary


In [4]:
from pathlib import Path

# do subset of texts due to computation constraints (each summarization takes ~5 mins)
num_texts = 150

# summarizer can deal with 1024 words max, use 512 as stable limit
word_threshold = 512

# create save directory for individual files
save_dir = Path('summaries')
save_dir.mkdir(exist_ok = True)

for i in range(num_texts):
    # temporary file to save individual summary
    summary_filename = Path(f'summaries/{i}.txt')
    print(f'processing \'{summary_filename}\'')

    # skip if already summarized
    if summary_filename.exists():  continue
    
    # get text
    original_text = Data.at[i, 'text']
    # break into sentences
    sentences = original_text.split('\n')

    # overall summary
    total_summary = ''

    # reset word count
    word_count = 0
    grouped_text = ''
    
    # reverse sentence order to facilitate pop
    sentences.reverse()

    # summarize each group of sentences
    # note: in rare cases, the final group of sentences may be small
    while sentences:
        next_sentence = sentences.pop()
        num_words = len(next_sentence.split(' '))

        word_count += num_words
        grouped_text += next_sentence
        
        if word_count >= word_threshold:
            # summarize sentence and append to total summary
            total_summary += summarize(grouped_text)

            # reset word count and group_text
            word_count = 0
            grouped_text = ''

    # finally, summarize group-wise summary
    summary_text = summarize(total_summary)

    # save summary
    with open(summary_filename, 'w') as file:
        file.write(summary_text)


processing 'summaries\0.txt'
processing 'summaries\1.txt'
processing 'summaries\2.txt'
processing 'summaries\3.txt'
processing 'summaries\4.txt'
processing 'summaries\5.txt'
processing 'summaries\6.txt'
processing 'summaries\7.txt'
processing 'summaries\8.txt'
processing 'summaries\9.txt'
processing 'summaries\10.txt'
processing 'summaries\11.txt'
processing 'summaries\12.txt'
processing 'summaries\13.txt'
processing 'summaries\14.txt'
processing 'summaries\15.txt'
processing 'summaries\16.txt'
processing 'summaries\17.txt'
processing 'summaries\18.txt'
processing 'summaries\19.txt'
processing 'summaries\20.txt'
processing 'summaries\21.txt'
processing 'summaries\22.txt'
processing 'summaries\23.txt'
processing 'summaries\24.txt'
processing 'summaries\25.txt'
processing 'summaries\26.txt'
processing 'summaries\27.txt'
processing 'summaries\28.txt'
processing 'summaries\29.txt'
processing 'summaries\30.txt'
processing 'summaries\31.txt'
processing 'summaries\32.txt'
processing 'summarie

### save to csv

In [5]:
# name of output file
output_csv = 'summaries.csv'

# copy dataframe for storing summaries; rename column from text to summary
summaries_df = Data.copy()
summaries_df.rename(columns = {'text': 'summary'}, inplace = True)

summaries_df.at[:,'summary'] = 'Summary not available due to computational constraints.'

for i in range(num_texts):
    summary_filename = Path(f'summaries/{i}.txt')
    
    with open(summary_filename, 'r') as file:
        summary_text = file.read()
        summaries_df.at[i, 'summary'] = summary_text


# write to csv
summaries_df.to_csv(output_csv)
