# TLDR text summarization

GitHub repo: https://github.com/sajastu/reddit_collector

Paper PDF: https://aclanthology.org/2021.newsum-1.15.pdf

IR Lab: https://ir.cs.georgetown.edu/

In [None]:
import os
import json
import pandas as pd
import pprint

## Data Loading

The following code loads the entire dataset in memory creating 3 dataframes (train, validation, test)

In [None]:
files = [os.path.join(dirpath,f) for (dirpath, dirnames, filenames) in os.walk("Dataset_TLDRHQ/") for f in filenames]
#files = files[0:14] # subset composed of train, val, test

test = pd.DataFrame()
train = pd.DataFrame()
val = pd.DataFrame()

for file in files:
    temp = pd.read_json(file, lines=True)
    temp.set_index("id", inplace=True)
        
    if "test" in file:
        test = pd.concat([test, temp])
    if "train" in file:
        train = pd.concat([train, temp])
    else:
        val = pd.concat([val, temp])

: 

Sizes of the dataframes

In [3]:
print("train elements \t", len(train))
print("val elements \t", len(val))
print("test elements \t", len(test))
print("tot elements \t", len(train)+len(val)+len(test))

train elements 	 1590132
val elements 	 80967
test elements 	 40486
tot elements 	 1711585


In [4]:
train

Unnamed: 0_level_0,document,summary,ext_labels,rg_labels
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
train-TLDR_RS_2019-07-25907.json,"hey y' all , i 've been a lurker in this commu...",i 'm publishing betas of some stuff i 've been...,"[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.040456223279431006, 0.022386650144119002, 0..."
train-TLDR_RS_2011-07-7721.json,"as the title says , dreams that really scared ...",dreamed i was a cop that got shot in the face ...,"[1, 1, 0, 0, 1]","[0.17657869627445003, 0.382656749215711, 0.157..."
train-TLDR_RC_2019-08-cm-9972.json,"from u / orangejews4u here 's my "" must read ""...",not worth ) ] ( https://www.reddit.com/r/howto...,"[0, 1, 0, 1]","[0.23651113155692202, 0.336435558104879, 0.165..."
train-TLDR_RS_2015-04-39164.json,"hello / r / relationships , i did n't really k...","wife wants a break , has started going out par...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.044352260202115, 0.016666897769790003, 0.02..."
train-TLDR_RC_2016-05-cm-45089.json,i agree with you to be honest .</s><s> my dad ...,i agree with you . regular dad things . some s...,"[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]","[0.43353065193323603, 0.054993317369966, 0.047..."
...,...,...,...,...
TLDR_RS_2021-02-cm-5342.json,basically last year before corona and everythi...,"fck web dev , i 'm off to learn how to make an...","[0, 1, 0, 0, 0, 0, 0, 0]","[0.11267163735327002, 0.31056765001268505, 0.1..."
TLDR_RS_2021-02-cm-11094.json,i really like idea of homebrew spells but this...,i do n't find such op spells fun and fair .,"[1, 0, 0, 0]","[0.49651190762569103, 0.26557613663699703, 0.0..."
TLDR_RS_2021-03-cm-31347.json,"when i ( 21f ) first met my boyfriend , he was...",my boyfriend is getting too sensitive and clin...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]","[0.098505453044571, 0.053650019684145005, 0.07..."
TLDR_RS_2021-03-cm-33971.json,for those wondering i 'm in the 9th grade and ...,i insulted my teacher in an online class . the...,"[0, 0, 0, 0, 0, 0, 1, 0, 1]","[0.10565948286030301, 0.026185176187118003, 0...."


Attributes
- **id**: The ID of the reddit post,
    - RS: submission (reddit post)
    - RC: comment
- **document**: User's post text (source)
    - is split by the sentences; hence, you will find `</s><s>` tokens within the document's text, indicating the sentence boundaries.
- **summary**: User-written summary/TL;DR of the post,
- **ext_labels**: Extractive labels of the post's sentences (one per sentence)
- **rg_labels**: The rouge scores of the post's sentences (one per sentence)

### Train set exploration

#### Example document

In [5]:
train.iloc[0]

document      hey y' all , i 've been a lurker in this commu...
summary       i 'm publishing betas of some stuff i 've been...
ext_labels    [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
rg_labels     [0.040456223279431006, 0.022386650144119002, 0...
Name: train-TLDR_RS_2019-07-25907.json, dtype: object

In [6]:
train['document'].iloc[0]

'hey y\' all , i \'ve been a lurker in this community for eons , and it \'s about time i contributed something .</s><s> ordinarily this is something more properly posted in r / jailbreak - but seeing as how it seems to be an eternal dumpsterfire , i \'m sure you can understand my preference to share this information here .</s><s> what i \'ve got for y\' all today are some betas for various tools i \'ve been working on for the past couple months , and have been holding off on publishing until they were all ready .</s><s> ## iksof ( ios kernel symbol offset finder )</s><s> just another in the long list of offset finders out there .</s><s> when i first started writing this , the idea was to finally have a nice platform binary , rather than a shell script for finding symbol offsets in an ipsw file \'s kernelcache .</s><s> i recently realized that there \'s actually a few out there already and that i just was n\'t looking hard enough .</s><s> regardless , this is a thing i made , and i thin

In [7]:
#pd.set_option('display.max_colwidth', None)
example_doc = pd.DataFrame({"sentence": train['document'].iloc[0].split("</s><s>"),
                       "ext": train['ext_labels'].iloc[0],
                       "rg": train['rg_labels'].iloc[0]})
example_doc

Unnamed: 0,sentence,ext,rg
0,"hey y' all , i 've been a lurker in this commu...",0,0.040456
1,ordinarily this is something more properly po...,0,0.022387
2,what i 've got for y' all today are some beta...,0,0.065609
3,## iksof ( ios kernel symbol offset finder ),1,0.097218
4,just another in the long list of offset finde...,0,0.031838
5,"when i first started writing this , the idea ...",0,0.023288
6,i recently realized that there 's actually a ...,0,0.020016
7,"regardless , this is a thing i made , and i t...",0,0.015217
8,"in essence , it tries to get all symbol offse...",0,0.019918
9,it 's open - source and on github here :,0,0.00834


In [8]:
example_doc[example_doc['ext']==1]

Unnamed: 0,sentence,ext,rg
3,## iksof ( ios kernel symbol offset finder ),1,0.097218
32,"in short , it 's a modded version of xcode th...",1,0.059257
42,"right now , i 've been focusing on the jailbr...",1,0.062502


In [9]:
train['summary'].iloc[0]

"i 'm publishing betas of some stuff i 've been working on : * ** iksof ** - ios kernel symbol offset finder * ** logos + + ** - a superset of logos that supports swift * ** xpwnd ** - a modded version of xcode designed to aid each level of the jailbreak stack stay tuned , ~ tomnific"

#### Statistics on the number of sentences per document

In [10]:
es = train['ext_labels'].str.len()
es.describe().apply("{0:.0f}".format)

count    1590132
mean          16
std           15
min            1
25%            7
50%           11
75%           19
max          976
Name: ext_labels, dtype: object

### From dataframe to dictionary

In [None]:
train = train.to_dict(orient="index")
val = val.to_dict(orient="index")
test = test.to_dict(orient="index")