In [1]:
# Uncomment if notebook is run in Colab
# %%capture
# !pip install datasets
# !pip install rouge-score

**Step 0:** Download dataset

**Step 1:** (Filtering based on source texts: filter duplicates)


**Step 2:** (Filtering based on source texts: filter non useful)


**Step 3:** (Filtering based on summaries: filter non useful)


**Step 4:** Aggregate the indices of the above & filter them from the dataset


**Step 5:** (Filtering based on summaries: filter duplicates)

* Duplicate summary field ('tldr') does not necessarily indicate a duplicate data point
* Find candidate duplicates based on duplicate summaries

**Step 6:** Cross-check whether the candidate duplicates of step 5 are duplicates indeed


**Step 7:** Filter the indices defined in step 6 from the dataset

In [2]:
import datasets
import pandas as pd
import nltk
import re
import matplotlib.pyplot as plt
import numpy as np
import random
from datasets import load_dataset, load_metric
from IPython.display import display, HTML

import warnings
warnings.filterwarnings('ignore')

In [3]:
rouge = load_metric('rouge', seed=42)

# Helper functions

def show_random_elements(dataset, num_examples=3):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

def rouge_2_recall(target_text_1, target_text_2):
    rouge_output = rouge.compute(predictions=target_text_2, references=target_text_1)
    ROUGE_2_recall = \
    round((rouge_output['rouge2'].low.recall \
           + rouge_output['rouge2'].mid.recall \
           + rouge_output['rouge2'].high.recall)/3, 1)
    
    return ROUGE_2_recall

def clean_string(string):
    string = re.sub('\*', '', string).lower().rstrip()
    return string

def remove_duplicate_sets_from_list(candidate_duplicates):
    candidate_duplicates_sets = list(map(set, candidate_duplicates))
    unique_sets = list(set(frozenset(item) for item in candidate_duplicates))
    candidate_duplicates_unique_sets = [set(item) for item in set(frozenset(item) for item in unique_sets)]
    
    candidate_duplicates_lists = []
    for item in candidate_duplicates_unique_sets:
        candidate_duplicates_lists.append(list(item))
        
    return candidate_duplicates_lists

**Step 0: Download Reddit-TIFU**

- No train-val-test split for this dataset is provided or mentioned anywhere 
- We download Reddit TIFU from Hugging Face datasets 
- the split='train' downloads the whole dataset

In [4]:
reddit_tifu = load_dataset('reddit_tifu', 'long', split='train')

Found cached dataset reddit_tifu (C:/Users/Anna/.cache/huggingface/datasets/reddit_tifu/long/1.1.0/3136b11fbef3f2517de1d720621af110bd29e6083aebeab0d8ec198c9f95dc95)


In [5]:
reddit_tifu

Dataset({
    features: ['ups', 'num_comments', 'upvote_ratio', 'score', 'documents', 'tldr', 'title'],
    num_rows: 42139
})

In [6]:
reddit_tifu[8200]

{'ups': 0.0,
 'num_comments': 0.0,
 'upvote_ratio': 0.5,
 'score': 0.0,
 'documents': 'so this happened last week. \ni am a college junior and i am in a business communications course which is probably the most time consuming class i\'ve taken in college. \n\nanyways... the way this course is designed is your group gets a real live client (a local organization) and you do some sort of consulting for them. you always get a project manager (usually a ta) who you report everything you do to, including a prescreen of the final presentation to our client -- which is where the fuck up happens. \n\ngearing up for our final presentation, our project manager (who we will call gabe for the rest of this story) asked us to do the presentation for him before we present to the client. we all have extremely busy schedules, so the only time and place that worked for gabe and the team was 9pm in one of the reservable group rooms at the library. gabe had requested that we all show up in business profess

In [7]:
# 3 random examples from the Reddit TIFU dataset:

show_random_elements(reddit_tifu)

Unnamed: 0,ups,num_comments,upvote_ratio,score,documents,tldr,title
0,5.0,0.0,0.87,5.0,"a few years back, i was living in a really cold part of australia called armidale. i was in the scouts, and can usually light a fire with one match. but on one particularly cold night, i just could not get the fire to light. i must have used half a box of matches and firelighter bricks. would not start. so, then i had the brilliant idea ""hey, there's a bottle of methylated spirits under the sink, i'll slosh some of that in, bet that works!"", so i sloshed a heap of metho on the wood, lit a match, and went to light it...\n\nnow, here's where the lack of forward thinking comes in. usually if you use metho to start a fire, you get a happy little whumpf and the fire is lit. however... metho is a volatile. and although the wood wasn't burning, it was extremely hot from all those fire lighters and failed attempts. so when i leant forward with my lit match, rather than lighting a nice safe pool of liquid metho, and getting a happy little whumpf, what the match met was actually a gaseous fuel-air mix, and i got a kkka-fucking-krummppfo which blew the door off the fire place, blew several now happily burning bits of wood into the room, set the carpet on fire, and took off all the hair on my right arm and the right side of my face. but got the fire lit, so there's that.\n\ni'm just glad my genius brain remembered the metho under the sink, and not the petrol in the jerry can in the garage...",blew up my fireplace,lighting a fire
1,8.0,22.0,0.63,8.0,"*this was typed on monday*\n\nso today was a normal monday, comming back from baseball practice and immediately getting in the shower to go to boy scouts. everything seems normal getting into the shower, no stomach ache and feeling completely normal. \n\nso, i get into the shower and start doing my normal routine. i start to feel this fart comming along, completely normal. the fart itself felt normal, but i hear a little bit more than usual water hit the floor. i look behind me and i see diarrhea all over the shower behind me, i immediately run to the toilet mi'd shit. when i made it onto the toilet there is a trail of shit on the floor. after that i was like ok that was weird and got into the car for boy scouts.\n\ni am at boy scouts and i feel it again. i grab a buddy and run to the bathroom. i shit my fucking brains out in a church bathroom, go back into the room and tell my mom that we need to leave now, we get into the car and drive home. i run into the house farting constantly, thankfully i make it to the bathroom before i liquid-shit everywhere. \n\ni write to you now still sitting on the toilet because my stomach feels like it is going to explode.","thought i was gonna fart in the shower, dirrhea everywhere. i thought i was fine going to boy scouts but, again, i liquid-shit everywhere. still sitting on the toilet.",farting in the shower. (nsfw)
2,2.0,5.0,0.75,2.0,"big block of text\n\n \n\nthis is completely my fault for what she did but i want some insight on if what she did was because of me.\n\ntwo days ago i was in a band play we skipped 4 classes to perform for our schools football team. this happened two days in a role so i was completely lost in math. my friend got a copy of the homework that was due the very next day. and my friend was also in my band class. as the usual high school students we are we all decided to copy each other. before i get to copy it, my friend messaged me about how the teacher caught him cheating using google docs to copy and paste the assignment. but my friend told her he wasn't done fair thing he gets to turn it in later. my friend warns me not to copy word for word. which i didn't, i changed the words up and tried to do it my self. all work done and turned in.\n\nvery next day the math teacher wasn't here. we then decided to play some cards, cool. 15 minutes later she comes and and makes us put the cards away. she then called out 6 people. i thought i got away with this. but then she called out another 4 and my friend and i was in those name. she told us blah blah blah no grade for this assignment, and i would drop your grades. fine fair deal. got what i deserved!\n\nthe very next day she came and grabbed my friend and i again. this time she spoke to us individually. i had a feeling she wasn't going to write my recommendation anymore because of what happen. she told me exactly what i type. ""blah blah blah... because of the recent incident i do not feel comfortable writing your recommendation anymore"". this slowly sank in to me and it kind of made me worry a bit, because i was expecting her not to write me it, but i wasn't really expecting her to not write it.\n\none hour later we have math class. she announces there would be a test on everything we learned from 11th grade till 12th. this was because of the people that cheated. our topic we are own wasn't completely finished but i knew she was upset about it. she would never do this to us having a test. later i find out 4 of my other friends, had the same treatment. \n\nbut further on when i went home to remove her recommendation from my college list. i noticed something very weird about the recommendation she was about to ""write"" for me. on the bottom for her name ""leva lincoln""- started on november 20 2015.\nthen on top of it that it said ""date added, september 20th 2015""\nthis suddenly struck me, that probably her ""not feeling comfortable"" was her excuse for not writing it. i told her 2 days in advance i need it by thanksgiving. she said ""okay i would have it done by monday"". but in my head i am thinking right now what if that was her excuse to not write it because she was late to start it? i gave her 2 month times to write it, but i am shocked i found out today that she started to write it today.\n\nso i am asking you guys, just for your input did she not write it because of us cheating and breaking her trust or she did not write it because she didn't have time to do the rest of ours. she made us fill out a 15 questionnaire each consisting of a paragraph of text each 2 days after we asked her to write it. \n\nso did she really, not feel comfortable or was it an excuse that was really well thought out. thing is 2 other people didn't get their recommendations cancel, i can go into more details about this if you guys want. \n\nactions have consequences, i take all responsibilities but i just want you guys input on this.","teacher caught us cheating, 4 people got no recommendations 1 unit test, or did she write it?",copying my friend on his homework causing a domino effect.


**Step 1:** (Filtering based on source texts: filter duplicates)
* Inspect Reddit-TIFU for duplicates of the source texts ('documents' column)

In [8]:
reddit_tifu_df = reddit_tifu.to_pandas()

In [9]:
reddit_tifu_df.iloc[20094] # Random element

ups                                                           5.0
num_comments                                                  6.0
upvote_ratio                                                 0.87
score                                                         5.0
documents       earlier this week*\n\nso, i have this intervie...
tldr            had an interview. forgot interviewers name. ca...
title                           asking an interviewer for a name.
Name: 20094, dtype: object

In [10]:
len(reddit_tifu_df['documents'].value_counts())

42101

- The value 42101 is smaller than the number of examples in the dataset (42139).
- This indicates that there are duplicates, for the column 'documents', in the dataset.
- 42139-42101=38 *exact* duplicates that should be removed

In [11]:
# Find the indices of the reddit_tifu_df of the exact matches for the column 'documents'
# store them in the *exact_duplicates_texts_indices* variable  

# Count the values of the field 'documents' that occur more than once 
# print(len(reddit_tifu_df['documents'].value_counts()[reddit_tifu_df['documents'].value_counts() > 1]))

# Identify exact duplicates in the 'documents' column
# 'exact_duplicates', will store a Series containing the exact duplicate documents along with their counts
exact_duplicates = reddit_tifu_df['documents'].value_counts()[reddit_tifu_df['documents'].value_counts() > 1]

exact_duplicates_df = pd.DataFrame({'value': exact_duplicates.index, 'occurencies_count': exact_duplicates.values})

# exact_duplicates_df['occurencies_count'].sum()

exact_duplicates_texts_indices_lists = []

for element in exact_duplicates_df['value'].to_list():
    element_occurence_indices = reddit_tifu_df.index[reddit_tifu_df['documents'] == element].tolist()
    exact_duplicates_texts_indices_lists.append(element_occurence_indices)

# for the *exact_duplicates_texts_indices* we keep all the elements that are
# duplicates of the first element in each list,
# each first element ("original" element) index is not stored in exact_duplicates_texts_indices since
# it itself is not a duplicate

exact_duplicates_texts_indices = []

for element in exact_duplicates_texts_indices_lists:
    for i in range(1, len(element)):
        exact_duplicates_texts_indices.append(element[i])

In [12]:
exact_duplicates_df

Unnamed: 0,value,occurencies_count
0,so this happened last week. \ni am a college j...,8
1,so this happened last week. \ni am a college j...,5
2,so this happened last week. \ni am a college j...,4
3,so this happened last week. \ni am a college j...,4
4,today i was invited to a mavericks game by my ...,2
5,"so this date backs to a couple of days ago, bu...",2
6,obligatory this didn't happen today. this happ...,2
7,a little bit of context for this. i am a 16 ye...,2
8,so i'm a young male and therefore an avid tind...,2
9,"this happened two days ago, and the only reaso...",2


In [13]:
len(exact_duplicates_texts_indices) 

38

The lenght of the *exact_duplicates_texts_indices* list confirms our initial finding; "42139-42101=38 exact duplicates that should be removed"

In [14]:
exact_duplicates_texts_indices

[8191,
 8192,
 8193,
 8194,
 8195,
 8196,
 8197,
 8209,
 8210,
 8211,
 8212,
 8187,
 8188,
 8189,
 8201,
 8202,
 8203,
 13634,
 3012,
 14806,
 15694,
 9779,
 8215,
 35357,
 17026,
 34357,
 16555,
 19022,
 32749,
 6682,
 8199,
 34186,
 10984,
 5846,
 11615,
 8205,
 9107,
 14940]

In [15]:
with open("Reddit_TIFU_exact_duplicates_texts_indices.txt", "w") as file:
    for idx in exact_duplicates_texts_indices:
        file.write(str(idx) + "\n")

**Step 2:** (Filtering based on source texts: filter non useful)
* Inspect dataset for non useful/problematic source texts ('documents' column)

In [16]:
not_useful_texts_indices = []

# Find the indices of the 'documents' that are empty or not text (e.g., punctuation marks only)

''' a regular expression that describes text: '''
text_pattern = re.compile("([a-z1-9])+..*", re.IGNORECASE)

for i in range(len(reddit_tifu_df)):
    if len(reddit_tifu_df['documents']) == 0 or not(text_pattern.search(reddit_tifu_df['documents'].loc[i])):
        not_useful_texts_indices.append(i)

In [17]:
not_useful_texts_indices

[24268]

**Step 3:** (Filtering based on summaries: filter non useful)

* Inspect dataset for non useful/non informative summaries ('tldr' column):
    *   nonsensical tldrs (e.g., punctuation marks only),
    *   tldrs that clearly are not a summary (e.g., "see title") 

In [18]:
# Find the indices of the items that are not useful (not informative);
#   - nonsensical tldrs (e.g., punctuation marks only),
#   - tldrs that clearly are not a summary (e.g., "see title") 

not_useful_tldrs_indices = []

# Find the indices of the TLDRs that empty or not text (e.g., punctuation marks only)

''' a regular expression that describes text: '''
text_pattern = re.compile("([a-z1-9])+..*", re.IGNORECASE)

for i in range(len(reddit_tifu_df)):
    if len(reddit_tifu_df['tldr']) == 0 or not(text_pattern.search(reddit_tifu_df['tldr'].loc[i])):
        not_useful_tldrs_indices.append(i)

# Find the indices of the TLDRs that are not useful, e.g., "see title"

not_useful_tldrs = ['title', 'title.', 'title!',
                    'see title', 'see title.', 'see title!',
                    'in the title', 'in the title.',
                    'read title', 'read title.', 'read title!', 'read the title', 'read the title.', 'read the title!',
                    'read up', 'read up!', 'read up.',
                    'read it', 'read it!', 'read it.',
                    'at bottom', 'at bottom.', 'at bottom!',
                    'at the bottom', 'at the bottom.', 'at the bottom!',
                    'at the end', 'at the end.', 'at the end!',
                    'at the top', 'at the top.', 'at the top!',
                    'version:',
                    'upvote', 'upvote.', 'upvote!',
                    'mandatory summary/question!']

for i in range(len(reddit_tifu_df)):
    # clean_string removes the special character * that appears often in the original 'tldr' field but offers no practical value 
    clean_tldr = clean_string(reddit_tifu_df.loc[i]['tldr'])
    if clean_tldr in not_useful_tldrs:
        not_useful_tldrs_indices.append(i)

for indx in not_useful_tldrs_indices:
    print(reddit_tifu_df.loc[indx]['tldr'])

?
---------
**
( ͡° ͜ʖ ͡°)
,
**
**
~~
k
**
**
**
(( ͡° ͜ʖ ͡°)͜ʖ( ͡° ͜ʖ ͡°))*
-
**:
**
**
**
**
--
"
**
???
**
)**
*
:
*
**:
**
:
⬆️
**
:
,
;
/╲/( ͡° ͡° ͜ʖ ͡° ͡°)/\╱\
**
,
;
?**
**
**
]
*
**
-
**
*
**
**
:
;
**
**
:
**
:
**
**
.
**
;
/
**
;
*
💨 💨 🐝💦💦💻 😯😐
?
*
*
:
)
,
,
'
'
.**
'
]
.**
**
read the title
at the bottom
read the title
read the title
see title
version:
title.
at the bottom.
in the title.
version:
at the end.
see title
read the title.
read the title
at the bottom.
see title.
read the title.
title.
read it.
see title
read the title.
read the title.
at the bottom.
at the bottom.
at bottom.
at the bottom.
see title
read it.
at the bottom.
title
at bottom.
title.
read the title
read title
at the bottom.
title
title
read the title.
at the bottom.
at the bottom.
at the bottom
title
at the end.
title.
at bottom.
at the bottom
at bottom
at the bottom
at bottom.
read the title.
at the bottom.
at the bottom
at the end.


In [19]:
len(not_useful_tldrs_indices)

135

**Step 4:** Aggregate the indices of the above & filter them from the dataset

* Aggregate all the indices that should be removed, found so far

In [20]:
len(exact_duplicates_texts_indices)

38

In [21]:
len(not_useful_texts_indices)

1

In [22]:
len(not_useful_tldrs_indices)

135

In [23]:
# Aggregate all the indices that should be removed
indices_to_remove = exact_duplicates_texts_indices + not_useful_texts_indices + not_useful_tldrs_indices

In [24]:
len(indices_to_remove)

174

In [25]:
# Select the reddit_tifu indices to keep by removing the indices to remove

all_indices = []
all_indices.extend(range(0,42139))

indices_to_keep = [x for x in all_indices if x not in indices_to_remove]

In [26]:
reddit_tifu_filtered_df = reddit_tifu_df.iloc[indices_to_keep]

In [27]:
reddit_tifu_filtered_df.head()

Unnamed: 0,ups,num_comments,upvote_ratio,score,documents,tldr,title
0,115.0,23.0,0.88,115.0,this actually happened a couple of years ago. ...,confuse a 5th grade girl for a boy in front of...,gender-stereotyping
1,16.0,12.0,0.79,16.0,"it was last october, but i'm feeling the fall-...","i found my estranged dad, thought i loved him ...",telling my dad that i love him.
2,55.0,10.0,0.85,55.0,so i had the brilliant idea to use veet hair r...,had my balls burned by sauron and was left dev...,i was deveeted...
3,90.0,20.0,0.92,90.0,today i was going to have a bath after a long ...,peppermint + bath = burning cold ladybits.,wanting a pepperminty bath.
4,81.0,18.0,0.79,81.0,"i haven't had a bath in practically years so, ...","got too high and too hot in the bath, almost c...",having a spliff in the bath.


**Step 5:** (Filtering based on summaries: filter duplicates)

* Next we look for candidate duplicates based on the values of the column 'tldr',
* Duplicate summary field ('tldr') in Reddit TIFU does not necessarily indicate a duplicate data point, we call these data points 'candidate' duplicates
* Find candidate duplicates based on duplicate summaries

In [28]:
# E.g., the following two elements of Reddit TIFU,
# have the same 'tldr' but are not duplicates

print("\n**Reddit TIFU indx 20074**")
print(f"TLDR SUMMARY: {reddit_tifu_df.loc[20074]['tldr']}")
print(f"SOURCE TEXT: {reddit_tifu_df.loc[20074]['documents']}")

print("\n**Reddit TIFU indx 23123**")
print(f"TLDR SUMMARY: {reddit_tifu_df.loc[23123]['tldr']}")
print(f"SOURCE TEXT: {reddit_tifu_df.loc[23123]['documents']}")


**Reddit TIFU indx 20074**
TLDR SUMMARY: think before you speak
SOURCE TEXT: yipee, this just happened (+5 tifu points)

i was watching the anzac ceremony in gallipoli on tv, like a new zealander should, but the volume was on 2. usually my family listens to the tv on volume 8-11 so my ears struggled to pick up the sounds.

i was not in the vicinity of the remote so i was unable to do it myself.

heres the fu:

i said "its very quiet" while they were playing the national anthem of turkey.

in my head i was wanting to hear the national anthem as i had never heard it before. instead it sounded like an offensive joke because there was a silence while the anthems play. awkward looks ensued.

**Reddit TIFU indx 23123**
TLDR SUMMARY: think before you speak
SOURCE TEXT: little background info: sometimes i blurt out things before i realize it wasn't a good idea to say it.

anyway were standing around and my friend was talking about what this guy could play when he uses the wah wah pedal ( http

"think before you speak" seems to be a popular conclusion :)

**Step 6:** Cross-check whether the candidate duplicates of step 5 are duplicates indeed

In [52]:
summaries_duplicates_value_counts = reddit_tifu_filtered_df['tldr'].apply(clean_string).value_counts()

In [53]:
summaries_duplicates_value_counts

tldr
i'm an idiot.                                                                                                                                                                                                                                                                                                              8
doesn't matter, had sex.                                                                                                                                                                                                                                                                                                   3
4th grade me thought i was a good magician, failed miserably doing a magic show in front of the class, and cried miserably at home.                                                                                                                                                                                        3
my girlfriends dad found out about our secre

In [58]:
exact_duplicates = reddit_tifu_filtered_df['tldr'].value_counts()[reddit_tifu_filtered_df['tldr'].value_counts() > 1]

In [62]:
# for sanity check
len(exact_duplicates)

69

In [32]:
reddit_tifu_filtered_df['clean_tldr'] = reddit_tifu_filtered_df['tldr'].apply(clean_string)

# Initialize a dictionary to store duplicate indices
candidate_duplicates_indices_dict = {}

# Find duplicates in the 'clean_tldr' column
duplicates_mask = reddit_tifu_filtered_df.duplicated(subset=['clean_tldr'], keep=False)

# Filter the DataFrame to get rows with duplicated values
duplicates_df = reddit_tifu_filtered_df[duplicates_mask]

# Iterate over each unique value in the 'clean_tldr' column
for value in duplicates_df['clean_tldr'].unique():
    # Get indices of rows with this value
    indices = duplicates_df[duplicates_df['clean_tldr'] == value].index.tolist()
    
    # Store indices in the dictionary
    candidate_duplicates_indices_dict[value] = indices

In [33]:
candidate_duplicates_indices_dict

{'i am a monster': [583, 11581],
 "i'm an idiot": [698, 30080],
 'never trust a fart.': [729, 13791, 25478],
 'delicious pocket snack inadvertently triggers a tsunami of shit that wrecks up our house, wife and i clean up with comically ineffectual tools.': [1183,
  10660],
 "sht happens and it doesn't care where or with whom you are.": [1687, 10869],
 'left hostel door open and woke up with a room full of pussy': [1814, 19438],
 'i grabbed a hose in the dark and got a shitload more than i bargained for.': [1831,
  2516],
 '4th grade me thought i was a good magician, failed miserably doing a magic show in front of the class, and cried miserably at home.': [2083,
  4230,
  14912],
 "got a girl's number, walked to what i thought was the outdoors, turns out it was a glass wall": [2203,
  2213],
 'pierced my member, posted pics online, got infected, washed out in cup, friend drank from cup, mom and dad found posted pics.': [2398,
  16932],
 "don't lie to your parents": [2655, 5015],
 'went 

In [34]:
len(candidate_duplicates_indices_dict)

69

* After finding the candidate duplicates based on the 'tldr' column
* we compare the corresponding source texts ('documents' column) to figure out if they are actual duplicates
* to compare the source texts for similarity ROUGE-2 recall is used
* two texts are considered duplicates if ROUGE-2 recall > 0.8 
* this way of computing similarity is based on the approach used in *Zhang, J., Zhao, Y., Saleh, M., & Liu, P. (2020, November). Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning (pp. 11328-11339). PMLR.*

In [35]:
print("The following data points will have the same summary-\"tldr\":\n")
print(reddit_tifu_filtered_df.loc[583])
print("\n")
print(reddit_tifu_filtered_df.loc[11581])

The following data points will have the same summary-"tldr":

ups                                                         181.0
num_comments                                                 24.0
upvote_ratio                                                 0.91
score                                                       181.0
documents       i recently got new car with manual transmissio...
tldr                                               i am a monster
title           making my 4 year old running face first into g...
clean_tldr                                         i am a monster
Name: 583, dtype: object


ups                                                          10.0
num_comments                                                 19.0
upvote_ratio                                                  0.8
score                                                        10.0
documents       like so many of these stories, this happen a w...
tldr                                               i 

In [36]:
import itertools

duplicates_tldrs_indices = []
pairs_of_duplicates_tldrs_indices = []

for _, value in candidate_duplicates_indices_dict.items():
    
    pairs = list(filter(lambda x: x[0] <= x[1], itertools.combinations(value, 2)))

    for pair in pairs:

        indx_1 = pair[0]
        indx_2 = pair[1]

        target_1 = reddit_tifu_filtered_df.loc[indx_1]['documents']
        target_2 = reddit_tifu_filtered_df.loc[indx_2]['documents']
      
        if rouge_2_recall([target_1], [target_2])>=0.8:        
            duplicates_tldrs_indices.append(indx_2)
            pairs_of_duplicates_tldrs_indices.append(pair)

In [37]:
pairs_of_duplicates_tldrs_indices

[(1183, 10660),
 (1687, 10869),
 (1814, 19438),
 (1831, 2516),
 (2083, 4230),
 (2083, 14912),
 (4230, 14912),
 (2203, 2213),
 (2398, 16932),
 (2655, 5015),
 (2682, 4823),
 (2929, 3864),
 (3008, 3010),
 (3425, 7720),
 (3702, 13027),
 (3865, 13881),
 (3995, 13287),
 (4738, 4739),
 (4921, 4923),
 (4921, 4926),
 (4923, 4926),
 (5414, 5990),
 (6070, 7992),
 (6070, 34588),
 (7992, 34588),
 (6378, 19823),
 (6539, 30581),
 (6699, 6701),
 (8166, 37384),
 (8190, 8200),
 (8206, 8207),
 (8206, 8208),
 (8207, 8208),
 (8213, 8214),
 (8563, 8564),
 (9930, 9960),
 (11607, 11612),
 (11695, 20396),
 (12273, 15165),
 (12559, 12579),
 (12705, 12706),
 (14133, 16483),
 (14721, 14747),
 (14983, 16673),
 (15678, 15679),
 (15856, 21575),
 (17505, 17516),
 (20121, 20123),
 (22133, 22138),
 (26339, 35285),
 (27895, 34946),
 (29700, 34060),
 (29863, 38575),
 (29974, 38085),
 (30045, 31548),
 (32106, 34216),
 (38416, 40432),
 (41524, 41525)]

In [51]:
reddit_tifu_df.loc[29974]['documents']

'hello tifu community,\n\ni am /u/shylo132 one of the *[newer moderators](http://i.imgur.com/t97yu6n.jpg?1)* that came to be in our last round of hiring! i am also probably known as one of the nicer mods and a heavy meme poster when interacted with.\n\nthe moderator team and i want to get to know you a [little better.](http://i.imgur.com/ibxo734.jpg) we also want your opinion on how we can make our awesome subreddit grow and expand to heights it has not yet seen before!\n\nto do this we need your help by filling out **[tifu\'s survey for 2016!](https://docs.google.com/forms/d/1sc5qz1wzntoykbj7rxmt8dpr2z8qlaizzqzdyafqlry/viewform)** this survey will stay up for 2 weeks to allow everyone a chance for input. the *[end date](http://i.imgur.com/9ornxnc.jpg)* of this survey is **12 june 16**\n\nthank you for your participation and may your *[fuck ups](http://i.imgur.com/c4moylz.jpg)* be all our entertainment.\n\n---\n \n\nupdate: be sure to attach this "[i voted](http://ecx.images-amazon.com

In [50]:
reddit_tifu_df.loc[38085]['documents']

'hello tifu community, sadly due to lack of contest participation we get to resort to this copy/pasta post! yay...\n\ni am /u/shylo132 one of the now well *[seasoned moderators!](http://i.imgur.com/t97yu6n.jpg?1)* i am known as the nicest mod and a meme poster when interacted with.\n\nthe moderator team and i want to get to know you a [little better.](http://i.imgur.com/ibxo734.jpg) we also want your opinion on how we can make our awesome subreddit grow and expand to heights it has not yet seen before!\n\nto do this we need your help by filling out **[tifu\'s survey for 2017!](https://docs.google.com/forms/d/e/1faipqlsei583a6bxsgtm2p1xt80tnynhyperlrqgtsrrp-9bro1eqnq/viewform)** this survey will stay up for 2 weeks to allow everyone a chance for input. the *[end date](https://www.timeanddate.com/countdown/to?iso=20170609t00&p0=37&msg=tifu+census+2017+closed&ud=1&font=sanserif&csz=1)* of this survey is **9 june 17**.\n\nthank you for your participation and may all your *[fuck ups](http:/

In [40]:
# Sanity check: there should be no common elements in the two lists
set(duplicates_tldrs_indices) & set(indices_to_remove)

set()

**Step 7:** Filter the indices defined in step 6 from the dataset

**Reddit TIFU indices that correspond to duplicates**

In [41]:
reddit_tifu_duplicates_indices = exact_duplicates_texts_indices + duplicates_tldrs_indices

In [42]:
len(reddit_tifu_duplicates_indices)

96

**Reddit TIFU indices that will be removed from the dataset (duplicates + not useful)**

In [43]:
reddit_tifu_indices_to_remove = exact_duplicates_texts_indices \
                                + duplicates_tldrs_indices \
                                + not_useful_texts_indices \
                                + not_useful_tldrs_indices

In [44]:
len(reddit_tifu_indices_to_remove)

232

In [45]:
# Select the reddit_tifu indices to keep by removing the indices to remove

all_reddit_tifu_indices = []
all_reddit_tifu_indices.extend(range(len(reddit_tifu_df)))

reddit_tifu_indices_to_keep = [element for element in all_reddit_tifu_indices if element not in reddit_tifu_indices_to_remove]

In [46]:
len(reddit_tifu_indices_to_keep)

41911

In [47]:
with open('reddit_tifu_indices_to_keep.txt', 'w') as f:
    for item in reddit_tifu_indices_to_keep:
        f.write("%s\n" % item)