In [2]:
import numpy as np
import pandas as pd
import json

***Initial download and exploration of samsum dataset***

In [3]:
with open('samsum/data/train.json', 'r') as f:
    train = json.load(f)
with open('samsum/data/test.json', 'r') as f:
    test = json.load(f)
with open('samsum/data/val.json', 'r') as f:
    val = json.load(f)

In [4]:
print(f'''Train size: {len(train)}
Test size:  {len(test)}
Val size:   {len(val)}''')

Train size: 14732
Test size:  819
Val size:   818


In [5]:
sam_train = pd.DataFrame(train)
sam_test = pd.DataFrame(test)
sam_val = pd.DataFrame(val)

display(sam_train.head())
display(sam_train.isna().sum())

Unnamed: 0,id,summary,dialogue
0,13818513,Amanda baked cookies and will bring Jerry some...,Amanda: I baked cookies. Do you want some?\r\...
1,13728867,Olivia and Olivier are voting for liberals in ...,Olivia: Who are you voting for in this electio...
2,13681000,Kim may try the pomodoro technique recommended...,"Tim: Hi, what's up?\r\nKim: Bad mood tbh, I wa..."
3,13730747,Edward thinks he is in love with Bella. Rachel...,"Edward: Rachel, I think I'm in ove with Bella...."
4,13728094,"Sam is confused, because he overheard Rick com...",Sam: hey overheard rick say something\r\nSam:...


id          0
summary     0
dialogue    0
dtype: int64

In [6]:
ex = sam_train.iloc[np.random.randint(0, len(train)-1)]

print('Dialogue:')
print(ex.dialogue)
print()
print('Summary')
print(ex.summary)


Dialogue:
Jean: Is anybody in Brussels now?
Terry: I'm here
Terry: But I'm leaving tomorrow morning
Theresa: I'm heading to Brussels again tomorrow
Jean: ahahah, just like Theresa May, and also not happy it seems
Terry: 🤣🤣🤣
Theresa: hahah, indeed not excited 
Jean: to negotiate a deal?
Theresa: not really
Theresa: However, marriage is a kind of a deal! right?
Terry: Sure, it is
Theresa: We're marring with Paul on Friday
Terry: what??? And I don't know anything? 😱
Theresa: it's only to get the Belgian citizenship after Brexit, pure formality, no party, not even parents evolved
Jean: We can at least have some drinks Friday night
Theresa: With pleasure, let's celebrate a fake wedding with a fake party 😛

Summary
Terry is in Brussels, he's leaving tomorrow morning. Theresa is going to Brussels again tomorrow. She's marrying Paul on Friday as a formality to get the Belgian citizenship after Brexit. Jean will celebrate with Theresa on Friday night. 


***Initial download and exploration of dialoguesum dataset***

In [7]:
da_train = pd.read_csv('dialoguesum/train.csv')
da_test = pd.read_csv('dialoguesum/test.csv')
da_val = pd.read_csv('dialoguesum/validation.csv')
da_hold = pd.read_csv('dialoguesum/holdout.csv')

display(da_train.head())
display(da_train.isna().sum())

Unnamed: 0,id,dialogue,summary,topic
0,train_0,"#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. ...","Mr. Smith's getting a check-up, and Doctor Haw...",get a check-up
1,train_1,"#Person1#: Hello Mrs. Parker, how have you bee...",Mrs Parker takes Ricky for his vaccines. Dr. P...,vaccines
2,train_2,"#Person1#: Excuse me, did you see a set of key...",#Person1#'s looking for a set of keys and asks...,find keys
3,train_3,#Person1#: Why didn't you tell me you had a gi...,#Person1#'s angry because #Person2# didn't tel...,have a girlfriend
4,train_4,"#Person1#: Watsup, ladies! Y'll looking'fine t...",Malik invites Nikki to dance. Nikki agrees if ...,dance


id          0
dialogue    0
summary     0
topic       0
dtype: int64

In [8]:
ex2 = da_train.iloc[np.random.randint(0, len(da_train)-1)]
print(ex2.dialogue)
print()
print(ex2.summary)

#Person1#: Do you want to go to sleep, or do you want to stay up and watch a movie? I'm pretty tired, but I'm always up for a horror movie. It is Halloween, after all...
#Person2#: I'd love to, but not tonight. I ate too much candy, and I'm so exhausted from trick-or-treating all night with the boys from the neighborhood. I need to rest!

#Person1# would like to see a horror movie but #Person2# is too tired from trick-or-treating and needs a rest.


The samsum and dialoguesum datasets seem to have a very similar format. Dialoguesum just has an extra "topic" column that doesn't contain very much information. We'll drop this "topic" column and combine the samsum and dialoguesum datasets into one.

In [11]:
df_train = pd.concat([sam_train, da_train.drop('topic', axis=1)], axis=0)
df_val = pd.concat([sam_val, da_val.drop('topic', axis=1)], axis=0)
df_test = pd.concat([sam_test, da_test.drop('topic', axis=1), da_hold.drop('topic', axis=1)], axis=0)

display(df_train.head())
print(len(df_train))

display(df_val.head())
print(len(df_val))

display(df_test.head())
print(len(df_test))

Unnamed: 0,id,summary,dialogue
0,13818513,Amanda baked cookies and will bring Jerry some...,Amanda: I baked cookies. Do you want some?\r\...
1,13728867,Olivia and Olivier are voting for liberals in ...,Olivia: Who are you voting for in this electio...
2,13681000,Kim may try the pomodoro technique recommended...,"Tim: Hi, what's up?\r\nKim: Bad mood tbh, I wa..."
3,13730747,Edward thinks he is in love with Bella. Rachel...,"Edward: Rachel, I think I'm in ove with Bella...."
4,13728094,"Sam is confused, because he overheard Rick com...",Sam: hey overheard rick say something\r\nSam:...


27192


Unnamed: 0,id,summary,dialogue
0,13817023,A will go to the animal shelter tomorrow to ge...,"A: Hi Tom, are you busy tomorrow’s afternoon?\..."
1,13716628,Emma and Rob love the advent calendar. Lauren ...,Emma: I’ve just fallen in love with this adven...
2,13829420,Madison is pregnant but she doesn't want to ta...,Jackie: Madison is pregnant\r\nJackie: but she...
3,13819648,Marla found a pair of boxers under her bed.,Marla: <file_photo>\r\nMarla: look what I foun...
4,13728448,Robert wants Fred to send him the address of t...,Robert: Hey give me the address of this music ...


1318


Unnamed: 0,id,summary,dialogue
0,13862856,Hannah needs Betty's number but Amanda doesn't...,"Hannah: Hey, do you have Betty's number?\nAman..."
1,13729565,Eric and Rob are going to watch a stand-up on ...,Eric: MACHINE!\r\nRob: That's so gr8!\r\nEric:...
2,13680171,Lenny can't decide which trousers to buy. Bob ...,"Lenny: Babe, can you help me with something?\r..."
3,13729438,Emma will be home soon and she will let Will k...,"Will: hey babe, what do you want for dinner to..."
4,13828600,Jane is in Warsaw. Ollie and Jane has a party....,"Ollie: Hi , are you in Warsaw\r\nJane: yes, ju..."


2419


In [13]:
#df_train.to_csv('train.csv')
#df_val.to_csv('val.csv')
#df_test.to_csv('test.csv')