The shelved submissions and comments datasets were downloaded from https://nicschrading.com/data/.

In [1]:
import shelve
import pandas as pd

**redditAbuseSubmissions**:
This data is an even set of 552 abuse submissions and 552 non-abuse submissions. Each submission has been parsed by the Illinois Curator for Semantic Role Labels. It has the variables:

data: A list of submission titles and text concatenated, 1 entry per submission.

labels: A list of labels (abuse or non_abuse), 1 entry per submission.

subIds: A list of reddit submission ids, 1 entry per submission.

roles: A list of lists. Each inner list has the semantic role labels in a submission. 1 list per submission.

predicates: A list of lists. Each inner list is a tuple of (predicates, sense number) in a submission. 1 list per submission.

In [2]:
submissions=pd.DataFrame(dict(shelve.open('redditAbuseSubmissions')))

In [3]:
submissions.head()

Unnamed: 0,data,subIds,roles,predicates,labels
0,I cant eat pls help I need help\nMy anxiety ha...,2wjl43,"[am-adv, am-adv, am-adv, am-dir, am-dis, am-mn...","[(cause, 01), (focus, 01), (need, 01), (go, 01...",non_abuse
1,"Financial Independence I am 18, with no job an...",2tdh8q,"[am-mnr, am-mod, am-rec, am-tmp, am-tmp, taker...","[(live, 01), (hope, 01), (be, 01), (be, 01), (...",abuse
2,Who decided that online calculus assignments w...,2vwei8,"[am-loc, causer, frustrater, comment, decision...","[(frustrate, 01), (frustrate, 01), (be, 01), (...",non_abuse
3,My friend recently told me she was abused as a...,p013r,"[am-cau, am-cau, am-cau, am-dir, am-dir, am-di...","[(be, 01), (be, 01), (know, 01), (come, 01), (...",abuse
4,How's it going on this monday night? I am list...,2xrdhg,"[am-mnr, am-rec, am-tmp, entity in motion/goer...","[(go, 01), (be, 01), (go, 01), (go, 01), (cata...",non_abuse


In [4]:
submissions.shape

(1104, 5)

In [5]:
submissions.labels.value_counts()

non_abuse    552
abuse        552
Name: labels, dtype: int64

**redditAbuseComments**: This data contains all the comments within the submissions in the small even set of submissions. It has the variables:

commData: A dictionary, where the key is a reddit submission id and the value is a list of comments in that submission.

commLabels: A dictionary, where the key is a reddit submission id and the value is a list of labels given to the comments (abuse or non_abuse).

In [6]:
comments=pd.DataFrame(dict(shelve.open('redditAbuseComments')))

In [7]:
comments.head()

Unnamed: 0,commLabels,commData
hgz26,abuse,[If the cops won't do anything about it... may...
2vpmnh,non_abuse,[I re-read your post and realized it's past ti...
2urnes,non_abuse,[Dealing with seniors? What exactly does that ...
2l78vr,non_abuse,"[Yeah I am the same I get terrible anxiety, te..."
136f5k,abuse,[So what I am getting is you're suffering from...


In [8]:
comments=comments.reset_index().rename(columns={'index':'subIds'})
comments.head()

Unnamed: 0,subIds,commLabels,commData
0,hgz26,abuse,[If the cops won't do anything about it... may...
1,2vpmnh,non_abuse,[I re-read your post and realized it's past ti...
2,2urnes,non_abuse,[Dealing with seniors? What exactly does that ...
3,2l78vr,non_abuse,"[Yeah I am the same I get terrible anxiety, te..."
4,136f5k,abuse,[So what I am getting is you're suffering from...


In [9]:
comments.shape

(1104, 3)

In [10]:
#merge the comments data with the corresponding posts
df=pd.merge(comments, submissions, on='subIds')

In [11]:
#check if labels are the same for comments & posts
(df.commLabels==df.labels).value_counts()

True    1104
dtype: int64

In [12]:
#rename columns
df.rename(columns={'subIds':'ID', 'commData':'comments', 'data':'posts'}, inplace=True)
#drop redundant comment labels column
df.drop('commLabels', axis=1, inplace=True)

df.head()

Unnamed: 0,ID,comments,posts,roles,predicates,labels
0,hgz26,[If the cops won't do anything about it... may...,My father abuses me and I cant do anything abo...,"[am-adv, am-adv, am-adv, am-adv, am-dis, am-di...","[(do, 02), (know, 01), (make, 01), (watch, 01)...",abuse
1,2vpmnh,[I re-read your post and realized it's past ti...,[Help]Finger Prick I know how dumb it sounds o...,"[am-adv, am-adv, am-dis, am-mnr, am-mnr, am-mn...","[(!think, 01), (be, 01), (have, 03), (do, 02),...",non_abuse
2,2urnes,[Dealing with seniors? What exactly does that ...,Going to college stress. I've had two panic at...,"[am-mnr, emotion or sensation, end point, end ...","[(do, 02), (feel, 01), (go, 01), (go, 01), (go...",non_abuse
3,2l78vr,"[Yeah I am the same I get terrible anxiety, te...",Just started my Sertraline medication today. I...,"[am-adv, am-adv, am-dis, am-mnr, am-mnr, am-mo...","[(change, 01), (start, 01), (change, 01), (aff...",non_abuse
4,136f5k,[So what I am getting is you're suffering from...,"hear me, i cry again and again, and maybe i am...","[am-adv, am-adv, am-adv, am-adv, am-adv, am-ad...","[(!be, 01), (be, 01), (do, 02), (ignore, 01), ...",abuse


In [13]:
df.to_csv('reddit.csv')