## Purpose:

Load data and create query/response pairs in different ways

## Initialize

In [3]:
import pandas as pd
import pickle

# Custom
from processing import tag_utterances
from processing import load_sem_types
from processing import DataPipeline
pd.set_option('display.max_columns', 500) # more columns displayed at once

# Set path for importing data
data_instance = DataPipeline(comments_path = '../data/reddit_comments_askDocs_2014_to_2018_03.gz',
                            posts_path = '../data/original_posts_under_askDocs_subreddit_id.gz')

There is a lot that could be done with formatting the data for training conversations:

* Option 1: All responses are equal
    * Treat every thread as a conversation
    * Every comment in the thread as a response to the original AskDocs

## Option 1: All responses are equal

* Each title of the thread is listed as the question 
* Every comment in that thread is listed as the answer
* The comment of the first query/post asked by the original author is ignored

In [2]:

df = data_instance.load_full_thread()

print('\nCount of threads:')
df['is_thread_start'].value_counts()

Comments Table Shape: (557648, 24)
Posts table shape: (43615, 35)
30710
Final combined table shape: (139535, 28)
Count of threads


0.0    108825
1.0     30710
Name: is_thread_start, dtype: int64

In [8]:
%%time
list_of_threads = df['link_id_short'].unique().tolist()

query = []
answer = []

# loop through all threads
for thread in list_of_threads:
    try:
        df_subset = df.loc[df['link_id_short']==thread]
        # assert there is one poster
        assert sum(df_subset['is_thread_start'].unique()) == 1

        thread_author = str(df_subset.loc[df_subset['is_thread_start']==1]['author'].unique()[0]).strip()
        thread_question = str(df_subset.loc[df_subset['is_thread_start']==1]['body'][0]).strip()


        thread_title = df_subset.title[df_subset.title.notnull()][0]

        try:
            thread_title_short = df_subset.url.unique()[1].split('/')[-2]
        except:
            thread_title_short = df_subset.url.unique()[1]


        thread_readers = df_subset.loc[df_subset['parent_id_short']==thread].author.tolist()

        if False:
            print('thread:',thread)
            print('thread_author:',thread_author)
            print('url:',df_subset.url.unique()[1])
            print('thread_title short:',thread_title_short)
            print('thread_title:',thread_title)
            print('thread_readers:',thread_readers)

        for index,row in df_subset.loc[df_subset['is_thread_start']!=1].iterrows():
            query.append({'author':thread_author,'reader':row['author'],'utterance':thread_title})
            answer.append({'author':row['author'], 'reader': thread_author,'utterance':row['body']})
    except:
        print(thread)
        
assert len(query)==len(answer)

CPU times: user 6min 25s, sys: 2.38 s, total: 6min 27s
Wall time: 6min 28s


In [17]:
pickle.dump(zip(query,answer),open( '../data/all_responses_equal.p', "wb" ))

In [170]:
for idx, row in df_subset.iterrows():
    print(df_subset.title[df_subset.title.notnull()][0])
    #print(row['url'],row['body'])
    print()

Out of hand tonsil infection. Help!

Out of hand tonsil infection. Help!

Out of hand tonsil infection. Help!

Out of hand tonsil infection. Help!

Out of hand tonsil infection. Help!

Out of hand tonsil infection. Help!

Out of hand tonsil infection. Help!



In [171]:
for idx in range(len(query)):
    print("Q:", query[idx])
    print("A:", answer[idx])
    print()
print('Number of Q/A:',len(query))

Q: {'author': 'dpeters14fuck', 'reader': 'ebast', 'utterance': 'Out of hand tonsil infection. Help!'}
A: {'author': 'ebast', 'reader': 'dpeters14fuck', 'utterance': 'Well, then just be careful and try to avoid any kind of abdominal trauma. Best of luck! Get better soon so you can enjoy your vacations :)'}

Q: {'author': 'dpeters14fuck', 'reader': 'dpeters14fuck', 'utterance': 'Out of hand tonsil infection. Help!'}
A: {'author': 'dpeters14fuck', 'reader': 'dpeters14fuck', 'utterance': "Thanks for the advice and well wishing! It's now wednesday and I've been all good at work so I guess as long as I'm not straining myself too much I'll be all good"}

Q: {'author': 'dpeters14fuck', 'reader': 'ebast', 'utterance': 'Out of hand tonsil infection. Help!'}
A: {'author': 'ebast', 'reader': 'dpeters14fuck', 'utterance': "You're welcome :) well, that's hard to say actually. Contact sports are like the classic thing you tell patients not to do because the spleen gets a little bigger during mono. I'

In [91]:
for index,row in df_subset.iterrows():
    print(row['body'])

I just got them today, and it was my first time wearing contacts in a few years. They weren't moving every time I blinked, but occasionally, maybe once every few minutes. Then after a while, I noticed it happening much less often, although it still happened occasionally. Maybe because I was distracted, I wasn't focusing on it as much. Maybe I was blinking less? I was playing a video game/watching TV for a while, and it was much better. 


Other than that, no reddness or discomfort from them. Tomorrow I'm going to take them for a spin during the day and wear them a little longer. (i'll bring my glasses just in case though).
Does it happen every time you blink? How many days have you been wearing them? They do take a while to get used to although if they are moving every time you blink they may not be the best fit for you. Astigmatism is harder to fit then regular prescriptions as the lens or the cornea is irregularly shaped. 

If your eyes become slightly red or irritated from the lense

## Option 2: Every comment is a question and answer

There might be a sub-option here: We get more data if we don't join with posts (i.e. the post that started the thread). Each post that had the original post as its parent_id would become the top level but be eliminated if it did not have post response.

In [4]:
df = data_instance.load_full_thread()

Comments Table Shape: (557648, 24)
Posts table shape: (43615, 35)
30710
Final combined table shape: (139535, 28)


In [5]:
# heres the situation we're dealing with in terms of looking at which post belongs to which
df[['link_id_short','parent_id','parent_id_short','post_id','id']].head()

Unnamed: 0,link_id_short,parent_id,parent_id_short,post_id,id
1662,37o1az,t3_37o1az,37o1az,cvkbr58,cvkbr58
1663,37o1az,t1_cvnw8ly,cvnw8ly,cvnwkg4,cvnwkg4
2156,3exs68,t3_3exs68,3exs68,cwphyrq,cwphyrq
2549,399nb8,t1_cw2sr75,cw2sr75,cw2svpt,cw2svpt
2550,399nb8,t1_cw1xikq,cw1xikq,cw2fe2o,cw2fe2o


In [6]:
# all original posts
all_threads = df['link_id_short'].unique().tolist()
print(len(all_threads))

30710


In [33]:
imp_columns = ['link_id_short','parent_id','parent_id_short','post_id','id']
# example link_id 37o1az
df_example = df[df['link_id_short'] == '37o1az']

print('Num total posts:',df_example.shape)
print('Num first responses:',sum(df_example['parent_id_short'] == df_example['link_id_short']))

# original post
df_example[df_example['body'].str.contains("Pertinent facts")][imp_columns]

query = []
response = []

qr_pair = []
# Append top query
query_id = df_example[df_example['parent_id'].isnull()]['link_id_short'].iloc[0]
query_original = df_example[df_example['parent_id'].isnull()]['body'].iloc[0]
# Append responses to top query 
df_example2 = df_example[(~df_example['parent_id'].isnull()) &
                         (df_example['body'] != '[deleted]') & 
                         (df_example['parent_id_short'].str.contains(query_id))] # get the children

for resp in df_example2['body'].tolist():
    qr_pair.append((query_original,resp))
# Create list of ids to find children
len(df_example2['post_id'].unique())

Num total posts: (84, 28)
Num first responses: 21


18

In [76]:
#
df[(df['parent_id_short'].isin(['37o1az'])) &
   (df['body'] != '[deleted]') # some entries have been removed for some reason
  & (df['id'] == 'cvkbr58')]['id'].tolist()

['cvkbr58']

In [77]:
class QueryResponse:
    """
    Recursive search to create query response-pairs
    """
    def __init__(self,parent_id_lst):
        self.parent_id_lst = parent_id_lst
        print('num parents:', len(self.parent_id_lst))
        if type(self.parent_id_lst) == list:
            self.children_id_lst = pd_data_frame[(pd_data_frame['parent_id_short'].isin(self.parent_id_lst)) &
                                                 # some entries have been removed for some reason
                                                 (pd_data_frame['body'] != '[deleted]') 
                                                ]['id'].tolist()
        else:
            self.children_id_lst = pd_data_frame[(pd_data_frame['parent_id_short'] == self.parent_id_lst) &
                                                 # some entries have been removed for some reason
                                                 (pd_data_frame['body'] != '[deleted]') 
                                                ]['id'].tolist()
            
        print('num children:', len(self.children_id_lst))
        # Given a list of parent ids, turn the cooresponding text for those into queries
        # and the entries whose are responses to the parent_ids given...turn those into responses.
        query_response = []
        for parent_id in parent_id_lst:
            query = pd_data_frame[pd_data_frame['id'] == parent_id]['body'].iloc[0]
            children_ids = pd_data_frame[(pd_data_frame['parent_id_short']==parent_id) &
                                              # some entries have been removed for some reason
                                              (pd_data_frame['body'] != '[deleted]') 
                                             ]['id'].tolist()
            for child_id in children_ids:
                response = pd_data_frame[pd_data_frame['id'] == child_id]['body'].iloc[0]
                query_response.append((query,response))
                
        self.query_response = query_response     
    
    @property
    def child_elements(self):
        return [QueryResponse(a) for a in self.children_id_lst]
    
    # Return the list of (query,response) tuples
    @property
    def get_value(self):
        
        return self.query_response
    
def node_recurse_generator(node):
    """
    Iterates through all response/query pairs. "node" is a QueryResponse object.
    """
    yield node.query_response
    for n in node.child_elements:
        yield from node_recurse_generator(n)

In [89]:
# test loop list length
things = '37o1az'#['37o1az','37o1az']
test_lst = things
count = 0
while count < len(test_lst):
    print(test_lst[count])
    count += 1

37o1az


In [81]:
test_lst

['3', '7', 'o', '1', 'a', 'z']

In [78]:
pd_data_frame = df_example
a = QueryResponse(['37o1az','37o1az'])
list(node_recurse_generator(a))

num parents: 2
num children: 18
num parents: 7
num children: 1


IndexError: single positional indexer is out-of-bounds

In [13]:
t = QueryResponse(df_example)
t.children

Unnamed: 0,archived,author,author_flair_css_class,author_flair_text,body,controversiality,created_utc,distinguished,downs,gilded,id,is_thread_start,link_id,link_id_short,name,over_18,parent_id,parent_id_short,post_id,removal_reason,retrieved_on,score,score_hidden,subreddit,subreddit_id,title,ups,url
1662,,kql,default,This user has not yet been verified.,"Hey, how's your husband doing now? Hope everyt...",0.0,1443691046,,,0,cvkbr58,0.0,t3_37o1az,37o1az,,,t3_37o1az,37o1az,cvkbr58,,1446704000.0,1,1.0,AskDocs,t5_2xtuc,,1.0,
63247,,BrownIRL,default,This user has not yet been verified.,A few weeks ago I was suffering from the same ...,0.0,1444599293,,,0,cvw8hsi,0.0,t3_37o1az,37o1az,,,t3_37o1az,37o1az,cvw8hsi,,1446909000.0,2,1.0,AskDocs,t5_2xtuc,,2.0,
459301,False,Maysj18,default,This user has not yet been verified.,Hi OP- not sure if anyone has mentioned this t...,0.0,1432917637,,0.0,0,croytlr,0.0,t3_37o1az,37o1az,t1_croytlr,,t3_37o1az,37o1az,croytlr,,1433377000.0,4,0.0,AskDocs,t5_2xtuc,,4.0,
473529,False,fusepark,default,This user has not yet been verified.,1. Has he been seen by an immunologist?\n\n2. ...,0.0,1432911358,,0.0,0,crouney,0.0,t3_37o1az,37o1az,t1_crouney,,t3_37o1az,37o1az,crouney,,1433335000.0,5,0.0,AskDocs,t5_2xtuc,,5.0,
476864,False,Bockabock,default,This user has not yet been verified.,My heart goes out to you and your husband. I'm...,0.0,1432870996,,0.0,0,crogddm,0.0,t3_37o1az,37o1az,t1_crogddm,,t3_37o1az,37o1az,crogddm,,1433328000.0,9,0.0,AskDocs,t5_2xtuc,,9.0,
477759,False,Medicine7,default,This user has not yet been verified.,Any update?,0.0,1435278413,,0.0,0,csivqbr,0.0,t3_37o1az,37o1az,t1_csivqbr,,t3_37o1az,37o1az,csivqbr,,1437355000.0,1,0.0,AskDocs,t5_2xtuc,,1.0,
486234,False,bigpandas,default,This user has not yet been verified.,With all the medical professionals not able to...,0.0,1436350254,,0.0,0,csw194g,0.0,t3_37o1az,37o1az,t1_csw194g,,t3_37o1az,37o1az,csw194g,,1437681000.0,-1,0.0,AskDocs,t5_2xtuc,,-1.0,
487528,False,lilleboff,verified-doc,Physician,With this kind of extensive diagnostics withou...,0.0,1432896786,,0.0,0,croo9ub,0.0,t3_37o1az,37o1az,t1_croo9ub,,t3_37o1az,37o1az,croo9ub,,1433332000.0,13,0.0,AskDocs,t5_2xtuc,,13.0,
487968,False,lurkERdoc,default,This user has not yet been verified.,I'm so sorry you're going through this! You se...,0.0,1432904040,,0.0,0,croqoy9,0.0,t3_37o1az,37o1az,t1_croqoy9,,t3_37o1az,37o1az,croqoy9,,1433333000.0,5,0.0,AskDocs,t5_2xtuc,,5.0,
500652,False,TuxPenguin1,default,This user has not yet been verified.,How is he doing?\n,0.0,1437193944,,0.0,0,ct7ee6q,0.0,t3_37o1az,37o1az,t1_ct7ee6q,,t3_37o1az,37o1az,ct7ee6q,,1437959000.0,1,0.0,AskDocs,t5_2xtuc,,1.0,


In [15]:
print(len(df.post_id.unique()))
print(len(df.id.unique()))

108714
139423


In [10]:
df.columns

Index(['archived', 'author', 'author_flair_css_class', 'author_flair_text',
       'body', 'controversiality', 'created_utc', 'distinguished', 'downs',
       'gilded', 'id', 'is_thread_start', 'link_id', 'link_id_short', 'name',
       'over_18', 'parent_id', 'parent_id_short', 'post_id', 'removal_reason',
       'retrieved_on', 'score', 'score_hidden', 'subreddit', 'subreddit_id',
       'title', 'ups', 'url'],
      dtype='object')

In [21]:
df[].head()

Unnamed: 0,link_id_short,parent_id,parent_id_short,post_id,id
1662,37o1az,t3_37o1az,37o1az,cvkbr58,cvkbr58
1663,37o1az,t1_cvnw8ly,cvnw8ly,cvnwkg4,cvnwkg4
2156,3exs68,t3_3exs68,3exs68,cwphyrq,cwphyrq
2549,399nb8,t1_cw2sr75,cw2sr75,cw2svpt,cw2svpt
2550,399nb8,t1_cw1xikq,cw1xikq,cw2fe2o,cw2fe2o


In [164]:
df[df['parent_id_short']=='cvnwkg4']

Unnamed: 0,archived,author,author_flair_css_class,author_flair_text,body,controversiality,created_utc,distinguished,downs,gilded,id,is_thread_start,link_id,link_id_short,name,over_18,parent_id,parent_id_short,post_id,removal_reason,retrieved_on,score,score_hidden,subreddit,subreddit_id,title,ups,url
326708,,RissaWasTaken,default,This user has not yet been verified.,Thank you so much! I may have something to act...,0.0,1444319711,,,0,cvsl6ms,0.0,t3_37o1az,37o1az,,,t1_cvnwkg4,cvnwkg4,cvsl6ms,,1446846000.0,2,1.0,AskDocs,t5_2xtuc,,2.0,
