# Conversation Graphs

This notebook creates edge and node lists from the Twitter and Reddit sample conversations for use in visualizations. This relies on having the s3://tweets.pull/ s3 bucket mounted in a data directory above the directory where this notebook is.

In [64]:
import json
import pathlib

tweets_dir = pathlib.Path('../data/tweets.pull')
reddit_dir = pathlib.Path('../data/reddit.pull')
convs_dir = pathlib.Path('./convs/data')

## Get the Data

We want to iterate through the files in the sample zip files and get the CSVs as Pandas DataFrames. This function when given a location where the sample zip files live, will iterate through the zips and look for CSVs in them which it will return along with the name of the dataset (matches the search criteria used to generate the data) and the conversation id (a tweet id or Reddit post id).

In [2]:
import io
import csv
import pandas
import zipfile

def get_conv_df(data_dir):
    for zip_path in data_dir.glob('*_30.zip'):
        z = zipfile.ZipFile(zip_path)
        for filename in z.namelist():
            if filename.endswith('.csv'):
                name, conv_id = filename.strip('.csv').split('/')
                table = pandas.read_csv(z.open(filename, 'r'))
                yield name, conv_id, table

We can test it out to get the first tweet dataset:

In [3]:
twitter_name, twitter_conv_id, twitter_df = next(get_conv_df(tweets_dir))
print(f'dataset name: {twitter_name}')
print(f'conversation_id: {twitter_conv_id}')

twitter_df

dataset name: tweets_wealth_convs_30
conversation_id: 1277729142390304768


Unnamed: 0,id,created_at,text,attachments.media,attachments.media_keys,attachments.poll.duration_minutes,attachments.poll.end_datetime,attachments.poll.id,attachments.poll.options,attachments.poll.voting_status,...,source,withheld.scope,withheld.copyright,withheld.country_codes,type,__twarc.retrieved_at,__twarc.url,__twarc.version,Unnamed: 93,sentiment
0,1277729145917771776,2020-06-29T22:22:22.000Z,-Minimum $1000 cash payments to all Americans ...,,,,,,,,...,Twitter Web App,,,,replied_to,2021-09-05T13:12:11+00:00,https://api.twitter.com/2/tweets/search/all?ex...,2.4.3,,0.7269
1,1277729148509851648,2020-06-29T22:22:22.000Z,#RussForUs #NJ06\n\n#AndrewYangEndorsed\n\nCan...,,,,,,,,...,Twitter Web App,,,,,2021-09-05T13:12:11+00:00,https://api.twitter.com/2/tweets/search/all?ex...,2.4.3,,0.0
2,1277729142390304768,2020-06-29T22:22:21.000Z,A Universal Basic Income will unleash our pote...,,,,,,,,...,Twitter Web App,,,,replied_to,2021-09-05T13:12:11+00:00,https://api.twitter.com/2/tweets/search/all?ex...,2.4.3,,-0.6238


And we can try it on the Reddit data:

In [10]:
reddit_name, reddit_conv_id, reddit_df = next(get_conv_df(reddit_dir))
print(f'dataset name: {reddit_name}')
print(f'conversation_id: {reddit_conv_id}')

reddit_df

dataset name: reddit_racial_wealth_gap_convs_30
conversation_id: m4ljb0


Unnamed: 0,all_awardings,approved_at_utc,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,retrieved_on,score,send_replies,stickied,subreddit,subreddit_id,top_awarded_type,total_awards_received,treatment_tags,sentiment
0,[],,,oath2order,,,[],,,,...,1615701284,1,True,False,scotus,t5_2rfsw,,0,[],0.0
1,[],,,King_Posner,,,[],,,,...,1615721998,1,True,False,scotus,t5_2rfsw,,0,[],0.7271
2,[],,,Sandra_Day_Rehnquist,,,[],,,,...,1615725644,1,True,False,scotus,t5_2rfsw,,0,[],0.5423
3,[],,,King_Posner,,,[],,,,...,1615747166,1,True,False,scotus,t5_2rfsw,,0,[],0.0307
4,[],,,arbivark,,,[],,,,...,1615747249,1,True,False,scotus,t5_2rfsw,,0,[],0.1779
5,[],,,Cwagmire,,,[],,,,...,1615749793,1,True,False,scotus,t5_2rfsw,,0,[],-0.7486
6,[],,,Cwagmire,,,[],,,,...,1615750133,1,True,False,scotus,t5_2rfsw,,0,[],0.4266
7,[],,,[deleted],,,,,,dark,...,1615750257,1,True,False,scotus,t5_2rfsw,,0,[],0.0
8,[],,,King_Posner,,,[],,,,...,1615750776,1,True,False,scotus,t5_2rfsw,,0,[],0.8805
9,[],,,Sandra_Day_Rehnquist,,,[],,,,...,1615751097,1,True,False,scotus,t5_2rfsw,,0,[],0.7448


## Extract Network

The Twitter and Reddit data is shaped differently but we can process each one into a datastructure of nodes and edges. It's actually kind of tedious because the head node in both Twitter and Reddit was not included in the conversation thread other than as an id that is being pointed at. So this code goes through some contortions to get it after the fact, to make it easier to display the network.

In [85]:
import os
import time

import twarc
twitter = twarc.client2.Twarc2(bearer_token=os.environ.get('BEARER_TOKEN'))

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
vader = SentimentIntensityAnalyzer()

def make_twitter_graph(df):   
    nodes = []
    edges = []
    
    for row in df.iloc():
        
        node_id = str(row['id'])
        if node_id not in [n['id'] for n in nodes]:
            nodes.append({
                "id": node_id,
                "url": f"https://twitter.com/{row['author.username']}/status/{row['id']}",
                "user": row["author.username"],
                "text": row["text"],
                "sentiment": row["sentiment"]
            })
        
        if type(row['referenced_tweets']) == str:
            reply_to_id = None
            for ref in json.loads(row['referenced_tweets']):
                if ref['type'] == 'replied_to':
                    reply_to_id = ref['id']
            if reply_to_id:
                edges.append((node_id, reply_to_id))
                
    # make sure all the edge ids have node information
    # the head of the conversation thread is usually missing
    
    node_ids = set([n['id'] for n in nodes])
    all_node_ids = set([n[0] for n in edges] + [n[1] for n in edges])
    missing_ids = all_node_ids - node_ids
    
    if missing_ids:
        print(f'looking up missing ids: {missing_ids}')
    for result in twitter.tweet_lookup(list(missing_ids)):
        if 'data' not in result:
            continue
        for tweet in twarc.ensure_flattened(result):
            nodes.append({
                "id": tweet["id"],
                "url": f"https://twitter.com/{tweet['author']['username']}/status/{tweet['id']}",
                "user": tweet['author']['username'],
                "text": tweet['text'],
                "sentiment": vader.polarity_scores(tweet['text'])['compound']
            })  
        time.sleep(1)
        
    return {"nodes": nodes, "edges": edges}
        
make_twitter_graph(twitter_df)

{'nodes': [{'id': '1277729145917771776',
   'url': 'https://twitter.com/UBI2021/status/1277729145917771776',
   'user': 'UBI2021',
   'text': '-Minimum $1000 cash payments to all Americans + $500 per dependent\\n-Eradicate poverty\\n-Narrow the racial wealth gap \\n-End the “scarcity mindset” transition to a “plentiful mindset” \\n-Incentivize socially beneficial work or education to curb the ramifications of automation\\n\\n #NJ06',
   'sentiment': 0.7269},
  {'id': '1277729148509851648',
   'url': 'https://twitter.com/UBI2021/status/1277729148509851648',
   'user': 'UBI2021',
   'text': '#RussForUs #NJ06\\n\\n#AndrewYangEndorsed\\n\\nCan you contribute today?\\nhttps://t.co/tyAKzgNT1j',
   'sentiment': 0.0},
  {'id': '1277729142390304768',
   'url': 'https://twitter.com/UBI2021/status/1277729142390304768',
   'user': 'UBI2021',
   'text': 'A Universal Basic Income will unleash our potential as human beings.\\n\\nThe incumbent Corporate Democrat Frank Pallone, DOES NOT support this an

And we can make a similar function for the Reddit conversations:

In [86]:
import requests

def make_reddit_graph(df):   
    nodes = []
    edges = []
    
    for row in df.iloc():
        
        node_id = row['id']
        if node_id not in [n['id'] for n in nodes]:
            nodes.append({
                "id": row["id"],
                "url": "https://www.reddit.com" + row["permalink"],
                "user": row["author"],
                "text": row["body"],
                "sentiment": row["sentiment"]
            })
        
        if row['parent_id']:
            edges.append((node_id, row['parent_id'].split('_')[1]))
            
    # make sure all the edge ids have node information
    # the head of the conversation thread is usually missing
    
    node_ids = set([n['id'] for n in nodes])
    all_node_ids = set([n[0] for n in edges] + [n[1] for n in edges])
    missing_ids = all_node_ids - node_ids
    
    if missing_ids:
        print(f'looking up missing_ids: {missing_ids}')
    
    time.sleep(1)
    resp = requests.get(f'https://api.pushshift.io/reddit/search/submission/?ids={",".join(list(missing_ids))}')
    if resp.status_code == 200:
        for post in resp.json().get('data', []):
            nodes.append({
                "id": post["id"],
                "url": post['full_link'],
                "user": post['author'],
                "text": post.get('selftext', ''),
                "sentiment": vader.polarity_scores(post.get('selftext', ''))['compound']
            })  
    
    return {"nodes": nodes, "edges": edges}
        
make_reddit_graph(reddit_df)

looking up missing_ids: {'m4ljb0'}
<Response [200]>


{'nodes': [{'id': 'gqvihaf',
   'url': 'https://www.reddit.com/r/scotus/comments/m4ljb0/the_racial_wealth_gap_is_a_civil_liberties_issue/gqvihaf/',
   'user': 'oath2order',
   'text': "What's the relevance to this sub?",
   'sentiment': 0.0},
  {'id': 'gqw3iva',
   'url': 'https://www.reddit.com/r/scotus/comments/m4ljb0/the_racial_wealth_gap_is_a_civil_liberties_issue/gqw3iva/',
   'user': 'King_Posner',
   'text': 'This one has it, if you read. The constitutional law evolutionary argument is pretty solid, I don’t agree with the conclusion as they only prove economic class is an issue, not an intersected, legally speaking, but it’s a good analysis of the caselaw.',
   'sentiment': 0.7271},
  {'id': 'gqw7kg4',
   'url': 'https://www.reddit.com/r/scotus/comments/m4ljb0/the_racial_wealth_gap_is_a_civil_liberties_issue/gqw7kg4/',
   'user': 'Sandra_Day_Rehnquist',
   'text': 'What do they want the court to do about it, rule that wealth must be redistributed on the basis of race?',
   'sent

## Save the Data

We're going to just save off the node and edge data to some files and do the visualization somewhere else.

In [87]:
for name, conv_id, df in get_conv_df(tweets_dir):
    g = make_twitter_graph(df)
    f = convs_dir / f"{name}_{conv_id}.json"
    json.dump(g, f.open('w'), indent=2)
    
    edges_csv = convs_dir / f"{name}_{conv_id}_edges.csv"
    out = csv.writer(edges_csv.open('w'))
    out.writerow(['source', 'target'])
    for edge in g['edges']:
        out.writerow(edge)
    
    nodes_csv = convs_dir / f"{name}_{conv_id}_nodes.csv"
    out = csv.DictWriter(nodes_csv.open('w'), fieldnames=['id', 'url', 'user', 'text', 'sentiment'])
    for node in g['nodes']:
        out.writerow(node)
    
    print(f)

convs/data/tweets_wealth_convs_30_1277729142390304768.json
looking up missing ids: {'1300401392952258561'}
convs/data/tweets_wealth_convs_30_1300299144435773441.json
convs/data/tweets_wealth_convs_30_1269772851416051713.json
looking up missing ids: {'1394332284791054336', '1394348469884751872'}
convs/data/tweets_wealth_convs_30_1394330467474698245.json
convs/data/tweets_wealth_convs_30_1281424682953121794.json
convs/data/tweets_wealth_convs_30_1314248442936528898.json
convs/data/tweets_wealth_convs_30_1325261109516001280.json
convs/data/tweets_wealth_convs_30_1351464750559985664.json
convs/data/tweets_wealth_convs_30_1352109589169274880.json
looking up missing ids: {'1271606461915967489', '1271566802536013824', '1271570524477493249', '1271606016258510849', '1271592215337480193', '1271567304585883651', '1271571260129017863', '1271571053190557698', '1271562900281360385', '1271589588650405888', '1271568979300102144', '1271570320063975431', '1271571752481697792', '1271630526277922816', '12

  for name, conv_id, df in get_conv_df(tweets_dir):


looking up missing ids: {'1368235875562258437', '1368236513469882368', '1368103582172602368', '1369096803988041729', '1368140372505935872', '1367936086702456841', '1368184671578816512', '1368003505055666177', '1368243660253368322', '1368159038207365125', '1367965960724963334', '1368233000605990913', '1368687706759340037', '1368332529770635273', '1368175377701629958', '1368172693753856002', '1368120247438340096', '1368022861248565250', '1368009846910574593', '1367941259663400963', '1368169491545358338', '1368024915111206913', '1368334232163147776', '1368024054813233152', '1368278698676387846', '1368540565651001350', '1368244147975561221', '1368302122060705795', '1368010373031473155', '1369696565724471296', '1367898628191305728', '1367988748835827714', '1367987803108302851', '1368052739314442246', '1368346393492996109', '1368063288261681154', '1367983496103395333', '1368152474234814467', '1367967443893710848', '1368314297063903232', '1368714741221359618', '1368666293847662593', '13679658

  for name, conv_id, df in get_conv_df(tweets_dir):


looking up missing ids: {'1274925873976639488', '1275214578020954115', '1274817933261393920', '1274735112966397954', '1274886260612423680', '1274696891473653764', '1274620006135934976', '1274566810512302080', '1274841122192404481', '1274570880824356864', '1275104576174202882', '1274892478131236864', '1275019640809365505', '1274994959901429760', '1274582891628949505', '1274627869524799488', '1274874125014704132', '1274767565504630784', '1274885616191275008', '1275218477234741250', '1274566651241955331', '1274745164448387076', '1275114385115975680', '1274751727158333442', '1274993277578682372', '1274634391747166208', '1274628068263579650', '1274574400344346624', '1274659069215866881', '1274725375281696771', '1275075408308789253', '1274804160999886849', '1274575491308023811', '1274564191416913920', '1274702804637097985', '1271913234199728129', '1275447422735413256', '1274623946386800640', '1274712726405865478', '1275084840522715137', '1274614735883943941', '1274752331276582913', '12745582

  for name, conv_id, df in get_conv_df(tweets_dir):


looking up missing ids: {'1396840098600214528', '1396847577170157574', '1396997102488723456', '1396789597682679810', '1397211849930674177', '1396569753037250568', '1396848665751326723', '1396786928213807111', '1396547948515897355', '1396783597219205121', '1396911654680645642', '1397641346006102016', '1396767918852030465', '1396939882572877824', '1396607425143349252', '1396858899173679107', '1396574636884758530', '1396790153570627588', '1396753136614707203', '1396836170261274626', '1396788377958195200', '1397221176808706048', '1396829601058459649', '1396909530752200708', '1396589764267192329', '1396768356963848195', '1396888712009723904', '1396878945073631241', '1396829420351107078', '1397124181020577794', '1396962222916751368', '1396930064588021763', '1396835905630089216', '1397033857250115585', '1396565507382419458', '1396902637069643776', '1396570674542616585', '1396864818376806403', '1397224560135745542', '1396852141852069894', '1397221285340618760', '1396550059647512579', '13965787

In [88]:
for name, conv_id, df in get_conv_df(reddit_dir):
    g = make_reddit_graph(df)
    f = convs_dir / f"{name}_{conv_id}.json"
    json.dump(g, f.open('w'), indent=2)
    
    edges_csv = convs_dir / f"{name}_{conv_id}_edges.csv"
    out = csv.writer(edges_csv.open('w'))
    out.writerow(['source', 'target'])
    for edge in g['edges']:
        out.writerow(edge)
    
    nodes_csv = convs_dir / f"{name}_{conv_id}_nodes.csv"
    out = csv.DictWriter(nodes_csv.open('w'), fieldnames=['id', 'url', 'user', 'text', 'sentiment'])
    for node in g['nodes']:
        out.writerow(node)
        
    print(f)

looking up missing_ids: {'m4ljb0'}
<Response [200]>
convs/data/reddit_racial_wealth_gap_convs_30_m4ljb0.json
looking up missing_ids: {'ifotnt'}
<Response [200]>
convs/data/reddit_racial_wealth_gap_convs_30_ifotnt.json
looking up missing_ids: {'jwh0qy'}
<Response [200]>
convs/data/reddit_racial_wealth_gap_convs_30_jwh0qy.json
looking up missing_ids: {'kswcrt'}
<Response [200]>
convs/data/reddit_racial_wealth_gap_convs_30_kswcrt.json
looking up missing_ids: {'kg1vjx'}
<Response [200]>
convs/data/reddit_racial_wealth_gap_convs_30_kg1vjx.json
looking up missing_ids: {'hbwbcl'}
<Response [200]>
convs/data/reddit_racial_wealth_gap_convs_30_hbwbcl.json
looking up missing_ids: {'ht2qiq'}
<Response [200]>
convs/data/reddit_racial_wealth_gap_convs_30_ht2qiq.json
looking up missing_ids: {'hggqkb'}
<Response [200]>
convs/data/reddit_racial_wealth_gap_convs_30_hggqkb.json
looking up missing_ids: {'hwdozb'}
<Response [200]>
convs/data/reddit_racial_wealth_gap_convs_30_hwdozb.json
looking up missing_

In [89]:
x = json.load(open('convs/data/reddit_black_people_convs_30_i4tskq.json'))

In [90]:
node_ids = set([n['id'] for n in x['nodes']])
node_ids

{'g0kxm4t',
 'g0laq1c',
 'g0lb2qc',
 'g0lc2d5',
 'g0lk87u',
 'g0lk93k',
 'g0lkadg',
 'g0lkv9v',
 'g0lmsuv',
 'g0lnmim',
 'g0lnnik',
 'g0lq92j',
 'g0lqn79',
 'g0lzkm1',
 'g0m8eb6',
 'g0m9a1j',
 'g0n268x',
 'g0n9aq5',
 'g0o08ab',
 'g0peluh',
 'i4tskq'}

In [91]:
edge_ids = set([n[0] for n in x['edges']] + [n[1] for n in x['edges']])
edge_ids

{'g0kxm4t',
 'g0laq1c',
 'g0lb2qc',
 'g0lc2d5',
 'g0lk87u',
 'g0lk93k',
 'g0lkadg',
 'g0lkv9v',
 'g0lmsuv',
 'g0lnmim',
 'g0lnnik',
 'g0lq92j',
 'g0lqn79',
 'g0lzkm1',
 'g0m8eb6',
 'g0m9a1j',
 'g0n268x',
 'g0n9aq5',
 'g0o08ab',
 'g0peluh',
 'i4tskq'}

In [92]:
len(node_ids)

21

In [93]:
len(edge_ids)

21

In [94]:
edge_ids - node_ids

set()

In [75]:
df = pandas.read_csv('../data/reddit.pull/reddit_black_people_convs_30/i4tskq.csv')
df

Unnamed: 0,all_awardings,approved_at_utc,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,retrieved_on,score,send_replies,stickied,subreddit,subreddit_id,top_awarded_type,total_awards_received,treatment_tags,sentiment
0,[],,,CartophorustheGreat,,,[],,,,...,1596736920,3,True,False,HillaryForPrison,t5_3d6h1,,0,[],-0.6705
1,[],,,kapow,,,[],,,,...,1596744239,1,True,False,HillaryForPrison,t5_3d6h1,,0,[],0.0
2,[],,,PChE1,,,[],,,,...,1596744433,3,True,False,HillaryForPrison,t5_3d6h1,,0,[],-0.3313
3,[],,,[deleted],,,,,,dark,...,1596744982,1,True,False,HillaryForPrison,t5_3d6h1,,0,[],0.0
4,[],,,greg_jenningz,,,[],,,,...,1596749629,3,True,False,HillaryForPrison,t5_3d6h1,,0,[],0.3818
5,[],,,[deleted],,,,,,dark,...,1596749644,1,True,False,HillaryForPrison,t5_3d6h1,,0,[],0.0
6,[],,,panxerox,,,[],,,,...,1596749665,1,True,False,HillaryForPrison,t5_3d6h1,,0,[],0.0
7,[],,,CartophorustheGreat,,,[],,,,...,1596749997,3,True,False,HillaryForPrison,t5_3d6h1,,0,[],-0.7042
8,[],,,Thats_Cool_bro,,,[],,,,...,1596751086,-1,True,False,HillaryForPrison,t5_3d6h1,,0,[],0.0
9,[],,,Tantalus4200,,,[],,,,...,1596751555,1,True,False,HillaryForPrison,t5_3d6h1,,0,[],-0.3182


In [82]:
g = make_reddit_graph(df)

edge_ids = set([n[0] for n in g['edges']] + [n[1] for n in g['edges']])
node_ids = set([n['id'] for n in g['nodes']])

edge_ids - node_ids

missing_ids: {'i4tskq'}
<Response [200]>


set()

In [80]:
edge_ids = set([n[0] for n in x['edges']] + [n[1] for n in x['edges']])
node_ids = set([n['id'] for n in x['nodes']])

edge_ids - node_ids

{'i4tskq'}