# Twitter Conversations

This notebook analyzes previously collected tweets to identify *conversations* of interest, and to collect the conversations more throughly for closer analysis. In a previous set of notebooks stored in the s3://api.pull.code notebook a series of tweet JSON datasets were created using keyword searching and the Twitter Search API. These datasets were stored in the `s3://json.terms.files` bucket as a set of files: one file per search criteria. With the s3://json.terms.files bucket mounted this notebook will walk through the tweets and identify chatty conversations.

## Chattiness

This research is specifically about *bystander interventions* in social media. Part of the argument here is that a bystander intervention in public Twitter manifests as a conversation thread where two (or more) users are engaged in what looks like a conversation. In order to identifiy these threads a measure of *chattiness* will be generated using two pieces of information available in each tweet:

* the `.public_metrics.reply_count` value available for each tweet, which is count of how many times a tweet has been replied to
* the `.conversation_id` value which is an identifier for each threaded conversation

In practice the number of users who participate in a thread is important too (a bystander intervention can't just be a user creating their own thread with no interaction). But we won't be able to ascertain that until we fetch the complete thread.

## Conversation Events

The first step is to extract *conversation events* from the data retrieved from the Twitter Search API. `get_conversation_events()` takes a Path as an argument, and generates a conversation activity objects, each represented as a dictionary with the following keys: `author_id`, `conversation_id`, and `reply_count`. It filters out any retweets, which are important signals, but are not directly relevant to identifying conversation threads and bystander interventions.

In [83]:
import json

def get_conv_events(tweets_file):
    
    # parse each line of json
    for line in tweets_file.open():
        response = json.loads(line)
        
        # some reponses don't have data sometimes
        if 'data' not in response:
            continue
        
        # iterate through each tweets and yield any conversation info
        for tweet in response['data']:      
            
            # ignore retweets
            if 'retweeted' in [ref['type'] for ref in tweet.get('referenced_tweets', [])]:
                continue
            
            # if the tweet has been replied to it's an event!
            if tweet['public_metrics']['reply_count'] > 0:
                yield({
                    'conversation_id': tweet['conversation_id'],
                    'reply_count': tweet['public_metrics']['reply_count']
                })

Lets test it out just looking at the first 10 or so results:

In [65]:
from pathlib import Path

count = 0
for event in get_conv_events(Path('/home/ubuntu/jupyter/data/json.terms.files/prison_pipe_achievement_gap.json')):
    print(event)
    
    # stop after 10
    count += 1
    if count > 10:
        break
                                              

{'conversation_id': '1396979032340717571', 'reply_count': 3}
{'conversation_id': '1396968282972905476', 'reply_count': 1}
{'conversation_id': '1396958637080391680', 'reply_count': 1}
{'conversation_id': '1396953594616717316', 'reply_count': 1}
{'conversation_id': '1396951539785285635', 'reply_count': 4}
{'conversation_id': '1396950852452167680', 'reply_count': 1}
{'conversation_id': '1396950478324371457', 'reply_count': 1}
{'conversation_id': '1396938780330938368', 'reply_count': 19}
{'conversation_id': '1396866492625563651', 'reply_count': 1}
{'conversation_id': '1396866900236423170', 'reply_count': 1}
{'conversation_id': '1396861451776708610', 'reply_count': 4}


## Aggregate Conversations

Next we need to aggregate the conversations by ID. `get_convs()` reads in the conversation events and generates a list of conversations that includes their: `conversation_id` and total `reply_count`.

In [79]:
def get_convs(events):
    convos = {}
    for e in events:
        conv_id = e['conversation_id']
        if conv_id in convos:
            convos[conv_id]['reply_count'] += e['reply_count']
        else:
            convos[conv_id] = {
                'conversation_id': conv_id,
                'reply_count': e['reply_count'],
            }
    
    # return the sorted conversations
    convos = convos.values()
    return sorted(convos, key=lambda c: c['reply_count'], reverse=True)

We can test this one too, by looking at the first 10 conversations:

In [80]:
count = 0

for conv in get_convs(get_conv_events(Path('/home/ubuntu/jupyter/data/json.terms.files/prison_pipe_achievement_gap.json'))):
    print(conv)
    
    count += 1 
    if count > 10: break

{'conversation_id': '1271571630045696001', 'reply_count': 735}
{'conversation_id': '1289699350197673986', 'reply_count': 570}
{'conversation_id': '1266783358731931648', 'reply_count': 522}
{'conversation_id': '1269010708886360066', 'reply_count': 261}
{'conversation_id': '1283068355000373250', 'reply_count': 226}
{'conversation_id': '1322871949815676930', 'reply_count': 197}
{'conversation_id': '1384218399333511169', 'reply_count': 176}
{'conversation_id': '1312127989363040258', 'reply_count': 171}
{'conversation_id': '1314436759489400832', 'reply_count': 166}
{'conversation_id': '1273988360286146561', 'reply_count': 151}
{'conversation_id': '1279598008426954752', 'reply_count': 135}


## Process the JSON files

Now we need to get the JSON files and process each one! We can write the counts data alongside the tweets they came from.

In [None]:
from pathlib import Path

data_dir = Path('/home/ubuntu/jupyter/data/json.terms.files')
for path in data_dir.iterdir():

    # ignore the convs files that we are generating
    if path.suffix == '.json' and 'convs' not in path.name:
        results = get_convs(get_conv_events(path))
        convs_path = path.as_posix().replace('.json', '-convs.json')
        json.dump(results, open(convs_path, 'w'), indent=2)
        print(f'{path.name} had {len(results)} conversations')

black_ppl.json had 94664 conversations
prison_pipe_achievement_gap.json had 12920 conversations
