# Twitter Conversation Collection

This notebook analyzes previously collected tweets to identify *conversations* of interest, and to collect the conversations more throughly for closer analysis. In a previous set of notebooks stored in the s3://api.pull.code notebook a series of tweet JSON datasets were created using keyword searching and the Twitter Search API. These datasets were stored in the `s3://json.terms.files` bucket as a set of files: one file per search criteria. With the s3://json.terms.files bucket mounted this notebook will walk through the tweets and identify chatty conversations.

## Chattiness

This research is specifically about *bystander interventions* in social media. Part of the argument here is that a bystander intervention in public Twitter manifests as a conversation thread where two (or more) users are engaged in what looks like a conversation. In order to identifiy these threads a measure of *chattiness* will be generated using two pieces of information available in each tweet:

* the `.public_metrics.reply_count` value available for each tweet, which is count of how many times a tweet has been replied to
* the `.conversation_id` value which is an identifier for each threaded conversation

In practice the number of users who participate in a thread is important too (a bystander intervention can't just be a user creating their own thread with no interaction). But we won't be able to ascertain that until we fetch the complete thread.

## Conversation Events

The first step is to extract *conversation events* from the data retrieved from the Twitter Search API. `get_conversation_events()` takes a Path as an argument, and generates a conversation activity objects, each represented as a dictionary with the following keys: `author_id`, `conversation_id`, and `reply_count`. It filters out any retweets, which are important signals, but are not directly relevant to identifying conversation threads and bystander interventions.

In [10]:
import json

def get_conv_events(tweets_file):
    
    # parse each line of json
    for line in tweets_file.open():
        response = json.loads(line)
        
        # some reponses don't have data sometimes
        if 'data' not in response:
            continue
        
        # iterate through each tweets and yield any conversation info
        for tweet in response['data']:      
            
            # ignore retweets
            if 'retweeted' in [ref['type'] for ref in tweet.get('referenced_tweets', [])]:
                continue
            
            # if the tweet has been replied to it's an event!
            if tweet['public_metrics']['reply_count'] > 0:
                yield({
                    'conversation_id': tweet['conversation_id'],
                    'reply_count': tweet['public_metrics']['reply_count']
                })

Lets test it out just looking at the first 10 or so results:

In [5]:
from pathlib import Path

count = 0
for event in get_conv_events(Path('/home/ubuntu/jupyter/data/json.terms.files/prison_pipe_achievement_gap.json')):
    print(event)
    
    # stop after 10
    count += 1
    if count > 10:
        break
                                              

{'conversation_id': '1396979032340717571', 'reply_count': 3}
{'conversation_id': '1396968282972905476', 'reply_count': 1}
{'conversation_id': '1396958637080391680', 'reply_count': 1}
{'conversation_id': '1396953594616717316', 'reply_count': 1}
{'conversation_id': '1396951539785285635', 'reply_count': 4}
{'conversation_id': '1396950852452167680', 'reply_count': 1}
{'conversation_id': '1396950478324371457', 'reply_count': 1}
{'conversation_id': '1396938780330938368', 'reply_count': 19}
{'conversation_id': '1396866492625563651', 'reply_count': 1}
{'conversation_id': '1396866900236423170', 'reply_count': 1}
{'conversation_id': '1396861451776708610', 'reply_count': 4}


## Aggregate Conversations

Next we need to aggregate the conversations by ID. `get_convs()` reads in the conversation events and generates a list of conversations that includes their: `conversation_id` and total `reply_count`.

In [3]:
def get_convs(events):
    convos = {}
    for e in events:
        conv_id = e['conversation_id']
        if conv_id in convos:
            convos[conv_id]['reply_count'] += e['reply_count']
        else:
            convos[conv_id] = {
                'conversation_id': conv_id,
                'reply_count': e['reply_count'],
            }
    
    # return the sorted conversations
    convos = convos.values()
    return sorted(convos, key=lambda c: c['reply_count'], reverse=True)

We can test this one too, by looking at the first 10 conversations:

In [4]:
count = 0

for conv in get_convs(get_conv_events(Path('/home/ubuntu/jupyter/data/json.terms.files/prison_pipe_achievement_gap.json'))):
    print(conv)
    
    count += 1 
    if count > 10: break

{'conversation_id': '1271571630045696001', 'reply_count': 735}
{'conversation_id': '1289699350197673986', 'reply_count': 570}
{'conversation_id': '1266783358731931648', 'reply_count': 522}
{'conversation_id': '1269010708886360066', 'reply_count': 261}
{'conversation_id': '1283068355000373250', 'reply_count': 226}
{'conversation_id': '1322871949815676930', 'reply_count': 197}
{'conversation_id': '1384218399333511169', 'reply_count': 176}
{'conversation_id': '1312127989363040258', 'reply_count': 171}
{'conversation_id': '1314436759489400832', 'reply_count': 166}
{'conversation_id': '1273988360286146561', 'reply_count': 151}
{'conversation_id': '1279598008426954752', 'reply_count': 135}


## Extract all the Conversations

Now we need to get the JSON files and process each one! We can write the counts data alongside the tweets they came from.

In [6]:
data_dir = Path('/home/ubuntu/jupyter/data/json.terms.files')

In [6]:
from pathlib import Path

for path in data_dir.iterdir():

    # ignore the convs files that we are generating
    if path.suffix == '.json' and '_convs' not in path.name:
        results = get_convs(get_conv_events(path))
        convs_path = path.as_posix().replace('.json', '_convs.json')
        json.dump(results, open(convs_path, 'w'), indent=2)
        print(f'{path.name} had {len(results)} conversations')

black_ppl.json had 231400 conversations
prison_pipe_achievement_gap.json had 12920 conversations
black_us.json had 87058 conversations
crime.json had 133919 conversations
police_100.json had 79 conversations
wealth.json had 8725 conversations
police_violence.json had 43243 conversations
police.json had 123869 conversations
business.json had 81583 conversations
floyd_chauvin.json had 46533 conversations
blm.json had 215291 conversations
racism.json had 214756 conversations


## Getting the Conversations

So what do these conversations look like? That's really the subject for another notebook, as this one is concerned with *collecting* the conversations. But we do have one more step to fetch each of the conversations. All we have are pieces of conversations that came back from our searches, and pointers to some of those threads.

Fortunately the Twitter APIs now supports searching for tweets using their `conversation_id`. This allows the complete conversation thread to be fetched. This next bit of code gets the top 100 conversations for each dataset, and writes the full conversation thread as JSON and as CSV to a directory named after the dataset. Having the data as CSV should help when analyzing the threads in other tools.

To fetch data from the Twitter API you will need to have previously run `twarc2 configure` in the environment where this notebook is running.

In [None]:
import os

from sh import twarc2

for conv_file in data_dir.glob('*convs.json'):
    convs = json.load(open(conv_file))        
    print(f'processing {conv_file}')

    conv_dir = data_dir / conv_file.name.replace('_convs.json', '_convs')
    if not conv_dir.is_dir():
        conv_dir.mkdir()
    
    # get the full threads for the top 100 conversation ids
    for conv in convs[0:100]:
        print(conv)
        conv_id = conv['conversation_id']
        conv_json = conv_dir / f'{conv_id}.json' 
        conv_csv = conv_dir / f'{conv_id}.csv'
        
        # don't re-generate the csv if we already have it!
        if conv_csv.is_file():
            continue

        # get the json, convert to csv and remove the json
        twarc2('conversation', '--archive', conv_id, conv_json)
        
        # sometimes there is nothing to retrieve for the conversation_id
        if conv_json.is_file():
            twarc2('csv', conv_json, conv_csv)
            os.remove(conv_json)

## Random Sample

In addition to getting the top 100 conversations for each tweet dataset we can get a random sample of all the conversations, and save them as CSV for analysis.

In [17]:
import sh
import pandas

seen = {}
twitter_json_dir = Path('/home/ubuntu/jupyter/data/tweets.pull')

def sample(convs_file, n):
    convs_dir = twitter_json_dir / (convs_file.stem + f"_{n}")
    if not convs_dir.is_dir():
        convs_dir.mkdir()

    convs = json.load(open(convs_file))
    df = pandas.DataFrame(convs)
    
    # sample size cannot be bigger than the dataframe
    if n > len(df):
        n = len(df)

    s = df.sample(n)
    
    for conv_id in s["conversation_id"]:
        conv_json = convs_dir / f"{conv_id}.jsonl"
        conv_csv = convs_dir / f"{conv_id}.csv"
        if conv_id in seen:
            print(f"using already fetched {conv_csv}")
            sh.cp(seen[conv_id], conv_csv)
        else:
            print(conv_csv)
            sh.twarc2("conversation", "--archive", conv_id, conv_json)
            # if the conversation_id no longer yields any tweets fhe json file will not exist
            if not conv_json.is_file():
                print(f"conversation {conv_id} no longer exists")
                sh.touch(conv_json)
            else:
                sh.twarc2("csv", conv_json, conv_csv)
                sh.rm(conv_json)
                seen[conv_id] = conv_csv

In [None]:
for convs_file in data_dir.glob("*convs.json"):
    sample(convs_file, 30)    