# Ferguson Activists

This notebook contains some work I've been doing with Bergis Jules and Mosi Secret
to investigate tweets from Ferguson activists who have died. These activists
include:

* Darren Seals
* Edward Crawford
* Bassem Masri
* Deandre Joshua
* Danye Jones

The analysis depends on several tweet datasets that were collected as part of
the Documenting the Now projects. So in order to work with the original data you will need access.

## Fetch the data

```
aws s3 sync s3://mith-bags/4D41FEA7-9E85-45B8-9499-362212278CAB data/4D41FEA7-9E85-45B8
-9499-362212278CAB
aws s3 sync s3://mith-bags/AE0A86DE-E17D-438E-BCDF-AA1F04851CAF data/AE0A86DE-E17D-438E
-BCDF-AA1F04851CAF
aws s3 sync s3://mith-bags/D651C3F6-5619-4A42-A8BC-7C22B7A9A44A data/D651C3F6-5619-4A42
-A8BC-7C22B7A9A44A
aws s3 sync s3://mith-bags/fe28a093-d3f4-42d7-83ba-f5ba1b1cc765 data/fe28a093-d3f4-42d7
-83ba-f5ba1b1cc765
```

## Filter the Data

Rather than working on the full dataset of many millions of tweets we will take a pass through all the tweets and look for ones that are relevant for the users we are interested in. Note, this will run for a day or so:

In [None]:
#!/usr/bin/env python3

import csv
import sys
import glob
import gzip
import json

from twarc.json2csv import get_row, get_headings

queries = [
    {
        "name": "Darren Seals",
        "screen_name": "KingDSeals",
        "user_id": "2747681903"
    },
    {
        "name": "Edward Crawford",
        "screen_name": "eyeFLOODpanties",
        "user_id": "84946406"
    },
    {
        "name": "Bassem Masri",
        "screen_name": "bassem_masri",
        "user_id": "2734647354"
    },
    {
        "name": "Deandre Joshua",
        "screen_name": None,
        "user_id": None
    },
    {
        "name": "Danye Jones",
        "screen_name": None,
        "user_id": None
    }
]

data_dirs = [
    {
        "name": "Beyond the Hashtags",
        "glob": "data/AE0A86DE-E17D-438E-BCDF-AA1F04851CAF/data/tweets/*.txt.gz"
    },
    {
        "name": "BlackLivesMatter",
        "glob": "data/4D41FEA7-9E85-45B8-9499-362212278CAB/data/*.json.gz",
    },
    {
        "name": "Ferguson Scrape",
        "glob": "data/D651C3F6-5619-4A42-A8BC-7C22B7A9A44A/data/*.json.gz",
    },
    {
        "name": "Ferguson",
        "glob": "data/fe28a093-d3f4-42d7-83ba-f5ba1b1cc765/data/*.json.gz"
    }
]

def filter_tweets():
    out = csv.writer(open("data/ferguson-activists.csv", "w"))
    out.writerow(get_headings() + ['dataset', 'file', 'user_match', 'match_type'])
    for d in data_dirs:
        for f in glob.glob(d['glob']):
            sys.stdout.write('\n{}:'.format(f))
            sys.stdout.flush()
            process_file(d['name'], f, out)

def process_file(source, json_path, out):
    for line in gzip.open(json_path):
        try:
            tweet = json.loads(line)
        except:
            continue
        match = tweet_match(tweet)
        if match:
            sys.stdout.write('.')
            sys.stdout.flush()
            out.writerow(get_row(tweet) + [source, json_path] + match)

def tweet_match(t):
    for q in queries:

        # tweet by the user?
        if q['user_id'] == t['user']['id_str']:
            if t['in_reply_to_user_id_str']:
                return [q['name'], 'replied']
            elif t.get('retweeted_status') is not None:
                return [q['name'], 'retweeted']
            else:
                return [q['name'], 'posted']

        # someone replied to a tweet by the user?
        if q['user_id'] and q['user_id'] == t['in_reply_to_user_id_str']:
            return [q['name'], 'replied to']

        # user reweeted by someone else?
        rt = t.get('retweeted_status')
        if rt and q['user_id'] == rt['user']['id_str']:
            return [q['name'], 'user retweeted']

        # user mentioned by someone else?
        for u in t['entities'].get('user_mentions', []):
            if q['user_id'] == u['id_str']:
                return [q['name'], 'user mentioned']

        # someone mentioned them by name?
        text = t.get('text') or t.get('full_text')
        text = text.lower()
        if q['name'].lower() in text:
            return [q['name'], 'name mention']

    return None


## Query the data

We created a CSV of the tweet data in the previous step, but now lets convert it to a sqlite database so we can query it.

In [2]:
import sqlite3

db = sqlite3.connect('data/ferguson-activists.db')

See what users the activists replied to the most:

In [3]:
q = """
    SELECT 
      user_screen_name,
      in_reply_to_screen_name,
      COUNT(*) AS total
    FROM results
    WHERE
      match_type = "replied"
    GROUP BY
      user_screen_name,
      in_reply_to_screen_name
    HAVING
      total > 1
    ORDER BY
      user_screen_name,
      total DESC
    """
    
for result in db.execute(q):
    print(result)


('KingDSeals', 'KingDSeals', 12)
('KingDSeals', 'DefJamRecords', 6)
('KingDSeals', 'MrChuckD', 6)
('KingDSeals', 'deray', 6)
('KingDSeals', 'kendricklamar', 5)
('KingDSeals', 'AZEALIABANKS', 4)
('KingDSeals', 'TefPoe', 4)
('KingDSeals', 'myfabolouslife', 4)
('KingDSeals', 'youngbuck', 4)
('KingDSeals', 'BarackObama', 3)
('KingDSeals', 'MasterPMiller', 3)
('KingDSeals', 'ShetheStreet', 3)
('KingDSeals', 'TalibKweli', 3)
('KingDSeals', 'TheRevAl', 3)
('KingDSeals', 'lostvoices14', 3)
('KingDSeals', 'AntonioFrench', 2)
('KingDSeals', 'Complex', 2)
('KingDSeals', 'DeplorableSunny', 2)
('KingDSeals', 'FTP_2015', 2)
('KingDSeals', 'FortuneMagazine', 2)
('KingDSeals', 'FunnyMaine', 2)
('KingDSeals', 'JColeNC', 2)
('KingDSeals', 'JuanMThompson', 2)
('KingDSeals', 'KingJames', 2)
('KingDSeals', 'Nettaaaaaaaa', 2)
('KingDSeals', 'SybrinaFulton', 2)
('KingDSeals', 'TheDreadPoet', 2)
('KingDSeals', 'TupacShakurST', 2)
('KingDSeals', 'allhiphopcom', 2)
('KingDSeals', 'bassem_masri', 2)
('KingDSeals

What does this look like as a nework? First create the graph.

In [4]:
import networkx

replies_g = networkx.DiGraph()
for from_user, to_user, count in db.execute(q):
    if from_user == to_user:
        continue
    replies_g.add_node(from_user)
    replies_g.add_node(to_user)
    replies_g.add_edge(from_user, to_user, weight=count)

Now create a function to visualize the graph with pyvis. I'm copying this little handy function from another project:

In [6]:
import pyvis

def label(s):
    if len(s) > 40:
        return s[0:10]
    return s

def vis(g, html_path, width="100%", height=600):
    degree = g.degree(g)
        
    v = pyvis.network.Network(notebook=True, height=height, width=width, directed=True)
    for src, dst, data in g.edges(data=True):
        v.add_node(src, label(src), title=src, value=degree[src], color='#ccc')
        v.add_node(dst, label(dst), title=dst, value=degree[dst], color='lightgreen')
        v.add_edge(src, dst, value=data['weight'])
        
    
    v.set_options('{"edges":{"color":{"inherit":true},"smooth":false},"physics":{"forceAtlas2Based":{"springLength":50},"minVelocity":0.75,"solver":"forceAtlas2Based"}}')
    return v.show(html_path)

vis(replies_g, 'data/replies.html')

We can do a similar query to see who the users are retweeting:

In [7]:
q = """
    SELECT 
      user_screen_name,
      retweet_or_quote_screen_name,
      COUNT(*) AS total
    FROM results
    WHERE
      match_type = "retweeted"
    GROUP BY
      user_screen_name,
      retweet_or_quote_screen_name
    HAVING
      total > 1
    ORDER BY
      user_screen_name,
      total DESC
    """
    
for result in db.execute(q):
    print(result)

('KingDSeals', 'bassem_masri', 30)
('KingDSeals', 'VanguardTNT', 10)
('KingDSeals', 'davidbanner', 7)
('KingDSeals', 'RE_invent_ED', 5)
('KingDSeals', 'TheOutlawz', 5)
('KingDSeals', 'ImRikaJai', 4)
('KingDSeals', 'T_DUBB_O', 4)
('KingDSeals', 'YOUNG_NOBLE1', 4)
('KingDSeals', 'brothercartan', 4)
('KingDSeals', 'handsupunited_', 4)
('KingDSeals', 'tariqnasheed', 4)
('KingDSeals', 'Gimme_A_Break1', 3)
('KingDSeals', 'TefPoe', 3)
('KingDSeals', 'jjmcphatter', 3)
('KingDSeals', 'lostvoices14', 3)
('KingDSeals', 'mikebrowncover', 3)
('KingDSeals', 'neweryork', 3)
('KingDSeals', 'Cherrell_Brown', 2)
('KingDSeals', 'ChuckModi1', 2)
('KingDSeals', 'Khan_SHEGOG', 2)
('KingDSeals', 'Nettaaaaaaaa', 2)
('KingDSeals', 'OwlsAsylum', 2)
('KingDSeals', 'dlatchison011', 2)
('KingDSeals', 'guygruber', 2)
('KingDSeals', 'peaceful_birdie', 2)
('KingDSeals', 'youthradio', 2)
('King_D_Seals', 'THEREALBANNER', 7)
('bassem_masri', 'Delo_Taylor', 203)
('bassem_masri', 'sarahkendzior', 45)
('bassem_masri', 'de

The retweet network looks interesting because bassem_masri and KingDSeals are retweeting each other. Let's create a network visualization of the retweets.

In [9]:
retweets_g = networkx.DiGraph()
for from_user, to_user, count in db.execute(q):
    if from_user == to_user:
        continue
    retweets_g.add_node(from_user)
    retweets_g.add_node(to_user)
    retweets_g.add_edge(from_user, to_user, weight=count)

vis(retweets_g, 'data/retweets.html')

This one has a bit more structure. Of the three activists we have retweet data for, the only Twitter user that all three have retweeted was Nettaaaaaaaa (Johnetta Elzie).