In [None]:
from pandas import to_datetime, Grouper, set_option, read_csv
set_option("display.max_rows",100)
set_option("display.max_colwidth",500)

Following how stories move
-------------------------

In this notebook we will apply our basic Twitter skills to examining how information or misinformation spreads. Our goal is to look at how a hashtag, a meme, a story or, broadly, a concerted messaging campaign moves between people, and publishers and platforms. We will use the openness of Twitter  to dissect some of the components that can be mobilized to spread information. This is just the start of our exploration, which will eventually take us into journalism's changing business model and so-called adtech. We began our treatment of these tools in the last class, but mostly exhibited what might be possible, without a real narrative arc. 

In this notebook, we are going to begin with [the article from Wired](https://www.wired.com/story/how-liberals-amped-up-a-parkland-shooting-conspiracy-theory/) that we are reading for Monday about how well-meaning liberals might have helped spread the the idea of "Crisis Actors" posing as students for news interviews after the Parkland shooting.
<br>
<br>

<img src="https://github.com/computationaljournalism/columbia2018/raw/master/images/EXPOSED.jpg" style="width: 50%; border: #000000 1px outset;"/>
<br><br>

The Wired article is a mix of simple web searching and (I'm assuming) programmatic use of Twitter's API. They focused on a report in The Gateway Pundit [*"EXPOSED: School Shooting Survivor Turned Activist David Hogg’s Father in FBI, Appears To Have Been Coached On Anti-Trump Lines \[VIDEO\]"*](http://www.thegatewaypundit.com/2018/02/exposed-school-shooting-surviver-turned-activist-david-hoggs-father-fbi-appears-coached-anti-trump-lines-video/) and an accompanying YouTube video that has since been removed. The report focused mostly on one student, raising the suspicion that he was coached to make statements that were anti-Trump and pro-gun legislation.

>*One student, in particular, David Hogg has been astonishingly articulate and highly skilled at propagating a new anti-Conservative/anti-Trump narrative behind the recent school shooting. Few have seen this type of rapid media play before, and when they have it has come from well-trained political operatives and MSM commentators.
<br><br>
**Immediately, these students-turned-activists threw up some red flags.***

They end their report stating

>*... in a recently uncovered early cut from one of his [Hogg's] interviews it appears he was heavily coached on lines and is merely reciting a script. Frequently seen in the footage mouthing the lines he should be reciting. Hogg becomes flustered multiple times, is seen apologizing, and asking for re-takes.*

In class we read [an article from USA Today](https://www.usatoday.com/story/tech/talkingtech/2018/02/24/7-days-fringe-mainstream-how-conspiracy-theory-ricocheted-around-web/361446002/) discussing how misinformation spread about who was being interviewed by news sources - were they students dealing with a horrific situation, or were they paid actors? The [Wired article focuses on the role of the GatewayPundit in this story](https://www.wired.com/story/how-liberals-amped-up-a-parkland-shooting-conspiracy-theory/) and we will use it to develop some basic tools for looking at misinformation and assessing the credibility of sources.

Wired begins by casting a wide net when it identified how the "crisis actors" story spread. Platforms like Facebook and YouTube are characterized as having struggled to deal with algorithmic recommendations and trends that spread the content.

>*In the days that followed the shooting, social media companies scrambled to deal with complaints about the proliferation of the crisis actors conspiracy across their platforms—even as their own algorithms helped to promote that same content. There were new rounds of statements from Facebook, YouTube, and Google about addressing the problematic content and assurances that more AI and human monitors must be enlisted in this cause.*

Let's follow Wired's approach and start when The Gateway Pundit first posts its report on 2/19. There is an almost immediate reaction. 

>*Of the 660 tweets and retweets of the “crisis actors” Gateway Pundit conspiracy story during the hour after it was posted, 200 (30 percent) came from accounts that have tweeted more than 45,000 times. Human, cyborg, or bot, these accounts are acting with purpose to amplify content (more on this in a moment). And this machinery of curation, duplication, and amplification both cultivates echo chambers that keep human users engaged and impacts how social media companies’ algorithms decide what is important, trending, and promoted to other users—part of triggering a feedback loop to win the “algorithmic popularity contest.”*

Let's get a handle on what's going on. We've pulled all the tweets that include a link to The Gateway Pundit article. This will include both original tweets as well as retweets, and we will focus, in particular, on the retweets. You can [download the file from Dropbox](https://www.dropbox.com/s/05p0wyy4ew0qqzq/thegatewaypundit.json.gz?dl=0), unzip it to produce `thegatewaypundit.json`, and place it in the same folder as this notebook.

**Flattening tweets**

Tweets, as we have seen multiple times, are represented by Twitter as JSON strings. That means, in Python terms, they are dictionaries. The structure can hold different content depending on the kind of tweet it is. For example, a retweet will include a key `retweeted_status` that holds, essentially, the tweet that is being retweeted. If that key is not present, then the tweet is not a retweet. In addition, there are elements in a tweet to hold all of the hashtags or mentions it contains. These data are variable in number, as someone could use one, two or many hashtags. Or none. 

This kind of data is different from what we started the term with, when everything was represented as a table. That is, each row was a unit of observation (a person given a survey, or a county, or a particular trial in an experiment) and each column stood for the measurements or qualities we recorded for each unit of observation (heights and weights, or population and high school graduation rate, or whether a patient got better under some treatment or not). 

One way around this mismatch is to "flatten" the JSON object, pulling out the information we need and loading it into tables. We'll see on Thursday that this fracturing of data into different tables (one for hashtags, or one for mentions, or even one for the the twitter accounts involved) is a classic move in database design. Those of you who have had experience with relational databases and SQL (the Structured Query Language) will be familiar with this idea.

### A single tweet

So, let's start this off by reading in some data. Each row in `thegatewaypundit.json` is a JSON string for a tweet that includes a link to the GatewayPundit story. Here, to remind ourselves, is one tweet represented as a Python dictionary. The file `thegatewaypundit.json` is organized in time order, with the oldest tweets first. Since `readline()` reads the first line of a file, what we get is the tweet that started the story. Have a look.

1. What was the content of the tweet?
2. Did it contain anything besides the link to the GatewayPundit article?
2. What screen name is associated with this first tweet?
3. What can you tell me about this person?

In [None]:
from json import loads

d = open("thegatewaypundit.json").readline()
loads(d)

Looking over the data, we see that the first tweet to launch the story was by an account by the name of Tokaise. This account is responsible for over 80k tweets since its owner joined Twitter two years ago. If we assume a 12-hour tweet day and no breaks for the weekends, that's about 9 tweets per hour! Every hour. 

In [None]:
80422/(365*2*12)

**Twarc**

We are going to start combining different kinds of data from Twitter into our investigations. For that, we are going to use a tool that is a bit more friendly when it comes to "industrial strength" analyses. Tweepy was great for small problems, but we really want an interface to Twitter's API that handles gracefully the rate limiting we've experienced (sleeping automatically) and can be placed "in the cloud" to operate autonomously. 

[Twarc](https://github.com/DocNow/twarc) is a both a stand-alone UNIX tool as well as a Python module. Today, we are going to use it as a Python module and Thursday we take it to the cloud! So, as with all Python modules, we install it using `pip` in a cell that we tell the notebook to interpret as UNIX commands. Thursday, our little venture skyward will mean we will also learn a bit about UNIX so this "cell magic" (`%%sh`) will seem less magical. 

But for now...

In [None]:
%%sh
pip install twarc

Now, let's put this to good use. 

We can now cull the last 3200 tweets from the user Tokaise using Twarc. This will give us some insight into their behavior. A big part of the Wired story was about coordination among accounts to amplify a message, so tweeting behavior is important to characterize in some way. 

In the cell below we set up our credentials from Twitter first. We then instantiate a Twarc object, just as we did a Twitter API object with Tweepy. We call the command `Twarc`, using our credentials. The resulting object has a series of methods for accessing Twitter's API. Below we are asking for the `timeline` of the user with screen name `tokaise`. Twitter treats user names as case-insensitive, so we've spelled it out in lowercase. 

The call to `timeline()` produces an "iterator" that we could for a loop with. Instead, we just use the command `list()` to turn the 3200 tweets into a list, a list of dictionaries -- one for each tweet. If we were to write these out to a file, we would have something similar to `thegatewaypundit.json` -- something we will do Thursday.

In [None]:
from twarc import Twarc

consumer_key = "Urq5NyCqyjxiGF2gLoXg7o3UZ"
consumer_secret = "KKiNtI8403O6R7MXUowWfM2mGB71eLJX2jeIMsgjGQ5SJrMaDl"
access_token = "20743-PbvM6FZjT2LoDSKTfAUpWwSwLKwPrXj25VVyIe5s3mya"
access_token_secret = "FdqcOey0FdwIhFhTyIuCJOFXwjFOX1EIDHG5vojPq3W51"

t = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)
tweets = list(t.timeline(screen_name="tokaise"))

Check that we have a list, check it's length, and have a look at the 10th entry, say (the entry with index 9).

In [None]:
type(tweets)

In [None]:
len(tweets)

In [None]:
tweets[9]

We can now loop over these tweets and pull out just the information we need, flattening the structure into a table. Below we prepare to write the data out to a CSV file which might be handy for later processing. There are a million ways to turn our tweets into a table. This is just one and not necessarily the best.

In the code below, we keep just the ID of each tweet, the time it was created, the software the person used to issue the tweet, and the text of the tweet. In this case, the tweet source will be `Null` in the JSON object if it turns out that the `source` is not given for some reason. We'd rather set it to "", an empty string, to be consistent with the way we will handle things later. So there's one `if-else` statement in the code in the next cell.

In [None]:
from csv import writer

# open a CSV file for writing the tweet components
tweet_file = writer(open("tokaise.csv","w",encoding='utf-8'))

# write out the header
tweet_file.writerow(["id","created_at","tweet_source","text"])

# now iterate over Tokaise's most recent tweets. We could use the 
# result from calling timeline() as we have done here or we could use
# the list we made above. 

for tweet in  t.timeline(screen_name="tokaise"):

    # for each tweet, keep some data -- for now, the id, the date it was created,
    # the tweet's "source" and the text of the tweet.
    
    # the "source" will be a "null" object if it is not given by the user's
    # computer or phone and with this code we say its an empty string "" if
    # its missing instead
    
    if tweet['source']:
        tweet_source = tweet['source']
    else: tweet_source = ""
        
    # store the row of data for the given tweet as a list and write it
    # out to our CSV file

    out = [tweet["id"],tweet["created_at"],tweet_source,tweet["full_text"]]
    tweet_file.writerow(out)


Now, let's read the data in using Panda's `read_csv()` function. We will call the DataFrame holding the flattened data from the last 3200 tweets from Tokaise, well, `tokaise`. 

In [None]:
tokaise = read_csv("tokaise.csv")
tokaise.head(50)

In [None]:
tokaise.shape

We should have round abouts 3200 tweets from Tokaise. Now, here's a quick Pandas test. Tell me how many different software programs Tokaise used to author their tweets, and how frequently they used each.

In [None]:
# put your code here



We will now create a simple plot of the times Tokaise tweets. To do this, we need a "datetime" object to represent these timings and not just the `created_at` string. We form the object using `to_datetime()` from Pandas and pass along the "format" of the timestamp. The format is another mini-language that can express just about any date representation. [You can read about it here,](https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior) under the heading of "directives". In the format below, for example, `%a` represents an abbreviated day of the week and `%S` represents seconds.

We will create these datetime objects from the `created_at` strings and store them in a column called `stamp` for "timestamp". These can now be used in plots and other things.

In [None]:
tokaise["stamp"] = to_datetime(tokaise["created_at"],format='%a %b %d %H:%M:%S +0000 %Y')
tokaise.head()

Now, let's make a simple plot of how Tokaise tweets but counting the number of tweets they post every 3 hours. We will use the function `Grouper()` from Pandas to group up the time objects in `stamp` in three hour chunks. We specify the frequency for grouping with an alias `3H`. This, too, is an expressive little system and you can represent a number of different groupings. [You can read about them here.](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) 

In the code below, we take our data frame holding Tokaise's tweets, group them into 3 hour periods, and for each period simply count the number of tweets that they posted. We have studied `groupby()` and `agg()` already. The last little piece of code below assigns our counts of id's the name `count`.

In [None]:
counts = tokaise.groupby(Grouper(key="stamp",freq='3H')).agg({"id":"count"}).rename(columns={"id":"count"})
counts.head()

As we have seen, `groupby()` returns a new data frame, but one that has the group labels as rownames. We can turn them from rownames into a column using the `reset_index()` command (with the `inplace` meaning that the object is updated without having to reassign the new object to its old name again -- like `x = changed x`.)

In [None]:
counts.reset_index(inplace=True)
counts.head()

And finally make a plot of the counts of tweets from Tokaise every three hours.

In [None]:
from plotly.plotly import iplot, sign_in
import plotly.graph_objs as go

sign_in("cocteautt","8YLww0QuMPVQ46meAMaq")

myplot_parts = [go.Scatter(x=counts["stamp"],y=counts["count"],mode="line")]
mylayout = go.Layout(autosize=False, width=1000,height=500)
myfigure = go.Figure(data = myplot_parts, layout = mylayout)
iplot(myfigure,filename="crisis")

What do you notice? What do you think? 

Now, Wired mentions 10 accounts that retweet the `gatewaypundit` account, the first of which is rlyor. Use the code above to collect the tweets from rlyor, store them in a CSV file called `rlyor.csv`, read them into a Data Frame and make a plot like the one above to see what their tweeting timing is like.

In [None]:
# put your code here



Is rlyor tweeting more or less often than Tokaise? What kinds of questions would you ask to compare these two accounts?

How would you see if rlyor and Tokaise tend to retweet the same material? What would you want to add to your flattened data sets to do this? Then what kind of computation would you do?

### All users retweeting `thegatewaypundit`

We are now going to use the tweets in the file `thegatewaypundit.json`. 
If you haven't done it already, you can [download the file from Dropbox](https://www.dropbox.com/s/05p0wyy4ew0qqzq/thegatewaypundit.json.gz?dl=0), unzip it to produce `thegatewaypundit.json`, and place it in the same folder as this notebook.
 
Again, the rows of this file represent all the tweets that include a reference to the URL for The Gateway Pundit story about David Hogg. Each row in the file is a JSON string that represents a single tweet. In terms of workflow, for Tokaise's tweets, we read them in using Twarc and wrote out a flattened CSV. We could (and will do Thursday) instead store the tweets whole as rows of JSON as we did with `thegatewaypundit.json`.  The difference? In the latter case, we store all the data and can reflatted using different fields. Eventually we will store JSON in a special database (Mongodb) that lets us effectively keep all the information and not flatten it unnaturally. 

So we have a file, one row per tweet, each row in JSON format. We open the file, use `loads` to turn the string into a Python dictionary and then, as we did above, keep only certain fields that we're interested in. This will leave us with a file called `thegatewaypundit.csv`.

In [None]:
from csv import writer

# open a CSV file for writing and write out the header

tweets = writer(open("thegatewaypundit.csv","w",encoding='utf-8'))

tweets.writerow(["id","screen_name","user_id","verified",
                 "text","created_at","tweet_source",
                 "retweet_count",
                 "retweeted_screen_name","retweeted_status_id",
                 "statuses_count","followers_count","friends_count"])

# open our data file
data = open("thegatewaypundit.json")

# and process each tweet, each row, one at a time
for d in data:
        
    tweet = loads(d)
    
    # to flatten the tweet, we will look for slots that are present (various screen_names)
    # and when they are not present, fill the table with a "" (which will end up being a
    # missing value)
    
    # if this is a retweet, what tweet are they retweeting and who created it?
    if "retweeted_status" in tweet:
        id_of_original_tweet = tweet["retweeted_status"]["id"]
        originator_of_original_tweet = tweet["retweeted_status"]["user"]["screen_name"]
    else: 
        id_of_original_tweet = ""      
        originator_of_original_tweet = ""
        
    # the platform the tweet came from - tweetdeck? twitter for ios?
    if tweet['source']:
        tweet_source = tweet['source']
    else: tweet_source = ""
            
    out = [tweet["id"],tweet["user"]["screen_name"],tweet["user"]["id"],tweet["user"]["verified"],
           tweet["text"],tweet["created_at"],tweet_source,
           tweet["retweet_count"],
           originator_of_original_tweet,id_of_original_tweet,
           tweet["user"]["statuses_count"],tweet["user"]["followers_count"],tweet["user"]["friends_count"]]

    tweets.writerow(out)

Having flattened the tweets into a CSV, let's read it back in as a DataFrame and have a look.

In [None]:
df = read_csv("thegatewaypundit.csv")
df.shape

So we have 19,310 tweets that include a link to The Gateway Pundit story since it first posted on 2/19. Let's have a look at the first few -- remember the tweets are sorted in the file from oldest to newest.

In [None]:
df.head()

We see Tokaise beating `thegatewaypundit` itself in tweeting about the story. And then we see a series of other accounts, many retweeting `thegatewaypundit` Twitter account. We can create some simple summaries of the different columns. Who created the greatest number of tweets containing the link to The Gateway Pundit report?

In [None]:
df["screen_name"].value_counts().head(10)

Who retweeted `thegatewaypundit` most often?

In [None]:
df["retweeted_screen_name"].value_counts().head(20)

What programs did people use when creating these tweets?

In [None]:
df["tweet_source"].value_counts().head(10)

And here are the first few total status counts for the accounts tweeting about the Gateway Pundit story. They're pretty big numbers! 

In [None]:
df["statuses_count"].head()

Wired quotes that 200 of the 660 accounts first tweeting about this report had produced more than 45,000 statuses over their lifetimes. This is a huge number, which suggests something strange about these accounts. We can come close to that number by adding things up. The number we get is a little different because some time has passed since the Wired story and some accounts moved past the 45,000 point. 

In [None]:
sum(df["statuses_count"][:660]>45000)

Now, let's use `Grouper()` again and come up with a timeline of the activity around this report. 
Again, first we create a datetime object...

In [None]:
df["stamp"] = to_datetime(df["created_at"],format='%a %b %d %H:%M:%S +0000 %Y')
df.head()

And then use `Grouper()` to form, this time, 30 minute counts of tweets. We then make a plot to see how this story progressed on Twitter.

In [None]:
counts = df.groupby(Grouper(key="stamp",freq='30min')).agg({"id":"count"}).rename(columns={"id":"count"})
counts.reset_index(inplace=True)

counts.head()

In [None]:
from plotly.plotly import iplot, sign_in
import plotly.graph_objs as go

sign_in("cocteautt","8YLww0QuMPVQ46meAMaq")

myplot_parts = [go.Scatter(x=counts["stamp"],y=counts["count"],mode="line")]
mylayout = go.Layout(autosize=False, width=1000,height=500)
myfigure = go.Figure(data = myplot_parts, layout = mylayout)
iplot(myfigure,filename="crisis")

**Networks**

The Wired article alludes to network effects that act in tandem to promote the Gateway Pundit Story. We can get a sense of who is retweeting whom by creating a network graph. We know accounts like `@gatewaypundit` and `@lucianwintrich` (the reporter on the story) had high retweet counts, but did the same people retweet them? 

Let's start by focusing on just the period leading to the peak, so before midnight on 2/20. (These times, remember, are UTC and are 5 hours ahead of NYC, say.) Here we keep just the portion of our data frame where the time is before midnight on 2/20. We see that leaves 1948 tweets.

In [None]:
lead = df[df["stamp"] <= "2018-02-20 00:00:00"]
lead.shape

We have called the reduced data set `lead`. We now keep just those tweets in `lead` that are retweets. We do that by keeping those that have a non-null (not an NaN) value under `retweeted_screen_name`. We can use the logical function `isnull()` to determine which have empty `retweeted_screen_name` fields and then use the tilde to make it "not" (flipping True and False). So the first line below subsets `lead`, keeping only those rows where the `retweeted_screen_name` exists. 

The result is named `retweets`. We then do another `groupby()`, this time forming groups based on the `screen_name` doing the retweeting and the `retweeted_screen_name` they are retweeting. Like before, we summarize these groups with their `count()` but this time we call the resulting column `Weight`. Resetting the index as before we have a simple data frame of counts. 

In [None]:
retweets = lead[~lead["retweeted_screen_name"].isnull()]
retweet_pairs = retweets[["id","screen_name","retweeted_screen_name"]].groupby(["screen_name","retweeted_screen_name"]).agg({"id":"count"}).rename(columns={"id":"Weight"})
retweet_pairs.reset_index(inplace=True)

retweet_pairs.head()

We are going to load this data set into [Graph Commons](https://graphcommons.com/), which requires a CSV with columns

>`FromType, FromName, Edge, ToType, ToName, Weight`

where the From's refer to the person retweeting and the To's refer to the person being retweeted. The lines below create three new columns, two made up of the same entry repeated, "Person", and one made up of "Retweet". The `Edge` column is meant to specify the relationship between the From's and To's. 

The code below renames columns to adhere to this form, using `inplace` to make the change without having to reassign the result. 

In [None]:
retweet_pairs["FromType"] = "Person"
retweet_pairs["ToType"] = "Person"
retweet_pairs["Edge"] = "Retweeted"
retweet_pairs.rename(columns={"screen_name":"FromName","retweeted_screen_name":"ToName"},inplace=True)
retweet_pairs.head(10)

Finally, we put the columns in the order Graph Commons wants them and we export to a CSV file called `For_Graph_Commons.csv`.

In [None]:
retweet_pairs[["FromType","FromName","Edge","ToType","ToName","Weight"]].to_csv("For_Graph_Commons.csv",index=False)

We can now navigate to Graph Commons, create a new graph and "import" data. You will see the option to import an edge file and you should select the `For_Graph_Commons.csv` that we created. Clicking through as you go, you will eventually end up with a display like this.

We see clearly the accounts with large numbers of retweets (encoded as the size of the dots and of their name). We also see clusters. 

<img src=https://github.com/computationaljournalism/columbia2018/raw/master/images/gwp.png>

**A bit more with Twarc**

We could now reasonably ask, among the first few accounts that tweeted about the Gateway Pundit story, how many of their followers went on to tweet about it. Below, we pull the followers for `@rlyor`, `@ahernandez85b` and `@mandersonhare1`. So far we have seen the `timeline()` method from a Twarc object and now we see the `follower_ids()` method. Here we pass a user id or a screen name and we will get back a list of the account's followers (well, their ID's). Here we put all of the followers into lists.

Note that Twitter returns follower lists so that the first elements are the most recent followers of an account and the last are the oldest.

In [None]:
ah = list(t.follower_ids(user="ahernandez85b"))
rl = list(t.follower_ids(user="rlyor"))
ma = list(t.follower_ids(user="mandersonhare1"))

print(len(ah),len(rl),len(ma))

So we see a relatively low number of followers. How many of `rlyor`'s followers went on to tweet about the story? Well, we can use the `user_id` column of our data frame of tweets to see which rows have ID's that are among rlyor's followers, say. That will give us a column of True's and False's that we can add up to tell us how many True's we have.

In [None]:
sum(df["user_id"].isin(rl))

So 200 or so out of 1,300 or so. What do you think of that number? 

One last Twarc feature. We can also take a list of screen names or user ID's and create a list of their associated user JSON data. Here we look at `@ahernandez85b`'s most recent 10 followers' descriptions.

In [None]:
ah_followers = list(t.user_lookup(user_ids=ah[:10]))

for a in ah_followers:
    print(a["description"])
    print("---"*10)

### A brief look ahead

We want to end up back with last lecture's treatment of #crisisactors. This time, we used a "premium" search feature from Twitter to pull all the tweets about this hashtag as well as just the two words "crisis actors". You can download the file from Dropbox, unzip it and put it in the same folder as this notebook. As we did with the Gateway Pundit, let's read in the data and "flatten" it out to a file. This time we'll call it `crisisactors.csv`. 

In [None]:
from csv import writer
from json import loads

# open a CSV file for writing and write out the header

tweets = writer(open("crisisactors.csv","w",encoding='utf-8'))

tweets.writerow(["id","screen_name","user_id","verified",
                 "text","created_at","tweet_source",
                 "retweet_count",
                 "retweeted_screen_name","retweeted_status_id",
                 "statuses_count","followers_count","friends_count"])

# open our data file
data = open("crisisactors.json")

# and process each tweet, each row, one at a time
for d in data:
        
    tweet = loads(d)
    
    # to flatten the tweet, we will look for slots that are present (various screen_names)
    # and when they are not present, fill the table with a "" (which will end up being a
    # missing value)
    
    # if this is a retweet, what tweet are they retweeting and who created it?
    if "retweeted_status" in tweet:
        id_of_original_tweet = tweet["retweeted_status"]["id"]
        originator_of_original_tweet = tweet["retweeted_status"]["user"]["screen_name"]
    else: 
        id_of_original_tweet = ""      
        originator_of_original_tweet = ""
        
    # the platform the tweet came from - tweetdeck? twitter for ios?
    if tweet['source']:
        tweet_source = tweet['source']
    else: tweet_source = ""
            
    out = [tweet["id"],tweet["user"]["screen_name"],tweet["user"]["id"],tweet["user"]["verified"],
           tweet["text"],tweet["created_at"],tweet_source,
           tweet["retweet_count"],
           originator_of_original_tweet,id_of_original_tweet,
           tweet["user"]["statuses_count"],tweet["user"]["followers_count"],tweet["user"]["friends_count"]]

    tweets.writerow(out)

Now, in one cell, we will read the data back in and make a plot...

In [None]:
from pandas import to_datetime, Grouper, set_option, read_csv

from plotly.plotly import iplot, sign_in
import plotly.graph_objs as go

# read in the flattened data
df = read_csv("crisisactors.csv")

# create a timestamp
df["stamp"] = to_datetime(df["created_at"],format='%a %b %d %H:%M:%S +0000 %Y')

# group the tweets into 30 minute intervals using this new timestamp
counts = df.groupby(Grouper(key="stamp",freq='30min')).agg({"id":"count"}).rename(columns={"id":"count"})
counts.reset_index(inplace=True)

# and create a plotly plot of the resulting structure
sign_in("cocteautt","8YLww0QuMPVQ46meAMaq")

myplot_parts = [go.Scatter(x=counts["stamp"],y=counts["count"],mode="line")]
mylayout = go.Layout(autosize=False, width=1000,height=500)
myfigure = go.Figure(data = myplot_parts, layout = mylayout)
iplot(myfigure,filename="crisis")

Recall the times the hashtag trended. These correspond to the largest peak in the plot. 

<pre>
2018-02-21 20:30:22
2018-02-21 20:45:24
2018-02-21 21:00:26
2018-02-21 21:15:23
2018-02-21 21:30:22
2018-02-21 21:45:20
2018-02-21 22:01:09
2018-02-21 22:15:53
2018-02-21 22:31:07
2018-02-21 22:46:07
2018-02-21 23:01:05
2018-02-21 23:15:23
2018-02-21 23:30:21
2018-02-21 23:45:22
2018-02-22 00:00:23
2018-02-22 00:15:21
2018-02-22 00:30:21
2018-02-22 00:45:21
2018-02-22 01:00:23
2018-02-22 01:15:23
2018-02-22 01:30:21
2018-02-22 01:45:22
2018-02-22 02:00:22
2018-02-22 02:15:21
</pre>

Using a more complete search function, we see that it takes 3,000 or so tweets in 30 minutes, or 100 a minute to get something to trend. Well, we should be careful as we don't know if Twitter adds other tweets to the same cluster and represents it as #crisisactors.

What questions should we be asking about the time this trended? 