## Part 2. The madding crowd

<img src='https://media1.s-nbcnews.com/j/newscms/2019_16/2761146/190221-mueller-report-live-blog-main-kh_7d773c19c99bc6c4c7ebafe761809094.fit-760w.jpg'>

Now, let's have a look at Twitter and how the Mueller Report spread across the network. We collected tweets from just before 11am to just around 11:20am. Around 84,000 of them. This drill will help you load the tweets in to a pandas DataFrame and start to look at what the crowd is talking about as the report is dropped.

To refresh your memory on the timeline of events for April 18:
- [Barr](https://en.wikipedia.org/wiki/William_Barr) holds a [press conference](https://www.nytimes.com/2019/04/18/us/politics/barr-conference-transcript.html)
 at 9:30 am ET to discuss the report.
- Around 11am, the Mueller Report is made available for download ([pdf](https://www.justice.gov/storage/report.pdf))

Before Monday's class, run through the following notebook which will help you load the Tweets into a DataFrame. Have a look at the Tweets and come to class on Monday sharing something you found. Some ideas of what you might look at:
- what is the discussion / what are they saying about the report?
- who do the tweets mention?
- what's the first mention of the report dropping? when was it?
- who is tweeting during this time window?
- what media outlets are covering the report? how long after the report drops are news outlets publishing stories about it?
- any bot activity?

Remember, use some of our new skills we learned on Wednesday (i.e. spacy) to look through the tweets. You might start with using spacy to look at the entities of the tweets: which people, places, locations, organizations are being mentioned. Take a look...have fun!

### Tweet Data
To get started, [download the tweets from github](
https://www.dropbox.com/s/1u53sw3v74ra2u8/mueller_tweets.jsonl.gz?dl=0) and put it in the same folder as your notebook.

The file is about 67Megs (it's compressed) and contains 83,478 tweets. The file was compressed using a program called [gzip](https://en.wikipedia.org/wiki/Gzip). Uncompressed it's about 500-600Megs. Let's leave it compressed so we don't take up half-a-gig on your laptop hard-drive!

The file itself is in a format called "json lines" - the file extension being `.jsonl`. Each line of the file contains a single tweet (as a json string) followed by a newline (`\n`).

We can use the command line to peek af the first few lines of the file. We'll run the `gunzip` command to uncompress the file (but we'll use the `-c` option which means we'll uncompress the file during out command but leave the file on the hard-drive as compressed). So, let's look at the first line of the file with some of our old UNIX commands:

In [None]:
!gunzip -c mueller_tweets.jsonl.gz | head -n 1

This should have printed out a single tweet. If this didn't work for you (prob on a Windows laptop?), then don't worry about it. Let's move on!

The following file will open up the tweets data file and extract some of the fields from each tweet that we want to use later on in our analysis. I'm pulling out a few fields like the tweet, retweet/fave counts and a bit of info about the user. Feel free to edit to pull other info you might want.

In [None]:
import gzip
import json

# we'll save a subset of each tweet in a list that we'll then load into a DataFrame later
tweets = []

# gzip (compressed) file of 83478 tweets
# for the format of the file: each line is a tweet (as a JSON string) followed by a newline ('\n')
# the .jsonl means "json lines"
mueller_tweets_file = 'mueller_tweets.jsonl.gz'

# open the gzip file
with gzip.open(mueller_tweets_file) as tweets_file:
    
    # loop through the file, line-by-line
    for line in tweets_file:
        
        # load the tweet in to a dictionary
        tweet = json.loads(line)

        # lets save a few things from each tweet:
        # "created at" time
        # the tweet id "string"
        # retweet and fave counts
        # the tweet text itself
        # the user's screen name
        # the user's follower + following counts
        # how many tweets the user tweeted in the past
        
        user = tweet['user']
        
        # save a list of the tweet and user info we want for analysis
        tweets.append(
            [
                tweet['created_at'],
                tweet['id_str'],
                tweet['favorite_count'],
                tweet['retweet_count'],
                tweet['text'],
                user['screen_name'], 
                user['followers_count'],
                user['friends_count'],
                user['statuses_count'],
            ]
        )
        
print('done loading the tweets from our file')

In [None]:
# how many tweets do we have?
len(tweets)

In [None]:
# tweets is a list of lists. let's print out the first few rows and take a look
tweets[:3]

In [None]:
# load the tweets in to a DataFrame
columns = ['created_at', 'id', 'favorite_count', 'retweet_count', 'text', 'screen_name', 'followers_count', 'friends_count', 'statuses_count']

tweets_df = pd.DataFrame(tweets, columns=columns)

In [None]:
tweets_df.head()

In [None]:
# as we saw in class on Wednesday, we convert the created_at column (which is a string) into a proper datetime object
from pandas import to_datetime

# convert the date as a string into a datetime object and store it in a new column named "time"
tweets_df['time'] = to_datetime(tweets_df['created_at'].astype(str), format='%a %b %d %H:%M:%S +0000 %Y')

In [None]:
tweets_df.head()

Let's plot the tweets per minute and see if it looks like the tweet per minute chart we created in part 1 (in class on Wednesday). To do this, we need to group our new 'time' column and create a count of tweets which occur every minute. We do that this way:

In [None]:
# now, let's group the tweets by minute
counts = tweets_df.groupby(pd.Grouper(key='time', freq='60s')).agg({'id': 'count'}).rename(columns={'id': 'count'})
counts.reset_index(inplace=True)

# counts is now a DataFrame with time (each minute in from 14:56 GMT to 15:21 GMT) and the count of tweets per minute
counts.head()

Now, let's abuse Mark's plotly account again and do a quick plot of time (by each minute) on the X-axis and the count of tweets per minute on the Y-axis:

In [None]:
from plotly.plotly import iplot, sign_in
from plotly.graph_objs import Scatter, Figure

# sign into the service (get your own credentials!)
sign_in("cocteautt","8YLww0QuMPVQ46meAMaq")

# create a plot of a single line tracking tweets over time
myplot_parts = [Scatter(x=counts["time"],y=counts["count"],mode="line")]

# make a figure from this line plot...
myfigure = Figure(data=myplot_parts)

# ... and plot it (the filename is a convention plotly needs in case you want to use it later)
iplot(myfigure,filename="tweets_per_min")

Does this look like the plot we did at the end of class on Wednesday? Even though the count of tweets per minute won't be exact as the data we looked at on Wednesday, it is directionaly aligned. (The counts not matching up exactly is [documented on Twitter's API site](https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search.html), but don't worry about it for now.)

**Now it's your turn!** Remember, the report has just dropped and your tasked with looking at the public and media coverage and reaction. What's being said? Who is Twitter talking about? What media outlets are starting to cover it?

Take a look over the weekend. Make sure you run through the notebook and we'll continue from here on Monday.