## Parts 2 and 3: The madding crowd and news coverage of the report

<img src='https://media1.s-nbcnews.com/j/newscms/2019_16/2761146/190221-mueller-report-live-blog-main-kh_7d773c19c99bc6c4c7ebafe761809094.fit-760w.jpg'>

Today, we will continue our drill from last Wednesday where we looked at the Mueller Report. We are going to move from looking at the actual report to looking at 1) the public reaction to the report dropping on Twitter, and 2) news coverage of the report.

Over the weekend, we had you look at the tweets we pulled from April 18th (84K tweets from around 11-11:20am) and how the report spread across the network. If you dind't have a chance to look at the tweets, you will get to now :-)

## Part 2: The Crowd

Let's start looking at how Twitter reacted to the report, as it was dropped around 11a on April 18th.

To refresh your memory a bit, we collected tweets from just before 11am to just around 11:20am. Around 84,000 of them. The report was made public shortly after 11am on the 18th.

### Tweet Data
To get started, [download the tweets from github](
https://www.dropbox.com/s/1u53sw3v74ra2u8/mueller_tweets.jsonl.gz?dl=0) and put it in the same folder as your notebook.

The file is about 67Megs (it's compressed) and contains 83,478 tweets. The file was compressed using a program called [gzip](https://en.wikipedia.org/wiki/Gzip). Uncompressed it's about 500-600Megs. Let's leave it compressed so we don't take up half-a-gig on your laptop hard-drive!

The file itself is in a format called "json lines" - the file extension being `.jsonl`. Each line of the file contains a single tweet (as a json string) followed by a newline (`\n`).

We can use the command line to peek af the first few lines of the file. We'll run the `gunzip` command to uncompress the file (but we'll use the `-c` option which means we'll uncompress the file during out command but leave the file on the hard-drive as compressed). So, let's look at the first line of the file with some of our old UNIX commands:

In [None]:
!gunzip -c mueller_tweets.jsonl.gz | head -n 1

This should have printed out a single tweet. If this didn't work for you (prob on a Windows laptop?), then don't worry about it. Let's move on!

The following file will open up the tweets data file and extract some of the fields from each tweet that we want to use later on in our analysis. I'm pulling out a few fields like the tweet, retweet/fave counts and a bit of info about the user. Feel free to edit to pull other info you might want.

In [None]:
import gzip
import json

# we'll save a subset of each tweet in a list that we'll then load into a DataFrame later
tweets = []

# gzip (compressed) file of 83478 tweets
# for the format of the file: each line is a tweet (as a JSON string) followed by a newline ('\n')
# the .jsonl means "json lines"
mueller_tweets_file = 'mueller_tweets.jsonl.gz'

# open the gzip file
with gzip.open(mueller_tweets_file) as tweets_file:
    
    # loop through the file, line-by-line
    for line in tweets_file:
        
        # load the tweet in to a dictionary
        tweet = json.loads(line)

        # lets save a few things from each tweet:
        # "created at" time
        # the tweet id "string"
        # retweet and fave counts
        # the tweet text itself
        # the user's screen name
        # the user's follower + following counts
        # how many tweets the user tweeted in the past
        
        user = tweet['user']
        
        # save a list of the tweet and user info we want for analysis
        tweets.append(
            [
                tweet['created_at'],
                tweet['id_str'],
                tweet['favorite_count'],
                tweet['retweet_count'],
                tweet['text'],
                user['screen_name'], 
                user['followers_count'],
                user['friends_count'],
                user['statuses_count'],
            ]
        )
        
print('done loading the tweets from our file')

In [None]:
# how many tweets do we have?
len(tweets)

In [None]:
# tweets is a list of lists. let's print out the first few rows and take a look
tweets[:3]

In [None]:
# load the tweets in to a DataFrame
import pandas as pd

columns = ['created_at', 'id', 'favorite_count', 'retweet_count', 'text', 'screen_name', 'followers_count', 'friends_count', 'statuses_count']

tweets_df = pd.DataFrame(tweets, columns=columns)

In [None]:
tweets_df.head()

In [None]:
# as we saw in class on Wednesday, we convert the created_at column (which is a string) into a proper datetime object
# convert the date as a string into a datetime object and store it in a new column named "time"
tweets_df['time'] = pd.to_datetime(tweets_df['created_at'].astype(str), format='%a %b %d %H:%M:%S +0000 %Y')

In [None]:
tweets_df.head()

Let's plot the tweets per minute and see if it looks like the tweet per minute chart we created in part 1 (in class on Wednesday). To do this, we need to group our new 'time' column and create a count of tweets which occur every minute. We do that this way:

In [None]:
# now, let's group and count the tweets by minute
counts_df = tweets_df.groupby([pd.Grouper(key='time', freq='60s')]).size().reset_index(name='count')

# counts is now a DataFrame with time (each minute in from 14:56 GMT to 15:21 GMT) and the count of tweets per minute
counts_df

Now, let's abuse Mark's plotly account again and do a quick plot of time (by each minute) on the X-axis and the count of tweets per minute on the Y-axis:

In [None]:
from plotly.plotly import iplot, sign_in
from plotly.graph_objs import Scatter, Figure

# sign into the service (get your own credentials!)
sign_in("cocteautt","8YLww0QuMPVQ46meAMaq")

# create a plot of a single line tracking tweets over time
myplot_parts = [Scatter(x=counts_df["time"],y=counts_df["count"],mode="line")]

# make a figure from this line plot...
myfigure = Figure(data=myplot_parts)

# ... and plot it (the filename is a convention plotly needs in case you want to use it later)
iplot(myfigure,filename="tweets_per_min")

Does this look like the plot we did at the end of class on Wednesday? Even though the count of tweets per minute won't be exact as the data we looked at on Wednesday, it is directionaly aligned. (The counts not matching up exactly is [documented on Twitter's API site](https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search.html), but don't worry about it for now.)

**OK, now it's your turn!** Take 5 minutes to look at the data and tell us a little about who is tweeting.

In [None]:
# Your Code



Next up...look through the data to see what's being mentioned in the tweets. What is the conversation about? Who is being talked about? 

Remember last week we used the [spacy](https://spacy.io/) library to extract entities (people, location, organizations, etc)? 

In [None]:
# Your Code



Last, take 20-30 minutes and run with one of the next few prompts. Or, come up with another question or angle you'd like to explore in the data.

- what is the discussion / what are they saying about the report?
- what's the first mention of the report dropping? when was it?
- what media outlets are covering the report? how long after the report drops are news outlets publishing stories about it?
- any bot activity? what message(s) are they pushing?

**NOTE** for some of these, you may need to go back to the code where we read the tweets data file - there is a lot more info in each tweet that you may need for some of this analysis (e.g. parsed URLs from each tweet).

## Part 3: News Coverage of the Mueller Report

Finally, we're going to look over the news coverage from April 18th, 2019. We've collected nearly 3500 articles (from big publications small) from [NewsAPI](https://newsapi.org/) that were published on the 18th. Grab the file [`mueller_report_articles_0418.csv`](https://github.com/computationaljournalism/columbia2019/blob/master/data/mueller_report_articles_0418.csv) and move it to the same directory as this notebook. 

We converted the stories from NewsAPI into a CSV file with a few fields: title, domain, url, description and published time.

Let's take a look!

In [None]:
# load up the articles into a DataFrame
import pandas as pd

news_df = pd.read_csv('mueller_report_articles_0418.csv')

In [None]:
# what do we have in the CSV file?
news_df.head()

Again, we'll need to convert the published date/time to a proper datetime object. We can do that with the following code:

In [None]:
# convert the date as a string into a datetime object and store it in a new column named "time"
news_df['published'] = pd.to_datetime(news_df['published_at'].astype(str), utc=True)

In [None]:
# look better?
news_df.head()

**Your Turn**: one of the first things we want to do is plot the number of articles per hour. We've done this a million times in class (and we even did it above ^^ in this notebook) and let's do it one last time. Take the `news_df` DataFrame and plot the count of stories per hour. Ready? Go!

In [None]:
# Your Code



Great! Now, let's take a look at the articles over the 11a-12p ET hour right as the report was dropped. A few things to look at:
- when were the first articles published that mention the report (after 11a)?
- what does the early coverage look like? remember that publications are just starting to read over the 400+ page report.

How do you filter just the stories from that hour? Remember that our data is in UTC (or GMT) timezone, so you're looking for the 15:00 to 16:00 range.

Spend the next 10-15 minutes looking over this hour range.

In [None]:
# Your Code



Lastly, try to combine the twitter data with the news article data. Which of these stories are people sharing on twitter? How many people/who are sharing the links? What does the conversation look like for people sharing the articles from publishers like dailycaller, infowars, foxnews, breitbart, thehill, nytimes, etc. What other questions do you have of the data?

In [None]:
# Your Code

