#### The Mueller Report: A Drill in Three parts

<img src=https://media.necn.com/images/652*367/trump-back-of-the-head.jpg width=500>

Today we are going to refresh our skills a little and learn something new. You are going to be divided into four newsrooms. Over the course of the next two meetings, you will cover a piece of "breaking data," in this case the Mueller Report. 

**Part 1. Picture perfect**

First, the report itself. Let's download the copy that was made available on April 18. Copies are floating all over the web but we'll stick to [the one hosted at Scribd](https://www.scribd.com/document/406726026/Mueller-report). It is the original and we are playing out this drill as if it was April 18. So, download the file and you have five minutes to tell me something about its contents. How is the argument structured? Who are the main characters? The significant places and times? **You can only work with this file.**

Go!

In [None]:
# A baby timer

from time import sleep

for i in range(1,301):
    
    sleep(1)
    
    if i % 60:
        print(i, end=" ")
    else:
        print("\nminutes left:", 5-i/60)

What is the issue with the file? It is a PDF but *that* kind of PDF. Groups noticed immediately that the PDF was based on a scan, [prompting speculation about how it was produced.](https://www.theverge.com/2019/4/19/18508148/pdf-association-mueller-report-scanning-redaction-sad-formatting) There is another [retelling of the story in Quartz.](https://qz.com/1601873/the-pdf-of-the-mueller-report-has-been-updated-to-be-more-accessible/) While this version might be hopeless for text, what *is* it good for? Here's a hint.
<br><br>
<img src="https://i0.wp.com/flowingdata.com/wp-content/uploads/2019/04/all_pages-cropped.png?w=2427&ssl=1" width=500>
<br><br>

What other things can you do with the images? [Axios took a middle-of-the-road approach](https://www.axios.com/explore-a-detailed-version-of-the-mueller-report-5f7cab5b-9c53-46bc-abaa-bd6b7b3e6d66.html) and colored things according to people and places. 
<br><br>
<img src="https://i1.wp.com/flowingdata.com/wp-content/uploads/2019/04/Tagged-Mueller-Report.png?w=1822&ssl=1" width=600>
<br><br>
Let's see what we can do here. We are going to use a newish package for working with PDF files. It's called `PyMuPdf` although the library name you load to use it is `fitz`. (There's some not particularly funny reason for this that for once I won't recount.)

Let's install the package and fire it up!

In [None]:
%%sh
pip install PyMuPdf

We can use this package to now open the file we downloaded. Again, make sure the PDF is in the same folder as this notebook. We are going to use the function `open()` in the `fitz` library.  There is a simple tutorial [here](https://pymupdf.readthedocs.io/en/latest/tutorial/). 

In [None]:
import fitz
doc = fitz.open("406726026-Mueller-report.pdf")
type(doc)

So we have a PDF document. Remember software objects have data and functions. In this case the `Document` object contains information like the number of pages...

In [None]:
doc.pageCount

... and metadata about its creation.

In [None]:
doc.metadata

As we saw in the Quartz article, there is a table of contents that suggests a division of labor for redacting the document. Here's that division.

In [None]:
doc.getToC()

All of this is fine, but what we really want is the contents. Now, we have verified that this is a PDF made up of images. We can verify that and work with the data. So, let's start by loading a page. Here we use the method `.loadPage()` to grab one, say the 100th...

In [None]:
page = doc.loadPage(100)
type(page)

In [None]:
page

To grab the image, we can use a method called `.getPixmap()` that pulls the pixels. It will tell us how big the image is...

In [None]:
pix = page.getPixmap()
pix

... and as we have done before, we can use a Jupyter tool to display it. We use the method `.getImageData()` from our pixelmap to return an image we can show. 

In [None]:
from IPython.display import Image
Image(pix.getImageData(),width=300)

This is officially getting in the weeds, but we can pull the commands into one place and have a look by changing the index. We could, if we wantd to, scroll through the pages this way. Thankfully we have PDF previewers that make that a lot easier.

Here we use a simpler command attached to the document and not the page, saving us a line of code. 

In [None]:
pix = doc.getPagePixmap(200)
Image(pix.getImageData(),width=300)

Before we leave this, let's look at how we might make a grid in Jupyter. (There are so many ways to do this, it's crazy, but here's an easy one now that we're here.) What we'll do is export the image to file -- a `.png`. This is how we write one...

In [None]:
pix.writePNG("0.png")

... and this is how we write 25 of them. You can use any offset you like so that you don't have to take the first 25. But anyway, print them out and then just use the `%%HTML` cell magic to display them in a table. Voila!

In [None]:
for i in range(25):
    pix = doc.getPagePixmap(i)
    pix.writePNG(str(i)+".png")

In [None]:
%%HTML
<table>
<tr><td><img src=0.png width=100></td><td><img src=1.png width=100></td><td><img src=2.png width=100></td><td><img src=3.png width=100></td><td><img src=4.png width=100></td></tr>
<tr><td><img src=5.png width=100></td><td><img src=6.png width=100></td><td><img src=7.png width=100></td><td><img src=8.png width=100></td><td><img src=9.png width=100></td></tr>
<tr><td><img src=10.png width=100></td><td><img src=11.png width=100></td><td><img src=12.png width=100></td><td><img src=13.png width=100></td><td><img src=14.png width=100></td></tr>
<tr><td><img src=15.png width=100></td><td><img src=16.png width=100></td><td><img src=17.png width=100></td><td><img src=18.png width=100></td><td><img src=19.png width=100></td></tr>
<tr><td><img src=20.png width=100></td><td><img src=21.png width=100></td><td><img src=22.png width=100></td><td><img src=23.png width=100></td><td><img src=24.png width=100></td></tr>
</table>

Now, PDF's have images and text. Just like we had a `.getPixmap()`, we also have `.getText()`.  In this case, we don't expect very much.

In [None]:
page = doc.loadPage(100)
page.getText("text")

Now, as we saw, the Mueller Report was quickly re-released using OCR to pull text from the scanned images. We'll again download from Scribd. Download [this version of the report now](https://www.scribd.com/document/406728825/Mueller-Report-searchable) and put it in the same folder as this notebook. 

Let's load it up and pull the text...

In [None]:
doc = fitz.open("406728825-Mueller-Report-searchable.pdf")

Now, we can `.load()` the page and use a method `.getText()` on the `page` object, or we can just `.getPageText()` directly from the `Document` object, saving us a little typing.

In [None]:
text = doc.getPageText(100)
text

What we get is a string. It's messy. We should try to find a cleaner version of the PDF that doesn't have so much stuff attached to it. Still, it will give us a taste of what we can do. We are going to start to use the text as data to help us understand what is being talked about.

Now, without anything fancy, what can we do with each page of text? Give yourself 15 minutes on the clock above and try something out.

In [None]:
# your code here



Now we are going to add some formalism. We have already seen one platform for so-called naturla language processing, `TextBlob.` Today we are going to use `spacy` instead. [A complete guide](https://spacy.io/usage/spacy-101) will give you all you need to be fully dangerous. 

First, let's install the package...

In [None]:
%%sh
pip install spacy

... and then install a language model. You will see from the documentation that there are a variety of languages available. We will stick with english.

In [None]:
%%sh
python -m spacy download en_core_web_sm

Now, let's start! We can create an object called `nlp` that will do the language processing for us. Here we take one sentence from page 81 and print the so-called named entities. You can see the kinds of things they recognize from [a spacy list.](https://spacy.io/api/annotation#named-entities)

In [None]:
print(doc.getPageText(81))

In [None]:
import spacy

# load spacy with a language model
nlp = spacy.load("en_core_web_sm")

# now perform a simple nlp computation on one of the sentences from page 81
parsed = nlp("Over the past few months, I have been working with a company based in Russia regarding the development of a Trump Tower-Moscow project in Moscow City.")

for ent in parsed.ents:
    print(ent.text.strip(), ent.label_)

We can also extract the so-called noun chunks (a noun plus words describing the noun).

In [None]:
for nc in parsed.noun_chunks:
    print(nc.text.strip())

And as with TextBlob we can pull sentences...

In [None]:
parsed = nlp(doc.getPageText(81))

for sent in parsed.sents:
    print(sent)
    print("--"*20)

Take another 10 minutes on the clock and do something with this new skill. What can you learn about the report?

In [None]:
# your code here



There is a lot more structure we can infer using spacy. Let's take the same sentence and consider the parts of speech it contains and the [dependency](https://spacy.io/api/annotation#dependency-parsing) between the words. It's probably a lot to get into the finer points of NLP today, but there are things we can do with just a little knowledge. The good thing is that most of what we need is [viewable graphically.](https://nlp.stanford.edu/software/dependencies_manual.pdf)

In [None]:
import spacy

# load spacy with a language model
nlp = spacy.load("en_core_web_sm")

# now perform a simple nlp computation on one of the sentences from page 100
parsed = nlp("Over the past few months, I have been working with a company based in Russia regarding the development of a Trump Tower-Moscow project in Moscow City.")

for token in parsed:
    print(token.text, token.dep_, token.head, token.head.pos_)

In [None]:
from spacy.symbols import nsubj, VERB

parsed = nlp(doc.getPageText(81))

verbs = []

for token in parsed:
    if token.dep == nsubj and token.text == "Cohen" and token.head.pos == VERB:
        verbs.append(token.head)

print(verbs)

And now what can you do?

In [None]:
# Your code here


**Part 2. The madding crowd**

Now, let's have a look at Twitter and how this document spread across the network. We collected tweets from just before 11am to just around 11:20am. The total volume counts [is located here](https://github.com/computationaljournalism/columbia2019/raw/master/data/mueller_counts.json). Download the file and put it in the same folder as this notebook. Let's read it in and see what kind of thing we have. We are going to use `loads()` because we know it's a json object.

In [193]:
from json import loads
recs = loads(open("mueller_counts.json").read())
recs



{'results': [{'timePeriod': '201904180000', 'count': 1284},
  {'timePeriod': '201904180001', 'count': 1224},
  {'timePeriod': '201904180002', 'count': 1272},
  {'timePeriod': '201904180003', 'count': 1262},
  {'timePeriod': '201904180004', 'count': 1295},
  {'timePeriod': '201904180005', 'count': 1310},
  {'timePeriod': '201904180006', 'count': 1189},
  {'timePeriod': '201904180007', 'count': 1268},
  {'timePeriod': '201904180008', 'count': 1258},
  {'timePeriod': '201904180009', 'count': 1256},
  {'timePeriod': '201904180010', 'count': 1255},
  {'timePeriod': '201904180011', 'count': 1215},
  {'timePeriod': '201904180012', 'count': 1282},
  {'timePeriod': '201904180013', 'count': 1300},
  {'timePeriod': '201904180014', 'count': 1312},
  {'timePeriod': '201904180015', 'count': 1279},
  {'timePeriod': '201904180016', 'count': 1313},
  {'timePeriod': '201904180017', 'count': 1325},
  {'timePeriod': '201904180018', 'count': 1313},
  {'timePeriod': '201904180019', 'count': 1264},
  {'timeP

Ha! A dictionary that contains a list of dictionaries. That list of dictionaries can be made into a data frame trivially with each list element being a row. Remember?

In [194]:
from pandas import DataFrame

counts = DataFrame.from_records(recs["results"])
counts.head()

Unnamed: 0,count,timePeriod
0,1284,201904180000
1,1224,201904180001
2,1272,201904180002
3,1262,201904180003
4,1295,201904180004


The time here is awful - to fix the string, we are going to use our datetime conversions. We need to specify the format of the string, meaning which numbers mean what. Clearly it's year-month-day-hour-minute. The format we need to decode this can be [found here](http://strftime.org/). We'll add a new column called `time` to represent a time object and not just a really big number.

In [209]:
from pandas import to_datetime
counts["time"] = to_datetime(counts["timePeriod"].astype(str),format="%Y%m%d%H%M")
counts.head()

Unnamed: 0,count,timePeriod,time
0,1284,201904180000,2019-04-18 00:00:00
1,1224,201904180001,2019-04-18 00:01:00
2,1272,201904180002,2019-04-18 00:02:00
3,1262,201904180003,2019-04-18 00:03:00
4,1295,201904180004,2019-04-18 00:04:00


What are we irresistably drawn to do?

In [207]:
from plotly.plotly import iplot, sign_in
from plotly.graph_objs import Scatter, Figure

# sign into the service (get your own credentials!)
sign_in("cocteautt","8YLww0QuMPVQ46meAMaq")

# create a plot of a single line tracking tweets over time
myplot_parts = [Scatter(x=counts["time"],y=counts["count"])]

# make a figure from this line plot...
myfigure = Figure(data=myplot_parts)

# ... and plot it (the filename is a convention plotly needs in case you want to use it later)
iplot(myfigure,filename="madding")

Now, set another 15 minutes on the clock and let's see what we can do!

In [None]:
# your code here

