#Sharpiegate
---------------

<img src="https://img.buzzfeed.com/buzzfeed-static/static/2020-02/1/17/asset/ff90b1278a8b/sub-buzz-346-1580578270-8.png?downsize=700%3A%2A&output-quality=auto&output-format=auto" width=300>

As you may recall, on September 1 of this year, President Trump tweeted that Alabama and other southern states were "most likely to be hit" by hurricane Dorian. The scenes of devastation from the Bahamas were unbelievable and the prospect of Dorian reaching Alabama was frightening. Three days later, President Trump prouced the map above, with what appeared to be an extra bump drawn in to add Alabama to the storm's path.

In [79]:
%%HTML
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">In addition to Florida - South Carolina, North Carolina, Georgia, and Alabama, will most likely be hit (much) harder than anticipated. Looking like one of the largest hurricanes ever. Already category 5. BE CAREFUL! GOD BLESS EVERYONE!</p>&mdash; Donald J. Trump (@realDonaldTrump) <a href="https://twitter.com/realDonaldTrump/status/1168174613827899393?ref_src=twsrc%5Etfw">September 1, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

A FOIA request from Buzzfeed and other news outlets produced [this dump of emails](https://www.documentcloud.org/documents/6749506-LEOPOLD-FOIA-NOAA-Hurricane-Dorian-Sharpie-Gate.html) Friday night. You can find articles summarizing what the over 1,000 pages contain at
[Axios](https://www.axios.com/noaa-trump-dorian-emails-release-key-takeaways-8a836b5c-d215-417c-86f7-fd0abb993e7a.html) and [Buzzfeed](https://www.buzzfeednews.com/article/zahrahirji/sharpiegate-fake-hurricane-map-emails). Continuing on our mission to understand how to go from unstructured to structured data - all the while teaching us about the built-in objects in Python - we will use this as a test case. 

Download the file and sketch out what things you might summarize and how they can be used to tell the story of what happened. 

Your ideas here




As an exercise, let's add to some of the players you see in the file. Look up a couple names and add to the dictionary below. I've created the names and titles for a few people I noticed by skimming the files. 

In [None]:
cast = {
        "Mary Erickson - NOAA Federal":"Deputy Director of NOAA's National Weather Service",
        "Louis Uccellini - NOAA Federal":"Director, NWS",
        "Craig McLean - NOAA Federal":"NOAA Acting Cheif Scientist",
        "George Jungbluth - NOAA Federal":"Chief of Staff, NWS",
        "Julie Roberts - NOAA Federal":"Director of Communications for NOAA",
        "Scott Smullen - NOAA Federal":"Deputy Director of NOAA Communications",
        "Christopher Vaccaro - NOAA Federal":"Senior Media Relations Specialist",
        "Susan Buchanan - NOAA Federal":"Media Specialist, National Weather Service Severe weather, NOAA Weather Radio, weather safety information",
        "Chris Darden - NOAA Federal":"NWS Birmingham, Meteorologist-in-Charge",
       }

Create a DataFrame by making a list of the information for three "senders". This might be data you look up, things contained in the file, etc. Create at least five columns (attributes) and three rows (people or mail aliases).

In [None]:
# Your work here



Below we have two lists that summarize the email counts per day in the FOIA dump. Use what you know about subsetting lists to pull out the `counts[]` for September 4 (the day the president unveiled his map). Then from `dates` pull out the string that corresponds to the say with the second largest counts. Yes, this is all a little tedious but we want to make sure you are good with lists and dictionaries!

In [None]:
counts = [3,10,93,49,97,60,105]
dates = ["2019-09-02","2019-09-03","2019-09-04","2019-09-05","2019-09-06","2019-09-07","2019-09-08"]

We have started to create a DataFrame that includes  information about each email in the FOIA release. It's called `noaa2.csv` and is on our github page. You can read it in directly from there. 

In [None]:
from pandas import read_csv
noaa = read_csv("https://raw.githubusercontent.com/computationaljournalism/columbia2020/master/data/noaa2.csv")

noaa.head()

Each row is an email. How many emails were released in the FOIA dump?

**Subsetting columns**

Now, let's summarize some columns. First, remember we subset out columns by simply giving their name in square brackets. So this extracts all the `date` data.

In [None]:
noaa["date"]

Now, lets use the method `.value_counts()` on this column of the DataFrame to tally again how many emails were sent per day.

In [None]:
noaa["date"].value_counts()

Now, summarize two other columns and tell me what they seem to indicate.

In [None]:
# First


# Second



We can use our line plot to track finer structure of the emails within a day. Here we plot the email number (which number it is in the FOIA dump, 1 being the first in time order, 417 being the last. On the x-axis we put time, on the y-axis we put number. 

In [None]:
from plotly.express import line
fig = line(noaa, x="dt", y="number")
fig.show()

A more natural summary is just a histogram. That is, we divide the x-axis (in this case time) into equally sized intervals and on the y-axis show how many emails fall into each bin. Here we use `plotly.express` again and have asked for 100 time units between September 3 and 8. 

In [None]:
from plotly.express import histogram

fig = histogram(noaa, x="dt",nbins=200)
fig.show()

**Subsetting rows**

Recall that we have seen two kinds of subsetting for DataFrames -- one involves supplying a character string or a list of strings between square braces `[  ]`. Here we pull out all the senders.

In [None]:
noaa["sender"]

The second kind of subsetting invloves using booleans to identify which rows to keep. So here we ask which sender is Chris Vaccaro (returning `True` and which is not returning `False`.

In [None]:
noaa["sender"]=="Christopher Vaccaro - NOAA Federal"

There is one boolean per row and using this inside `noaa[  ]` will produce another DataFrame that just has Vaccaro's emails. 

In [None]:
noaa[noaa["sender"]=="Christopher Vaccaro - NOAA Federal"]

So that means so far we have seen **two forms of subsetting for Data Frames.** One for columns, providing a column name or a list of names, and one for rows, providing `True` or `False` to specify which to keep and which to kill. 

You might wonder why. The bottom line is that we end up doing this kind of subsetting a lot. And so making it easy to specify means our lives are easier. Let's try this out and have a look at just the histogram of Chris Vaccaro's involvement with `#Sharpiegate`.

In [None]:
from plotly.express import histogram

cv = noaa[noaa["sender"]=="Christopher Vaccaro - NOAA Federal"]

fig = histogram(cv, x="dt",nbins=100)
fig.show()

Try this with someone else or with another condition, perhaps a particular subject line.

Now, let's go back to the overall activity. We are going to create more complex boolean expressions to subset rows. Here's the complete email histogram again.

In [None]:
fig = histogram(noaa, x="dt",nbins=200)
fig.show()

A big spike hit at about 2 in the afternoon on the 4th (when was the president's briefing?). We might want to subset the emails from between 2 and 3 in the afternoon.

In [None]:
noaa[noaa["dt"]>"2019-09-04 14:00:00"]

And here we look at those between 2 and 3pm. The simplest thing is to simply add another condition to the first one. 

In [None]:
start2 = noaa[noaa["dt"]>"2019-09-04 14:00:00"]
btwn23 = start2[start2["dt"]>"2019-09-04 15:00:00"]

btwn23

We can combine these two conditions into one like we did with booleans, but instead of `and` and `or` we use `&` and `|`. There are technical reasons why, but `and` in `(3<5) and (2>10)` is makig one comparison. When we do this in Pandas we are making one comparison per row. So the symbols are a little different. Test it out...

In [None]:
(noaa["dt"]>"2019-09-04 14:00:00") & (noaa["dt"]>"2019-09-04 15:00:00")

And now feed these into `noaa[  ]` for subsetting, keeping just the rows in the time period.

In [None]:
noaa[(noaa["dt"]>"2019-09-04 14:00:00") & (noaa["dt"]>"2019-09-04 15:00:00")]

What do you see? What do you want to do next?

**Crosstabulations**

Next, we might look cross the days and see who is active or what subjects are talked about. We might do this easily with a crosstabulation. Here we import a function like `read_csv` from pandas, `crosstab`. This time, instead of simply tabulating the days emails were sent or the popular senders, now we look at the cross of both to see which days each person was active.

In [None]:
from pandas import crosstab

crosstab(noaa["sender"],noaa["date"])

Finally, we might want to know how many emails referred to Buzzfeed or ABC. When a column in a DataFrame is made up of strings, we can use the `.str` object to access all the functions we were used to with strings. So for example, we can make all the subjects uppercase...

In [None]:
noaa["subject"].str.upper()

... and then we can see if they contain `"ABC"`...

In [None]:
noaa["subject"].str.upper().str.contains("ABC")

... and use the boolean values to keep just the rows that have the condition we're after. 

In [None]:
noaa[noaa["subject"].str.upper().str.contains("ABC")]

Now, ask some questions about the data set -- explore a little. Go back to the original PDF and see how we might expand the data we have so far.

Your thoughts here




**Finally, Iowa**

We are going to try to build a model to predict county-level outcomes 