<a href="https://colab.research.google.com/github/cocteau/computing2021/blob/main/notebooks/01_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Computational Journalism](https://cdn.wallpapersafari.com/50/92/VhFYsk.jpg
 "Computational Journalism")

## Computational Journalism
## Summer, 2021

### Background

We'll start with some basic questions. First, the term *computational journalism.* What does it mean for journalism to be computational? In some sense, the use of computation in journalism is not new. Pulitzer himself wrote about how important "data" was to journalism.
<br><br>

<img src="https://github.com/cocteau/computing2021/raw/main/images/Screen%20Shot%202021-05-25%20at%2011.59.05%20AM.png" width=600>
<br><br>
In data he said we could find "romance, human interest, humor and fascinating revelations."
<br><br>

<img src="https://github.com/cocteau/computing2021/raw/main/images/Screen%20Shot%202021-05-25%20at%2011.58.34%20AM.png" width=400>

In ways that Pulitzer probably could not have imagined, over the last few decades data and computing have become part of our everyday lives. They regulate and shape our interactions with the physical and virtual worlds. Organizations increasingly equate (though not without problems) "data release" with transparency. Sensing (sound, light, air quality) is cheap and plentiful, and easily deployed by the general public. Our actions online generate vast quantities of digital data. 

And, increasingly, computer systems exercise real power in the world through the insertion of machine learning (statistical models, artificial intelligence) alongside or in place of human decision making. In all of this, we can find new ways to ask questions about the world, how it's organized and how it functions. But the keys to this new digitized kingdom are data, code and algorithms. The curiosity, the questioning spirit, you developed last semester in your reporting classes finds an outlet in new and unexpected ways, mediated by data, code and algorithms. Hence, *computational journalism.* It is simply a response to our new condition of living in a computational society.

In this year's edition of the course, we will be focusing on computational tools and techniques that, while not necessarily new, certainly achieved new prominence in the national election in 2016 and beyond. The vast networks of information that are created every day are simply too large for us to examine in their entirety. To get a sense of "what's on," we take feeds from algorithmic recommender systems, we scan trending topics, we focus on information shared with us by our friends or people we trust. Recently, we have seen how these tools and strategies for directing our attention can be hacked. This year, we are going to place special emphasis on understanding machine learning or artificial intelligence and how it impacts journalism — from helping your reporting, to creating new kinds of story forms, to its use in the distribution of journalism.

### Class Themes

Along the way, we will cover a variety of topics that will help you in your journalistic practice, both in the sense that you better understand the media "ecology" you interact with daily, but also we will teach you to look to these systems as a source of stories, and even perhaps as a source of inspiration to build some kind of new platform to support journalism. We will learn a variety of tools, and our primary programming language will be Python. We will talk through how we came to this decision, but for the moment, know that it is a flexible language that lets you easily connect to networks like Twitter, assemble and analyze data from formal databases and the web, and build responsive services based on all these inputs. Over the next 12 weeks, we will introduce you to the following technologies and tools:

* Python programming
* Basic data analysis
* Data collection using APIs and [scraping](https://en.wikipedia.org/wiki/Web_scraping)
* Machine Learning: ML to report on, ML to report with and ML as a distribution tool
* Bots (text and voice)
* Deep fakes
* Regular Expressions
* Natural Language Processing
* UNIX command-line tools
* Data Visualization
* Database technologies

Our class typically attracts people with different skill levels, most having no background in computation, some having recently been introduced to Python, and occasionally one or two who are already proficient in many of the topics we are covering. The course assignments will be structured in ways that everyone has something new to do, with those needing less in the way of a technical introduction focusing on applications to reporting or the work of journalism broadly.

We will use Python from within this "notebook" framework. The notebook is an ideal way to address you journalistic and programming needs. Beyond simply commenting on what
your code is doing, these notebooks are a legitimate authoring system that you will use
to create (and publish) pitches and memos for this class. One of your humble instructors has [lectured on why the Jupyter notebook is ideal for journalists.](https://conferences.oreilly.com/jupyter/jup-ny/public/schedule/detail/70966)

### Why we code

The goal of this course is to introduce computation, broadly defined, as a tool for both finding and telling stories. This mean "reporting on" computation and its role in the world, as well as "reporting with" computing tools in pursuit of a story — and any combination of the two.  

When teaching computation (or any "technology") as part of a course, people often refer to "literacy" as a goal. For the most part, that term implies "functional literacy" — do you understand how to use something? Can you write a program, say, to assemble a data set from the web? 

Stuart Selber, a professor of English at Penn State, writes about two other facets to being literate. After functional literacy, he defines "critical literacy." Here are characteristics of a critically literate student.

>*Design cultures.* A critically literate student scrutinizes the dominant perspectives that shape computer design cultures and their artifacts.
<br><br>
*Use contexts.* A critically literate student sees use contexts as an inseperable aspect of computers that helps to contextualize and constitute them.
<br><br>
*Institutional forces.* A critically literate student understands the institutional forces that shape computer use.
<br><br>
*Popular representations.* A critically literate student scrutinizes representations of 
computers in the public imagination.

The third kind of literacy is "rhetorical." 

>*Persuasion.* A rhetorically literate student understands that persuasion permeates interface design contexts in both implicit and explicit ways and that it always involves larger structures and forces (e.g., use contexts, ideology).
<br><br>
*Deliberation.* A rhetorically literate student understands that interface design problems are ill-defined problems whose solutions are representational arguments that have been arrived at through various deliberative activities.
<br><br>
*Reflection.* A rhetorically literate student articulates his or her interface design knowledge at a conscious level and subjects their actions and practices to critical assessment.
<br><br>
*Social action.* A rhetorically literate student sees interface design as a form of social versus technical action.

From the standpoint of a journalism student, all of this might best be wrapped up in the following equivalences (shamelessly cribbed from Ian Bogost at Georgia Tech).
<br><br>
<center><b>Digital Technology = Model of the World  = Argument</b></center>
<br>

In short, every piece of digital technology embeds within it a model of the world. You might think of this as the dominant "use case" a designer had in mind. The net effect is that some actions are natural, "designed for", easy, while others are hard. And this is the argument. It is the way that technology trains you to adopt its conventions, its embedded model of the world. You are led to do the easy things and avoid the hard things. 

In this class, we will spend a great deal of time learning a programming language, Python. And yes, any given coding language has its own model of the world and makes its own arguments for certain kinds of practices (certain metaphors for actually writing code). But with a coding language comes almost unbounded flexibility to create. Unlike many of the designed systems we interact with, coding gives us the freedom to build, to imagine the world in new ways. 

All of these ideas take on particular resonance with our theme, "Hacking your attention." We are inviting you to not only investigate existing algorithms for computing "trending topics," but also to try out your own ideas about how this should be done. In an age when people are arguing for "algorithmic accountability" and "explainable" artificial intelligence", it's the perfect time to consider a reporting practice that investigates by building. 

### Instructors

In addition to your humble professor, we will have an excellent TA to travel with us on this computing adventure.

>**Bernat Ivancsics** *is a PhD candidate in communications and a CJS '16 alum. He's writing his dissertation on how news stories become data (archive-ready and machine-readable) through data-warehousing and data-processing tools and workflows that are deployed within print news organizations. This is a longer history that begins in the 1920s and iterates to the present day. Bernat is currently affiliated with Columbia's Data Science Institute, and has held all sorts of research positions at the Tow Center, the NYT R&D Lab, and MSN.*

### Assignments

**Each week**, you will receive notebooks, like this one, that you will work through outside of class. They will usually be due before the next class meeting, but specific deadlines will be given with each. There will be one or two per week and their level of detail will depend on the material. Sometimes they will be more drill-like, and other times they will challenge you to create something new. But don't worry, we do not assume you know anything about Python, in particular, or coding, in general. 

**You may work on your assignment in groups, but you should answer any questions in your own words. No copying. It is important that we see how well you are understanding the material.**

In addition, **each week beginning June 1** you will find a story or some technology (program, platform, web site) that deals with the themes of the class. You will write a summary/critique, and submit it via Courseworks. To help you, here are the kinds of questions you might address about a story you read.

1. What is the story about? Use no more than two sentences.
2. What drew you to this story, and why does it enhance our class discussion?
3. What data is used in the story, if any? How did the journalist obtain the data?
4. How did the computing help in telling the story? Who performed the computations?
5. Did the journalist "show their work" and could you recreate their results?
6. What non-computing sources were used, and how do they contribute to the story?
7. What would you do to follow up on this story? Where would you go next? 

These writeups are **due by 5pm Tuesday evenings.**

**The class will culminate in a final project, the largest component of your grade.** You will work in groups of 2-3 students. Your project is meant to be an act of computational journalism. This might mean building and documenting a new data set or computing service, or using computation to probe an existing platform or data set to tell a new story. No matter what path is chosen, we expect a well-written, well-reported story memo that accompanies your analysis or technology development. 

**A significant story pitch describing your project is due Tuesday, June 29 by 5pm.** This should be of sufficient detail that it’s clear you will have a strong, finished project by the end of the term — you might have started building something, reporting on something and analyzing data, etc. The purpose of this midterm check-in is to avoid end-of-term surprises as data fall through, holes emerge or analyses break down.

**Each Thursday by 5pm beginning June 10,** students will update a Jupyter notebook corresponding to their final project. Initially, this might consist entirely of text and straight-up reporting, along with questions about a story idea and how to proceed. It might also consist of computations and progress toward a final story memo. We expect just one update per group. And we understand that groups will shift during this period. We just want to see that people are thinking about their projects early.

**Grading**

Grades will be divided between weekly writing assignments (computing or data story
writeups and project updates), weekly coding drills, your final project, and attendance/
participation. **We expect you to submit complete each of these by the deadline.** If you are having trouble keeping up, let us know right away. **We expect you to attend every class.** Here is how grading breaks down.

> 15% Attendance and participation<br>
> 15% Blog contributions (computing stories) and presentations<br>
> 15% Project updates<br>
> 15% Coding homework<br>
> 40% Final project
 
*We will make use of the “low pass” option for grading.*

Participation includes contributing to and occasionally leading class discussions, offering reflections on how course topics can impact journalistic practice, and help steer the course into topics that are "breaking" in the tech world. 

### Python and Jupyter

Python is a programming language created by a guy, [Guido van Rossum](https://en.wikipedia.org/wiki/Guido_van_Rossum). van Rossum began work on Python in the late 1980s and version 1.0 was released in 1994. Python now has a considerable development community and you can find important resources at the [Python web site.](https://www.python.org/) According to that site, Python is "a high-level general-purpose programming language that can be applied to many different classes of problems." 

Those problems include  string manipuation — looking at the words or sentences in a document, say. Python is conversant in network protocols which means you can use it to access web sites and services — this will help with web scraping or pulling data from Twitter. There are add-ons contributed by the community that let you make wonderful maps and data visualizations, perform analysis on tabular data (but not in a wonky Excel fashion), and access data stored in a variety of different databases. 

In the late 1990s van Rossum wrote a proposal entitled ["Computer Programming for Everybody"](https://www.python.org/doc/essays/cp4e/). To give you a sense of van Rossum as a designer of technology, consider this passage.

>In the dark ages, only those with power or great wealth (and selected experts) possessed reading and writing skills or the ability to acquire them. It can be argued that literacy of the general population (while still not 100%), together with the invention of printing technology, has been one of the most emancipatory forces of modern history.
<br><br>
We have only recently entered the information age, and it is expected that computer and communication technology will soon replace printing as the dominant form of information distribution technology. About half of all US households already own at least one personal computer, and this number is still growing.
<br><br>
However, while many people nowadays use a computer, few of them are computer programmers. Non-programmers aren't really "empowered" in how they can use their computer: they are confined to using applications in ways that "programmers" have determined for them. One doesn't need to be a visionary to see the limitations here.

Later he envisions a world with millions or even billions of computer programmers at various levels of proficiency. His is a world where people are not trained by expert-created platforms, but instead have sufficient facility with computation to help shape the software systems around them.

In the rest of this Jupyter notebook, we introduce Python as a language and prepare you for its basic "syntax" — as a language, what are the nouns and verbs and what grammar glues them together? We will also introduce you to the Jupyter notebook itself.

Jupyter, by the way, comes from the original core languages that the notebook supported — Julia, Python and R. You might have heard about Python and R, but probably not Julia. In fact, new langauges are being created all the time, often tailored to particular kinds of problems. Python is a bit of a generalist, while R is great for statistical computations. [Here is a very long list](https://en.wikipedia.org/wiki/List_of_programming_languages) of programming languages. 

But our choice is made — Python. Let's have a look!

### Introduction

So, to begin. The notebook we are using is made up from two kinds of "cells". One holds text and one holds code. The genius of this system is that it supports a kind of "literate" programming, building an interesting form  of narrative -- observations based on code output, graphics and data visualizations, as well as straight story telling ampified by computation. 

**This text we have been reading is in a text cell -- it is written in "Markdown,"** a kind of pre-language for creating HTML. You can double-click on this "cell" to see the raw Markdown, and then shift-enter to render it as HTML. Notice that you can still recognize lists and emphasized text from the Markdown additions, and that's the other point of this. Your documents, while written in plain text, make use of typographical conventions that make the document's highlighting understandable even without translation to HTML. That's a good trick! 

You can find [the Markdown description here.](http://daringfireball.net/projects/markdown/). To warm up, please go through the [Markdown Tutorial](http://markdowntutorial.com). Then create a new code cell and write a brief summary of a story you've read recently or a project you're dying to work on... in Markdown.



**1. Computing with objects**

Again, your notebook knows two kinds of cells and we will spend our time with Markdown and Python. The cell below this is a "code" cell — it contains simple Python instructions or "expressions." You "execute" the code in the cell by simply clicking in the cell and then pressing the "shift" and "enter" keys at the same time. 

In [None]:
5+30

You can also assign "variables" — that is, we take the result of some expression or computation on the righthand side of the "equals" sign and let the name on the lefthand side refer to it. Here, "p" is associated with the sum of 5 and 30 and wherever we refer to p, that value of 35 is substituted.

In [None]:
p = 5+30
12+p

Working with Python is about creating and evolving "software objects". For example, the number 35 is an object that, like objects in the real world, has things you can do with it (add it to or multiply it by another number, say) and various properties (for example, 35 is smaller than 38). Python's creators designed a series of powerful objects that will help us do a lot of work, and, importantly, they left open a backdoor so you can make new kinds of objects. Why might we do that?

Community members have created objects to work with images and sound, to manipulate tabular data and not just single values like 35, to make requests for data across the web, or to suck the data out of PDF files. All of this will become second nature. But for now, the important thing is that **Python is an object-oriented language**, meaning that software objects are used to organize data and computations. 

You can get the type or "class" of any object by asking with the "function" `type()`. A function as a series of Python commands that are executed based on some input you provide. `type()` takes an object as input and then returns a short description of the kind of object it is. If there's an object type that you don't understand, there is plenty of online documention to help you. The [docs.python.com](https://docs.python.org/3/tutorial/introduction.html) site has a nice introduction to the simple data types that come "built-in" with Python.

Here we execute `type()` for the number 35.

In [None]:
type(35)

In the output, `int` stands for "integer" which we (hopefully) remember from grade school as numbers like 1,2,3 and -10,-11,-12. 

Before we explain what functions like `type()` are formally and how you (yes you!) write them to perform actions, let's look at some other built-in data types. There are objects to represent "real" numbers, strings of characters and even objects that contain other objects, perhaps organizing them into a list.

In [None]:
type(5.0/30.0 + 2.3)

Wait, "float"? What's that? Hmm. 

Lucky thing Python even knows about more elaborate objects like YouTube videos. But we're getting ahead of ourselves. The type "float" represents a "floating point number" which is a computer representation of numbers that have a decimal point. 

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('PZRI1IfStY0')

As we think about the kinds of data we come across everyday browsing the web, certainly numbers are important. But so too are sequences of characters or "strings". These might represent people's names or addresses, for example. We create a string in Python by surrounding a series of characters with quotations.

In [None]:
type("romance, human interest, humor and fascinating revelations")

We can again introduce variables to store this data descriptively, and work with the names as easily as we would the underlying data.

In [None]:
p = "You can find truth there if you know how to get at it, and "
p + "romance, human interest, humor and fascinating revelations"

This is a nice example of computations changing depending on the type of the objects involved. Add two numbers and you get their sum. Add two strings and you get a concatenation. What about multiplication?

In [None]:
"fascinating revelations "*10

**Note on quotes**: *You can create a string by surrounding it with double quotes, single quotes or even triple single or double quotes. Why so many choices? So "Trump" and 'Trump' represent the same string as does """Trump""". Look up (AKA Google) why we might need triple quotes!*

We said that objects are the way Python organizes its data and computations. Much of what we do in a Python task is make and evolve objects. **What kinds of things might we want to do with strings, for example? What computations make sense? Open a new cell in markdown and write a few ideas.**

**2. Methods**

To access the data and computations (they're called "methods") unique to a particular object, we use so-called "dot" or "." notation. The methods provided by Python for strings, say, were chosen because the operations have proven useful in working with data or in completing general programming tasks — in short, they are used often and so we want to make sure they are easy to execute on the object. 

Here we use the methods `upper()` and `lower()` to, well, change the case of the string to all uppercase or all lowercase.

In [None]:
bernat = "romance, human interest, humor and fascinating revelations"

bernat.upper()

In [None]:
p.lower()

Why would we ever use this (aside from needing to yell in tweet)? In addition to case changes, we can count the number of times certain patterns occur in a string or find where the pattern starts. Here we count the number of "I"'s.

In [None]:
p.count("i")

And here we take our original string and replace all "t"'s with "g"'s. Again, why might this come in handy?

In [None]:
p.replace("t", "g")

Here's a small aside about the notebook. Jupyter has been "printing" out the result of the last computation in the cell. So `p.replace("t", "g")` performed a computation and the result of that operation was printed below the cell. If we want to see the results of other computations, we need to call the `print()` command as we are doing below. So, instead of this...

In [None]:
p = "romance, human interest, humor and fascinating revelations"

print(p.upper())
print(p.lower())

print(p.count("i"))
print(p.replace("t", "g"))

We can also save the result of the computation in another variable for use later.

In [None]:
p = "romance, human interest, humor and fascinating revelations"
rant = p.upper()

rant*10

Notice that when we are taking action like translating something to uppercase or counting the number of "i"'s in the string, we end the method with parentheses. Same is true when we ask for an object's `type()` or `print()` something to the notebook. Think back to your algebra when you were introduced to functions — maybe `y = f(x)` on a graphing calculator. It's the same concept here. Ah but sometimes functions require "arguments" in the parentheses to specify what we want done (like when we replaced the "t"'s with "g"'s) and sometimes they do not (like when we turned the string to upper or lowercase).

Finally, methods can (and likely will be) unique to the kind of object we are dealing with. This will toss up an error because it's not clear how one turns a number into uppercase.

In [None]:
p = 40
p.upper()

Python has a simple help facility to let you see what kinds of things you can do to an object and what kinds of data it has. `help()` is another function, by the way. (This means we've seen two kinds of functions — `help()` and `type()` and `print()` are so-called "globals" that can be applied widely, whereas `upper()` and `count()` are associated with specific object types and are called with the dot notation.)

In [None]:
p = "fascinating revelations"
help(type(p))

In [None]:
p = 1.5
help(type(p))

Here you see all the things you can do to a float. Like, say, turn it into the ratio of two integers...

In [None]:
p.as_integer_ratio()

Before we leave this introduction, just a comment on how you can extend the capabilities of Python. It knows about numbers and strings and a lot of different kinds of "built-in" objects. But sometimes you want to work with other objects not considered by the language's designers. Here we "import" functionality from other packages or modules contributed by community members. In the case below, we create an object representing a YouTubeVideo and play it. Be warned! This one is not as exciting as floating point numbers. It's about Jupyter :)

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('GMKZD1Ohlzk')

I should add that the Jupyter notebook is quite a thing on its own. You can publish it as a document, you can send it around for others to use. Google offers the notebook as a kind of Google Doc that lets you run Python in their cloud and even share notebooks. 

The notebook itself is also capable of "magic," allowing us to tell the notebook to interpret the code in a cell as Python (default) or R or HTML or even UNIX. Here's the HTML code for embedding one of Trump's Tweets, taken directly from Twitter.

Here we use the `%%` to tell Jupyter that the code that follows is HTML and to render it as such in the browser. The result is an embedded Tweet. 

In [None]:
%%HTML
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">It’s been one year since George Floyd was murdered. In that time, George’s family has shown extraordinary courage. Last month’s conviction was a step towards justice – but we cannot stop there. <br><br>We face an inflection point. We have to act.</p>&mdash; President Biden (@POTUS) <a href="https://twitter.com/POTUS/status/1397237895140888582?ref_src=twsrc%5Etfw">May 25, 2021</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

**3. Objects in action**

We are going to use three examples from the [Documenting COVID-19](https://documentingcovid19.io/) project here at CJS. It is a repository of FOIA requests made to county, state and federal health departments. When making records requests, the results can be uneven. 
<br><br>

<img src="https://github.com/cocteau/computing2021/raw/main/images/Screen%20Shot%202020-11-09%20at%209.22.36%20AM.png" width=500>

<br><br>

Example 1

We will look at three requests and see how we might think about the output computationally. Let's start with a story that involves farmworkers in California and the rollout of the COVID vaccines. [Here is the original FOIA request](https://docs.google.com/document/d/1nJ1yuCemkh655cZE7_Pr-kb2DOi21AjLWoXJPzQfO58/edit?usp=sharing) and [here is one file received as a response](https://drive.google.com/file/d/1fZZqH8a9-U-ETyHsVZIMmmhN5BJuCbfB/view?usp=sharing). Much of the correspondence returned in this file is with the head of the Riverside University Health Services.

The team published a story with [CalMatters](https://calmatters.org/economy/workplace/2021/05/growers-vaccinate-farmworkers/) describing how growers made use of alternative services and sidestepped county and state programs. Have a look at the FOIA documents and think about how you interact with them. What kinds of "computation" can we perform? What are the technical means by which we find our story?
<br><br>

<img src="https://i2.wp.com/calmatters.org/wp-content/uploads/2021/05/Californian_farmworkervaccine_022521_02.jpg?w=2000&ssl=1" width=500>



One thing you will notice is that the document is just an image. Yes, it's a PDF, but it's essentially a sequence of images. This means we need to perform some kind of optical character recognition (OCR) to tranlate patterns of pixels as letters and numbers and punctuation. It can be an easy or a hard problem depending on the resolution of the image and other issues. As an example of how effective OCR is, we can look at [another set of Riverside documents.](https://documentingcovid19.io/embed/233) 

Use the search tool to look for various terms and you will see the underlying OCR. How well did it do?

It's important to keep in mind what we are looking at with these files. These are essentially images representing printouts of the requested emails. In the second group, we used OCR to recover some of the text. 

Finally, a number of emails were shared with the team through sources as opposed to FOIA requests. [Here is an example of a thread shared by Blaine Carian, a business owner in Coachella](https://drive.google.com/file/d/1CGcfbX0-D9S-5XNg0I7V2TBG44gSofh7/view?usp=sharing). 

Example 2

The next story we'll look at comes from Michigan and was published in [the Detroit Free Press](https://www.freep.com/story/news/local/michigan/2021/04/18/michigan-british-covid-19-variant-coronavirus/7220315002/). It has to do with Michigan and the UK variant of COVID. The records request sought emails containing terms like "outbreak", "variant" and "Bellamy" (a correctional facility that was part of the early spead of the variant). You can see the Washtenaw response [here](https://drive.google.com/file/d/1sp6ns0MGh40D093MtFbIcxtrq2O-C1Ab/view?usp=sharing). What sort of "affordances" does this PDF provide for finding a story through computational means? (FWIW page 311 deals with the county's decision not to enforce a public health order on the University of Michigan campus during a B117 variant outbreak in late January because of a threat of litigation. "Probably the most meaningful thread," said Derek of Documenting COVID 19.

In discussing this story, Derek mentioned searching the corpus for the word "attorney" to find places where the county might be thinking about lawsuits over lockdown orders.

<img src="https://www.gannett-cdn.com/presto/2020/10/20/PDTF/fb3409f5-3869-4d3b-a5bb-1644760d0ff8-102020_University_of_Michig_6.jpg?width=1320&height=960&fit=crop&format=pjpg&auto=webp" width=500>




Example 3

The next story appeared in [the Food & Environment Reporting Network](https://thefern.org/2020/12/documents-show-scope-of-covid-19-in-north-carolina-meat-industry/) and is about the size of the COVID outbreak in the meat industry. 

>At the height of the first wave of the Covid-19 pandemic, the number of positive cases at 10 North Carolina meatpacking plants was 75 percent higher than reported publicly, internal health department records reveal, showing the huge gulf between what was known by officials and the public about the scope of workplace infections.

The FOIA request involved terms like "farm" and "H-2A" and "migrant" as well as employers like "Butterball". Finally, there had been a publicized outbreak at"Sleepy Creek" drive in  "Harrells". The results of the FOIA [look like this.](https://drive.google.com/drive/folders/1MEpo3C2d9HdacIiXtTHJp4uQZ5gYqeSn?usp=sharing) Each folder is named after a keyword requested, and the contents are .eml files, each representing an email message. 

You can read about the EML file format [here](https://docs.fileformat.com/email/eml/). I don't think this is a particularly good source, but it does the job. The point is that what we have are not printouts of emails or scanned printouts of emails, but the actual email files themselves. While that might sound horrifying, we can use Python as an email program, read the email in and then, well, compute with it.

First thing we need to do is make a shortcut to this folder on your own Google Drive. To do that, go [here](https://drive.google.com/drive/folders/1T0E9OP0cF4VfyV-j9mTiO5L8N79oJLz0?usp=sharing) and then highlight the Bladen County folder and select "add shortcut". 

<img src="https://github.com/cocteau/computing2021/raw/main/images/Screen%20Shot%202021-05-25%20at%204.42.37%20PM.png" width=500>

OK. Now we can look at the files through Python. We "mount" our Google drive so that we can see the files. 


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Next, we are going to give Python some special powers. We are going to install a new *package* that will help us work with email messages. This will add new functionality to Python. Packages are often contributed by hardworking community members who are devoted to making a particular task easier, or making a name for themselves. Either way, we'll be using a number of packages that don't come with Python, but are contributed by others. 

We use the Python package installer, `pip` for this.

In [None]:
!pip install mail-parser

We now `import` from this package a function that will let us read in an EML file.

In [None]:
from mailparser import parse_from_file


filename = "/content/drive/My Drive/Bladen County, North Carolina/sleepycreek/054LkYJI022594.eml"
msg = parse_from_file(filename)

Here we read in one file. It is stored as a special "object". As we have seen, software objects are like objects in the physical world, containing data and performing operations. What data might we associate with an email message? What operations might we perform?

In [None]:
type(msg)


Here's the subject line...

In [None]:
msg.subject

... and the date... 

In [None]:
msg.date

... and the body.

In [None]:
print(msg.body)

This message has an attachment and we can look at it...

In [None]:
msg.attachments

... and then write it out and download it.

In [None]:
msg.write_attachments(".")

In [None]:
from google.colab import files

files.download('COVIDdailyupdateJun4.pptx')

The great beauty of this kind of approach is that now we can "iterate" over the emails, extracting data as we go. Consider all the email files in the `farm` folder.

In [None]:
from os import listdir

path = "/content/drive/My Drive/Bladen County, North Carolina/farm/"

filenames = listdir(path)
filenames

Don't worry if this looks strange. The syntax will come clear soon enough. For now, know that we can scoop over all the emails and pull out the subject lines or the dates or the attachments.

In [None]:
# print out first 50 emails

for file in filenames[:50]:
  msg = parse_from_file(path+file)
  print(file,msg.date,msg.subject)

Just think of where we can go next! Again, this was fast to give you a brief sense of what we will be doing in this class and the ways in which computing allows you to go farther with your reporting. 

Until next time...