![Computational Journalism](https://images.squarespace-cdn.com/content/59c7d7d63e00be8678b32954/1506270218234-8KTU3BWDYNZB5VQ8YQ0F/IMG_20170924_110421.jpg?format=1000w&content-type=image%2Fjpeg
 "Computational Journalism")


# Computational Journalism
## Technology, Media and Democracy


### Background

We'll start with some basic questions. First, the term *computational journalism.* What does it mean for journalism to be computational? Over the last few decades, computers have become part of our everyday lives. They regulate and shape our interactions with the physical and virtual worlds. Organizations increasingly equate (though not without problems) "data release" with transparency. Sensing (sound, light, air quality) is cheap and plentiful, and easily deployed by the general public. Our actions online generate vast quantities of digital data. And, increasingly, computer systems exercise real power in the world through the insertion of machine learning (statistical models, artificial intelligence) alongside or in place of human decision making. In all of this, we can find new ways to ask questions about the world, how it's organized and how it functions. But the keys to this new digitized kingdom are data, code and algorithms. The curiosity, the questioning spirit, you developed last semester in your reporting classes finds an outlet in new and unexpected ways, mediated by data, code and algorithms. Hence, *computational journalism.* It is simply a response to our new condition of living in a computational society.

In this year's edition of the course, we will be focusing on computational tools and techniques that, while not necessarily new, certainly achieved new prominence in the national election in 2016 and beyond. The vast networks of information that are created every day are simply too large for us to examine in their entirety. To get a sense of "what's on," we take feeds from algorithmic recommender systems, we scan trending topics, we focus on information shared with us by our friends or people we trust. Recently, we have seen how these tools and strategies for directing our attention can be hacked. This year, we are going to place special emphasis on understanding machine learning or artificial intelligence and how it impacts journalism — from helping your reporting, to creating new kinds of story forms, to its use in the distribution of journalism.

Our course is part of a city-wide effort to create new technologies, to look for new kinds of stories, that respond to this new societal condition, to these "threats to journalism." There are five major New York City Universities involved: Columbia, Cornell Tech, The New School, Queens College and NYU. We are a mix of engineers, journalists and media studies students. 

From the all-city class syllabus:

>How does the information ecosystem contribute to the health of representative democracies such as the United States? How do we move towards communities that are equipped with the knowledge and deliberation mechanisms required to address the challenges that face us? How does the relationship between technology platforms, media, governments and citizens determine which voices get heard?
<br><br>
These are some of the central questions that will animate Tech, Media & Democracy 2020. This course will bring together students with backgrounds in engineering, computer science, design, journalism and other relevant disciplines to understand:

>1. New investigative tools to build our understanding of technology, media and social issues;
The current structures, mechanisms and designs of our information ecosystem platforms and their impact on informed society outcomes;
2. How media, propaganda and misinformation impact knowledge and participation in the functions of democracy, such as elections and the census;
3. How the voices of vulnerable populations of people- such as minorities, immigrants and the poor- fare in our information and media landscape;
4. The decline of local journalism, its effect on the health of communities and what can be done to repair it. 

>The city-wide course sessions will feature ten joint group lectures and activities taking place on Monday nights across New York City at the participating universities; a hackathon that will compel teams to develop solutions to the problems posed in the course; and an opportunity to present prototype solutions publicly at the close of the semester. The goal of the course is to bring diverse perspectives to these challenging problems.

### Schedule

We'll be meeting each Monday and Wednesday from 5:30-8:30pm over the next few months. For ten of the Monday's, we'll be meeting at various locations around the city with the Tech, Media and Democracy group. On days that we meet for TMD lectures or events, we'll meet as a class at 5:30p-7pm and then with the TMD group from 7-8:30pm. 

The following is our TMD schedule. For any other Monday/Wednesdays not listed in here, we'll be meeting at Columbia from 5:30-8:30p as normal.

Monday, January 27th @ Columbia
* Class from 5:30-7p
* TMD from 7-8:30p - Intro, faculty panel discussion and group exercises

Monday, February 3rd @ The New School
* Class from 5:30-7p
* TMD from 7-8:30p - Complex problems, systems thinking

Monday, February 10th @ Cornell Tech
* Class from 5:30-7p
* TMD from 7-8:30p - Problems of the platforms: challenges to elections

Monday, March 2nd @ Cornell Tech
* Class from 5:30-7p
* TMD from 7-8:30p - Problems of the platforms: policy, regulation and antitrust

Monday, March 9th @ NYU/TNS
* Class from 5:30-7p
* TMD from 7-8:30p - Economic drivers: ad tech, fraud and the news economy

Monday, March 23rd @ NYU/TNS
* Class from 5:30-7p
* TMD from 7-8:30p - Technology drivers: AI, synthetic media

Monday, April 6th @ TNS
* Class from 5:30-7p
* TMD from 7-8:30p - Civic drivers: Civic technology, citizen journalism

Monday, April 20th @ NYU
* Class from 5:30-7p
* TMD from 7-8:30p - Hackathon idea jam and preparation

Saturday, April 25th @ NYU or Columbia 
* TMD Hackathon! Details to come.

Monday April 27th @ Columbia
* Class from 5:30-7p
* TMD from 7-8:30p - Faculty panel: Reviewing, debating, synthesizing

Monday May 4th @ Cornell Tech
* Class from 5:30-7p
* TMD from 7-8:30p - Final presentations and celebration

**Note:** Columbia Spring Break is from March 16 to March 20 so we will not have class that week.


### Class Themes

Along the way, we will cover a variety of topics that will help you in your journalistic practice, both in the sense that you better understand the media "ecology" you interact with daily, but also we will teach you to look to these systems as a source of stories, and even perhaps as a source of inspiration to build some kind of new platform to support journalism. We will learn a variety of tools, and our primary programming language will be Python. We will talk through how we came to this decision, but for the moment, know that it is a flexible language that lets you easily connect to networks like Twitter, assemble and analyze data from formal databases and the web, and build responsive services based on all these inputs. Over the next few months, we will introduce you to the following technologies and tools:

* Python programming
* Basic data analysis
* Data collection using APIs and [scraping](https://en.wikipedia.org/wiki/Web_scraping)
* Machine Learning: ML to report on, ML to report with and ML as a distribution tool
* Bots (text and voice)
* Regular Expressions and command-line tools
* Data Visualization
* Database technologies

Our class typically attracts people with different skill levels, most having no background in computation, some having recently been introduced to Python, and occasionally one or two who are already proficient in many of the topics we are covering. The course assignments will be structured in ways that everyone has something new to do, with those needing less in the way of a technical introduction focusing on applications to reporting or the work of journalism broadly.

We will use Python from within this "notebook" framework. The notebook is an ideal way to address you journalistic and programming needs. Beyond simply commenting on what
your code is doing, these notebooks are a legitimate authoring system that you will use
to create (and publish) pitches and memos for this class. One of your humble instructors has [lectured on why the Jupyter notebook is ideal for journalists.](https://conferences.oreilly.com/jupyter/jup-ny/public/schedule/detail/70966)

### Why we code

The goal of this course is to introduce computation, broadly defined, as a tool for both finding and telling stories. This mean "reporting on" computation and its role in the world, as well as "reporting with" computing tools in pursuit of a story — and any combination of the two.  

When teaching computation (or any "technology") as part of a course, people often refer to "literacy" as a goal. For the most part, that term implies "functional literacy" — do you understand how to use something? Can you write a program, say, to assemble a data set from the web? 

Stuart Selber, a professor of English at Penn State, writes about two other facets to being literate. After functional literacy, he defines "critical literacy." Here are characteristics of a critically literate student.

>*Design cultures.* A critically literate student scrutinizes the dominant perspectives that shape computer design cultures and their artifacts.
<br><br>
*Use contexts.* A critically literate student sees use contexts as an inseperable aspect of computers that helps to contextualize and constitute them.
<br><br>
*Institutional forces.* A critically literate student understands the institutional forces that shape computer use.
<br><br>
*Popular representations.* A critically literate student scrutinizes representations of 
computers in the public imagination.

The third kind of literacy is "rhetorical." 

>*Persuasion.* A rhetorically literate student understands that persuasion permeates interface design contexts in both implicit and explicit ways and that it always involves larger structures and forces (e.g., use contexts, ideology).
<br><br>
*Deliberation.* A rhetorically literate student understands that interface design problems are ill-defined problems whose solutions are representational arguments that have been arrived at through various deliberative activities.
<br><br>
*Reflection.* A rhetorically literate student articulates his or her interface design knowledge at a conscious level and subjects their actions and practices to critical assessment.
<br><br>
*Social action.* A rhetorically literate student sees interface design as a form of social versus technical action.

From the standpoint of a journalism student, all of this might best be wrapped up in the following equivalences (shamelessly cribbed from Ian Bogost at Georgia Tech).
<br><br>
<center><b>Digital Technology = Model of the World  = Argument</b></center>
<br><br>
In short, every piece of digital technology embeds within it a model of the world. You might think of this as the dominant "use case" a designer had in mind. The net effect is that some actions are natural, "designed for", easy, while others are hard. And this is the argument. It is the way that technology trains you to adopt its conventions, its embedded model of the world. You are led to do the easy things and avoid the hard things. 

In this class, we will spend a great deal of time learning a programming language, Python. And yes, any given coding language has its own model of the world and makes its own arguments for certain kinds of practices (certain metaphors for actually writing code). But with a coding language comes almost unbounded flexibility to create. Unlike many of the designed systems we interact with, coding gives us the freedom to build, to imagine the world in new ways. 

All of these ideas take on particular resonance with our theme, "Hacking your attention." We are inviting you to not only investigate existing algorithms for computing "trending topics," but also to try out your own ideas about how this should be done. In an age when people are arguing for "algorithmic accountability" and "explainable" artificial intelligence", it's the perfect time to consider a reporting practice that investigates by building. 

### Instructors

Given our ambitions for this course, we have an additional instructor who will be leading discussions, giving lectures and assignments, and providing assistance with your projects. 

>**Michael Young** *is the Director of Machine Learning Engineering at The New York Times. His teams at The NYTimes are building Machine Learning Infrastructure to help power personalization, recommendations, business-side optimizations as well as new tools for the editors/reporters. This is his second stint at The NYTimes - Michael was the lead Creative Technologist in the NYT R&D Lab from 2006-2010. In between his two NYTimes gigs, he was the CTO of Digg.com and News.me, two personalized news services. This is Michael's fourth year of helping with the Computational Journalism class.*

### Assignments

**Each week**, you will receive notebooks, like this one, that you will work through outside of class. They will usually be due before the next class meeting, but specific deadlines will be given with each. There will be one or two per week and their level of detail will depend on the material. Sometimes they will be more drill-like, and other times they will challenge you to create something new. But don't worry, we do not assume you know anything about Python, in particular, or coding, in general. 

**You may work on your assignment in groups, but you should answer any questions in your own words. No copying. It is important that we see how well you are understanding the material.**

In addition, **each week beginning February 17** you will find a story or some technology (program, platform, web site) that deals with the themes of the class. You will write a summary/critique, and contribute it to the course Tumblr page. To help you, here are the kinds of questions you might address about a story you read.

1. What is the story about? Use no more than two sentences.
2. What drew you to this story, and why does it enhance our class discussion?
3. What data is used in the story, if any? How did the journalist obtain the data?
4. How did the computing help in telling the story? Who performed the computations?
5. Did the journalist "show their work" and could you recreate their results?
6. What non-computing sources were used, and how do they contribute to the story?
7. What would you do to follow up on this story? Where would you go next? 

These writeups are **due by 5pm Monday evenings.**

**The class will culminate in a final project, the largest component of your grade.** You will work in groups of 2-3 students. Your project is meant to be an act of computational journalism. This might mean building and documenting a new data set or computing service, or using computation to probe an existing platform or data set to tell a new story. No matter what path is chosen, we expect a well-written, well-reported story memo that accompanies your analysis or technology development. 

**A significant story pitch describing your project is due Monay, March 30 by 5pm.** This should be of sufficient detail that it’s clear you will have a strong, finished project by the end of the term — you might have started building something, reporting on something and analyzing data, etc. The purpose of this midterm check-in is to avoid end-of-term surprises as data fall through, holes emerge or analyses break down.

**Each Wednesday by 5pm beginning February 17,** students will update a Jupyter notebook corresponding to their final project. Initially, this might consist entirely of text and straight-up reporting, along with questions about a story idea and how to proceed. It might also consist of computations and progress toward a final story memo. We expect just one update per group. And we understand that groups will shift during this period. We just want to see that people are thinking about their projects early.

**Grading**

Grades will be divided between weekly writing assignments (computing or data story
writeups and project updates), weekly coding drills, your final project, and attendance/
participation. **We expect you to submit complete each of these by the deadline.** If you are having trouble keeping up, let us know right away. **We expect you to attend every class, including the TMD joint classes.** Here is how grading breaks down.

> 15% Attendance and participation<br>
> 15% Blog contributions (computing stories) and presentations<br>
> 15% Project updates<br>
> 15% Coding homework<br>
> 40% Final project
 
We will make use of the “low pass” option for grading.

### Python and Jupyter

Python is a programming language created by a guy, [Guido van Rossum](https://en.wikipedia.org/wiki/Guido_van_Rossum). van Rossum began work on Python in the late 1980s and version 1.0 was released in 1994. Python now has a considerable development community and you can find important resources at the [Python web site.](https://www.python.org/) According to that site, Python is "a high-level general-purpose programming language that can be applied to many different classes of problems." 

Those problems include  string manipuation — looking at the words or sentences in a document, say. Python is conversant in network protocols which means you can use it to access web sites and services — this will help with web scraping or pulling data from Twitter. There are add-ons contributed by the community that let you make wonderful maps and data visualizations, perform analysis on tabular data (but not in a wonky Excel fashion), and access data stored in a variety of different databases. 

In the late 1990s van Rossum wrote a proposal entitled ["Computer Programming for Everybody"](https://www.python.org/doc/essays/cp4e/). To give you a sense of van Rossum as a designer of technology, consider this passage.

>In the dark ages, only those with power or great wealth (and selected experts) possessed reading and writing skills or the ability to acquire them. It can be argued that literacy of the general population (while still not 100%), together with the invention of printing technology, has been one of the most emancipatory forces of modern history.
<br><br>
We have only recently entered the information age, and it is expected that computer and communication technology will soon replace printing as the dominant form of information distribution technology. About half of all US households already own at least one personal computer, and this number is still growing.
<br><br>
However, while many people nowadays use a computer, few of them are computer programmers. Non-programmers aren't really "empowered" in how they can use their computer: they are confined to using applications in ways that "programmers" have determined for them. One doesn't need to be a visionary to see the limitations here.

Later he envisions a world with millions or even billions of computer programmers at various levels of proficiency. His is a world where people are not trained by expert-created platforms, but instead have sufficient facility with computation to help shape the software systems around them.

In the rest of this Jupyter notebook, we introduce Python as a language and prepare you for its basic "syntax" — as a language, what are the nouns and verbs and what grammar glues them together? We will also introduce you to the Jupyter notebook itself.

Jupyter, by the way, comes from the original core languages that the notebook supported — Julia, Python and R. You might have heard about Python and R, but probably not Julia. In fact, new langauges are being created all the time, often tailored to particular kinds of problems. Python is a bit of a generalist, while R is great for statistical computations. [Here is a very long list](https://en.wikipedia.org/wiki/List_of_programming_languages) of programming languages. 

But our choice is made — Python. Let's have a look!

### Introduction

We will begin our introduction to Python with some of the most infamous artifacts since the 2016 election — Donald Trump's tweets. Trump's use of Twitter might be a topic for a final project, and this article in the New York Times, [**10 Times Trump Spread Fake News**](https://www.nytimes.com/interactive/2017/business/media/trump-fake-news.html?em_pos=small&emc=edit_tu_20170118&nl=bits&nl_art=1&nlid=16428923&ref=headline&te=1&_r=0&smid=tw-share), suggests broader connections with the themes of our class.

>His sourcing highlights the bounty of misinformation accessible on the web and its power in a deeply divided America — especially when endorsed by someone of Mr. Trump’s influence and visibility.
<br><br>
He offered this explanation for his actions while discussing an altered YouTube video he had tweeted as part of an unsubstantiated claim that a protester at one of his rallies had ties to the Islamic State: “I don’t know what they made up; all I can do is play what’s there,” Mr. Trump said on NBC’s “Meet the Press.”
<br><br>
“All I know is what’s on the internet.”

So, to begin. **This text is written in "Markdown,"** a kind of pre-language for creating HTML. You can double-click on this "cell" to see the raw Markdown, and then shift-enter to render it as HTML. Notice that you can still recognize lists and emphasized text from the Markdown additions, and that's the other point of this. Your documents, while written in plain text, make use of typographical conventions that make the document's highlighting understandable even without translation to HTML. That's a good trick! 

You can find [the Markdown description here.](http://daringfireball.net/projects/markdown/). For Monday, please go through the [Markdown Tutorial](http://markdowntutorial.com). There might be other learning resources that we should share with the class, so let us know if you find something really helpful!

**1. Computing with objects**

Your notebook knows a few kinds of cells and we will spend our time primarily with Markdown and Python. The cell below this is a "code" cell — it contains simple Python instructions or "expressions." You "execute" the code in the cell by simply clicking in the cell and then pressing the "shift" and "enter" keys at the same time. 

In [None]:
5+30

You can also assign "variables" — that is, we take the result of some expression or computation on the righthand side of the "equals" sign and let the name on the lefthand side refer to it. Here, "p" is associated with the sum of 5 and 30 and wherever we refer to p, that value of 35 is substituted.

In [None]:
p = 5+30
12+p

Working with Python is about creating and evolving "software objects". For example, the number 35 is an object that, like objects in the real world, has things you can do with it (add it to or multiply it by another number, say) and various properties (for example, 35 is smaller than 38). Python's creators designed a series of powerful objects that will help us do a lot of work, and, importantly, they left open a backdoor so you can make new kinds of objects. Why might we do that?

Community members have created objects to work with images and sound, to manipulate tabular data and not just single values like 35, to make requests for data across the web, or to suck the data out of PDF files. All of this will become second nature. But for now, the important thing is that **Python is an object-oriented language**, meaning that software objects are used to organize data and computations. 

You can get the type or "class" of any object by asking with the "function" `type()`. A function as a series of Python commands that are executed based on some input you provide. `type()` takes an object as input and then returns a short description of the kind of object it is. If there's an object type that you don't understand, there is plenty of online documention to help you. The [docs.python.com](https://docs.python.org/3/tutorial/introduction.html) site has a nice introduction to the simple data types that come "built-in" with Python.

Here we execute `type()` for the number 35.

In [None]:
type(35)

In the output, `int` stands for "integer" which we (hopefully) remember from grade school as numbers like 1,2,3 and -10,-11,-12. 

Before we explain what functions like `type()` are formally and how you (yes you!) write them to perform actions, let's look at some other built-in data types. There are objects to represent "real" numbers, strings of characters and even objects that contain other objects, perhaps organizing them into a list.

In [None]:
type(5.0/30.0 + 2.3)

Wait, "float"? What's that? Hmm. 

Lucky thing Python even knows about more elaborate objects like YouTube videos. But we're getting ahead of ourselves. The type "float" represents a "floating point number" which is a computer representation of numbers that have a decimal point. 

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('PZRI1IfStY0')

As we think about the kinds of data we come across everyday browsing the web, certainly numbers are important. But so too are sequences of characters or "strings". These might represent people's names or addresses, for example. We create a string in Python by surrounding a series of characters with quotations.

In [None]:
type("Is he really going to make us read Trump's Tweets?")

We can again introduce variables to store this data descriptively, and work with the names as easily as we would the underlying data.

In [None]:
p = "Is he really going to make us read Trump's Tweets?"
p + " Heaven help us."

This is a nice example of computations changing depending on the type of the objects involved. Add two numbers and you get their sum. Add two strings and you get a concatenation. What about multiplication?

In [None]:
"Tweets "*10

**Note on quotes**: *You can create a string by surrounding it with double quotes, single quotes or even triple single or double quotes. Why so many choices? So "Trump" and 'Trump' represent the same string as does """Trump""". Look up (AKA Google) why we might need triple quotes!*

We said that objects are the way Python organizes its data and computations. Much of what we do in a Python task is make and evolve objects. **What kinds of things might we want to do with strings, for example? What computations make sense? Open a new cell in markdown and write a few ideas.**

**2. Methods**

To access the data and computations (they're called "methods") unique to a particular object, we use so-called "dot" or "." notation. The methods provided by Python for strings, say, were chosen because the operations have proven useful in working with data or in completing general programming tasks — in short, they are used often and so we want to make sure they are easy to execute on the object. 

Here we use the methods `upper()` and `lower()` to, well, change the case of the string to all uppercase or all lowercase.

In [None]:
p = "Schiff must release the IG report, without changes or tampering, which is said to be yet further exoneration of the Impeachment Hoax. He refuses to give it. Does it link him to Whistleblower? Why is he so adamant?"

p.upper()

In [None]:
p.lower()

Why would we ever use this (aside from needing to yell in tweet)? In addition to case changes, we can count the number of times certain patterns occur in a string or find where the pattern starts. Here we count the number of "I"'s.

In [None]:
p.count("i")

And here we take our original string and replace all "t"'s with "g"'s. Again, why might this come in handy?

In [None]:
p.replace("t", "g")

Here's a small aside about the notebook. Jupyter has been "printing" out the result of the last computation in the cell. So `p.replace("t", "g")` performed a computation and the result of that operation was printed below the cell. If we want to see the results of other computations, we need to call the `print()` command as we are doing below. So, instead of this...

In [None]:
p = "Schiff must release the IG report, without changes or tampering, which is said to be yet further exoneration of the Impeachment Hoax. He refuses to give it. Does it link him to Whistleblower? Why is he so adamant?"

print(p.upper())
print(p.lower())

print(p.count("i"))
print(p.replace("t", "g"))

We can also save the result of the computation in another variable for use later.

In [None]:
p = "Schiff must release the IG report, without changes or tampering, which is said to be yet further exoneration of the Impeachment Hoax. He refuses to give it. Does it link him to Whistleblower? Why is he so adamant?"
rant = p.upper()

rant

Notice that when we are taking action like translating something to uppercase or counting the number of "i"'s in the string, we end the method with parentheses. Same is true when we ask for an object's `type()` or `print()` something to the notebook. Think back to your algebra when you were introduced to functions — maybe `y = f(x)` on a graphing calculator. It's the same concept here. Ah but sometimes functions require "arguments" in the parentheses to specify what we want done (like when we replaced the "t"'s with "g"'s) and sometimes they do not (like when we turned the string to upper or lowercase).

Finally, methods can (and likely will be) unique to the kind of object we are dealing with. This will toss up an error because it's not clear how one turns a number into uppercase.

In [None]:
p = 40
p.upper()

Python has a simple help facility to let you see what kinds of things you can do to an object and what kinds of data it has. `help()` is another function, by the way. (This means we've seen two kinds of functions — `help()` and `type()` and `print()` are so-called "globals" that can be applied widely, whereas `upper()` and `count()` are associated with specific object types and are called with the dot notation.)

In [None]:
p = "Tweets"
help(type(p))

In [None]:
p = 1.5
help(type(p))

Here you see all the things you can do to a float. Like, say, turn it into the ratio of two integers...

In [None]:
p.as_integer_ratio()

Before we leave this introduction, just a comment on how you can extend the capabilities of Python. It knows about numbers and strings and a lot of different kinds of "built-in" objects. But sometimes you want to work with other objects not considered by the language's designers. Here we "import" functionality from other packages or modules contributed by community members. In the case below, we create an object representing a YouTubeVideo and play it. Be warned! This one is not as exciting as floating point numbers. It's about Jupyter :)

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('GMKZD1Ohlzk')

I should add that the Jupyter notebook is quite a thing on its own. You can publish it as a document, you can send it around for others to use. Google offers the notebook as a kind of Google Doc that lets you run Python in their cloud and even share notebooks. 

The notebook itself is also capable of "magic," allowing us to tell the notebook to interpret the code in a cell as Python (default) or R or HTML or even UNIX. Here's the HTML code for embedding one of Trump's Tweets, taken directly from Twitter.

Here we use the `%%` to tell Jupyter that the code that follows is HTML and to render it as such in the browser. The result is an embedded Tweet. 

In [None]:
%%HTML
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">I NEVER told John Bolton that the aid to Ukraine was tied to investigations into Democrats, including the Bidens. In fact, he never complained about this at the time of his very public termination. If John Bolton said this, it was only to sell a book. With that being said, the...</p>&mdash; Donald J. Trump (@realDonaldTrump) <a href="https://twitter.com/realDonaldTrump/status/1221663763138588672?ref_src=twsrc%5Etfw">January 27, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

**3. Putting it to work**

For a final task, let's reexamine strings and look at a few of our President's tweets. Technically, a tweet is a pretty complex object. Here are a few. **What kinds of data do they consist of?** Open a new Markdown window and take some notes. 

In [35]:
%%HTML
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Great Editorial in today’s Wall Street Journal, “And Congress Shall Be King.” Bottom line: “The President becomes a vassal of King Congress. This is another reason for the Senate to repudiate this House Impeachment as its own abuse of power.” A partisan Hoax!</p>&mdash; Donald J. Trump (@realDonaldTrump) <a href="https://twitter.com/realDonaldTrump/status/1221443936952233984?ref_src=twsrc%5Etfw">January 26, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

In [36]:
%%HTML
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">“Again: Read the Transcript!” Michael Goodwin, New York Post, Sunday.</p>&mdash; Donald J. Trump (@realDonaldTrump) <a href="https://twitter.com/realDonaldTrump/status/1221449400834252800?ref_src=twsrc%5Etfw">January 26, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

In [37]:
%%HTML
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">After having been exposed as a fraud and corrupt, can anyone, including Sleepyeyes Chuck Todd of Fake <a href="https://twitter.com/NBCNews?ref_src=twsrc%5Etfw">@NBCNews</a>, continue to listen to his con?</p>&mdash; Donald J. Trump (@realDonaldTrump) <a href="https://twitter.com/realDonaldTrump/status/1221454586717839366?ref_src=twsrc%5Etfw">January 26, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

These are just a couple Tweets out of oh so many — and their frequency is only increasing. You can have access to all Trump's Tweets through the [Trump Twitter Archive](http://www.trumptwitterarchive.com/). Have a look at the site. What kinds of features does it offer for each tweet? 

We will work with data from this site today, for your homework, and on Wednesday. For now, we'll use a special version we created that is basically a big text file, one line per tweet from the first 27 days of 2020. The data are sorted from oldest (top of the file) to the newest (bottom of the file). [Download it here](https://github.com/computationaljournalism/columbia2020/raw/master/data/trump_2020.txt) and place it in the same folder as this notebook. You can hold down the "Option" key and click on the  link in the previous sentence to download the file directly on a Mac, say.

We then use a function called `open()` to open the file (creating an object that represents . a file of data) and then invoke a `read()` method to scan its contents. Over the semester, we'll have a lot to say about how you load data for Python to work with. For now, we are taking the file of Tweets and reading them in as one long text string.

Notice here we are creating a new object `tweetfile` (what is its type?) and then executing a method to `read()` its contents into a string `tweets`.

In [None]:
tweetfile = open("trump_2020.txt")
tweets = tweetfile.read()
tweets

Exhibiting the string in this way highlights the fact that it includes special characters like `\n`. This particular special character means a "newline" — it is what happens when you hit "enter" when you are typing in a Word document. We can have the `\n`'s print as actual newlines by using the command `print()` instead. (This is one of the differences between having the notebook exhibit the result of your last computation and formally calling Python's `print()` command.) Here we will get one line (perhaps wrapped) per Tweet.

In [None]:
print(tweets)

This is a slightly silly way to store these "data". You can think of having all of the president's tweets in one big string with `\n` newlines to separate the tweets. There are better ways to structure this information and there's more information to be had than just the text of each tweet — as you no doubt listed, there's time and retweet counts and so on. But for now, we'll keep it really simple and store the data like one long string.

**A. As the object `tweets` is one long string (type `str`), write some code and tell us something about Trump's tweets from 2020 so far — how many times did he mention a hoax?**

In [None]:
# Your code here



We have also prepared [a file with Trump's tweets from 2019 during roughly the same first 27 days of the year](https://github.com/computationaljournalism/columbia2020/raw/master/data/trump_2019.txt). Download it and put it in the same folder as you've stored this notebook.  

**B. Read in the tweets as you did before, but call the string something other than `tweets`. We want you to compare 2019 and 2020. What was the president concerned about in 2019 versus 2020? Note that we are aiming for some big goals with very modest tools, so it's worth also noting what you'd like to do, but how the code you've learned doesn't quite get you there yet.**

In [None]:
# Your code here



**C. Find online documentation about "string methods" in Python (version 3) and try something out that we haven't used yet in this notebook.**

In [None]:
# Your code here



**D. Come up with 3 story ideas about Trump's tweets, perhaps comparing 2019 to 2020, perhaps some other topic. Think broadly and not just with respect to the data you've been working with. For each, describe the kind of computation you would need to do ("I need to count how many...") and whether this simple form of storing tweets as one long string is up to the task. For example, this is a limiting way to work with text — what kinds of things would you like to be able to do or "read" from the text that our simple set of commands like `count()` and `upper()` don't do? Also there is a lot of data about a tweet other than its text — perhaps your story idea needs some of this other information. Basically, in anticipation of next lecture when we start to add more structure to data, we want you to think about what you might need to tell a story.**

Write your ideas here

