<a href="https://colab.research.google.com/github/cocteau/computing2021/blob/main/notebooks/02_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://thumbs.dreamstime.com/b/robot-reading-newspaper-isolated-white-background-robot-reading-newspaper-101645621.jpg" width=500 />

###Case Study - The News as Data

**1. Introduction**

As we have mentioned, in this class our main language for computation will be [Python](http://www.python.org). It was chosen because of its emphasis on **readability and code sharing.** It is also exceedingly popular, with a large user community and an active group of contributors constantly adding new tools in the language. Not surprisingly, there are plenty of online sources to help you on your journey learning the language, from [cheatsheets](https://www.pythoncheatsheet.org/) to [online tutorials](https://www.coursera.org/learn/python?specialization=python). You will quickly find that the web is a great place to find examples of code to do what you need to do. So, suppose you can't remember how to concatenate two strings...

<img src="https://raw.githubusercontent.com/computationaljournalism/columbia2020/master/images/cc.jpeg" width=500 style="border:1px solid black">

A word of caution: Make sure that when you find an answer on the web, it refers to **Python version 3.** There are two popular versions of the language in use, 2 and 3. This is a good example of eventually needing to move on from whatever technological environment you are used to — apps have upgrades, your phone's operating system asks to be upgraded, and programming languages evolve. I mean everything can get better, right? We are learning Python Version 3.

This semester, we will program in Python (write and execute Python expressions) using the Jupyter notebook — the name "Jupyter" being a mix [Julia](http://julialang.org/), [Python](https://www.python.org/) and [R](https://www.r-project.org/), the three languages it originally supported. Programming with the notebook is often referred to as **"literate computing"** — by that we mean that you code a little, have a look, write a little, come up with more ideas, code a little more, write a little more and so on. To support this, as we have seen, Jupyter has two kinds of "cells" that you can either write or program in. Think of it as a modern reporter's notebook.
<br><br>

<img src="https://raw.githubusercontent.com/computationaljournalism/columbia2020/master/images/notebook.jpeg" width=500>
<br>

Your writing is done in **Markdown cells**. Markdown (as opposed to Mark-up) is a language that lets you write in a plain text editor (in this simplest cases the Notepad or TextEdit would work fine too) and there are simple typographical shorthands to **make text bold** or *to put text in italics*. You can also make lists like

+ Bread
+ Milk
+ Dog food
+ Swiffer pads

These Markdown conventions are then translated into HTML — the upshot being that you don't have to know anything about HTML to create documents that look reasonably good, certainly good enough for your reporting notes. 

Double click in this window to see the "raw" Markdown. Notice that you can still recognize lists and emphasized text from the Markdown additions, and that's the other point of this. Your documents, while written in plain text, make use of typographical conventions that make the document's highlighting understandable even without translation to HTML. That's a good trick! 

You can find [the Markdown description here.](http://daringfireball.net/projects/markdown/). And hopefully you'vee been through the [Markdown Tutorial](http://markdowntutorial.com) we started in the last class. 

*To re-render the Markdown in this cell into HTML, click in the cell and hit Shift-Enter to execute the transformation.*

One last note. Many organizations (scientific research labs, journalistic organizations, and so on) are "publishing" dual works — one that goes into an official journal like Science or The New York Times summarizing a set of computations, and then another that is published in the form of a notebook that documents the author's computational work. Here are some examples from science and journalism.

> ["Peeling back the curtain — How the Economist is opening the data behind our reporting."](https://medium.economist.com/peeling-back-the-curtain-487bd3be0c47) In their words "We published these calculations in a Jupyter notebook, a tidy format for breaking scripts into small blocks and annotating them."

>[BuzzFeedNews/everything — An index of all our open-source data, analysis, libraries, tools, and guides.](https://github.com/BuzzFeedNews/everything#data-and-analyses) Data and code for many of their major stories including ["The Graveyard Doesn't Lie"](https://www.buzzfeednews.com/article/peteraldhous/texas-winter-storm-power-outage-death-toll), an "[a]nalysis of excess deaths caused by the February 2021 winter storm and power outages in Texas", and ["How Russia’s Online Trolls Engaged Unsuspecting American Voters — And Sometimes Duped The Media"](https://www.buzzfeednews.com/article/peteraldhous/russia-online-trolls-viral-strategy).

>The New York Times has published analyses as well. As an example, for the article ["What Drives Gun Sales: Terrorism, Obama and Calls for Restrictions"](https://www.nytimes.com/interactive/2015/12/10/us/gun-sales-terrorism-obama-restrictions.html?), they also offered [detailed background computations.](https://github.com/nytimes/gunsales)

>["The Need for Openness in Data Journalism"](http://nbviewer.jupyter.org/github/brianckeegan/Bechdel/blob/master/Bechdel_test.ipynb) by Brian Keegan. This is a little old, but Keegan makes good points about the benefits of working with a notebook.

>["Why Jupyter is data scientists’ computational notebook of choice,"](https://www.nature.com/articles/d41586-018-07196-1) a recent overview piece in Nature that marks the rise of notebooks this way — "One analysis of the code-sharing site GitHub counted more than 2.5 million public Jupyter notebooks in September 2018, up from 200,000 or so in 2015."  

>["The Architecture of Jupyter — Interactive by design."](http://scisoftdays.org/pdf/2016_slides/perez.pdf) Starting on page 23 of this PDF, Fernando Perez, one of the designers of the Jupyter describes how notebooks have been published alongside papers in journals like Nature and Science and Scientific American, to name a few.

You get the idea. There are an increasing number of projects like this — the fact that you can take someone else's notebook and examine the steps they followed to arrive at their conclusions is, ultimately, an important step toward transparency in data or computational journalism. *The notebooks become objects of coordination.*

**2. Case Study**

Today we are going to branch out from our humble start on Tuesday and dig more deeply into how we  build more complex objects into our programming environment and conduct more interesting analyses. In short, we will be learning to code. We will also try to emphasize how you break a problem down into pieces, tasks that can then be realized in code. 

We are going to start with an article that appeared in the Times in April. It is called ["Swelling Anti-Asian Violence: Who Is Being Attacked Where"](https://www.nytimes.com/interactive/2021/04/03/us/anti-asian-attacks.html). Here's their goal.

>The New York Times attempted to capture a sense of the rising tide of anti-Asian bias nationwide. 

*Read the article and in the "text cell" below, map out what you think a reporting strategy for this article might have been.* Create a list of steps you would follow to gather data for their story.

The Times piece makes use of other news articles as sources. In corresponding with the lead reporter, I was told that the team relied on extensive Google News searches with different combinations of keywords and locations, paired with Twitter searches. They followed reporters focused on the issue as well as city hate crime departments. 

In my mind, one effective thing about this article is how it organizes and then displays the primary sources, the original news articles. 

<img src="https://github.com/cocteau/computing2021/raw/main/images/Screen%20Shot%202021-05-27%20at%206.43.26%20AM.png" width=500>

<img src="https://github.com/cocteau/computing2021/raw/main/images/Screen%20Shot%202021-05-27%20at%206.43.53%20AM.png" width=500>


If we were to repeat this analysis, what keywords might we use? Have a read of a few of the articles linked to the NYT piece and identify some words to search for. In addition, find a couple of reporters we might follow on Twitter to keep up with the topic. *Make a list in the text cell below.*

With these in mind, how might we transform a one-time analysis into something that we could refresh over time? Consider similar projects like [The Counted](https://www.theguardian.com/us-news/ng-interactive/2015/jun/01/the-counted-police-killings-us-database) by the Guardian that attempts to create a public record of, in this case, people killed by police.

**An Aside: More on the affordances of data formats** 

The news articles in the NYT piece are listed at the end. Right now, they are in a form where we can click through and read more about an individual incident, returning to the source news agency. But the actual sources, the extra facts they contain, and any clues about other subcategories of aggression or new keywords are hard to identify  by reading articles one at a time. 

This is a common situation. There is data we'd like to have access to, but it is structured in the form of a web page. The data could be quite regular and might even fit into a spreadsheet (meaning each violent incident has the same facts recorded about it like date and age of victims and perhaps the URL of the news story providing the full description). But for now, the data are formatted in HTML as a web document for you to read and interact with. Let's try to grab the data and do something more with it.

If we look at the HTML source for the page we see that the list of incidents is coded in a very clear way. Fist, let's recall a little about HTML, the Hyper Text Markup Language. Each web page essentially looks like this.
<br><br>
<pre>
&lthtml lang="en"&gt
  &lthead&gt
    &ltmeta name="news_keywords" content="Asian American,Hate crime,Discrimination,Korean American,Chinese American,Japanese American,Vandalism,Assault,US,Coronavirus;COVID-19"/&gt
    &lttitle&gt
       Swelling Anti-Asian Violence: Who Is Being Attacked Where - The New York Times
    &lt/title&gt
  &lt/head&gt
  &ltbody&gt
     &ltp&gt Over the last year, in an unrelenting 
      series of episodes with clear racial animus, people of Asian descent have been pushed,
      beaten, kicked, spit on and called slurs.
      Homes and businesses have been vandalized.
      The violence has known no boundaries,
      spanning generations, income brackets and
      regions.
      &lt/p&gt
      &ltp&gt The New York Times attempted to
      capture a sense of the rising tide of
      anti-Asian bias nationwide. Using media
      reports from across the country, The Times
      found more than 110 episodes since March 2020
      in which there was clear evidence of
      race-based hate.
      &lt/p&gt
...
   &lt/body&gt
&lt/html&gt
</pre>

HTML defines a series of "tags" that describe elements of a document. Here we have the main content contained in paragraph tags, where `<p>` opens a paragraph and `</p>` closes it. We also see the opening `<html>` tag that indicates what follows is HTML (as opposed to some other format). The `<head>` tag is used to specify so-called meta-data about a page like keywords or the title that is to be displayed in the upper tab of your browser. 

Of course the actual HTML for an NYT story is more complicated than what I've given you here, but the structure is essentially what I've shown -- information structured around tags that describe different parts of the document. We can see the full "source" for this page by, in Chrome and similarly for other browsers, choosing View->Developer->View Source. Pictorally this means...
<br><br>

<img src="https://github.com/cocteau/computing2021/raw/main/images/Screen%20Shot%202021-05-27%20at%206.44.46%20AM.png" width=400>
<br>

You should then get another tab that shows you the underlying HTML. You can search (Command-F) for the `<head>` tag and the `<body>` tag and some of the text of the story to see that it's all there... plus a lot of other things. Now, look for the listing of incidents in the source. You'll see something like the following.

<img src="https://github.com/cocteau/computing2021/raw/main/images/Screen%20Shot%202021-05-27%20at%207.04.22%20AM.png" width=500>

Notice that each of the incidents is described in a `<div>` tag -- these help the web designer mark out different divisions of the document to be "styled" differently. The styling in this case is the the form of the "class" attribute of the `<div>` tag. Notice that they are all called `incident_line`. 

With this source, we could imagine a program that "parses" the HTML code and lets you pull out the pieces you want, like the URL's of the stories. Like Tuesday's example of parsing email files, such a program would give you direct access to the components on a web page **as data**. So instead of Command-F for certain tags, can we work with the document in a way that's more natural? As luck would have it, there is such a package for Python and it is called [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/). 

So, suppose we want to grab the URLs for the source articles. What do we need to do?

1. Get the HTML page from the New York Times web site
2. Parse it into a Python object that lets us search for specific tags and their content
3. Extract all the URLs.

Let's give it a try! First grab the page. Python can act like your browser making requests to a web server. In this case we want the page

<pre>https://www.nytimes.com/interactive/2021/04/03/us/anti-asian-attacks.html</pre>

We will use the `requests` package in Python that lets us make requests for resources on the web. We just need one function from that package called `get()` to go `get()` the HTML for our URL. We will cover this in excruciating detail in a later lecture.



In [None]:
from requests import get

url = "https://www.nytimes.com/interactive/2021/04/03/us/anti-asian-attacks.html"
response = get(url)

Now `response` holds what the New York Times web server returned in response to our request. It's an object...


In [None]:
type(response)

... and you can look at things like the "status code". For example, 404 is the code for "File Not Found".

In [None]:
response.status_code

Here we have 200 for a successful transaction. You can see a complete list [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status). The source of our web page is stored in `content` as one long string (we met strings on Tuesday).

In [None]:
response.content

We will use this to create a `BeautifulSoup` object that will let us work more effectively with the article.

In [None]:
from bs4 import BeautifulSoup
nyt = BeautifulSoup(response.content)

type(nyt)

One of the things we can do with this object is `find_all()` of the tags with a given class. In this case we want `div` tags that are of class `incident_line`. The result of this function call is a `list` of the tags we're after. A list is just a collection of objects that you access sequentially. We'll see a lot about lists later today. For the moment know that lists store a sequence of objects.

In [None]:
articles = nyt.find_all("div","incident_line")

In [None]:
len(articles)

We can now access each `div` tag individually...

In [None]:
articles[3]

In [None]:
print(articles[3].prettify())

Notice that each `div` tag of the kind we want is in fact another object that has more nested HTML tags. This is the brilliance of having an object that represents HTML. We now can operate on it in ways that make sense to the format. Here we might want to look inside the mini-HTML snippets and search for an anchor `a` tag that is the link to a source article, and then extract the `href` attribute that is the URL of the article you go to when you click the link.

Below we use a loop like we did in our last session, going over each `div` tag we found and printing out the links. Don't worry about the syntax of this code, we will come back to it in great detail. This is more inspirational... showing you where we are headed.

In [None]:
for article in articles:
  anchor = article.find("a")
  if anchor: print(anchor["href"])

What do you notice from this list? 

To sum, we used an object that understands the structure of HTML to identify pieces of the document we're interested in. We  then pulled them out as a list or some other object and can do data analysis. What were the common sources for stories on anti-Asian bias and aggression? Can we use our `request` package again to `get()` the actual articles (as we did the original NYT story) and extract other features from the text?

**Another aside: APIs**

Now, suppose we want to make this project "live" and keep adding articles. While it won't be the same in terms of rigor, we could use a source for news articles and scan them for the keywords we identified. The News API is one good example of a live wire of news stories. 

API stands for Application Programming Interface. It is a way for us to ask for **data** from a server, rather than a web page. Notice that in the NYT example, we had to parse the HTML to find the links we wanted. With an API, we can ask for the information we want and have it returned in a format better suited for the data we requested.

Have a look at the [News API website](https://newsapi.org/). You'll see on the home page an example of how the service works. You create a special URL that specifies the data you want. The server looks for your data and returns it as a JSON string. That stands for JavaScript Object Notation. It turns out that JSON and Python play very well together. The API returns essentially a list of articles, each article specifying its publication,  author,  title and publication date, among other things. We'll come back to JSON and these data structures shortly.

In creating your URL, one of the components is a query string. It is the keyword (or words) you would like to search their corpus for. Another component of the URL is a so-called API key. Go ahead and register for a key on the News API site -- it only takes a second. 

The API key is how the server you're asking data from knows who you are. Among other things, it's used to enforce limits (only so many requests in one day, at lease some amount of time between requests) to keep their service working smoothly so that other users can make requests at the same time. 

Below I use string concatenation to put together the URL I want. If you read the News API there are lots of things you can specify in this URL to affect the data that comes back. This just asks for the last month of data and I ask for the keyword "anti-asian". Replace my key with yours and let's see what we get.

In [None]:
key = "9b49997ed4274749871f14355ec9cd3f"

query = "anti-asian"
url = "https://newsapi.org/v2/everything?q="+query+"&from=2021-04-27&sortBy=publishedAt&apiKey="+key

url

Now, let's use `get()` from the `requests` package to get our news stories. 

In [None]:
from requests import get

response = get(url)
response.status_code

Perfect! Now, we can get our JSON object into Python like this.

In [None]:
news = response.json()

news

OK let's now take a step back and figure out what all this means and how we got here.

**3. Review: The basic data types**

As we've said, whether we code using a notebook or some other interface, our basic language will be Python. Python is an **object-oriented language**. Software objects are a kind of programming abstraction, a particular way of organizing information and actions. Software objects try to mimic the notion of objects in the physical world — that means they contain properties or **data** and also might have certain operations or **methods** that you can use to transform the object in some way. Take as an example one of our lovely rolling chairs — it has operations (it can support you when you sit, it has a foldable arm for writing, and it can move around the room) as well as "data" (it has a seat height, a desk height, maybe even an RGB value specifying its lovely seat color).

**Python has a series of built-in types of objects, meaning certain types of information that are so basic that they are needed by just about every programming exercise you'll attempt**. Which have we covered so far? As a kind of quiz, write some code that creates a variable having each of the data types we've seen so far using information from the first article among the NYT's sources ["Asian Americans In Chicago Feel The Bite Of Prejudice During The Spread Of The Coronavirus"](https://www.wbez.org/stories/asian-americans-in-chicago-feel-the-bite-of-prejudice-during-the-spread-of-the-coronavirus/687b0f4e-fed8-4fca-90c4-b7c3c495b4cf)

<img src="https://github.com/cocteau/computing2021/raw/main/images/Screen%20Shot%202021-05-27%20at%2012.34.38%20PM.png" width=400>

In [None]:
# put your code here

date = "March 31, 2020"
location = "Chicago"
age = 31
type_attack = "verbal"


**4. Combining data**

So far, we have created variables that contain a single value — a number or a string. With the article above, we see that we might want to represent an object in the world, say an article or a tweet or an Instagram post, as a collection of simple data types. Python provides a few built-in objects that are containers. The first we'll look at is a **dictionary.** 

Think back for a moment to how you used a literal dictionary (Websters?). Word definitions  were referenced by words. Finding a definition meant specifying the word we were after. This is the idea behind a dictionary in Python — store data (**"values"**) according to a name (a word or some kind of **"key"**). The result is a collection of key-value pairs.

For example yesterday the News API has 165 articles including the term "anti-asian". We can store these facts in a variable called "activity", say.

In [None]:
activity = {"date":"May 26, 2021", "count":165}

# have a look at what we built
activity

The curly braces (not parentheses!) mean we are creating a dictionary, a set of key-value pairs. The names we give to the data (the word, if you will, we associate with the dictionary entry) are "date" and "count" (one the left of the colons) and the values, the data, are on the right. 

If we want to lookup or access data, we provide a name in square brackets.

In [None]:
activity["count"]

In [None]:
# extract the date

activity["date"]

Now, recreate "activity" but add the fact that yesterday was a Wednesday. 

In [None]:
# your code here

activity = {"date":"May 26, 2021", "count":165, "day_of_week":"Wednesday"}


**A common mistake for people learning Python is to confuse the parentheses we use to indicate a function (or taking action) like `p.count('i')` with the square brackets we will use to extract or subset data. They look similar so be careful! 😉**

Now, using the article from WBEZ we presented above, create a dictionary that encapsulates as much of the story as you can. Call the dictionary `article`. 

In [None]:
# Put your code here -- 

story = {"date":"March 31, 2020","location":"Chicago","age":31,"type_attack":"verbal"}
story

In [None]:
story.keys()

When we made a call to the News API we received an object that looks a lot like a dictionary. Recall we called it `news`.

In [None]:
news.keys()

In [None]:
news["totalResults"]

For another example, below we present a tweet and then create a dictionary version of it called `tweet` -- it  was returned by Twitter's [Application Programming Interface](https://developer.twitter.com/en/docs/api-reference-index). We can explore a dictionary visually by printing it out (too easy), or we can also ask for the kinds of data it contains with a method of a dictionary called `.keys()`.

In [None]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">finally putting in the effort to learn to code python and it has been enjoyable and rewarding thus far</p>&mdash; halleJOKEL (@halleJOKEL) <a href="https://twitter.com/halleJOKEL/status/1086994877496332288?ref_src=twsrc%5Etfw">January 20, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

In [None]:
tweet = {"created_at": "Sun Jan 20 14:32:28 +0000 2019", "id": 1086994877496332288, "id_str": "1086994877496332288", "full_text": "finally putting in the effort to learn to code python and it has been enjoyable and rewarding thus far", "truncated": False, "display_text_range": [0, 102], "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": []}, "source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>", "in_reply_to_status_id": None, "in_reply_to_status_id_str": None, "in_reply_to_user_id": None, "in_reply_to_user_id_str": None, "in_reply_to_screen_name": None, "user": {"id": 144210472, "id_str": "144210472", "name": "halleJOKEL", "screen_name": "halleJOKEL", "location": "Raleigh, NC", "description": "sports was a mistake", "url": None, "entities": {"description": {"urls": []}}, "protected": False, "followers_count": 273, "friends_count": 482, "listed_count": 6, "created_at": "Sat May 15 16:24:53 +0000 2010", "favourites_count": 2044, "utc_offset": None, "time_zone": None, "geo_enabled": False, "verified": False, "statuses_count": 5736, "lang": None, "contributors_enabled": False, "is_translator": False, "is_translation_enabled": False, "profile_background_color": "131516", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme14/bg.gif", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme14/bg.gif", "profile_background_tile": True, "profile_image_url": "http://pbs.twimg.com/profile_images/1069453101311172608/s8lDcFQT_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/1069453101311172608/s8lDcFQT_normal.jpg", "profile_banner_url": "https://pbs.twimg.com/profile_banners/144210472/1536632996", "profile_image_extensions_alt_text": None, "profile_banner_extensions_alt_text": None, "profile_link_color": "000000", "profile_sidebar_border_color": "EEEEEE", "profile_sidebar_fill_color": "EFEFEF", "profile_text_color": "333333", "profile_use_background_image": True, "has_extended_profile": True, "default_profile": False, "default_profile_image": False, "can_media_tag": True, "followed_by": False, "following": False, "follow_request_sent": False, "notifications": False, "translator_type": "none"}, "geo": None, "coordinates": None, "place": None, "contributors": None, "is_quote_status": False, "retweet_count": 0, "favorite_count": 4, "favorited": False, "retweeted": False, "lang": "en"}

In [None]:
tweet.keys()

This tells you there are keys or words that are used to reference data. For example, under `created_at` we have the time the tweet was authored (in GMT). We access the information as we did above, providing the key or word to look up.

In [None]:
tweet["created_at"]


Explore a little and tell me about the kinds of data that are packaged by Twitter with a tweet. If you need some help, consult [Twitter's description of their tweet objects. ](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object)

In [None]:
# explore a little here



There is some notation that we haven't learned yet. When you come across it, take a note and let's talk about it. What did you find?

Finally, let's create a dictionary for one story that's a lot like how the News API outputs its articles.

In [None]:
story = {
    "author":"Tina Moore, Amanda Woods",
    "publishedAt":"2021-05-27T16:38:52Z",
    "title":"Elderly Asian woman punched in the face in Queens, police sources say",
    "url":"https://reason.com/volokh/2021/05/27/censor-of-anti-china-speech-among-us/"
}

Now, from this dictionary pull the `author` and the `url` entries.

In [None]:
# put your code here
story["author"]

**5. From single variables to lists**

It's one thing to store single values (a single number or a single string or a single dictionary), but as we know, we tend to collect a lot of data different aspects of a person or thing in the world - we might offer a survey to 100 people that consists of 10 questions, or we might record facts about the last 100 articles containing a given keyword.  

A **list** is another built-in data structure used to group information. As its name suggests, it is simply an **ordered collection** of objects. It has a well-defined first entry, a second entry and a last entry. It can hold different kinds of objects in each position. It is constructed using square brackets [ ] (as opposed to the curly braces for a dictionary)

Below we make a list called `dates` that holds the publication times for the latest 5 articles from when I ran the code this afternoon (Z is GMT). 

In [None]:
dates = ['2021-05-27T16:38:52Z','2021-05-27T16:27:54Z','2021-05-27T16:23:37Z','2021-05-27T16:03:39Z','2021-05-27T15:35:39Z']

print("The type of 'dates' is", type(dates), "and its length is", len(dates))

Note that we've been tricky with `print()`. We have several items to print in one line all separated by commas. Sssssslick! Also, a new "global funciton" called `len()`. This function returns the number of elements in a list, or its **length.** It is a global funciton because it can be called meaningfully on a lot of objects. For example, it will also tell you the length or number of characters in a string.

In [None]:
len("fascinating revelations")

As an object, a list carries both data as well as methods that you can apply. What kinds of things would you like to be able to do to this type of object? 

*Maybe add new objects to the list? `append()` does that.*

In [None]:
# print out the list of pub dates

dates

In [None]:
# now add something to the back of the list

dates.append('2021-05-27T15:35:39Z')
dates

What else would we like to do with a list? Below we create a list of the authors associated with the most recent 5 articles (the pub dates for which are in the list `dates`).

*Maybe sort the list? `sort()` does that.*

In [None]:
authors = ['Tina Moore, Amanda Woods','Eugene Volokh','Robert Repino','Ed Browne','RFE/RL']

authors.sort()
authors

As a container object (an object that holds or groups other objects), the most obvious set of operations you would like to perform should involve storing and retrieveing data from the list. As we said, a list stores objects in a well-defined order. There is a first, a second, a third, and so on. You access these objects using **an index.**  A small catch: Python refers to positions starting at 0 and not at 1. So the first object has index 0, the second has index 1 and so on. 

In [None]:
# the first element
authors[0]

In [None]:
# the third element
authors[2]

In [None]:
# the fifth elemenet
authors[4]

Sometimes counting places from the back or righthand side of the list is easier. We use negative indices for that.

In [None]:
# the last element -- sneaky, right?
authors[-1]

In [None]:
# the fourth from the right
authors[-4]

We can take out "slices" from a list by asking for not just a single index but a range. The construction `m:n` means starting from index `m` take all the data in a list up to, but not including, the index `n`. So `3:6` means data stored behind indices 3, 4 and 5 (or actual positions 4, 5 and 6 since we count from zero). A slice returns another list containing just the specified objects. 

In [None]:
# Finally, you can pull more than one element with the : symbol to create a 'slice'
print("From the fourth element to the end:", authors[3:], "\n")

In [None]:
print("Up to but not including the third element:", authors[:2], "\n")

In [None]:
print("From the third up to but not including the fifth element:", authors[2:4], "\n")

**6. Comparing lists and dictionaries**

Next, we will use data returned by the News API to make some comparisons. We will represent the first story from the News API call I made earlier today. How does this compare to the dictionary we saw before? Compare the two ways of storing the same information. What is lost? What is gained?

In [None]:
story = [
    "Tina Moore, Amanda Woods",
    "2021-05-27T16:38:52Z",
    "Elderly Asian woman punched in the face in Queens, police sources say",
    "https://reason.com/volokh/2021/05/27/censor-of-anti-china-speech-among-us/"
]
        
story

And let's look at how we extract data from this container. We will use the square brackets again — `story[0]` representing the first item in the list, `story[2]` representing the third data item stored in the list and so on. Counting from zero rather than 1 is confusing but it will become natural.

In [None]:
print("The list has", len(story), "elements", "\n")

# the first element in the list has index 0
print("The first element:", story[0], "\n")

# and the third has index 2
print("The fourth element:", story[2], "\n")

# and the last has index -1 — the negative indices count from the right!
print("The last element:", story[-1], "\n")
print("The third from the last:", story[-3], "\n")

# Finally, a slice - remember this returns a new list
print("From the second up to but not including the fourth element:", story[1:3], "\n")

Just as you can pull data from a list,  you can also change the contents of one or more elements of a list.

In [None]:
story[0] = "Joseph Pulitzer"
story

In [None]:
story[1:3] = [1,10]
story

Certain operations produce lists. For example, we can divide a character string into pieces by "splitting" on a character using the method `split()`. This gives us a crude way to pull words from a string that represents a sentence.

In [None]:
line = 'After a particularly painful year for the Asian American and Pacific Islander community, three leaders call for the workplace to become safer for all.'

# divide into substrings using the space character " " as a breakpoint — this gives a rough division into words.
rough_words = line.split(" ")
rough_words

In [None]:
print("There are roughly", len(rough_words), "words in this description.")

Add text to this cell explaining what might be wrong with this approach to pulling words from text.



Finally, use "e" as a breakpoint to split the string. Make sure you understand what happened here.

In [None]:
line.split("e")

Make sure you understand lists. Create one and try out forming subsets, changing values and so on.

In [None]:
# put your work here


**7. Higher-level objects: A DataFrame**

So we have seen lists and dictionaries, built-in structures that help us group data that are associated in some way. With dictionaries, we use names or keys to look up data. With lists, we use position to look things up. In many cases we actually need a mixture of both kinds of structures. The most common example is a table. 

Think about a spreadsheet. The basic structure involves rows and columns. In many cases the rows refer to different objects in the real world and the columns represent things we measure or record about each object. For example, if instead of one article from the News API, we had 100 or 1,000, we would have a series of rows the first entry could be the publication, the second could be the author, the third could be the title and so on. This happens so often that researchers have created a special object to emulate a spreadsheet. 

To see it in action, let's apply it to the most recent 10 articles I pulled from the News API earlier today.

We will store each article as a dictionary, with one key for the publication, one for the author and one for the title. This is VERY close to what the News API puts out (go back and look at their [homepage](https://newsapi.org/).

In [None]:
stories = [
 {'author': 'Tina Moore, Amanda Woods',
  'source': 'New York Post',
  'title': 'Elderly Asian woman punched in the face in Queens, police sources say'},
 {'author': 'Eugene Volokh',
  'source': 'Reason',
  'title': '[Eugene Volokh] Censor of Anti-China Speech Among Us'},
 {'author': 'Robert Repino',
  'source': 'Slate Magazine',
  'title': 'The Real Delaware County Is Nothing Like Mare of Easttown'},
 {'author': 'Ed Browne',
  'source': 'Newsweek',
  'title': 'Scientists React to COVID Origin Investigation Ordered by Joe Biden'},
 {'author': 'RFE/RL',
  'source': 'Radio Free Europe/ Radio Liberty',
  'title': 'Russian Supreme Court Backs Government In Patent Dispute Over Remdesivir COVID Treatment'},
 {'author': 'RFE/RL',
  'source': 'Radio Free Europe/ Radio Liberty',
  'title': 'Russian Supreme Court Backs Government In Patent Dispute Over Remdesivir COVID Treatment'},
 {'author': 'Cynthia Choi, Manjusha P. Kulkarni, Russell Jeung',
  'source': 'Quartz India',
  'title': 'How businesses can create lasting change to advance racial equity for Asian Americans & Pacific Islanders'},
 {'author': 'Henry A. Giroux',
  'source': 'Truthout',
  'title': 'Tucker Carlson Is Just the Tip of the Iceberg in Right-Wing Media’s War on Truth'},
 {'author': 'The Economist',
  'source': 'The Economist',
  'title': 'Is Kamala Harris a gift to the Republicans?'},
 {'author': 'by Katherine Donlevy, Associate Editor',
  'source': 'Qchron.com',
  'title': '‘The entire Jewish community is on edge’'}
]

print(type(stories))
print(len(stories))

This is certainly a fine way to store the data. We can select information about the third publication
by "subsetting" just the third row, say.

In [None]:
stories[2]

# why 2?

In [None]:
# extract the publisher of the article in the fifth row

stories[2]["source"]

We will, from time to time, make our data sets "by hand" like this, so it's worth seeing how it might be done. Our data format, the list of dictionaries, is trying really hard to create essentially a **table**. That is, a grid of data, where each row refers to a time period and then each column refers to either the date or the tweet count for the meme. For our simple data above, that would be a table with 11 rows and 2 columns.

Interacting with even this simple data in this format is a little cumbersome. We can appeal to a higher-level object to create a proper table for us. You are probably familiar with Excel or some spreadsheet. These programs are all about tables. In Python, the answer to Excel (or a popular answer) is a so-called Pandas **DataFrame**. Pandas refers to a package contributed by a Python developer who wanted to make working with tabular data easier. 

[You can read more about Pandas here](http://pandas.pydata.org/)

[And there are simple tutorials here](http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.1/cookbook/Chapter%201%20-%20Reading%20from%20a%20CSV.ipynb)

Pandas is a **package** that means its author has published data, functions and a host of new objects for the community to use. Whereas the built-in objects are basic and get us pretty far, often we need something special to make our lives easier. In the case of Pandas, an object of type DataFrame will help us manipulate (compute with, make graphs of, etc) simple tabular data. 

We can use the `articles` object (the list of dictionaries) and turn it into a DataFrame using the function `DataFrame().` (Yeah, that might be confusing — the type of the object is "DataFrame" and the name of the function to turn your data into an object of that type is also called "DataFrame". This is a fairly common naming convention, and functions like this are called "constructors.") As arguments, it takes the data itself (the list of dictionaries).

We **import** the function "DataFrame" from the pandas package first. The import command is giving us super powers from the Pandas package to do things not built into the basic Python system. We will see this construction a lot.

In [None]:
from pandas import DataFrame

df = DataFrame(stories)
df

Notice that the way our data looks has changed. It's much more like an actual table now with column headings and the like. The DataFrame has lots of wonderful things you can do to it — lots of ways to compute with the data contained in the underlying table. 

One simple thing is just to get its size. How many rows and columns? This is an attribute, information, stored with the object that we can again access with "dot" notation. Because we are looking up information and not computing something (like making strings lowercase, say), we don't need parentheses.

In [None]:
df.shape

To close this off, we can make our Data Frame directly from the output from the News API. It is, afterall returning a list of dictionaries as part of its output. So, from the top...

In [None]:
key = "9b49997ed4274749871f14355ec9cd3f"

query = "anti-asian"
url = "https://newsapi.org/v2/everything?q="+query+"&from=2021-04-27&sortBy=publishedAt&apiKey="+key

In [None]:
from requests import get

response = get(url)
news = response.json()

news.keys()

In [None]:
from pandas import DataFrame

df = DataFrame(news["articles"])

In [None]:
df.head()

The point is that we could imagine polling the News API once every few hours and adding to our table of articles. Notice that we might want to do some screening before including them as examples of anti-Asian aggression. 

**Bonus: Automation**

The Times provided a summary of each event -- just two sentences. These were presumably hand-written, but they could be "automated" as well. Map out some instructions and the underlying data for how a summary might be created about an incident from a news story.