Introduction 
-------------

**1. IPyton or Jupyter Notebook**

As we have mentioned, in this class our main language for computation will be [Python](http://www.python.org). It was chosen because of its emphasis on **readability and code sharing.** These and other desirable aspects of the language have helped attract a large community of users who have contributed an incredible range of capabilities to the language. Community-written code will let us pull data from web pages and PDFs, it will let us manipulate tweets and other posts from social media, and will even give us the capacity to anaylyze images and sound. *The more we see of the world as data, as open to computation in some way, our reporting skills expand and our stories become deeper.*

We will program in Python (assemble Python expressions) using the Jupyter notebook -- the name "Jupyter" being a mix [Julia](http://julialang.org/), [Python](https://www.python.org/) and [R](https://www.r-project.org/), the three languages it originally supported. Programming with the notebook is often referred to as "literate computing" -- by that we mean that you code a little, have a look, write a little, come up with more ideas, code a little more, write a little more and so on. To support this, there are two kinds of "cells" that one can either write or program in. 

The writing is done in **Markdown cells**. Markdown (as opposed to Mark-up) is a language that lets you write in a plain text editor (in this simplest cases the Notepad or TextEdit would work fine too) and there are simple typographical shorthands to **make text bold** or *to put text in italics*. You can also make lists like

+ Bread
+ Milk
+ Dog food
+ Swiffer pads

These Markdown conventions are then translated into HTML -- the upshot being that you don't have to know anything about HTML to create documents that look reasonably good, certainly good enough for your reporting notes. 

Double click in this window to see the "raw" Markdown. Notice that you can still recognize lists and emphasized text from the Markdown additions, and that's the other point of this. Your documents, while written in plain text, make use of typographical conventions that make the document's highlighting understandable even without translation to HTML. That's a good trick! 

You can find [the Markdown description here.](http://daringfireball.net/projects/markdown/). For Monday, you  please go through the [Markdown Tutorial](http://markdowntutorial.com). There might be other learning resources that we should share with the class, so let us know if you find something really helpful!

*To re-render the Markdown in this cell into HTML, click in the cell and hit Shift-Enter to execute the transformation.*

One last note. Many organizations (scientific research labs, journalistic organizations, and so on) are "publishing" dual works -- one that goes into an official journal like Science or The New York Times summarizing a set of computations, and then another that is published in the form of a notebook that documents the author's computational work. Here are some examples from science and journalism.

> ["Peeling back the curtain -- How the Economist is opening the data behind our reporting."](https://medium.economist.com/peeling-back-the-curtain-487bd3be0c47) In their words "We published these calculations in a Jupyter notebook, a tidy format for breaking scripts into small blocks and annotating them."

>[BuzzFeedNews/everything -- An index of all our open-source data, analysis, libraries, tools, and guides.](https://github.com/BuzzFeedNews/everything#data-and-analyses) Data and code for many of their major stories includeing ["Shoot Someone In A Major US City, And Odds Are You’ll Get Away With It"](https://www.buzzfeednews.com/article/sarahryley/police-unsolved-shootings?bftw=&utm_term=4ldqpfp#4ldqpfp) and ["How Russia’s Online Trolls Engaged Unsuspecting American Voters — And Sometimes Duped The Media"](https://www.buzzfeednews.com/article/peteraldhous/russia-online-trolls-viral-strategy).

>["The Need for Openness in Data Journalism"](http://nbviewer.jupyter.org/github/brianckeegan/Bechdel/blob/master/Bechdel_test.ipynb) by Brian Keegan. This is a little old, but Keegan makes good points about the benefits of working with a notebook.

>["Why Jupyter is data scientists’ computational notebook of choice,"](https://www.nature.com/articles/d41586-018-07196-1) a recent overview piece in Nature that marks the rise of notebooks this way -- "One analysis of the code-sharing site GitHub counted more than 2.5 million public Jupyter notebooks in September 2018, up from 200,000 or so in 2015."  

>["The Architecture of Jupyter -- Interactive by design."](http://scisoftdays.org/pdf/2016_slides/perez.pdf) Starting on page 23 of this PDF, Fernando Perez, one of the designers of the Jupyter describes how notebooks have been published alongside papers in journals like Nature and Science and Scientific American, to name a few.

You get the idea. There are an increasing number of projects like this -- the fact that you can take someone else's notebook and examine the steps they followed to arrive at their conclusions is, ultimately, an important step toward transparency in data or computational journalism. *The notebooks become objects of coordination.*

**2. Background**

In the wake of the layoffs at news outlets last week, a meme started -- "Learn to code." [Know Your Meme](https://knowyourmeme.com/memes/learn-to-code) began tracking it a few days ago calling it "an expression used to mock journalists who were laid off from their jobs, encouraging them to learn software development as an alternate career path." 

<img src=https://github.com/computationaljournalism/columbia2019/raw/master/images/tc.jpg width=600>

At some point, it was suggested that Twitter was suspending accounts directing this meme at fired journalists. 

In [None]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">I am told by a person in the know that tweeting &quot;learn to code&quot; at any recently laid off journalist will be treated as &quot;abusive behavior&quot; and is a violation of Twitter&#39;s Terms of Service</p>&mdash; Jon Levine (@LevineJonathan) <a href="https://twitter.com/LevineJonathan/status/1089905702146060288?ref_src=twsrc%5Etfw">January 28, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

There were reports that the meme was organized on 4chan and later that groups had coordinated on gab. 

In [None]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">btw, if any other journos targeted by layoffs are getting masses of “learn to code” harassment, it was coordinated on 4chan (of course) <a href="https://t.co/DtpinjWhID">pic.twitter.com/DtpinjWhID</a></p>&mdash; Talia Lavin (@chick_in_kiev) <a href="https://twitter.com/chick_in_kiev/status/1088590587731808256?ref_src=twsrc%5Etfw">January 25, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

Some "news" reports circulated about the meme (in places you might expect) --
[The Ringer](https://www.theringer.com/tech/2019/1/29/18201695/learn-to-code-twitter-abuse-buzzfeed-journalists), 
[Fox News](https://www.foxnews.com/tech/twitter-fights-harassment-against-fired-journalists-told-to-learn-to-code), 
[Breitbart](https://www.breitbart.com/tech/2019/01/28/twitter-telling-fired-journalists-to-learn-to-code-is-targeted-harassment/), 
and 
[The Daily Wire](https://www.dailywire.com/news/42783/heres-where-learn-code-meme-originated-hint-not-ashe-schow). As much as it pains me to conflate our mission in this class with this meme, we're diving in. Let's use this as a test case for how we might "scale up our reporting on this story." In parallel, we will review some of the basic data types we've seen already from Python, introduce a few new ones and then talk about how we might create "structures" for data -- that is, how can we organize these basic types to record data on more complex things like tweets.

**3. Introduction to Python: Some basic data types**

As we've said, whether we code using a notebook or some other interface, our basic language will be Python. Python is an **object-oriented language**. Software objects are a kind of programming abstraction, a particular way of organizing information and actions. Software objects try to mimic the notion of objects in the physical world -- that means they contain properties or **data** and also might have certain operations or **methods** that you can use to transform the object in some way. Take as an example one of our lovely rolling chairs -- it has operations (it can support you when you sit, it has a foldable arm for writing, and it can move around the room) as well as "data" (it has a seat height, a desk height, maybe even an RGB value specifying its lovely seat color).

**Python has a series of built-in types of objects, meaning certain types of information that are so basic that they are needed by just about every programming exercise you'll attempt**. What might those be?

As we have seen, some of these basic or built-in types of objects are really really basic like numbers (counting numbers or, more generally, integers, like 2, 5 and -3; as well as "real" numbers -- also known as floating point numbers -- those having a decimal point, like 5.67) and text strings (sequences of characters like "Is he really going to make us read Trump's Tweets?"). These data have *types* `int` and `float` and `str` for integer, floating point and string, respectively. You can tell the type of any object in Python with a built-in function `type()`. 

**Yes, some operations are so common or basic that they are also "built-in" as *functions*.**

In the lines below, we create strings (objects of type `str`) by surrounding some text with quotation marks. They can be double, single (or even triple double quotes) and Python doesn't care which as long as they match at the beginning and the end of the string. The built-in statement `print()` will, well, print whatever follows it where we can print groups of things by separating them with commas.

(For the record, we're now about to see three built-in types -- integers, real or floating point numbers, and strings -- built-in functions, `type()` and `print()`. And it's only our first few cells of code!)

In the first case, we look at a sequence of charactrers that constitute a string. We'll use tweet text again...

In [None]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">finally putting in the effort to learn to code python and it has been enjoyable and rewarding thus far</p>&mdash; halleJOKEL (@halleJOKEL) <a href="https://twitter.com/halleJOKEL/status/1086994877496332288?ref_src=twsrc%5Etfw">January 20, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

This is a tweet posted on Sunday at 9:30 NYC time before the phrase took a turn. Remember that the notebook will always exhibit the result of your last compuatation you perform, so we could have the notebook exhibit the string by simply doing this...

In [None]:
tweet = "finally putting in the effort to learn to code python and it has been enjoyable and rewarding thus far"
tweet

... but if we wanted to see two outputs, we'd need to use the `print()` command. So, again, if you want to see the output of a single computation, you don't need to use `print()`, the notebook will exhibit the result for you. Otherwise, `print()` lets you be more intentional about what gets printed when.

OK back to strings. `"it has been enjoyable and rewarding thus far"` is a string and we can tell by looking at its `type()`.

In [None]:
type("it has been enjoyable and rewarding thus far")

Or if we did this all in one cell, we could use the `print()` command to exhibit the data and indicate its type.

In [None]:
print("it has been enjoyable and rewarding thus far")
print(type("it has been enjoyable and rewarding thus far"))

From strings we move to integers, numbers like 1, 5 or -3.

In [None]:
print(2)
print(type(2))

Change the 2 above to some other *integer* value and see that the type is the same. Now, here is real or floating point number like 2.1 or 3.14159.

In [None]:
print(5.7)
print(type(5.7))

Change the 5.7 above to some other number with a decimal point and see that the type is the same. 

It's worth being a little precise here -- although it takes us off our main course briefly. A *floating point* number is a representation of real numbers on the computer. This has some complexity because some numbers have decimal representations that never end. Take 1/3 for example -- it's 0.333333... and the 3's repeat forever. A computer can't deal with never-ending things and has to make compromises. The video below explains the difference between a *real* number like 1/3 and its *floating point* representation -- as told by a charming British man.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('PZRI1IfStY0')

Note that the example he references in the video of adding 0.1 and 0.2 happens in Python. This issue isn't with the language, it's with the way the computer represents numbers -- and what happends when a number has a really long representations.

In [None]:
0.1 + 0.2

Let's leave that aside for the moment because we never want to stray too far from why we're learning this in the first place. Our goal is to use computation to heighten our understanding of the world around us. We will take an occasional detour to discuss nerdier issues since you'll see them and might wonder "Why?".

Returning to the built-in data types, there's one more important one. It has only two values.

In [None]:
print(True)
print(type(True))

The  data type above is called *Boolean.* It represents just two states -- true and false. Boolean data are generated by typing the special sequences of characters <font face=courier>True</font> or <font face=courier>False</font> (without quotations because they are not strings). Change the <font face=courier>True</font> in the above expression to  <font face=courier>False</font> to make sure that it, too, is a Boolean value

You will primarily encounter Booleans as the output of some **logical expression.** Here are some examples of expressions that return Boolean (True/False) data. Try the expressions below -- each asks whether a relationship holds or not, is true or false.

*Riddle me this: Is 3 bigger than 5?*

In [None]:
3 > 5

*Riddle me this: Is 10 smaller than 100?*

In [None]:
10.0 < 100.0

*Riddle me this: Is the letter 'e' in 'Jeb Bush'?*

In [None]:
"e" in "John Levine"

*Riddle me this: Is the letter 'z' in 'Donald Trump'?*

In [None]:
"z" in "Donald Trump"

The first two logical expressions in the code above make **comparisons** while the second two test for **membership**. 

The Boolean type will be important when we start to write code that "branches" its behavior depending on whether some condition is true or false -- we might want to take one action if something is true, but another action if that thing is false. For example, we might want to analyze only the tweets coming from the President's mobile device and would use a Boolean to separate out those cases.

With this in mind, sometimes, you will want to take action based on a combination of conditions. For this, we use "and", "or" and "not" to build more complicated expressions. 

*A series of logical expressions joined with 'and', for example, is True only if all the expressions are True.* 

Here we use the `print()` command to exhibit the output from two comparisons.

In [None]:
print("e" in "John Levine" and 3 < 5)
print("z" in "John Levine" and 3 < 5)

*A series of conditions joined with 'or' is True if at least one expression is True.*

In [None]:
print("u" in "John Levine" or 3 < 5)
print(3>10 or 5>100)

*And you can flip from True to False (and vice versa) using the world 'not'. This is also called "negation".*

In [None]:
print(not "u" in "Donald Trump")
print(3<10 and not 5>100)

*As with simple algebraic expressions like (1+3)\*5, we can use parentheses to make sure our expressions are evaluated in the right order.*

In [None]:
print(6 > 5 or (2 > 5 and "u" in "Trump"))

Make sure you see why this evaluated to True. Finally, keep in mind, all of these expressions return a Boolean object.

In [None]:
print(type(3<10 and not 5>100))

Experiment a little on your own and make sure you understand how these expressions work.

In [None]:
# Your work here


**4. Operators**

Technically, all of these symbols (">", "<", "in", "and", "or" and "not") are examples of **operators** in Python. The simplest kind of operators are arithmetic. They probably would have been a better place to start (as we did in class) -- they underscore the idea that Python acts like a big calculator. Here "+" and "\*" and "/" are called **arithmetic operators.**

In [None]:
print(3+10)
print(100*5)
print(3*(100+2.5))

As we have seen, Python 'overloads' its operators so they can behave differently depending on what arguments you pass them. And the results can be surprising. Take arithmetic operations on strings, for example...

In [None]:
print(5*"Tweets ")
print("#LearnToCode"+" tweets")

You can read about the various operators in Python [here](http://www.tutorialspoint.com/python/python_basic_operators.htm). It's a clean summary.

**5. Variables**

Creating data in cells and printing it out is awesome and all, but most of what you will be doing in object-oriented programming is making and evolving objects. We can take any of the objects above and store them in a **variable**, literally associating them with a name that we can use to refer to later. The equals sign here is called an **assignment operator**. 

Notice that you can use variables to catch the output of all the expressions we've seen so far.  

In [None]:
tweet = "#code #100daysofCode #learntocode working on #FreeCodeCamp final challenges for fun"

isg = "g" in tweet

print(tweet)
print("Is 'g' in this post?", isg)

A new thing in the cell above. We printed out several different expressions in one line, separating each with a comma. You can chain several expressions in this way to make a tidier printout.

Anyway, more examples of variables and operators -- this time with numeric data...

In [None]:
x = 4.3
y = 100.2

print("Adding x and y to get", x+y)

... or a mix of types if the expression makes sense.

In [None]:
w = 5
x = "Trump\t"
y = "Multiplication with strings:\n"

print(y+x*w)

This last cell introduces something new. The "\n" in the expression above represents a **newline** character. It, well, moves your cursor to the start of a new line, and hence we see the five "Trump"'s on a seprate line. We also see another special character "\t" for **tab**. 

In general, the "\\" is an **escape character**. It "encodes difficult-to-type characters into a string" like a tab or a new line. As another example, you can use \' and \" to encode quotes inside a string... 

In [None]:
print("He said, \"I cannot believe we are looking at Trump's tweets\"")

Here we used double quotes to start and end our string and so we had to "escape" the double quotes on the inside of the string. Otherwise Python would confuse them for the start or end of our string. You can read more about escape sequences [here](https://www.quackit.com/python/reference/python_3_escape_sequences.cfm). But for the moment, it's enough to know that the backslash is a special character so that you won't be surprised when you see it.

As a final example, "\U" is used to specify the "32-bit hex value" for a Unicode string. Oy. We will come back to this, but characters can be represented for a computer in a variety of ways. Unicode contains the most expansive character set among all the representations, literally allowing you to specify characters in any human language... and Klingon. So the "\U" means the next 8 letters/numbers are used to specify a character. It could be in Cyrillic or Arabic or Tibetan or even an emoji. [Here's a reference on Unicode -- but we'll come back to it.](http://unicode.org/charts/)

In [None]:
print("\U0001f63b")

In [None]:
print("\U0001f601", "\U0000270c", "\U0001f370", "\U0001f423")

Finally, as we mentioned in class, if you ever need help on something, there is a built-in function called, well, `help()`. Remember in the last cell, we made x a string...

In [None]:
help(type(x))

In [None]:
help(type(w))

Notice in the code above that the variable x was first set to equal the number 4.3 and then it was assigned a string representing a tweet. Python is flexible that way. (It's called **dynamic typing** if you want to get formal -- if you have programmed before you might know that not all languages are so loose.) 

Technically, when you create a variable, the name on the lefthand side of the expression is associated or "points to" the value you assign to it. At any time, you can have that name reassigned and point to another object. This flexibility is really handy. Here is an example that drives the point home. Variables are just names that "point to" objects, acting like labels. You can reapply the labels as you see fit...

In [None]:
x = 3
y = x
print("Here x and y are both pointing to the value 3:", x, y)

y = 5
print("But now y points to the value 5, and x is unchanged:", x, y)

**6. Objects**

As we said in class, everything in Python is an **object**. Just like objects in the real world, you might be able to guess about the kinds of things you'd like to be able to do to a particular software object. 

As we saw in class, a string (of type `str`) is just a set of characters, for example -- what might you want to do to such an object?

In [None]:
tweet = "Why kindergartners need to learn to code https://t.co/ovV6IMS0Hf via @BostonGlobe"

print("There are", tweet.count("e"), "occurences of the letter e","\n")
print("There are", tweet.count("a"), "occurences of the letter a","\n")

print("Make the tweet all caps:", tweet.upper())
print("Or all lowercase:", tweet.lower(), "\n")

print("Or swap e's for a's:")
print(tweet.replace("e","a"), "\n")

print("... or 'learn' with 'forget':")
print(tweet.replace('learn', 'forget'))

OK even with this tiny bit of programming, we were a little dangerous. In class, we read in Trump's tweets from 2018 and 2019 as a long string object and counted occurences of certain string patterns -- maybe you looked for "RT" or "#" or "!" or "wall." Rather than continue with strings, we are going to examine another kind of basic object in Python, a **list**.

**7. From single variables to lists**

It's one thing to store single values (a single number or a single string), but as we know, we tend to collect a lot of data different aspects of a person or thing in the world - we might offer a survey to 100 people that consists of 10 questions; or we might record facts about the last 100 of Donald Trump's tweets, including the time he tweeted and the number of retweets each earned; or we might consider all the tweets that promote the meme #LearnToCode. 

A **list** is simply an **ordered collection** of objects. It has a well-defined first entry, a second entry and a last entry. It can hold different kinds of objects in each position. It is constructed using square brackets [ ]. 

**A common mistake for people learning Python is to confuse the parentheses we use to indicate a function like "tweet.count('RT')" with the square brackets we will use to group objects into a list. They look similar so be careful! 😉**

For the moment we will look at counts of tweets containing "learn to code" or #learntocode or #learn2code. These were pulled from the Twitter [Application Programming Interface or API](https://developer.twitter.com/en/docs/api-reference-index). We will have a lot to say about API's and data access over time. Each count represents the number of tweets appearing on a single day, where the days range from January 1, 2019 to January 29, 2019.

In [None]:
counts = [474,540,679,970,6279,8412,7448,9209,37595,20250]

print("The type of 'counts' is", type(counts), "and its length is", len(counts))

The `len()` function returns the number of elements in a list, or its **length.** It also can be used to tell you the length or number of characters in a string.


In [None]:
len("learn to code")

As an object, a list carries both data as well as methods that you can apply. What kinds of things would you like to be able to do to this type of object? 

*Maybe add new objects to the list? append() does that.*

In [None]:
print(counts)

counts.append(2767)
print(counts)

The number 2767 is the count for today, but only up until 10am so the data are not complete. What else would we like to do with a list?

*Maybe sort the list? sort() does that.*

In [None]:
counts.sort()
print(counts)

As a container object (an object that holds or groups other objects), the most obvious set of operations you would like to perform should involve storing and retrieveing data from the list. As we said, a list stores objects in a well-defined order. There is a first, a second, a third, and so on. You access these objects using **an index.**  A small catch: Python refers to positions starting at 0 and not at 1. So the first object has index 0, the second has index 1 and so on. 

Next, we will use a list to store data from a tweet by Donald Trump Jr. on the meme. In this list we store the time and day he tweeted, the content of the tweet, it's index, the number of people who retweeted it, the number of people who "liked" it, whether it was a retweet and the platform he used to tweet from.

Here's the tweet

In [None]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Could someone explain to me why if I tell my kids to “learn to code” it’s likely sound parenting, but if I told a journalist the same it’s grounds for a <a href="https://twitter.com/Twitter?ref_src=twsrc%5Etfw">@twitter</a> suspension?</p>&mdash; Donald Trump Jr. (@DonaldJTrumpJr) <a href="https://twitter.com/DonaldJTrumpJr/status/1089958848742518785?ref_src=twsrc%5Etfw">January 28, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

And let's look at how we extract data from this container.

In [None]:
djt = [
    "Mon Jan 28 18:50:14 +0000 2019",
    "Could someone explain to me why if I tell my kids to 'learn to code' it’s likely sound parenting, but if I told a journalist the same it’s grounds for a @twitter suspension?",
    "1089958848742518785",
    7565,
    31098,
    False,
    "Twitter for iPhone"
]
        
print(djt, "\n")

print("The list has", len(djt), "elements", "\n")

# the first element in the list has index 0
print("The first element:", djt[0], "\n")

# and the fourth has index 3
print("The fourth element:", djt[3], "\n")

# and the last has index -1 -- the negative indices count from the right!
print("The last element:", djt[-1], "\n")
print("The third from the last:", djt[-3], "\n")

# Finally, you can pull more than one element with the : symbol to create a 'slice'
print("From the fourth element to the end:", djt[3:], "\n")
print("Up to but not including the third element:", djt[:2], "\n")
print("From the third up to but not including the fifth element:", djt[2:5], "\n")

Just as you can pull data from a list,  you can also change the contents of one or more elements of a list.

In [None]:
djt[0] = 6000
print(djt, "\n")

djt[1:4] = [1,10,100]
print(djt)

Certain operations produce lists. For example, we can divide a character string into pieces by "splitting" on a character using the method `split()`. This gives us a crude way to pull words from a string that represents a sentence.

In [None]:
line = "Could someone explain to me why if I tell my kids to 'learn to code' it’s likely sound parenting, but if I told a journalist the same it’s grounds for a @twitter suspension?"

# divide into substrings using the space character " " as a breakpoint -- this gives a rough division into words.
rough_words = line.split(" ")
print(rough_words,"\n")

print("There are roughly", len(rough_words), "words in this tweet.")

Explain here what might be wrong with this approach to pulling words from text.



Finally, use "e" as a breakpoint to split the string. Make sure you understand what happened here.

In [None]:
print(line.split("e"))

Make sure you understand lists. Create one and try out forming subsets, changing values and so on.

In [None]:
# put your work here


**8. Higher-level objects: A DataFrame**

Let's start with a simplified version of a tweet. We'll look at the "source" or client the President used to author the tweet (a desktop or mobile phone, say), the time and day the tweet was posted to Twitter, and the number of retweets it received. So Each tweet consists of those three pieces of information. 

We will store each tweet as a list, where the first element is the source, the second is the date and the third is the retweet count. Then, in the cell below, we form a data set consisting of 6 tweets. We store each tweet (each list with the data for a tweet) in a list. Yes, a  **list of lists.**

In [None]:
times = [
    ["2019-01-30",2767],
    ["2019-01-29",20250],
    ["2019-01-28",37595],
    ["2019-01-27",9209],
    ["2019-01-26",7448],
    ["2019-01-25",8412],
    ["2019-01-24",6279],
    ["2019-01-23",970],
    ["2019-01-22",679],
    ["2019-01-21",540],
    ["2019-01-20",474]
  ]

print(type(times))
print(len(times))

This is certainly a fine way to store the data. We can select information about the third tweet by "subsetting" just the third row, say.

In [None]:
times[2]

# why 2?

We will, from time to time, make our data sets "by hand" like this, so it's worth seeing how it might be done. Our data format, the list of lists, is trying really hard to create essentially a **table**. That is, a grid of data, where each row refers to a time period and then each column refers to either the timestamp or the tweet count for the meme. For our simple data above, that's 11 rows and 2 columns.

Interacting with even this simple data in this format is a little cumbersome. We can appeal to a higher-level object to create a proper table for us. You are probably familiar with Excel or some spreadsheet. These programs are all about tables. In Python, the answer to Excel (or a popular answer) is a so-called Pandas **DataFrame**. Pandas refers to a package contributed by a Python developer who wanted to make working with tabular data easier. 

[You can read more about Pandas here](http://pandas.pydata.org/)

[And there are simple tutorials here](http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.1/cookbook/Chapter%201%20-%20Reading%20from%20a%20CSV.ipynb)

Pandas is a **package** that means its author has published data, functions and a host of new objects for the community to use. Whereas the built-in objects are basic and get us pretty far, often we need something special to make our lives easier. In the case of Pandas, an object of type DataFrame will help us manipulate (compute with, make graphs of, etc) simple tabular data. 

We can use the `times` object (the list of lists) and turn it into a DataFrame using the function `DataFrame().` (Yeah, that might be confusing -- the type of the object is "DataFrame" and the name of the function to turn your data into an object of that type is also called "DataFrame". This is a fairly common naming convention, and functions like this are called "constructors.") As arguments, it takes the data itself (the list of lists) and then optionally a list of strings that represent the column names.

We **import** the function "DataFrame" from the pandas package first. The import command is giving us super powers from the Pandas package to do things not built into the basic Python system. We will see this construction a lot.

In [None]:
from pandas import DataFrame

tweets = DataFrame(times, columns=["time", "count"])
tweets

Notice that the way our data looks has changed. It's much more like an actual table now with column headings and the like. The DataFrame has lots of wonderful things you can do to it -- lots of ways to compute with the data contained in the underlying table. 

One simple thing is just to get its size. How many rows and columns? This is an attribute, information, stored with the object that we can again access with "dot" notation. Because we are looking up information and not computing something (like making strings lowercase, say), we don't need parentheses.

In [None]:
tweets.shape

**Aside -- Installing Packages**

Python has a set of built-in functionality and data types that it knows about. We have seen some really basic things so far -- numbers and `print()`ing, say. The power of the platform is that people are constantly adding new functionality, making way for new kinds of data and new kinds of computation. This new capacity is organized into packages. Hence, `pandas`. 

Now some packages come with Python, some are added by Anaconda and still others you have to install yourself. You can search through the collection [here](https://pypi.org/). In particular, we are going to add plotting functionality to our notebook. It's basic so don't get overly excited yet. The package uses the service `plot.ly`. 

Below we use a "UNIX shell command" called `pip` to install the Python package `plotly` that provides us with access to its plotting facilities from within Python. You can read a bit about it [here](https://plot.ly/python/).

In [None]:
%%sh 
pip install plotly

Now, we `import` a few functions. One lets us sign in (it's OK to use my credentials for these notebooks, but for your projects please visit `plot.ly` to get your own). The second row of imports relate to kinds of plotting -- we are going to create a figure that contains a single scatterplot. (The `Scatter()` function is in a list because you could have multiple lines on the graph -- we'll see that shortly.)

In [None]:
from plotly.plotly import iplot, sign_in
from plotly.graph_objs import Scatter, Figure

# sign into the service (get your own credentials!)
sign_in("cocteautt","8YLww0QuMPVQ46meAMaq")

# create a plot of a single line tracking tweets over time
myplot_parts = [Scatter(x=tweets["time"],y=tweets["count"])]

# make a figure from this line plot...
myfigure = Figure(data=myplot_parts)

# ... and plot it (the filename is a convention plotly needs in case you want to use it later)
iplot(myfigure,filename="Learn to code")

**9. More with DataFrames**

Writing out data like we did to create the `tweets` data frame is really limiting. Instead, we can read data into a data frame from a vriety of formats. The easiest is a [CSV](https://en.wikipedia.org/wiki/Comma-separated_values).

We have pulled tweets from Twitter and binned the counts into 10-minute intervals and stored them in 
a [CSV file](https://github.com/computationaljournalism/columbia2019/raw/master/data/learn_counts.csv). Click on the link and have a look. For each row in the file, you will see two fields separated by a comma. The first row of the file is called a "header" and gives you the names of the variables recorded in each row. So you will see `time` and `count`. There are two entries in the header so each row has two entries.

Each row after the first represents a ten minute period from the last few days, arranged so that the most recent are first and the oldest appear last in the file. Following the names in the header, the first entry in each row is the "created at time" and the second is a count of references to "learn to code" or #LearnToCode or #Learn2Code. Each row arranges the data about its time period according to the labels in the first row, and separates the entries by a comma. Hence CSV.

In the cell below, we first import the function, `read_csv()`. Unlike `DataFrame()`, `read_csv()`  takes a CSV file and creates a DataFrame. Oh and it takes as its argument either the URL of a CSV or the location of a CSV file on your computer. Here we supply the URL on github.

In [None]:
from pandas import read_csv

# read in the tweets from the CSV file in our github data directory
tweets = read_csv("https://github.com/computationaljournalism/columbia2019/raw/master/data/learn_counts.csv")
print(type(tweets))

In [None]:
tweets.shape

So We have 1,403 time periods (rows in the table) and 2 variables recorded for each. We can have a look at the "top" and "bottom" of the data set. These are printed with `head()` and `tail()`methods.

In [None]:
tweets.head()

In [None]:
tweets.tail()

The `head()` and `tail()` methods of a DataFrame gives you five time periods from the start and end of the data (and you can give an argument to see more). It's important to look at the top and bottom of the file to check that everything looks consistent (column entries seem to mean what they should) and see how the data might be organized.

We can now have a look at these values in a plot. Again, basic, but it motivates our investigations. Each point on the line is a time period. What should we be asking?

In [None]:
# create a plot of a single line tracking tweets over time
myplot_parts = [Scatter(x=tweets["time"],y=tweets["count"])]

# make a figure from this line plot...
myfigure = Figure(data=myplot_parts)

# ... and plot it (the filename is a convention plotly needs in case you want to use it later)
iplot(myfigure,filename="Learn to code")

In the code above we have "subset" the `tweets` data frame, pulling out one column for the x-axis and one for the y-axis. We will cover this in more detail shortly, but you can subset the columns by simply using square brackets and including the name of the column you want. If you want multiple columns, you put their names in a list.

In [None]:
tweets["time"]

Just to cement ideas, we can also split out retweets from tweets. So if a tweet was a reply or an original tweet, we include it in the `tweet_count` total below and otherwise if it is a retweet we count it in the `retweet_count` column.

In [None]:
tweets2 = read_csv("https://github.com/computationaljournalism/columbia2019/raw/master/data/learn_counts2.csv")
tweets2.head()

In [None]:
# create a plot of a single line tracking tweets over time
myplot_parts = [Scatter(x=tweets2["time"],y=tweets2["retweet_count"],name="retweets"),
                Scatter(x=tweets2["time"],y=tweets2["tweet_count"],name="tweets")]

# make a figure from this line plot...
myfigure = Figure(data=myplot_parts)

# ... and plot it (the filename is a convention plotly needs in case you want to use it later)
iplot(myfigure,filename="Learn to code")

**10. One more built-in type -- Dictionaries**

When we discussed lists, we stored facts about Donald Trump Jr.'s addition to the meme on Twitter. Do you remember what the different entries represented?

In [None]:
djt = [
    "Mon Jan 28 18:50:14 +0000 2019",
    "Could someone explain to me why if I tell my kids to 'learn to code' it’s likely sound parenting, but if I told a journalist the same it’s grounds for a @twitter suspension?",
    "1089958848742518785",
    7565,
    31098,
    False,
    "Twitter for iPhone"
]
  
djt

In some cases, like a literal dictionary (Websters?) we prefer to store data under "keys" like words rather than in a simple order. If we had to look up data in the dictionary by its numeric order in the text, we'd be lost. Instead, we search by word. This is the idea behind a dictionary in Python -- store the data according to words.

Here is the `djt` data, but stored in dictionary form. Notice that instead of surrounding the content with square braces, we enclose the structure with curly braces. Inside, we have a list of *key* and *value* pairs, each joined with a colon. The individual entries are again separated by commas. 

Have a look.

In [None]:
djt = {
     "created_at":"Mon Jan 28 18:50:14 +0000 2019",
     "full_text":"Could someone explain to me why if I tell my kids to 'learn to code' it’s likely sound parenting, but if I told a journalist the same it’s grounds for a @twitter suspension?",
     "id":"1089958848742518785",
     "retweet_count":7565,
     "favorite_count":31098,
     "retweet":False,
     "source":"Twitter for iPhone"
}

We can then ask what keys does the dictionary contain..

In [None]:
djt.keys()

... and extract data using any one of the keys.

In [None]:
djt["source"]

In [None]:
# Pick another key and extract the value associated with it...



Now, download another "classic" example of a dictionary -- in this case, a tweet. Tweet data are stored as [JSON files (JavaScript Object Notation)](https://www.json.org/) but it's really really close to just a Python dictionary. Let's download Trump Jr.'s tweet from [our GitHub site](https://github.com/computationaljournalism/columbia2019/raw/master/data/djt.json) and put it in the same folder as your notebook.

You can have a look, it's a file with a single line. We can read it like we did Trump's tweets last time.  

In [None]:
tweetstring = open("djt.json").read()
print(type(tweetstring),"\n")
print(tweetstring)

OK so we have a string. Great. And it seems to have a lot of information in it. We see numbers and strings in the data. We also see list structures -- those square brackets we learned about last time. And then there are our curly braces or { }'s. Again, while square brackets marked off lists, curly braces mark off dictionaries. 

To explore a bit, let's turn the tweet string into a Python object. We will `import` two functions from the package `json,` `loads` and `dumps.` The former takes a string and makes Python objects, the latter takes a Python object (made up of built-in parts like numbers and lists and booleans and dictionaries) and turns it into a string.

In [None]:
from json import dumps,loads

tweet = loads(tweetstring)
print(type(tweet))

Again, we look up values in a dictionary by "keys". Not surprisingly, this is a method for the dictionary type of object. And again, we use the same notation for subsetting as a list, the square brackets, but now we put the name of the data we want. 

Here are our choices of names, keys, that are associated with a tweet. What do you see? We then pull some data from the djt tweet.  If you need some help, consult [Twitter's description of their tweet objects. ](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object)

In [None]:
tweet.keys()

The text of the tweet...

In [None]:
tweet["full_text"]

... a string representing when the tweet occurred (in UTC)...

In [None]:
tweet["created_at"]

... and even a boolean to suggest if it has been truncated to the new 280 character limit or not.

In [None]:
tweet["truncated"]

In [None]:
# your turn -- pull some data and tell us what you find



In [None]:
# extract some data about Trump Jr. from his user information that is bundled with the tweet



Now, we can turn this from a dictionary, made up of these simple, built-in data types, back to a string that we could dump into a file. The command is "dumps" and there's an argument "indent" that might make the whole thing a little more readable. 

In [None]:
# default, not so pretty

print(dumps(tweet))

In [None]:
# some indentation to mark off the different components, much more pretty :)

print(dumps(tweet,indent=5))

**To launch you into the world...**

We have created a CSV file where each row is a tweet. This file is big so you have to download it. I put it up [on Dropbox](https://www.dropbox.com/s/ene3qllvkwzolof/learn_tweets.csv?dl=0) -- download it and put it in the same folder as your notebook. Then you should be able to read it in using the commands below.

In [None]:
from pandas import read_csv, set_option

# set the maximum number of characters in any cell
set_option("display.max_colwidth", 280)

tweets = read_csv("learn_tweets.csv")

In [None]:
tweets.head()

In homework, we will prepare you with some tools to help you extract the content from this table and ask about tweeting activity. Who was the most tweeted in this little episode? Who was the most tweeted *at*? For example, we can figure out who tweeted the most on this topic with a method called `value_counts()`.

In [None]:
tweets['screen_name'].value_counts()

A bit later we'll see how we can subset not just columns but also rows. Here we pull out the top retweeters.

In [None]:
tweets[tweets["screen_name"]=="ham_gretsky"]

In [None]:
tweets[tweets["screen_name"]=="SherlockRobo"]