Introduction 
-------------

**IPyton or Jupyter Notebook**

As we have mentioned, in this class our main language for computation will be [Python](http://www.python.org). It was chosen because of its emphasis on **readability and code sharing.** These and other desirable aspects of the language have helped attract a large community of users who have contributed an incredible range of capabilities to the language. Community-written code will let us pull data from web pages and PDFs, it will let us manipulate tweets and other posts from social media, and will even give us the capacity to anaylyze images and sound. *The more we see of the world as data, as open to computation in some way, our reporting skills expand and our stories become deeper.*

We will program in Python (assemble Python expressions) using the Jupyter notebook -- the name "Jupyter" being a mix [Julia](http://julialang.org/), [Python](https://www.python.org/) and [R](https://www.r-project.org/), the three languages it originally supported. Programming with the notebook is often referred to as "literate computing" -- by that we mean that you code a little, have a look, write a little, come up with more ideas, code a little more, write a little more and so on. To support this, there are two kinds of "cells" that one can either write or program in. 

The writing is done in **Markdown cells**. Markdown (as opposed to Mark-up) is a language that lets you write in a plain text editor (in this simplest cases the Notepad or TextEdit would work fine too) and there are simple typographical shorthands to **make text bold** or *to put text in italics*. You can also make lists like

+ Bread
+ Milk
+ Dog food
+ Swiffer pads

These Markdown conventions are then translated into HTML -- the upshot being that you don't have to know anything about HTML to create documents that look reasonably good, certainly good enough for your reporting notes. 

Double click in this window to see the "raw" Markdown. Notice that you can still recognize lists and emphasized text from the Markdown additions, and that's the other point of this. Your documents, while written in plain text, make use of typographical conventions that make the document's highlighting understandable even without translation to HTML. That's a good trick! 

You can find [the Markdown description here.](http://daringfireball.net/projects/markdown/). Don't forget that for our last class, you were expected to have gone through the [Markdown Tutorial](http://markdowntutorial.com). There might be other learning resources that we should share with the class, so let us know if you find something really helpful!

*To re-render the Markdown in this cell into HTML, click in the cell and hit Shift-Enter to execute the transformation.*

One last note. Many organizations (scientific research labs, journalistic organizations, and so on) are "publishing" dual works -- one that goes into an official journal like Science or The New York Times summarizing a set of computations, and then another that is published in the form of a notebook that documents the author's computational work. Here are some examples from science and journalism.

>[Fact-Checking Facebook Politics Pages — Analysis](https://github.com/BuzzFeedNews/2016-10-facebook-fact-check/blob/master/notebooks/facebook-fact-check.ipynb). This analysis notebook was created by Jeremy Singer-Vine for a Buzzfeed article [Hyperpartisan Facebook Pages Are Publishing False And Misleading Information At An Alarming Rate](https://www.buzzfeed.com/craigsilverman/partisan-fb-pages-analysis)

>[The Need for Openness in Data Journalism](http://nbviewer.jupyter.org/github/brianckeegan/Bechdel/blob/master/Bechdel_test.ipynb) by Brian Keegan. This is a little old, but Keegan makes good points about the benefits of working with a notebook.

>[The Architecture of Jupyter -- Interactive by design](http://scisoftdays.org/pdf/2016_slides/perez.pdf) Starting on page 23 of this PDF, Fernando Perez, one of the designers of the Jupyter describes how notebooks have been published alongside papers in journals like Nature and Science and Scientific American, to name a few.

You get the idea. There are an increasing number of projects like this -- the fact that you can take someone else's notebook and examine the steps they followed to arrive at their conclusions is, ultimately, an important step toward transparency in data or computational journalism. The notebooks become objects of coordination.

**Introduction to Python: Some basic data types**

Whether we code using a notebook or some other interface, our basic language will be Python. As we mentioned in class, Python is an **object-oriented language**. Software objects are a kind of programming abstraction, a particular way of organizing information and actions. Software objects try to mimic the notion of objects in the physical world -- that means they contain properties or **data** and also might have certain operations or **methods** that you can use to transform the object in some way. We used the example of one of our lovely purple rolling chairs -- it has operations (it can support you when you sit, it has a foldable arm for writing, and it can move around the room) as well as "data" (it has a seat height, a desk height, maybe even an RGB value specifying the lovely purple color).

Python has a series of built-in types of objects, meaning certain types of information that are so basic that they are needed by just about every programming exercise you'll attempt. What might those be?

As we have seen, some of the basic or built-in types of objects are really really basic like numbers (counting numbers or, more generally, integers, like 2, 5 and -3; as well as "real" numbers, those having a decimal point, like 5.67) and text strings (sequences of characters like "Is he really going to make us read Trump's Tweets?"). These data have *types* 'int' and 'float' and 'str' respectively. You can tell the type of any object in Python with a built-in function type(). Yes, some operations are so common or basic that they are also "built-in" as *functions*. 

In the lines below, we create strings (objects of type 'str') by surrounding some text with quotation marks. They can be double, single (or even triple double quotes) and Python doesn't care which as long as they match at the beginning and the end of the string. The built-in statement "print" will, well, print whatever follows it where we can print groups of things by separating them with commas.

(For the record, we're now about to see three built-in types -- integers, real or floating point numbers, and strings -- a built-in function, type(), and a built-in statement, print. And it's only our first few cells of code!)

In the first case, we look at a sequence of charactrers that constitute a string.

In [None]:
print "Is he really going to make us read Trump's Tweets?"
print type("Is he really going to make us read Trump's Tweets?")

From strings we move to integers, numbers like 1, 5 or -3.

In [None]:
print 2
print type(2)

Change the 2 above to some other *integer* value and see that the type is the same. Now, here is real or floating point number like 2.1 or 3.14159.

In [None]:
print 5.7
print type(5.7)

Change the 5.7 above to some other number with a decimal point and see that the type is the same. 

It's worth being a little precise here -- although it takes us off our main course briefly. A *floating point* number is a representation of real numbers on the computer. This has some complexity because some numbers have decimal representations that never end. Take 1/3 for example -- it's 0.333333... and the 3's repeat forever. A computer can't deal with never-ending things and has to make compromises. The video below explains the difference between a *real* number like 1/3 and its *floating point* representation -- as told by a charming British man.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('PZRI1IfStY0')

Note that the example he references in the video of adding 0.1 and 0.2 happens in Python. This issue isn't with the language, it's with the way the computer represents numbers -- and what happends when a number has a really long representations.

In [None]:
0.1+0.2

Let's leave that aside for the moment because we never want to stray too far from why we're learning this in the first place. Our goal is to use computation to heighten our understanding of the world around us. We will take an occasional detour to discuss nerdier issues since you'll see them and might wonder "Why?".

Returning to the built-in data types, there's one more important one. It has only two values.

In [None]:
print True
print type(True)

The  data type above is called *Boolean.* It represents just two states -- true and false. Boolean data are generated by typing the special sequences of characters <font face=courier>True</font> or <font face=courier>False</font> (without quotations because they are not strings). Change the <font face=courier>True</font> in the above expression to  <font face=courier>False</font> to make sure that it, too, is a Boolean value

You will primarily encounter Booleans as the output of some **logical expression.** Here are some examples of expressions that return Boolean (True/False) data. Try the expressions below -- each asks whether a relationship holds or not, is true or false.

*Riddle me this: Is 3 bigger than 5?*

In [None]:
print 3 > 5

*Riddle me this: Is 10 smaller than 100?*

In [None]:
print 10.0 < 100.0

*Riddle me this: Is the letter 'e' in 'Jeb Bush'?*

In [None]:
print "e" in "Jeb Bush"

*Riddle me this: Is the letter 'z' in 'Donald Trump'?*

In [None]:
print "z" in "Donald Trump"

The first two logical expressions in the code above make **comparisons** while the second two test for **membership**. 

The Boolean type will be important when we start to write code that "branches" its behavior depending on whether some condition is true or false -- we might want to take one action if something is true, but another action if that thing is false. For example, we might want to analyze only the tweets coming from the President's mobile device and would use a Boolean to separate out those cases.

With this in mind, sometimes, you will want to take action based on a combination of conditions. For this, we use "and", "or" and "not" to build more complicated expressions. 

*A series of logical expressions joined with 'and', for example, is True only if all the expressions are True.*

In [None]:
print "e" in "Jeb Bush" and 3 < 5
print "z" in "Jeb Bush" and 3 < 5

*A series of conditions joined with 'or' is True if at least one expression is True.*

In [None]:
print "u" in "Jeb Bush" or 3 < 5
print 3>10 or 5>100

*And you can flip from True to False (and vice versa) using the world 'not'. This is also called "negation".*

In [None]:
print not "u" in "Jeb Bush"
print 3<10 and not 5>100

*As with simple algebraic expressions like (1+3)\*5, we can use parentheses to make sure our expressions are evaluated in the right order.*

In [None]:
print 6 > 5 or (2 > 5 and "u" in "Trump")

Make sure you see why this evaluated to True. Finally, keep in mind, all of these expressions return a Boolean object.

In [None]:
print type(3<10 and not 5>100)

Experiment a little on your own and make sure you understand how these expressions work.

In [None]:
# Your work here



**Operators**

Technically, all of these symbols (">", "<", "in", "and", "or" and "not") are examples of **operators** in Python. The simplest kind of operators are arithmetic. They probably would have been a better place to start (as we did in class) -- they underscore the idea that Python acts like a big calculator. Here "+" and "\*" and "/" are called **arithmetic operators.**

In [None]:
print 3+10
print 100*5
print 3*(100+2.5)

As we have seen, Python also 'overloads' these arithmetic operators to do something sensible with various data types. For example, a common snag for Python beginners is that division with integer data leaves off the remainder. 

Below, 100/3 is just 33. But if at least one number in the expression is a floating point, then Python 'promotes' all the numbers to floating point (2 becomes 2.0, say) and you get a remainder back. Small point.

Compare these calculations. Operators at work!

In [None]:
print 100/3
print 100.0/3

As we have seen, this 'overloading' can be surprising. Take arithmetic operations on strings, for example...

In [None]:
print 5*"Tweets "
print "Trump's"+" tweets"

You can read about the various operators in Python [here](http://www.tutorialspoint.com/python/python_basic_operators.htm). It's a clean summary.

**Variables**

Creating data in cells and printing it out is awesome and all, but most of what you will be doing in object-oriented programming is making and evolving objects. We can take any of the objects above and store them in a **variable**, literally associating them with a name that we can use to refer to later. The equals sign here is called an **assignment operator**. 

Notice that you can use variables to catch the output of all the expressions we've seen so far.  

In [None]:
tweet = "Just tried watching Saturday Night Live - unwatchable!"
ise = "e" in tweet

print tweet
print "Is 'e' in this post?", ise

A new thing in the cell above. We printed out several different expressions in one line, separating each with a comma. You can chain several expressions in this way to make a tidier printout.

Anyway, more examples of variables and operators -- this time with numeric data...

In [None]:
x = 4.3
y = 100.2

print "Adding x and y to get", x+y

... or a mix of types if the expression makes sense.

In [None]:
w = 5
x = "Trump\t"
y = "Multiplication with strings:\n"

print y+x*w

In [None]:
print u"\U0001f63b"

This last cell introduces something new. The "\n" in the expression above represents a **newline** character. It, well, moves your cursor to the start of a new line, and hence we see the five "Trump"'s on a seprate line. We also see another special character "\t" for **tab**. 

In general, the "\" is an **escape character**. It "encodes difficult-to-type characters into a string" like a tab or a new line. As another example, you can use \' and \" to encode quotes inside a string... 

In [None]:
print "He said, \"I cannot believe we are looking at Trump's tweets\""

Here we used double quotes to start and end our string and so we had to "escape" the double quotes on the inside of the string. Otherwise Python would confuse them for the start or end of our string. You can read more about escape sequences [here](https://learnpythonthehardway.org/book/ex10.html). But for the moment, it's enough to know that the backslash is a special character so that you won't be surprised when you see it. 

(Like when we work with emoji and need to declare them in "Unicode" with the "\U" escape sequence. More on this later.)

In [None]:
print u"\U0001f601", u"\U0000270c", u"\U0001f370", u"\U0001f423"

Finally, as we mentioned in class, if you ever need help on something, there is a built-in function called, well, help(). Remember in the last cell, we made x a string...

In [None]:
help(type(x))

In [None]:
help(type(w))

Notice in the code above that the variable x was first set to equal the number 4.3 and then it was assigned a string representing a tweet. Python is flexible that way. (It's called **dynamic typing** if you want to get formal -- if you have programmed before you might know that not all languages are so loose.) 

Technically, when you create a variable, the name on the lefthand side of the expression is associated or "points to" the value you assign to it. At any time, you can have that name reassigned and point to another object. This flexibility is really handy. Here is an example that drives the point home. Variables are just names that "point to" objects, acting like labels. You can reapply the labels as you see fit...

In [None]:
x = 3
y = x
print "Here x and y are both pointing to the value 3:", x, y

y = 5
print "But now y points to the value 5, and x is unchanged:", x, y

**Objects**

As we said in class, everything in Python is an **object**. Just like objects in the real world, you might be able to guess about the kinds of things you'd like to be able to do to a particular software object. 

As we saw in class, a string (of type 'str') is just a set of characters, for example -- what might you want to do to such an object?

In [None]:
tweet = "Just tried watching Saturday Night Live - unwatchable!"

print "There are",tweet.count("e"),"occurences of the letter e","\n"
print "There are",tweet.count("a"),"occurences of the letter a","\n"

print "Make the first line all caps:", tweet.upper()
print "Or all lowercase:", tweet.lower(),"\n"

print "Or swap e's for a's:"
print tweet.replace("e","a"),"\n"

print "... or 'un' with 'Super ':"
print tweet.replace('un','Super ')

OK even with this tiny bit of programming, we were a little dangerous. In class, we read in Trump's last 1000 tweets as a long string object and counted occurences of certain string patterns like "RT" or "#" or "!". Rather than continue with strings, we are going to examine another kind of basic object in Python, a **list**.

**From single variables to lists**

It's one thing to store single values (a single number or a single string), but as we know, we tend to collect a lot of data different aspects of a person or thing in the world - we might offer a survey to 100 people that consists of 10 questions; or we might record facts about the last 100 of Donald Trump's tweets, including the time he tweeted and the number of retweets each earned; or we might look at the last month of the top trending topics on Facebook, recording them every 15 minutes. 

A **list** is simply an **ordered collection** of objects. It has a well-defined first entry, a second entry and a last entry. It can hold different kinds of objects in each position. It is constructed using square brackets [ ]. 

A common mistake for people learning Python is to confuse the parentheses we use to indicate a function like "tweet.count('RT')" with the square brackets we will use to group objects into a list. They look similar so be careful! 😉

For the moment, we are going to look again at Donald Trump's most recent tweets. Again, you can have access to all Trump's Tweets through the [Trump Twitter Archive](http://www.trumptwitterarchive.com/). The underlying data [are stored here.](https://github.com/bpb27/political_twitter_archive/tree/master/realdonaldtrump) 

Returning to lists, here we store the 15 retweet counts for Trump's most recent tweets.

In [None]:
counts = [5181,3811,6146,6859,5133,5396,411,545,5564,8911,9115,22118,6877,5947,12107]

print "The type of 'counts' is",type(counts), "and its length is", len(counts)

The len() function returns the number of elements in a list, or its **length.**

As an object, a list carries both data as well as methods that you can apply. What kinds of things would you like to be able to do to this type of object? 

*Maybe add new objects to the list? append() does that.*

In [None]:
print counts

counts.append(1614)
print counts

*Maybe sort the list? sort() does that.*

In [None]:
counts.sort()
print counts

As a container object (an object that holds or groups other objects), the most obvious set of operations you would like to perform should involve storing and retrieveing data from the list. As we said, a list stores objects in a well-defined order. There is a first, a second, a third, and so on. You access these objects using **an index.**  A small catch: Python refers to positions starting at 0 and not at 1. So the first object has index 0, the second has index 1 and so on. 

Here we have a row of data from the Trump Tweet Archive. It records the "favorite_count", the "source", the actual "text" of the tweet, the "in_reply_to_screen_name", whether it "is_retweet", when it was "created_at", its "retweet_count", and its Twitter "id_str".

Let's look at how we extract data from this container.

In [None]:
x = [21740,"Twitter Web Client","Thank you to General Motors and Walmart for starting the big jobs push back into the U.S.!",
     False,"Tue Jan 17 17:55:38 +0000 2017",5181,821415698278875136]

print x, "\n"

print "The list has", len(x), "elements", "\n"

# the first element in the list has index 0
print "The first element:", x[0], "\n"

# and the fourth has index 3
print "The fourth element:", x[3], "\n"

# and the last has index -1 -- the negative indices count from the right!
print "The last element:", x[-1], "\n"
print "The third from the last:", x[-3], "\n"

# Finally, you can pull more than one element with the : symbol to create a 'slice'
print "From the fourth element to the end:", x[3:], "\n"
print "Up to but not including the third element:", x[:2], "\n"
print "From the fifth up to but not including the seventh element:", x[4:6], "\n"

Just as you can pull data from a list,  you can also change the contents of one or more elements of a list.

In [None]:
x[0] = 6000
print x, "\n"

x[2:5] = [1,10,100]
print x

Certain operations produce lists. For example, we can divide a character string into pieces by "splitting" on a character using the method split(). This gives us a crude way to pull words from a string that represents a sentence.

In [None]:
line = "Thank you to General Motors and Walmart for starting the big jobs push back into the U.S.!"

# divide into substrings using the space character " " as a breakpoint -- this gives a rough division into words.
rough_words = line.split(" ")
print rough_words

print "There are roughly", len(rough_words), "words in this tweet."

Explain here what might be wrong with this approach to pulling words from text.



Finally, use "e" as a breakpoint to split the string. Make sure you understand what happened here.

In [None]:
print line.split("e")

Make sure you understand lists. Create one and try out forming subsets, changing values and so on.

In [None]:
# put your work here


**Higher-level objects: A DataFrame**

Let's start with a simplified version of a tweet. We'll look at the "source" or client the President used to author the tweet (a desktop or mobile phone, say), the time and day the tweet was posted to Twitter, and the number of retweets it received. So Each tweet consists of those three pieces of information. 

We will store each tweet as a list, where the first element is the source, the second is the date and the third is the retweet count. Then, in the cell below, we form a data set consisting of 6 tweets. We store each tweet (each list with the data for a tweet) in a list. Yes, a  **list of lists.**

In [None]:
dat = [
    
    ["Twitter Web Client","Tue Jan 17 17:55:38 +0000 2017",5181],
    ["Twitter Web Client","Tue Jan 17 17:36:45 +0000 2017",3811],  
    ["Twitter for Android","Tue Jan 17 14:52:39 +0000 2017",6146],  
    ["Twitter for Android","Tue Jan 17 14:46:27 +0000 2017",6859],  
    ["Twitter for Android","Tue Jan 17 14:36:26 +0000 2017",5133],
    ["Twitter for Android","Tue Jan 17 14:30:19 +0000 2017",5396]
]

print type(dat)
print len(dat)

This is certainly a fine way to store the data. We can select information about the third tweet by "subsetting" just the third row, say.

In [None]:
dat[2]

# why 2?

We will, from time to time, make our data sets "by hand" like this. Suppose, for example, we wanted to add to our tweet data, simple indicators about whether the President used an exclamation mark or we wanted to count the number of words in the tweet. We might consider building the data row by row as we did above. 

Our data format, the list of lists, is trying really hard to create essentially a **table**. That is, a grid of data, where each row refers to a tweet and then each column refers to a different attribute of the tweet. For our simple tweet data above, that's 6 rows and 3 columns.

Interacting with even this simple data in this format is a little cumbersome. We can appeal to a higher-level object to create a proper table for us. You are probably familiar with Excel or some spreadsheet. These programs are all about tables. In Python, the answer to Excel (or a popular answer) is a so-called Pandas **DataFrame**. Pandas refers to a package contributed by a Python developer who wanted to make working with tabular data easier. 

[You can read more about Pandas here](http://pandas.pydata.org/)

[And there are simple tutorials here](http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.1/cookbook/Chapter%201%20-%20Reading%20from%20a%20CSV.ipynb)

Pandas is a **package** that means its author has published data, functions and a host of new objects for the community to use. Whereas the built-in objects are basic and get us pretty far, often we need something special to make our lives easier. In the case of Pandas, an object of type DataFrame will help us manipulate (compute with, make graphs of, etc) simple tabular data. 

We can use the "dat" object (the list of lists) and turn it into a DataFrame using the function DataFrame(). (Yeah, that might be confusing -- the type of the object is "DataFrame" and the name of the function to turn your data into an object of that type is also called "DataFrame". This is a fairly common naming convention, and functions like this are called "constructors.") As arguments, it takes the data itself (the list of lists) and then optionally a list of strings that represent the column names.

We **import** the function "DataFrame" from the pandas package first. The import command is giving us super powers to do things not built into the basic Python system. We will see this construction a lot.

In [None]:
from pandas import DataFrame

tweets = DataFrame(dat,columns=["source","created_at","retweet_count"])
tweets

Notice that the way our data looks has changed. It's much more like an actual table now with column headings and the like. The DataFrame has lots of wonderful things you can do to it -- lots of ways to compute with the data contained in the underlying table. 

One simple thing is just to get its size. How many rows and columns? This is an attribute, information, stored with the object that we can again access with "dot" notation.

In [None]:
tweets.shape

Because we are looking up information and not computing something (like making strings lowercase, say), we don't need parentheses.

**More with DataFrames**
 

Rather than play around with this shortened data, let's read in all the tweets in the Trump Twitter Archive.
The Twitter Trump Archive gives you the complete set of tweets from Donald Trump's account. They make the data available in two ways -- first as a table stored in a CSV (comma-separated-values) file.

Have a look at the [CSV file from the archive](https://github.com/bpb27/political_twitter_archive/blob/master/realdonaldtrump/realdonaldtrump.csv), choosing to view the "Raw Data". For each row in the file, you will see a number of fields separated by commas. The first row of the file is called a "header" and gives you the names of the variables recorded on each tweet. So you will see things like "created_at" and the tweets' IDs. There are eight entries in the header, separated by commas, and 8 entries in every subsequent row. 

Each row after the first represents a tweet from the President, arranged so that the most recent are first and the oldest appear last in the file. Following the names in the header, the first entry in each row is the "favorite_count", the second is "source", the platform the President used to send the tweet, and so on. Each row arranges the data about its tweet according to the labels in the first row, and separating the entries by commas. Hence CSV.

For each tweet, however, Twitter actually publishes a lot more data than these 8 fields, so this table represents a subset of the information we get when the President tweets. These 8 were probably chosen because the publishers made most frequent use of these fields in their analysis. The table format is also easy to work with. We'll return to the complete tweets in our next drill.

In the cell below, we first import two functions, read_csv() and set_option(). Unlike DataFrame(), read_csv()  takes a CSV file and creates a DataFrame. Oh and it takes as its argument either the URL of a CSV or the location of a CSV file on your computer. Here we supply the URL on github.

The set_option() function lets us specify options about the way our data are processed as they are turned into a DataFrame. Bascially, we can control what the lovely table above looks like. The option we will set says how many characters are printed in each cell. Because we have tweets, we want to allow for up to 140 characters.

In [None]:
from pandas import read_csv, set_option

# set the maximum number of characters in any cell
set_option("display.max_colwidth",140)

# read in the tweets from the CSV file on the Trump Twitter Archive's github page
tweets = read_csv("https://github.com/bpb27/political_twitter_archive/raw/master/realdonaldtrump/realdonaldtrump.csv")
print type(tweets)

In [None]:
tweets.shape

So We have 30,315 tweets (rows in the table) and 8 variables recorded for each. We can have a look at the "top" and "bottom" of the data set. These are printed with head() and tail() methods.

In [None]:
tweets.head()

In [None]:
tweets.tail()

The head() and tail() methods of a DataFrame gives you five tweets from the start and end of the data (and you can give an argument to see more). It's important to look at the top and bottom of the file to check that everything looks consistent (column entries seem to mean what they should) and see how the data might be organized (here it's in time order of the tweet, from newest to oldest).

It's typically a good idea to look at each column in a data set and determine what kind of measurement it represents -- is it qualitative or quantitative. That determination will help you decide what computations makes sense and what are a little silly. We can create simple numerical summaries of data that look like numbers (means and so on) using a method of the DataFrame object called describe(). We do this below. What do you notice? Do you see any issues with the summaries presented?

In [None]:
tweets.describe()

The numerical summaries here are applied if Python feels like the column is made of numbers. These summaries are great for quantitative data like favorite_count (What is the average favorite count for Trump's tweets?) but might not make sense for data that look like quantitative measurements but are not. Do you see any of those mistakes above? Answer in the space below.

.

.

.

DataFrames implement a number of methods that help you extract data you might want to look at more closely. As with our list of lists, if we provide a range of numbers (a slice), we can form subsets of rows of the table. Here we select just rows 1000 through 1025...

In [None]:
tweets[999:1025]

We are often interested in extracting other kinds of subsets of data from a table and not just ranges of rows. For example, we might want to pull out all 30,315 favorite_counts in the table. We again use the square brackets, but now we just supply the name of the column we are after (or list of names if we want several columns).

In [None]:
tweets["favorite_count"]

... and maybe compute the mean once we've extracted the values.

In [None]:
tweets["favorite_count"].mean()

Notice that the square brackets are doing very different things. In one case, we are pulling out ranges of rows and in another named columns. The designer of the software decided that these are common uses and wanted to make them easy for you. There is a more formal subset mechanism that we will discuss later. 

You can nest these subsetting operations as well. For example, here are the first five values of retweet_count.

In [None]:
tweets["retweet_count"][:5]

For qualitative data, we might summarize the different categories based on counts. Here the source of the President's tweets might be something of interest. We can create a tabulation of the different values using a method called value_counts().

In [None]:
tweets["source"].value_counts()

In [None]:
# Pick another qualitative variable and produce a table of counts. 
# What story does it tell?




If we look back at the output of describe() we see that the maximum retweet_count was 352174. That's high. We can count how many tweets had a count as high as 50000, say, by using our comparison operator, making a new data set of True's and False's and then tabulating them.

In [None]:
tweets["retweet_count"] > 50000

In [None]:
(tweets["retweet_count"] > 50000).value_counts()

We can also use booleans to subset our Data Frame. Let's look at the tweets with over 50,000 retweets. Here we see a third kind of subsetting. First it was ranges of numbers for rows. Then names (or list of names) for columns, and now Booleans to subset those rows that evaluate to True. Inside the square braces we see the logical expression two cells back.

In [None]:
tweets[ tweets["retweet_count"] > 50000 ]

So what we have done is extracted data, made simple summaries, and created subsets. Try making a subset on your own using whether the tweet is a retweet or not. Or maybe compare favorites or retweets by the source of the tweet.

In [None]:
# Put your work here

