<img src=https://github.com/computationaljournalism/columbia2019/raw/master/images/rectanguler-wooden-block-500x500.jpg>

A little syntax
----------------

We start today with a little rudimentary work and then a much more open-ended session. The "basics" this time have to do with blocks of code, iteration and conditional evaluation. Skim this if you are feeling Python-powerful. Read more closely if you have been hanging on. 

**Blocks of code: Loops and conditional evaluation**

A block of code is nothing more than a group of Python commands. Typically, this group hangs together because, when executed in sequence, they perform a single high-level task or a complete component of some task. Most modern programming languages have some kind of block structure. *Python identifies blocks through common indenttion.* This language requirement, forcing common indentation, also makes the code more readable. Remember, Guido wants code that can be shared, that can be read by others.

![stack](http://www.python-course.eu/images/blocks.png)

"So, how does it work? All statements with the same distance to the right belong to the same block of code, i.e. the statements within a block line up vertically. The block ends at a line less indented or the end of \[your notebooks' code cell\]. If a block has to be more deeply nested, it is simply indented further to the right... There is another aspect of structuring in Python, which we haven't mentioned so far, which you can see in the example below. Loops and Conditional statements end with a colon ":" - the same is true for functions and other structures introducing blocks. So, we should have said Python structures by colons and indentation." (Cribbed from the [Python Tutorial](http://www.python-course.eu/python3_blocks.php))

**Conditional expressions**

Blocks of code that are executed only if certain conditions apply are called "conditional blocks" and we'd like to document them formally here. In the cell below we have a simple example -- the code is indented to the same level (and the notebook helps you here) is all to be executed if the expression between the "if" and the colon ":" is true. You can put any Boolean expression in here, including the operators like `and` and `or` and `not`. 

We'll use an example from the trending `#snowday`.

In [None]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">How much snow is outside your window? Tweet us your <a href="https://twitter.com/hashtag/snowstorm?src=hash&amp;ref_src=twsrc%5Etfw">#snowstorm</a> photos and we’ll share them on CTV News at 5. <a href="https://twitter.com/hashtag/ottnews?src=hash&amp;ref_src=twsrc%5Etfw">#ottnews</a> <a href="https://twitter.com/hashtag/snowday?src=hash&amp;ref_src=twsrc%5Etfw">#snowday</a></p>&mdash; CTV Ottawa (@ctvottawa) <a href="https://twitter.com/ctvottawa/status/1095663232705089536?ref_src=twsrc%5Etfw">February 13, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

In [None]:
x = "How much snow is outside your window?"

if "snow" in x:
    x = x.replace("snow","ice")
    x = x.upper()
    print(x)

Recall that the `"snow" in x` is a logical expression that is either `True` or `False`. Depending on which value is returned, the code in the block following the `if` statement is executed or not. 

In [None]:
x = "How much snow is outside your window?"

"snow" in x

Things can get a bit more interesting in that you can branch off of an `if` statement. That is, you can take one action when the logical expression is `True` and one when it is `False`. Again, the alternative actions are represented as blocks of code.

In [None]:
x = "How much snow is outside your window?"

if "snow" in x:
    x = x.replace("snow","ice")
    x = x.upper()
    print(x)
else:
    print("This is a lousy tweet about #snowday!")

Change `x` above to different strings and make sure it does what you think it should. Next, here is another example where we take one of two actions depending on whether the initial condition is `True` or `False`. There's a lot of indentation in this example, corresponding to **nested blocks** -- you only get to the `if` statement asking whether `x` is larger than 100, if `x` is larger than 25. Make sure you understand what is getting executed when.

In [None]:
x = 300

if x > 25:
    print(x,"is a big number.")
    if x>100:
        print("Actually, it's a really big number")
else:
    print("small number")

Again, change `x` to a few different values and make sure you understand what's happening. 

Finally, we can specify as many subconditions to an `if` statement as we like. That is, it's not just `if-else`, it's `if-elif-elif-...elif-else`. In the cell below, if `x` is larger than 100 we print out `"really big"`. On the other hand, if it's just bigger than 25 (between 25 and 100), we print out `"big"` and otherwise (`x` is less than or equal to 25) we call it `"small"`.  Change the values of `x` to make sure you know what this is doing. Try adding other conditions.

Note that in this case, the subconditions are putting tighter constraints on the value of `x`. This is common.

In [None]:
x = 7

if x > 100:
    print("really big")
elif x > 25:
    print("big")
else:
    print("small")

And just to be clear, the choice of the varible name `x` is arbitrary (and lazy). You can name the object anything you like.

In [None]:
arnold = 7

if arnold > 100:
    print("really big")
elif arnold > 25:
    print("big")
else:
    print("small")

**Loops**

Another common reason for using blocks is that we'd like to repeat the group of operations several times, with inputs that we specify. The `for` loop" basically iterates over a list-like data set and executes the subsequent code block with each data point. 

For example, `range()` specifies a set of integers in, well, a given range. You can choose a start and an end and an increment. Technically, in Python 3, `range()` returns its own class of object. *Again, if you use something often enough, the designers of the language attempt to make things easier syntactically (how you write it) or more efficient (faster execution).*

But, using `list()` we can create an actual list of integers from this `range` object. We don't have to do this to iterate over its content -- we are doing it just to look at the numbers `range()` returns.

In [None]:
type(range(10))

In [None]:
# from 0 up to but not including 10
list(range(10))

In [None]:
# from 5 up to but not including 20
list(range(5,20))

In [None]:
# from 100 up to but not including 2000 in steps of 200
list(range(100,2000,200))

Again, using `range()` we essentially get a list of integers that run from the start value, up to but not including the end value, in steps of the increment.

The simplest `for` loop just iterates over an object like `range`. In the cell below, have our source data as `range(10)` or the integers from 0 through 9. The loop proceeds by successively assigning the variable `i` to each element of the list of integers. First, `i` stands for the number  0, then `i` is 1 then `i` is 2 and so on up until 9. **With each new value, we execute the code in the block**, here just the single line that prints the value of `i`.

Because it is a variable name, `i` could have been any name. `pineapple` or `sheep` or `diet_coke` would all work if you substituted every occurence of the name `i` with your new choice.

In [None]:
for i in range(10):
    print(i)

Next we run through the integers 1 though 10 and test if the number is odd or not, printing one thing if it is and another if it's even. Notice that again we have **nested blocks**. The print conditional block is nested in the looping block. We also have a new operator here `%` -- `a%b` returns the remainder of the division of `a` by `b`. (And so an even number has 0 remainder after division by 2).

In [None]:
print(3%2)
print(20%2)

In [None]:
print(30%5)
print(24%5)

OK let's put this to work

In [None]:
for i in range(1,11):
    
    if i % 2 == 0:
        print(i,"is an even number")
        
    else:
        print(i, "is an odd number")

So all of this is a little boring but it's a good start. You can iterate over lots of different kinds of things, chiefly lists. Remember, they store data in order so the loop below assigns each name from the list `students` successively to the variable name `s`. It then carries that value into the loop block and creates a slightly lame sentence with the indicated name. So we start with `s` being "Yaling" and end with `s` being "Ethan."

In [None]:
students = ["Yaling","Sophia","Alex","Erin","Ellen","Isabelle","Ethan"]

for student in students:
    
    drill = student + " is learning about code blocks."
    print(drill)
    print("---")

In [None]:
students = ["Yaling","Sophia","Alex","Erin","Ellen","Isabelle","Ethan"]

for student in students:
    
    drill = student + " is learning about code blocks."

    if len(student) % 2 == 0:
        drill = drill + " Their name has an even number of letters."
    else:
        drill = drill + " Their name has an odd number of letters."

    print(drill)
    print("---")

print("And that's it!")

Again, `student` is the name of a variable and is arbitrary -- we could have used anything. Replace `student` with the letter `s` or the word `lightbulb`. 

Before we finish iteration, there is one other kind of construction that loops. The `while` loop will continue executing until some condition is satisfied. For example, you might want to run through a list of sentences and print out the first that is less than 280 charaters, or one that contains the word "snow". 

Below we will use the command `sample()` from the `random` package. The package contains a number of tools for generating random variables. For example, `sample()` -- as its name might suggest -- takes a list as an argument and, in computer style, puts the contents of the list into a hat and pulls out some number of the elements at random, a number you specify. Here we take 3 from the list of integers from 0 to 9, or we take two student names from the list of students. 

Execute this code several times to make sure you see what it's doing.

In [None]:
from random import sample

# 3 draws from the collection 0,...,9
print(sample(range(10),3))

# 2 draws from our list of students
print(sample(students,2))

Notice that `sample()` returns a list. So if we ask for 1 randomly selected element, we will get a list with one element. Often we don't want a list with one element, but we want the student name, say, that we selected. You can do this with the following command (where the square braces ask for the entry with index 0, the first and only element in the list).

In [None]:
# 1 randomly selected student name
student = sample(students,1)
print(student)

# 1 randomly selected student name, but a string
print(student[0])

We can do this in one line too. Here we select a single item from the list ["H","T"]. It's like a 50-50 coin toss everytime you execute the code below. Each time you run it, Python puts the "T" and "H" in a hat, mixes it up and selects one. Try it a few times!

In [None]:
sample(["H","T"],1)[0]

Below we will use the command `sample()` from the `random` package to pick either "H" or "T" with 50% chance for each and print out how many "coin tosses" it took to get the first "H". We will use the counter `count` (again an arbitrary name, but  which we start as 0) and increment it each time we toss something other than a "H". 

The code starts with a flip. If it was "H", then we never execute the `while` loop. If it was tails, "T", we go into the loop and keep flipping until we get a "H". Got it? Execute this a few times and make sure you understand what it's doing.

In [None]:
flip = sample(["H","T"],1)[0]
print(flip)
count = 1

while flip == "T":
    flip = sample(["H","T"],1)[0]
    print(flip)
    count = count + 1
    
print("--->", count, "flip(s)")

**Putting this into action (enough with the coin flipping!)**

We've used loops before when looking at the list-like objects returned by calls from Tweepy to the Twitter API. Let's gear up and do that again. It will give us an opportunity to learn a bit more about Tweepy and introduce a new general kind of iteration in Python -- the list comprehension.

First, fire up your Twitter API.

In [None]:
# grab your keys from a previous notebook or https://apps.twitter.com

CONSUMER_KEY = ""
CONSUMER_SECRET = ""
ACCESS_TOKEN = ""
ACCESS_TOKEN_SECRET = ""

In [None]:
# before we can make Twitter API calls, we need to initialize a few things...
from tweepy import OAuthHandler, API

# setup the authentication
auth = OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

# create an object we will use to communicate with the Twitter API
api = API(auth,wait_on_rate_limit=True,wait_on_rate_limit_notify=True)

Now, recalling the Tweepy documentation, we can find our own followers (or the followers of other accounts if we specify an account name) by simply using the method `.followers()`. Here we get a list-like object where each element represents the details of a follower. Here we look at the `.json` object returned by Tweepy first to get a better sense of what it contains. 

(This information is available in the Twitter API description as well -- all Tweepy is doing is grabbing the JSON string from Twitter and turning it into a Python dictionary.)

In [None]:
followers = api.followers()

print(type(followers))
print(len(followers))

In [None]:
followers[0]._json

What's the one piece of data that seems to be missing here that would be super valuable to know?

Recall that Tweepy makes all this easy to access through their objects. In this case, they encapsulate the results from the API call into an object, where each element in `followers` is of type `User`. We can access the data (as we do for any object) with the `.` or dot notation. So this means we can either use the `.json` representation of our followers as Python lists and dictionaries or we can use the Tweepy object.

In [None]:
type(followers[0])

Let's iterate!

In [None]:
for f in followers:
    print(f.followers_count,f.screen_name)

**List comprehensions**

Mike loves loops. 😍*It's almost pathological.* Mark finds loops hard to read, even with Python's graceful indentation. There are other forms of iteration that are worth commenting on, both of which Mark believes are more elegant than loops. Suppose, for example, I wanted to create a list of the follower counts of the people who follow me. I could loop over the `followers` result set, extract from each account their `.followers_count` one at a time, and build a new list. 

*Or, perhaps I can store them in a list more directly.* In walks list comprehensions. Have a gander.

In [None]:
counts1 = [f.followers_count for f in followers]
counts1

And now we can maybe compute some summary statistics.

In [None]:
from statistics import mean, median

print(mean(counts1))
print(median(counts1))

Or we can create a list of follower names...

In [None]:
who = [f.screen_name for f in followers]
who

I can also add a conditional statement, choosing which elements to include and which to drop.

In [None]:
# make a list of just the names that contain the letter "a"
[name for name in who if "a" in name]

In [None]:
# make a list of just the counts that are larger than 1000
[n for n in counts if n > 1000]

Now, going back to our original lists `counts1` and `who`, we'll add a `counts2` that gives us the `.friends_count` for each of our followers and then we'll bundle everything up into a data frame, making a dictionary of lists. 

I know. It's awesome.

In [None]:
counts2 = [f.friends_count for f in followers]

In [None]:
from pandas import DataFrame

# now make a data frame from a dictionary of lists
df = DataFrame({"screen_name":who,"followers":counts1,"friends":counts2})

df

<img src=https://c1.staticflickr.com/2/1710/23872688022_5bee85a7fe_b.jpg width=500>

One last form of iteration. This one implicit. DataFrames implement a kind of iteration, allowing for element-wise computations. So, in one form of bot detection, we look for accounts that follow many more people than they have following them. That is, the `"friends"` to `"followers"` ratio is large. In short, it's easier to follow people than it is to get them to follow you. 

We *could* loop along the elements in our two lists or the two columns of the data frame and compute the ratio. Or we could just do the following using our data frame.

In [None]:
df["friends"]/df["followers"]

Ha! Loops happened but no intermediate variables were harmed in their construction. The whole thing was handled implicitly. Again, rowwise operations are common enough that they should be simple to specify. Hence, Pandas.

We can also store the result in a new column called `"ratio"`...

In [None]:
df["ratio"] = df["friends"]/df["followers"]

df

... and maybe sort the whole thing by this ratio, highlighting the most lopsided accounts.

In [None]:
df.sort_values("ratio",ascending=False)

`badc0fee` looks a little suspect. [Have a look at his (its?) Twitter account](http://twitter.com/badc0fee). That's a lot of likes in two years! Let's pull some of this account's tweets. Via Tweepy, this means `.user_timeline()`.

In [None]:
recents = api.user_timeline("badc0fee")

print(len(recents))
print(type(recents[0]))

This gives us a list of 20 statuses -- their last 20 tweets. Have a look again at what this object represents in terms of the raw response data from Twitter.

In [None]:
recents[0]._json

We can get a sense of their tweeting frequency by looking at just the times each tweet was `created_at`. 

In [None]:
times = [t.created_at for t in recents]

times

And OK if we have to satisfy Mike 😍, we can create a loop to see how much time elapsed between each tweet. Recall the tweets are in reverse chronological order with the newest coming first. So the time between the most recent tweet aand the second most recent tweet is `times[0]-times[1]`. 

Let's see how much time passes betwee tweets...

In [None]:
for i in range(1,len(times)):
    print(times[i-1]-times[i])

Busy busy. 

**Scaling up Tweepy**

Last time we fumbled slightly trying to do a search that returned more than 15 tweets. My bad. Today we'll fix that. We are going to use something in Tweepy called a `Cursor`. It is a construction that will make multiple requests for you and keep track of where you are. By that I mean, the Twitter API only allows so many tweets or users or trends to be returned by each request. The `Cursor` will keep track of how many you have been delivered so far and keep making requests until the total number you want is satisfied. 

Now, if you ask for too many, you will start to hit rate limits (the number of requests you can make with your API keys in 15 minute windows) and it will sleep for a while. Just so we get all the additions to our Tweepy setup in one place, let's remake our `api` object. 

In [None]:
# before we can make Twitter API calls, we need to initialize a few things...
from tweepy import OAuthHandler, API, Cursor

# setup the authentication
auth = OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

# create an object we will use to communicate with the Twitter API 
# we have added two arguments that will have the program "wait" if you ask for
# too many things in a 15 minute window
api = API(auth,wait_on_rate_limit=True,wait_on_rate_limit_notify=True)

Let's now get the last 500 tweets from `badc0fee`.

In [None]:
bad_tweets = [status for status in Cursor(api.user_timeline, screen_name="badc0fee").items(500)]
len(bad_tweets)

In [None]:
times = [t.created_at for t in bad_tweets]
texts = [t.text for t in bad_tweets]

df = DataFrame({"time":times,"text":texts})
df

Now, sort the data so the oldest tweets come first and work up to the newest. Then use the method `.diff()` to compute the differences in times...

In [None]:
df = df.sort_values("time")
df["time"].diff()

In [None]:
df["time"].diff().median()

Just about a minute between each tweet. Busy busy. 

The `Cursor` object will also let you search and perform other operations from the API, taking care of all your rate limiting etc and letting you focus on just the total number of objects you need. Trending in NYC today was the upsetting arrest of Maria Ressa. We can see the last 1000 tweets on this topic. Notice that the call to `Cursor` involves the `api` function you want and then any arguments you want to pass it. The `.items()` method tells the `Cursor` how many objects you're after. If you leave it empty it will go until the list is exhausted.

In [None]:
searched_tweets = [status for status in Cursor(api.search, q="Maria%20Ressa").items(1000)]

Now we have a list of recent tweets. We can loop over them...

In [None]:
for tweet in searched_tweets[:10]:
    print(tweet.text)
    print("===")

... or create a data frame...

In [None]:
times = [t.created_at for t in searched_tweets]
texts = [t.text for t in searched_tweets]
users = [t.user.screen_name for t in searched_tweets]

df = DataFrame({"time":times,"user":users,"text":texts})
df

... and use basic Pandas functionality to see who is involved in the conversation. You can also examine retweets, who's mentioned and so on.

In [None]:
df["user"].value_counts()

Loops, list comprehensions, rowwise operations... these are all methods to repeat operations. Your code will often be a mix of each and there is no right way to do things. Just make sure your code is readable and don't choose a loop over some unnatural attraction 😍.

**Next, your turn**

You now have a TON of tools at your disposal and it's time to start thinking like reporters again. You have trends from during the State of the Union. You have the capacity to search tweets, to pull people's timelines, to identify their followers. Let's tell a story about conversation happening on Twitter. Or let's look at a particular person like `badc0fee` and see what they are up to. 

Go! (Or reinforce your iteration knowledge [with this video](https://www.flocabulary.com/unit/coding-for-loops/) and then, go!