Trending Topics: How would you do it?
------------------------------------

**Due 5pm on 2/1**

Having looked a little at Twitter's trending topics, we can now think about what we want from the concept. Remember our discussion last week where we surfaced ideas like

* Popularity
* Timeliness
* Impact
* Influence or promoted by influential people/organizations
* Likelihood of being "fake" material

Each of these concepts makes sense in words, but needs to be translated into data via some computation on tweets. We will unpack these ideas in this drill. 

The tweets we will examine come from Washington DC at noon on Inauguration Day. You might start by looking at what was trending there. I've pasted the code we used for this. I have added the limit by position, so we're only looking at trends that appear in the top 10. Remember our results changed when we did that and some patterns became clearer. You should do as you see fit, however. I just wanted to give you the code and remind you of the issue.

Use this code to examine DC during the noon hour (remember all our time stamps are 5 hours ahead of NYC time). Don't forget our handy "startswith()" trick that we used to narrow things down to particular days and times!

In [None]:
# 1. load up pandas and then read in the data
from pandas import read_csv,set_option
set_option('display.max_rows', 50)

trends = read_csv('twitter_trending_topics_for_us_120to122_mh2.csv')

In [None]:
# 2. look at one city
trendy = trends[(trends["city"] == "Las Vegas") & (trends["position"] <= 10)]

In [None]:
# 3. pull the top trending topics (maybe <= 10 is too much? is the top 25 right?)
topics = trendy["topic_name"].value_counts()
tops = topics.index[:30]
tops

In [None]:
# 4. prepare to plot trends from the city
from plotly.plotly import iplot, sign_in
import plotly.graph_objs as go

sign_in("cocteautt","9psj3t57ti")

trendy_tops = trendy[trendy["topic_name"].isin(tops)]

mydata = [go.Scatter(x=trendy_tops["datetime"],y=trendy_tops["topic_name"],mode="markers")]
mylayout = go.Layout(autosize=False, width=1000,height=800,margin=go.Margin(l=150,r=50,b=100,t=100,pad=4))
myfigure = go.Figure(data = mydata, layout = mylayout)
iplot(myfigure)

In [None]:
# 5. Look at a single trend and plot it across the country
target_topic = 'To Sir With Love'

trends_topic = trends[(trends['topic_name']==target_topic)& (trends["position"]<=10)]

mydata = [go.Scatter(x=trends_topic["datetime"],y=trends_topic["city"],mode="markers")]
mylayout = go.Layout(autosize=False, width=1000,height=1500,margin=go.Margin(l=150,r=50,b=100,t=100,pad=4))
myfigure = go.Figure(data = mydata, layout = mylayout)
iplot(myfigure)

**1. Write up a summary here of what you found**

* Did it make sense?
* Did you get any clues about what seems broken?



**Tweets - An aside about APIs**

As we saw in class, Twitter makes its data available via an Application Programming Interface or API. The web site [ProgrammableWeb](https://www.programmableweb.com/) offers great introductory material about APIs as well as [an  API directory.](https://www.programmableweb.com/apis/directory) 

Let's first scan the API directory. 

Notice that companies like Twitter and Google and Flickr offer interfaces to their services for you to use -- for you to build new applications on.  Want to include a map on your web page? Or perhaps a scrolling list of your organization's most recent tweets? Want to pull down Yelp data from your neighborhood and cross it with the NYC violations data? Maybe you have a list of addresses that you would like to translate into longitude/latitude pairs? And then put them on a Google Map. 

APIs to the rescue. So many companies and organizations offer their data and computation to developers through APIs.

So, an API lets you build new services from old. If you parse out the acronym a bit, an **interface** is a bridge between two computer systems. A **programming** interface, means that we are helping programmers instruct computers to make use of the bridge. As a practical matter, this means  the programmer needs to know how to ask for the data, or computation they want, and they have to know how their request will be answered -- what will the data look like? 

We will see that requests for data, or for maps or for whatever a service is advertising, are usually in the form of a URL. That's right, the same way you would specify a web page to read in your browser, you can ask for data or computation from an API.  

The answer to your request, however, could come in a variety of forms. We will get into the details of all this later in the term, but for our Twitter example, all of our requests (for trends or for tweets) return data formatted in [JSON, the JavaScript Object Notation.](http://www.json.org/) This is a very common choice because it plays nicely with the programming language JavaScript that runs in most browsers -- many new services using APIs are built for the the web, as web pages, and JavaScript is the language for pulling it all together. Having a data format that can be easily read into JavaScript is a huge benefit.

Of course there are plenty of competing ways to structure data. You have seen one already -- the humble CSV. In CSV format, data are organized in as a table, with each row describing a different unit of observation, and, in each row, commas separate the different measurements taken on the corresponding unit. Your choice of data format depends on things like ease of use for your application (JSON and JavaScript, say) as well as the expressiveness of the format. There are things that CSV cannot do easiy that JSON can. We'll see that shortly. A nice tool for exploring data formats is [Mr. Data Converter](https://shancarter.github.io/mr-data-converter/) by Shan Carter from the New York Times. 

One final comment on APIs. An article from ProgrammableWeb motivates the concept of an API nicely, reminding us that the goal of an API is not so much for a human to use, but, once programmed, for computers to be able to chain data and computations from different places to make new services. This is the soul of "mashups" and Web 2.0. Skim this article if you're curious -- [APIs Are Like User Interfaces--Just With Different Users in Mind](https://www.programmableweb.com/news/apis-are-user-interfaces-just-different-users-mind/analysis/2015/12/03).

**Twitter's APIs**

Twitter has a [published JSON format for its tweets.](https://dev.twitter.com/overview/api/tweets) Have a look! You'll see sections of the data about the location where the tweet originated, details about the user, details about the tweet like whether it contains hashtags or URLS. As we saw last time in class, a JSON object, while designed to be maximally useful with JavaScript, can be easily translated into basic, built-in objects in Python. Numbers, strings, Boolean values, lists and (our newest built-in object) dictionaries. 

We have used the Twitter API to capture 19,677 raw tweets from the DC area at noon on the day of the the inauguration. The Twitter API does not return all the tweets from DC (that would require the so-called firehose), but instead they return a random sample of tweets. 

[The data set is located here.](http://compute-cuj.org/inauguration_data_mh.tar.gz) **This is an updated file, so please download this new one!** 

Download it and move it to the folder where you have placed this notebook. Let's look at a tweet! There is a folder called "inauguration_data_mh" and then in it is another folder called "noon_tweets". This latter folder holds the raw JSON files for each tweet. You can open any of these files in TextEdit or the NotePad to see what they look like. It's just text with formatting to describe the data comprising each tweet. 

To load them into Python, we use the "json" package and a function called "loads". Here's a tweet sorta chosen at random. We open the file with the function open() as we did in the first drill and then pass the contents of the file to loads() to turn the tweet into a Python object.

In [None]:
from json import loads

tweet = loads(open("inauguration_data_mh/noon_tweets/822503956907773952.json").read())

In [None]:
tweet

First, notice the curly braces that start and end the tweet. These, in Python, define a **dictionary** just like square brackets [ ] were used to group data into a list. A dictionary is a container object like a list, except that instead of storing things sequentially, it stores them under names or words. Think of how you look things up in a dictionary... you don't ask for the definition of the 2,354th word, you ask for the definition of the word "asymptote", say. 

Here's how we see that tweet is a dictionary...

In [None]:
type(tweet)

Here's a mini-example. We will build a mini_tweet dictionary that has just some of the data of the tweet above. Here we chose to store the date the tweet was created, its source, the text of the tweet and some facts about the user who tweeted it. Data like the source or the tweet text are stored under a name. 

We refer to the names as "keys" and the data they refer to as "values". So below, the key "created_at" is associated with the value "Fri Jan 20 17:59:59 +0000 2017"

In [None]:
mini_tweet = {'created_at': 'Fri Jan 20 17:59:59 +0000 2017',
              'source': '<a href="http://instagram.com" rel="nofollow">Instagram</a>',
              'text': 'millwoodschool Our upper school boys are bonding on the slopes! #Wintergreen #AnnualSkiTrip @\u2026 https://t.co/XYn5wDVJAt',
              'user': {'followers_count': 121,
                       'friends_count': 121,
                       'id': 608401769,
                       'lang': 'en',
                       'location': 'Richmond, VA',
                       'name': 'ChristensCreations',
                       'statuses_count': 2778}
              }

And we access the data values by referring to the appropirate key.

In [None]:
print mini_tweet["created_at"]

In [None]:
print mini_tweet["text"]

In [None]:
print mini_tweet["user"]["name"]

The method keys() gives us all the names used to store data in a dictionary, and the method values() returns all the data.

In [None]:
mini_tweet.keys()

In [None]:
mini_tweet.values()

These commands both return lists. But the keys and values are not in any real order. Remember we are accessing data by name not by position so position doesn't matter. If it does, use a list! 

**2. Now, go back to the full tweet and pull out some data that you think might be interesting. Again, our goal is to decide what's trending in the DC area on Twitter. What kinds of information do you want to pull from the tweet?**

*Small comment on the printout of the tweet a few cells back. The strings all look a little funny -- they have a "u" in front of them, like u"Richmond, VA". The u in front of the string means it is encoded in Unicode. This is a technicality about the characters that are available for making a string. With the u, or a Unicode encoding, we can create strings using the alphabet from just about every known language. Um, including emoji. So think of the u as indicating a string but one with the ability to express words in lots of languages. Otherwise it is just like any other string we've seen in terms of operations like subsetting or startswith() or count().*

In [None]:
# Extract some data from the tweet object and try 
# loading a different tweet. Finish by writing out what 
# features you'd like to use for your trending algorithm.



**A first pass -- A simple CSV representing 19,677 tweets**

We have taken tweets from the noon period on Inauguration Day in DC and boiled them down into a CSV for you. Have a look. Here we read in the CSV data. Again, it's updated from the version we handed out Thursday.

In [None]:
set_option("display.max_colwidth",140)

tweets = read_csv("inauguration_data/inauguration_tweets_at_noon_mh3.csv")

In [None]:
tweets.shape

In [None]:
tweets.head(50)

We have screen names, follower counts, the tweet's text, and so on. The time a tweet was created is given in a datetime string (coming from Twitter), in a timestamp (seconds since the UNIX epoch as with all our other data sets) and then a counter that tells you what 10 minute chunk of the hour the tweet came from. Minutes 0-9 are marked 0, minutes 10-19 are marked 10, and minutes 50-59 are marked 10. Make sense? We added this so you could look at the number of tweets per 10 minute intervals easily.

Notice that we have several columns for the hashtags that are in a tweet. Many of these fields are missing in the CSV. This is a nice example of how JSON and CSVs differ. 

The JSON version of the tweet stores entities like Hashtags and URLs in lists. No hashtags means that list is empty. But since it's a list, there's also no limit (other than the 140 charaters) to how many hashtags the JSON object can store. For a CSV we need a column for each hashtag. That's why we have "hashtag 1" and so on. In this case, the CSV feels awkward. (There are other ways to do this but having variable-length elements in a row is always awkward.)

OK let's start with some warm up questions that use all the operations we learned with DataFrames. (And no, we won't be using DataFrames so heavily all semester. They are just a good place to start.)

**3. How many tweets did we collect in each of the ten minute periods starting at noon on Inauguration day?**

In [None]:
# Your code here


**4. How many different people do we have in our data set and tell me about the most frequent tweeters.**

In [None]:
# Your code here


**5. Which tweeter that hour had the largest number of followers?**

In [None]:
# Your code here


To find the tweet with the largest number of retweets, we could sort the table by "retweet count" and then use a head()...

In [None]:
tweets.sort_values(by="retweet count",ascending=False).head(10)

... or we can figure out what the largest retweet count is and then work from there. For that, we can use the method max() which comes along with min() and sum() and mean(), for example. Basic statistical summaries.

In [None]:
tweets["retweet count"].max()

In [None]:
tweets[tweets["retweet count"]==1961]

Now, suppose I want to figure out how many retweets each tweeter received during the hour. I might use our groupby() command from last class. Here we would take our DataFrame, groupby() the screen names and then apply a function like sum() to the groups of retweet counts. 

In [None]:
tot_retweets = tweets[["retweet count","user's screen_name"]].groupby("user's screen_name").sum()
type(tot_retweets)

This gives us back a DataFrame. In this case the index is not row number but group name -- the screen names. Notice it prints out differently.

In [None]:
tot_retweets

Instead of alphabetical order, we can sort by retweet totals instead.

In [None]:
tot_retweets.sort_values(by="retweet count",ascending=False)

**6. Look at how often different hashtags were used, finding the most frequent. For simplicity, use just the column "hashtag \#1".**

In [None]:
# Put your code here


In the example above, we used sum() to add up the number of retweets. There are several functions like mean() which will take the average of the values in the group, or min() which will find the minimum or count() which simply tells you the size of the group.

**7. For a couple of hashtags, compute the number of times each appeared in the 6 ten minute windows.**

In [None]:
# put your code here



**8. For a couple hashtags, find the people tweeting them and compare them based on their follower counts.**

In [None]:
# put your code here



**Bonus.** Now, to look at the hashtag usage per ten minute period, we could use groupby() again. This time we will subset the groups to focus on just the hashtag values for each group ("hashtag \#1"). This gives us a series in each group to which we can then apply value_counts(). This will produce a final series instead of a data frame (as we got above when we used groupby), but it will be a series with a nested index. I know that hurts my head too, but have a look. 

In [None]:
cnts = tweets[["hashtag #1","ten minute"]].groupby("ten minute")["hashtag #1"].value_counts()
type(cnts)

In [None]:
cnts

You see it's nested. We have a series (look at the type of cnt) but the index is first on ten minute interval and then on hashtag. We can look at just those entries with more than 5 tweets using standard subsetting.

In [None]:
set_option('display.max_rows', 100)

cnts[cnts>=10]

This gives us a pretty clean view of what the top hashtags were during each 10 minute period. OK, what should the trends be?