Trends in Twitter
----------------

Last time we looked extensively at a dataset consisting of Facebook's trending topics for 5 zones. Today, we are going to dig in to Twitter. I have cleaned up the original Twitter data a fair bit. Recall that the ingestion script probes Twitter every 15 minutes and logs the TTL (trending topic list) + underlying trends (users might not see this in the UI) for ~400 geographical locations worldwide. Again, [download the data](http://compute-cuj.org/twitter_trending_topics_for_us_120to122_mh2.csv.gz), uncompressit, and place it in the same folder as this notebook.

The data clean up some issues with the timing of the robots and makes life generally happier. So, load it up!

In [None]:
from pandas import read_csv, set_option
set_option('display.max_rows', 50)

In [None]:
trends = read_csv('twitter_trending_topics_for_us_120to122_mh2.csv')
trends.shape

In [None]:
trends.head(50)

The data are sorted first by city and then by time, running from oldest to newest topic. The display of the first 50 lines shows this structure clearly. After 29 trends from 8 minutes after midnight January 20, we jump to the trends from Albuquerque 23 minutes after midnight. As with FB, the data collection here is meant to be every 15 minutes. 

We can look at the first and the last entries to get a sense of the period of our data collection from Twitter.

In [None]:
trends.head(1)

In [None]:
trends.tail(1)

So, data were collected from twitter from midnight January 20th until just before midnight on January 22nd. As with Facebook, each pass of data collection scooped up all the trends for all the locations in the US. We see that there are 64 unique places (cities + the United States) and we can look at how many trends we have from each...

In [None]:
trends["city"].describe()

In [None]:
trends["city"].value_counts()

So we have 3 days of data collection which is 3\*24\*4 wich is about 300 observation times. If we have about 11,000 topics for each city, that's about 35 or so trends per city for every time we pulled data from Twitter. 

**Las Vegas (and Pandas Series)**

Now, let's pick a city and look at its trends. Las Vegas is a reasonably large city. Let's look at what kinds of topics trended there. We would usually use value_counts() to summarize a qualitative data column. 

In [None]:
trendy = trends[ trends["city"] == "Las Vegas" ]
topics = trendy["topic_name"].value_counts()

type(topics)

In [None]:
topics

The value_counts() method returns a single Pandas column of data, an object of type "Series." You can think of a Series as a list, but one that might have names attached to each entry. So our value_counts() Series has its core "list" data (the number of times each topic appeared in Las Vegas, say) but each data point also has a label (the topic name). 

So **you can subset it just like you would a list**. Here we take the first 25 most frequent topics that crested the top 10. Remember how slices work? 

In [None]:
topics[:30]

As for the names. Pandas gives you the ability to use not just row numbers (as we have been doing) but also row names (strings that describe each row rather than a number). It calls them an "index". So far, our index has always been just a row number. Here's our trends data, for example.

In [None]:
trends.head(5)

You can get access to the index of a DataFrame or a series by referring to ".index". It returns something that is again like a list. Here is the index for our "topics" Series. 

In [None]:
topics.index

Finally, becaue the index is list-like, you can make a subset. Here we look at just the first 25 topics.

In [None]:
tops = topics.index[:30]
tops

We can then ask if various strings are in this (essentially) list. The operator "in" does that for for single strings...

In [None]:
print "#ThankYouObama" in tops
print "Columbia Journalism" in tops

... and if our strings are in a Pandas DataFrame, we can use the equivalent operator "isin".

In [None]:
trendy["topic_name"].isin(tops)

The "isin" operator makes us a mask that we can use to subset our Las Vegas trends data, keeping only the rows associated with the top 25 trends, say. We can then plot these top 25 in a dot chart. I've make the code for plotly a bit easier to follow. 

In [None]:
from plotly.plotly import iplot, sign_in
import plotly.graph_objs as go 

sign_in("cocteautt","9psj3t57ti")

trendy_tops = trendy[trendy["topic_name"].isin(tops)]

mydata = [go.Scatter(x=trendy_tops["datetime"],y=trendy_tops["topic_name"],mode="markers")]
mylayout = go.Layout(autosize=False, width=1000,height=800,margin=go.Margin(l=150,r=50,b=100,t=100,pad=4))
myfigure = go.Figure(data = mydata, layout = mylayout)
iplot(myfigure)

**Methods (some generality for examining trends)**

Methods are a list of features that trend signals usually contain. While these might not be valuable individually to comprehend the nature of the trend, a combination of all features allows two things:

* Uniquely fingerprint a trend in the world
* Allow comparison between two trend signals.


**Origins (which has to do with time) and geospan**

The origin of a trend indicates where it was initiated. The main reason origin is a fascinating features includes (1) big trends can originate in small cities and go national (2) sometimes cities have an affinity to initiate certain categories of trends, e.g., many gaming trending topics originate in SF whereas numerous fashion trending topics originate in NY. 

The trend "To Sir With Love" is presumaby from the SNL skit in which two cast members said goodbye to President Obama. The times here require a little work. The datetime column ends in "Z" which means the time is recorded in UTC, Coordinated Universal Time. It is 5 hours ahead of NYC now. So this means the trend started before 1am NYC time.

Let's see where else it trended. 

In [None]:
target_topic = 'To Sir With Love'

trends_topic = trends[trends['topic_name']==target_topic].sort_values(by="timestamp")

In [None]:
mydata = [go.Scatter(x=trends_topic["datetime"],y=trends_topic["city"],mode="markers")]
mylayout = go.Layout(autosize=False, width=1000,height=800,margin=go.Margin(l=150,r=50,b=100,t=100,pad=4))
myfigure = go.Figure(data = mydata, layout = mylayout)
iplot(myfigure)

In [None]:
trends_topic.sort_values(by="timestamp").head(100)

**Your turn**

We began by looking at trends in Las Vegas and then moved out to various cities. We can extend this analysis in various ways. Starting with another city or the United States? Looking for less common trending topics (not the top 25 but maybe a middle 25? What else?

Try!

I've pulled the important code into four short steps. There's a lot of, um, pedagogy up there!

In [None]:
# 1. load up pandas and then read in the data
from pandas import read_csv,set_option

trends = read_csv('twitter_trending_topics_for_us_120to122_mh2.csv')

In [None]:
# 2. look at one city
trendy = trends[ trends["city"] == "Las Vegas" ]

In [None]:
# 3. pull the top trending topics (maybe <= 10 is too much? is the top 25 right?)
topics = trendy["topic_name"].value_counts()
tops = topics.index[:30]

In [None]:
# 4. prepare to plot trends from the city
from plotly.plotly import iplot, sign_in
import plotly.graph_objs as go

sign_in("cocteautt","9psj3t57ti")

trendy_tops = trendy[trendy["topic_name"].isin(tops)]

mydata = [go.Scatter(x=trendy_tops["datetime"],y=trendy_tops["topic_name"],mode="markers")]
mylayout = go.Layout(autosize=False, width=1000,height=800,margin=go.Margin(l=150,r=50,b=100,t=100,pad=4))
myfigure = go.Figure(data = mydata, layout = mylayout)
iplot(myfigure)

In [None]:
# 5. Look at a single trend and plot it
target_topic = 'To Sir With Love'

trends_topic = trends[(trends['topic_name']==target_topic)]

mydata = [go.Scatter(x=trends_topic["datetime"],y=trends_topic["city"],mode="markers")]
mylayout = go.Layout(autosize=False, width=1000,height=800,margin=go.Margin(l=150,r=50,b=100,t=100,pad=4))
myfigure = go.Figure(data = mydata, layout = mylayout)
iplot(myfigure)

The **geospan** of a trend signal signifies the various geo-locations at which it was observed. In the case of micro signals, this boils down to individuals from different locations acting upon the media related to the trend. Geospan's are an important measure in identifying if a trend has gone national, in which case it will be visible in most geo locations of the country!

**Persistence**

Persistence of a trend is the duration of continuous time units for which it kept trending in some geo-location, signified by continual presence in the trending topic list. This means during the persistence spell, a trend never fell out of the TTL and was not replaced by any other trend. 

So what does persistence really signify? Recall that a topic trends because people are tweeting about it. Two conditions are necessary for a trend to persist: 
1. a decent volume of tweets containing the trending word in a short amount of time and 
2. a failure of consolidation - i.e.  other tweets from the user group (either geo-location or follower group) fail to use the same trending word/ hash-tag in a consolidated fashion in enough tweets. This is also [the reason why #OccupyWallStreet **did not** trend in New York](http://www.niemanlab.org/2011/10/why-hasnt-occupywallstreet-trended-in-new-york/). 


<img src = "http://www.niemanlab.org/images/socialflow_twittertrending.png">

The first condition assures that the word is trending enough to be above the threshold or cut-off marker that qualifies as a trend. The second condition assures that other trends are not competing hard enough to enter into the TTL. 

A smart way scientist visualize persistence is through something called dispersion plot. The Y-axis represents geo locations whereas the X-axis represents units of time since origin. You can the place a (dot) for every time the trend was observed at a location, and a blank if it wasn't. The result is continuos lines indicating persistence and gaps indicating lack of it. 

As we said, the persistence of a trend can be defined as the longest sequence of consecutive time periods that it was popular. We might take that to mean it was in the top 10, say. The sequence of consecutive time periods can be turned into actual time. If a topic persisted for 20 time periods that's 20\*0.25 = 5 hours. 

Let's pick a trend and a city and see how what it's persistence is like. We start by creating a smaller DataFrame that has just one city's trends. We then add a new column called "tops" to this DataFrame that is True if the position is 10 or smaller and False otherwise. We add the 0 to the Boolean because Python will convert a True to 1 and a False to 0 when you include it in an arithmetic calculation.

In [None]:
target_trend = '#USofScience'
city = 'Seattle'

trendy = trends[(trends['topic_name'] == target_trend) & (trends['city']==city)].sort_values(by='timestamp')

# add a new column
trendy["tops"] = (trendy["position"]<=10)+0
trendy.head(20)

Notice that the syntax is consistent. We use [ ]'s and a string to access the data in a column, it seems to only be fair that we can create a column in the same way.

For persistence, we need to create another column, this one that will creates runs of whether a topic is in the top 10 or not. For that we compare the value of each entry to the entry from the time period just before it. The command shift() will take a column in a DataFrame and, well, shift it by one. So a column of [a,b,c] becomes [NaN,a,b], where the first entry in the shifted column is a missing value. 

In the code below we compare the shifted and unshifted columns, creating a True if the topic's status in one period was either NOT the same as that for the period immediately preceding it, or the two adjacent periods are separated by more than one samplingepoch. 

In [None]:
trendy['block'] = ((trendy['tops']!= trendy['tops'].shift(1)) | (trendy['epoch'] - trendy['epoch'].shift(1) > 1 ))
trendy.head(20)

In [None]:
trendy['block'] = ((trendy['tops']!= trendy['tops'].shift(1)) | (trendy['epoch'] - trendy['epoch'].shift(1) > 1 )).cumsum()
trendy.head(20)

In [None]:
trendy[trendy["tops"]==1].groupby("block").size()

That means we have a group of 10 sampling windows or about 2.5 hours. Here's the plot.

In [None]:
trendy_tops = trendy[trendy["tops"]==1]

mydata = [go.Scatter(x=trendy_tops["datetime"],y=trendy_tops["topic_name"],mode="markers",marker=go.Marker(size=10,color=trendy["block"],colorscale='Viridis'))]
mylayout = go.Layout(autosize=False, width=1000,height=400,margin=go.Margin(l=150,r=50,b=100,t=100,pad=4))
myfigure = go.Figure(data = mydata, layout = mylayout)
iplot(myfigure)

Arguably, this is not really what we mean by persistence. The gaps out of the top 10 are really short. We might want to relax the definition a little and look at top 20 with maybe 2 or 3 epoch skips. That would let the topic drop off for 15 or 30 minutes but still come back and be part of the persistence window.

**Your turn**

Below is the essential portions of the code to calculate persistence. In cell 1 you set your targets. In 2 you define what it means to persist. You could look at top 10 or top 20 trends. You can say that one sampling window gap declares a new block or you could be more forgiving.

Finally, we use the groupby() operation to determine the runs.

In [None]:
# 1. define target trend and city and subset the trends data
target_trend = '#USofScience'
city = 'Seattle'

trendy = trends[(trends['topic_name'] == target_trend) & (trends['city']==city)].sort_values(by='timestamp')

In [None]:
# 2. add new columns to help with the grouping
trendy["tops"] = (trendy["position"]<=10)+0
trendy['block'] = ((trendy['tops']!= trendy['tops'].shift(1)) | (trendy['epoch'] - trendy['epoch'].shift(1) > 1 )).cumsum()

In [None]:
# 3. Examine persistence
trendy[trendy["tops"]==1].groupby("block").size()

In [None]:
# 4. Make a plot
trendy_tops = trendy[trendy["tops"]==1]

mydata = [go.Scatter(x=trendy_tops["datetime"],y=trendy_tops["topic_name"],mode="markers",marker=go.Marker(size=10,color=trendy["block"],colorscale='Viridis'))]
mylayout = go.Layout(autosize=False, width=1000,height=400,margin=go.Margin(l=150,r=50,b=100,t=100,pad=4))
myfigure = go.Figure(data = mydata, layout = mylayout)
iplot(myfigure)

**Recurrence of a trend**

The recurrence of a trend is the number of times the trend reappears in the TTL (Trending Topic List) after initially dropping out the TTL. 

The phenomena causing recurrence is intuitively more challenging to comprehend than persistence. Firstly, it makes sense to assume that if a trend can persist for longer its chances of recurrence are lower, because **sustained attention is hard!** Recurrence indicates disrupted or unsteady attention spans among users in the community. The repetition of the trend reappearing could be due to many factors, including reduction of attention of one trend caused due to a sudden relative increase in attention of the another trend.  

Here's another fascinating tidbit about recurrence: **data shows that the origin location of a trend plays an important role in the recurrence score.** In fact, the recurrence score is higher if the location's population is larger and more diverse. For example, trends will recur more often in New York than Tallahassee. This is because a big city with diverse population tweeting many different things disperses attention more quickly compared to a more homogenous crowd of smaller cities where people might have limited topics to tweet about. 

Recurrence is also common after people wake up from sleep. Because you don't tweet in bed (or do you?)

<img src="https://cdn-images-1.medium.com/max/2000/1*4YreqD2g2mgtnrBlv0RNsw.gif">

In the previous code blocks, we saw 3 periods of time when the topic was active, meaning an initial window of popuarity and then 2 more. So it's recurrence is 2. Here we include the same basic code as above but the example is a different topic in a different city. It also computes the recurrence explicitly, using the len() function.

In [None]:
# 1. define target trend and city and subset the trends data
target_trend = 'Richard Spencer'
city = 'Las Vegas'

trendy = trends[(trends['topic_name'] == target_trend) & (trends['city']==city)].sort_values(by='timestamp')

In [None]:
# 2. add new columns to help with the grouping - here top 20 targets with a gap as big as 2 sampling times
trendy["tops"] = (trendy["position"]<=20)+0
trendy['block'] = ((trendy['tops']!= trendy['tops'].shift(1)) | (trendy['epoch'] - trendy['epoch'].shift(1) > 2 )).cumsum()

In [None]:
# 3. Examine recurrence
grps = trendy[trendy["tops"]==1].groupby("block")

print "The topic",target_topic,"recurs", len(grps)-1, "times in", city
print grps.size()

In [None]:
# 4. Make a plot
trendy_tops = trendy[trendy["tops"]==1]

mydata = [go.Scatter(x=trendy_tops["datetime"],y=trendy_tops["topic_name"],mode="markers",marker=go.Marker(size=10,color=trendy["block"],colorscale='Viridis'))]
mylayout = go.Layout(autosize=False, width=1000,height=400,margin=go.Margin(l=150,r=50,b=100,t=100,pad=4))
myfigure = go.Figure(data = mydata, layout = mylayout)
iplot(myfigure)

**Drift**

In simple terms, the drift of a trend is the chronological order of geo-locations that it touches on its way to becoming a national trend (sometimes it doesn't go national but only local). The reason we calculate drift is to observe two powerful network effects:

* Drift can tell us which cities have low attention grasping capability, i.e. they can quickly catch up to another city's trending topic.
* Drift can tell us which cities have similar interests, which is one of the reasons the trend spreads to that city.

Shown below is the drift of #JesuisCharlie trend. It begins in Paris and then spreads to the French cities. However after that, it simultaneously drift to both some US cities (like NY, San Diego) and European cities (Madrid, Dusseldoff) within very short time. The final cities to get affected by the trend are South American and Australian cities. 

<img src="https://cdn-images-1.medium.com/max/2000/1*nmDuxI2vBA-R5gwIb1xjeg.gif">

#### Bias

Now let’s think about the bias issue. Bias means certain responses are more probable than others. This might cause a data sensor to detect some changes more promptly than others. Bias is not always social, it can be dependent on sampling. 

Sometimes, it is caused by the inherent signal generation. A nice example of this is determining which news articles are most read by users. One could pick a signal like ‘# of RTs the tweets with that news article received in Twitter’. But note Twitter has lots of bots, algorithm’s that could tweet out links based on domains or keywords. Thus, a link that has been RT-ed a lot might be under bots bias. On the other hand, think about an app like Instapaper, which flags a ‘read’ every time the user scrolls down the page to reach 20% distance from the end. This signal has much less bias, because bots cannot scroll. 

#### Algorithmic Curation

* How do we start thinking about ways to have editors work in tandem with algorithms to identify trends. 

* What could happen if humans are not in the loop?

Here is some more interesting reads about humans and algorithmic trend capture: 

| Article | Description |
| ------ | ----------- |
|1. [Fake news in Trends](https://www.washingtonpost.com/news/the-intersect/wp/2016/10/12/facebook-has-repeatedly-trended-fake-news-since-firing-its-human-editors/?utm_term=.ec1c1e47ca49)   |  Facebook fires editors, algorithm can't detect fake news. |
|2. [Is this how the Trending Algorithm works?](https://qz.com/769413/heres-how-facebooks-automated-trending-bar-probably-works/) | And does that make it vulnerable? |
|3. [The real problem with facebook and trending](https://stratechery.com/2016/the-real-problem-with-facebook-and-the-news/) | Is there a solution: editorial or algorithmic? |


**Tweets in the wild**

Twitter makes its data available via an Application Programming Interface or API. Think of it as a kind of web server that, instead of publishing HTML pages, offers you data. Web services are the soul of what we called Web 2.0. Data are interchangeable between machines allowing for automated processing. In this case, we are looking at trends, but the underlying Tweet data is available too. 

For a series of services to share data, you obviously have to agree on what the data looks like. It's format. Twitter has a [published format for its tweets.](https://dev.twitter.com/overview/api/tweets) Have a look! Tweets are stored in JSON objects, where JSON stands for the Javascript Object Notation. It looks a lot like native Python objects. We'll explain a bit more shortly.

We have captured 20,000 raw tweets from the DC area at noon during the inauguration. The data set is located here. Download it and move it to the folder where you have placed this notebook. Let's look at a tweet! 

In [None]:
from json import loads

tweet = loads(open("inauguration_data/noon_tweets/822503956907773952.json").read())

In [None]:
tweet

In [None]:
type(tweet)

To understand a tweet, we need one more built-in object in Python -- a dictionary. While lists let you store data sequentially (a fist object, a second and a last), dictionaries store data using words (technically so-called immutable objects). Here's an example.

In [None]:
mini_tweet = {'created_at': 'Fri Jan 20 17:59:59 +0000 2017',
              'source': '<a href="http://instagram.com" rel="nofollow">Instagram</a>',
              'text': 'millwoodschool Our upper school boys are bonding on the slopes! #Wintergreen #AnnualSkiTrip @\u2026 https://t.co/XYn5wDVJAt',
              'user': {'followers_count': 121,
                       'friends_count': 121,
                       'id': 608401769,
                       'lang': 'en',
                       'location': 'Richmond, VA',
                       'name': 'ChristensCreations',
                       'statuses_count': 2778}
              }

print mini_tweet["created_at"],"\n"
print mini_tweet["text"],"\n"
print mini_tweet["user"]["name"]

A dictionary, like a list, can hold anything. Lists, other dictionaries, boolean, numeric data... you name it. We are going to work with these tweets to pull out trends. What kind of data might be interesting? Have a look at some tweets and see what you can find.

**A first pass**

We have taken tweets from the noon period and boiled them down into a CSV for you. Have a look. Let's talk about what you need to make your own trending algorithm!

In [None]:
set_option("display.max_colwidth",140)

tweets = read_csv("inauguration_data/inauguration_tweets_at_noon.csv")
tweets.head(50)