In [None]:
%matplotlib inline

<h1 align="center">Trending Topics - A Signal for Attention </h1>
<hr>
<img src="http://noble66.github.io/Viz/mh370drift_world2.gif" style="width: 65%;"/>
<br>

Welcome to the first class where we will explore topics that trend on social media platforms, specifically Twitter and Facebook. 

There are **so** many questions that revolve around this particular feature available to users on most social networks today, that it would be hard to cover everything in a single class. However, we will do our best to discuss the relevant parts and defer the rest of the material via readings outside class. Main points from these reading materials will be explored further in [discussion sessions](#Discussion).

Here's the flow for our session on **trending topics**: 

> We will start with some [introductory](#Introduction) questions that focus on how and where trends come from? Or who decides what's on the trending topic list? Can they really be considered **a signal for "attention"** on social media platforms? It is important to understand these factors before we embark on parsing and finding patterns in data about computed trends from Facebook or Twitter. This is fundamental to any data science analysis - the incentives and procedures through which the data source generates and publishes data can often give us significant knowledge about how to interrogate it. 

>Following that, we will explore some real [datasets](#Data) of trends collected from Facebook and Twitter. This will give you hands-on experience analyzing trends, assessing their patterns and rhythms. On Thursday, we will flip things around and you will be asked to come up with trending algorithms.

>The next section will be [methods](#Methods), where we will study some metrics that help **quantify features in the trend signal**. Example features include how long a trend lasted in some location, or how many geolocations it spread to etc. Note it isn't necessary that these features will reveal something "interesting" about the trend. The idea of this exercise is to help you calibrate your thinking about trends as a signal of social attention. In other words, assuming trends were a signal of "attention", what metrics can be used to compare to trends? 

>Finally, we will [discuss](#Discussion) more general readings about trending topics, including the recent controversy around Facebook's Trending Topic, the limitations of algorithmic methods in surfacing trends and the unintended appearance of fake news in trending topics list! 

[Introduction](#Introduction) | [Data](#Data) | [Methods](#Methods) | [Discussion](#Discussion)

<hr>

# Introduction

We will start this session with some web work and a discussion. 

1. The idea of a trending or popular topic has become a fixture of the web and almost a navigational strategy for many of us. But what does it mean for a topic to be trending? How do these notions relate to other uses of the word, say from fashion? Next, let's find situations where trends are published, and talk about what they mean. What's trending on these sites?

2. Next, let's focus just on Twitter and Facebook. What are the differences between their ideas of what's trending? What are your options for exploring trends on each platform? What comparisons can you make and what changes?

In a broad sense, the idea behind trending topics is to surface something that signifies what the community is interested in, or paying attention to. Thus, when a group of users on Twitter increasingly RT (Retweet) a message or tweet about some topic, Twitter labels this as a "trend". 

Due to the incredible amounts of user-generated messages that these social platforms see everyday, Trending Topics are usually determined by an "algorithm" (a computer program or a set of guidelines for humans to follow... or a mixture). The role of the algorithm is to collect a substantial chunk of data generated on the social media site and strive to find **words/or phrases that occur more frequently than others, or words that are being produced with high "velocity"**. 

In [None]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Trending topics are determined by an algorithm measuring velocity of Tweets about a topic, not overall popularity: <a href="https://t.co/8gJJOT3gCw">https://t.co/8gJJOT3gCw</a></p>&mdash; Policy (@policy) <a href="https://twitter.com/policy/status/742642510862950400">June 14, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

What does the term "velocity" mean here? Take a moment to figure it out. What have you found? 

Note that velocity is probably a better measure than frequency in satisfying the real time condition, because while overall frequency of a word might be very high its usage in the last few minutes might be low - meaning thats not actually "trending" now. An simplistic way to calculate velocity is:

**Velocity_of_word (trending nature) = number of times a word was observed / time duration **


Remeber when the power went out during the SuperBowl of 2013
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">All the lights are out!! It’s pandemonium!! Thank god we have out Beyonce finger lights! <a href="http://t.co/JxUKr27i">pic.twitter.com/JxUKr27i</a></p>&mdash; Neil Patrick Harris (@ActuallyNPH) <a href="https://twitter.com/ActuallyNPH/status/298245477042900992">February 4, 2013</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Here is a chart comparing two trending topics "superbowl" and "power" that evening. 

<img src="http://www.socialflow.com/wp-content/uploads/2013/02/power_superbowl.jpg" style="width: 80%;"/>


### 1.1 Sampling

Trending algorithms often begin by creating a **sample** of tweets or FB status messages. Why do we want to work with  a subset of these messages, why not the whole thing you ask? Well, for one, the data might simply be too big to analyze quickly. In this case, some form of random sampling would help reduce the size of the data, but hopefully preserve the important patterns. Twitter, for example, offers users a random sample of tweets from their API.

Also, depending on your notion of a trend, we might want to identify patterns in certain cohorts/groups/communities within a social network. This means creating a sample of data from just these members, possibly using statistical sampling again if the number is too large.

So sampling, being the first phase of the trend algorithm, plays a critical role in what trends are surfaced. For example, if you sample all Facebook messages originating from India vs. those originating from the USA, then there is bound to be significant difference in the trending topic list, because the status updates would probably be different. This is true unless there is a global event that captures the attention of both nations simulateneously. 

Which sample you choose really goes back to your goals for computing a trending topic in the first place? What do you intend or what do your readers expect?

### 1.2 Personalization

Both Twitter and Facebook personalize trends based on many signals. The main signals include: 
* (1) Your interests: Pages you follow or things you LIKE.  
* (2) People in your social network, and
* (3) Your location: Twitter provides trends in more than 400 geolocations world wide. Facebook collects trends in only 5 zones, namely CA, IN, US, GB and AU. 

Twitter calls this sort of personalization "Tailored Trends" which is on by default, although you can click on "change" and retrieve trending topics for ANY geolocation. The Facebook trending list is personalized by default as well, but more importantly it is *not possible* to retrieve the global list of trends for any Facebook zone in Facebook's UI. Meaning on Twitter, you can see whats trending in Paris even if you live in Boston. On Facebook, you cannot see what's trending in Australia unless you live there. 

Luckily for us, our APIs have the ability to extrat the global list of Facebook trends from all its 5 zones. 

### 1.3 Trend Propagation

So the next obvious question is: why does a trend start in one place and spread all over the network? To understand this, we must delve a bit deeper. A sequence of activity resembles what we call time series data. Activity in a social network can be tracked in three levels: (1) Micro, (2) Meso and (3) Macro. 

* Micro: You tweet or RT something
* Meso: All the people in Washington DC tweet/RT something
* Macro: The entire network activity

Meso signals resemble time series data generated from activity of a collection of users. We do not know exactly how such activity correlates with one other, but most previous research hint at information cascades as one reason.

**Information cascades** resemble a mathematical system where each node will make a decision of whether to forward the message/information or not based on its previous experience (is this a influential person) and the current signal information (is this tweet good enough to RT). In general, the *n*th agent considers the decisions of the previous *n-1* agents, and his/her own signal. He/she then makes a decision based on some reasoning to determine the most rational choice. The result is a set of activities on information propagating from node to node. 

<img src="https://cdn-images-1.medium.com/max/800/1*7988kEm4d2hDZdCpQoMGiQ.png">

Because meso signals are generated from micro signal (individual) activity and individual activity is influenced by network topology (who you follow and who follows you), the trend signal is indirectly caused by information dynamics governed by network topology. 

### 1.4 The Attention Signal

Thus, the **trending topic over time is a SIGNAL** - a signal for what has captured the attention of a particular demographic (i.e. in a geo-location or within a community). Plotting such topic signals can often reveal interesting facts about information diffusion. 

For example, lets look at the trending topic '#Ferguson' in various US cities. 

<img src="https://cdn-images-1.medium.com/max/800/1*NXer4xgyr9qFEnUEjuIFJA.gif">

Here, the X-axis represents a fixed duration of time from when the trend was first seen in St. Louis. Y-axis represents the score of the trend in the trending topic list. (A score of 10 represents rank 1)

### 1.5 Introductory Data Reads

| Article | Description |
| ------ | ----------- |
|1. [FB Newsroom Trending Topics Guideline](https://fbnewsroomus.files.wordpress.com/2016/05/full-trending-review-guidelines.pdf)   |  How your news gets to be on the Trending Topic List |
|2. [The Digital Flames of Ferguson](https://medium.com/i-data/the-digital-flames-of-ferguson-87c4eb9aaae4#.5cuci5dm0) | Quantifying how news spreads from location to location using Twitter trends. |
|3. [#TrumpWon: Trend vs. Reality](https://medium.com/i-data/trumpwon-trend-vs-reality-16cec3badd60#.lspl71ll7)   | Our reality is decided by what we pay attention to. |
|4. [Hong Kong, the World and Facebook Trends](https://medium.com/i-data/hong-kong-the-world-and-facebook-trends-28331ee8f2d1#.60t0s8k3j)   | A product's user Interface design can affect what becomes a trend|

<hr>

### Some Helper Scripts before we begin

##### Counters...

In [None]:
# A Counter is really useful thing, that can count the frequency of things in a list..
# it can then order in descending fashion. 
from collections import Counter
x = ['ford','chevy','honda','ford','chevy','lincoln','ford']
print Counter(x)
print ''
print Counter(x).most_common(2)

##### Sets...

In [None]:
x = ['ford','chevy','honda','ford','chevy','lincoln','ford']
print set(x)

In [None]:
y = ['ford','chevy','aston martin']
set(x).intersection(set(y))

<hr>

# Data
[Introduction](#Introduction) | [Data](#Data) | [Methods](#Methods) | [Discussion](#Discussion)

There are two sources of trends data we will work with: Facebook and Twitter.

**Facebook Data Description**: The ingestion script probes Facebook every 15 minutes and logs the trends for five zones: CA, AU, US, IN, GB.  [Download the data](http://sumandebroy.com/columbia/fb_trends.csv.gz), uncompress it, and place it in the same folder as this notebook.

**Twitter Data Description**: The ingestion script probes Twitter every 15 minutes and logs the TTL (trending topic list) + underlying trends (users might not see this in the UI) for ~400 geographical locations worldwide. Again, [download the data](http://sumandebroy.com/columbia/twitter_trending_topics_for_us_120to122.csv.gz), uncompressit, and place it in the same folder as this notebook.

Let's do some basic analysis on the two datasets:
  * [Facebook](#tt-fb)
  * [Twitter](#tt-tw)

In [None]:
from pandas import read_csv

### 2.1 Facebook Trends <a id="tt-fb"></a>

In [None]:
#load fb trends data
trends = read_csv("fb_trends.csv")

In [None]:
# what does this data look like.. 
trends.head()

In [None]:
trends.count() 

**[Q]**: why are there fewer headlines? Facebook does not return headlines for very low trending news. 

In [None]:
# what zones am I looking at? 
trends["zone"].value_counts()

In [None]:
trends['topic_name'].value_counts().head(20)

In [None]:
trends['topic_name'].describe()

In [None]:
trends[2160:2170]

In [None]:
# view some row
trends[1849:1850]

In [None]:
# you can also combine two columns !
trends[['position','topic_name']][:4]

In [None]:
# most common entries in a column, lets say topic_name
trends['topic_name'].value_counts().head(20)

In [None]:
# what were the 10 most popular trend in US
trends['topic_name'][ trends['zone']=='US' ].value_counts().head(10)

In [None]:
# how many items are there on the 22nd ? 
mask = (trends['datetime'] > '2017-01-22') & (trends['datetime'] <= '2017-01-23')
mask.value_counts()

*An aside about columns of strings*

When a column in a DataFrame is made up of strings, you can access all the string methods you learned in your drill using the ".str" object of the column. So, here we might look at the uppercase versions of the topic_names.

In [None]:
trends["topic_name"].str.upper().head(10)

You can also use the subsetting operator to extract out portions of a string. 

In [None]:
trends["topic_name"].str[:5].head(10)

In [None]:
# using this, show me data from a given date
mask = trends['datetime'].str[:10] == '2017-01-22'
mask.value_counts()

Of course, there are several ways to do this. In this case, we could also use the string method startswith()

In [None]:
mask = trends['datetime'].str.startswith('2017-01-22')
mask.value_counts()

We'll have more to say about some of the nice syntax in a DataFrame, but for now, it's enough to know that the material we covered about strings is not lost.

In [None]:
# trends that reached rank 1 in India
trends[(trends['zone']=='IN') & (trends['position']==1)]  

Zero rows returned... looks fishy !

In [None]:
# hmm.. whats going on here.
trends[(trends['zone']=='IN') & (trends['position']=='1')]  #example of dirty data, format returned is different 

In [None]:
# trends that reached rank 1 in GB on the 21st? 
trends[(trends['zone']=='GB') & (trends['position']=='1') & (trends['datetime'].str.startswith('2017-01-21'))]  

In [None]:
# any common trends between US and GB on the 20th  ?

gbTrends = trends['topic_name'][(trends['zone']=='GB') & (trends['datetime'].str.startswith('2017-01-20'))]
usTrends = trends['topic_name'][(trends['zone']=='US') & (trends['datetime'].str.startswith('2017-01-20'))]

print set(gbTrends).intersection(set(usTrends))

### 2.2 Twitter Trends <a id="tt-tw"></a>

In [None]:
#load twitter trends data
trends = read_csv('twitter_trending_topics_for_us_120to122.csv',names=["datetime","timestamp","country","city","position","topic_name"])

In [None]:
# what this data looks like
trends.head()

In [None]:
# how big is this dataset?
trends.count()  

**[Q]**: Why is the count of cities lower than the overall count?

In [None]:
# Describe topics trended on these days?
trends['topic_name'].describe()

In [None]:
# what are the most common trends in this time period
trends["topic_name"].value_counts().head(50)

In [None]:
# What trended in LA ? 
mask = (trends['city'] == 'Los Angeles')
trends[mask].head(10)

In [None]:
# more concise report on what trended in LA
trends['topic_name'][mask].value_counts().head(10)

**[Q]**: How do you find the top trend in other cities? 

In [None]:
# Limit by date
mask = trends['datetime'].str.startswith('2017-01-21')
before22 = trends[mask]

before22['topic_name'].value_counts().head(10)

In [None]:
# what was trending in Seattle before 20-21st Jan? 
before22['topic_name'][ before22["city"]=='Seattle' ].value_counts().head(10)

In [None]:
# what trends reached rank 1 in Boston? 
trends['topic_name'][(trends['position']==1) & (trends['city']=='Boston')].value_counts()

In [None]:
# what trend was emerging, but never made it to Boston's TTL
trends['topic_name'][(trends['position']>10) & (trends['position']<15) & (trends['city']=='Boston')].value_counts()

In [None]:
# what were the most common trends in some city ? 
trends['topic_name'][(trends['city']=='Atlanta')].value_counts().head(50)

In [None]:
# Exercises

<hr>

# Methods
[Introduction](#Introduction) | [Data](#Methods) | [Methods](#Results) | [Discussion](#Discussion)

Methods are a list of features that trend signals usually contain. While these might not be valuable individually to comprehend the nature of the trend, a combination of all features allows two things:

* Uniquely fingerprint a trend in the world
* Allow comparison between two trend signals.

### Measures to capture features in a trend signal
  * [Origin](#chapter-1)
  * [Persistence](#chapter-2)
  * [Recurrence](#chapter-3)
  * [Geospan](#chapter-4)
  * [Drift](#chapter-5)
  * [Volatility](#chapter-6)

### 3.1 Origin<a id="chapter-1"></a>

The origin of a trend indicates where it was initiated. The main reason origin is a fascinating features includes (1) big trends can originate in small cities and go national (2) sometimes cities have an affinity to initiate certain categories of trends, e.g., many gaming trending topics originate in SF whereas numerous fashion trending topics originate in NY. 

In [None]:
# Exercise: Within the time limits of this data, find where "#WomensMarch" or "#USofScience" originated ! 

In [None]:
# Exercise: Can you find a city that did not originate any trend, but always latched on to existing ones? 

### 3.2 Persistence of a Trend <a id="chapter-2"></a>

Persistence of a trend is the duration of continuous time units for which it kept trending in some geo-location, signigied by continual presence in the trending topic list. This means during the persistence spell, a trend never fell out of the TTL and was not replaced by any other trend. 

So what does persistence really signify? Recall that a topic trends because people are tweeting about it. Two conditions are necessary for a trend to persist: 
* (1) a decent volume of tweets containing the trending word in a short amount of time and 
* (2) a failure of consolidation - i.e.  other tweets from the user group (either geo-location or follower group) fail to use the same trending word/ hash-tag in a consolidated fashion in enough tweets. This is also [the reason why #OccupyWallStreet **did not** trend in New York](http://www.niemanlab.org/2011/10/why-hasnt-occupywallstreet-trended-in-new-york/). 


<img src = "http://www.niemanlab.org/images/socialflow_twittertrending.png">

The first condition assures that the word is trending enough to be above the threshold or cut-off marker that qualifies as a trend. The second condition assures that other trends are not competing hard enough to enter into the TTL. 


#### Visualizing Persistence:

A smart way scientist visualize persistence is through something called dispersion plot. The Y-axis represents geo locations whereas the X-axis represents units of time since origin. You can the place a (dot) for every time the trend was observed at a location, and a blank if it wasn't. The result is continuos lines indicating persistence and gaps indicating lack of it. 

<img src="drogba.png" width="60%">

Shown above is a dispersion plot of **Euro2012**, a soccer tournament. Notice how the trend persists/ sticks more in European cities than the American counterparts, indicating more interest or attention in the former locations.  

In [None]:
# calculate persistence of trend '#USofScience' in Boston. 
trends[(trends['topic_name'] == '#USofScience') & (trends['city']=='Boston') & (trends['position']<11)]

### 3.3 Recurrence of a Trend <a id="chapter-3"></a>

The recurrence of a trend is the number of times the trend reappears in the TTL (Trending Topic List) after initially dropping out the TTL. 

The phenomena causing recurrence is intuitively more challenging to comprehend than persistence. Firstly, it makes sense to assume that if a trend can persist for longer its chances of recurrence are lower, because **sustained attention is hard!** Recurrence indicates disrupted or unsteady attention spans among users in the community. The repetition of the trend reappearing could be due to many factors, including reduction of attention of one trend caused due to a sudden relative increase in attention of the another trend.  

Here's another fascinating tidbit about recurrence: **data shows that the origin location of a trend plays an important role in the recurrence score.** In fact, the recurrence score is higher if the location's population is larger and more diverse. For example, trends will recur more often in New York than Tallahassee. This is because a big city with diverse population tweeting many different things disperses attention more quickly compared to a more homogenous crowd of smaller cities where people might have limited topics to tweet about. 

Recurrence is also common after people wake up from sleep. Because you don't tweet in bed (or do you?)

<img src="https://cdn-images-1.medium.com/max/2000/1*4YreqD2g2mgtnrBlv0RNsw.gif">

In [None]:
# Exercise: 
# Find the recurrence of some trends from the Twitter dataset

### 3.4 GeoSpan <a id="chapter-4"></a>

The geospan of a trend signal signifies the various geo-locations at which it was observed. In the case of micro signals, this boils down to individuals from different locations acting upon the media related to the trend. Geospan's are an important measure in identifying if a trend has gone national, in which case it will be visible in most geo locations of the country! 

In [None]:
# How many  cities in the US are we tracking? 
trends['city'][(trends['country'] == 'United States')].describe()

In [None]:
# How many cities did #TrumpInauguration trend at? 
trends['city'][(trends['topic_name'] == '#TrumpInauguration')].describe()

In [None]:
# How many cities did #WomensMarch trend at? 
trends['city'][(trends['topic_name'] == '#WomensMarch')].describe()

In [None]:
# Exercise: find the cities where #TrumpInauguration did not trend! 
noTrendHere = set(trends['city'][(trends['country'] == 'United States')]) - set(trends['city'][(trends['topic_name'] == '#TrumpInauguration')])
print noTrendHere

### 3.5 Drift <a id="chapter-5"></a>

In simple terms, the drift of a trend is the chronological order of geo-locations that it touches on its way to becoming a national trend (sometimes it doesn't go national but only local). The reason we calculate drift is to observe two powerful network effects:

* Drift can tell us which cities have low attention grasping capability, i.e. they can quickly catch up to another city's trending topic.
* Drift can tell us which cities have similar interests, which is one of the reasons the trend spreads to that city.

Shown below is the drift of #JesuisCharlie trend. It begins in Paris and then spreads to the French cities. However after that, it simultaneously drift to both some US cities (like NY, San Diego) and European cities (Madrid, Dusseldoff) within very short time. The final cities to get affected by the trend are South American and Australian cities. 

<img src="https://cdn-images-1.medium.com/max/2000/1*nmDuxI2vBA-R5gwIb1xjeg.gif">

In [None]:
# drift exercises time permitting. 

### 3.6 Volatility <a id="chapter-6"></a>

In [None]:
# More mathematical, time permitting. 

<hr>

# Discussion
[Introduction](#Introduction) | [Data](#Data) | [Methods](#Methods) | [Discussion](#Discussion)

#### Bias

Now let’s think about the bias issue. Bias means certain responses are more probable than others. This might cause a data sensor to detect some changes more promptly than others. Bias is not always social, it can be dependent on sampling. 

Sometimes, it is caused by the inherent signal generation. A nice example of this is determining which news articles are most read by users. One could pick a signal like ‘# of RTs the tweets with that news article received in Twitter’. But note Twitter has lots of bots, algorithm’s that could tweet out links based on domains or keywords. Thus, a link that has been RT-ed a lot might be under bots bias. On the other hand, think about an app like Instapaper, which flags a ‘read’ every time the user scrolls down the page to reach 20% distance from the end. This signal has much less bias, because bots cannot scroll. 

#### Algorithmic Curation

* How do we start thinking about ways to have editors work in tandem with algorithms to identify trends. 

* What could happen if humans are not in the loop?

Here is some more interesting reads about humans and algorithmic trend capture: 

| Article | Description |
| ------ | ----------- |
|1. [Fake news in Trends](https://www.washingtonpost.com/news/the-intersect/wp/2016/10/12/facebook-has-repeatedly-trended-fake-news-since-firing-its-human-editors/?utm_term=.ec1c1e47ca49)   |  Facebook fires editors, algorithm can't detect fake news. |
|2. [Is this how the Trending Algorithm works?](https://qz.com/769413/heres-how-facebooks-automated-trending-bar-probably-works/) | And does that make it vulnerable? |
|3. [The real problem with facebook and trending](https://stratechery.com/2016/the-real-problem-with-facebook-and-the-news/) | Is there a solution: editorial or algorithmic? |




## Next Session:

Trends-2:
>Build your own trends from raw Twitter data. We will use raw tweets collected from Washington DC and New York to compile the trending topic list in these locations. All the personalization belongs to you.

> More visualizations