# Social Media Analytics
Twitter data can be useful in a number of different ways for journalism, such as helping to identify events, to understand the aggregate flows and trends of information, or to locate key sources within the network. 

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

# Makes it so that you can scroll horizontally to see all columns of an output DataFrame
pd.set_option('display.max_columns', None)
# Make it so urls and tweets won't get truncated when we print them out
pd.set_option('display.max_colwidth', -1)

# This magic function allows you to see the charts directly within the notebook. 
%matplotlib inline

# This command will make the plots more attractive by adopting the commone style of ggplot
matplotlib.style.use("ggplot")

### Gamergate
Andy Baio collected thousands of tweets over a period of 72 hours which used the #Gamergate hashtag. His analysis is [here](https://medium.com/message/72-hours-of-gamergate-e00513f7cf5d#.c12plmtcf) but below we'll start from his raw data and do some of our own analysis. You can download the data [here](https://www.dropbox.com/s/5zeuic9qr8v8y4n/gamergate_tweets_hydrated.csv?dl=0). 

In [None]:
gg_df = pd.read_csv("Data/gamergate_tweets_hydrated.csv", parse_dates="created_at")

# Print out the column headings to see what kind of data we have
gg_df.columns

### Analyzing Conversation

A hashtag conversation on Twitter can be characterized in different ways: 
1. How many tweets were sent and by how many unique users suggests how broad or concentrated the conversation is; 
2. The number of original tweets vs. retweets can indicate how much new information is being created in comparison to information that is being passed along; 
3. The number of tweets that are replies directly to another users can be a measure of how "chatty" the event is;  
4. The distribution of the conversation across different languages could indicate different interest groups as well as if the event has spread internationally;
5. Metrics can also be aggregated at the level of users. The total number of retweets, favorites, or followers for each user could indicate "important" or at least "interesting" people within the conversation

Let's walk through each of these analyses.

In [None]:
# Number of tweets sent
gg_df.shape[0]

You might notice that this number differs from the number that Andy indicates in the original blog post he wrote. The difference stems from the fact that the current data we're analyzing was "re-hydrated" using a tool called [Twarc](https://github.com/edsu/twarc). Twitter's TOS do not allow you to redistribute tweets. You can only share tweet IDs, and all of the tweet metadata has to be re-hydrated or re-constituted from those tweet IDs. The downside is that some people have deleted their accounts or deleted the original tweets (especially for a controversial topic like this). That means we can't rehydrate all of the original tweets. In fact we only have 215,153 which is about 68%. For this reason it's important to be timely in collecting Twitter data if you want to publish news with it.

In [None]:
# Number of Unique screen names that sent tweets
gg_df["user_screen_name"].unique().shape[0]

**1. How many tweets were sent and by how many unique users suggests how broad or concentrated the conversation is.**

To understand how broad vs concentrated the conversation is a histogram of # users vs. # tweets  could help.

In [None]:
vc = gg_df["user_screen_name"].value_counts()
# By setting the bin size to the max value count that means each bin will correspond to a single value, the normed=True parameter makes the y-axis a proportion
plt.hist(vc, bins=vc.max(), normed=True)

# Crop the plot to show show upto 10
plt.xlim(0, 10)

# Make it bigger
fig = plt.gcf()
fig.set_size_inches(12,8)

**2. The number of original tweets vs. retweets can indicate how much new information is being created in comparison to information that is being passed along;**

To calculate the number of original tweets versus retweets are in the data we can do some counting:

In [None]:
# Number of Original Tweets
# if reweet_id is null then it's original, otherwise the tweet is a retweet of the given retweet_id
print "Num Retweets: %d, which is %.2f%% of total." % (gg_df["reweet_id"].count(), 100* float(gg_df["reweet_id"].count()) / gg_df.shape[0])
print "Num Original Tweets: %d, which is %.2f%% of total." % (gg_df.shape[0] - gg_df["reweet_id"].count(), 100* float(gg_df.shape[0] - gg_df["reweet_id"].count()) / gg_df.shape[0])

**3. The number of tweets that are replies directly to another users can be a measure of how "chatty" the event is;**


In [None]:
# Number of Reply Tweets
# in_reply_to_status_id is not null if the tweet is a reply to another tweet. 
print "Num Replies: %d, which is %.2f%% of total." % (gg_df["in_reply_to_status_id"].count(), 100* float(gg_df["in_reply_to_status_id"].count()) / gg_df.shape[0])
print "Num Non-Replies: %d, which is %.2f%% of total." % (gg_df.shape[0] - gg_df["in_reply_to_status_id"].count(), 100* float(gg_df.shape[0] - gg_df["in_reply_to_status_id"].count()) / gg_df.shape[0])

**4. The distribution of the conversation across different languages could indicate different interest groups as well as if the event has spread internationally;**

In [None]:
# Histogram across languages ("lang" is the language of the account not necessarily of the message of the tweet)
vcounts = gg_df["lang"].value_counts(ascending=True)
print vcounts

# Because the "en" and "und" variables will dominate the bar chart, we remove them before plotting
del vcounts["en"]
del vcounts["und"]
vcounts.plot(kind="barh")

# Make it bigger
fig = plt.gcf()
fig.set_size_inches(12,8)

**5. Metrics can also be aggregated at the level of users. The total number of retweets, favorites, or followers for each user could indicate "important" or at least "interesting" people within the conversation**

Let's look at activity aggregated by user.

In [None]:
gg_user_grouped = gg_df.groupby("user_screen_name")
print "Top 50 Users by # Tweets"
top_50_users = gg_user_grouped.size().sort_values(ascending=False)[0:50]
top_50_users

Just how active were those top 50 users? Let's tabulate the average and median # of RTs for each of the people in the top 50 most active users.

In [None]:
# First count up the number of tweets from those top users
gg_top_50_users_df = gg_df[gg_df["user_screen_name"].isin(top_50_users.index.values)]
print "%d tweets from top 50 users \n" % gg_top_50_users_df.shape[0]
gg_top_50_grouped = gg_top_50_users_df.groupby("user_screen_name")

# Top 50 most active users ranked by average RTs / tweet
print "Top 50 most active users ranked by average RTs / tweet"
print gg_top_50_grouped["retweet_count"].aggregate(np.mean).sort_values(ascending=False)
print "\n"

# Top 50 users in terms of median RTs
print "Top 50 most active users ranked by median RTs / tweet"
print gg_top_50_grouped["retweet_count"].aggregate(np.median).sort_values(ascending=False)

**Exercise**: How can we adapt the code above to compute the mean and median number of favorites across all users? 

### The Pulse of the Event
What's the shape of the event in terms of the hashtags that are used? 

Let's first examine the set of hashtags that are used at all:

In [None]:
# Hastag trend
# The hashtags field can have multiple hashtags stuffed into it, separated by a space so we need to parse those out separately to be able to count them
hashtags_list = []

def parse_hashtags(hashtags):
    #print hashtags
    hashtags_list.extend(hashtags.split(" "))
    
gg_df["hashtags"].dropna().map(parse_hashtags)

hashtags_df = pd.DataFrame(hashtags_list, columns=["hashtag"])
print "Number of unique hashtags: %d " % hashtags_df["hashtag"].unique().shape[0]
print "\nTop Ten Hashtags:"
print hashtags_df["hashtag"].value_counts()[0:10]

# Convert all the hashtags to lowercase since otherwise we have variations based on capitalization
print ""
hashtags_df["hashtag"] = hashtags_df["hashtag"].map(lambda x: x.lower())
print "Number of unique hashtags: %d " % hashtags_df["hashtag"].unique().shape[0]
print "\nTop Ten Hashtags:"
print hashtags_df["hashtag"].value_counts()[0:10]

print ""
top_ten_hashtags = hashtags_df["hashtag"].value_counts()[0:10].index.values
print top_ten_hashtags

Now let's plot these hashtags over time so we can see the shape of how they were used and if there are any patterns. 

In [None]:
# We need both the hashtags and the creation date in the same list to plot one against the other
# Here just tabulate for the top 10 hashtags
hashtags_list = []

def parse_hashtags(row):
    #print hashtags
    htags = row["hashtags"].split(" ")
    for h in htags:
        if h in top_ten_hashtags:
            hashtags_list.append([h.lower(), row["created_at"]])
    
gg_df.dropna(subset=["hashtags"]).apply(parse_hashtags, axis=1)

# Create a data frame from the list
hashtags_df = pd.DataFrame(hashtags_list, columns=["hashtag", "created_at"])
# Need to parse the created_at field as a datetime
hashtags_df["created_at"] = pd.DatetimeIndex(hashtags_df["created_at"])
# Now generate the histogram
hashtags_df.hist(column="created_at", by="hashtag", bins=72, figsize=(12,12), sharex=True)

---
### Identifying Breaking News Content
Besides the people participating in a conversation we may also be interested in identifying key content. This can be very helpful for finding content that's relevent in breaking news situations. The data used below comes from the day of the DC Navy Yard Shooting from 2013 and can be downloaded [here](https://www.dropbox.com/s/m6dlp6oacyt8vhi/navyyard_tweets_hydrated.csv?dl=0). 

- Most RTed Tweets
- Most Favorited Tweets
- Most RTed Images
- Most Favorited Images
- Most Replied-to Tweets

In [None]:
# Notes on data: "I collected data using the public API's search endpoint, using the Navy Yard's coordinates as a center point and a mile radius. I also backfilled users found through this query using the user timeline endpoint. This gives me tweets leading up to the event." What are the implications of using only geotagged tweets?
import pytz
import datetime

ny_df = pd.read_csv("Data/navyyard_tweets_hydrated.csv", parse_dates="created_at")

# The pytz allows us to convert from Universal time to eastern time (Note: special thanks to Jennifer Stark for this code)
local_tz = pytz.timezone('US/Eastern')
def utc_to_local(row):
    # Parse the string UTC date into a datetime python object
    utc_dt = datetime.datetime.strptime(row, "%Y-%m-%d %H:%M:%S")
    # Change the timezone to eastern and output the datetime as a string again
    return utc_dt.replace(tzinfo=pytz.utc).astimezone(local_tz).strftime('%Y-%m-%d %H:%M:%S')

ny_df["created_at"] = ny_df["created_at"].apply(utc_to_local)

# We know the first tweet relating to the event was at about 8:30am on Sept 16th so let's filter for that.
ny_df = ny_df[ny_df["created_at"] > "2013-09-16 08:30:00"]

ny_df[["created_at", "text"]]
#ny_df.shape

In [None]:
print ny_df.columns

In [None]:
print ny_df.shape[0]

A good filter to find original breaking news content is to look for tweets that are NOT retweets, but that have been retweeted or favorited themselves.

In [None]:
# filter for non-RTs, and sort by RT count
ny_original_df = ny_df[ny_df["reweet_id"].isnull()]
ny_original_df.sort_values(["retweet_count"], ascending=False)[0:10][["tweet_url", "text", "retweet_count"]]

In [None]:
# filter for non-RTs, and sort by favorite count
ny_original_df = ny_df[ny_df["reweet_id"].isnull()]
ny_original_df.sort_values(["favorite_count"], ascending=False)[0:10][["tweet_url", "text", "favorite_count"]]

In [None]:
# filter for non-RTs, and for tweets with images, and sort by RT count
ny_original_df = ny_df[ny_df["reweet_id"].isnull()]
ny_original_df = ny_original_df.dropna(subset=["media"])
ny_original_df = ny_original_df.sort_values(["retweet_count"], ascending=False)[0:10][["media", "text", "retweet_count"]]
ny_original_df

In [None]:
# We can review some of these images (sorted by RT counts)
from IPython.display import Image
from IPython.core.display import HTML 
from IPython.display import display
images = []
for i in np.arange(0,10):
    images.append(Image(url=ny_original_df.iloc[i].media))
    
display(*images)


In [None]:
# filter for non-RTs, and for tweets with images, and sort by Favorite count
ny_original_df = ny_df[ny_df["reweet_id"].isnull()]
ny_original_df = ny_original_df.dropna(subset=["media"])
ny_original_df = ny_original_df.sort_values(["favorite_count"], ascending=False)[0:10][["media", "text", "favorite_count"]]
ny_original_df

In [None]:
images = []
for i in np.arange(0,10):
    images.append(Image(url=ny_original_df.iloc[i].media))
    
display(*images)

In [None]:
# sort by most replied to tweet
ny_repliedto_df = ny_df[ny_df["in_reply_to_status_id"].notnull()]
print ny_repliedto_df["in_reply_to_status_id"].value_counts().sort_values(ascending=False)
# Apparently no tweet was replied to more than once in this dataset
ny_repliedto_df[["in_reply_to_status_id", "text"]]

### Network Analysis
An important aspect of social media is that people are linked to other people. Those links can help define groups of people (e.g. if a set of people are all interconnected), or help in identifying central participants who may play key information roles (e.g. if one user is connected to many others, or talks to many others). The strength of a connection between users could indicate how well they know each other. 

A great library for doing network analysis is called NetworkX, here's the documentation: [https://networkx.github.io/documentation/latest/](https://networkx.github.io/documentation/latest/)

We'll cover some basics of using NetworkX here, including:
- How to create a graph with nodes and edges, annotate nodes and edges, draw a graph
- How to construct a graph from the reply network of users
- How to calculate some basic centrality measures for identifying "interesting" nodes

In [None]:
import networkx as nx

g = nx.Graph()
g

In [None]:
print g.nodes()
print g.edges()

In [None]:
g.add_node(1)
print g.nodes()

In [None]:
g.add_nodes_from([2,3])
print g.nodes()

In [None]:
g.add_edge(1,2)
g.add_edges_from([(2,3), (1,3)])
print g.edges()

In [None]:
print "# Nodes: ", g.number_of_nodes()
print "# Edges: ", g.number_of_edges()

We can add metadata to edges or nodes, like a weight value for an edge, or a name for a node.

In [None]:
g[1][2]["weight"] = 5
g[1][3]["weight"] = 10
print g[1]

g.node[1]["name"] = "Nick"
print g.node[1]

And we can draw the labeled graph. 

In [None]:
# Create a layout based on a spring force algorithm
nx.draw(g, node_color="#ffaaaa", with_labels=True)

Going back to the Navy Yard event, let's see who is talking to whom in terms of @ replies.

In [None]:
# Let's construct a graph from the reply network of users
# Filter out anyone who never responded to anyone
ny_df_filtered = ny_df.dropna(subset=["in_reply_to_screen_name"])
print "# Users:", ny_df_filtered["user_screen_name"].append(ny_df_filtered["in_reply_to_screen_name"]).unique().shape[0]
# To determine unique users we must consider user_screen_name as well as in_reply_to_screen_name fields
unique_users = ny_df_filtered["user_screen_name"].append(ny_df_filtered["in_reply_to_screen_name"]).unique()

# Create a graph
ny_g = nx.Graph()
# for each unique user, add a node to the graph
for n in unique_users:
    ny_g.add_node(n)

for i in ny_df_filtered.index:
    n1 = ny_df_filtered.loc[i]["user_screen_name"]
    n2 = ny_df_filtered.loc[i]["in_reply_to_screen_name"]
    # If it already has the edge, then just increment the weight; otherwise add a new edge with weight = 1
    if ny_g.has_edge(n1,n2):
        ny_g[n1][n2]["weight"] += 1
    else:
        ny_g.add_edge(n1,n2,weight=1)

# Lots of parameters to tweak for the visualization
nx.draw(ny_g, node_color="#ff8888", node_size=500, alpha=.8, with_labels=False, font_size=10, pos=nx.spring_layout(ny_g, k=(3/np.sqrt(len(ny_g.nodes()))), weight="weight", iterations=70))
plt.gcf().set_size_inches(12,12)

We could also look at the most connected people (in terms of the reply network we've created) by calculating the degree of each node. The degree of a node measures the number of incident edges.

In [None]:
nx.degree(ny_g)

[Centrality measures](https://networkx.github.io/documentation/latest/reference/algorithms.centrality.html) can be used to calculate the importance of nodes in various ways. For instance degree centrality for a node is the fraction of nodes it is connected to. 

In [None]:
# The degree_centrality function returns a dictionary, so we use the from_dict pandas function to create a dataframe from that. 
dc = pd.DataFrame.from_dict(nx.degree_centrality(ny_g), orient="index")
dc.columns=["deg_centrality"]
dc.sort_values("deg_centrality", ascending=False)


### What other questions could we ask of this data? 
If there's time let's brainstorm a few and see if we can work out the answers. 