
# Part 1 - Graph Centrality Measures

In this part, we will load the Kite network data and perform a graph centrality algorithm by hand. First we load the data again:

In [5]:
# Hide some silly output
import logging
logging.getLogger("requests").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)

# Import everything we need
import graphlab as gl

# Load Data
kite_vertices = gl.SFrame.read_csv('../Week1/kite_vertices.csv')
kite_edges = gl.SFrame.read_csv('../Week1/kite_edges.csv')

# Create graph
g_kite = gl.SGraph()
g_kite = g_kite.add_vertices(vertices=kite_vertices, vid_field='name')
g_kite = g_kite.add_edges(edges=kite_edges, src_field='src', dst_field='dst')
g_kite = g_kite.add_edges(edges=kite_edges, src_field='dst', dst_field='src')

# Visualize graph?
gl.canvas.set_target('ipynb')
g_kite.show(vlabel="id")

PROGRESS: Finished parsing file /home/james/Development/Masters/IndependentStudy/Week1/kite_vertices.csv
PROGRESS: Parsing completed. Parsed 10 lines in 0.034121 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /home/james/Development/Masters/IndependentStudy/Week1/kite_vertices.csv
PROGRESS: Parsing completed. Parsed 10 lines in 0.019032 secs.
PROGRESS: Finished parsing file /home/james/Development/Masters/IndependentStudy/Week1/kite_edges.csv
PROGRESS: Parsing completed. Parsed 18 lines in 0.019352 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,str,str]
If parsing fails due to incorrect types, you can

<IPython.core.display.Javascript object>

I will look at degree centrality, closeness centrality, betweenness centrality, eigenvector centrality and PageRank.

## Degree Centrality

In [67]:
from graphlab import degree_counting
deg = degree_counting.create(g_kite)
deg_graph = deg['graph']
in_degree = deg_graph.vertices[['__id', 'in_degree']]
in_degree

__id,out_degree
Beverly,4
Fernando,5
Diane,6
Jane,1
Ed,3
Garth,5
Andre,4
Carol,3
Ike,2
Heather,3


As we can see, the Diane node would be considered most central by this metric, which is true: Diane has the most connections

## Closeness Centrality

Closeness centrality is computed by finding the shortest paths from all nodes to another, and then for each node, computing an average distance to all other nodes, dividing that by the maximum distance, and taking a reciprocial. Graphlab allows us to find shortest paths starting from a node. We will use that and create an algorithm to calculate this measure for all nodes:

In [19]:
from graphlab import shortest_path

# Foind shortest paths to all nodes for each vertex in graph
shortestPaths = {}
for vertex in g_kite.get_vertices():
    shortestPaths[vertex['__id']] = shortest_path.create(g_kite, source_vid=vertex['__id'], verbose=False)

# Find maximum distance
maxDistance = 0
for node, sp in shortestPaths.iteritems():
    maxDistance = max(maxDistance, max(sp['distance']['distance']))

# Calculate a closeness metric for all nodes
closeness = {}
for vertex, sp in shortestPaths.iteritems():
    closeness[vertex] = sum(sp['distance']['distance']) / float(len(sp['distance']['distance']))
    closeness[vertex] = maxDistance / closeness[vertex]
closeness

{'Andre': 2.3529411764705883,
 'Beverly': 2.3529411764705883,
 'Carol': 2.2222222222222223,
 'Diane': 2.6666666666666665,
 'Ed': 2.2222222222222223,
 'Fernando': 2.857142857142857,
 'Garth': 2.857142857142857,
 'Heather': 2.6666666666666665,
 'Ike': 1.9047619047619047,
 'Jane': 1.3793103448275863}

This looks like an intersting measure, as we have Fernando and Garth tied for first with Diane and Heather up next. This seems more interesting than degree centrality, since some nodes may have only a few connections but may end up being on many shortest paths

## Betweenness Centrality

Not sure how accurate this is, esp since GraphLab only returns one shortest path even if there are multiples!

In [16]:
x = shortestPaths['Carol']
x.get_path('Garth')
#'Andre' in [ p[0] for p in x.get_path('Diane')]

[('Carol', 0.0), ('Diane', 1.0), ('Garth', 2.0)]

In [20]:
betweenness = { }
for v in g_kite.get_vertices():
    numShortestPathsWithVertex = 0
    numShortestPaths = 0
    for s in g_kite.get_vertices():
        for t in g_kite.get_vertices():
            if v != s and v != t and s != t:
                sp = shortestPaths[s['__id']]
                if v['__id'] in [ p[0] for p in x.get_path(t['__id'])]:
                    numShortestPathsWithVertex = numShortestPathsWithVertex + 1
                numShortestPaths = numShortestPaths + 1
    betweenness[v['__id']] = numShortestPathsWithVertex / float(numShortestPaths)
betweenness

{'Andre': 0.1111111111111111,
 'Beverly': 0.0,
 'Carol': 1.0,
 'Diane': 0.2222222222222222,
 'Ed': 0.0,
 'Fernando': 0.3333333333333333,
 'Garth': 0.0,
 'Heather': 0.2222222222222222,
 'Ike': 0.1111111111111111,
 'Jane': 0.0}

## Pagerank

In [42]:
pr = gl.pagerank.create(g_kite)
pr.get('pagerank').topk(column_name='pagerank')

PROGRESS: Counting out degree
PROGRESS: Done counting out degree
PROGRESS: +-----------+-----------------------+
PROGRESS: | Iteration | L1 change in pagerank |
PROGRESS: +-----------+-----------------------+
PROGRESS: | 1         | 3.32917               |
PROGRESS: | 2         | 3.19104               |
PROGRESS: | 3         | 2.71239               |
PROGRESS: | 4         | 2.04452               |
PROGRESS: | 5         | 1.2017                |
PROGRESS: | 6         | 0.715013              |
PROGRESS: | 7         | 0.25379               |
PROGRESS: | 8         | 0.0283844             |
PROGRESS: | 9         | 0                     |
PROGRESS: +-----------+-----------------------+


__id,pagerank,delta
Jane,0.91209238429,0.0
Ike,0.896579275635,0.0
Heather,0.87832855957,0.0
Garth,0.683254915365,0.0
Fernando,0.347204427083,0.0
Diane,0.310703125,0.0
Ed,0.289563802083,0.0
Beverly,0.181875,0.0
Carol,0.181875,0.0
Andre,0.15,0.0


# Part 2 - Crawling Social Data

In this part, I will use the Twitter / Facebook / LinkedIn API to download a graphical data of my portion of the social graph, and attempt to visualize it in Gephi or Neo4J.

## 2.1 - Facebook API

In [21]:
import requests
import json

#ACCESS_TOKEN="CAAFokc3kSoEBALJY8T8qtg1q5Frfc9PYMgjBqHocZBf5a0kwfsKi0AGpZApw5iEKAZBQlVAQMCZBGcKJglbVHkZB2n2pwquMHWrZAgrhpGqHbVLbXMsmHjAvQfnHP4u1Mx2CQ0CHAJNMme9j4ozJut1MBf9V2ZCxYHZA2wVcDxZBOa9WaDwJNOAiR7wJcsZCh7Of7VvG3rvz7ZAeM4wiFvV0SRN2lfnYVwbN206oEUnJJhgogZDZD"
ACCESS_TOKEN="CAAFokc3kSoEBABwZCe10rJrxRphw9cGNd96fxJHZAY3SvhkVxXFYyGR6TcAcQgMH4wxRM0wNRNxQj0ZAD6zLE8jyWcyTTeRi0PnZApM1ykuq3U8aZBrtSiHhTLFC1X3nXcHe16vzCvAegPEvTSyHeAEtm9KBFlC4J0hbOD1nNMDWuBm8TVIxxnAKzSl0eanXLC7ZC9kyWtllmkYqZAF7Q1lkeZCi0ZCZBWZCn0ZD"
base_url = 'https://graph.facebook.com/me'

# Get 10 likes for 10 friends
fields = 'id,name,friends.fields(likes.limit(10))'
url = '%s?fields=%s&access_token=%s' % (base_url, fields, ACCESS_TOKEN,)

# Interpret the response as JSON and convert back to Python data structures
content = requests.get(url).json()

# Pretty-print the JSON and display it
print json.dumps(content, indent=1)

{
 "friends": {
  "data": [], 
  "summary": {
   "total_count": 214
  }
 }, 
 "id": "684051972564", 
 "name": "James Quacinella"
}


What? No friends?! Researching this brought me to https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/issues/191, which states that the API only gives back results for users who give permission to the Oauth app generated by me. Much of the API calls and graph explorer are different now, so much of the book does not apply now.

Moving on to LinkedIn ...

## 2.2 - LinkedIn

Sadly, the [same issue](https://github.com/ozgur/python-linkedin/issues/78) has arisen with LinkedIn. The API is no longer giving out access to Oauth 1.0 tokens and have substantially altered the API. Even the book's website has an [open github issue](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/issues/274) about this.

## 2.3 - Twitter

As a substitute, I will try to download my list of followers, and see how far I can expand this. There is a big difference here with twitter: relationships are directed, since the follow relationship is not bi-directional. Another issue to look out for are API limits. Lets start working on querying twitter:

In [3]:
# Lets create out api object w/ OAuth parameters
api = twitter.Api(consumer_key='yp4wi4FASXbsRKa6JxYqzhUlH',
                consumer_secret='Wkh1d5ygAOp4Bp65syFzHRN4xQsS8O4FvU3zHWosX8NXCqMpcl',
                access_token_key='16562593-F6lRFe7iyoQEahezhPmaI64oInHZD0LNpcIbbq7Wy',
                access_token_secret='weregYL8n6DI7yZy9pkizIJ78rH2GY02Do9jvpTe7rCey')

user = api.GetUser(screen_name='mrquintopolous')
print json.dumps(user.AsDict(), indent=1)

{
 "status": {
  "lang": "en", 
  "favorited": false, 
  "truncated": false, 
  "text": "\"Syriza\u2019s Red Lines\" - http://t.co/wjKxwCI2JT", 
  "created_at": "Wed Jun 10 14:07:34 +0000 2015", 
  "retweeted": false, 
  "source": "<a href=\"http://bufferapp.com\" rel=\"nofollow\">Buffer</a>", 
  "urls": {
   "http://t.co/wjKxwCI2JT": "http://buff.ly/1JDO0wN"
  }, 
  "id": 608636625443278848
 }, 
 "lang": "en", 
 "profile_background_tile": false, 
 "statuses_count": 338, 
 "description": "Living with Analysis Paralysis", 
 "friends_count": 328, 
 "profile_link_color": "0084B4", 
 "created_at": "Thu Oct 02 17:27:48 +0000 2008", 
 "profile_sidebar_fill_color": "http://abs.twimg.com/images/themes/theme16/bg.gif", 
 "utc_offset": -14400, 
 "profile_image_url": "https://pbs.twimg.com/profile_images/434041506114985984/AdJ3cim3_normal.jpeg", 
 "name": "mrquintopolous", 
 "profile_text_color": "333333", 
 "followers_count": 45, 
 "protected": false, 
 "profile_background_color": "9AE4E8", 
 "f

This is my personal twitter account. Lets get my list of followers (in some sense a depth-first search going on level deep):

In [5]:
import pickle
following = api.GetFriendIDs(screen_name='mrquintopolous')
pickle.dump(following, open("following1", "wb"))
following = pickle.load(open("following1", "rb"))

I wrote a script to help further crawl my followers, which can be found on [github](https://github.com/jquacinella/IndependentStudy/blob/master/Week2/crawlFollowing.py). The script needs to take into account timeouts and API limtis from twitter. To help speed up the process, I ran the script in parallel using two different sets of API keys. Annoying but effective. Here I will merge their results and look to import the data into an igraph. 

In [7]:
# Load the results from twitter
following_depth_part1 = pickle.load(open('following_depth2.part1', 'rb'))
following_depth_part2 = pickle.load(open('following_depth2.part2', 'rb'))
following_depth_part1.update(following_depth_part2)

In [11]:
# Construct an igraph from the crawl
from igraph import *

# Create empty graph
twitter_graph = Graph()

# Add vertices to the graph
me = '16562593'
following = [str(follow) for follow in following]
twitter_graph.add_vertices(following)
twitter_graph.add_vertex(name=me) # me

for follower in following:
    try:
        twitter_graph.add_edge(follower, me)
    except Exception as e:
        print follower
        raise e

for follower, following_depth2 in following_depth_part1.iteritems():
    twitter_graph.add_vertex(name=str(follower))
    try:
        twitter_graph.add_vertices( [str(f) for f in following_depth2] )
        twitter_graph.add_edges( [(str(f), str(follower)) for f in following_depth2 ] )
    except Exception as e:
        print f
        raise e

In [None]:
#import pickle
#pickle.dump(twitter_graph, open("twitter_graph", "wb"))
#twitter_graph = pickle.load(open("twitter_graph", "rb"))
layout = twitter_graph.layout("large")
plot(twitter_graph, layout = layout)

This will not print out simple because it takes to long to calculate the graph.

In [None]:
twitter_edges_f = open('twitter_egdes', 'w')
twitter_vertices_f = open('twitter_vertices', 'w')

me = '16562593'
twitter_vertices_f.write("%s" % me)
for f in following:
    twitter_vertices_f.write("%s" % f)
    twitter_edges_f.write("%s,%s" % (f, me))


for follower, following_depth2 in following_depth_part1.iteritems():
    twitter_vertices_f.write("%s" % follower)
     
    for f in following_depth2:
        twitter_vertices_f.write("%s" % str(f))
        twitter_edges_f.write("%s,%s" % (f, follower))
        #twitter_graph.add_edges( [(str(f), str(follower)) for f in following_depth2 ] )

# Load Data
# Hide some silly output
import logging
logging.getLogger("requests").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)

# Import everything we need
import graphlab as gl

twitter_vertices = gl.SFrame.read_csv('twitter_vertices')
twitter_edges = gl.SFrame.read_csv('twitter_egdes')

# Create graph
g_twitter = gl.SGraph()
g_twitter = g_twitter.add_vertices(vertices=twitter_vertices, vid_field='name')
g_twitter = g_twitter.add_edges(edges=twitter_edges, src_field='src', dst_field='dst')
#g_twitter = g_twitter.add_edges(edges=kite_edges, src_field='dst', dst_field='src')

# Visualize graph?
gl.canvas.set_target('ipynb')
g_twitter.show(vlabel="id")

PROGRESS: Finished parsing file /home/james/Development/Masters/IndependentStudy/Week2/twitter_vertices
PROGRESS: Parsing completed. Parsed 0 lines in 0.070235 secs.
Insufficient number of rows to perform type inference

# Part 3 - Textual Analysis of Tweets from Political ThinkTanks

In this part, I will download the tweet streams from different political 'think tanks' and perform a simple frequency analysis to see if there are any insights we can derive about the political leanings of these institutions. As an example of what tweet strams I will parse:

- https://twitter.com/fairmediawatch - Fairness and Accuracy In Reporting
- https://twitter.com/AccuracyInMedia - Accuracy In Media
- https://twitter.com/ips_dc - Institute for Policy Studies
- https://twitter.com/heritage - Heritage Foundation

## 3.1 - Prep Work

In [9]:
# Lets load up the Twitter API
import twitter
import prettytable

# Grab FAIR's tweet stream
#
# NOTE: do not include retweets, too many dupes (though for text analysis this might be a 
#      way to weigh more heavily text from tweets that are being retweeted by the account)
statuses = api.GetUserTimeline(screen_name='fairmediawatch', count=500, include_rts=False)

# Create a pretty table of tweet contents and any expanded urls
pt = prettytable.PrettyTable(["Tweet Status", "Expanded URLs"])
pt.align["Tweet Status"] = "l" # Left align city names
pt.align["Expanded URLs"] = "l" # Left align city names
pt.max_width = 60 
pt.padding_width = 1 # One space between column edges and contents (default)

# Add rows to the pretty table
for status in statuses:
    pt.add_row([status.text, "".join([url.expanded_url for url in status.urls]) ])

# Lets see the results!
print pt

+--------------------------------------------------------------+----------------------------------------------+
| Tweet Status                                                 | Expanded URLs                                |
+--------------------------------------------------------------+----------------------------------------------+
| That most US terrorists aren't Muslim "may come as a         | http://bit.ly/1J7XVYL                        |
| surprise"--especially if you rely on corporate media.        |                                              |
| http://t.co/J5bn1tQzRY                                       |                                              |
| Baltimore "gang threat" swallowed by media was found to be   | http://bit.ly/1RxKvr6                        |
| "non-credible" by FBI. @Vice @AdamJohnsonNYC                 |                                              |
| http://t.co/4kZSXwnRka                                       |                                        

## Part 3.2 - Getting Recent Tweets from All Accounts

Lets get the tweets for all the accounts, and store them in a dictionary:

In [10]:
# List of accounts to process, and our results dict
accounts = ['fairmediawatch', 'AccuracyInMedia', 'ips_dc', 'heritage']
allStatuses = { }

# For each account, query tiwtter for top tweets
for account in accounts:
    allStatuses[account] = api.GetUserTimeline(screen_name=account, count=500, include_rts=False)

# Save results
import pickle
pickle.dump( allStatuses, open( "allStatuses", "wb" ) )

Using NTLK, we will process the tweets ands split them into words (while filtering for stopwords) and storing all hastag mentions.

In [136]:
from collections import Counter
from nltk.corpus import stopwords

words = { }
hashtags = { }
counters = { "words": { }, "hashtags": { } }

for account in accounts:
    words[account] = [ w.lower() for t in allStatuses[account] for w in t.text.split() if w.lower() not in stopwords.words('english') ]
    counters["words"][account] = Counter(words[account])
    
    hashtags[account] = [ hashtag.text.lower() for status in allStatuses[account] for hashtag in status.hashtags ]
    counters["hashtags"][account] = Counter(hashtags[account])

Lets print out the prominent words from each account:

In [146]:
for account in counters["words"]:
    pt = prettytable.PrettyTable(field_names=['Word', 'Count'])
    [ pt.add_row(kv) for kv in counters["words"][account].most_common()[:20] ]
    pt.align['Word'], pt.align['Count'] = 'l', 'r' # Set column alignment
    print account
    print pt
    print

ips_dc
+----------------+-------+
| Word           | Count |
+----------------+-------+
| u.s.           |    21 |
| via            |    12 |
| -              |    12 |
| #andstillirise |    12 |
| #tpp           |    12 |
| women          |     9 |
| black          |     9 |
| &amp;          |     9 |
| labor          |     7 |
| world          |     7 |
| corporations   |     6 |
| ↑              |     5 |
| leaders        |     5 |
| movement       |     5 |
| could          |     5 |
| americans      |     5 |
| #stopfasttrack |     5 |
| take           |     5 |
| justice        |     5 |
| w/             |     5 |
+----------------+-------+

fairmediawatch
+------------------------+-------+
| Word                   | Count |
+------------------------+-------+
| media                  |     9 |
| corporate              |     5 |
| &gt;@nytimes           |     4 |
| @jnaureckas            |     4 |
| @adamjohnsonnyc        |     4 |
| still                  |     3 |
| @deanbaker13

All lets print out the most prominent hashtags from all the accounts:

In [147]:
for account in counters["hashtags"]:
    pt = prettytable.PrettyTable(field_names=['Hashtag', 'Count'])
    [ pt.add_row(kv) for kv in counters["hashtags"][account].most_common()[:20] ]
    pt.align['Hashtag'], pt.align['Count'] = 'l', 'r' # Set column alignment
    print account
    print pt
    print

ips_dc
+-------------------------+-------+
| Hashtag                 | Count |
+-------------------------+-------+
| tpp                     |    19 |
| andstillirise           |    12 |
| stopfasttrack           |     5 |
| fasttrack               |     4 |
| studentdebt             |     4 |
| fightfor15              |     3 |
| popefrancis             |     3 |
| notpp                   |     3 |
| ttip                    |     2 |
| 1u                      |     2 |
| blackworkersmatter      |     2 |
| isis                    |     2 |
| blackworkingwomenmatter |     1 |
| charleston              |     1 |
| wallstreet              |     1 |
| climatechange           |     1 |
| estatetax               |     1 |
| g7summit                |     1 |
| congress                |     1 |
| domesticworkersday      |     1 |
+-------------------------+-------+

fairmediawatch
+---------------+-------+
| Hashtag       | Count |
+---------------+-------+
| berniesanders |     2 |
| tpp    

Looking at the results, things make sense. The two 'left leaning' accounts list hashtags like "tpp, stopfasttrack, fasttrack, studentdebt, fightfor15, notpp, blackworkersmatter, blackworkingwomenmatter, berniesanders", which are all left leaning. FAIR used words like 'corporate' and 'media', which make sensesince they are a media-watchdog.

The two right leaning accounts mention typical phrases and hastags from that side, including the most prominent conservative hastag #tcot, but also "scotus, kingvburwell, benghazi, prolife, obamacare, exim, forfeiturereform, ndaa, religiousliberty, obama", all very prominent conservative issues.

## Future Research

- Stemming
- Extend to bi and tri-grams
- Link words to topics or linked open data
- Sentiment analysis
- Topic Modeling
- Crawl articles linked to by tweets and use boilerpipe to extend analysis