# Making networks with Twitter data

This notebook will walk you through how to create and analyze networks using Twitter data.

## Data preprocessing: getting data into NetworkX

To make a network in NetworkX using external data, the nodes and the connections between them must be represented by pairs of tuples. In this first section, we'll walk through some data preprocessing techniques together to get our data ready for analysis.

Let's take a look at the data we're working with.

In [None]:
import json
f = open('../materials/data/friends/list.PyTennessee.json')

data = json.load(f)
pairs = []

for user in data['users']:
    pairs.append(('PyTennessee', str(user['screen_name'])))

pairs[:10]

If you run the section below, we'll end up with all of the friend and follower pairs across all of the files. You should end up with 1286 pairs.

In [None]:
# Because the relationship data is split across files, we need to
# walk through all of them to get the data.
import os

for (dir_path, dir_names, file_names) in os.walk('../materials/data/friend_relationships/'):
    files = file_names
    
for file_name in files:
    with open('../materials/data/friend_relationships/' + file_name) as p:
        pair_data = json.load(p)
        for k in pair_data.keys():
            twitter_pair = k.split()
            if pair_data[k]['relationship']['source']['following'] is True:
                pairs.append((str(twitter_pair[0]), str(twitter_pair[1])))
            elif pair_data[k]['relationship']['source']['followed_by'] is True:
                pairs.append((str(twitter_pair[1]), str(twitter_pair[0])))
                
len(pairs)

## Make networks with Twitter data

Using the NetworkX methods we've learned before, let's do some network analysis on PyTennessee's Twitter friends. We're going to look at the Twitter handles that PyTennessee follows, as well as the relationships between those handles.

### Undirected graph

In [None]:
%matplotlib inline
import networkx as nx

# Build an undirected graph.

In [None]:
# Just from looking at it, is this network connected or unconnected?


In [None]:
# Hint: if you want to sort a dictionary to easily 
# find the highest and lowest values, use this function 
# on the output of the centrality measures like degree_centrality():

import operator

def centrality_sort(centrality_dict):
    return sorted(centrality_dict.iteritems(), key=operator.itemgetter(1))

# ex. degree_sorted = centrality_sort(degree_vals)

In [None]:
# Which nodes have the highest/lowest degree centrality?


In [None]:
# Which nodes have the highest/lowest betweenness centrality?


In [None]:
# Which nodes have the highest/lowest closeness centrality?


In [None]:
# Let's look at subsections of the graph. We'll do this together.


### Directed graph

Let's add some direction to the graph. When we processed our data, we ordered the pairs so that the first handle in the pair is a follower of the second handle. We're not worrying about pairs that mutually follow each other right now.

In [None]:
# Build a directed graph.


In [None]:
# Run some degree centrality measures for directed graphs:
# in_degree_centrality(): number of incoming connections (number of people following you)
# out_degree_centrality(): number of outgoing connections (number of people you follow)


In [None]:
# Let's look at subsections of the graph. Just like we did above.

# Top 20 highest in-degree centrality scores:


In [None]:
# Top 20 highest out-degree centrality scores:


### Network models

Does our network match any of the network models we discussed earlier?

In [None]:
# Analyze the models here.


# Back to [tutorial.ipynb](http://localhost:8888/notebooks/notebooks/tutorial.ipynb#visual) for Visualizations!