# Visualizing Stack Overflow Data in Python

In this notebook, we visualize posts on Stack Overflow from September 2017. The data was compiled from searches on the [Stack Exchange Data Explorer](https://data.stackexchange.com/stackoverflow/query/new). Location information was added using the [Google Maps API](https://developers.google.com/maps/).

<div id="contents"></div>
## Table of Contents
1. [Load the Data](#load)
1. [Visualize Completeness](#completeness)
1. [Visualize Time](#time)
1. [Visualize Tags](#tags)
1. [Explore Text](#explore)
1. [Plot Place](#place)
1. [Plot Connections](#network)
1. [Conclusion](#conclusion)

Make sure to go through the first two sections (data and completeness) first. The other sections can be done out of order.

## Load Libraries
This cell contains all the libraries which are necessary for this notebook

In [None]:
# General purpose libraries
# A nice library for reading in csv data
import pandas as pd
# A library which most visualization libraries in Python are built on.
# We will start by using it to make some plots with pandas
%matplotlib inline
import matplotlib.pyplot as plt
# A library for doing math
import numpy as np
# A library for turning unicode fields into ASCII fields
import unicodedata
# a regex library
import re
# a class which makes counting the number of times something occurs in a list easier
from collections import Counter

# some functions for displaying html in a notebook
from IPython.core.display import display, HTML

# A library to visualize holes in a dataset
import missingno as msno
# A library to make wordclouds
import wordcloud

# Libraries for Word Trees
# lets us use graphviz in python
from pydotplus import graphviz
# to display the final Image
from IPython.display import Image

# Libraries interactive charts
from bokeh.io import output_notebook
# display interactive charts inline
output_notebook()
from bokeh.palettes import Viridis6 as palette
from bokeh.plotting import figure, show
from bokeh.models import HoverTool, ColorBar, LinearColorMapper, FixedTicker, ColumnDataSource, LogColorMapper
# to make patches into glyphs and treat counties and states differently
from bokeh.models.glyphs import Patches

# shape files for US counties
from bokeh.sampledata.us_counties import data as counties
# shape files for US states
from bokeh.sampledata import us_states as us_states_data

<div id="load"></div>
## Load the Data
*[Table of Contents](#contents)*

In [None]:
# load the data
posts = pd.read_csv('SeptemberPosts.csv')
posts.head()

## Columns
What do all these columns mean?
- PostId = the id in Stack Overflow's database of this post
- Score = the score given to the post by people voting up and down on it
- PostType = What type of post is this?
- CreationData = When was this post posted?
- Title = The text in the title of the post
- UserId = The id of the user who posted in the Stack Overflow database
- Reputation = The reputaiton of the user who posted
- Location = The location the user put down as their home on their profile
- Tags = Tags which are associated with this post
- QuestionId = The question this post is linked to

Let us convert CreationDate to a datetime type

In [None]:
posts['CreationDate'] = pd.to_datetime(posts['CreationDate'])

<div id="completeness"></div>
## Visualizing Completeness
*[Table of Contents](#contents)*

I'd like to know how complete our data is, so let's look at which fields have null values for the answers and questions using [missingno](https://github.com/ResidentMario/missingno).

In [None]:
msno.matrix(posts)

Is there any correlation between variables as to when they are blank?

Missingno has a lot of ways of visualizing data completeness. One of them is to see the correlation in missing-ness between fields.

In [None]:
msno.heatmap(posts)

We can see that posts with a title or tags never have a question id. I wonder if these groups make up different types of posts. Let's investigate which post types we have.

In [None]:
post_type_counts = posts['PostType'].value_counts()
post_type_counts.plot(kind='bar', color='DarkBlue')
plt.show()

Is there a way to make this chart interactive?

Let's use [Bokeh](https://bokeh.pydata.org/en/latest/)

In [None]:
TOOLS = "pan,wheel_zoom,reset,hover,save"

p = figure(
    title="Post Types",
    tools=TOOLS,
    x_range=post_type_counts.index.tolist(),
    plot_height=400
)
p.vbar(x=post_type_counts.index.values, top=post_type_counts.values, width=0.9)

hover = p.select_one(HoverTool)
hover.point_policy = "follow_mouse"
hover.tooltips = [("Number of Posts", "@top")]

show(p)

Most posts are either questions or answers. These types of posts serve very different purposes, so let's seperate them out and see how complete each is.

This time we'll try a different method of visualizing missing data in which we count up how often each attribute is not missing.

In [None]:
questions = posts[posts['PostType'] == 'Question']
answers = posts[posts['PostType'] == 'Answer']

print("Questions")
msno.bar(questions)
print("Answers")
msno.bar(answers)

Usually, when you fing out that you having missing data you want to know why and what is going on with those points. Fortunately Pandas makes this very easy.

In [None]:
posts[posts['UserId'].isnull()].head()

If we want to investigate these posts further, we can see posts and answers with null `UserId`s using Jupyter's HTML capabilities.

In [None]:
def get_link(p):
    if p['PostType'] == 'Answer':
        link = '"https://stackoverflow.com/questions/{0}#answer-{1}"'.format(int(p['QuestionId']), int(p['PostId']))
        return '<a href='+link+' target="_blank">Answer without user</a>'
    else:
        link = '"https://stackoverflow.com/questions/{0}"'.format(int(p['PostId']))
        return '<a href='+link+' target="_blank">Question without user</a>'

display(HTML('<br/>'.join(posts[posts['UserId'].isnull()].head().apply(lambda p: get_link(p), axis=1))))

<div id="time"></div>
## Visualizing Time
*[Table of Contents](#contents)*

1. What time of day do people post?
1. [How long did it take to get an answer in September?](#firstReply)

In [None]:
posts['CreationDate'].apply(lambda x: x.hour).hist(bins=range(24))
plt.show()

<div id="firstReply"></div>
When was the first reply for each question?

In [None]:
# aggregate answers by question id
answers_by_question = answers.groupby('QuestionId')['CreationDate'].agg(min)
# get the earliest creation date for each answer
first_reply = pd.DataFrame({'PostId':answers_by_question.index.values, 'EarliestReply':answers_by_question.values})
# add the time of the earliest answer to the questions data frame (filtering out questions which were not answered)
first_reply = pd.merge(first_reply, questions, how='inner', on=['PostId'])

In [None]:
# get the time it took to get an answer
gap = (first_reply['EarliestReply']-first_reply['CreationDate'])
# convert to minutes
gap /= pd.Timedelta(minutes=1)

# find the median
print('Median answer time for questions asked and answered in September 2017 is {0} min.'.format(gap.median()))

In [None]:
# plot
plt.hist(gap.tolist(), bins = 50)
#plt.yscale('log')
plt.ylabel('Number of Questions')
plt.xlabel('Time in Minutes')
plt.show()

<div id="tags"></div>
## Visualize Tags
*[Table of Contents](#contents)*

What do we ask about when we ask about data and errors in programing?

In [None]:
def text_to_wordcloud(series, title):
    # stitch all the text together
    text = ' '.join(series.tolist())
    # make a wordcloud from the text
    title_wordcloud = wordcloud.WordCloud().generate(text)
    # we want the words in our cloud to all be the same color
    title_wordcloud.recolor(color_func=lambda word, **kwargs:'white')
    # turn the wordcloud into an image
    plt.imshow(title_wordcloud, interpolation='bilinear')
    # we don't want an x and y axis
    plt.axis("off")
    plt.title(title + ' (' + str(len(series)) + ' questions)')
    plt.show()

def get_tags(words):
    words = ['[^\w]{}[^\w]'.format(w) for w in words]
    tags = questions[questions['Title'].str.lower().str.contains('|'.join(words))]['Tags']
    tags = tags.apply(lambda x: x.strip('<>').split('><'))
    return tags

data_tags = get_tags(['data', 'scrape', 'clean', 'open', 'load'])
text_to_wordcloud(data_tags.apply(lambda x: ' '.join(x)), 'Data Tags')
error_tags = get_tags(['error', 'wrong', 'throws', 'throw', 'why'])
text_to_wordcloud(error_tags.apply(lambda x: ' '.join(x)), 'Error Tags')

This is really just turning words in numbers

In [None]:
def get_tag_count(series):
    return pd.Series([t for tag_list in series for t in tag_list]).value_counts()

tag_counts = pd.concat([get_tag_count(data_tags), get_tag_count(error_tags)], axis=1)
tag_counts = tag_counts.rename(columns={0:'data', 1:'error'})
tag_counts = tag_counts[tag_counts['data'].notnull() & tag_counts['error'].notnull()]

x = tag_counts['data'].values
y = tag_counts['error'].values
TOOLS = "pan,wheel_zoom,reset,hover,save"

source = ColumnDataSource(data=dict(
    x=x,
    y=y,
    data_perc=[round(val/len(data_tags)*100,2) for val in x],
    error_perc=[round(val/len(error_tags)*100,2) for val in y],
    name=tag_counts.index.tolist()
))

# uncomment the lines to switch this to a log scale to see the number for less popular languages
p = figure(
    title="Tags",
    tools=TOOLS,
#     x_axis_type="log",
#     y_axis_type="log",
    plot_height=400
)
p.circle('x', 'y', size=10, source=source)
p.xaxis.axis_label = "Data Posts"
p.yaxis.axis_label = "Error Posts"

hover = p.select_one(HoverTool)
hover.point_policy = "follow_mouse"
hover.tooltips = [("Tag", "@name"), ("% of data posts:", "@data_perc"), ("% of error posts:", "@error_perc")]

show(p)

<div id="explore"></div>
## Text Visualization
*[Table of Contents](#contents)*

What do users post about when they post a question tagged 'python'? The Body column has a lot of HTML tags in it's values, which I don't feel like deeling with right now, so let's look at what question askers put in the titles of their posts.

In [None]:
python_titles = questions[questions['Tags'].str.contains('python')]['Title'].str.lower()
text_to_wordcloud(python_titles, 'Python Titles')

### Other Methods of Looking at Word Frequencies

One problem with word clouds is that they divorce the words in them from any context. This isn't much of a problem when the words are tags, which aren't part of a sentence. But when looking at titles it would be nice to see what users are saying when they use the words 'python', 'using' and 'file'.

To really understand all 19,462 questions tagged 'python', we'd have to break out some machine learning methods, but their are ways visualize context better than a word cloud. One of them is called a [Word Tree](http://hint.fm/papers/wordtree_final2.pdf) and was developed by the many eyes group at IBM. You can see [examples](https://www.jasondavies.com/wordtree/) of this style of visualization made in d3.js.

Since Jupyter notebooks can embed HTML elements, word trees rendered in d3 can be embedded in a notebook. However, for this tutorial we are sticking to python, so I will demonstrate how to build a word tree using a library which runs [graphviz](http://graphviz.org) in python.

In [None]:
# a variable to help us mark nodes as distinct when they have the same label
node_counter = 0

# a class to keep track of a node and it's connections
class Node:
    def __init__(self, word, count, matching_strings, graph, reverse=False, branching=3, highlight=False):
        global node_counter
        if highlight:
            self.node = graphviz.Node(node_counter, label=word+'\n'+str(count), peripheries=2, fontsize=20)
        else:
            self.node = graphviz.Node(node_counter, label=word+'\n'+str(count))
        node_counter += 1
        graph.add_node(self.node)
        if count > 1:
            self.generate_children(matching_strings, graph, reverse, branching)
    
    def generate_children(self, matching_strings, graph, reverse, branching):
        if len(matching_strings) == 0:
            return
        matching_strings = matching_strings[matching_strings.apply(len) > 0]
        all_children = Counter(matching_strings.apply(lambda x:x[-1 if reverse else 0]))
        children = all_children.most_common(branching)
        for word, count in children:
            if not reverse:
                child_matches = matching_strings[matching_strings.apply(lambda x:x[0]) == word].apply(lambda x:x[1:])
                c_node = Node(word, count, child_matches, graph=graph, reverse=reverse, branching=branching)
                graph.add_edge(graphviz.Edge(self.node, c_node.node))
            else:
                child_matches = matching_strings[matching_strings.apply(lambda x:x[-1]) == word].apply(lambda x:x[:-1])
                c_node = Node(word, count, child_matches, graph=graph, reverse=reverse, branching=branching)
                graph.add_edge(graphviz.Edge(c_node.node, self.node))
        left_over = sum(all_children.values()) - sum([x[1] for x in children])
        if left_over > 0:
            c_node = Node('...', left_over, [], graph=graph, reverse=reverse, branching=branching)
            if reverse:
                graph.add_edge(graphviz.Edge(c_node.node, self.node))
            else:
                graph.add_edge(graphviz.Edge(self.node, c_node.node))

def build_tree(root_string, suffixes, prefixes):
    graph = graphviz.Dot()
    root = Node(root_string, len(suffixes), suffixes, graph, reverse=False, highlight=True)
    root.generate_children(prefixes, graph, True, 3)
    return Image(graph.create_png())

def get_end(string, sub_string, reverse):
    side = 0 if reverse else -1
    return [x for x in re.split(r'[^\w]+', string.lower().split(sub_string)[side]) if len(x) > 0]

def select_text(phrase):
    series = questions['Title']
    instances = series[series.str.lower().str.contains(phrase)]
    suffixes = instances.apply(lambda x: get_end(x, phrase, False))
    prefixes = instances.apply(lambda x: get_end(x, phrase, True))
    return build_tree(phrase, suffixes, prefixes)

select_text('file using python')

We can use our word tree function to explore how any phrase is used in question titles

In [None]:
select_text('tableau')

<div id="place"></div>
## Plot Places
*[Table of Contents](#contents)*

Where do people say they are from?

In this example we'll look at how to make maps in python and why making maps is so hard.

Let's start by adding information on each location from the [Google Maps API](https://developers.google.com/maps/).

In [None]:
location_data = pd.read_csv('loc_data.csv')
location_data['Google_Data'] = location_data['Google_Data'].apply(lambda x: eval(x))

In [None]:
location_data['Google_Data'].apply(len).hist()
plt.xlabel('Number of Matches')
plt.ylabel('Number of Unique Place Strings')
plt.show()

location_data = location_data[location_data['Google_Data'].apply(len) == 1]
posts_with_location = pd.merge(posts, location_data, how='inner', on='Location')

How precise is our location information?

In [None]:
def get_lowest_component(address_list):
    components = address_list[0]['address_components']
    all_parts = []
    for c in components:
        if len(c['types']) == 2 and c['types'][1] == 'political':
            all_parts.append(c['types'][0])
    if len(all_parts) > 0:
        return all_parts[0]
    return None

posts_with_location = pd.merge(posts, location_data, how='inner', on='Location')
lowest_components = posts_with_location['Google_Data'].apply(get_lowest_component).value_counts()
lowest_components.plot(kind='bar', color='DarkBlue')
plt.show()

In [None]:
def get_part(address_list, level='country', name_type='long_name'):
    components = address_list[0]['address_components']
    for c in components:
        if c['types'] == [level, 'political']:
            return c[name_type]
    return None

for part in lowest_components.index:
    print(part)
    posts_with_location[part] = posts_with_location['Google_Data'].apply(lambda x: get_part(x, level=part))

One way to plot location is to just make a convex hull around each set of latitude and longitude points

In [None]:
def get_lat_long(address_list):
    location = address_list[0]['geometry']['location']
    return location['lat'], location['lng']

posts_with_location['lat_lng'] = posts_with_location['Google_Data'].apply(get_lat_long)
posts_with_location['lat'] = posts_with_location['lat_lng'].apply(lambda x: x[0])
posts_with_location['lng'] = posts_with_location['lat_lng'].apply(lambda x: x[1])
msno.geoplot(posts_with_location, x='lng', y='lat', by='country')

Let's make a better map.

In order to make a better map we need shape files which contain information on the boundries of each area we are interested in. We also need to link our data to these shape files which is usually far from trivial. In this example we will plot US counties because shape files on US states and counties are easy to find.

I don't go over map projections here, but the projection of a map can be changed fairly easily. Shape files contain border information in latitude and longitude typically. You can write a projection function which can be applied to each cordinate before the map is plotted to capture the fact that the world is not square.

In [None]:
county_posts = posts_with_location.loc[posts_with_location['country'] == 'United States',:]
county_posts = county_posts.groupby(['administrative_area_level_1', 'administrative_area_level_2'])

def get_lang_count(series, lang='python'):
    return len(series[series.notnull() & (series.str.find('<'+lang+'>') > -1)])

county_stats = county_posts.agg({'lat':np.mean, 'lng':np.mean, 'PostId':len, 'Tags':get_lang_count,
                                'PostType':lambda x : len(x[x=='Question'])})

county_stats.reset_index(inplace=True)
county_stats = county_stats.rename(columns={'PostId':'posts', 'Tags':'python questions', 'PostType':'questions', 'administrative_area_level_1':'State', 'administrative_area_level_2':'County'})
county_stats.head()

In [None]:
# get shape date for counties and states
counties = {
    code: county for code, county in counties.items()
}

us_states = us_states_data.data.copy()

name_to_code = dict([(counties[code]['detailed name'], code) for code in counties])

def match_county(county):
    state = county['State']
    county_name = county['County']
    # take out non-ascii characters which are not in Bokeh file
    county_name = unicodedata.normalize('NFKD', county_name).encode('ascii','ignore').decode("utf-8")
    full_name = county_name + ', ' + state
    if full_name in name_to_code:
        return name_to_code[full_name]
    close_matches = [n for n in name_to_code.keys() if n.endswith(state) and n.startswith(county_name.split(' ')[0])]
    if len(close_matches) == 0:
        print(full_name)
        return None
    full_name = min(close_matches, key=len)
    return name_to_code[full_name]

county_stats['code'] = pd.Series(county_stats.apply(match_county, axis=1))
county_stats.head()

In [None]:
def build_map(county_stats, county_slice=None, language='python'):
    color_mapper = LogColorMapper(palette=palette)
    
    if county_slice is not None:
        county_stats = county_stats[county_slice]

    county_xs = county_stats['code'].apply(lambda code: counties[code]["lons"]).tolist()
    county_ys = county_stats['code'].apply(lambda code: counties[code]["lats"]).tolist()
    county_names = (county_stats["County"]+', '+county_stats["State"]).tolist()
    
    language_perc = county_posts['Tags'].agg(lambda x: get_lang_count(x, language))
    language_perc = language_perc.reset_index()
    if county_slice is not None:
        language_perc = language_perc[county_slice]
    language_perc = (language_perc['Tags']/county_stats['questions'])
    language_perc = (language_perc*100).tolist()

    posts_source = ColumnDataSource(data=dict(
        x=county_xs,
        y=county_ys,
        name=county_names,
        posts=county_stats['posts'].tolist(),
        questions=county_stats['questions'].tolist(),
        lang_posts=language_perc
    ))
    
    TOOLS = "pan,wheel_zoom,reset,save"

    p = figure(
        title="Posts by County", tools=TOOLS,
        x_axis_location=None, y_axis_location=None,
        plot_width=900
    )
    p.grid.grid_line_color = None

    county_pathches = Patches(xs="x", ys="y",
              fill_color={'field': 'lang_posts', 'transform': color_mapper},
              fill_alpha=0.7, line_color="white", line_width=0.5)
    county_pathches_render = p.add_glyph(posts_source, county_pathches)
    
    # add hover tooltip
    hover = HoverTool(renderers=[county_pathches_render], tooltips=[
        ("Name", "@name"),
        ("Posts", "@posts"),
        ("Questions", "@questions"),
        ("% "+language.capitalize(), "@lang_posts")])
    p.add_tools(hover)
    
    # -----------
    # Add state outlines
    # -----------
    filter_fun = lambda x : x != 'AK' and x != 'HI'
    # get lat and long as x and y
    state_xs = [us_states[code]["lons"] for code in us_states if filter_fun(code)]
    state_ys = [us_states[code]["lats"] for code in us_states if filter_fun(code)]
    
    # draw state lines
    p.patches(state_xs, state_ys, fill_alpha=0.0, line_color="#"+('9'*6), line_width=0.5)

    show(p)

build_map(county_stats)

In [None]:
build_map(county_stats, county_stats['posts'] > 30, language='python')

<div id="network"></div>
## Plot Connections
*[Table of Contents](#contents)*

We can model user interactions as a graph by making an edge from each person who answers a question to the person who posted that question. We can then use graphviz to visualize this graph.

To start with, we need to compute the edges between non-null users.

In [None]:
# filter out null users and get question ids
from_edges = answers.loc[answers['UserId'].notnull(),['UserId', 'QuestionId']]
from_edges.rename(columns={'UserId':'AnswerUID', 'QuestionId':'PostId'}, inplace=True)
# filter out null users and get question ids
to_edges = questions.loc[questions['UserId'].notnull(),['UserId','PostId']]
# merge on question id
edges = pd.merge(from_edges, to_edges, on='PostId', how='inner')
# use a counter to merge duplicate edges to get edge weights
edges = Counter(edges.apply(lambda x:(x['AnswerUID'], x['UserId']), axis=1).tolist())
edges.most_common(10)

Let's use user reputation to color the nodes in our graph.

In [None]:
repuatation_map = dict(zip(posts['UserId'], posts['Reputation']))

Let's make some functions to visualize our graph.

Graphviz has a few algorithms for deciding where nodes go. You can choose between them use the `prog` attribute when you create the graph image. The default option is dot, which works well for trees, but makes less sense for visualizing social networks. The full list of options is:
* dot - "hierarchical" or layered drawings of directed graphs. This is the default tool to use if edges have directionality.
* neato - "spring model'' layouts.  This is the default tool to use if the graph is not too large (about 100 nodes) and you don't know anything else about it. Neato attempts to minimize a global energy function, which is equivalent to statistical multi-dimensional scaling.
* fdp - "spring model'' layouts similar to those of neato, but does this by reducing forces rather than working with energy.
* twopi - radial layouts, after Graham Wills 97. Nodes are placed on concentric circles depending their distance from a given root node.
* circo - circular layout, after Six and Tollis 99, Kauffman and Wiese 02. This is suitable for certain diagrams of multiple cyclic structures, such as certain telecommunications networks.

In this example, nodes don't have labels and are filled according to a user's repuatation. If you would like to label nodes with the user's id, just change `label=''` to `label=uid` in `make_node`. In this example darker colors signify user's with less reputation. To change this just add `val = 255 - val` in `get_fill`. I like magenta, but if you want a different overall color to the graph, change the return statement for `get_fill`. These are some simple options:
* gray = [val, val, val]
* red = [val, 0, 0]
* green = [0, val, 0]
* blue = [0, 0, val]
* yellow = [val, val, 0]
* cyan = [0, val, val]

In [None]:
# a function to get a fill color based on reputation
def get_fill(uid):
    rep = repuatation_map[int(uid)]
    # The distribution of reputations is very lop-sided, so let's use log reputation for our scale
    # the value should be between 0 and 255
    val = np.log(rep)/np.log(max(repuatation_map.values())) * 255
    return to_hex([val, 0, val])

# turn a RGB triplet into a hex color graphviz will understand
def to_hex(triple):
    output = '#'
    for val in triple:
        # the hex funciton returns a string of the form 0x<number in hex>
        val = hex(int(val)).split('x')[1]
        if len(val) < 2:
            val = '0'+val
        output += val
    return output

# The function to visualize our network graph
# It takes in a list of edges with weights
def build_network(edges_with_weights, prog='neato'):
    # The function which builds each node. You can change the node style here.
    make_node = lambda uid: graphviz.Node(uid, label='', shape='circle', style='filled', fillcolor=get_fill(uid))
    graph = graphviz.Dot()
    # A dictionary to keep track of node objects
    nodes = {}
    for pair in edges_with_weights:
        e, w = pair
        e = (str(int(e[0])), str(int(e[1])))
        # Add notes to the graph if they don't exist yet
        if e[0] not in nodes:
            nodes[e[0]] = make_node(e[0])
            graph.add_node(nodes[e[0]])
        if e[1] not in nodes:
            nodes[e[1]] = make_node(e[1])
            graph.add_node(nodes[e[1]])
        graph.add_edge(graphviz.Edge(nodes[e[0]], nodes[e[1]], penwidth=(float(w)/2)))
    return Image(graph.create_png(prog=prog))

# Let's build a small network from the edges with the highest weights.
build_network(edges.most_common(10))

As we can see from the sample of edges above, this may not be a connected graph. Let's pick a sample of edges and plot a connected subgraph.

In [None]:
def find_connected_subgraphs(edges):
    nodes = list(set([n for e in edges for n in e]))
    mappings = dict(zip(nodes, range(len(nodes))))
    flipped_mappings = dict(zip(range(len(nodes)), [[n] for n in nodes]))
    for e in edges:
        c_1 = mappings[e[0]]
        c_2 = mappings[e[1]]
        if c_1 == c_2:
            continue
        if len(flipped_mappings[c_1]) > len(flipped_mappings[c_2]):
            tmp = c_1
            c_1 = c_2
            c_2 = tmp
        for n in flipped_mappings[c_1]:
            mappings[n] = c_2
            flipped_mappings[c_2].append(n)
    return mappings

num_edges = 10000
connection_mapping = find_connected_subgraphs([x[0] for x in edges.most_common(num_edges)])
Counter(connection_mapping.values()).most_common(10)

100 nodes seems like a large enough graph.

*Note: If you plot a graph with over 500 nodes you might overwhelm graphviz*

In [None]:
e_list = [x for x in edges.most_common(num_edges) if connection_mapping.get(x[0][0],None) == 3586]
build_network(e_list)

<div id="conclusion"></div>
## Conclusion
*[Table of Contents](#contents)*