In [9]:
import seaborn as sns
sns.set()     

# The New York Social Graph


[New York Social Diary](https://www.newyorksocialdiary.com/) provides a
fascinating lens onto New York's socially well-to-do.  The data forms a natural social graph for New York's social elite.  Take a look at this page of a recent [To Love Unconditionally](https://www.newyorksocialdiary.com/to-love-unconditionally/). 

You will notice the photos have carefully annotated captions labeling those that appear in the photos.  We can think of this as implicitly implying a social graph: there is a connection between two individuals if they appear in a picture together.

For this project, we will assemble the social graph from photo captions for parties.  Using this graph, we can make guesses at the most popular socialites, the most influential people, and the most tightly coupled pairs.

We will attack the project in three phases:
1. Get a list of all the photo pages to be analyzed. 
2. Get all captions in each party, Parse all of the captions and extract guests' names.
3. Assemble the graph, analyze the graph and answer the questions

## Phase One


The first step is to crawl the data.  We want photos from parties on or before December 1st, 2014.  Go to the [Party Pictures Archive](https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures) to see a list of (party) pages.  We want to get the url for each party page, along with its date.

Here are some packages that you may find useful.  You are welcome to use others, if you prefer.

In [10]:
import requests
import dill
from bs4 import BeautifulSoup
from datetime import datetime

We recommend using Python [Requests](http://docs.python-requests.org/en/master/) to download the HTML pages, and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) to process the HTML.  Let's start by getting the [first page](https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures).

In [11]:
page = requests.get("https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures")
page.status_code
print(page.text)

<!DOCTYPE html>
  <!--[if IEMobile 7]><html class="no-js ie iem7" lang="en" dir="ltr"><![endif]-->
  <!--[if lte IE 6]><html class="no-js ie lt-ie9 lt-ie8 lt-ie7" lang="en" dir="ltr"><![endif]-->
  <!--[if (IE 7)&(!IEMobile)]><html class="no-js ie lt-ie9 lt-ie8" lang="en" dir="ltr"><![endif]-->
  <!--[if IE 8]><html class="no-js ie lt-ie9" lang="en" dir="ltr"><![endif]-->
  <!--[if (gte IE 9)|(gt IEMobile 7)]><html class="no-js ie" lang="en" dir="ltr" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product#"><![endif]-->
  <!--[if !IE]><!--><html class="no-js" lang="en" dir="ltr" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product#"><!--<![endif]-->
<head><scrip

Now, we process the text of the page with BeautifulSoup.

In [12]:
soup = BeautifulSoup(page.text, "lxml")

This page has links to 50 party pages. Look at the structure of the page and determine how to isolate those links.  Your browser's developer tools (usually `Cmd`-`Option`-`I` on Mac, `Ctrl`-`Shift`-`I` on others) offer helpful tools to explore the structure of the HTML page.

Once you have found a pattern, use BeautifulSoup's [select](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors) or [find_all](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find) methods to get those elements.

In [13]:
links = dict(zip(soup.find_all('span', {"class": "views-field views-field-title"}), soup.find_all('span', {"class": "views-field views-field-created"}) ))

links = dict((  link.find_all('a')[0].get('href') ,date.find_all('span', {'class': 'field-content'})[0].getText()) for (link, date) in links.items())

print(links)
print("Total Links: ", len(links))

{'/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/kicks-offs-sing-offs-and-pro-ams': 'Friday, September 11, 2015', '/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/grand-finale-of-the-hampton-classic-horse-show': 'Tuesday, September 1, 2015', '/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/riders-spectators-horses-and-more': 'Wednesday, August 26, 2015', '/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/artist-and-writers-and-designers': 'Thursday, August 20, 2015', '/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/garden-parties-kickoffs-and-summer-benefits': 'Monday, August 17, 2015', '/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/the-summer-set': 'Wednesday, August 12, 2015', '/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/midsummer-parties': 'Wednesday, August 5, 2015', '/web/20150913224145/http://www

There should be 50 per page.

In [14]:
len(links) == 50

True

Let's take a look at that first link.  Figure out how to extract the URL of the link, as well as the date.  You probably want to use `datetime.strptime`.  See the [format codes for dates](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) for reference.

In [15]:
"/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/grand-finale-of-the-hampton-classic-horse-show">Grand Finale of the Hampton Classic Horse Show</a>

SyntaxError: invalid syntax (3577086455.py, line 1)

In [None]:
import re

link = list(links.keys())[0]
print(link)

datePattern =  re.compile(r"[0-9]{14}")
date = datePattern.search(link).group(0)
print(date)

date = {'year':date[0:4],
        'month':date[4:6],
        'day':date[6:8],
        'hour':date[8:10],
        'mins':date[10:12],
        'sec':date[12:14]}

time_data = date['year'] + "/" + date['month'] + "/" + date['day'] + " " + date['hour'] + ":" + date['mins'] + ":" + date['sec'] 

print(time_data)

urlPattern = re.compile(r"http.*")
url = urlPattern.search(link).group(0)
print(url)

# Check that the title and date match what you see visually.

For purposes of code reuse, let's put that logic into a function.  It should take the link element and return the URL and date parsed from it.

In [19]:
import re
import datetime
from time import strptime

def get_link_date(el, links):
    
    date = links[el]
    
    _, month_day, year = date.split(",")

    _, month, day  = month_day.split(" ")
    
    date = datetime.datetime(int(year), strptime(month[0:3],'%b').tm_mon, int(day))
    
    urlPattern = re.compile(r"http.*")
    url = urlPattern.search(el).group(0)
    return url, date

#print(links)
#link = list(links.keys())[0]
#print(get_link_date(link))
#print(get_link_date(link)[1].month)

You may want to check that it works as you expected.

Once that's working, let's write another function to parse all of the links on a page.  Thinking ahead, we can make it take a Requests [Response](https://requests.readthedocs.io/en/master/api/#requests.Response) object and do the BeautifulSoup parsing within it.

In [20]:
def get_links(response):
    
    soup = BeautifulSoup(response.text, "lxml")
    
    links_preprocessed = dict(zip(soup.find_all('span', {"class": "views-field views-field-title"}), soup.find_all('span', {"class": "views-field views-field-created"}) ))

    links_preprocessed = dict((link.find_all('a')[0].get('href') ,date.find_all('span', {'class': 'field-content'})[0].getText()) for (link, date) in links_preprocessed.items())

    links_processed = {}

    for k, v in links_preprocessed.items():
        url, date = get_link_date(k, links_preprocessed)
        links_processed[url]  = date 
        
    return links_processed

#links = get_links(page)

#print(links)


If we run this on the previous response, we should get 50 pairs.

In [None]:
# These should be the same links from earlier
len(get_links(page)) == 50

But we only want parties with dates on or before the first of December, 2014.  Let's write a function to filter our list of dates to those at or before a cutoff.  Using a keyword argument, we can put in a default cutoff, but allow us to test with others.

In [21]:
def filter_by_date(links, cutoff=datetime.datetime(2014, 12, 1)):
    # Return only the elements with date <= cutoff
    link = list(links.keys())[0]
    #print(links[link] <= cutoff)
    #print(links[link], " ", cutoff)  
    return {k:v for (k,v) in  links.items() if v <= cutoff}

#filtered_links = filter_by_date(links)

#print(filtered_links)

With the default cutoff, there should be no valid parties on the first page.  Adjust the cutoff date to check that it is actually working.

In [None]:
# Double check the dates are being extracted correctly
len(filter_by_date(get_links(page))) == 0

Now we should be ready to get all of the party URLs.  Click through a few of the index pages to determine how the URL changes.  Figure out a strategy to visit all of them.

HTTP requests are generally IO-bound.  This means that most of the time is spent waiting for the remote server to respond.  If you use `requests` directly, you can only wait on one response at a time.  [requests-futures](https://github.com/ross/requests-futures) lets you wait for multiple requests at a time.  You may wish to use this to speed up the downloading process.

In [46]:
from requests_futures.sessions import FuturesSession
from concurrent.futures import  ProcessPoolExecutor
from concurrent.futures import as_completed
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

request_list = []

session  = FuturesSession(executor=ProcessPoolExecutor(max_workers=8))
retries = 5
status_forcelist = [429, 500, 502, 503, 504]
retry = Retry(
     total=retries,
     read=retries,
     connect=retries,
     backoff_factor=10,  
     status_forcelist=status_forcelist
)

adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)

request = session.get("https://web.archive.org/web/20150918040534/http://www.newyorksocialdiary.com/party-pictures")
request_list.append(get_links(request.result()))

futures = [session.get('https://web.archive.org/web/20150918040534/http://www.newyorksocialdiary.com/party-pictures?page={}'.format(i)) for i in range(1,27)]
for future in as_completed(futures):
    request_list.append(get_links(future.result()))


print(request_list)
print(len(request_list))
#soup = BeautifulSoup(page.text, "lxml")

# You can use link_list.extend(others) to add the elements of others
# to link_list.

[{'http://www.newyorksocialdiary.com/party-pictures/2015/coaching-and-kickoffs': datetime.datetime(2015, 9, 16, 0, 0), 'http://www.newyorksocialdiary.com/party-pictures/2015/kicks-offs-sing-offs-and-pro-ams': datetime.datetime(2015, 9, 11, 0, 0), 'http://www.newyorksocialdiary.com/party-pictures/2015/grand-finale-of-the-hampton-classic-horse-show': datetime.datetime(2015, 9, 1, 0, 0), 'http://www.newyorksocialdiary.com/party-pictures/2015/riders-spectators-horses-and-more': datetime.datetime(2015, 8, 26, 0, 0), 'http://www.newyorksocialdiary.com/party-pictures/2015/artist-and-writers-and-designers': datetime.datetime(2015, 8, 20, 0, 0), 'http://www.newyorksocialdiary.com/party-pictures/2015/garden-parties-kickoffs-and-summer-benefits': datetime.datetime(2015, 8, 17, 0, 0), 'http://www.newyorksocialdiary.com/party-pictures/2015/the-summer-set': datetime.datetime(2015, 8, 12, 0, 0), 'http://www.newyorksocialdiary.com/party-pictures/2015/midsummer-parties': datetime.datetime(2015, 8, 5, 0

In the end, you should have 1193 parties.

In [47]:
# Make sure you are using the same /web/stringofdigits/... for each page
# This is to prevent the archive from accessing later copies of the same page
# If you are off by a just a few, that can be the archive misbehaving
def merge_dicts(data):
    """
    Function to merge a list of dictionaries into one

    :param data: List of dictionaries
    :return: Dictionary
    """
    merged_dict = data[0]
    for idx in range(len(data)):
        merged_dict.update(data[idx])
    return merged_dict

link_list = filter_by_date(merge_dicts(request_list))
print(link_list)
print(len(link_list))
len(link_list) == 1193

{'http://www.newyorksocialdiary.com/party-pictures/2014/the-thanksgiving-day-parade-from-the-ground-up': datetime.datetime(2014, 12, 1, 0, 0), 'http://www.newyorksocialdiary.com/party-pictures/2014/gala-guests': datetime.datetime(2014, 11, 24, 0, 0), 'http://www.newyorksocialdiary.com/party-pictures/2014/equal-justice': datetime.datetime(2014, 11, 20, 0, 0), 'http://www.newyorksocialdiary.com/party-pictures/2014/celebrating-the-treasures': datetime.datetime(2014, 11, 18, 0, 0), 'http://www.newyorksocialdiary.com/party-pictures/2014/associates-and-friends': datetime.datetime(2014, 11, 17, 0, 0), 'http://www.newyorksocialdiary.com/party-pictures/2014/michaels-25th': datetime.datetime(2014, 11, 13, 0, 0), 'http://www.newyorksocialdiary.com/party-pictures/2014/new-york-lifelines': datetime.datetime(2014, 11, 12, 0, 0), 'http://www.newyorksocialdiary.com/party-pictures/2014/legends-and-leaders': datetime.datetime(2014, 11, 10, 0, 0), 'http://www.newyorksocialdiary.com/party-pictures/2014/fa

False

In case we need to restart the notebook, we should save this information to a file.  There are many ways you could do this; here's one using `dill`.

In [48]:
dill.dump(link_list, open('nysd-links.pkd', 'wb'))

To restore the list, we can just load it from the file.  When the notebook is restarted, you can skip the code above and just run this command.

In [None]:
link_list = dill.load(open('nysd-links.pkd', 'rb'))

### Question 1: In which month did most of the parties occur? (10 p)
### Question 2: What is the overall trend of parties from 2007 to 2014? (10 p)
Use visualizations to answer the two questions above. Ensure that you interpret your plots thoroughly.

## Phase Two


In this phase, we concentrate on getting the names out of captions for a given page.  We'll start with [the benefit cocktails and dinner](https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/celebrating-the-neighborhood) for [Lenox Hill Neighborhood House](http://www.lenoxhill.org/), a neighborhood organization for the East Side.

Take a look at that page.  Note that some of the text on the page is captions, but others are descriptions of the event.  Determine how to select only the captions.

In [54]:
page = requests.get("https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/celebrating-the-neighborhood")

soup = BeautifulSoup(page.text, "lxml")
    
captions = soup.find_all('div', {"class": "photocaption"})

captions = [ caption.text for caption in captions]

print(captions)

print(len(captions))

["Glenn Adamson, Simon Doonan, Victoire de Castellane, Craig Leavitt, Jerome Chazen, Andi Potamkin, Ralph Pucci, Kirsten Bailey, Edwin Hathaway, and Dennis Freedman at the Museum of Art and Design's annual MAD BALL. ", ' Randy Takian ', ' Kamie Lightburn and Christopher Spitzmiller ', ' Christopher Spitzmiller and Diana Quasha ', ' Mariam Azarm, Sana Sabbagh, and Lynette Dallas ', ' Christopher Spitzmiller, Sydney Shuman, and Matthew Bees', ' Christopher Spitzmiller and Tom Edelman ', ' Warren Scharf and Sydney Shuman ', ' Amory McAndrew and Sean McAndrew ', ' Sydney Shuman, Mario Buatta, and Helene Tilney ', ' Katherine DeConti and Elijah Duckworth-Schachter ', ' John Rosselli and Elizabeth Swartz ', ' Stephen Simcock, Lee Strock, and Thomas Hammer ', ' Searcy Dryden, Lesley Dryden, Richard Lightburn, and Michel Witmer ', ' Jennifer Cacioppo and Kevin Michael Barba ', ' Virginia Wilbanks and Lacary Sharpe ', ' Valentin Hernandez, Yaz Hernandez, Chele Farley, and James Farley', ' Harry

By our count, there are about 110.  But if you're off by a couple, you're probably okay.

In [55]:
# These are for the specific party referenced in the text
abs(len(captions) - 110) < 5

True

Let's encapsulate this in a function.  As with the links pages, we want to avoid downloading a given page the next time we need to run the notebook.  While we could save the files by hand, as we did before, a checkpoint library like [ediblepickle](https://pypi.python.org/pypi/ediblepickle/1.1.3) can handle this for you.  (Note, though, that you may not want to enable this until you are sure that your function is working.)

You should also keep in mind that HTTP requests fail occasionally, for transient reasons.  You should plan how to detect and react to these failures.   The [retrying module](https://pypi.python.org/pypi/retrying) is one way to deal with this.

In [63]:
from requests_futures.sessions import FuturesSession
from concurrent.futures import  ProcessPoolExecutor
from concurrent.futures import as_completed
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

def get_captions(path):
    soup = BeautifulSoup(page.text, "lxml")
        
    captions = soup.find_all('div', {"class": "photocaption"})
    
    captions = [ caption.text for caption in captions]
    return captions

caption_list = []

session  = FuturesSession(executor=ProcessPoolExecutor(max_workers=8))
retries = 5
status_forcelist = [429, 500, 502, 503, 504]
retry = Retry(
     total=retries,
     read=retries,
     connect=retries,
     backoff_factor=10,  
     status_forcelist=status_forcelist
)

adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)

#print(list(link_list.keys())[0])

#futures = [session.get('https://web.archive.org/web/20150918040534/{}'.format(link)) for link in list(link_list.keys())]

[print('https://web.archive.org/web/20150918040534/{}'.format(link)) for link in list(link_list.keys())]

#for future in as_completed(futures):
#    caption_list.append(get_captions(future.result()))
        
#print(caption_list)
#print(len(caption_list))


https://web.archive.org/web/20150918040534/http://www.newyorksocialdiary.com/party-pictures/2014/the-thanksgiving-day-parade-from-the-ground-up
https://web.archive.org/web/20150918040534/http://www.newyorksocialdiary.com/party-pictures/2014/gala-guests
https://web.archive.org/web/20150918040534/http://www.newyorksocialdiary.com/party-pictures/2014/equal-justice
https://web.archive.org/web/20150918040534/http://www.newyorksocialdiary.com/party-pictures/2014/celebrating-the-treasures
https://web.archive.org/web/20150918040534/http://www.newyorksocialdiary.com/party-pictures/2014/associates-and-friends
https://web.archive.org/web/20150918040534/http://www.newyorksocialdiary.com/party-pictures/2014/michaels-25th
https://web.archive.org/web/20150918040534/http://www.newyorksocialdiary.com/party-pictures/2014/new-york-lifelines
https://web.archive.org/web/20150918040534/http://www.newyorksocialdiary.com/party-pictures/2014/legends-and-leaders
https://web.archive.org/web/20150918040534/http:/

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

This should get the same captions as before.

In [None]:
# This cell is expecting get_captions to return a list of the captions themselves
# Other routes to a solution might need to adjust this cell a bit
captions == get_captions("/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/celebrating-the-neighborhood")

Now that we have some sample captions, let's start parsing names out of those captions.  There are many ways of going about this, and we leave the details up to you.  Some issues to consider:

  1. Some captions are not useful: they contain long narrative texts that explain the event.  Try to find some heuristic rules to separate captions that are a list of names from those that are not.  A few heuristics include:
    - look for sentences (which have verbs) and as opposed to lists of nouns. For example, [`nltk` does part of speech tagging](http://www.nltk.org/book/ch05.html) but it is a little slow. There may also be heuristics that accomplish the same thing.
    - Similarly, spaCy's [entity recognition](https://spacy.io/docs/usage/entity-recognition) could be useful here.
    - Look for commonly repeated threads (e.g. you might end up picking up the photo credits or people such as "a friend").
    - Long captions are often not lists of people.  The cutoff is subjective, but for grading purposes, *set that cutoff at 250 characters*.
  2. You will want to separate the captions based on various forms of punctuation.  Try using `re.split`, which is more sophisticated than `string.split`. **Note**: Use regex exclusively for name parsing.
  3. This site is pretty formal and likes to say things like "Mayor Michael Bloomberg" after his election but "Michael Bloomberg" before his election.  Can you find other titles that are being used?  They should probably be filtered out because they ultimately refer to the same person: "Michael Bloomberg."
  4. There is a special case you might find where couples are written as eg. "John and Mary Smith". You will need to write some extra logic to make sure this properly parses to two names: "John Smith" and "Mary Smith".
  5. When parsing names from captions, it can help to look at your output frequently and address the problems that you see coming up, iterating until you have a list that looks reasonable. Because we can only asymptotically approach perfect identification and entity matching, we have to stop somewhere.
  
**Questions worth considering:**
  1. Who is Patrick McMullan and should he be included in the results? How would you address this?
  2. What else could you do to improve the quality of the graph's information?

Once you feel that your algorithm is working well on these captions, parse all of the captions and extract all the names mentioned.  

Now, run this sort of test on a few other pages.  You will probably find that other pages have a slightly different HTML structure, as well as new captions that trip up your caption parser.  But don't worry if the parser isn't perfect -- just try to get the easy cases.

Once you are satisfied that your caption scraper and parser are working, run this for all of the pages.  If you haven't implemented some caching of the captions, you probably want to do this first.

In [None]:
# Scraping all of the pages could take 10 minutes to an hour

## Phase 3: Graph Analysis

For the remaining analysis, we think of the problem in terms of a
[network](http://en.wikipedia.org/wiki/Computer_network) or a
[graph](https://en.wikipedia.org/wiki/Graph_%28discrete_mathematics%29).  Any time a pair of people appear in a photo together, that is considered a link.  It is an example of  an **undirected weighted graph**. We recommend using python's [`networkx`](https://networkx.github.io/) library.

In [None]:
import itertools  # itertools.combinations may be useful
import networkx as nx

All in all, you should end up with over 100,000 captions and more than 110,000 names, connected in about 200,000 pairs.

## Question 3: Graph EDA (20 p)


- Use parsed names to create the undirected weighted network and visualize it (5 p)
- Report the number of nodes and edges (5 p)
- What is the diameter of this graph? (5 p)
- What is the average clustering coeff of the graph? How you interpret this number? (5 p)

## Question 4: Graph properties (20 p)

What real-world graph properties does this graph exhibit? Please show your work and interpret your answer. Does the result make sense given the nature of the graph?

## Question 4: Who are the most photogenic persons? (10 p)

The simplest question to ask is "who is the most popular"?  The easiest way to answer this question is to look at how many connections everyone has.  Return the top 100 people and their degree.  Remember that if an edge of the graph has weight 2, it counts for 2 in the degree.

**Checkpoint:** Some aggregate stats on the solution

    "count": 100.0
    "mean": 189.92
    "std": 87.8053034454
    "min": 124.0
    "25%": 138.0
    "50%": 157.0
    "75%": 195.0
    "max": 666.0

In [None]:
degree = 


## Question 5: Centrality analysis (20 p)


Use eccentricity centrality, closeness centrality, betweenness centrality, prestige, and PageRank to identify the top 10 individuals with the highest centrality for each measure. How do you interpret the results?

Use 0.85 as the damping parameter for page rank, so that there is a 15% chance of jumping to another vertex at random.

**Checkpoint:** Some aggregate stats on the solution for pagerank

    "count": 100.0
    "mean": 0.0001841088
    "std": 0.0000758068
    "min": 0.0001238355
    "25%": 0.0001415028
    "50%": 0.0001616183
    "75%": 0.0001972663
    "max": 0.0006085816

## Question 6: best_friends (10 p)


Another interesting question is who tend to co-occur with each other.  Give us the 100 edges with the highest weights.

Google these people and see what their connection is.  Can we use this to detect instances of infidelity?

**Checkpoint:** Some aggregate stats on the solution

    "count": 100.0
    "mean": 25.84
    "std": 16.0395470855
    "min": 14.0
    "25%": 16.0
    "50%": 19.0
    "75%": 29.25
    "max": 109.0

In [32]:
#temple form for the answer
best_friends = [(('Michael Kennedy', 'Eleanora Kennedy'), 41)] * 100
