# New York Social Diary

[New York Social Diary](http://www.newyorksocialdiary.com/) casts a fascinating lens onto New York's socially well-to-do.  The data forms a natural social graph for New York's social elite.  Take a look at this page of a recent run-of-the-mill holiday party:

`http://www.newyorksocialdiary.com/party-pictures/2014/holiday-dinners-and-doers`

Besides the brand-name celebrities, you will notice the photos have carefully annotated captions labeling those that appear in the photos.  We can think of this as implicitly implying a social graph: there is a connection between two individuals if they appear in a picture together. In this project, we will scrape data from this website, parse the captions to find which people occur in photos together, and build a social graph of the result.

The first step is to fetch the data.  This comes in two phases.

The first step is to crawl the data.  We want photos from parties before December 1st, 2014.  Go to
`http://www.newyorksocialdiary.com/party-pictures`
to see a list of (party) pages.  For each party's page, grab all the captions.

*Hints*:

  1. Click on the on the index page and see how they change the url.  Use this to determine a strategy to get all the data.

  2. Notice that each party has a date on the index page. You can use python's `datetime.strptime` function to parse it.

  3. Some captions are not useful: they contain long narrative texts that explain the event.  Usually in two stage processes like this, it is better to keep more data in the first stage and then filter it out in the second stage.  This makes your work more reproducible.  It's usually faster to download more data than you need now than to have to redownload more data later.

Now that you have a list of all captions, you should probably save the data on disk so that you can quickly retrieve it.  Now comes the parsing part.

  1. Try to find some heuristic rules to separate captions that are a list of names from those that are not. For one, consider that long captions are often not lists of people.  The cutoff is subjective so to be definitive, *let's set that cutoff at 250 characters*.

  2. You will want to separate the captions based on various forms of punctuation.  Try using `re.split`, which is more sophisticated than `string.split`.

  3. You might find a person named "ra Lebenthal".  There is no one by this name.  Can anyone spot what's happening here?

  4. This site is pretty formal and likes to say things like "Mayor Michael Bloomberg" after his election but "Michael Bloomberg" before his election.  Can you find other ('optional') titles that are being used?  They should probably be filtered out b/c they ultimately refer to the same person: "Michael Bloomberg."

For the analysis, we think of the problem in terms of a [network](http://en.wikipedia.org/wiki/Computer_network) or a [graph](http://en.wikipedia.org/wiki/Graph_%28mathematics%29).  Any time a pair of people appear in a photo together, that is considered a link.  What we have described is more appropriately called an (undirected) [multigraph](http://en.wikipedia.org/wiki/Multigraph) with no self-loops but this has an obvious analog in terms of an undirected [weighted graph](http://en.wikipedia.org/wiki/Graph_%28mathematics%29#Weighted_graph).  In this problem, we will analyze the social graph of the new york social elite.

For this problem, we recommend using python's `networkx` library.


## I. Degree

The simplest question you might want to ask is 'who is the most popular'?  The easiest way to answer this question is to look at how many connections everyone has.  Write a function that returns the top 100 people and their degree.  Remember that if an edge of the graph has weight 2, it counts for 2 in the degree.
  
*Checkpoint:*

    Top 100 .describe()
    count    100.000000
    mean     106.340000
    std       51.509579
    min       69.000000
    25%       77.000000
    50%       85.500000
    75%      116.500000
    max      372.000000


## II. PageRank

 A similar way to determine popularity is to look at their [pagerank](http://en.wikipedia.org/wiki/PageRank).  Pagerank is used for web ranking and was originally [patented](http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=6285999) by Google and is essentially the [stationary distribution](http://en.wikipedia.org/wiki/Markov_chain#Stationary_distribution_relation_to_eigenvectors_and_simplices) of a [markov chain](http://en.wikipedia.org/wiki/Markov_chain) implied by the social graph.

Use 0.85 as the damping parameter so that there is a 15% chance of jumping to another vertex at random.

*Checkpoint:*

    Topp 100 .describe()
    count    100.000000
    mean       0.000185
    std        0.000076
    min        0.000124
    25%        0.000138
    50%        0.000162
    75%        0.000200
    max        0.000623
   

## III. Best Friends

Another interesting question is who tend to co-occur with each other.  Give us the 100 edges with the highest weights. Write a function which returns a list of 100 tuples of the form ((person1, person2), count) in descending order of count

    Topp 100 .describe()
    count    100.000000
    mean      25.070000
    std       15.647154
    min       13.000000
    25%       15.000000
    50%       19.000000
    75%       28.500000
    max      107.000000
   

In [None]:
%xmode Context
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
from collections import namedtuple
from datetime import datetime 
PartyPictures = namedtuple('PartyPictures', 'title, url, date,html')
PartyPicturesDetails = namedtuple('PartyPictures','imgurl,text')

def get_party_pictures(fromdatetime=datetime(2015, 11, 30)):
    party_pictures=[]
    for i in range(2,29):
        response = requests.get("http://www.newyorksocialdiary.com/party-pictures", params={"page": str(i)})
        soup = BeautifulSoup(response.text, "lxml")
        parent_div = soup.find_all("div", class_="view-content")
        rows = parent_div[0].find_all("div",class_="views-row")
        for row in rows:
            url = row.select("span.views-field-title")[0].select("span.field-content")[0].select("a")[0].get('href')
            title = row.select("span.views-field-title")[0].select("span.field-content")[0].select("a")[0].getText()
            date = row.select("span.views-field-created")[0].select("span.field-content")[0].getText()
            formatteddate = datetime.strptime(date, '%A, %B %d, %Y')
            if (formatteddate < fromdatetime):
                party_pictures.append(PartyPictures(url=url,title=title,date=date,html=''))
    return party_pictures

import pickle

def get_party_picture_html(response,party_pic):
    return PartyPictures(url=party_pic.url,title=party_pic.title,date=party_pic.date,html=response.text)
    

from requests_futures.sessions import FuturesSession
def get_people_party_pictures_details():
    party_pictures_details = []
    url_base = "http://www.newyorksocialdiary.com"
    session = FuturesSession(max_workers=15)
    party_pics = get_party_pictures()
    futures = [session.get(urljoin(url_base, party_pic.url)) for party_pic in party_pics]
    party_pictures_details = [get_party_picture_html(future.result(), party_pic) for future, party_pic in zip(futures, party_pics)]
    pickle.dump( party_pictures_details, open( "party_picture_details.p", "wb" ) )
    return party_pictures_details


def get_images_party_pictures(party_pictures):
    my_dict = {}
    for party_picture in party_pictures:
        party_images = []
        soup = BeautifulSoup(party_picture.html, "lxml")
        images = soup.find_all("img")
        for image in images:
            tables = image.find_all_previous("table")
            if (len(tables)) > 0:
                photocaption = tables[0].select(".photocaption")
                if (len(photocaption)) > 0:
                    text = photocaption[0].getText()
                    if ("for NYSD Contents" not in text):
                        party_images.append(PartyPicturesDetails(imgurl=image.get('src'),text=text))
                else:
                    font = tables[0].select("font")
                    if (len(font)) > 0:
                        text = font[0].getText()
                        if ("for NYSD Contents" not in text):
                            party_images.append(PartyPicturesDetails(imgurl=image.get('src'),text=text))
        my_dict[party_picture.url] = party_images 
    return my_dict

In [None]:
party_pictures_details_list = get_people_party_pictures_details()

In [None]:
party_pictures_details_list = pickle.load( open( "party_picture_details.p", "rb" ) )

In [None]:
my_dict = get_images_party_pictures(party_pictures_details_list)

In [None]:
import re
def get_network_for_party_pictures(partypictures_list_of_lists):
    my_network = {}
    #all_people_list = []
    for partypicture_list in partypictures_list_of_lists:
        for partypicture in partypicture_list:
            finallist = []
            excludes = ['Click','Dinner','Violet Ball', 'Steering Committee','Girl Scouts', 'Future Woman',
                        'Chicago Botanic Garden Summer Gala','Sending','Birthday Party',
                        'Hudson Terrace','The Plaza Hotel Grand Ballroom','The','Media Award',
                        'Interm President','Lobel Modern', 'Century Gallery','Guests','Fountain',
                        'Los Angeles','East Hampton Village','Dinner','Villa Pisani','Glow Zone',
                        'Marlborough','Hospital','Museum','Foundation','Clockwise','Trustee',
                        'Special Surgery','Honoree','Chicago','M.D','Jr','Yugoslavia','New York',
                        'The Society','Guest','Trustees','Members', "York", 'L','Directors', 'Universe',
                        'Dancing', 'Arriving', 'African Mammals','Cocktails','The Metropolitan Club',
                        'Dressed','Keynote','MD','PhD','PsyB','The','The Society','Memorial Sloan Kettering',
                        'The Four Seasons Restaurant','Committee','Jay Heritage Center','CEO NCR','Ceremonies',
                        'NY State Assemblyman','Children Award Executive Director','Governors Award',
                        'Children Award Board','Directors','The Soul Rebels','Island Weiss','YAGP',
                        'American Ballet Theatre','Brooklyn Museum','Ocean Life','Natural History',
                        'Mayor','Governor','Click','Contents','Miss','Dr.','Medical',
                        'Langone','Countess','Photographs','Distinction', 'Ambassador',
                        'Councilwoman','Annual', 'Designer', 'Showhouse',
                        'Cocktails', 'Esplanade', 'Publicists','Benefit', 'Mstr', 'Sgt',
                        'Presented','Choreographer','Actor','Congresswoman','County',
                        'Legislator','Honorary','Gala','Chairman','Co-Chairmen','Board','Member',
                        'Executive','Director','Chair','Co-Chairs','Gallery','Artist','Event',
                        'President','Steering', 'Committee','Duchess','Jr','Vice','Frick',
                        'New', 'Art', 'Co', 'Front','(Front)', 'Back','(Back)','Center','Former', 'Author','Deputy',
                        'Curator', 'Honorees', 'Chief', 'Serbia',
                        'Museum Gala Chairs','Museum Gala Chair','A','The MAD',"Young Attending Award","CEO","M","Mr"
                        'American Ballet','School','Miss USA','Miss Teen USA',"H"]

            if (len(partypicture.text) >= 250):
                continue
            text = partypicture.text.strip()
            #print(text)
            
            #Remove non name capitalized words which are part of names
            text = re.sub(r'(\(Trey\))|Brooklyn DA|(\(Museum\s+Trustee\))|(Museum\s+Gala\s+Chair)|(Honorable\s+)|(Trustees\s+)|(Trustee\s+)|(Honorees\s+)|(Honoree\s+)|(JHC\s+President\s+)|(Museum\s+Trustee\s+)|(Former\s+MA\s+)|(Former\s+DA\s+)|(Museum\s+President\s+)|(Co-Chair\s+)|','',text)
            
            #Remove extra spaces and replace with one space
            text = re.sub(r'\s{2,}',' ',text)
            
            #print(text)
            
            # Deal with sentences like "Mr. and Mrs.Will Smith as just Will Smith"
            searchresult = re.search(r'Mr(?:.)\s+and\sMrs(?:.)((\s)?([A-z]\w*)(\s[A-z]\w*)?)',text.strip())
            if searchresult:
                text = searchresult.group(1).strip()
            
            # Deal with sentences like  Will and Jada Smith as Will Smith and Jada Smith at start of sentence
            replacesearchresult = re.search(r'^[A-Z]{1}\w*\s+and\s+[A-Z]{1}\w*\s+([A-Z]{1}\w*(\s+[A-Z]{1}\w*)?)(,\s+Jr.{0,1})?',text)
            if replacesearchresult:
                text = re.sub(r' and', " "+replacesearchresult.group(1)+" and", text)
            
            #Split the sentence
            wordlist = re.split(': |,\sand |, |\sand |\swith |\sof |\sat',text)
            
            
            #Finding the names
            for newwords in wordlist:
                searchresult = re.search(r"^(([A-Z]{1}\w*\.\s+)?([A-Z]{1}\w*)(\s+du)?(\s+van)?(\s+von)?(\s+del)?(\s+van\s+der)?(\s+de\s+la)?(\s+de)?(\s+[A-Z]([a-z])?\.)?(\s+[A-Z]{1}\w*|(-\w*))?('[A-Z]{1}\w*)?(\s+[A-Z]{1}\w*)?(,\s+Ph.D)?)|([A-Z]{1}\w*\.\s+)?([A-Z]{1}\w*)(\s+du)?(\s+van)?(\s+von)?(\s+del)?(\s+van\s+der)?(\s+de\s+la)?(\s+de)?(\s+[A-Z]([a-z])?\.)?(\s+[A-Z]{1}\w*|(-\w*))?('[A-Z]{1}\w*)?(\s+[A-Z]{1}\w*)?(,\s+Ph.D)?\Z",newwords.strip())
                if searchresult:
                    if (searchresult.group().strip() in excludes):
                         continue
                    else:
                        finallist.append(searchresult.group().strip())
            #print(finallist)            
            #Dict with person --> friends
            if (len(finallist) > 1):
                for name in finallist:
                    if name not in my_network:
                        templist = list(finallist)
                        if (len(templist) > 1):
                            templist.remove(name)
                            my_network[name] = list(templist)
                        else:
                            my_network[name] = []
                    else:
                        templist = list(finallist)
                        if (len(templist) > 1):
                            templist.remove(name)
                            my_network[name] =  list(my_network[name]) + list(templist)
    return my_network

In [None]:
my_network_dict = get_network_for_party_pictures(list(my_dict.values()))

In [None]:
import networkx as nx
G = nx.Graph()
for node in list(my_network_dict.keys()):
    for each_node in my_network_dict[node]:
        edge_dict = G.get_edge_data(node,each_node)
        if (edge_dict is None):
            G.add_edge(node,each_node,weight=1)
        else:
            G.add_edge(node,each_node,weight=(edge_dict["weight"]+1))

In [None]:
#Degree, Degree Weighted and Page Rank
import pandas as pd
names = []
degree = []
degrees = G.degree(weight=1)
for key,val in list(degrees.items()):
    names.append(key)
    degree.append(val)
my_df = pd.DataFrame(data = list(zip(names,degree)), columns=['Name', 'Degree'])
my_df.set_index("Name")

names = []
degree = []
degrees = G.degree(weight="weight")
for key,val in list(degrees.items()):
    names.append(key)
    degree.append(val)
my_df2 = pd.DataFrame(data = list(zip(names,degree)), columns=['Name', 'DegreeWeighted'])
my_df2.set_index("Name")

names = []
pageranklist = []
pageranks = nx.pagerank(G,alpha=0.85)
for key,val in list(pageranks.items()):
    names.append(key)
    pageranklist.append(val)
my_df3 = pd.DataFrame(data = list(zip(names,pageranklist)), columns=['Name', 'Pagerank'])
my_df3.set_index("Name")

my_df_full = my_df.merge(my_df2).merge(my_df3)
my_df_full = my_df_full.sort(columns=["Degree"],ascending=False)

my_df_full

In [None]:
def my_friends_tuple():
    my_tuple_list = []
    pair_dict = {}
    for each_tuple in G.nodes(data=True):
        myfriendsset = set(G[each_tuple[0]])
        for friend in myfriendsset:
            edge_dict = G.get_edge_data(each_tuple[0],friend)
            #prevent duplicates
            if friend+"_"+each_tuple[0] not in pair_dict:
                my_tuple_list.append(((each_tuple[0],friend),edge_dict["weight"]))
                #Add pair to dict for faster traversal to find duplicates
                pair_dict[each_tuple[0]+"_"+friend] = True 
    my_tuple_list.sort(key=lambda tup: tup[1],reverse=True)
    return my_tuple_list

my_friends_tuple()