In [1]:
import seaborn as sns
sns.set()

In [2]:
from static_grader import grader

# The New York Social Graph


[New York Social Diary](https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/) provides a
fascinating lens onto New York's socially well-to-do.  The data forms a natural [social graph](https://en.wikipedia.org/wiki/Social_graph) for New York's social elite.  Take a look at this page of a recent [run-of-the-mill holiday party](https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2014/holiday-dinners-and-doers).

Besides the brand-name celebrities, you will notice the photos have carefully annotated captions labeling those that appear in the photos.  We can think of this as implicitly implying a social graph: there is a connection between two individuals if they appear in a picture together.

For this project, we will assemble the social graph from photo captions for parties _dated December 1, 2014 and before_.  Using this graph, we can make guesses at the most popular socialites, the most influential people, and the most tightly coupled pairs.

These pages are hosted on the Internet Archive, which can be quite slow and unreliable. To get around this, we have created an API that provides the captions. This API lives at `https://party-captions.tditrain.com`. The [documentation](https://party-captions.tditrain.com) describes how this API works in detail. At a high level, it's divided into two parts
- An endpoint that provides a list of parties, `/parties`
- An endpoint that provides the captions for a given party, `/captions`

Both take parameters that allow us to select what we're looking for.

To get the social graph that we want, we'll attack the problem in several steps:
1. Get a list of the parties we want
1. Parse the names from each caption for one party
1. Parse the names for the rest of the parties
1. Assemble the graph

## Getting the parties


The `/parties` endpoint provides us with a list of party names and dates (the date the party occurred). It can only provide up to 100 at a time, and there are over 1000 parties in the data set. By using the `limit` and `offset` parameters, as described in the documentation, get a list of all of the parties and their dates.

As we did in class, we recommend using [`requests`](http://docs.python-requests.org/en/master/) to hit the endpoint. The checkpoints are expecting a list where each element corresponds to one party. How you want to represent this party (as a tuple, a dictionary, or something else) is up to you.

The API will return a JSON object containing a list of party names and dates, and some metadata.  Here is an example, only returning the first ten parties for our convenience.

We want the information in the `parties` element. You will need to call the API multiple times to get all the parties.

In [3]:
import requests

offset_val=[*range(0, 1200, 100)]
listy=[]

for number in offset_val:
    print(f"https://party-captions.tditrain.com/parties?limit=100&offset={number}")
    urls= f"https://party-captions.tditrain.com/parties?limit=100&offset={number}"
    newlisty=requests.get(urls).json()
    newlisty=newlisty.get("parties")
    listy.append(newlisty)

https://party-captions.tditrain.com/parties?limit=100&offset=0
https://party-captions.tditrain.com/parties?limit=100&offset=100
https://party-captions.tditrain.com/parties?limit=100&offset=200
https://party-captions.tditrain.com/parties?limit=100&offset=300
https://party-captions.tditrain.com/parties?limit=100&offset=400
https://party-captions.tditrain.com/parties?limit=100&offset=500
https://party-captions.tditrain.com/parties?limit=100&offset=600
https://party-captions.tditrain.com/parties?limit=100&offset=700
https://party-captions.tditrain.com/parties?limit=100&offset=800
https://party-captions.tditrain.com/parties?limit=100&offset=900
https://party-captions.tditrain.com/parties?limit=100&offset=1000
https://party-captions.tditrain.com/parties?limit=100&offset=1100


In [4]:
from datetime import datetime
from datetime import date
import time

date_string = '1 December, 2014'
obj = datetime.strptime(date_string, "%d %B,  %Y")
obj

####for loop

party_list = []
for i in range(len(listy)):
    dataList = listy[i]
    
    for j in range(len(dataList)):
        date_test = dataList[j]['date']
        date_test

        obj2 = datetime.strptime(date_test, "%Y-%m-%d")
        

        if obj >= obj2:
            #save to partylist then append
            party_list.append(listy[i][j])

In [5]:
#party_list
import re
party_list777 = ', '.join(d['name'] for d in party_list)
party_list777 = re.split(", ",party_list777) 

Now that we have our list of parties, we'll need to remove those that occurred after December 1st, 2014 (we keep the ones that occurred _on_ or before that date). The API provided us with the dates, as strings. One option would be to use `datetime`'s `strptime` method and the [format codes for dates](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) to parse this into dates for comparison.

In [6]:
party_list = party_list

In [7]:
# If you have successfully gotten all of the parties, there should be 1145 of them
# Double check that you are not skipping or duplicating parties
# if you are, look at how you are incrementing your offset
# Have you filtered by the date?
grader.check(len(party_list) == 1145)

True

To avoid having to get the party list from the API again if we restart the notebook, we should save this list to a file. There are many ways to do it, here's how with `dill`.

In [None]:
import dill

with open('nysd-parties.pkd', 'wb') as f:
    dill.dump(party_list, f)

And to read it back later, we just `load` it.

In [22]:
import dill
with open('nysd-parties.pkd', 'rb') as f:
    party_list = dill.load(f)

## Question 1: histogram


Get the number of party pages for each of the 95 months (that is, month-year pair) in the data.  Represent this histogram as a list of 95 tuples, each of the form `("Dec-2014", 1)`.  Note that you can convert `datetime` objects into these sort of strings with `strftime` and the [format codes for dates](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) from before.

The grader is expecting the list of tuples. 

Plot the histogram for yourself.  Do you see any trends?

In [9]:
from collections import Counter

In [10]:
party_dates = [datetime.strftime(datetime.strptime(p['date'], "%Y-%m-%d"), "%b-%Y")  for p in party_list]

In [11]:
#turn counter object into items/list
party_dates = list(Counter(party_dates).items())

In [12]:
len(party_dates)

95

In [13]:
histogram = party_dates

In [None]:
grader.score('graph__histogram', histogram)

## Parsing captions


We now have all of the parties.  For each party, we'll need to get the captions, then find who appears in each caption. Let's start with a single party, [the benefit cocktails and dinner](https://web.archive.org/web/20150913224145/http://www.newyorksocialdiary.com/party-pictures/2015/celebrating-the-neighborhood) for [Lenox Hill Neighborhood House](http://www.lenoxhill.org/), a neighborhood organization for the East Side. In our API, this corresponds to the party named `2015-celebrating-the-neighborhood`.  Let's get the captions for it.

In our API, the `/captions` endpoint takes a parameter `party`, which we give the name of the party we want. We can then extract the captions from the JSON it returns.

In [10]:
response = requests.get('https://party-captions.tditrain.com/captions?party=2015-celebrating-the-neighborhood')
captions = response.json()['captions']

# We'll need to do this for all of our parties later, so we should make it a function we can call. Take the party name as an argument and return the list of captions. 

We want to avoid having to hit the API repeatedly the next time we need to run the notebook.  While you could save the files by hand, as we did before, a checkpoint library like [ediblepickle](https://pypi.python.org/pypi/ediblepickle/1.1.3) can handle this for you.  (Note, though, that you may not want to enable this until you are sure that your function is working.)

You should also keep in mind that HTTP requests fail occasionally, for transient reasons.  You should plan how to detect and react to these failures.   The [retrying module](https://pypi.python.org/pypi/retrying) is one way to deal with this.

In [8]:
def get_captions(party_name):
    return requests.get('https://party-captions.tditrain.com/captions?party=' + party_name).json()['captions']

If things have gone according to plan, this should get the same captions as before.

In [11]:
# This cell is expecting get_captions to return a list of the captions themselves
# Other routes to a solution might need to adjust this cell a bit
grader.check(get_captions('2015-celebrating-the-neighborhood') == captions)

True

In [55]:
testlist = get_captions('2015-celebrating-the-neighborhood')
testlist

["Glenn Adamson, Simon Doonan, Victoire de Castellane, Craig Leavitt, Jerome Chazen, Andi Potamkin, Ralph Pucci, Kirsten Bailey, Edwin Hathaway, and Dennis Freedman at the Museum of Art and Design's annual MAD BALL. ",
 ' Randy Takian ',
 ' Kamie Lightburn and Christopher Spitzmiller ',
 ' Christopher Spitzmiller and Diana Quasha ',
 ' Mariam Azarm, Sana Sabbagh, and Lynette Dallas ',
 ' Christopher Spitzmiller, Sydney Shuman, and Matthew Bees',
 ' Christopher Spitzmiller and Tom Edelman ',
 ' Warren Scharf and Sydney Shuman ',
 ' Amory McAndrew and Sean McAndrew ',
 ' Sydney Shuman, Mario Buatta, and Helene Tilney ',
 ' Katherine DeConti and Elijah Duckworth-Schachter ',
 ' John Rosselli and Elizabeth Swartz ',
 ' Stephen Simcock, Lee Strock, and Thomas Hammer ',
 ' Searcy Dryden, Lesley Dryden, Richard Lightburn, and Michel Witmer ',
 ' Jennifer Cacioppo and Kevin Michael Barba ',
 ' Virginia Wilbanks and Lacary Sharpe ',
 ' Valentin Hernandez, Yaz Hernandez, Chele Farley, and James 

In [80]:
import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag

def extract_names(party_name):
    names = []
    capshun_list = get_captions(party_name)
    for caption in capshun_list:
        words = word_tokenize(caption)
        tagged_words = pos_tag(words)
        for i in range(len(tagged_words)):
            if tagged_words[i][1] == 'NNP':  # NNP stands for proper noun
                name = tagged_words[i][0]
                j = i + 1
                while j < len(tagged_words) and tagged_words[j][1] == 'NNP':
                    name += ' ' + tagged_words[j][0]
                    j += 1
                if re.match(r'^[A-Z][a-z]*\s[A-Z][a-z]*$', name):  # match full names only
                    names.append(name)
    return names
jj = extract_names('2015-celebrating-the-neighborhood')

In [99]:
#Alternative route for name parsing using nltk outputting a tuple with assocciating values for each caption
import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag

def extract_names_as_tuple(party_name):
    names = []
    capshun_list = get_captions(party_name)
    for index, caption in enumerate(captions):
        words = word_tokenize(caption)
        tagged_words = pos_tag(words)
        for i in range(len(tagged_words)):
            if tagged_words[i][1] == 'NNP':  # NNP stands for proper noun
                name = tagged_words[i][0]
                j = i + 1
                while j < len(tagged_words) and tagged_words[j][1] == 'NNP':
                    name += ' ' + tagged_words[j][0]
                    j += 1
                if re.match(r'^[A-Z][a-z]*\s[A-Z][a-z]*$', name):  # match full names only
                    names.append((index, name))
    namesonly_ = [element for sublist in names for element in sublist]
    return names,namesonly_
#jj = extract_names_as_tuple(testlist)

In [65]:
#taking in a list of tuples, joining the names found in each caption to return a list of lists
def join_names(input_names):
    name_dict = {}
    for index, name in input_names:
        if index in name_dict:
            name_dict[index].append(name)
        else:
            name_dict[index] = [name]
    output_names = list(name_dict.values())
    return output_names

In [127]:
#combined functions
def extract_and_join_names(party_name):
    names = []
    capshun_list = get_captions(party_name)
    for index, caption in enumerate(capshun_list):
        words = word_tokenize(caption)
        tagged_words = pos_tag(words)
        for i in range(len(tagged_words)):
            if tagged_words[i][1] == 'NNP':  # NNP stands for proper noun
                name = tagged_words[i][0]
                j = i + 1
                while j < len(tagged_words) and tagged_words[j][1] == 'NNP':
                    name += ' ' + tagged_words[j][0]
                    j += 1
                if re.match(r'^[A-Z][a-z]*\s[A-Z][a-z]*$', name):  # match full names only
                    names.append((index, name))
    
    name_dict = {}
    for index, name in names:
        if index in name_dict:
            name_dict[index].append(name)
        else:
            name_dict[index] = [name]
    
    output_names = list(name_dict.values())
   
    return output_names

In [128]:
extract_and_join_names('2014-celebrating-the-treasures')

[['Fujiko Nakaya', 'Glass House'],
 ['Mary Lattimore'],
 ['Charles Renfro', 'Tony Vidler', 'James Welling', 'Daniel Gortler'],
 ['Margaret Russell', 'Henry Urbach', 'Margareth Henriquez'],
 ['Carlos Souza'],
 ['Bonnie Morrison', 'Douglas Friedman'],
 ['Painting Gallery'],
 ['Martha Stewart', 'Charles Renfro'],
 ['Margaret Russell',
  'Martha Stewart',
  'Magrino Dunning',
  'Margareth Henriquez',
  'Reed Krakoff',
  'John Yunis',
  'John Calcagno'],
 ['Lee Mindel',
  'Margaret Russell',
  'Tomas Maier',
  'Martha Stewart',
  'Magrino Dunning'],
 ['Bill Katz',
  'Bonnie Morrison',
  'Douglas Friedman',
  'Henry Urbach',
  'Laurie Beckelman',
  'James Sanders',
  'Tony Vidler',
  'Charles Renfro'],
 ['Rich Kosann', 'Reed Krakoff'],
 ['Hendel Teicher'],
 ['Henry Urbach', 'Charles Renfro'],
 ['Lori Tritsch', 'William Lauder'],
 ['Margaret Russell'],
 ['Martha Stewart', 'Laura Pla', 'Tomas Maier', 'Andrew Preston'],
 ['Margaret Russell', 'Carlos Souza'],
 ['Honoree Guillaume', 'Elizabeth St

Now that we have some sample captions, let's start parsing names out of those captions.  There are many ways of going about this, and we leave the details up to you.  Some issues to consider:

  1. Some captions are not useful: they contain long narrative texts that explain the event.  Try to find some heuristic rules to separate captions that are a list of names from those that are not.  A few heuristics include:
    - Look for sentences (which have verbs) and as opposed to lists of nouns. For example, [`nltk` does part of speech tagging](http://www.nltk.org/book/ch05.html) but it is a little slow. There may also be heuristics that accomplish the same thing.
    - Similarly, spaCy's [entity recognition](https://spacy.io/docs/usage/entity-recognition) could be useful here, but like `nltk` using `spaCy` will add to processing time.
    - Look for commonly repeated threads (e.g. you might end up picking up the photo credits or people such as "a friend").
    - Long captions are often not lists of people.  The cutoff is subjective, but for grading purposes **we set that cutoff at 250 characters**.
  1. Many of the captions contain extraneous whitespace or other formatting issues you may need to deal with.
  1. You will want to separate the captions based on various forms of punctuation.  Try using `re.split`, which is more sophisticated than `string.split`. **Note**: The reference solution uses regex exclusively for name parsing.
  1. You might find a person named "ra Lebenthal".  There is no one by this name.  Any idea what might cause that?
  1. This site is pretty formal and likes to say things like "Mayor Michael Bloomberg" after his election but "Michael Bloomberg" before his election.  Can you find other ('optional') titles that are being used?  They should probably be filtered out because they ultimately refer to the same person: "Michael Bloomberg."
  1. There is a special case you might find where couples are written as e.g. "John and Mary Smith". You will need to write some extra logic to make sure this properly parses to two names: "John Smith" and "Mary Smith".
  1. When parsing names from captions, it can help to look at your output frequently and address the problems that you see coming up, iterating until you have a list that looks reasonable. This is the approach used in the reference solution. Because we can only asymptotically approach perfect identification and entity matching, we have to stop somewhere.
  1. Your eye is very good at doing this sort of parsing.  You will find it helpful to look at a caption and the names you parse of out it. Do this for a selection of captions to detect potential issues.
  1. You want to keep the names in a caption together - that's how we can tell they're connected to each other! You should get one list of names for each caption.
  
**Questions worth considering:**
  1. Who is Patrick McMullan and should he be included in the results? How would you address this?
  2. What else could you do to improve the quality of the graph's information?

In [None]:
# You will want to make a function that takes in a caption and returns a list of names

In [14]:
#library defs

import requests
import re
#import spacy
#from spacy.tokens import Doc

In [31]:
def titlremover(capshun):
    # Define the titles to be removed
    titles = ['Sir', 'Count', 'Countess', 
              'CEO', 'Lord', 'Dutchess', 
              'Consul', 'Counsel', 'Lady', 
              'General', 'Major', 'Senator', 
              'Mayor', 'Officer', 'Chief', 
              'President', 'Executive', 'Princess', 
              'Trustee', 'honoree', 'Curator', 'Honorees', 'Honoree']

    # Split the caption into words
    words = capshun.split()

    # Create a new list to store the words without titles
    filtered_words = []

    # Iterate through the words and add them to the new list if they are not in titles
    for word in words:
        if word not in titles:
            filtered_words.append(word)

    # Join the filtered words back into a string with spaces
    capshun_without_titles = ' '.join(filtered_words)
    return capshun_without_titles

In [38]:
#function for removing titles

def titlremover(capshun):
    

#remove titles
    titles = ['Sir','Count','Countess', 
              'CEO', 'Lord', 'Dutchess', 
              'Consul','Counsel', 'Lady', 
              'General', 'Major', 'Senator', 
              'Mayor', 'Officer', 'Chief', 
              'President', 'Executive', 'Princess', 
              'Trustee', 'honoree', 'Curator','Honorees','Honoree','Co-chairs','Baroness']

    title_del = capshun.split(' ')
    for t in titles:
        try:
            #trim white space to left of title
            title_index = title_del.index(t)
            del title_del[0:title_index + 1]
            capshun = title_del
            capshun = ' '.join(capshun)
        #dummy line to say move on past no title found error
        except ValueError as e:
            ty=0
    return capshun

In [16]:
#trimming white space

def whitetrimmer(capshun):

    #remove parentheses and quotes
    capshun = re.sub(r'\([^())]*\)', '', capshun)
    capshun = re.sub(r'\"[^"]*\"', '', capshun)
    
    #trim new line characters
    capshun = re.sub('\n','',capshun)

    #extra space deletion
    capshun = re.sub(' +',' ',capshun)
    capshun = capshun.rstrip(' ')
    capshun = capshun.lstrip(' ')
    
    #remove apostrophes
    capshun = re.sub('\'s','',capshun)
    
    
    return capshun

In [17]:
#Separating by spouse

def spouse_sep(capshun):
    #split by comma
    capshun = re.split(r"\,\s| with ",capshun)
    
    #matrix to be filled in case of list containing a couple
    flatlist=[]
    
    #capshun = re.split(' with ',str(capshun))
    #sub incidental and at beginning of name
    for i in range(len(capshun)): 
        capshun[i] = re.sub('^and ','',capshun[i])
        #regex split by and
        splitted = re.split(r" and\sand | and ",capshun[i])
        #for cases left with one name (prob just last name)
        if len(splitted) == 1:
            nu_split = re.split(' ',str(splitted))
            
            #this is for one name [no last]
            if len(nu_split) == 1:
                return None
            #this is for one full name
            else:
                capshun[i] = splitted
        
        else:
            #checking if married or unmarried couple
            if len(splitted[0].split(' ')) == 1:
                #married couple, so split by space on second name
                marsplit = splitted[1].split(' ')
                
                if len(marsplit) == 1:
                    del splitted
                
                else:

                    splitted[0] = splitted[0]+ ' ' + marsplit[1]

                    capshun[i] = splitted

            else:
                capshun[i] = splitted
        
    capshun=[element for sublist in capshun for element in sublist]
        
    return capshun

In [18]:
#remove mr/mrs/dr

def prefixrem(capshun):
    capshun = re.sub('[DM]rs?\.?\s','',capshun)
    
    return capshun

In [19]:
#limit caption to 250 characters

def charlimit(capshun):
    #limit caption character count to 250
    if len(capshun) > 200:
        return None
    #if its less than 200 characters return the caption
    else:
        return capshun

In [20]:
#remove articles such as 'a, an, the...'

def artrem(capshun):
    #if a capshun starts with any of the articles, skip it
    #first split by space then check for articles in [0]
    splitted = re.split(' ',capshun)
    
    if splitted[0] == 'A' or splitted[0] == 'The' or splitted[0] == 'An':
        return None
    else:
        return capshun

In [21]:
def unfriend(capshun):
    capshun = re.sub('(\san?d?)\s?a?\s?(friends?)','',capshun)
    
    return capshun

In [17]:
#using spacy to filter non name captions
import spacy
from spacy.tokens import Doc
def spacy_fun(capshun):
    nlp = spacy.load("en_core_web_sm")

    doc = nlp(capshun)
    sent_mat = []
    for token in doc:
        sent_mat.append(token.pos_)

    try:
        if sent_mat[0] == 'ADJ':
            return None

        else:
            name_check = re.split(' ',capshun)
            if len(name_check) == 1:
                del capshun
            else:
                x = re.search('^[a-z]',name_check[1])
                y = x is None

                if y == True:
                    return capshun
                else:
                    return None
    except IndexError as e:
        #print('bad capshun')
        return None

In [32]:
#Final functions
def get_names(party_name):
    final_names = []
    #for i in range(len(party_list)):
    capshun_list = get_captions(party_name)
    for i in range(len(capshun_list)):
        capshun = capshun_list[i]
        capshun = whitetrimmer(capshun)
        x = charlimit(capshun)
        y = x is None
        if y != True:
            x = artrem(capshun)
            y = x is None
            if y != True:  
                capshun = prefixrem(capshun)
                capshun = titlremover(capshun)
                capshun = unfriend(capshun)
                x = spouse_sep(capshun)
                y = x is None
                if y != True:
                    final_names.append(x)
                else:
                    ty=0
            else:
                ty=0
        else:
            ty=0
    #make a flattened list of final names
    namesonly = [element for sublist in final_names for element in sublist]
    return final_names, namesonly

In [91]:
def get_captions(party_name):
    return requests.get('https://party-captions.tditrain.com/captions?party=' + party_name).json()['captions']

final_names, namesonly = get_names('2015-celebrating-the-neighborhood')
#final_names
len(namesonly)
namesonly

['Randy Takian',
 'Kamie Lightburn',
 'Christopher Spitzmiller',
 'Christopher Spitzmiller',
 'Diana Quasha',
 'Mariam Azarm',
 'Sana Sabbagh',
 'Lynette Dallas',
 'Christopher Spitzmiller',
 'Sydney Shuman',
 'Matthew Bees',
 'Christopher Spitzmiller',
 'Tom Edelman',
 'Warren Scharf',
 'Sydney Shuman',
 'Amory McAndrew',
 'Sean McAndrew',
 'Sydney Shuman',
 'Mario Buatta',
 'Helene Tilney',
 'Katherine DeConti',
 'Elijah Duckworth-Schachter',
 'John Rosselli',
 'Elizabeth Swartz',
 'Stephen Simcock',
 'Lee Strock',
 'Thomas Hammer',
 'Searcy Dryden',
 'Lesley Dryden',
 'Richard Lightburn',
 'Michel Witmer',
 'Jennifer Cacioppo',
 'Kevin Michael Barba',
 'Virginia Wilbanks',
 'Lacary Sharpe',
 'Valentin Hernandez',
 'Yaz Hernandez',
 'Chele Farley',
 'James Farley',
 'Harry Heissmann',
 'Angela Clofine',
 'Michael Clofine',
 'Jared Goss',
 'Kristina Stewart Ward',
 'Alex Papachristidis',
 'Mario Buatta',
 'Nick Olsen',
 'Lindsey Coral Harper',
 'Alberto Villalobos',
 'David Duncan',
 

## Question 2: sample_names


Once you feel that your algorithm is working well, parse all of the captions we got from the `2015-celebrating-the-neighborhood` party and extract all the names mentioned.  Sort them alphabetically, by first name, and return the first hundred unique names.

In [75]:
sample_names = np.unique(namesonly)[:100]

grader.score('graph__sample_names', sample_names)

Your score: 1.0000


Now, test your tools on a few other parties.  You will probably find that other parties have new issues in their captions that trip up your caption parser.  But don't worry if the parser isn't perfect - just try to get the easy cases for now. You may need to come back and refine it more for the later questions, however.

## Parsing all the parties


Once you are satisfied that your parser is working, we want to run it for all of our parties. First, get the captions for all of the parties in our party list. If you haven't implemented some caching of the captions, you probably want to do this first.

In [None]:
# It may take several minutes to fetch all the captions

In [None]:
#Getting captions for each party in party_list
lister = []
#returns just captions for all parties in party list
for i in range(len(party_list)):
        x = get_captions(party_list[i]['name'])
        lister.append(x)
        
#flatten list of captions for each party
flat_list = [item for sublist in lister for item in sublist]
len(flat_list)

#remove extraneous new line entries
new_list = [x for x in flat_list if re.search(r'\w', x)]
len(new_list)

In [19]:
import dill

#with open('nysd-partycaptions.pkd', 'wb') as f:
    #dill.dump(new_list, f)
    
with open('nysd-partycaptions.pkd', 'rb') as f:
    new_list = dill.load(f)

In [23]:
#get a list of just party names
party_list777 = ', '.join(d['name'] for d in party_list)
party_list777 = re.split(", ",party_list777)

'2014-gala-guests'

In [130]:
#edit Final functions to take in new_list
final_names1 = []

for i in range(len(party_list777)-1):
    i=i+1
    #print(party_list777[i])
    final_names = extract_and_join_names(party_list777[i])
    final_names1.append(final_names)

namesonly_final = [element for sublist in final_names1 for element in sublist]

#save final names
import dill

with open('final_names_en.pkd', 'wb') as f:
    dill.dump(final_names1, f)

#save names only
import dill

with open('namesonly_final_en.pkd', 'wb') as f:
    dill.dump(namesonly_final, f)

In [148]:
import dill
with open('final_names2.pkd', 'rb') as f:
    final_names1 = dill.load(f)

In [2]:
with open('namesonly_final.pkd', 'rb') as f:
    namesonly_final = dill.load(f)

In [135]:
namesonly_final = [element for sublist in namesonly_final for element in sublist]

In [136]:
import numpy
#number of names
len(set(namesonly_final))

87166

In [65]:
regex_pattern = r'\d|the|&|presents|[Aa]ward|for|Dinner|[Mm]ember|guest|cocktail|board|Sponsorship|group'
filtered_list = [item for item in namesonly_final if not re.search(regex_pattern, item)]

len(set(filtered_list))

100938

In [54]:
#number of captions
sum(len(inner) for inner in modified_data)

99635

And parse the names in each caption.

In [None]:
# You should have a list of names for each caption
# Depending on how you set up your parser, this may take quite a while

You should end up with over 100,000 captions and roughly 110,000 names.

## Building the graph

For the remaining analysis, we think of the problem in terms of a
[network](http://en.wikipedia.org/wiki/Computer_network) or a
[graph](https://en.wikipedia.org/wiki/Graph_%28discrete_mathematics%29).  Any time a pair of people appear in a caption together, that is considered a link.  What we have described is more appropriately called an (undirected)
[multigraph](http://en.wikipedia.org/wiki/Multigraph) with no self-loops, but this has an obvious analog in terms of an undirected [weighted graph](http://en.wikipedia.org/wiki/Graph_%28mathematics%29#Weighted_graph).

In the remainder of this miniproject, we will analyze the social graph of the New York social elite.  We recommend using python's [`networkx`](https://networkx.github.io/) library to build this social graph.

In [5]:
import itertools  # itertools.combinations may be useful
import networkx as nx

You should find you have roughly 200,000 distinct pairs of people appearing in photos together - corresponding to how many (weighted) edges there are in our graph.

## Question 3: degree


The simplest question to ask is "who is the most popular"?  The easiest way to answer this question is to look at how many connections everyone has.  Return the top 100 people and their degree.  Remember that if an edge of the graph has weight 2, it counts for 2 in the degree.

**Checkpoint:** Some aggregate stats on the solution:
    
    count:  100.0
    mean:   202.2
    std:     94.1
    min:    132.0
    25%:    146.8
    50%:    169.5
    75%:    212.0
    max:    713.0
    
Note that these checkpoints are guidelines, you may not match them exactly.

In [150]:
#last clean up to remove single letters and other extras
modified_data = [
    [
        [i for i in nested_nested if not (len(i) == 1 or re.search(r'\d|the|&|presents|[Aa]ward|for|Dinner|[Mm]ember|guest|cocktail|board|Sponsorship|group', i))]
        for nested_nested in nested
        if any(not (len(i) == 1 or re.search(r'\d|the|&|presents|[Aa]ward|for|Dinner|[Mm]ember|guest|cocktail|board|Sponsorship|group', i)) for i in nested_nested)
    ]
    for nested in final_names1
    if any(any(not (len(i) == 1 or re.search(r'\d|the|&|presents|[Aa]ward|for|Dinner|[Mm]ember|guest|cocktail|board|Sponsorship|group', i)) for i in nested_nested) for nested_nested in nested)
]

In [152]:
#flaten list of all captions

finalnames_final11= [element for sublist in modified_data for element in sublist]

#Making network graph

import itertools  # itertools.combinations may be useful
import networkx as nx

# Create an empty graph
graph = nx.Graph()

# Loop through each list in the list of lists
for name_list in finalnames_final11:
    # Update the graph with combinations of names within each list
    combinations = itertools.combinations(name_list, 2)
    for pair in combinations:
        name1, name2 = pair
        if graph.has_edge(name1, name2):
            # If the edge already exists, increment the edge weight
            graph[name1][name2]['weight'] += 1
        else:
            # If the edge does not exist, add it with weight 1
            graph.add_edge(name1, name2, weight=1)

sorted_degrees = sorted(graph.degree, key=lambda x: x[1], reverse=True)
top_100_degrees = sorted_degrees[:100]

for key, value in top_100_degrees:
    print(f"{key}: {value}")

Jean Shafiroff: 397
Mark Gilbertson: 356
Gillian Miniter: 239
Geoffrey Bradfield: 208
Andrew Saffir: 205
Mario Buatta: 202
Somers Farkas: 189
Yaz Hernandez: 188
Alexandra Lebenthal: 186
Kamie Lightburn: 183
Alina Cho: 179
Eleanora Kennedy: 175
Lucia Hwong Gordon: 175
Debbie Bancroft: 174
Sharon Bush: 167
Muffie Potter Aston: 154
Bettina Zilkha: 150
Patrick McMullan: 149
Martha Stewart: 146
Barbara Tober: 146
Allison Aston: 146
Amy Fine Collins: 142
Jamee Gregory: 139
Grace Meigher: 129
Liliana Cavendish: 128
Deborah Norville: 127
Margo Langenberg: 125
Fernanda Kellogg: 124
Karen Klopp: 124
Leonard Lauder: 124
Dennis Basso: 123
Audrey Gruss: 122
Nicole Miller: 121
Christopher Hyland: 121
Lydia Fenet: 120
Donna Karan: 119
Jennifer Creel: 118
Elizabeth Stribling: 117
Kipton Cronkite: 117
Evelyn Lauder: 115
Liz Peek: 112
Bonnie Comley: 112
Russell Simmons: 112
Annette Rickel: 111
Anka Palitz: 109
Karen LeFrak: 109
Michael Bloomberg: 109
Adelina Wong Ettelson: 109
Fe Fendi: 108
Felicia Tayl

In [153]:
import heapq  # Heaps are efficient structures for tracking the largest
              # elements in a collection.  Use introspection to find the
              # function you need.
degree = [('Alec Baldwin', 144)] * 100

grader.score('graph__degree', top_100_degrees)

Your score: 0.9100


## Question 4: PageRank


A similar way to determine popularity is to look at their
[PageRank](http://en.wikipedia.org/wiki/PageRank).  PageRank is used for web ranking and was originally
[patented](http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=6285999) by Google and is essentially the stationary distribution of a [Markov
chain](http://en.wikipedia.org/wiki/Markov_chain) implied by the social graph. You can implement this yourself or use the version in `networkx`.

Use 0.85 as the damping parameter so that there is a 15% chance of jumping to another vertex at random.

**Checkpoint:** Some aggregate stats on the solution:


    count:  100.000000
    mean:     0.000193
    std:      0.000080
    min:      0.000130
    25%:      0.000144
    50%:      0.000169
    75%:      0.000206
    max:      0.000646


In [157]:
# Calculate PageRank with damping parameter 0.85
pagerank_scores = nx.pagerank(graph, alpha=0.85)

# Get the top 100 scores using heapq
top_100_scores = heapq.nlargest(100, pagerank_scores.items(), key=lambda item: item[1])

# Print the top 100 scores for each name
print("Top 100 PageRank Scores:")
for name, score in top_100_scores:
    print(f"{name}: {score:.4f}")

Top 100 PageRank Scores:
Jean Shafiroff: 0.0007
Mark Gilbertson: 0.0005
Gillian Miniter: 0.0005
Geoffrey Bradfield: 0.0004
Andrew Saffir: 0.0004
Alexandra Lebenthal: 0.0003
Yaz Hernandez: 0.0003
Mario Buatta: 0.0003
Somers Farkas: 0.0003
Debbie Bancroft: 0.0003
Sharon Bush: 0.0003
Kamie Lightburn: 0.0003
Eleanora Kennedy: 0.0003
Alina Cho: 0.0003
Barbara Tober: 0.0003
Lucia Hwong Gordon: 0.0003
Bonnie Comley: 0.0002
Muffie Potter Aston: 0.0002
Jamee Gregory: 0.0002
Bettina Zilkha: 0.0002
Martha Stewart: 0.0002
Elizabeth Stribling: 0.0002
Amy Fine Collins: 0.0002
Patrick McMullan: 0.0002
Allison Aston: 0.0002
Fernanda Kellogg: 0.0002
Christopher Hyland: 0.0002
Daniel Benedict: 0.0002
Liliana Cavendish: 0.0002
Russell Simmons: 0.0002
Margo Langenberg: 0.0002
Grace Meigher: 0.0002
Lydia Fenet: 0.0002
Dawne Marie Grannum: 0.0002
Stewart Lane: 0.0002
Dennis Basso: 0.0002
Donna Karan: 0.0002
Deborah Norville: 0.0002
Karen LeFrak: 0.0002
Leonard Lauder: 0.0002
Barbara Regna: 0.0002
Karen Klop

In [158]:
pagerank = [('Martha Stewart', 0.00019312108706213307)] * 100

grader.score('graph__pagerank', top_100_scores)

Your score: 0.9900


## Question 5: best_friends


Another interesting question is who tend to co-occur with each other.  Give us the 100 edges with the highest weights.

Google these people and see what their connection is.  Can we use this to detect instances of infidelity?

**Checkpoint:** Some aggregate stats on the solution:

    count:  100.0
    mean:    27.5
    std:     16.9
    min:     14.0
    25%:     17.0
    50%:     21.5
    75%:     31.3
    max:    122.0

In [161]:
# Get the top 100 edges with the highest weights
top_100_edges = sorted(graph.edges(data=True), key=lambda edge: edge[2]['weight'], reverse=True)[:100]

# Format the output as a list of ((name1, name2), weight) tuples
output = [((name1, name2), weight['weight']) for name1, name2, weight in top_100_edges]

# Print the top 100 edges with their weights
print("Top 100 Edges with Highest Weights:")
for edge, weight in output:
    print(f"{edge}: Weight - {weight}")

Top 100 Edges with Highest Weights:
('Gillian Miniter', 'Sylvester Miniter'): Weight - 119
('Bonnie Comley', 'Stewart Lane'): Weight - 74
('Jamee Gregory', 'Peter Gregory'): Weight - 74
('Daniel Benedict', 'Andrew Saffir'): Weight - 66
('Geoffrey Bradfield', 'Roric Tobin'): Weight - 64
('Barbara Tober', 'Donald Tober'): Weight - 60
('Somers Farkas', 'Jonathan Farkas'): Weight - 55
('Jean Shafiroff', 'Martin Shafiroff'): Weight - 50
('Campion Platt', 'Tatiana Platt'): Weight - 49
('Alexandra Lebenthal', 'Jay Diamond'): Weight - 46
('Yaz Hernandez', 'Valentin Hernandez'): Weight - 44
('Eleanora Kennedy', 'Michael Kennedy'): Weight - 44
('Chappy Morris', 'Melissa Morris'): Weight - 43
('Guy Robinson', 'Elizabeth Stribling'): Weight - 41
('Peter Regna', 'Barbara Regna'): Weight - 41
('Deborah Norville', 'Karl Wellner'): Weight - 41
('Grace Meigher', 'Chris Meigher'): Weight - 40
('Jonathan Tisch', 'Lizzie Tisch'): Weight - 40
('Margo Catsimatidis', 'John Catsimatidis'): Weight - 40
('Sessa

In [164]:
best_friends = [(('Michael Kennedy', 'Eleanora Kennedy'), 41)] * 100

grader.score('graph__best_friends', output)

Your score: 0.9800


*Copyright &copy; 2022 Pragmatic Institute. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*