# Assignment 1 group 1
Github repo is found at: https://github.com/c-wejendorp/ComSocSciAssignment1Group1
<br>
Contributions: s204090 0% , s204145 0% and s216135 0%


In [1]:
#import the necessary libraries used in this notebook
from bs4 import BeautifulSoup 
import requests
import re
import csv
import pandas as pd
import warnings
import time

#### Part 1: Using web-scraping to gather data
We are interested in identifying the researchers that have joined the most important scientific conference in Computational Social Science in 2019. To do this we webscrape the programmes from 2019. They can be found at the following links:<br>
- https://2019.ic2s2.org/oral-presentations/
- https://2019.ic2s2.org/posters/

We scraped the websites using the code below:



In [None]:
oralPresenters = set()
#first link
link1="https://2019.ic2s2.org/oral-presentations/"
#get the HTML content
r = requests.get(link1)
#use BeautifulSoup to parse the content. 
soup = BeautifulSoup(r.content)
#locate the HTML tag containing the author names by inspecting the website using at browser.
# We see that the names are located in the tag <p> 
paragraphs=soup.find_all("p")
#relevant paragraphs choosen by quick manual inspection
paragraphs=paragraphs[3:-7]
#we note that each rooom has a "chairperson" which WILL NOT be included. 

for paragraph in paragraphs:
    #print(paragraph)
    text=str(paragraph)
    #spilt by the "<br>"
    text=re.split("<br>|<br/>",text)
    #print(text)
    #print(re.split(r'\s*\–\s*|\.s*',text[0]))
    #locate the names by splitting by "-" and choosing idx 2    
    for line in text[2:]:        
        lineList=re.split(r'\s*\–\s*|\.s*',line)        
        #get each individual name 
        # we check if lineList has a lenght over 1 as sometimes there is no "." btw the authornames and the topic.
        # Instead there is a "–". Thus we do not include the authors where this is the case. 
        if len(lineList)>1:
            
            names=re.split(r',\s*',lineList[2])    
            for name in names:
                oralPresenters.add(name)

print(f"We found {len(oralPresenters)} authors when scraping the oral presentations")     
print(f"we had one odd entry in our set: '(Moved to 3D Text Analysis) Ivan Smirnov', which was easily fixed")
oralPresenters.add('Ivan Smirnov')
oralPresenters.remove('(Moved to 3D Text Analysis) Ivan Smirnov')

In [None]:
posterPresenters = set()
#second link
link2= "https://2019.ic2s2.org/posters/"
#get the HTML content
r = requests.get(link2)
#use BeautifulSoup to parse the content. 
soup = BeautifulSoup(r.content)
#we locate the HTML tag <ul> in the soup.
unOrderedLists=soup.find_all("ul")
#locate the HTML tag <li> as the author names are located within these tags. 
lists=soup.find_all("li")
#We did some manual inspection and choose the relevant <li> tags
relevantLists=lists[32:-7]

for num,item in enumerate(relevantLists):    
    text=str(item)   
    # find the text between the first "<li>" and first"<span>"
    #text=re.split(r"\s*<li>|<span>",text,maxsplit=2)
    text=re.split(r"\s*<li>|\s*</li>|<span>",text)
    # for some reason the first element after the split is "" an empty string, so the author names is located at index 1.    
    namesString=text[1]
    #split the names string into individual names. They are seperated by "," and "and". However we must bed carefull not to split name such as as "Alessandro Cossard" into "Aless" and "ro"
    # thereofore we require the char before "and" to be a space. This achieved by: (?<=\s)    
    names=re.split(r"\s*,\s*|(?<=\s)and\s*",namesString) 
    
    for name in names:
        #when we inspected the posterPresenters set, it became clear that some unwanted "names" were added to the set. An example is: "<strong>Evolution of Employment in the United States: A Longitudinal Study of Job Polarisation</strong>"
        #thus we decided to only add a name, if the string doesn't contain any typical html notation such as "<",">" and "/"
        pattern = re.compile(r'[<>/]')
        if pattern.findall(name):
            #uncomment below if you want to see the "names" that contains unwanted characters             
            #print(name)
            pass 
        else: 
            posterPresenters.add(name)            

print(f"We found {len(posterPresenters)} authors when scraping the poster presentations")          
    
    

In [None]:
#unique authors:
uniqueAuthors=oralPresenters|posterPresenters
print(f"We found {len(uniqueAuthors)} unique researchers in 2019") 

#save the names into a csv.
f = open('data/initialNames.csv',"w")
writer = csv.writer(f,lineterminator = '\n')
for name in list(uniqueAuthors):    
    writer.writerow([name])

During the websscraping we were very careful when extracting names from lines of text.
As an example we could not just split a string as "Jacob SomethingCool, Alessandro Cossard and Lars AnotherName"
carelessly by the pattern "," or "and" as this would lead to the names:
["Jacob SomethingCool, Aless, andro Cossard, Lars AnotherName"] and thus ending up overestimating the number of authors. In the case of the 2019 programmes it was fairly easy to perform some manuel checks, to estimate whether the found number of researchers seemed plausable. 



#### Part 2: Getting data from the Semantic Scholar API
In this notebook we will only use the researchers from the conference in 2019, as this notebook othwise will take hours to run. From these 914 conference attendees we end up with 54589 unique IDs when all their collaborators have been found.
  
 When searching for a given author's name the API will return multiple hits. We have choosen to only use the first returned name and ID. We also note that when using the API to search for author names, only one name can be send pr request as to opposed to when searching for authors based on IDs, where one can include up 100 IDs in a single request. As the request sometimes fails, we also keep track of the "faulty names". If interested look into the the getAuthorIDs.ipynb

For now we will focus on the task of creating the three dataframes; "Author dataset", "Paper dataset" and "Paper abstract dataset". As we are going to be doing quite a lot of requests, we continuously saves smaller data frames as pickles in the folder "syltedeAgurker". We later then merge these dataframes and create the before mentioned dataframes. The starting point will b

In [2]:
#Lets load the IDs of all authors from the csv. 
allAuthorIDs={}
with open('data/allAuthorIDs.csv', 'r') as f:
    for line in csv.reader(f):
        id = line[0]
        name = line[1]
        allAuthorIDs[id]=name 

In [3]:
#we split into appropiate batch sizes and requestsizes. 
# we can request 100 times pr 5 min and each request can be on 100 names
numRequests = 100
batchSize = 100
sleepTIme= 60*5

authorList=list(allAuthorIDs.keys())


authorBatches = [authorList[i:i + batchSize] for i in range(0, len(authorList), batchSize)]
requestList = [authorBatches[i:i + numRequests] for i in range(0, len(authorBatches),numRequests)]

BASE_URL = "https://api.semanticscholar.org/graph/"
VERSION = "v1/"
RESOURCE = "author/batch"
my_url = BASE_URL + VERSION + RESOURCE
params = {
            "fields":"authorId,name,aliases,citationCount,papers,papers.externalIds,papers.title,papers.abstract,papers.year,papers.citationCount,papers.s2FieldsOfStudy,papers.authors"}

#This is a pandas df that contains all the relevent info, and it will later be split up into the three required. 
#The pandasDF
authorDfColumns=["authorId","name","aliases","citationCount","papersId","papersExternalId","papersTitle","papersAbstract","papersYear","papersCitationCount","papers_s2FieldsOfStudy","papersAuthors"]
authorDF=pd.DataFrame(columns=authorDfColumns)

# sometimes we get an internal server error, so we would like to save these batches for inspection or something else
theBadBunch = []




In [9]:
#pandas is not so fond of this way of appending to a df, so we suppress the warning. 
with warnings.catch_warnings():
    warnings.simplefilter(action='ignore', category=FutureWarning)    
    
    for num,request in enumerate(requestList):
        print(f"I am a request number {num+1} out of {len(requestList)}")
        for batchnum,batch in enumerate(request):        
            json_data={"ids": batch}
            r=requests.post(my_url,json=json_data,params = params)

            statuscode = r.reason
            #print(statuscode)
            if statuscode == "OK": 
                batchauthorDF = authorDF.copy()        
                #try:                  
                for person in r.json():
                    if person is not None:
                    #check if we recieved usefull information
                        if person.get("authorId") is not None:                

                            authorDetails = {
                            "authorId": person["authorId"],      
                            "name":  person["name"],
                            "aliases": person["aliases"],
                            "citationCount": person["citationCount"]}
                            #check if the author have written some papers
                            if person.get("papers") is not None:
                                papers = person["papers"]
                                #update the authorDetails dict by the update operator: "|="            
                                authorDetails |= {                
                                "papersId": [paper['paperId'] for paper in papers],            
                                "papersExternalId": [paper['externalIds'] for paper in papers],                       
                                "papersTitle": [paper['title'] for paper in papers],  
                                "papersAbstract": [paper['abstract'] for paper in papers], 
                                "papersYear": [paper['year'] for paper in papers], 
                                "papersCitationCount": [paper['citationCount'] for paper in papers], 
                                "papers_s2FieldsOfStudy": [paper['s2FieldsOfStudy'] for paper in papers], 
                                "papersAuthor": [paper['authors'] for paper in papers], 
                                                                                        }
                            #lets add relevant information to our df
                            batchauthorDF = batchauthorDF.append(authorDetails, ignore_index=True)
                            pickleName = f"data/syltedeAgurker/agurk{num},{batchnum}.pkl"
                            batchauthorDF.to_pickle(pickleName)
                                
                        else:
                            print("Bad response from API, unable to use. THis is the json:")
                            print(f"This should be the person {person}")
                            print("And this is the json")
                            print(r.json())    
                #except:
                    #print("I had an error..")
                       

            else: 
                theBadBunch.append((statuscode,batch))
        print("I am going to sleep")        
        time.sleep(sleepTIme)
        print("I woke up")     
    #save the DF    
    #authorDF.to_pickle("data/authorDFnewTrial.pkl")

I am a request number 4 out of 6
I am going to sleep
I woke up
I am a request number 5 out of 6
I am going to sleep
I woke up
I am a request number 6 out of 6
I am going to sleep
I woke up


In [12]:
import pickle
with open("data/theBadBunch.pkl", 'wb') as file:
    pickle.dump(theBadBunch, file)
print(f'We have failed to retrieve data for {len(theBadBunch)*100} authors, due to errors when requesting')
print("These author IDs are now stored in theBadBunch.pkl if want to do something about this")

We have failed to retrieve data for 33100 authors, due to errors when requesting
These author IDs are now stored in theBadBunch.pkl if want to do something about this


In [11]:
len(theBadBunch)


331