Link to github: https://github.com/albert-moller/CSSAssignment1.git

Group members: Albert Frisch Møller (s214610) and Mark Andrawes (s214654)

For this assignment each group member contributed equally to every aspect of the assignment

### Assignment 1

#### Part 1: Web-Scraping

In [1]:
#Importing necessary Web Scraping and Data Storage packages
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import os
from tqdm import tqdm
from joblib import Parallel, delayed
import networkx as nx
import json

if not os.path.exists("./data"):
    os.mkdir("./data")

In [2]:
LINK = "https://ic2s2-2023.org/program"
r = requests.get(LINK)
soup = BeautifulSoup(r.content)
researchers = []

table_rows = soup.find_all("tr")
for row in table_rows:
    for tag in row.find_all("a"):
        if re.search("Keynote", tag.text):
            string = tag.text
            string = string.replace("Keynote - ", "")
            string = string.strip()
            researchers.append(string)

plenaries = soup.find_all(class_='nav_list')

for plenary in plenaries:
    italic_names = plenary.find_all('i')
    italic_names = [i.get_text() for i in italic_names]

    for entry in italic_names:
        speakers = entry.split(',')
        for speaker in speakers:
            name = speaker.strip()
            if name not in researchers:
                researchers.append(speaker)

researchers_df = pd.DataFrame(researchers, columns=['Full Name'])
researchers_df.to_csv('data/ics2_researchers.csv', index=False)

print(f"The number of unique researchers is {len(researchers_df)}")

The number of unique researchers is 1856


#### How many unique researchers do you get?

We obtained that there are 1856 unique researchers present at the International Conference in Computational Social Science for 2023

##### Explain the process you followed to web-scrape the page. Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices. (answer in max 150 words)

To webscrape the IC2S2 2023 program page, we inspected the HTML structure to identify the elements containing researcher names. Using BeautifulSoup, we collected the table rows ('tr') to extract the names of the keynote speakers. Additionally, we collected each 'i' tag within each plenary (found in the class 'nav_list') which contained multiple names of participants - we split these names and appended each one to the list. Finally, we stored the data in a Pandas DataFrame and saved as csv.


To retrieve the names accurately, we removed the 'Keynote -' prefix to cleanly retrieve the names of the keynote speakers. We used the '.strip()' function to remove any leading/trailing whitespace. This was done to ensure that we can accurately compare names when checking for duplicates. We assessed the quality by checking for duplicates by only adding new names to the list. This ensured the list only included unique participants.



#### Part 2: Ready Made vs Custom Made Data

##### What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book (answer in max 150 words)

In Centola's experiment, custom-made data was collected specifically for the study, allowing for full control over the environment and the variables. As explained in “Bit by Bit”, this enhances the relevance and accuracy of the findings to the research question, which is a significant advantage. Additionally, this reduces ethical concerns in the data. On the other hand, custom-made data requires significant resources and time to create, which is a disadvantage.


In Nicolaides's study, ready-made data was used. As described in “Bit by Bit”, an advantage of this is the free accessibility of the data and its broader context. However, this type of data usually contains biases and may require adjustments in the methodology to align with the available data. Moreover, there may be issues with the quality and completeness of the data.


##### How do you think these differences can influence the interpretation of the results in each study? (answer in max 150 words)

The differences between custom-made and ready-made data can significantly influence the interpretation of results in the studies. For Centola's experiment, the control over the experimental conditions allowed for clearer ways of linking effects to specific causes, leading to these insights being more reliable. However, this control also means that the results might be less generalizable to real-world settings, as the controlled environment may not capture all realistic external factors. 

On the other hand, Nicolaides's use of ready-made data derived from existing datasets offers insights that are more reflective of real-world behaviors. This enhances the generalizability of the findings but makes it challenging to determine links between effects and causes due to potential confounders and biases in the data.
Hence, while custom-made data provides cleaner, more controlled insights, ready-made data offers broader perspectives on natural behaviors. These differences mean that we must carefully interpret results based on the data type.


#### Part 3: Gathering Research Articles using the OpenAlex API

In [3]:
#Step 1) Obtain ISC2 Research OpenAlex IDs (using the "authors" endpoint)

researchers = pd.read_csv("data/ics2_researchers.csv")
openalexids_df = pd.DataFrame(columns = ['id', 'display_name', 'works_api_url', 'h_index', 'works_count', 'country_code', 'cited_by_count'])

url = "https://api.openalex.org/authors"
index = 0

for researcher in tqdm(researchers["Full Name"], desc="Obtaining authors dataset"):
    params = {'search': researcher}
    response = requests.get(url, params=params)
    data = response.json()

    if not data['results']:
        continue

    author_data = data['results']
    if len(author_data) > 1:
        author_data = max(author_data, key=lambda x: x.get('relevance_score', 0))
    else:
        author_data = data['results'][0]

    try:
        id = author_data['id']
        display_name = author_data['display_name']
        works_api_url = author_data['works_api_url']
        h_index = author_data['summary_stats']['h_index']
        works_count = author_data['works_count']
        country_code = author_data['last_known_institution']['country_code']
        cited_by_count = author_data['cited_by_count'] 
    except:
        continue

    df_index = len(openalexids_df)
    openalexids_df.loc[df_index] = [id, display_name, works_api_url, h_index, works_count, country_code, cited_by_count]

    if index % 30 == 0:
        openalexids_df.to_csv("data/ics2_authors.csv", index=False)    

Obtaining authors dataset:   0%|          | 0/1856 [00:00<?, ?it/s]

Obtaining authors dataset: 100%|██████████| 1856/1856 [22:32<00:00,  1.37it/s]


In [4]:
#Step 2) Use the "concepts" endpoint to obtain the concepts IDs for Sociology, Psychology, Economics, Political Science, Mathematics, Physics, Computer Science

fields = ["Sociology", "Psychology", "Economics", "Political Science", "Mathematics", "Physics", "Computer Science"]
concepts_id = {}

for field in fields:
    url = f"https://api.openalex.org/concepts?search={field}"
    response = requests.get(url)
    data = response.json()
    if data['results']:
        concept_id = data['results'][0]['id']
        _, concept_id = os.path.split(concept_id)
        concepts_id[field] = concept_id

quantitative_disciplines_filter = f"{concepts_id['Mathematics']}|{concepts_id['Physics']}|{concepts_id['Computer Science']}"
css_filter = f"{concepts_id['Sociology']}|{concepts_id['Psychology']}|{concepts_id['Economics']}|{concepts_id['Political Science']}"

print(f"Quantiative Disciplines Filter: {quantitative_disciplines_filter}")
print(f"Computational Social Science Filter: {css_filter}")

Quantiative Disciplines Filter: C33923547|C121332964|C41008148
Computational Social Science Filter: C144024400|C15744967|C162324750|C17744445


In [5]:
#Step 3) Use the "works" endpoint to obtain all the research articles authored by ICS2 participants

def fetch_author_works(author_id, social_sciences_ids, quantitative_disciplines_ids):
    _, openalex_id = os.path.split(author_id)
    filters = (
        f'author.id:{openalex_id}|{openalex_id}',
        f'cited_by_count:>10',
        f'authors_count:<10',
        f'concepts.id:({social_sciences_ids})',
        f'concepts.id:({quantitative_disciplines_ids})'
    )
    url = f"https://api.openalex.org/works?filter={','.join(filters)}per-page=200"

    response = requests.get(url)
    data = response.json()
    return data['results']

def process_author_works(author_id):
    author_works = fetch_author_works(author_id, css_filter, quantitative_disciplines_filter)
    papers_data = []
    abstracts_data = []
    for work in author_works:
        work_id = work['id']
        publication_year = work['publication_year']
        cited_by_count = work['cited_by_count']
        title = work['title']
        abstract_inverted_index = work.get('abstract_inverted_index', {})
        author_ids = [author['author']['id'] for author in work['authorships']]
        
        papers_data.append({
            'id': work_id,
            'publication_year': publication_year,
            'cited_by_count': cited_by_count,
            'author_ids': author_ids,
        })
        
        abstracts_data.append({
            'id': work_id,
            'title': title,
            'abstract_inverted_index': abstract_inverted_index,
        })
    return papers_data, abstracts_data
        
    
ICS2_authors = pd.read_csv("data/ics2_authors.csv")
ICS2_authors_filtered = ICS2_authors[(ICS2_authors['works_count'] >= 5) & (ICS2_authors['works_count'] <= 5000)]
author_ids = ICS2_authors_filtered['id'].tolist()

all_papers_data = []
all_abstracts_data = []

results = Parallel(n_jobs=4, backend="threading")(delayed(process_author_works)(author_id) for author_id in tqdm(author_ids))

for papers_data, abstracts_data in results:
    all_papers_data.extend(papers_data)
    all_abstracts_data.extend(abstracts_data)

papers_df = pd.DataFrame(all_papers_data, columns=["id", "publication_year", "cited_by_count", "author_ids"])
abstracts_df = pd.DataFrame(all_abstracts_data, columns=["id", "title", "abstract_inverted_index"])
papers_df.to_csv('data/papers.csv', index=False)
abstracts_df.to_csv('data/abstracts.csv', index=False)

100%|██████████| 1501/1501 [05:16<00:00,  4.75it/s]


In [6]:
# Step 4) Determine how many unique researchers have co-authored these works

co_authors = papers_df['author_ids']
unique_co_authors = []

for row in co_authors:
    for author in row:

        if author not in unique_co_authors:
            unique_co_authors.append(author)

print(f"Number of works listed in the ICSS2 papers dataframe: {len(papers_df)}")
print(f"Number of unique researchers that have co-authored the found works: {len(unique_co_authors)}")

Number of works listed in the ICSS2 papers dataframe: 12986
Number of unique researchers that have co-authored the found works: 13027


##### How many works are listed in your IC2S2 papers dataframe? How many unique researchers have co-authored these works?

There are 12972 works listed in the ICS2S2 papers dataframe. Additionally, there are 13015 unique researchers that have co-authored the found works.  

#### Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time? (answer in max 150 words)

To enhance code efficiency, our approach included batching requests to the OpenAlex API to query multiple authors simultaneously and applying filters directly within API requests to minimize data processing. We used Joblib's Parallel function which enabled concurrent processing, significantly reducing the total execution time. By adjusting the API's per-page limit to 200 works and implementing pagination, the script efficiently handled large datasets. These strategies ensured that the data retrieval process was both fast and resource-efficient, and the executing time was massively reduced by implementing these strategies compared to the case where they were not implemented.  

#### Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices? (answer in max 150 words)

The purpose of the specified filtering criteria is to enhance the dataset’s relevance and manageability. Setting thresholds for an author’s total work count ensured that we only included those with a significant contribution to the field of Computational Social Science, while the citation count filter ensured that only works with greater impact were included. Limiting works authored by fewer than 10 individuals helped focus on collaborations that are more typical in Computational Social Science, and avoided overly large teams that might dilute the focus of the paper. Including only works that are relevant to Computational Social Science and a quantitative discipline ensured the dataset’s relevance to computational methodologies. These filters may lead to an underrepresentation of emerging research with fewer citations. On the other hand, well-cited collaborative research within the traditional Computational Social Science disciplines may be overrepresented. 