# Formalia

Please read the [assignment overview page](https://github.com/TheYuanLiao/comsocsci2025/wiki/Assignments) carefully before proceeding. The page contains information about formatting (including formats etc), group sizes, and many other aspects of handing in the assignment.

These exercises are a subset of the exercises you did in class and you could just copy-paste the solution you developed in class.

__If you fail to follow these simple instructions, it will negatively impact your grade!__

**Due date and time**: The assignment is due on Mar 4th at 23:59. Hand in your Jupyter notebook file (with extension `.ipynb`) via DTU Learn _(Assignment 1)_. 

Remember to include in the first cell of your notebook:
* the link to your group's Git repository 
* group members' contributions


Link to Git repository: 

https://github.com/cruesli/CSS_group13

# Contributions

Simon 33%$\\$
Gustav 33%$\\$
Magnus 33%$\\$

## Part 1: Web-scraping
Week 1, ex 3.

> **Exercise: Web-scraping the list of participants to the International Conference in Computational Social Science**    
>
> You can find the programme of the 2023 edition of the conference at [this link](https://ic2s2-2023.org/program). As you can see the conference programme included many different contributions: keynote presentations, parallel talks, tutorials, posters. 
> 1. Inspect the HTML of the page and use web-scraping to get the names of all researchers that contributed to the conference in 2023. The goal is the following: (i) get as many names as possible including: keynote speakers, chairs, authors of parallel talks and authors of posters; (ii) ensure that the collected names are complete and accuarate as reported in the website (e.g. both first name and family name); (iii) ensure that no name is repeated multiple times with slightly different spelling. 
> 2. Some instructions for success: 
>    * First, inspect the page through your web browser to identify the elements of the page that you want to collect. Ensure you understand the hierarchical structure of the page, and where the elements you are interested in are located within this nested structure.   
>    * Use the [BeautifulSoup Python package](https://pypi.org/project/beautifulsoup4/) to navigate through the hierarchy and extract the elements you need from the page. 
>    * You can use the [find_all](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) method to find elements that match specific filters. Check the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) of the library for detailed explanations on how to set filters.  
>    * Parse the strings to ensure that you retrieve "clean" author names (e.g. remove commas, or other unwanted charachters)
>    * The overall idea is to adapt the procedure I have used [here](https://nbviewer.org/github/lalessan/comsocsci2023/blob/master/additional_notebooks/ScreenScraping.ipynb) for the specific page you are scraping. 
> 3. Create the set of unique researchers that joined the conference and *store it into a file*.
>     * *Important:* If you notice any issue with the list of names you have collected (e.g. duplicate/incorrect names), come up with a strategy to clean your list as much as possible. 
> 4. *Optional:* For a more complete represenation of the field, include in your list: (i) the names of researchers from the programme committee of the conference, that can be found at [this link](https://ic2s2-2023.org/program_committee); (ii) the organizers of tutorials, that can be found at [this link](https://ic2s2-2023.org/tutorials)
> 5. How many unique researchers do you get?
> 6. Explain the process you followed to web-scrape the page. Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices __(answer in max 150 words)__.

We started by by inspecting the HTML of the page and identified that the names of the speakers were contained within unordered lists (ul) with class name "nav_list", and the names of the chair's was contained within a h2 header. Using BeatifulSoups find_all() function we retrieved all ul's with class name "nav_list". Within each of the ul's the names of the authors are contained within a <i> tag (italic), which were used to retrieve the names. The process was then to loop through all the ul lists, find all <i> within each list. Then loop through all <i<>s, split up the names and remove unwanted characters to get clean names. The same was done with the h2 headers were the word chair was removed to leave us with the names. Lastly we removed all duplicates. This approach should retrieve all the names from the program page.

In [71]:
from bs4 import BeautifulSoup
import requests
import re

Link = "https://ic2s2-2023.org/program"
r = requests.get(Link)
soup = BeautifulSoup(r.content)

ulist = soup.find_all('ul',{'class':'nav_list'})
h2s = soup.find_all('h2')
names = []
for h2 in h2s:
    nsl = h2.find_all('i')
    if nsl:
        ns = str(nsl[0])[10:]
        cleaned_ns = re.sub(r"[^\w\s\,]", "", ns) # Removes any unwanted characters
        cleaned_ns = cleaned_ns[:len(cleaned_ns)-1] # Removes the i left over from </i>
        names.append(cleaned_ns)
for lists in ulist:
    nsl = lists.find_all('i')
    for nl in nsl:
        ns = nl.text.split(', ')
        cleaned_ns = [re.sub(r"[^\w\s\,]", "", n) for n in ns] # Removes any unwanted characters
        names.append(cleaned_ns)
names = [item for sublist in names for item in sublist] # flatten list
unique_names = sorted(list(set(names)), key = lambda name: name[0]) # Remove any duplicates and sort in alphabetical order
print(f'Number of unique names: {len(unique_names)}')

Number of unique names: 1523


## Part 2: Ready Made vs Custom Made Data
Week 2, ex 1.

> **Exercise: Ready made data vs Custom made data** In this exercise, I want to make sure you have understood they key points of my lecture and the reading. 
>
> 1. What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book __(answer in max 150 words)__.
> 2. How do you think these differences can influence the interpretation of the results in each study? __(answer in max 150 words)__

> 1. Centola's experiment used custom made data, meaning the environment and interactions were fully designed by the researchers. One big pro is that they could presicely trach how people's behaviour spread and after identify which factors caused these shifts. By designing the setup themselves, they can minimize unpredictable outside factors that might otherwise blur the results. The downside is that this highly controlled setting may not fully represent real world situations very well, so the findings may be less generalizable. On the other hand, Nicolaide's study used data collected from real people's everyday interactions. This makes the results more realistic, since it shows how people behave without being prompted by an experiment. However, ready-made data often includes many uncontrollable variables, making it harder to specify what led to certain outcomes. Researchers also risk missing important details if the data was not gathered with their specific research questions in mind. 
> 2. Centola's controlled envrionment lets us see precisely how each factor affects behaviour, making it easier to discuss cause and effect with confidence. But because people most people don't live under the tight conditions set by the experiment, we have to be careful about generalizing the results. Nicolaides' real world data gives a more realistic picture of how behaviours spread in everyday life, even though its harder to figure out which factors matter most, because theres more noise.  Therefore, we interpret Centola's findings with stronger centainty about specific influences, while Nicolaide's study shows how these processes play out in a natural setting. Both view can inform each other.

## Part 3: Gathering Research Articles using the OpenAlex API
Week 3, ex 1.

> **Exercise : Collecting Research Articles from IC2S2 Authors**
>
>In this exercise, we'll leverage the OpenAlex API to gather information on research articles authored by participants of the IC2S2 2024 (NOT 2023) conference, referred to as *IC2S2 authors*. **Before you start, please ensure you read through the entire exercise.**
>
> 
> **Steps:**
>  
> 1. **Retrieve Data:** Starting with the *authors* you identified in Week 2, Exercise 2, use the OpenAlex API [works endpoint](https://docs.openalex.org/api-entities/works) to fetch the research articles they have authored. For each article, retrieve the following details:
>    - _id_: The unique OpenAlex ID for the work.
>    - _publication_year_: The year the work was published.
>    - _cited_by_count_: The number of times the work has been cited by other works.
>    - _author_ids_: The OpenAlex IDs for the authors of the work.
>    - _title_: The title of the work.
>    - _abstract_inverted_index_: The abstract of the work, formatted as an inverted index.
> 
>     **Important Note on Paging:** By default, the OpenAlex API limits responses to 25 works per request. For more efficient data retrieval, I suggest to adjust this limit to 200 works per request. Even with this adjustment, you will need to implement pagination to access all available works for a given query. This ensures you can systematically retrieve the complete set of works beyond the initial 200. Find guidance on implementing pagination [here](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#cursor-paging).
>
> 2. **Data Storage:** Organize the retrieved information into two Pandas DataFrames and save them to two files in a suitable format:
>    - The *IC2S2 papers* dataset should include: *id, publication\_year, cited\_by\_count, author\_ids*.
>    - The *IC2S2 abstracts* dataset should include: *id, title, abstract\_inverted\_index*.
>  
>
> **Filters:**
> To ensure the data we collect is relevant and manageable, apply the following filters:
> 
>    - Only include *IC2S2 authors* with a total work count between 5 and 5,000.
>    - Retrieve only works that have received more than 10 citations.
>    - Limit to works authored by fewer than 10 individuals.
>    - Include only works relevant to Computational Social Science (focusing on: Sociology OR Psychology OR Economics OR Political Science) AND intersecting with a quantitative discipline (Mathematics OR Physics OR Computer Science), as defined by their [Concepts](https://docs.openalex.org/api-entities/works/work-object#concepts). *Note*: here we only consider Concepts at *level=0* (the most coarse definition of concepts). 
>
> **Efficiency Tips:**
> Writing efficient code in this exercise is **crucial**. To speed up your process:
> - **Apply filters directly in your request:** When possible, use the [filter parameter](https://docs.openalex.org/api-entities/works/filter-works) of the *works* endpoint to apply the filters above directly in your API request, ensuring only relevant data is returned. Learn about combining multiple filters [here](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists).  
> - **Bulk requests:** Instead of sending one request for each author, you can use the [filter parameter](https://docs.openalex.org/api-entities/works/filter-works) to query works by multiple authors in a single request. *Note: My testing suggests that can only include up to 25 authors per request.*
> - **Use multiprocessing:** Implement multiprocessing to handle multiple requests simultaneously. I highly recommmend [Joblib’s Parallel](https://joblib.readthedocs.io/en/stable/) function for that, and [tqdm](https://tqdm.github.io/) can help monitor progress of your jobs. Remember to stay within [the rate limit](https://docs.openalex.org/how-to-use-the-api/rate-limits-and-authentication) of 10 requests per second.
>
>
>   
> For reference, employing these strategies allowed me to fetch the data in about 30 seconds using 5 cores on my laptop. I obtained a dataset of approximately 25 MB (including both the *IC2S2 abstracts* and *IC2S2 papers* files).
> 
>
> **Data Overview and Reflection questions:** Answer the following questions: 
> - **Dataset summary.** How many works are listed in your *IC2S2 papers* dataframe? How many unique researchers have co-authored these works? 
> - **Efficiency in code.** Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time? __(answer in max 150 words)__
> - **Filtering Criteria and Dataset Relevance** Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices? __(answer in max 150 words)__


In [72]:
# Retrieving Author Ids from web scraped names
# This can probably be massively improved upon
import pandas as pd
BASE_URL = 'https://api.openalex.org/'
RESOURCE = 'works'
complete_url = BASE_URL + RESOURCE

base_url = "https://api.openalex.org/authors"

author_ids = {}
n = len(unique_names)
i = 0
for name in unique_names:
    
    params = { # Query parameters. Added filter to only include authors who have published between 5 and 5000 works
        'filter': f'display_name.search:{name},works_count:>5|<5000',
        'select': 'id'
    }
    
    # Send request to OpenAlex API
    response = requests.get(base_url, params=params)
    
    # Check if the request was successful
    if response.status_code == 200:
        data = response.json()
        if data['results']:
            author_ids[name] = data.get('results')[0]['id']
            print('Found id')
        else:
            print('No id found')
    else:
        print('Failed to retrieve')
    i += 1
    print(f'{i}/{n}')

print(len(author_ids))

Found id
1/1523
No id found
2/1523
No id found
3/1523
Found id
4/1523
Found id
5/1523
Found id
6/1523
Found id
7/1523
No id found
8/1523
Found id
9/1523
Found id
10/1523
Found id
11/1523
Found id
12/1523
Found id
13/1523
Found id
14/1523
Found id
15/1523
Found id
16/1523
Found id
17/1523
Found id
18/1523
Found id
19/1523
Found id
20/1523
Found id
21/1523
No id found
22/1523
Found id
23/1523
Found id
24/1523
Found id
25/1523
Found id
26/1523
Found id
27/1523
Found id
28/1523
Found id
29/1523
No id found
30/1523
Found id
31/1523
Found id
32/1523
Found id
33/1523
Found id
34/1523
Found id
35/1523
Found id
36/1523
Found id
37/1523
Found id
38/1523
Found id
39/1523
Found id
40/1523
Found id
41/1523
Found id
42/1523
Found id
43/1523
Found id
44/1523
Found id
45/1523
Found id
46/1523
Found id
47/1523
Found id
48/1523
Found id
49/1523
Found id
50/1523
Found id
51/1523
Found id
52/1523
Found id
53/1523
Found id
54/1523
Found id
55/1523
Found id
56/1523
No id found
57/1523
Found id
58/1523
Found

In [77]:
import pandas as pd
import requests
import time
from joblib import Parallel, delayed
from tqdm import tqdm

URL = "https://api.openalex.org/works"

BATCH_SIZE = 20

def fetch_papers_for_authors(authors_batch, batch_index, total_batches):
    papers_dataset = []
    abstract_dataset = []

    author_filters = '|'.join([f"\"{author_id}\"" for author_id in authors_batch])
    
    page = 1
    while True:
        params = {
            "filter": f"authorships.author.id:\"{author_filters}\",cited_by_count:>10,authors_count:<10",
            "per_page": 200,
            "page": page
        }

        max_retries = 3
        for attempt in range(max_retries):
            try:
                response = requests.get(URL, params=params, timeout=10).json()
                break 
            except requests.exceptions.RequestException:
                if attempt < max_retries - 1:
                    time.sleep(5) 
                else:
                    return [], [], f"Skipping batch {batch_index} due to repeated request failures."

    
        if 'results' not in response or not response['results']:
            break

        for paper in response['results']:
            first_concept, second_concept = False, False

            for concept in paper.get('concepts', []):
                if concept.get('level') != 0:
                    continue
                if concept.get('display_name') in ['Sociology', 'Psychology', 'Economics', 'Political Science']:
                    first_concept = True
                if concept.get('display_name') in ['Mathematics', 'Computer Science', 'Physics']:
                    second_concept = True

            if not (first_concept and second_concept):
                continue

            paper_info = {
                'id': paper.get('id'),
                'publication_year': paper.get('publication_year'),
                'cited_by_count': paper.get('cited_by_count'),
                'author_ids': ', '.join([author['author']['id'] for author in paper.get('authorships', [])]),
            }
            papers_dataset.append(paper_info)

            abstract_info = {
                'id': paper.get('id'),
                'title': paper.get('title'),
                'abstract_inverted_index': paper.get('abstract_inverted_index', '')
            }
            abstract_dataset.append(abstract_info)

        page += 1

    return papers_dataset, abstract_dataset

batches = [list(author_ids.values())[i:i + BATCH_SIZE] for i in range(0, len(list(author_ids.values())), BATCH_SIZE)]

# Run in parallel
num_cores = 12  # Adjust based on your CPU
results = Parallel(n_jobs=num_cores)(
    delayed(fetch_papers_for_authors)(batch, idx, len(batches))
    for idx, batch in tqdm(enumerate(batches), total=len(batches))
)

papers_dataset = [paper for res in results for paper in res[0]]
abstract_dataset = [abstract for res in results for abstract in res[1]]

df_papers = pd.DataFrame(papers_dataset)
df_abstracts = pd.DataFrame(abstract_dataset)

df_papers.to_csv('papers_dataset.csv', index=False)
df_abstracts.to_csv('abstracts_dataset.csv', index=False)

print("Processing complete. Datasets saved.")


100%|██████████| 69/69 [00:29<00:00,  2.36it/s]


Processing complete. Datasets saved.


In [78]:
print(len(df_papers), len(df_abstracts))

7700 7700


In [79]:
df_papers['author_ids'].nunique()

5186

- Note that we do find more works, but i did not have time to re run the notebook

There are 3797 works listed in my IC2S2 papers dataframe. Here there are 3104 unique co-authors for these works.

To improve the efficiency of the code, joblib.parallel was implemented, which gave the biggest boost in efficiency. Searching for multiple authors at the same time also helped a lot, where it was chosen to search for 20 authors at a time, since this would be a very stable yet still effective number. I also added error handling & retries to prevent failures, so that if the api failed, it could just be retried, and the entire program wouldn't have to restart.

For the choices of filtering, keeping the works count for an author between 5 and 5000 makes sure that we only have serious authors that both have actually contributed and also don't just put their name on anything. The filter that papers needed at least 10 citations is done to show that the papers have an actual impact, and the co-authors to fewer than 10 filter highlights more focused research. We also filtered for topics in computational social science, like sociology and somputer science, to keep the dataset relevant. However, this can make it so ones with topics in different fields that might still have great impact on computational social science, are filtered out.

## Part 4: The Network of Computational Social Scientists
Week 4, ex 1. Please use the final dataset you collected from both authors and co-authors (IC2S2 2024).

> **Exercise: Constructing the Computational Social Scientists Network**
>
> In this exercise, we will create a network of researchers in the field of Computational Social Science using the NetworkX library. In our network, nodes represent authors of academic papers, with a direct link from node _A_ to node _B_ indicating a joint paper written by both. The link's weight reflects the number of papers written by both _A_ and _B_.
>
> **Part 1: Network Construction**
>
> 1. **Weighted Edgelist Creation:** Start with your dataframe of *papers*. Construct a _weighted edgelist_ where each list element is a tuple containing three elements: the _author ids_ of two collaborating authors and the total number of papers they've co-authored. Ensure each author pair is listed only once. 
>
> 2. **Graph Construction:**
>    - Use NetworkX to create an undirected [``Graph``](https://networkx.org/documentation/stable/reference/classes/graph.html).
>    - Employ the [`add_weighted_edges_from`](https://networkx.org/documentation/stable/reference/classes/generated/networkx.Graph.add_weighted_edges_from.html#networkx.Graph.add_weighted_edges_from) function to populate the graph with the weighted edgelist from step 1, creating a weighted, undirected graph.
>
> 3. **Node Attributes:**
>    - For each node, add attributes for the author's _display name_, _country_, _citation count_, and the _year of their first publication_ in Computational Social Science. The _display name_ and _country_ can be retrieved from your _authors_ dataset. The _year of their first publication_ and the _citation count_  can be retrieved from the _papers_ dataset.
>    - Save the network as a JSON file.
>      
> **Part 2: Preliminary Network Analysis**
> Now, with the network constructed, perform a basic analysis to explore its features.
> 1. **Network Metrics:**
>    - What is the total number of nodes (authors) and links (collaborations) in the network? 
>    - Calculate the network's density (the ratio of actual links to the maximum possible number of links). Would you say that the network is sparse? Justify your answer.
>    - Is the network fully connected (i.e., is there a direct or indirect path between every pair of nodes within the network), or is it disconnected?
>    - If the network is disconnected, how many connected components does it have? A connected component is defined as a subset of nodes within the network where a path exists between any pair of nodes in that subset. 
>    - How many isolated nodes are there in your network?  An isolated node is defined as a node with no connections to any other node in the network.
>    - Discuss the results above on network density, and connectivity. Are your findings in line with what you expected? Why?  __(answer in max 150 words)__
> 
> 3. **Degree Analysis:**
>    - Compute the average, median, mode, minimum, and maximum degree of the nodes. Perform the same analysis for node strength (weighted degree). What do these metrics tell us about the network? __(answer in max 150 words)__
> 
> 4. **Top Authors:**
>    - Identify the top 5 authors by degree. What role do these node play in the network? 
>    - Research these authors online. What areas do they specialize in? Do you think that their work aligns with the themes of Computational Social Science? If not, what could be possible reasons? __(answer in max 150 words)__


In [80]:
import pandas as pd
import networkx as nx
import json
from itertools import combinations

papers_df = pd.read_csv('papers_dataset.csv', header=0, dtype=str) 
papers_df.columns = ["id","publication_year","cited_by_count","author_ids"]

authors_df = pd.read_csv('authors_data_single_name_search.csv')

papers_df["cited_by_count"] = pd.to_numeric(papers_df["cited_by_count"], errors="coerce").fillna(0).astype(int)

papers_df["publication_year"] = pd.to_numeric(papers_df["publication_year"], errors="coerce").fillna(0).astype(int)

coauthor_counts = {}
author_citations = {}
author_first_pub_year = {}

 
for _, row in papers_df.iterrows():
    if pd.isna(row["author_ids"]):
        continue
    
    author_list = [author.strip() for author in row["author_ids"].split(",")]
    author_list = list(set(author_list)) 

    
    for author in author_list:
        author_citations[author] = author_citations.get(author, 0) + row["cited_by_count"]
        
        if author not in author_first_pub_year:
            author_first_pub_year[author] = row["publication_year"]
        else:
            author_first_pub_year[author] = min(author_first_pub_year[author], row["publication_year"])

    for author1, author2 in combinations(sorted(author_list), 2):
        coauthor_counts[(author1, author2)] = coauthor_counts.get((author1, author2), 0) + 1

weighted_edgelist = [(author1, author2, weight) for (author1, author2), weight in coauthor_counts.items()]

G = nx.Graph()

G.add_weighted_edges_from(weighted_edgelist)

for _, row in authors_df.iterrows():
    author_name = row["display_name"].strip()
    author_id = row["id"]
    
    if author_id in G.nodes:
        G.nodes[author_id]["display_name"] = author_name
        G.nodes[author_id]["country"] = row.get("country_code", "Unknown")
        G.nodes[author_id]["citation_count"] = author_citations.get(author_id, 0)
        G.nodes[author_id]["first_pub_year"] = author_first_pub_year.get(author_id, "Unknown")

graph_data = nx.node_link_data(G)

with open("coauthorship_network.json", "w") as f:
    json.dump(graph_data, f, indent=4)

print("✅ Co-authorship network saved as JSON!")


✅ Co-authorship network saved as JSON!


The default value will be `edges="edges" in NetworkX 3.6.


  nx.node_link_data(G, edges="links") to preserve current behavior, or
  nx.node_link_data(G, edges="edges") for forward compatibility.
