# ADA 25: Exam (100 Points Total)

This exam consists of three main tasks, each with several questions. The tasks are independent of each other and can be solved in any order. 

**DO NOT WORRY IF YOU CAN NOT SOLVE ALL OF THESE TASKS!** The larger number of questions is intended to give you more opportunities to show what you have learned during the course :).

A shared "Google Doc" is available [here](https://docs.google.com/document/d/e/2PACX-1vQWG_J3UfU8WaVCly1HpPwCSf49sDHT6hjOQUSSFoJasENqUq_x2HHpqQoJ5m66gt2RmSJ0q9-ZJMQD/pub):
- Any announcements will be posted in this document
- Any questions emailed to <a href="ada-core-assistants-2025@groupes.epfl.ch">ada-core-assistants-2025@groupes.epfl.ch</a> will appear here

### Random Seed
**Important**: always use the `RANDOM_SEED = 4`, do **NOT** change this

In [None]:
import random
import numpy as np

RANDOM_SEED = 4

random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

## Task 1 (40 Points): Good News Can't Wait?

For your first assignment you'll be exploring the conference acceptance process for a top-tier AI conference. Let's start with some background:

**The ICLR Review Process: A Brief Overview:**

The International Conference on Learning Representations (ICLR) is one of the most prestigious AI conferences. It employs a ``double-blind'' peer-review system, meaning the identities of both authors and reviewers are concealed throughout the process. The decision-making workflow generally follows these stages:

1. **Submission & Assignment**: Each paper is typically assigned 3 to 4 independent reviewers based on subject matter expertise, overseen by an Area Chair (AC).
2. **Initial Review Phase**: Reviewers submit independent scores and qualitative feedback.
3. **Discussion & Rebuttal**: Authors can respond to reviews, and reviewers may update their scores based on these discussions.
4. **Recommendation**: The AC synthesizes the reviews and the discussion to provide a Meta-Review and a final recommendation to the Program Chairs.
5. **Final Decision**: The Program Chairs make the ultimate acceptance decision. This is influenced not only by individual paper quality but also by the target acceptance rate for the conference and the total capacity of the venue.

In an effort to promote transparency, ICLR make both the reviews for the accepted **and** the rejected submissions available to the general public.

The ICLR review process is hosted by a website called "Openreview." 
Each paper submission gets its own webpage that shows the paper's title and abstract (short description of the paper) followed by the reviews.
Due to a "quirk" of Openreview, the webpage lists the reviews based on the date they were first posted (the most recent review is shown first). 
For the remainder of this task, we can thus assume the order is "randomly" assigned.
 
**The main focus of this task is to investigate if the order used to present the reviews matters for paper acceptance**

### 1.a: Causal Diagram (5 pts)
From the **AC's point of view**, please describe the causal DAG representing the acceptance recommendation using the following entities:
- paper
- reviews
- order
- AC recommendation

Describe the directed acyclic diagram (DAG) using equations of the form: A = f(B), C = g(E), etc.  
For each relation, provide a **short** justification.

Provide your solution in **markdown** (just simple text, no fancy plotting needed :))

/Discuss/:

**your answer**


### Data task 1 and 2

The data dictionary for task 1 and 2:
- `paper_id`: (str) unique paper ID
- `abstract`: (str) short description of the paper
- `track`: (str) the subject matter category the paper was submitted to
- `r1_score`: (float) the review score shown first
- `r2_score`: (float) the review score shown second
- `r3_score` (float) the review score shown third
- `r4_score`: Optional(float) IF four reviews, the review score shown fourth, ELSE None
- `num_reviews`: (int) the number of reviews the paper received
- `is_accepted`: (int) the final acceptance decision (0 is rejected, 1 is accepted)

In [None]:
# THIS IS THE DATA USED FOR TASK 1 and 2
import pandas as pd

df = pd.read_csv('data/task_1_2_data.csv')
df = df.astype({
    'paper_id': str,
    'abstract': str,
    'track': str,
    'is_accepted': int,
    'r1_score': float,
    'r2_score': float,
    'r3_score': float,
    'r4_score': float,
    'num_reviews': int
    })
df.head()

### 1.b Inspect the data (10 pts)
We'll start our investigation with some high-level data analysis:

1. (2 pts) Compute the overall acceptance rate of the conference and print it

2. (2 pts) Compute the average score per paper into a new column `score_avg`

3. (2 pts) Plot a density histogram of these score averages using 20 bins.

4. (2 pts) On a second plot, create a bar plot with `score_avg` bins on the x-axis (use bins with size 0.5) and the acceptance rate on the y-axis. Additionally, add a horizontal line with the overall acceptance rate. Make sure your plot has a clear legend and axes labels.

**Hint**: The "acceptance rate" is the mean of `is_accepted` for all papers in the applicable `score_avg` bin.

5. (2 pts) /Discuss/: Based on the two plots, provide score cutoffs for a "clear reject", "clear accept", and "borderline" region

**your answer**


### 1.c. Factors of interest (15 pts)
We will start by first creating some features of interest to aid our analysis (1 to 3), which will then be used to fit a logistic regression (4), the results of which will be analyzed in (5).

1. (2 pts) Write a function that checks if the first score is the highest score and add this to the dataframe as new column `is_highest`.
    - Compute and print the average acceptance rate conditioned on the `is_highest` feature. That is: Compute and print the acceptance rate separately for papers where the first review score is the highest and for papers where it is not.
    - highest means: >=  **NOT**  >

2. ( 2pts) Write a function that computes the "consensus" of the reviewers measured by **standard deviation of the scores multiplied by -1** and adds this to the dataframe as new column `score_consensus`.
    - Compute and print the mean value of `score_consensus` across the dataset.

3. (2 pts) Assume the borderline section starts at 4 and ends at 6.5 (both inclusive).
    - Add a `is_borderline` column and print the average acceptance rate for borderline papers.

4. (4 pts) Use the `logit` function from `statsmodels.formula.api` to perform a logistic regression on the subset of borderline papers (use the `is_borderline` column you added to filter these).
    - `is_accepted` is your target variable
    - `is_highest`, `score_consensus`, `num_reviews`, and `track` are your variables

5. (5 pts) /Discuss/: Based on the results of (4), would you tentatively conclude the effect of review order on the final acceptance decision is significant at 95% CI? How do you interpret the `score_consensus` results?
    - First: compute the Odds Ratio (OR) for the `is_highest` variable, then interpret this value in plain english, e.g., "All things equal, the odds of a paper being accepted are [X] times higher if ..."
    - Second: what might be the relation between `is_highest` and `score_consensus`?


**Hint for 5**: 

_What happens to `is_highest` when there is perfect agreement among reviewers?_

### 1.d Statistical Twins (10 pts)
To truly isolate the impact of the position of a review (the "champion" effect) from the quality of the paper, we use a matching strategy. By grouping papers into "Statistical Twins"—papers that received the exact same set of scores (e.g., three reviewers all gave a 5, 6, and 8)—we effectively hold paper quality constant. Within these identical sets, we can then ask: Does the specific behavior or timing of the 'highest' reviewer (the 8) change the outcome, even when the `signature` of the paper is identical? This approach mimics a controlled experiment, allowing us to see if the Area Chair is influenced by how a score sits within the distribution, rather than just the raw scores themselves.

1. (2 pts) Write a function that adds a review score `signature` column. That is: the set of scores a paper received.

**Hint**: Papers with (4, 5, 4) and (4, 4, 5) should get the same signature, while (5, 5, 4) should get a different one.

2. (2 pts) Filter out any non-borderline papers and any signatures that have fewer than 10 papers.

3. (4 pts) Perform an appropriate statistical test and report the p-value

4. (2 pts) /Discuss:/ What do you conclude at 95% CI?

**your answer**

## Task 2 (20 Points): You had me at Hello

While numerical scores and review order provide a structural view of the conference, the content of the research itself—expressed through the abstract contains the actual "signal" of innovation. In this section, we shift from review metadata to Natural Language Processing (NLP). Research papers often follow linguistic trends; certain methodologies (e.g., "diffusion models," "transformers") or buzzwords may be associated with higher acceptance rates in a given year.

### 2.a Buzzz words (10 pts)
1. (3 pts) Use `sklearn.feature_extraction.text.TfidfVectorizer` to transform the abstract column into a feature matrix $X$
    - Constraint 1: Remove English stop words to discard uninformative frequent terms.
    - Constraint 2: Use both unigrams (1-gram) and bigrams (2-gram) to capture phrases like "neural network".
    - Constraint 3: Limit max_features to 2,000 to avoid creating a matrix that is too sparse or high-dimensional.

2. (2 pts) Fit a Logistic Regression model (`sklearn.linear_model.LogisticRegression`) to predict is_accepted using the TF-IDF matrix.
    - Please use L1 (Lasso) regularization to penalize large weights, as text data often has more features than documents ($D > N$) which leads to overfitting. Use the following hyperparamers: `l1_ratio=0, C=1.0, solver='liblinear`.

3. (3 pts) Extract the model coefficients ($\beta$) and list the top 10 words (features) with the highest positive coefficients and the top 10 words with the lowest negative coefficients.

Once a paper is accepted, authors are asked to provide a _camera-ready_ version. This means authors can update their paper, which can also change the abstract.

4. (2 pts) /Discuss/: Carefully examine the top words for the rejected and accepted papers:
    - If your goal was to help an author predict if their paper would be accepted based only on the abstract, why is this current model "useless"?


**Hint, 4**:

_Use your knowledge of the double-blind review policy (where the identity of the authors and reviewers should remain anonymous)_

**your answer**:

### 2.b Topics Matter? (10 pts)
Individual keywords are often noisy and can be influenced by formatting or procedural artifacts. To capture the underlying research themes of the ICLR corpus, we must shift our analysis from vocabulary to latent concepts.

1. (3 pts) Use ``TruncatedSVD`` from ``sklearn.decomposition`` (Latent Semantic Analysis) to factorize your TF-IDF matrix into 50 latent components, ensuring you set random_state=RANDOM_SEED.
    - Identify the Theme: Extract and print the top 10 words associated with the second component (index 1).
    - **Mathematical Context**: _This process approximates the decomposition $T \approx USV^T$. By analyzing the components, we are examining the "Topic Space" where words that frequently co-occur are grouped into singular axes of variance._

2. (3 pts) Create a new column in your dataframe called `topic_1_strength` using the values from this second component. Generate a scatter plot with a linear regression trend line to visualize the relationship between a paper's alignment with this topic and its `score_avg`.

3. (4 pts) Calculate the Pearson correlation coefficient ($r$) and its associated p-value.
    - /Discuss/: Based on your p-value, is the relationship between this topic and review scores statistically significant at the $\alpha = 0.05$ level?
    - /Discuss/: Examine the magnitude of the correlation coefficient. In the context of a competitive peer-review process, how much of the variance in scores is actually explained by a paper’s alignment with this dominant topic? Contrast the statistical significance with the practical meaningfulness of this effect.

In [None]:
# Coding part

**Discussion answers**:

## Task 3 (40 Points): Minds of a Matter Publish Together

In the final stage of our analysis, we move beyond the text of the paper and the mechanics of the review process to examine the social topology of the ICLR community. Science is rarely a solitary endeavor; it is conducted within a network of collaborators, laboratories, and institutions.

In this section, you will treat the conference as a Co-authorship Network. By representing authors as nodes and their collaborations as edges, you will investigate whether the "connectedness" of an author—their position within the global research web—relates to the likelihood of their work being accepted. Is acceptance purely a meritocracy of ideas, or do structural advantages within the social graph play a role in the "minds of a matter" publishing together?

### Data task 3

The data dictionary for task 3:
- `paper_id`: (str) unique paper ID
- `authorids`: (list[str]) the unique IDs of authors listed on a paper
- `is_accepted`: (int) the final acceptance decision (0 is rejected, 1 is accepted)

In [None]:
import pandas as pd

df_papers = pd.read_csv('data/task_3_data.csv')
df_papers['authorids'] = df_papers['authorids'].apply(eval)
df_papers.head()

### 3.a: Preparing the data (8pts)
1. (2 pts) /Discuss/: We have removed all papers that had more than 13 authors (<1% of the papers). Why might we do this?

**your answer**

2. (2 pts) /Discuss/: We also want to exclude all authors that have fewer than 3 paper from this analysis. Why?

**your answer**

3. (2 pts) Use `df_papers` to create an undirected graph where authors are nodes and edges indicate collaboration on a paper.

**important**: Only include authors that have at least 3 papers. This means that you should **not** add edges between authors *with* at least 3 papers and authors *without* at least papers.

4. (2 pts) Print the total number of nodes and edges

### 3.b: Network Feature Engineering (12 pts)
To understand what drives success, we must move beyond simple counts (like number of papers) and quantify an author's structural position. Are they a "hub" connected to many? Are they "influential" by connecting to other key players? Or are they stuck in a tight "clique"? We will create features to measure these distinct properties.

#### 3.b.1: Network Metrics (6 pts)
1. (2 pts) Plot the degree distribution using a log plot.
    - /Discuss/: Does this network follow a power-law distribution?


2. (2 pts) Use the network graph you created to compute the following network metrics for each author: degree, pagerank, cluster coefficient

3. (2 pts) Create a dataframe called `df_features_network` that has the columns: "author_id", "degree", "pagerank", "cluster_coeff" and print the first five rows

#### 3.b.2: Paper Metrics (6 pts)

Use the `df_papers` dataframe as provided at the beginning of Task 3 to create a new dataframe called `df_features_author` that contains the following collumns:
1. `author_id`
2. `number_of_papers`, the number of papers the author wrote
3. `average_num_collaborators`, the average number of collaborators (other authors) per paper
4. `acceptance_rate`, the average acceptance probabilty of `is_accepted` across papers

Please print the first 5 rows.

### 3.c Analysis (10 pts)
The average acceptance rate for top conferences often lies around 33%. We could therefore classify "successful authors" as those with an acceptance rate of >40%.
1. (2 pts) First merge `df_features_network` and `df_features_author` using an `inner` merge operation to create `df_features` 

2. (2 pts) Add a new column to your dataframe for successful authors called `is_successful` using the rule `acceptance_rate` > 0.4

3. (3 pts) Compute the correlations of the features with respect to the `is_successful` column. Additionally, perform a groupby on `is_successful` with respect to the features printing the mean and standard errors.

4. (3 pts) /Discuss/:  Observe the correlations for `degree` and `cluster_coeff`. Using the "structural holes" theory, how could one interpret these results?

**your answer**

### 3.d Prediction (10 pts)
We have seen some interesting correlations in the previous question. Let's see if we can use this to "predict" if a paper will be accepted!
1. (2 pts) Create three new column in the df_papers dataframe: `avg_degree, avg_pagerank, avg_cluster_coeff`. Populate these by calculating the "mean" of the respective metrics for all **active authors** on that paper (these should be the ones in the `df_features` dataframe!).

2. (1 pts) Define `X` as the features data using the three new features and `y` as the `is_accepted` column

3. (1 pts) Split the data into 80% training and 20% testing sets using `random_state=RANDOM_SEED`

4. (1 pts) Train a `LogisticRegression` classifier on the training set

5. (1 pts) Predict the `y` for the test set.

6. (2 pts) Print the accuracy score and print the `Coefficients` of the model to identify which network feature is the strongest predictor of acceptance.

7. (2 pts) /Discuss/: Given that the average baseline acceptance rate is ~33%, do the network features provide a meaningful signal for predicting paper acceptance?

**your answer**