In [159]:
from datetime import datetime, timedelta, timezone
import os
import io

import dr_util.file_utils as fu

In [8]:
# Path constants, make into config
RAW_PDF_DIR = "/Users/daniellerothermel/drotherm/data/raw_pdfs/"
PARSED_PDF_DIR = "/Users/daniellerothermel/drotherm/data/parsed_pdfs/"
METADATA_DIR = "/Users/daniellerothermel/drotherm/data/pdf_metadata/"
AUTHORS = [
    "Pavel Izmailov",
    "Mengye Ren",
    "Eunsol Choi",
    "Tal Linzen",
    "He He",
    "Lerrel Pinto",
    "Rajesh Ranganath",
    "Kyunghyun Cho",
]

In [168]:
AUTHOR_INFO = {
    "He He": """
Assistant Professor of Computer Science and Data Science

Bio: He He is an Assistant Professor in Computer Science and Data Science. She is broadly interested in natural language process and machine learning. Her recent research focuses on understanding large language models, improving their trustworthiness, and human-AI interaction. Prior to joining NYU, she obtained her PhD from University of Maryland, did a post-doc at Stanford, and spent one year at AWS working in dialogue platforms.

Research Areas:

- Machine learning
- Deep learning
- Natural language processing

I want to build intelligent systems that can communicate with humans effectively and enable individuals to achieve their goals. Today’s systems are often opaque, brittle, and difficult to control, which limits their usefulness in human-centered applications. To make them our trustworthy collaborators, my research aims to (i) understand the computational foundation of generalization in novel scenarios, and (ii) build interactive systems that align with user’s goals.

I am an Assistant Professor of Computer Science and Center for Data Science at New York University. I am affiliated with the CILVR Lab, the Machine Learning for Language Group, and the Alignment Research Group.

Here are some directions I’m excited about nowadays:

- Robustness: Machine learning models are trained on a fixed and often biased dataset, but face a constantly-changing world. How can we build predictors that align with human rationales, avoid spurious correlations, and generalize to out-of-distribution data? How can models adapt quickly given new information?
- Truthfulness: We are increasingly relying on machine learning models (e.g., large language models) for critical tasks. How can we make sure that the model outputs conform to facts? Does the model know what it doesn’t know? Can it output a “proof” for its answer? How do we evaluate factuality efficiently for questions beyond the ability of an average person?
- Human-AI collaboration: We want AI agents to deal with our daily minutiae, support our decision-making, and teach us complex concepts. How should the agent infer user intention and preferences, allow for fine-grained control, and take (natural language) feedback? How will this collaboration shape the future workforce?
    """,
    "Eunsol Choi": """
Assistant Professor of Computer Science and Data Science

Bio: Eunsol Choi is an assistant professor of computer science and data science at New York University. Her research spans natural language processing and machine learning, with a focus on interpreting and reasoning about text in dynamic real-world contexts. Prior to joining NYU, she was an assistant professor at the University of Texas at Austin. She also spent a year at Google AI as a visiting researcher. She holds a Ph.D. in computer science and engineering from the University of Washington and a B.A. in mathematics and computer science from Cornell University. Outside of her academic pursuits, she enjoys nature and museums.

Research Areas:
- Machine learning
- Natural language processing

I am an assistant professor in Computer Science (Courant Institute) and Data Science at New York University. I was an assistant professor in the Computer Science department at the University of Texas at Austin from 2020. Before UT, I was a researcher at Google AI in NYC and a Ph.D. student at UW, advised by Luke Zettlemoyer and Yejin Choi.

I enjoy studying real world language usages with simple and generalizable models. I also build benchmarks that allows us to evaluate NLP models, conduct model analysis, and bring the progresses in English NLP to a wider range of languages. Here are research topics that I am currently interested in:

- Continual Learning and Knowledge Editing: While LMs retain vast amounts of world knowledge seen during pretraining, such knowledge can get outdated. I am interested in retrieval augmentation and updating parametric knowledge in LMs.
- Long-form Question Answering: Enabling systems to produce paragraph-level answers opens up possiblities to handle more complicated questions and provide more comprehensive answers. LFQA merges two challenging research areas -- information retrieval and text generation. Furthermore, we have to synthesize information from multiple documents.
- Human-LM Interaction: NLP systems are getting deployed fast and widely. I am interested in improving human interactions with LM, for example, how should we present information such that users will not be misled by plausible yet imperfect model predictions? The deployment of models also creates opportunities to learn from interaction with users.
- Spoken Language Processing: Spoken language exhibits richer prosodic features that are absent in written text. Can we build textless NLP system which can work on speech signals, opening doors to handle languages without written scripts?
    """,
    "Mengye Ren": """
Assistant Professor of Computer Science and Data Science

Bio: Mengye Ren is an assistant professor of computer science and data science at New York University (NYU). Before joining NYU, he was a visiting faculty researcher at Google Brain Toronto. He received Ph.D. in Computer Science from the University of Toronto. He was also a senior research scientist at Uber Advanced Technologies Group (ATG) and Waabi, working on self-driving vehicles. His research focuses on making machine learning more natural and human-like, in order for AIs to continually learn, adapt, and reason in naturalistic environments.

Research Areas:

- Deep learning
- Computer vision
- Machine learning

Research
Areas: machine learning, computer vision, representation learning, meta-learning, few-shot learning, brain & cognitively inspired learning, robot learning, self-driving vehicles

My key research question is: how do we enable human-like, agent-based machine intelligence to continually learn, adapt, and reason in naturalistic environments? I am interested in the emergence of intelligence by learning from a point-of-view experience. Current research topics in my group are:

- Memorization and forgetting in sequentially changing environments

- Visual representation learning in the wild using egocentric videos

- Few-shot learning, reasoning, and abstraction in vision and language

- Human-AI alignment in personalized AI

Recent Talks:

- Lifelong and human-like learning in foundation models
- Visual learning in the open world
    """,
    "Rajesh Ranganath": """
Associate Professor of Computer Science and Data Science

Research Areas:

- Probabilistic modeling
- Approximate inference
- Bayesian nonparametric statistics
- Machine learning for healthcare and causal inference

I am an Assistant Professor at the Courant Institute at NYU in Computer Science and at the Center for Data Science. I am also part of the CILVR group. My research interests include causal, statistical, and probabilistic inference, out-of-distribution detection and generalization, deep generative modeling, interpretability, and machine learning for healthcare. Before joining NYU, I earned degrees in computer science; my PhD was completed at Princeton University working with Dave Blei and my undergraduate was done at Stanford University. I have also spent time as a research affiliate at MIT’s Institute for Medical Engineering and Science.
    """,
    "Tal Linzen": """
Associate Professor of Linguistics and Data Science

Bio: Tal Linzen is an Associate Professor of Linguistics and Data Science at New York University and a Research Scientist at Google. Before moving to NYU in 2020, he was a faculty member at Johns Hopkins University, and before that, a postdoctoral researcher at the École Normale Supérieure in Paris. He received his PhD from NYU in 2015. At NYU, he directs the Computation and Psycholinguistics (CAP) Lab, which studies the connections between machine learning and human language comprehension and acquisition, with a particular focus on understanding and mimicking humans’ ability to learn quickly and generalize effectively. As part of this enterprise, the CAP lab develops novel evaluation and analysis techniques for neural network models, drawing inspiration from linguistics and cognitive science. He has received a Google Faculty Award and a National Science Foundation CAREER award.

Research Areas:

- Natural language processing
- Language models
- Computational cognitive science
- Psycholinguistics

I am an Associate Professor of Linguistics and Data Science at New York University, and a Research Scientist at Google. At NYU, I direct the Computation and Psycholinguistics Lab; we use behavioral experiments and computational methods to study how people learn and understand language. We also develop methods for evaluating, understanding and improving computational systems for language processing.
    """,
    "Kyunghyun Cho": """
Professor of Computer Science and Data Science

Research Areas:

- Machine learning
- Natural language processing (NLP)
- Application of data science in medicine

Bio: Kyunghyun Cho is a professor of computer science and data science at New York University and a senior director of frontier research at the Prescient Design team within Genentech Research & Early Development (gRED). He is also a CIFAR Fellow of Learning in Machines & Brains and an Associate Member of the National Academy of Engineering of Korea. He served as a (co-)Program Chair of ICLR 2020, NeurIPS 2022 and ICML 2022. He is also a founding co-Editor-in-Chief of the Transactions on Machine Learning Research (TMLR). He was a research scientist at Facebook AI Research from June 2017 to May 2020 and a postdoctoral fellow at University of Montreal until Summer 2015 under the supervision of Prof. Yoshua Bengio, after receiving MSc and PhD degrees from Aalto University April 2011 and April 2014, respectively, under the supervision of Prof. Juha Karhunen, Dr. Tapani Raiko and Dr. Alexander Ilin. He received the Samsung Ho-Am Prize in Engineering in 2021. He tries his best to find a balance among machine learning, natural language processing, and life, but almost always fails to do so.
    """,
    "Lerrel Pinto": """
    I am an Assistant Professor of Computer Science at NYU Courant and part of the CILVR group. Before that, I was at UC Berkeley for a postdoc, at CMU Robotics Institute for a PhD, and at IIT Guwahati for undergrad.

Research: I run the General-purpose Robotics and AI Lab (GRAIL) with the goal of getting robots to generalize and adapt in the messy world we live in. Our research focuses broadly on robot learning and decision making, with an emphasis on large-scale learning (both data and models), representation learning for sensory data, developing algorithms to model actions and behavior, reinforcement learning for adapting to new scenarios, and building open-sourced affordable robots. A talk on our recent robotics efforts is here. If you are interested in joining our lab, please read this.

Here are some public talks that covers my recent research:

- (March 2023) CMU RI Seminar - A Constructivist’s Guide to Robot Learning.

- (May 2022) ETHZ Robot Autonomy Seminar - The Surprising Effectiveness of Representations for Robotics.

- (May 2020) MIT EI Seminar - Diverse Data and Efficient Algorithms for Robot Learning.
    """,
    "Pavel Izmailov": """
    I am a Researcher at Anthropic. I am primarily interested in reasoning, AI for science and AI alignment.

Starting in Fall 2025, I will be joining NYU as an Assistant Professor in the Tandon CSE department, and Courant CS department by courtesy. I am also a member of the NYU CILVR Group.

Previously, I worked on reasoning and superintelligent AI alignment at OpenAI.

My research interests are broadly in understanding how deep neural networks work. I am excited about a broad array of topics in core machine learning, including:
- Improving reasoning and problem-solving in AI
- Interpretability of deep learning models, including both large language models and computer vision models
- AI for scientific discovery
- Out-of-distribution generalization and robustness of large-scale models
- Technical AI alignment
- Probabilistic deep learning, uncertainty estimation and Bayesian methods
Recent Highlights
I contributed to the recent OpenAI o1 models, a new state-of-the-art in LLM reasoning. Our work on weak-to-strong generalization was covered by a WIRED, MIT Technology Review and others. Our work on Bayesian model selection was recognized with an Outstanding Paper Award 🏆 at ICML 2022!
"""
}

In [None]:
## Utils

In [5]:
def get_author_metadata_path(author):
    assert author in AUTHORS
    return f'{METADATA_DIR}{author.replace(" ", "_")}_query_metadata.json'

def get_author_metadata(author):
    md_path = get_author_metadata_path(author)
    md = fu.load_file(md_path)
    return md

In [7]:
def get_parsed_pdf_path(pdf_name):
    return f'{PARSED_PDF_DIR}{pdf_name}.pkl'

def get_parsed_pdf(pdf_name):
    ppdf_path = get_parsed_pdf_path(pdf_name)
    if os.path.exists(ppdf_path):
        return fu.load_file(ppdf_path)
    return None

In [6]:
def get_author_parsed_papers(author):
    md = get_author_metadata(author)
    pdfs_dict = md['pdfs_metadata']
    parsed_pdfs_dict = []
    for pdf_name, pdf_data in pdfs_dict.items():
        ppdf = get_parsed_pdf(pdf_name)
        if ppdf is None:
            continue
        ppdf_dict = {**pdf_data}
        ppdf_dict['parsed_pdf'] = ppdf
        parsed_pdfs_dict.append(ppdf_dict)
    return parsed_pdfs_dict

In [None]:
## Load Parsed, Extract Structure

In [15]:
parsed_pdfs_pavel = get_author_parsed_papers(AUTHORS[0])
print(f">> Number of parsed papers for {AUTHORS[0]}: {len(parsed_pdfs_pavel)}")

>> Number of parsed papers for Pavel Izmailov: 17


In [18]:
test_ppdf = parsed_pdfs_pavel[0]

In [23]:
print(test_ppdf['title'])
print(test_ppdf['published'])
print(test_ppdf['authors'])
print(test_ppdf['pdf_link'])
print(f">> Num blocks in parsed pdf: {len(test_ppdf['parsed_pdf'])}")

Can a Confident Prior Replace a Cold Posterior?
2024-03-02T17:28:55Z
['Martin Marek', 'Brooks Paige', 'Pavel Izmailov']
http://arxiv.org/pdf/2403.01272v1
>> Num blocks in parsed pdf: 19


### Utils

In [157]:
def reconstruct_split_text(split_text, verbose=False):
    buff = io.StringIO()
    for section in split_text:
        if verbose:
            buff.write(f"\n\n ===== Heading: {section['heading']} \n\n")
        buff.write("\n\n".join(section['lines']))
        buff.write("\n\n")
    return buff.getvalue()

In [107]:
def split_by_heading(text, title):
    tls = text.split("\n")
    title_str = f"# {title}"
    sections = []

    start_tl_strip = tls[0].strip()
    if title_str in start_tl_strip or start_tl_strip[0] != "#":
        start_heading = "From Previous Block"
        start_lines = []
    else:
        start_heading = start_tl_strip[2:]
        start_lines = [start_tl_strip]
        
    curr_section = {"heading": start_heading, "lines": start_lines}
    for tl in tls[1:]:
        tl_strip = tl.strip()
        if len(tl_strip) == 0 or tl_strip[0].isdigit():
            continue

        if tl_strip[0] == "#":
            # Drop all header mentions of the title, we'll add it back in
            if title_str in tl_strip:
                continue
            # Otherwise start a new section
            sections.append(curr_section)
            curr_section = {"heading": tl_strip[2:], "lines": []}
        curr_section['lines'].append(tl_strip)
    
    sections.append(curr_section)
    return sections

In [137]:
def get_all_sects(input_ppdf, input_title):
    all_sects = []
    for i, block in enumerate(input_ppdf):
        sects = split_by_heading(block.text, input_title)
        if i == 0:
            # Drop the title section
            all_sects.extend(sects[1:])
        else:
            all_sects.extend(sects)
    return all_sects

In [135]:
def group_sections(sections):
    grouped_sections = []

    figs = []
    last_was_fig = False
    for section in sections:
        if len(section['lines']) == 0:
            continue
            
        heading = section['heading']
        
        # For ease of reading split the starting case out
        if len(grouped_sections) == 0:
            grouped_sections.append({
                'heading': heading,
                'lines': [],
            })
            
        if heading.startswith("Figure"):
            figs.append(section)
            last_was_fig = True
            continue

        if last_was_fig:
            last_was_fig = False
            if len(section['lines']) == 0:
                print(section)
                assert False
            if len(section['lines'][0]) == 0:
                print(section)
                assert False
            if section['lines'][0][0].islower():
                first_l = f"{section['heading']} {section['lines'][0]}"
                grouped_sections[-1]['lines'].append(first_l)
                grouped_sections[-1]['lines'].extend(section['lines'][1:])
                continue
        
        if (heading != "From Previous Block" and
            grouped_sections[-1]['heading'] != heading
        ):
            grouped_sections.append({
                'heading': heading,
                'lines': [],
            })
        grouped_sections[-1]['lines'].extend(section['lines'])    
    return grouped_sections, figs

In [146]:
def ppdf_to_body_refs_figs(input_ppdf):
    all_s = get_all_sects(input_ppdf['parsed_pdf'], input_ppdf['title'])
    print(f">> There are {len(all_s)} sections total.")

    grouped_s, figs_s = group_sections(all_s)
    print(f">> There are {len(grouped_s)} grouped sections and {len(figs_s)} figures.")

    body_s = []
    references = None
    for s in grouped_s:
        if 'References' in s['heading']:
            references = s
            break
        body_s.append(s)
    return body_s, figs_s, references

## Test Full Flow

In [147]:
bd_s, fg_s, rfs = ppdf_to_body_refs_figs(test_ppdf)

>> There are 76 sections total.
>> There are 49 grouped sections and 8 figures.


In [161]:
# print(reconstruct_split_text(bd_s + fg_s))
# rfs

In [148]:
for gt in bd_s:
    print(f"{len(gt['lines']):2} | {gt['heading']}")

 2 | Abstract
 2 | 1. Introduction
10 | In a regression setting
 2 | 2. Background
 7 | 2.1. Bayesian neural networks
 3 | 2.3. What is wrong with cold posteriors?
 1 | 3. Related work
 2 | Cold posteriors and data augmentation.
 3 | Label noise.
 2 | Prior misspecification.
 5 | Improved prior distributions.
 4 | 4. Confidence of a Normal prior
 6 | 5. Dirichlet prior
19 | log density
 1 | Train accuracy
22 | Test accuracy
 3 | 7. Clipped Dirichlet prior
 2 | 7.1. Training stability
12 | 1.0
22 | 8. Confidence priorpost. density
 6 | 9. Discussion
 2 | Summary
 2 | Impact statement
 2 | Acknowledgements


### Sub Section Tests

In [138]:
all_sections_test = get_all_sects(test_ppdf['parsed_pdf'], test_ppdf['title'])
print(f">> There are {len(all_sections_test)} sections total")

>> There are 76 sections total


In [139]:
grouped_test, figs_test = group_sections(all_sections_test)
print(f">> There are {len(grouped_test)} grouped sections")

>> There are 49 grouped sections


In [145]:
for gt in grouped_test:
    print(f"{len(gt['lines']):3} | {gt['heading']}")

  2 | Abstract
  2 | 1. Introduction
 10 | In a regression setting
  2 | 2. Background
  7 | 2.1. Bayesian neural networks
  3 | 2.3. What is wrong with cold posteriors?
  1 | 3. Related work
  2 | Cold posteriors and data augmentation.
  3 | Label noise.
  2 | Prior misspecification.
  5 | Improved prior distributions.
  4 | 4. Confidence of a Normal prior
  6 | 5. Dirichlet prior
 19 | log density
  1 | Train accuracy
 22 | Test accuracy
  3 | 7. Clipped Dirichlet prior
  2 | 7.1. Training stability
 12 | 1.0
 22 | 8. Confidence priorpost. density
  6 | 9. Discussion
  2 | Summary
  2 | Impact statement
  2 | Acknowledgements
 38 | References
  9 | Appendix outline
  3 | A. Model comparison
  1 | B. Further discussion of DirClip prior
 23 | B.1. Clipping value is reached
  8 | B.2. Low likelihood
  2 | B.3. Fine-tuning has converged
  3 | D. Proof that confidence prior converges to a cold likelihood
 17 | Tempered categ. likelihood
  3 | E.2. Proof that Dirichlet diverges
 16 | F. Di

## Putting it All Together

In [180]:
def make_author_page(author):
    bio = AUTHOR_INFO[author]
    
    buff = io.StringIO()
    buff.write(f"# Research Summary for {author}\n\n")
    buff.write(f"## Bio\n{bio}\n\n")
    

    buff.write("## Recent Papers\n\n")
    parsed_pdfs_author = get_author_parsed_papers(author)
    for ppdf in parsed_pdfs_author:
        buff.write(f"# Title: {ppdf['title']}\n Published: {ppdf['published']}\n")
        buff.write("Authors: " + ", ".join(ppdf['authors']) + "\n\n")

        bd_s, fg_s, rfs = ppdf_to_body_refs_figs(ppdf)
        buff.write(reconstruct_split_text(bd_s + fg_s))# + [rfs]))
        buff.write(f"\n\n -------------- End Paper: {ppdf['title']}")
    return buff.getvalue()

In [181]:
bio_and_one_paper = make_author_page(AUTHORS[0])

>> There are 76 sections total.
>> There are 49 grouped sections and 8 figures.
>> There are 144 sections total.
>> There are 115 grouped sections and 19 figures.
>> There are 95 sections total.
>> There are 67 grouped sections and 7 figures.
>> There are 120 sections total.
>> There are 96 grouped sections and 21 figures.
>> There are 103 sections total.
>> There are 92 grouped sections and 4 figures.
>> There are 145 sections total.
>> There are 134 grouped sections and 6 figures.
>> There are 90 sections total.
>> There are 88 grouped sections and 1 figures.
>> There are 25 sections total.
>> There are 23 grouped sections and 1 figures.
>> There are 184 sections total.
>> There are 140 grouped sections and 11 figures.
>> There are 155 sections total.
>> There are 141 grouped sections and 10 figures.
>> There are 114 sections total.
>> There are 104 grouped sections and 6 figures.
>> There are 103 sections total.
>> There are 72 grouped sections and 5 figures.
>> There are 47 section

In [183]:
fu.dump_file(bio_and_one_paper, '/Users/daniellerothermel/drotherm/data/pavel_izmailov_summary_markdown.txt')

True

In [184]:
for author in AUTHORS:
    bio_and_one_paper = make_author_page(author)
    fu.dump_file(bio_and_one_paper, f'/Users/daniellerothermel/drotherm/data/{author.replace(" ", "_").lower()}_summary_markdown.txt', verbose=True)
    

>> There are 76 sections total.
>> There are 49 grouped sections and 8 figures.
>> There are 144 sections total.
>> There are 115 grouped sections and 19 figures.
>> There are 95 sections total.
>> There are 67 grouped sections and 7 figures.
>> There are 120 sections total.
>> There are 96 grouped sections and 21 figures.
>> There are 103 sections total.
>> There are 92 grouped sections and 4 figures.
>> There are 145 sections total.
>> There are 134 grouped sections and 6 figures.
>> There are 90 sections total.
>> There are 88 grouped sections and 1 figures.
>> There are 25 sections total.
>> There are 23 grouped sections and 1 figures.
>> There are 184 sections total.
>> There are 140 grouped sections and 11 figures.
>> There are 155 sections total.
>> There are 141 grouped sections and 10 figures.
>> There are 114 sections total.
>> There are 104 grouped sections and 6 figures.
>> There are 103 sections total.
>> There are 72 grouped sections and 5 figures.
>> There are 47 section