
### 👋 Welcome to the technical screening for the **Research Engineer** position at Uryclea.

The research engineer role requires a broad range of engineering skills and **this screening tests for two of them: ML research engineering acumen and a product development mindset** when it comes to research tooling for AI product development.

Another very important aspect of this role is reviewing and improving existing code and providing feedback and input to engineering teams. This aspect is not the primary focus of the current test but will be tested in the future during the work trial.


### Instructions

1. Make a copy of this notebook.
2. Read through the tasks and rules below and plan ahead. You can tackle the tasks in whichever order you prefer.
3. Fill in your solutions directly in this notebook. The places to fill are marked with "✍️".
4. When you're finished, please share the notebook with the reviewing team (see instructions at the bottom).

If you encounter any issues during the test, please contact us at the private Slack channel.

Good luck!


---
*Note: In the invitation e-mail we asked you to:*
- *use Colab if you do not have Jupyter Notebook support.*
- *sign up for Colab Pro if you aim to train or finetune models.*

*This is purely precautionary. It is entirely possible to do this task without either of these. So, please do not worry if you end up not relying on them in your solution.*

---

### Task Overview

1. **ML Engineering**:
   - Topic: Parsed data from finance is brittle and messy and retrieval systems are easily undone because of this. That is, it is incredibly easy to build a prototype and near enough impossible to build production-level systems.
   - Goal: Implement a simple MVP of how parsing and processing such data works in practice and add your thoughts on how to extend it.
   - See below for detailed instructions.

2. **Product Development / Research Acceleration**:
   - Topic: Accelerating research for downstream usage.
   - Goal: Propose ideas that could accelerate research for products being developed downstream.
   - See below for detailed instructions.

  *While both tasks are somewhat open-ended, we **expect task 1 to be significantly more time-demanding**. You're free to prioritize as you wish of course but by default would suggest to allot at least 2/3 of your time for task 1.*

### Rules

- Please don't update your notebook after submission
- Please don't share your notebook with anyone else
- While we of course encourage you to use AI assistance during your daily work, we also want to understand your non-AI-mediated thinking and writing. **Please do not use AI assistance during this work test with the below exceptions**.
- Allowed tools:
  - Internet search for code documentation or snippets is fine.
  - autocomplete inside google colab is fine.
  - using AI assistants for dataset creation is fine (but should be commented on in the code)

Thanks for respecting the process!

In [3]:
# ✍️ Please agree to the above rules by filling in your name and e-mail below
name = ""
email = ""
if name and email:
  print(f"🤗 Thank you, {name} ({email}), for consenting to our rules!")
else:
  print("🖐️ Please fill in your name and e-mail to proceed.")


🖐️ Please fill in your name and e-mail to proceed.


# Task 1: ML Research Engineering

Estimated Time: 40 minutes to 4 hours.

## Objective

Implement a simplified version of a **retrieval system processing HTML text**. This task involves selecting appropriate tools, preparing datasets, implementing an evaluation process, and analyzing the results.


## Background

A [Form 8-K](https://www.investor.gov/introduction-investing/investing-basics/glossary/form-8-k#:~:text=Form%208%2DK%20is%20known,that%20triggers%20the%20filing%20requirement) contains exhibits as part of the filing. See [here](https://www.sec.gov/ix?doc=/Archives/edgar/data/40729/000119312524271543/d916570d8k.htm) for an example.

These exhibits can come in any shape or form. We can obtain these forms and exhibits in an HTML format. However, the path from this to *talking to your HTML* is not as straightforward.

Important:

*   You are free to deviate from the specific approach in the post.
*   We are looking for an MVP of a simple retrieval system.

We expect this task to be hard to fully solve in the alloted time. So please don't panic and just see how far you get. Good luck!

## Instructions

The main task components are:
1. Parse HTML files.
2. Prepare and process text and vector embeddings for archiving into a vector database.
4. Evaluate the system's outputs on sample queries.
5. Provide brief thoughts on scaling to alternative settings and challenges.
6. *Optional; only if time allows: Attempt extenions such as finetuning models and agentic parsing/retrieval.*

## YOUR SOLUTION

*✍️ Fill in your solution below*

*Feel free to add your thinking / observations / where you got stuck directly as comments in your code*

### Data Preparation

We provide a sample of parsed HTML files in the `html_exhibits`directory. Import them here and examine the text carefully.


In [None]:
# ✍️ 

### Parse and Process Text

Now, use and import tools to parse and process these HTMLs to text that a retrieval system can use.

In [None]:
# ✍️ 

### Archive Embeddings

Next, using an embedding model, archive the text parsed above into a vector store of your choice. Justify your choices.

In [None]:
# ✍️ 

### Retrieval System

Design a retrieval system that can get answers to user's questions about the exhibits.

In [None]:
# ✍️ 

### Evaluation

Come up with sample queries and **evaluate** the performance of the retrieval system.

In [None]:
# ✍️ 

## Analysis and Reflection

✍️ *Briefly (at most 2-3 sentences / bullet points each) discuss the following questions, filling in your answers directly below.*

***Note: you do not need to repeat anything that you already commented on in your solution above.***

- How you might expand this approach to arbitrary exhibits, spreadsheets, and pdfs with financial tables?


- How you might compare the effectiveness of different parsing, emnbeddings, and retrieval methods?
    
    
- Ideas for efficiently redoing the architecture design?


- Other potential baselines or evaluation methods you would implement with more time.

### Optional Extensions of Your Solution (only if time permits)

In [5]:
# ✍️ If time permits (optional): Experiment with redoing the retrieval system.

# Task 2: Accelerating AI Research Engineering and Synthesis via Product Discovery

Estimated Time: 20 minutes to 2 hours.

## Objective

The goal of this task is to **propose ways to help devise ways for a research engineer to work more efficiently with respect to the tasks that a product team carries out**. We'd like you to:

1. Consider common challenges research engineers might face in reviewing product specs from an engineering team in the PRD process.
2. Propose a process that could help researchers more effectively discover, organize, or draw insights from relevant specs to propose new initiatives.
3. Briefly describe how you would prototype and test this (technical/non-technical) process.

*Note that this is a purposefully vague task. In fact, the most fun part about this role is that you get to define it, so we want you to go wild with your imaginations here :-)*

## Deliverable

Write a short proposal (3 - 4 paragraphs) which should aim to address one or more challenges in a way that goes beyond existing approaches. You may want to start by reflecting on (a) which challenge you would like to prioritize, (b) how current approaches geared to software engineers (if any) might fall short in this role, and (c) maybe how you can use AI to build something fundamentally better.

Aim to write something that would convince an AI startup to dedicate resources to further developing and implementing your idea.

## Your Proposal

*✍️ Fill in your proposal here*

# Submission Insructions

Congratulations on finishing the technical screening!

**To submit your work please notify us in the private Slack channel and share this notebook with the following accounts for review:**

aman@uryclea.com, parik@urcylea.com, founders@urcylea.com.

**Note: Please do not share this notebook with anyone else. Thank you!**



---
---