# core

>Here's where the main `formalyzer` workflow is defined

```markdown
formalyzer: 

Reads PDF reccomendation letter, fills in admissions form(s)

usage: 
  formalyzer <recc_letter.pdf> <url_list.txt>

Instead of url_list.txt, a single URL can be given (esp. for testing purposes) 

Description: 
Formalyzer will scrape the text from the PDF recc letter, 
and for each URL in url_list, it will: 
- launch a browser tab for that url 
- fill in the form using what the LLM has gleaned from the recc letter
- attach the PDF via the form's upload/attachment button
...and do no more. 
The user will need to review the page and press the Submit button manually.


Requirements: 
- Playwright 
- ANTHROPIC_API_KEY env var. (Could support other LLMs layer)
- pypdf  

Author: Scott H. Hawley, @drscotthawley
```



In [None]:
#| default_exp core

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
#| export
import os 

def read_urls_file(urls_file:str) -> list:
    "read a text file where each line is a url of a submission site" 
    with open(os.path.expanduser(urls_file)) as f:
        return f.read().splitlines()

In [None]:
urls = read_urls_file("~/recc_urls.txt") 
urls

['https://mx.technolutions.net/ss/c/u001.Jg_sqgPJs1lCjHcCRsBB8SQ8i7hVV8mUmfF3Ua3b3jA7jTKcZ8CEJmZqPMpTVFlQs4qGqBuZuPykKKVRr1AWA0mxAmwUbeeuxsixKZJEz_g/4ls/LUSQagWaTLqWrYPg-MeXgA/h2/h001.W3ldv8fA9NqIuPtRb5trEubIT5nGeGc8_L8S3hDU1dg',
 'https://apply.grad.ucla.edu/apply/refer?key=0229711527540226',
 'https://grad.apply.colorado.edu/apply/refer?key=1906989233732298',
 'https://insight.uoregon.edu/apply/refer?key=0001235261988027',
 'https://www.graddiv.ucsb.edu/eapp/lor/Recommender.aspx?guid=913ed35d-448e-4ec3-8760-60062a082923',
 'https://apply.gsas.columbia.edu/apply/refer?key=9864322409134170',
 'https://gradadmit.wustl.edu/apply/refer?key=8586723267110540',
 'https://apply.grad.uci.edu/apply/refer?key=2554401942524274',
 'https://gradapply.wisc.edu/apply/refer?key=5578605407822667',
 'https://gradapp.berkeley.edu/apply/refer?key=0629559341024567',
 'https://apply.grad.uw.edu/apply/refer?key=2935558351001331',
 '']

In [None]:
#| export
from pypdf import PdfReader
import logging
logging.getLogger("pypdf").setLevel(logging.ERROR)

def read_pdf_text(pdf_file):
    reader = PdfReader(os.path.expanduser(pdf_file))
    return "\n".join(page.extract_text() for page in reader.pages)

In [None]:
letter_text = read_pdf_text("~/recc_letter.pdf")
letter_text 

'  Dear Graduate Admissions Committee,  It is with absolute confidence that I strongly recommend Brody Blackwood for admission to your program. He is extremely well-suited for graduate work and ranks within the top 2% of students I have worked with in my 19 years of undergraduate teaching.  It has been my great pleasure to work with Brody as a one-on-one research supervisor, as well as an instructor in two “upper-level physics” courses: “PHY3110 – Analytical Mechanics,” “PHY4410 – Survey of Advanced Physics” and “PHY4420 – Deep Learning and AI Ethics.”   He is a rare student who excels in everything he does. His work in my courses was excellent, and in his research, he has been conscientious and motivated.  He is easy-going, gets along well with other students, and works well either individually or in team settings.  Quite simply, Brody works: he is one of the rare students who does what you tell him to do, on time, and does it very well. His homework was always fully complete, well-wr

Okay, we've got the basic functionality where we can read in the letter as $letter_text and the url list as $urls. 
Now, what we want to do is pick the first URL and let's try to open a browser tab using using Playwright at that URL. Ideally we'd be doing this from my laptop browser not your server, so we might need to do some fancy stuff like ssh tunneling. What do you think? Refer to the lesson Transcript above where Jeremy was using Playwright to control his browser. 

##### 🤖Reply🤖<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Great question! Yes, to control your *local* browser (where you're logged in, have cookies, etc.), you'll need the approach Jeremy showed in the transcript. Here's the game plan:

**The key pieces:**

1. **Start Chrome with remote debugging enabled** on your laptop - Chrome needs to be launched with `--remote-debugging-port=9222` and (importantly, as Jeremy discovered) a custom `--user-data-dir` because Chrome now blocks debugging on the default profile for security.

2. **SSH tunnel** - Forward port 9222 from solveit to your laptop so Playwright on solveit can connect to your local Chrome.

3. **Connect via Playwright's CDP** - Use `playwright.chromium.connect_over_cdp()` to attach to your running browser.

Before we dive in, a few questions:

- **What OS is your laptop?** (Mac, Windows, Linux?) This affects how we launch Chrome with the right flags.
- **Do you already have SSH tunneling set up to your laptop** (like bore.pub or ngrok as shown in the transcript), or do we need to set that up first?

If you don't have the tunnel yet, we'd first need to get something like bore or ngrok running so solveit can SSH to your machine. Let me know where you're at and we can take the next step!

# `formalyzer` CLI script

In [None]:
#| export
import os 
from fastcore.script import call_parse

@call_parse
def main(pdf_path: str, urls: str, debug: bool = False):
    pdf_path = os.path.expanduser(pdf_path) 
    assert os.path.exists(pdf_path), f"File not found: {pdf_path}"
    if os.path.exists(os.path.expanduser(urls)): 
        if debug: print(f"File {urls} exists. Reading.")
        urls = read_urls_file(urls)
    else: 
        print(f"No file {urls}. Treating it as a single url") 
        urls = [urls] 
    if debug: print("urls =\n",urls)
    letter_text = read_pdf_text(pdf_path)
    if debug: print("letter_text =\n",letter_text) 

In [None]:
main("~/recc_letter.pdf","~/recc_urls.txt")

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()