-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v0 pdf rendering #364
base: main
Are you sure you want to change the base?
v0 pdf rendering #364
Conversation
@shreevatsa @akprasad Let me know if you have any suggestions for improvement. As @akprasad knows, I'm rather new to the building things business; most of my past work has been analytics, so I'm happy to get feedback! |
Thanks, this is a start! One way to put off thinking about deployment etc is to simply generate a In fact, we could put it off for quite a while: could simply generate the |
Ah yes, that makes sense w.r.t. deployment. I think we can think about the S3 bucket + Overleaf API to do the compilation. But that's for later. I'll start testing this weekend. See if I can get a good number of the text we have up working. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main comment is that there's an easier way to manage the template logic by using Jinja, which is what we use for our HTML. Jinja gives us:
- variable substitutions
- loops and conditionals
- macros
So it's well-suited to this use case. Here's a minimal example of how to use it:
from jinja2 import Environment
env = Environment()
template = env.from_string("This is {{foo}}")
print(template.render(foo="my foo"))
If Jinja's use of {}
characters is confusing given the TeX context, you can pass your own overrides:
env = Environment(
block_start_string='<%',
block_end_string='%>',
variable_start_string='<<',
variable_end_string='>>',
comment_start_string='<#',
comment_end_string='#>')
render_pdf/tex_files/runscript
Outdated
@@ -0,0 +1,4 @@ | |||
xelatex template.tex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be an option to suppress these output files. I think this is better in case we end up deleting other files called .out, .log, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xelatex
itself doesn't have an option for cleaning up, but look into invoking it via say latexmk
(it has -c
for clean IIRC) or arara
: its documentation mentions a directive like
% arara: clean: { extensions: [ aux, log ] }
— other options I haven't looked at are latexrun
or (further afield) tectonic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, and I need xelatex for babel/font support. I'll fix this later; I'd rather not waste time now for something that has a clear/simple solution. And it's also not the end of the world if I leave the log files too. I think
- we can process a pdf, store it in a bucket to cache, and
- overwrite it for the next run of render_pdf.
They're fairly small files, so it's not too large a concern.
Ohhh I didn't realize jinja could be used for this. Thank you! |
@shreevatsa I've converted the code to a jinja template.
|
Should I focus on:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is ready for review or not, so I kept it fairly high-level
render_pdf/src/parser.py
Outdated
@@ -0,0 +1,129 @@ | |||
from lxml import etree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of our transitive dependencies uses lxml, but let's use xml.etree.ElementTree
instead, which has 99% of the functionality and is also in the standard lib. grep in ambuda
for examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not rewrite; any specific reason why you want to move to lxml? If possible, I'll leave this rewrite for another PR. @akprasad
# God this is ugly # TODO | ||
try: | ||
return root.find(tagname("head")).text | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no bare except
-- generally good practice to catch a specific exception like except AttributeError
.find(tagname("fileDesc")) \ | ||
.find(tagname("titleStmt")) | ||
return tag.find(tagname("title")).text | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no bare except here
@skmnktl About supporting more texts v/s getting the pdf to look nicer, it's up to you (whatever you're more motivated to do), though if you're neutral between the two options, then I'd say that getting the PDF to look nicer can be done indefinitely, after a first version gets in. Also consider a third option, that of cleaning up the PR for production. I could think of a sequence of PRs like, say:
But could also think of doing it in other orders. BTW what did you mean by "halting latex compile messages"? |
@shreevatsa For the tasks outlined:
This is what make_latex.py does. I'm writing the resultant
Do we want to wait on this? Should I use AWS? Or perhaps a Google Cloud Storage Bucket? Or perhaps a mongo instance? If the main app will move to a more permanent storage, I can just piggyback off that. If you have preferences on what that storage will be in the future, we can do that as well.
I'm hesitant to write this since the storage decisions have yet to be made. We could go the render on the fly every time route, and for that I can make a FastAPI endpoint. There's not that many texts right now anyway. |
@skmnktl Hi, to be clear, by "sequence of PRs" I meant doing each of those things in a separate PR (so to answer your latter two questions, yes definitely wait on storing in the cloud, and don't write the app now) — I was actually talking about removing stuff from this PR (or moving to another one), to keep it minimal and ready to merge. :-) So this PR would have only 3 files:
if I understand correctly. (Because In the meantime I'll download and compile |
@skmnktl I looked into Let's start with a template where we understand why each line is needed, and which compiles without any errors or warnings, and add things as needed (e.g. the current template has a line or two of math-related stuff that I'm sure we won't need for a long time if ever). The file compiles fine without any errors or warnings if I replace lines 1–53 of
or, if you want to keep
Feel free to add your font choices back in later. There were also some errors from Babel's Sanskrit setup; we don't seem to need it for this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments and I'm yet to take a closer look atparser.py
, but this is actually looking quite reasonable and close to merging after some small changes — thanks for working on this! We can continue improving after having a first version.
@shreevatsa I took your comments above and did a few things with my last commit:
|
@skmnktl Great! I still see the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass. Also, feel free to move this to ambuda/render_pdf
so that it's easier to reuse and share utils with the rest of the project.
@@ -0,0 +1,80 @@ | |||
from typing import Union | |||
import parser | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run isort render_pdf/**/*.py
to standardize and reorder imports
except: | ||
try: | ||
root = alt_root | ||
tag = root.find(tagname("teiHeader"))\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if using ElementTree:
try:
return xml.find('./teiHeader/fileDesc/titleStmt').text
except AttributeError:
return None
By the way, if it helps, feel free to consider splitting this PR further, and checking in just one of the two:
(E.g. it's ok to merge just (2) for now, and work on (1) in parallel.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level comments:
- remove the zivopaniSad file from this PR
- use
xml.etree.ElementTree
instead of lxml -- they have basically the same API, but ElementTree is built-in and we'll probably removelxml
in the future. - use xpath (
xml.find('./element/anotherElement/thirdElement)
) instead of chainingtagname
- no bare
except
s anywhere - run
black
over the code
@@ -0,0 +1,5 @@ | |||
python make_latex.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better name -- maybe create_pdf.sh
, give it the .sh
extension and add #!/usr/bin/sh
at the top of the file, then chmod +x create_pdf.sh
to mark it as executable
|
||
if __name__ == "__main__": | ||
input_file, output_file = sys.argv[1], sys.argv[2] | ||
verses, analysis, title = parse(url=link) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is link
defined?
|
||
|
||
# probably need to use asyncio here? | ||
def parse(url=None, data_location = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run black
over the file to fix the spacing
|
||
|
||
# probably need to use asyncio here? | ||
def parse(url=None, data_location = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are two entirely different code paths here and it doesn't make sense to make them a single function. Better to have parse_url
andparse_file
and split them up accordingly
def tagname(tag): | ||
return f"{{{root_url}}}{tag}" | ||
|
||
def parse_from_root(root) -> (dict, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a type
.find(tagname("titleStmt")) | ||
author = tag.find(tagname("author")).text | ||
except: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to have explicit return None
author = tag.find(tagname("author")).text | ||
except: | ||
return | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try / except / else
is kind of funky. Can you rewrite this?
something like:
author = root.find("./teiHeader/fileDesc/titleStmt/author')
if author:
return author.text
else:
return None
return verse_id, phrases | ||
|
||
def parse_note(note): | ||
corresp = note.get("corresp").replace("#","") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this fail if corresp
is missing?
return parse_lg(note.find(tagname("lg")), corresp) | ||
|
||
def parse_div(div): | ||
verses = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return [parse_lg(lg) for lg in div]
analysis = {} | ||
for child in root.iter(): | ||
if child.tag == tagname("lg"): | ||
v,txt = parse_lg(child) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces after commas. black
will do this for you
|
||
|
||
# CLUNKY Fix this. Closure maybe? | ||
root_url = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend just looping over the document and deleting this root
prefix. see _remove_namespace
in tei_parser.py
for examples
Returning to this PR -- I suggest we scope this PR down to the simplest thing that adds new functionality then iterate on it with follow-up PRs. |
Do you want to meet this weekend to finalize this? I can wrap it up in the
next couple of days?
…On Thu, Apr 6, 2023 at 9:32 PM Arun Prasad ***@***.***> wrote:
Returning to this PR -- I suggest we scope this PR down to the simplest
thing that adds new functionality then iterate on it with follow-up PRs.
—
Reply to this email directly, view it on GitHub
<#364 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE2BJBATJKU2CSUW6P4PON3W75VDFANCNFSM6AAAAAAQSTANOE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sure, just scheduled a weekly sync (see #general on Discord) or we can find some time elsewhere |
Tasks to be done: