-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: scrape [html + markdown] #889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
83 commits
Select commit
Hold shift + click to select a range
01a2678
wip: get html
amhsirak 994142a
fix: define browser context
amhsirak 0c9dc89
feat: get input text for llm
amhsirak 560f5a3
feat: get llm ready text
amhsirak 9b71cfc
fix: return empty empty str on error
amhsirak 191ac52
fix: return empty empty str on error
amhsirak af95706
fix: get important content
amhsirak a3891f6
wip: markdown + plain text
amhsirak dae4e83
wip: markdown + plain text
amhsirak 28f1bf8
fix: better markdown output
amhsirak 1651763
fix: better markdown output
amhsirak 6e6d6c6
chore(deps): install cheerio, turndown
amhsirak f22f6ef
debug(temporary): turndown x amzn
amhsirak 0fa5397
debug(temporary): test url -> llm text
amhsirak 4158896
chore: link replace
amhsirak 6c8850a
chore: link replace
amhsirak 7da4647
wip: to markdown
amhsirak dd1a9a8
chore(deps): install koffi
amhsirak ec49565
chore: ignore build files
amhsirak f0d6712
chore: build
amhsirak da48d46
chore: build
amhsirak 713d374
feat: to markdown
amhsirak 6c93cbc
feat: html -> markdown
amhsirak 0837ac5
fix: go parser path
amhsirak 66d8291
fix: export convert fxn
amhsirak 1d65f90
feat: use parser to scrape
amhsirak 3fd9bb5
chore(debug): test
amhsirak 1a291c2
chore: cleanup
amhsirak ecaa23f
chore: install scrape plugins
amhsirak 767fa5f
chore: del go
amhsirak b4644ba
feat: use turndown
amhsirak b14d84d
fix: -rm debug turndown
amhsirak 839f9fa
fix: plugin imports
amhsirak 0a7a1eb
fix: make baseUrl optional param
amhsirak 9257b15
feat: pass url param
amhsirak d1f13cf
feat: add robot markdown creation section ui
RohitR311 c1373d8
feat: display separate field md content
RohitR311 0d45d1d
feat: markdownify manual, scheduled, api runs
RohitR311 b19e02f
feat: add markdown route
RohitR311 05d2d1b
feat: add optional type and url fields
RohitR311 d444756
chore: add static markdown import
RohitR311 ddcb3df
feat: extend turndown + clean
amhsirak 28d2288
Merge branch 'markdownify' of https://github.com/getmaxun/maxun into …
amhsirak 8346c96
chore: cleanup
amhsirak 924d687
feat: add create markdown api
RohitR311 e711326
feat: extract
amhsirak 51a0c3a
chore: remove icon
amhsirak 672a182
feat: extract
amhsirak d0b8d0c
chore: remove icon
amhsirak 53bf9eb
feat: scrape
amhsirak 8428314
feat: scrape
amhsirak 6de6c3b
feat: remove header
amhsirak ef43116
feat: markdown
amhsirak f745089
feat: markdown
amhsirak 81d69a4
chore: lint
amhsirak eb86b6e
feat: markdown
amhsirak febc6c1
feat: markdown
amhsirak dbb6c87
feat: change mui default tabs
amhsirak 606790e
chore: lint
amhsirak 3dac1a0
feat: change mui default tabs
amhsirak 9601905
feat: turn to markdown
amhsirak 930c7b6
fix: lesser restrictions
amhsirak 691dedc
fix: lesser restrictions
amhsirak 7f48e27
chore: lint
amhsirak fef038b
chore: cleanup wanted deps
amhsirak 5aafe6e
feat: add html
amhsirak 418100c
feat: scrape robot
amhsirak f3c79bd
feat: scrape robot
amhsirak e90cd99
feat: add html scrape support
RohitR311 a9a8e20
fix: resolve merge conflicts
RohitR311 c89b2af
feat: modify scrape api to support html
RohitR311 0987183
chore: increase goto timeout scrape 100s
RohitR311 ac0c70e
feat: disable sheets and airtable scrape robot
RohitR311 f646713
fix: format
amhsirak 174a09f
fix: use p instead of formlabel
amhsirak 565c858
fix: remove margin
amhsirak 7f22a77
Merge branch 'markdownify' of https://github.com/getmaxun/maxun into …
amhsirak 6477fee
fix: dont show selected output format
amhsirak b2b5a91
chore: add telemetry for scrape robots and runs
RohitR311 467ffe3
feat: rm display integrations scrape robot
RohitR311 25fd74e
feat: center tabs
amhsirak 5cfbd1e
Merge branch 'markdownify' of https://github.com/getmaxun/maxun into …
amhsirak a1b2117
fix: less gap
amhsirak File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validation: requestedFormats should be validated before use.
When
requestedFormatsis provided, line 672 filters it to only include valid formats, but there's no check to ensure the resulting array isn't empty. If a caller passesformats: ['invalid'], the filtered array would be empty and no output would be generated, but the run would still succeed.Apply this diff to add validation:
// Override if API request defines formats if (requestedFormats && Array.isArray(requestedFormats) && requestedFormats.length > 0) { formats = requestedFormats.filter((f): f is 'markdown' | 'html' => ['markdown', 'html'].includes(f)); + if (formats.length === 0) { + throw new Error('No valid formats specified. Supported formats: markdown, html'); + } }📝 Committable suggestion
🤖 Prompt for AI Agents