-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index the title and content of URLs in page #41
Comments
https://github.com/ageitgey/node-unfluff/blob/master/README.md https://github.com/craftzdog/extract-main-text-node 有網友說在 cjk 他會比 unfluff 穩定 |
https://github.com/inspiredjw/oembed-auto some media sites may support oembed |
Webpage loading & rendering 可以直接用 headless chrome 來處理 SPA,還可以順便截張圖 或者是用 Rendertron,不用接 API,直接起 docker 做 prerender |
From slack https://g0v-tw.slackarchive.io/cofacts/page-17/ts-1506900396000054 滿厲害的可以處理中文: 處理方式:http://www.keyvan.net/2011/03/content-extraction/ Python goose3 也可以處理中文: |
As for URL normalization, we can rely on: https://github.com/g0v/url-normalizer.js But this should be not very important, since we mostly uses its content to do matching. |
Pure goose3 test: https://docs.google.com/spreadsheets/d/1y1GGc04HBhpU76D6LvX5hqt877X_Rfatc1vSQNJmZ6Q/edit#gid=0
|
Another possible alternative is Boilerpipe in Java.
Here are the comparison between goose and boilerpipe: From the ycombinator article https://news.ycombinator.com/item?id=2526127 , we also have the following candidates:
|
Mozilla/Readability + Puppeteer test results: test script: https://gist.github.com/MrOrz/fb48f27f0f21846d0df521728fda19ce (Pure JS!) Quote my feedback from slack:
However, there is one web page that will trigger Readability's bug. Maybe we should report that. |
Requirementhttps://hackmd.io/s/SyqhWqLKz#Proposal To be implemented
|
1 and 2 are always used together, consider implementing them into Migration script |
想要解決的問題:
如果可以記錄每個連結的
一併加入 Full text search 的話會方便很多。
The text was updated successfully, but these errors were encountered: