Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index the title and content of URLs in page #41

Closed
MrOrz opened this issue Aug 2, 2017 · 12 comments
Closed

Index the title and content of URLs in page #41

MrOrz opened this issue Aug 2, 2017 · 12 comments
Assignees

Comments

@MrOrz
Copy link
Member

MrOrz commented Aug 2, 2017

想要解決的問題:

  1. Retrieval 時遇到連結基本上就不能做任何事,即使連結背後的文章高度相關也無法找到。
  2. Line bot 顯示找到的文章給使用者選時,對連結相當不友善
  3. 編輯要點進去看有點麻煩,而且 related article 對連結的效果不彰

如果可以記錄每個連結的

  1. Title
  2. Canonical URL (after redirect)
  3. Content

一併加入 Full text search 的話會方便很多。

@MrOrz
Copy link
Member Author

MrOrz commented Aug 19, 2017

https://github.com/ageitgey/node-unfluff/blob/master/README.md
This definitely won't work yet for languages like Chinese / Arabic / Korean / etc that need smarter word tokenization.

https://github.com/craftzdog/extract-main-text-node 有網友說在 cjk 他會比 unfluff 穩定

@MrOrz
Copy link
Member Author

MrOrz commented Aug 19, 2017

https://github.com/inspiredjw/oembed-auto some media sites may support oembed

@MrOrz
Copy link
Member Author

MrOrz commented Aug 21, 2017

Webpage loading & rendering 可以直接用 headless chrome 來處理 SPA,還可以順便截張圖
https://github.com/GoogleChrome/puppeteer#readme

或者是用 Rendertron,不用接 API,直接起 docker 做 prerender
https://github.com/GoogleChrome/rendertron

@MrOrz
Copy link
Member Author

MrOrz commented Nov 9, 2017

@MrOrz
Copy link
Member Author

MrOrz commented Nov 10, 2017

Goose3 seems really promising!
We just need to provide URL. How neat!

2017-11-11 12 55 14

@MrOrz
Copy link
Member Author

MrOrz commented Nov 28, 2017

As for URL normalization, we can rely on: https://github.com/g0v/url-normalizer.js
and canonical URL field

But this should be not very important, since we mostly uses its content to do matching.

@MrOrz
Copy link
Member Author

MrOrz commented Mar 11, 2018

Pure goose3 test:

https://docs.google.com/spreadsheets/d/1y1GGc04HBhpU76D6LvX5hqt877X_Rfatc1vSQNJmZ6Q/edit#gid=0

  1. Failed to resolve some of the url shortener. Maybe it contains multiple redirects?
  2. Youtube contains no server rendered stuff (other than meta description). It contains no cleaned_text.
  3. Canonical URL works for removing utm_sources, but there may be some sites not implementing canonical urls correctly
  4. weibo contents cannot be resolved.
  5. Layout in some weixin articles will cause some text recognized as figure captions and not being included in cleaned_text
  6. When chinese stopwords are used, we cannot extract english documents' cleaned_text

@MrOrz
Copy link
Member Author

MrOrz commented Mar 19, 2018

Another possible alternative is Boilerpipe in Java.
Although it's old, it do have an API, and supports Chinese.

Here are the comparison between goose and boilerpipe:
https://gist.github.com/eldilibra/5637215

From the ycombinator article https://news.ycombinator.com/item?id=2526127 , we also have the following candidates:

@MrOrz
Copy link
Member Author

MrOrz commented Mar 19, 2018

Mozilla/Readability + Puppeteer test results:

https://docs.google.com/spreadsheets/d/1y1GGc04HBhpU76D6LvX5hqt877X_Rfatc1vSQNJmZ6Q/edit#gid=1885459841

test script: https://gist.github.com/MrOrz/fb48f27f0f21846d0df521728fda19ce (Pure JS!)

Quote my feedback from slack:

雖然似乎是比 goose3 多抽出了一些垃圾來(首尾會多出一些日期之類的;mygopen的主文不知道為什麼重複了兩次 @@),但就 indexing 的需求來說我覺得可以接受耶。

這個 JS solution 我覺得夠好了,
現在沒有動力測其他 python based solution XD

youtube 的 extraction 還是很糟,但 youtube 連結偏偏又很多。看來需要針對 youtube 做特化 @@

However, there is one web page that will trigger Readability's bug. Maybe we should report that.

@MrOrz MrOrz self-assigned this Apr 21, 2018
@MrOrz
Copy link
Member Author

MrOrz commented Apr 21, 2018

Requirement

https://hackmd.io/s/SyqhWqLKz#Proposal

To be implemented

  • an utility function, given an URL, returns the fetched result and write new entries in urls index (or just return cached result; or check cache first before fetching)
  • After fetching an URL, update title & summary for all articles and replies using scripting updates.
  • a script that populates urls from article & replies (also fills in their own hyperlinks field), given a range of date to scan.
  • a mechanism to override / import urls index

@MrOrz
Copy link
Member Author

MrOrz commented Jul 9, 2018

CreateArticle, CreateReply:

  1. Extract URLs from string fields
  2. Invoke scrapUrls with cache turned on. After fetch, insert newly searched URL to the urls index.
  3. write to hyperlinks when scrapUrls returned.

ListArticles + moreLikeList filter

  1. Extract URLs from string fields
  2. Invoke scrapUrls with cache turned on. After fetch, insert newly searched URL to the urls index.
  3. perform search with summary returned by scrapUrls

1 and 2 are always used together, consider implementing them into scrapUrl()

Migration script
For each article & reply, perform CreateArticle's 1~3.

@MrOrz
Copy link
Member Author

MrOrz commented Oct 17, 2018

With #97, #98 and #104 deployed to production, we can now close this 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant