Index the title and content of URLs in page #41

MrOrz · 2017-08-02T15:48:54Z

想要解決的問題：

Retrieval 時遇到連結基本上就不能做任何事，即使連結背後的文章高度相關也無法找到。
Line bot 顯示找到的文章給使用者選時，對連結相當不友善
編輯要點進去看有點麻煩，而且 related article 對連結的效果不彰

如果可以記錄每個連結的

Title
Canonical URL (after redirect)
Content

一併加入 Full text search 的話會方便很多。

MrOrz · 2017-08-19T14:51:16Z

https://github.com/ageitgey/node-unfluff/blob/master/README.md
This definitely won't work yet for languages like Chinese / Arabic / Korean / etc that need smarter word tokenization.

https://github.com/craftzdog/extract-main-text-node 有網友說在 cjk 他會比 unfluff 穩定

MrOrz · 2017-08-19T14:57:39Z

https://github.com/inspiredjw/oembed-auto some media sites may support oembed

MrOrz · 2017-08-21T05:51:22Z

Webpage loading & rendering 可以直接用 headless chrome 來處理 SPA，還可以順便截張圖
https://github.com/GoogleChrome/puppeteer#readme

或者是用 Rendertron，不用接 API，直接起 docker 做 prerender
https://github.com/GoogleChrome/rendertron

MrOrz · 2017-11-09T18:03:29Z

From slack https://g0v-tw.slackarchive.io/cofacts/page-17/ts-1506900396000054

滿厲害的可以處理中文：
http://fivefilters.org/content-only/

處理方式：http://www.keyvan.net/2011/03/content-extraction/

Python goose3 也可以處理中文：
https://github.com/goose3/goose3#goose-in-chinese

MrOrz · 2017-11-10T16:58:31Z

Goose3 seems really promising!
We just need to provide URL. How neat!

MrOrz · 2017-11-28T03:19:45Z

As for URL normalization, we can rely on: https://github.com/g0v/url-normalizer.js
and canonical URL field

But this should be not very important, since we mostly uses its content to do matching.

MrOrz · 2018-03-11T14:29:38Z

Pure goose3 test:

https://docs.google.com/spreadsheets/d/1y1GGc04HBhpU76D6LvX5hqt877X_Rfatc1vSQNJmZ6Q/edit#gid=0

Failed to resolve some of the url shortener. Maybe it contains multiple redirects?
Youtube contains no server rendered stuff (other than meta description). It contains no cleaned_text.
Canonical URL works for removing utm_sources, but there may be some sites not implementing canonical urls correctly
weibo contents cannot be resolved.
Layout in some weixin articles will cause some text recognized as figure captions and not being included in cleaned_text
When chinese stopwords are used, we cannot extract english documents' cleaned_text

MrOrz · 2018-03-19T10:04:42Z

Another possible alternative is Boilerpipe in Java.
Although it's old, it do have an API, and supports Chinese.

Here are the comparison between goose and boilerpipe:
https://gist.github.com/eldilibra/5637215

From the ycombinator article https://news.ycombinator.com/item?id=2526127 , we also have the following candidates:

mozilla's Readability, which drives Firefox & Safari's reader mode, and is IN JS!: https://github.com/mozilla/readability - I think we can load it directly into the page and invoke it through puppeteer
Dragnet is an open-sourced machine-learning based solution with better performance than goose3
we can also create our own rules and code it using Fathom

MrOrz · 2018-03-19T16:46:00Z

Mozilla/Readability + Puppeteer test results:

https://docs.google.com/spreadsheets/d/1y1GGc04HBhpU76D6LvX5hqt877X_Rfatc1vSQNJmZ6Q/edit#gid=1885459841

test script: https://gist.github.com/MrOrz/fb48f27f0f21846d0df521728fda19ce (Pure JS!)

Quote my feedback from slack:

雖然似乎是比 goose3 多抽出了一些垃圾來（首尾會多出一些日期之類的；mygopen的主文不知道為什麼重複了兩次 @@），但就 indexing 的需求來說我覺得可以接受耶。

這個 JS solution 我覺得夠好了，
現在沒有動力測其他 python based solution XD

youtube 的 extraction 還是很糟，但 youtube 連結偏偏又很多。看來需要針對 youtube 做特化 @@

However, there is one web page that will trigger Readability's bug. Maybe we should report that.

MrOrz · 2018-04-21T03:37:27Z

Requirement

https://hackmd.io/s/SyqhWqLKz#Proposal

To be implemented

an utility function, given an URL, returns the fetched result and write new entries in urls index (or just return cached result; or check cache first before fetching)
After fetching an URL, update title & summary for all articles and replies using scripting updates.
a script that populates urls from article & replies (also fills in their own hyperlinks field), given a range of date to scan.
a mechanism to override / import urls index

MrOrz · 2018-07-09T17:30:00Z

CreateArticle, CreateReply:

Extract URLs from string fields
Invoke scrapUrls with cache turned on. After fetch, insert newly searched URL to the urls index.
write to hyperlinks when scrapUrls returned.

ListArticles + moreLikeList filter

Extract URLs from string fields
Invoke scrapUrls with cache turned on. After fetch, insert newly searched URL to the urls index.
perform search with summary returned by scrapUrls

1 and 2 are always used together, consider implementing them into scrapUrl()

Migration script
For each article & reply, perform CreateArticle's 1~3.

MrOrz · 2018-10-17T02:33:38Z

With #97, #98 and #104 deployed to production, we can now close this 🎉

MrOrz self-assigned this Apr 21, 2018

This was referenced Apr 22, 2018

Add hyperlinks field and urls index cofacts/rumors-db#20

Merged

Scrap website content from URL & leverage it in article search APIs #76

Merged

Show URL preview in website cofacts/rumors-site#102

Closed

This was referenced Jul 15, 2018

Scrap URLs when articles are submitted #97

Merged

Scrap URLs when replies are submitted #98

Merged

MrOrz mentioned this issue Sep 22, 2018

Script that fills "urls" index #104

Merged

MrOrz closed this as completed Oct 17, 2018

MrOrz mentioned this issue May 4, 2020

Throughput issue cofacts/url-resolver#69

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index the title and content of URLs in page #41

Index the title and content of URLs in page #41

MrOrz commented Aug 2, 2017 •

edited

Loading

MrOrz commented Aug 19, 2017

MrOrz commented Aug 19, 2017

MrOrz commented Aug 21, 2017 •

edited

Loading

MrOrz commented Nov 9, 2017 •

edited

Loading

MrOrz commented Nov 10, 2017

MrOrz commented Nov 28, 2017 •

edited

Loading

MrOrz commented Mar 11, 2018 •

edited

Loading

MrOrz commented Mar 19, 2018 •

edited

Loading

MrOrz commented Mar 19, 2018 •

edited

Loading

MrOrz commented Apr 21, 2018 •

edited

Loading

MrOrz commented Jul 9, 2018 •

edited

Loading

MrOrz commented Oct 17, 2018 •

edited

Loading

Index the title and content of URLs in page #41

Index the title and content of URLs in page #41

Comments

MrOrz commented Aug 2, 2017 • edited Loading

MrOrz commented Aug 19, 2017

MrOrz commented Aug 19, 2017

MrOrz commented Aug 21, 2017 • edited Loading

MrOrz commented Nov 9, 2017 • edited Loading

MrOrz commented Nov 10, 2017

MrOrz commented Nov 28, 2017 • edited Loading

MrOrz commented Mar 11, 2018 • edited Loading

MrOrz commented Mar 19, 2018 • edited Loading

MrOrz commented Mar 19, 2018 • edited Loading

MrOrz commented Apr 21, 2018 • edited Loading

Requirement

To be implemented

MrOrz commented Jul 9, 2018 • edited Loading

MrOrz commented Oct 17, 2018 • edited Loading

MrOrz commented Aug 2, 2017 •

edited

Loading

MrOrz commented Aug 21, 2017 •

edited

Loading

MrOrz commented Nov 9, 2017 •

edited

Loading

MrOrz commented Nov 28, 2017 •

edited

Loading

MrOrz commented Mar 11, 2018 •

edited

Loading

MrOrz commented Mar 19, 2018 •

edited

Loading

MrOrz commented Mar 19, 2018 •

edited

Loading

MrOrz commented Apr 21, 2018 •

edited

Loading

MrOrz commented Jul 9, 2018 •

edited

Loading

MrOrz commented Oct 17, 2018 •

edited

Loading