Splitting the text in scoring #113

haziyevv · 2019-02-13T08:28:40Z

In the score_paragraphs method content score is calculated like this:
content_score += len(inner_text.split(','))

But I think it should be like below, because there may be no comma in a text.
content_score += len(re.split(' |,',inner_text))

Also I think this may be added: Do not take into account non words and words with length less than 3
inner_text = " ".join(re.findall("[^\d\W]{3,}", inner_text))

buriy · 2019-03-08T06:41:12Z

This is a typical counter-intuitive situation where "more is better" strategy doesn't work. More separators isn't better, because this purpose is made with a different goal in mind.
"|" is rarely used in texts, but often in titles -- so scoring it would have a negative impact.
"," is rarely used in titles, often in larger texts -- that's why it's counted.
Counting spaces -- one will need to rescale the score and also they won't distinguish between good content and bad content.
The last comment looks partially valid, symbols doesn't make the text better, but punctuation is a sign of text, what's the purpose of ignoring it?
Have you evaluated the impact of your changes in practice?

haziyevv · 2019-03-08T06:47:06Z

Thank you for replying. Yes I have applied and actually it was effective. Before I was not able to get content of a page, just the footer, but after those changes I was able to get the content. May be it is because of the input I used. I used news pages. Because in most pages there may not be commas, but there is a big bunch of text, but in the footer there are lots of commas. For example,

this department is situated in Baku, Azerbaijan, 21thditsti, postcode xx

.

buriy · 2019-03-08T06:55:55Z

Thanks for a valid counter-example, this package is designed for news pages but was modeled from English ones and doesn't consider such use-case. Rather I would suggest a discount on commas counting then, and will consider its implementation in next update -- I'm trying to do package updates at least once per 3 months.
This package is made to collect from hundreds/thousand news sources and could behave bad on some specific ones. For quick tuning, positive/negative keywords should work better than other solutions.

buriy · 2019-04-03T12:42:41Z

@faridhaziyev please don't close this issue.
Once I'll have time for maintenance, I'll add this improvement.

buriy closed this as completed Mar 8, 2019

buriy reopened this Mar 8, 2019

haziyevv closed this as completed Apr 3, 2019

buriy reopened this Apr 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting the text in scoring #113

Splitting the text in scoring #113

haziyevv commented Feb 13, 2019 •

edited

Loading

buriy commented Mar 8, 2019 •

edited

Loading

haziyevv commented Mar 8, 2019 •

edited by buriy

Loading

buriy commented Mar 8, 2019

buriy commented Apr 3, 2019

Splitting the text in scoring #113

Splitting the text in scoring #113

Comments

haziyevv commented Feb 13, 2019 • edited Loading

buriy commented Mar 8, 2019 • edited Loading

haziyevv commented Mar 8, 2019 • edited by buriy Loading

buriy commented Mar 8, 2019

buriy commented Apr 3, 2019

haziyevv commented Feb 13, 2019 •

edited

Loading

buriy commented Mar 8, 2019 •

edited

Loading

haziyevv commented Mar 8, 2019 •

edited by buriy

Loading