Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting the text in scoring #113

Open
haziyevv opened this issue Feb 13, 2019 · 4 comments
Open

Splitting the text in scoring #113

haziyevv opened this issue Feb 13, 2019 · 4 comments

Comments

@haziyevv
Copy link

haziyevv commented Feb 13, 2019

In the score_paragraphs method content score is calculated like this:
content_score += len(inner_text.split(','))

But I think it should be like below, because there may be no comma in a text.
content_score += len(re.split(' |,',inner_text))

Also I think this may be added: Do not take into account non words and words with length less than 3
inner_text = " ".join(re.findall("[^\d\W]{3,}", inner_text))

@buriy
Copy link
Owner

buriy commented Mar 8, 2019

This is a typical counter-intuitive situation where "more is better" strategy doesn't work. More separators isn't better, because this purpose is made with a different goal in mind.
"|" is rarely used in texts, but often in titles -- so scoring it would have a negative impact.
"," is rarely used in titles, often in larger texts -- that's why it's counted.
Counting spaces -- one will need to rescale the score and also they won't distinguish between good content and bad content.
The last comment looks partially valid, symbols doesn't make the text better, but punctuation is a sign of text, what's the purpose of ignoring it?
Have you evaluated the impact of your changes in practice?

@buriy buriy closed this as completed Mar 8, 2019
@haziyevv
Copy link
Author

haziyevv commented Mar 8, 2019

Thank you for replying. Yes I have applied and actually it was effective. Before I was not able to get content of a page, just the footer, but after those changes I was able to get the content. May be it is because of the input I used. I used news pages. Because in most pages there may not be commas, but there is a big bunch of text, but in the footer there are lots of commas. For example,

this department is situated in Baku, Azerbaijan, 21thditsti, postcode xx

.

@buriy
Copy link
Owner

buriy commented Mar 8, 2019

Thanks for a valid counter-example, this package is designed for news pages but was modeled from English ones and doesn't consider such use-case. Rather I would suggest a discount on commas counting then, and will consider its implementation in next update -- I'm trying to do package updates at least once per 3 months.
This package is made to collect from hundreds/thousand news sources and could behave bad on some specific ones. For quick tuning, positive/negative keywords should work better than other solutions.

@buriy buriy reopened this Mar 8, 2019
@haziyevv haziyevv closed this as completed Apr 3, 2019
@buriy buriy reopened this Apr 3, 2019
@buriy
Copy link
Owner

buriy commented Apr 3, 2019

@faridhaziyev please don't close this issue.
Once I'll have time for maintenance, I'll add this improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants