Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this project still being maintained? #878

Open
lodenrogue opened this issue Mar 23, 2021 · 15 comments
Open

Is this project still being maintained? #878

lodenrogue opened this issue Mar 23, 2021 · 15 comments

Comments

@lodenrogue
Copy link

No description provided.

@johnbumgarner
Copy link

johnbumgarner commented Mar 23, 2021

ref: #813

The owner of this project hasn't responded to any inquires about the status of this project, since June 2020. The project likely needs to be forked and updated, because the last published update by @codelucas was on Jun 13, 2017.

@ghost
Copy link

ghost commented Mar 31, 2021

The owner did an interview on a podcast in September where he expressed his interest in continuing to maintain the library but that he was having trouble keeping up with it (if my memory serves me).

The interview:
https://www.pythonpodcast.com/newspaper-data-extraction-episode-280/

@planktonrobo
Copy link

If this project were to be forked & updated, what suggestions do you have for updates needed? @johnbumgarner

@ghost
Copy link

ghost commented Apr 9, 2021

Has anyone tried to reach out to the developer yet? I may reach out offering support.
The biggest thing this project needs in my opinion is more transparent and direct access to the cached articles. If there are methods to access the cache, I have not found them yet.

@johnbumgarner
Copy link

Yes. Reference: #813

Has anyone tried to reach out to the developer yet? I may reach out offering support.
The biggest thing this project needs in my opinion is more transparent and direct access to the cached articles. If there are methods to access the cache, I have not found them yet.

@johnbumgarner
Copy link

If this project were to be forked & updated, what suggestions do you have for updates needed? @johnbumgarner

Based on some the past issues the extraction piece of this module would require the most changes. After that likely the NLP piece of this code.

@AlviseSembenico
Copy link

Shall we once for all fork it and work on it? It seems a lot of time passed since the last conversation about this.

@lodenrogue
Copy link
Author

Yes please

@johnbumgarner
Copy link

Shall we once for all fork it and work on it? It seems a lot of time passed since the last conversation about this.

@AlviseSembenico mostly likely, because the module's creator won't respond to emails about the status of the code base. The question is how much to keep and how much to redesigned from scratch. The rule-base extraction is still useful, but it might be better to rebuild that to use some type of machine learning technique that can "guess at a page's structure and tags." I have started doing research into that, but I'm not an expert on ML or modeling.

I have also been exploring all the issues with the current version by reading all the pull requests and open/closed issues.

@RaedShabbir
Copy link

@johnbumgarner Would love to contribute on that

@AlviseSembenico
Copy link

@johnbumgarner Your is a good point. Let's bear in mind that a "fast" version should be available since some of the use cases require speed and might run on not-so-performing computers. I have an ML background so can do research. Did you already look if there is already a project going in that direction?

@RaedShabbir
Copy link

@AlviseSembenico
The best I've found is https://github.com/fhamborg/news-please

They recently released a paper with a non transformer based model https://github.com/fhamborg/NewsMTSC

It would be great to see a version of that library empowered by huggingfaces!

@AlviseSembenico
Copy link

@RaedShabbir I worker with News-please, it is a great project, however, it uses Newspaper and other heuristics under the hood so it is not a radical change in the paradigm.

@edvilme
Copy link

edvilme commented May 22, 2021

Hello! I recently stumbled upon this repo.
Despite not being maintained anymore, how reliable would you say this project is? And is news-please any more reliable?
If not, has anyone made an updated fork?

@mxdev88
Copy link

mxdev88 commented Jun 6, 2023

And is news-please any more reliable?

news-please depends on newspaper3k so it cannot be considered more reliable. news-please however is an active project. We are better off getting in touch with news-please maintainer. newspaper3k could potentially be made an optional dependency and replaced by another extractor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants