Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trafilatura can't be loaded after installing it to local folder #30

Closed
HaIDsIEx opened this issue Nov 13, 2020 · 5 comments
Closed

Trafilatura can't be loaded after installing it to local folder #30

HaIDsIEx opened this issue Nov 13, 2020 · 5 comments
Labels
wontfix This will not be worked on

Comments

@HaIDsIEx
Copy link

HaIDsIEx commented Nov 13, 2020

Hi,

I just tried to install your awsome project to an local folder (pip install --target {path}/package trafilatura). After installing it I cant load it with from package import trafilatura:

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

~\anaconda3\lib\site-packages\pkg_resources\__init__.py in get_provider(moduleOrReq)
    359     try:
--> 360         module = sys.modules[moduleOrReq]
    361     except KeyError:

KeyError: 'trafilatura'


During handling of the above exception, another exception occurred:

ModuleNotFoundError                       Traceback (most recent call last)

<ipython-input-37-57dd4ca0a626> in <module>
      1 from pprint import pprint
----> 2 from package import trafilatura
      3 import time
      4 
      5 trafilatura

~{path}\package\trafilatura\__init__.py in <module>
     14 import logging
     15 
---> 16 from .core import extract, process_record
     17 from .utils import fetch_url
     18 

~{path}\package\trafilatura\core.py in <module>
     17 
     18 # own
---> 19 from .external import justext_rescue, sanitize_tree, SANITIZED_XPATH, try_readability
     20 from .filters import content_fingerprint, duplicate_test, language_filter, text_chars_test
     21 from .htmlprocessing import (convert_tags, discard_unwanted,

~{path}\package\trafilatura\external.py in <module>
     36 from .settings import JUSTEXT_LANGUAGES, MANUALLY_STRIPPED
     37 from .utils import trim, HTML_PARSER
---> 38 from .xml import TEI_VALID_TAGS
     39 
     40 

~{path}\package\trafilatura\xml.py in <module>
     21 LOGGER = logging.getLogger(__name__)
     22 # validation
---> 23 TEI_SCHEMA = pkg_resources.resource_filename('trafilatura', 'data/tei-schema.pickle')
     24 TEI_VALID_TAGS = {'body', 'cell', 'code', 'del', 'div', 'fw', 'head', 'hi', 'item', \
     25                   'lb', 'list', 'p', 'quote', 'row', 'table'}

~\anaconda3\lib\site-packages\pkg_resources\__init__.py in resource_filename(self, package_or_requirement, resource_name)
   1143     def resource_filename(self, package_or_requirement, resource_name):
   1144         """Return a true filesystem path for specified resource"""
-> 1145         return get_provider(package_or_requirement).get_resource_filename(
   1146             self, resource_name
   1147         )

~\anaconda3\lib\site-packages\pkg_resources\__init__.py in get_provider(moduleOrReq)
    360         module = sys.modules[moduleOrReq]
    361     except KeyError:
--> 362         __import__(moduleOrReq)
    363         module = sys.modules[moduleOrReq]
    364     loader = getattr(module, '__loader__', None)

ModuleNotFoundError: No module named 'trafilatura'

Python code used:

from package import trafilatura
trafilatura

Note:
I'm using Python 3.7.6 with pip 20.0.2.

Edit:
A quick fix (for me) is replacing TEI_SCHEMA = pkg_resources.resource_filename('trafilatura', 'data/tei-schema.pickle') in "xml.py" with TEI_SCHMEA = './data/tei-schema.pickle' and
change line 11 in settings.py from from trafilatura import __version__ to from ..trafilatura import __version__

@adbar
Copy link
Owner

adbar commented Nov 13, 2020

Thank you, it was indeed a problem, I fixed it in 1a57635, could you please confirm by trying the version straight from the repository? (pip install -U git+https://github.com/adbar/trafilatura.git with your --target)

@HaIDsIEx
Copy link
Author

Hey, thanks for the fast response! I just made a new environment and tried to set up everything. However, (even if the package gets downloaded into the "package"-folder) I get following error:

Traceback (most recent call last):
  File "{path}/function.py", line 1, in <module>
    from package import trafilatura
  File "{path}\package\trafilatura\__init__.py", line 16, in <module>
    from .core import extract, process_record
  File "{path}\package\trafilatura\core.py", line 16, in <module>
    from lxml import etree, html
ModuleNotFoundError: No module named 'lxml'

@adbar
Copy link
Owner

adbar commented Nov 16, 2020 via email

@HaIDsIEx
Copy link
Author

Afaik, lxml needs to be compiled and installed for each machine. Therefore, „portable“ compatibility (install it to a folder and copy it anywhere and run it) can’t be achieved with lxml. In my case I would like to push it on AWS Lambda (localstack; can't install things easily there). I guess it won’t work as long as this project builds on lxml. However, I already found some alternatives for now (I expose your project via a REST-Service on my machine for development purposes). Later (on AWS) I should be able to use EC2 to install all necessary packages such as lxml.

@adbar adbar added the wontfix This will not be worked on label Dec 7, 2020
@adbar
Copy link
Owner

adbar commented Jun 14, 2021

Hi @HaIDsIEx, please refer to this answer and this code snippet, both show how to solve the issue with LXML.

@adbar adbar closed this as completed Jun 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants