Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document setup procedure for OSX #7

Closed
cogdog opened this issue Jan 19, 2022 · 16 comments
Closed

Document setup procedure for OSX #7

cogdog opened this issue Jan 19, 2022 · 16 comments

Comments

@cogdog
Copy link

cogdog commented Jan 19, 2022

I did not get far.

  1. Created a csv with 3 pdf URLs, 3 web URLs, as specified, file name named keep_this.csv in a source/ directory (note, the readme suggests using this, it might help if the distro included a sample csv one could run as a first test)
  2. Modified archive.py with the new file name
  3. Ran python archive.py -p csv

I get an error message

  File "archive.py", line 143
    print(f"Processing {url}\n")
                              ^
SyntaxError: invalid syntax

I know nothing of python and it's likely a rookie error.

@billfitzgerald
Copy link
Owner

No such thing as a rookie error -

Can you upload the csv here and I'll try and process it, and/or see if I can spot anything amiss?

@billfitzgerald
Copy link
Owner

And: #8 - I'll get this done tomorrow.

@cogdog
Copy link
Author

cogdog commented Jan 19, 2022

Oi, I forgot the file, of course... It was really just a random set of things to test

keep_this.csv

@billfitzgerald
Copy link
Owner

I'll run this tomorrow, but the error you got might be related to your python version.

At the command line, run python -V - ideally, you will see 3.8.x or above. This was developed in 3.8 on Linux.

Or, specify that you want to run python 3.x by running python3 archive.py -p csv

@billfitzgerald
Copy link
Owner

Okay - after running this, I believe that the syntax error is due to the python version.

@cogdog - can you report back what you get when you enter python -V ?

Related: the csv was processed successfully, but pulling data from the pdf files threw an error, so I opened #11 to address this.

@cogdog
Copy link
Author

cogdog commented Jan 19, 2022

Yes, I get 2.7.10

I ran a Homebrew update... It looks like I have 3.9.10

python3 --version
Python 3.9.10

so as you suggest, I am trying python3.

New error, I am missing a "BeautifulSoup module"

python3  archive.py -p csv
Traceback (most recent call last):
  File "/Users/cogdog/Documents/webdev/trapper-keeper/archive.py", line 9, in <module>
    from bs4 import BeautifulSoup
ModuleNotFoundError: No module named 'bs4'

So I get that via

pip3 install beautifulsoup4

Next error- missing module named "pandas" And I do

pip3 install pandas

Next nissing module is tldextract so...

pip3 install tldextract

And onward to add modules pyfiglet andocrmypdf

Finally on! and more errors:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/common/service.py", line 71, in start
    self.process = subprocess.Popen(cmd, env=self.env,
  File "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 1821, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/cogdog/Documents/webdev/trapper-keeper/archive.py", line 68, in <module>
    driver = webdriver.Firefox()
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/webdriver.py", line 174, in __init__
    self.service.start()
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/common/service.py", line 81, in start
    raise WebDriverException(
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH. 

And now, I need a break...

@billfitzgerald
Copy link
Owner

Alan - so close!

I think that there are a small number of issues that currently stand between where you are now and getting this running locally.

  1. Install geckodriver for Firefox. Really, it's downloading the latest driver and making sure the downloaded file is in your PATH - these instructions go over the details: https://selenium-python.readthedocs.io/installation.html#drivers
  2. Pull the latest updates from the repo - @jgraham909 added a requirements.txt that looks like it has everything
  3. Use the attached csv file - I modified it from the one you shared, and used it to support some work on better error handling.

One of the pdfs you linked to wasn't able to be processed when I ran it - it gets downloaded, but the processing/text extraction chokes and I haven't had a chance to explore why. PDFs are screwy beasts.

I ran this successfully before responding here, so it worked in Linux, There are definitely differences between Linux and MacOS, so this will be helpful to trap for any of those and get them documented.

Thank you for the testing you have done/are doing. It's hugely appreciated.

keep_this_clean.csv

@billfitzgerald
Copy link
Owner

This resource for installing Geckodriver is for Ubuntu, but it's the same principles for Mac: https://askubuntu.com/questions/851401/where-to-find-geckodriver-needed-by-selenium-python-package

@billfitzgerald
Copy link
Owner

Some initial install instructions are in place here: https://github.com/billfitzgerald/trapper-keeper/wiki/B.-Getting-Started

@billfitzgerald
Copy link
Owner

@cogdog - Just an FYI/heads up - I'm going to work on install instructions for OSX this weekend (tomorrow, hopefully) - I'm working from a fresh machine so I'll hopefully hit all the steps.

@billfitzgerald
Copy link
Owner

Currently getting this running on OSX in Python 3.9

The web page cleanup works with no issue.

The pdf processing is throwing some hiccups. More to come.

@billfitzgerald
Copy link
Owner

Note: https://ocrmypdf.readthedocs.io/en/v13.2.0/api.html

Need to adjust the code to support osx and windows.

@billfitzgerald
Copy link
Owner

I did a fair amount of testing over the weekend, and have some good notes on the osx setup.

This ticket has evolved into two parallel issues: getting reliable documentation for OSX setup, and troubleshooting issues related to ocrmypdf on OSX.

I have the OCRMyPDF issue documented here: #17

I'm going to edit the title of this issue to reflect the need to get OSX setup better documented. These two things will happen in parallel, as addressing the OCRMyPDF issues in the code is a blocker to using the full script in OSX.

The good news is that it looks like the problems are isolated to this single issue.

The bad news is that this issue effectively blocks cleaning of pdfs for people working in OSX.

@billfitzgerald billfitzgerald changed the title Syntax Error Document setup procedure for OSX Jan 24, 2022
@cogdog
Copy link
Author

cogdog commented Jan 24, 2022

Thanks for all the followup, I was just wanting to test it out and do not have an use case yet. And for my interests, it would not be as much on PDF as web sites.

@billfitzgerald
Copy link
Owner

Cool - and for what it's worth, working through the questions and feedback you have raised in this thread has resulted in some nice improvements, so thank you!

I'm still nailing down the final details and steps, but I'll be sharing out some reasonably precise documentation for getting up and running on OSX, in addition to some better code that is tested in OSX on Python 3.9

@billfitzgerald
Copy link
Owner

Closing this out.

The script is now running cleanly in OSX and has been tested against 3.9.10.

Instructions to get up and running are in the wiki: https://github.com/billfitzgerald/trapper-keeper/wiki/Installation-in-OSX

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants