Document setup procedure for OSX #7

cogdog · 2022-01-19T00:58:28Z

I did not get far.

Created a csv with 3 pdf URLs, 3 web URLs, as specified, file name named keep_this.csv in a source/ directory (note, the readme suggests using this, it might help if the distro included a sample csv one could run as a first test)
Modified archive.py with the new file name
Ran python archive.py -p csv

I get an error message

  File "archive.py", line 143
    print(f"Processing {url}\n")
                              ^
SyntaxError: invalid syntax

I know nothing of python and it's likely a rookie error.

The text was updated successfully, but these errors were encountered:

billfitzgerald · 2022-01-19T02:11:04Z

No such thing as a rookie error -

Can you upload the csv here and I'll try and process it, and/or see if I can spot anything amiss?

billfitzgerald · 2022-01-19T02:13:46Z

And: #8 - I'll get this done tomorrow.

cogdog · 2022-01-19T03:08:59Z

Oi, I forgot the file, of course... It was really just a random set of things to test

keep_this.csv

billfitzgerald · 2022-01-19T04:15:17Z

I'll run this tomorrow, but the error you got might be related to your python version.

At the command line, run python -V - ideally, you will see 3.8.x or above. This was developed in 3.8 on Linux.

Or, specify that you want to run python 3.x by running python3 archive.py -p csv

billfitzgerald · 2022-01-19T18:08:47Z

Okay - after running this, I believe that the syntax error is due to the python version.

@cogdog - can you report back what you get when you enter python -V ?

Related: the csv was processed successfully, but pulling data from the pdf files threw an error, so I opened #11 to address this.

cogdog · 2022-01-19T20:16:43Z

Yes, I get 2.7.10

I ran a Homebrew update... It looks like I have 3.9.10

python3 --version
Python 3.9.10

so as you suggest, I am trying python3.

New error, I am missing a "BeautifulSoup module"

python3  archive.py -p csv
Traceback (most recent call last):
  File "/Users/cogdog/Documents/webdev/trapper-keeper/archive.py", line 9, in <module>
    from bs4 import BeautifulSoup
ModuleNotFoundError: No module named 'bs4'

So I get that via

pip3 install beautifulsoup4

Next error- missing module named "pandas" And I do

pip3 install pandas

Next nissing module is tldextract so...

pip3 install tldextract

And onward to add modules pyfiglet andocrmypdf

Finally on! and more errors:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/common/service.py", line 71, in start
    self.process = subprocess.Popen(cmd, env=self.env,
  File "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 1821, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/cogdog/Documents/webdev/trapper-keeper/archive.py", line 68, in <module>
    driver = webdriver.Firefox()
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/webdriver.py", line 174, in __init__
    self.service.start()
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/common/service.py", line 81, in start
    raise WebDriverException(
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

And now, I need a break...

billfitzgerald · 2022-01-19T21:22:29Z

Alan - so close!

I think that there are a small number of issues that currently stand between where you are now and getting this running locally.

Install geckodriver for Firefox. Really, it's downloading the latest driver and making sure the downloaded file is in your PATH - these instructions go over the details: https://selenium-python.readthedocs.io/installation.html#drivers
Pull the latest updates from the repo - @jgraham909 added a requirements.txt that looks like it has everything
Use the attached csv file - I modified it from the one you shared, and used it to support some work on better error handling.

One of the pdfs you linked to wasn't able to be processed when I ran it - it gets downloaded, but the processing/text extraction chokes and I haven't had a chance to explore why. PDFs are screwy beasts.

I ran this successfully before responding here, so it worked in Linux, There are definitely differences between Linux and MacOS, so this will be helpful to trap for any of those and get them documented.

Thank you for the testing you have done/are doing. It's hugely appreciated.

keep_this_clean.csv

billfitzgerald · 2022-01-19T21:41:25Z

This resource for installing Geckodriver is for Ubuntu, but it's the same principles for Mac: https://askubuntu.com/questions/851401/where-to-find-geckodriver-needed-by-selenium-python-package

billfitzgerald · 2022-01-20T15:56:14Z

Some initial install instructions are in place here: https://github.com/billfitzgerald/trapper-keeper/wiki/B.-Getting-Started

billfitzgerald · 2022-01-22T22:52:27Z

@cogdog - Just an FYI/heads up - I'm going to work on install instructions for OSX this weekend (tomorrow, hopefully) - I'm working from a fresh machine so I'll hopefully hit all the steps.

billfitzgerald · 2022-01-23T14:27:40Z

Currently getting this running on OSX in Python 3.9

The web page cleanup works with no issue.

The pdf processing is throwing some hiccups. More to come.

billfitzgerald · 2022-01-23T14:40:28Z

Note: https://ocrmypdf.readthedocs.io/en/v13.2.0/api.html

Need to adjust the code to support osx and windows.

billfitzgerald · 2022-01-24T13:04:47Z

I did a fair amount of testing over the weekend, and have some good notes on the osx setup.

This ticket has evolved into two parallel issues: getting reliable documentation for OSX setup, and troubleshooting issues related to ocrmypdf on OSX.

I have the OCRMyPDF issue documented here: #17

I'm going to edit the title of this issue to reflect the need to get OSX setup better documented. These two things will happen in parallel, as addressing the OCRMyPDF issues in the code is a blocker to using the full script in OSX.

The good news is that it looks like the problems are isolated to this single issue.

The bad news is that this issue effectively blocks cleaning of pdfs for people working in OSX.

cogdog · 2022-01-24T20:50:26Z

Thanks for all the followup, I was just wanting to test it out and do not have an use case yet. And for my interests, it would not be as much on PDF as web sites.

billfitzgerald · 2022-01-25T15:28:31Z

Cool - and for what it's worth, working through the questions and feedback you have raised in this thread has resulted in some nice improvements, so thank you!

I'm still nailing down the final details and steps, but I'll be sharing out some reasonably precise documentation for getting up and running on OSX, in addition to some better code that is tested in OSX on Python 3.9

billfitzgerald · 2022-01-27T00:00:42Z

Closing this out.

The script is now running cleanly in OSX and has been tested against 3.9.10.

Instructions to get up and running are in the wiki: https://github.com/billfitzgerald/trapper-keeper/wiki/Installation-in-OSX

billfitzgerald mentioned this issue Jan 19, 2022

Improve error handling when processing pdfs #11

Closed

billfitzgerald mentioned this issue Jan 20, 2022

Document the setup process so people have an easier time getting started. #13

Closed

billfitzgerald changed the title ~~Syntax Error~~ Document setup procedure for OSX Jan 24, 2022

billfitzgerald closed this as completed Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document setup procedure for OSX #7

Document setup procedure for OSX #7

cogdog commented Jan 19, 2022

billfitzgerald commented Jan 19, 2022

billfitzgerald commented Jan 19, 2022

cogdog commented Jan 19, 2022

billfitzgerald commented Jan 19, 2022

billfitzgerald commented Jan 19, 2022

cogdog commented Jan 19, 2022

billfitzgerald commented Jan 19, 2022

billfitzgerald commented Jan 19, 2022

billfitzgerald commented Jan 20, 2022

billfitzgerald commented Jan 22, 2022

billfitzgerald commented Jan 23, 2022

billfitzgerald commented Jan 23, 2022

billfitzgerald commented Jan 24, 2022

cogdog commented Jan 24, 2022

billfitzgerald commented Jan 25, 2022

billfitzgerald commented Jan 27, 2022

Document setup procedure for OSX #7

Document setup procedure for OSX #7

Comments

cogdog commented Jan 19, 2022

billfitzgerald commented Jan 19, 2022

billfitzgerald commented Jan 19, 2022

cogdog commented Jan 19, 2022

billfitzgerald commented Jan 19, 2022

billfitzgerald commented Jan 19, 2022

cogdog commented Jan 19, 2022

billfitzgerald commented Jan 19, 2022

billfitzgerald commented Jan 19, 2022

billfitzgerald commented Jan 20, 2022

billfitzgerald commented Jan 22, 2022

billfitzgerald commented Jan 23, 2022

billfitzgerald commented Jan 23, 2022

billfitzgerald commented Jan 24, 2022

cogdog commented Jan 24, 2022

billfitzgerald commented Jan 25, 2022

billfitzgerald commented Jan 27, 2022