New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document setup procedure for OSX #7
Comments
No such thing as a rookie error - Can you upload the csv here and I'll try and process it, and/or see if I can spot anything amiss? |
And: #8 - I'll get this done tomorrow. |
Oi, I forgot the file, of course... It was really just a random set of things to test |
I'll run this tomorrow, but the error you got might be related to your python version. At the command line, run Or, specify that you want to run python 3.x by running |
Yes, I get 2.7.10 I ran a Homebrew update... It looks like I have 3.9.10
so as you suggest, I am trying python3. New error, I am missing a "BeautifulSoup module"
So I get that via
Next error- missing module named "pandas" And I do
Next nissing module is tldextract so...
And onward to add modules Finally on! and more errors:
And now, I need a break... |
Alan - so close! I think that there are a small number of issues that currently stand between where you are now and getting this running locally.
One of the pdfs you linked to wasn't able to be processed when I ran it - it gets downloaded, but the processing/text extraction chokes and I haven't had a chance to explore why. PDFs are screwy beasts. I ran this successfully before responding here, so it worked in Linux, There are definitely differences between Linux and MacOS, so this will be helpful to trap for any of those and get them documented. Thank you for the testing you have done/are doing. It's hugely appreciated. |
This resource for installing Geckodriver is for Ubuntu, but it's the same principles for Mac: https://askubuntu.com/questions/851401/where-to-find-geckodriver-needed-by-selenium-python-package |
Some initial install instructions are in place here: https://github.com/billfitzgerald/trapper-keeper/wiki/B.-Getting-Started |
@cogdog - Just an FYI/heads up - I'm going to work on install instructions for OSX this weekend (tomorrow, hopefully) - I'm working from a fresh machine so I'll hopefully hit all the steps. |
Currently getting this running on OSX in Python 3.9 The web page cleanup works with no issue. The pdf processing is throwing some hiccups. More to come. |
Note: https://ocrmypdf.readthedocs.io/en/v13.2.0/api.html Need to adjust the code to support osx and windows. |
I did a fair amount of testing over the weekend, and have some good notes on the osx setup. This ticket has evolved into two parallel issues: getting reliable documentation for OSX setup, and troubleshooting issues related to ocrmypdf on OSX. I have the OCRMyPDF issue documented here: #17 I'm going to edit the title of this issue to reflect the need to get OSX setup better documented. These two things will happen in parallel, as addressing the OCRMyPDF issues in the code is a blocker to using the full script in OSX. The good news is that it looks like the problems are isolated to this single issue. The bad news is that this issue effectively blocks cleaning of pdfs for people working in OSX. |
Thanks for all the followup, I was just wanting to test it out and do not have an use case yet. And for my interests, it would not be as much on PDF as web sites. |
Cool - and for what it's worth, working through the questions and feedback you have raised in this thread has resulted in some nice improvements, so thank you! I'm still nailing down the final details and steps, but I'll be sharing out some reasonably precise documentation for getting up and running on OSX, in addition to some better code that is tested in OSX on Python 3.9 |
Closing this out. The script is now running cleanly in OSX and has been tested against 3.9.10. Instructions to get up and running are in the wiki: https://github.com/billfitzgerald/trapper-keeper/wiki/Installation-in-OSX |
I did not get far.
keep_this.csv
in a source/ directory (note, the readme suggests using this, it might help if the distro included a sample csv one could run as a first test)python archive.py -p csv
I get an error message
I know nothing of python and it's likely a rookie error.
The text was updated successfully, but these errors were encountered: