Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature python3 support lsh tweaks #364

Merged
merged 10 commits into from
Dec 5, 2017

Conversation

lsh-0
Copy link
Contributor

@lsh-0 lsh-0 commented Nov 29, 2017

... ok.

the first and most important point: I have the tests running for python3, so thanks to @seanwiseman for all the hard work.

but also:

  • updated fixtures
  • updated api-raml
  • updated install script so it uses python3
  • the scripts (scrape-article, scrape-random, etc) are now working again
  • update-api-raml no longer dies when it tries to change back to the previous dir (if you're using a symlink to a shared api-raml dir like me)

some controversial edits:

I haven't gone through the original PR too closely yet.

@lsh-0
Copy link
Contributor Author

lsh-0 commented Nov 29, 2017

ok, I see there is an issue unpickling py2 objects in py3: requests-cache/requests-cache#83

I'll see if I can't convert it

@lsh-0
Copy link
Contributor Author

lsh-0 commented Nov 29, 2017

problem can be replicated with this:

import sqlite3
c = sqlite3.connect('cache/requests-cache.sqlite3')
cur = c.cursor()
results = cur.execute('select value from responses limit 1').fetchone()
value = results[0] # first value of only row

import pickle
pickle.loads(value, encoding='utf-8')

hrm, I can get the value to unpickle in python 3 by changing it to:

import _dummy_thread as dummy_thread

import sqlite3
c = sqlite3.connect('cache/requests-cache.sqlite3')
cur = c.cursor()
results = cur.execute('select value from responses limit 1').fetchone()
value = results[0]

import pickle
value=pickle.loads(value, encoding='bytes')

but there are so many broken references after that. we should think about rebuilding the cache.

I'll change the cache db path to include the version of python or leave unchanged if python 2

@lsh-0
Copy link
Contributor Author

lsh-0 commented Nov 29, 2017

ha - er. the moment I push this change it's probably going to thrash iiif once it gets past the app tests and I'm signing off in a moment. Attaching the change as a patch.

patch.diff.txt

@seanwiseman
Copy link
Contributor

Nice work @lsh-0 . The elife-tools commit it is using is fully 2/3 compatible. There were a few import changes I had to make to get it to work when imported as a dependency. I will now tidy up that PR and get it into the develop branch, once done I will update the requirements here.

@giorgiosironi
Copy link
Contributor

We could boot iiif--ci (4 servers), but we don't want to do that for every build. Down the road we could hypothesize to deploy several containers to a shared infrastructure to deal with the build.

So I think I'll turn on iiif--ci, lock it, apply the patch along with a URL to build the cache, then revert it.

@giorgiosironi
Copy link
Contributor

Nope, won't work as when we switch back the URL will be different.

@giorgiosironi
Copy link
Contributor

Tweak

Parallel(n_jobs=-1)(delayed(render)(path, json_output_dir) for path in paths)
to use just 2/4 processes, wait for a very long build, turn it back to the original?

@lsh-0
Copy link
Contributor Author

lsh-0 commented Nov 30, 2017

that would definitely slow the huge number of requests made to iiif, but each article still does N requests, so it would be a steady stream instead of a waterfall.

I'll do as you suggest though and keep an eye on the iiif server. We do have a ticket open somewhere to ensure iiif can survive a backfill

- we're going to be stuck with it from hereonout, so lets not suffix it with 'py3'
reduced the number of processes to use while generating article-json so we don't flood iiif. this change should be reverted once cache is rebuilt.
@lsh-0
Copy link
Contributor Author

lsh-0 commented Dec 1, 2017

ok - dropped the number of processes to use down to 2 and changed the cache name, it shouldn't thrash iiif now but I'll keep an eye on the alerts

@lsh-0
Copy link
Contributor Author

lsh-0 commented Dec 1, 2017

ah - green is a great little tool, but it's behaviour on encountering import errors is to swallow it quietly and report 'no tests found'. It looks like a chunk of tests with import src.foo.bar style imports were not being run. Fixing them up now

@lsh-0
Copy link
Contributor Author

lsh-0 commented Dec 1, 2017

cool - it's going through the corpus now and I can see all the cache misses

@lsh-0
Copy link
Contributor Author

lsh-0 commented Dec 1, 2017

(would love to know where this is originating from:)

/ext/jenkins-libraries-runner/workspace/or_PR-364-CEJKMGU523MRCQW4DFGD/venv/lib/python3.5/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available XML parser for this system ("lxml-xml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 50 of the file src/generate_article_json.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml-xml")

afaict elife-tools is doing the right thing, and it only instantiates bs4 once

@giorgiosironi
Copy link
Contributor

@gnott
Copy link
Member

gnott commented Dec 1, 2017

The warning is from https://github.com/elifesciences/elife-tools/blob/develop/elifetools/parseJATS.py#L15

    return BeautifulSoup(xml, ["lxml", "xml"])

should be

    return BeautifulSoup(xml, "lxml-xml")

I believe it is like this after the BeautifulSoup version used was incremented. I'll see if it works and arrange a PR on the elife-tools project.

@lsh-0
Copy link
Contributor Author

lsh-0 commented Dec 3, 2017

updating requirements.txt to clone elifetools to an editable repo and changing that line (which I thought was still valid anyway) to what they requested didn't suppress the warning for me. Maybe I did something wrong.

@lsh-0
Copy link
Contributor Author

lsh-0 commented Dec 3, 2017

build failed with:


INFO - 2017-12-01 08:13:53,806 - Requesting url https://prod--iiif.elifesciences.org/lax:13964/elife-13964-fig4-v3.tif/info.json (cache key '5fcd4b882d0d2ef2e6850167a0aa6b12a76e546a2ed91ef55b56875045ba0aee') -- {"stack_info": null}
/ext/jenkins-libraries-runner/workspace/or_PR-364-CEJKMGU523MRCQW4DFGD/venv/lib/python3.5/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available XML parser for this system ("lxml-xml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 882 of the file /usr/lib/python3.5/threading.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml-xml")

  markup_type=markup_type))
INFO - 2017-12-01 08:13:53,982 - elife-13943-v2.xml -> elife-13943-v2.xml.json => success -- {"stack_info": null}
/ext/jenkins-libraries-runner/workspace/or_PR-364-CEJKMGU523MRCQW4DFGD/venv/lib/python3.5/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available XML parser for this system ("lxml-xml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 882 of the file /usr/lib/python3.5/threading.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml-xml")

  markup_type=markup_type))
INFO - 2017-12-01 08:13:54,036 - elife-13943-v1.xml -> elife-13943-v1.xml.json => success -- {"stack_info": null}
Terminated
script returned exit code 143

investigating

@lsh-0
Copy link
Contributor Author

lsh-0 commented Dec 4, 2017

this failure:

INFO - 2017-12-04 03:11:29,953 - Requesting url https://prod--iiif.elifesciences.org/lax:03043/elife-03043-fig7-v1.tif/info.json (cache key 'c3e4b1e37b2dc2949011c289cfda6f1364d0364577287354eb985c7cb0221c24') -- {"stack_info": null}
INFO - 2017-12-04 03:11:30,179 - elife-03043-v1.xml -> elife-03043-v1.xml.json => success -- {"stack_info": null}
/ext/jenkins-libraries-runner/workspace/or_PR-364-CEJKMGU523MRCQW4DFGD/venv/lib/python3.5/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available XML parser for this system ("lxml-xml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 50 of the file src/generate_article_json.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml-xml")

  markup_type=markup_type))
INFO - 2017-12-04 03:11:30,275 - elife-03035-v1.xml -> elife-03035-v1.xml.json => success -- {"stack_info": null}
Sending interrupt signal to process
/ext/jenkins-libraries-runner/workspace/or_PR-364-CEJKMGU523MRCQW4DFGD/venv/lib/python3.5/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available XML parser for this system ("lxml-xml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 882 of the file /usr/lib/python3.5/threading.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml-xml")

  markup_type=markup_type))
INFO - 2017-12-04 03:11:30,906 - elife-03032-v1.xml -> elife-03032-v1.xml.json => success -- {"stack_info": null}
Terminated
script returned exit code 143

@lsh-0
Copy link
Contributor Author

lsh-0 commented Dec 4, 2017

@giorgiosironi , is there a timeout on the builds? I see this Sending interrupt signal to process this time.

@giorgiosironi
Copy link
Contributor

Yes, there is a default timeout of 2 hours. I'll override this.

@gnott
Copy link
Member

gnott commented Dec 4, 2017

elifesciences/elife-tools#256 in elife-tools will use "lxml-xml", and hopefully it will get rid of the warning (once merged and integrated).

@lsh-0
Copy link
Contributor Author

lsh-0 commented Dec 4, 2017

excellent, thanks both. I'll integrate your change @gnott and push it up to trigger the rebuild. I think it's about 2/3 of the way through the corpus before it starts getting cache misses I see it got rebuilt on Giorgio's change to the jenkins file.

this should suppress the warning about the parser being used
@lsh-0 lsh-0 merged commit 2f783a6 into feature-python3-support Dec 5, 2017
@lsh-0 lsh-0 deleted the feature-python3-support-lsh-tweaks branch December 5, 2017 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants