Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#65 partial support for multi-file fulltext records #67

Merged
merged 1 commit into from
Jan 12, 2018

Conversation

spacemansteve
Copy link
Contributor

the fulltext is spread over multiple files for over 11k bibcodes. currently, fulltext fails on all of them. with this change only the first file of the list of files will be processed. this is very much a partial solution, fortunately the first file holds the bulk of the text. other files in the list typically hold text from tables.

the fulltext is spread over multiple files for over 11k bibcodes.  currently, fulltext fails on all of them.  with this change only the first file of the list of files will be processed.  this is very much a partial solution, fortunately the first file holds the bulk of the text.  other files in the list typically hold text from tables.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 87.254% when pulling 08728bf on spacemansteve:master into d411452 on adsabs:master.

@@ -268,3 +270,10 @@ def check_if_extract(message_list, extract_path):

return {'Standard': publish_list_of_standard_dictionaries,
'PDF': publish_list_of_pdf_dictionaries}

def filename_cleanup(filename):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can split a string by a given text with this:

filenames = filename.split(",")

This will return an array like ["file1", "file2"] that you can iterate, if there is no comma then you just get an array with one element like this ["file"]. Hence, you don't need to check if there are commas or not and you can just use filename.split(",")[0] and probably you don't need a function just for that :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Steve was trying to write a more robust file-splitting that would allow a filename containing a "," to be handled correctly. I thought this may help us someday if we decide to name source files according to the pairtree syntax based on an identifier (e.g. 20/01/gr/,q/c,/,,/,1/00/42/A/). However, as you can see there are going to be plenty of cases where the sequence ",/" appears in here as well preventing this splitting logic to work. AFAIK we don't have right now any filenames which contain commas, so we should not agonize too much about the splitting business but on the other hand we may want to consider some day changing the way we concatenate filenames if we are going to use pairtree to name them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Albergo mentioned, commas can appear in the strangest places. Here's a line from fulltext/all.links showing a comma in the middle of a filename:
2008bhgs.confE...1V /proj/ads/fulltext/sources/downloads/cache/POS/pos.sissa.it//archive/conferences/075/001/BHs,%20GR%20and%20Strings_001.pdf POS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but we can still use the shorter code filename.split(",/")[0]. Or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't claim to be pythonic, but I'd rather not create an array when I just want to get the prefix of a string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's not get blocked here and go ahead :-)

@spacemansteve spacemansteve merged commit 90c6479 into adsabs:master Jan 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants