#65 partial support for multi-file fulltext records #67

spacemansteve · 2018-01-10T21:23:30Z

the fulltext is spread over multiple files for over 11k bibcodes. currently, fulltext fails on all of them. with this change only the first file of the list of files will be processed. this is very much a partial solution, fortunately the first file holds the bulk of the text. other files in the list typically hold text from tables.

coveralls · 2018-01-10T21:28:25Z

Coverage increased (+0.2%) to 87.254% when pulling 08728bf on spacemansteve:master into d411452 on adsabs:master.

marblestation · 2018-01-10T22:15:22Z

adsft/checker.py

@@ -268,3 +270,10 @@ def check_if_extract(message_list, extract_path):

    return {'Standard': publish_list_of_standard_dictionaries,
            'PDF': publish_list_of_pdf_dictionaries}
+
+def filename_cleanup(filename):


You can split a string by a given text with this:

filenames = filename.split(",")

This will return an array like ["file1", "file2"] that you can iterate, if there is no comma then you just get an array with one element like this ["file"]. Hence, you don't need to check if there are commas or not and you can just use filename.split(",")[0] and probably you don't need a function just for that :-)

I think Steve was trying to write a more robust file-splitting that would allow a filename containing a "," to be handled correctly. I thought this may help us someday if we decide to name source files according to the pairtree syntax based on an identifier (e.g. 20/01/gr/,q/c,/,,/,1/00/42/A/). However, as you can see there are going to be plenty of cases where the sequence ",/" appears in here as well preventing this splitting logic to work. AFAIK we don't have right now any filenames which contain commas, so we should not agonize too much about the splitting business but on the other hand we may want to consider some day changing the way we concatenate filenames if we are going to use pairtree to name them

As Albergo mentioned, commas can appear in the strangest places. Here's a line from fulltext/all.links showing a comma in the middle of a filename:
2008bhgs.confE...1V /proj/ads/fulltext/sources/downloads/cache/POS/pos.sissa.it//archive/conferences/075/001/BHs,%20GR%20and%20Strings_001.pdf POS

I see, but we can still use the shorter code filename.split(",/")[0]. Or am I missing something?

I don't claim to be pythonic, but I'd rather not create an array when I just want to get the prefix of a string.

Ok, let's not get blocked here and go ahead :-)

marblestation requested changes Jan 10, 2018

View reviewed changes

marblestation approved these changes Jan 11, 2018

View reviewed changes

spacemansteve merged commit 90c6479 into adsabs:master Jan 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#65 partial support for multi-file fulltext records #67

#65 partial support for multi-file fulltext records #67

spacemansteve commented Jan 10, 2018

coveralls commented Jan 10, 2018

marblestation Jan 10, 2018

aaccomazzi Jan 11, 2018

spacemansteve Jan 11, 2018

marblestation Jan 11, 2018

spacemansteve Jan 11, 2018

marblestation Jan 11, 2018

#65 partial support for multi-file fulltext records #67

#65 partial support for multi-file fulltext records #67

Conversation

spacemansteve commented Jan 10, 2018

coveralls commented Jan 10, 2018

marblestation Jan 10, 2018

Choose a reason for hiding this comment

aaccomazzi Jan 11, 2018

Choose a reason for hiding this comment

spacemansteve Jan 11, 2018

Choose a reason for hiding this comment

marblestation Jan 11, 2018

Choose a reason for hiding this comment

spacemansteve Jan 11, 2018

Choose a reason for hiding this comment

marblestation Jan 11, 2018

Choose a reason for hiding this comment