Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

let preston cat dereference content line ranges #128

Closed
mielliott opened this issue Jun 21, 2021 · 14 comments
Closed

let preston cat dereference content line ranges #128

mielliott opened this issue Jun 21, 2021 · 14 comments

Comments

@mielliott
Copy link
Collaborator

@mielliott Just installed preston 0.3.0 and found that

https://deeplinker.bio/cat/line:zip:hash://sha256/29d30b566f924355a383b13cd48c3aa239d42cba0a55f4ccfc2930289b88b43c!/occurrence.txt!/L1

works like a charm (see attached screenshot) . Note that the hash is the (huge) ebird dataset

Screenshot from 2021-06-18 16-20-14

I had the urge to use a line range e.g., L1-2 . Is that something you had in mind too?

Originally posted by @jhpoelen in #109 (comment)

@mielliott
Copy link
Collaborator Author

mielliott commented Oct 28, 2021

Crazy idea - perhaps we could specify disjoint sets of lines, e.g. L1,10 or L1,10-20. This would make it simple to pair any row of a TSV with the header row - without the header, any other row's meaning can be difficult to parse.

e.g. line:zip:hash://sha256/29d30b566f924355a383b13cd48c3aa239d42cba0a55f4ccfc2930289b88b43c!/occurrence.txt!/L1,L10-13

id	institutionCode	collectionCode	basisOfRecord	occurrenceID	catalogNumber	recordedBy	individualCount	year	month	day	country	stateProvince	county	locality	decimalLatitude	decimalLongitude	scientificName	kingdom	phylum	class	order	family	genus	specificEpithet	publishingCountry
OBS142128464	CLO	EBIRD	HumanObservation	URN:catalog:CLO:EBIRD:OBS142128464	OBS142128464	obsr277523		2012	02	24	United States	Florida	Pinellas	Boyd Hill Nature Park & Lake Maggiore	27.7322367	-82.6521206	Corvus ossifragus	Animalia	Chordata	Aves	Passeriformes	Corvidae	Corvus	ossifragus	US
OBS142128484	CLO	EBIRD	HumanObservation	URN:catalog:CLO:EBIRD:OBS142128484	OBS142128484	obsr277523		2012	02	24	United States	Florida	Pinellas	Boyd Hill Nature Park & Lake Maggiore	27.7322367	-82.6521206	Sterna forsteri	Animalia	Chordata	Aves	Charadriiformes	Laridae	Sterna	forsteri	US
OBS142128461	CLO	EBIRD	HumanObservation	URN:catalog:CLO:EBIRD:OBS142128461	OBS142128461	obsr277523		2012	02	24	United States	Florida	Pinellas	Boyd Hill Nature Park & Lake Maggiore	27.7322367	-82.6521206	Ardea herodias	Animalia	Chordata	Aves	Pelecaniformes	Ardeidae	Ardea	herodias	US
OBS142128485	CLO	EBIRD	HumanObservation	URN:catalog:CLO:EBIRD:OBS142128485	OBS142128485	obsr277523		2012	02	24	United States	Florida	Pinellas	Boyd Hill Nature Park & Lake Maggiore	27.7322367	-82.6521206	Ardea alba	Animalia	Chordata	Aves	Pelecaniformes	Ardeidae	Ardea	alba	US

And, if there's any use for it, this could also be extended to cut ranges. And now that we're using "line" operations, it could be fun to add "column" operations (e.g. the default behavior of the unix cut)... oh, the possibilities are endless! But let's not get carried away...

@jhpoelen
Copy link
Member

jhpoelen commented Oct 28, 2021

@mielliott Introducing the L1,L5-61 notation would be pretty neat. I usually end up using two cat's instead.

A more general approach would be to say something like: here's the schema definition associated with this piece of content. With DwC-A, this schema definition would be a fragment of the meta.xml .

And . . . your proposed notation sounds neat for reasons other than just getting a sense for a schema. I imagine that subsetting a disjoint range of records from a (potentially) giant dataset would be very useful.

mielliott added a commit that referenced this issue Apr 21, 2022
@mielliott
Copy link
Collaborator Author

Still gotta squash some bugs. Bear with me.

mielliott added a commit that referenced this issue Apr 21, 2022
@mielliott
Copy link
Collaborator Author

Tests are passing in my IntelliJ (Windows) but failing when running mvn clean package (bash on ubuntu). Will resume debugging on Friday.

@jhpoelen
Copy link
Member

@mielliott sounds good. Sometimes I just commit known failures to share the joys of fixing the bug or test error. . .

mielliott added a commit that referenced this issue Apr 22, 2022
@jhpoelen
Copy link
Member

@mielliott Thanks for sharing your code.

After fixing the test case, I tried:

$ preston track "https://duckduckgo.com"
[...]
$ preston cat 'line:hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9!/L1'
<!DOCTYPE html>
$ preston cat 'line:hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9!/L2'
<!--[if IEMobile 7 ]> <html lang="en-US" class="no-js iem7"> <![endif]-->

However, when I tried:

$ preston cat 'line:hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9!/L1-L2'

the command didn't complete, it just seemed stuck.

Instead, I expected:

$ preston cat 'line:hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9!/L1-L2'
<!DOCTYPE html>
<!--[if IEMobile 7 ]> <html lang="en-US" class="no-js iem7"> <![endif]-->

Any idea what it going on?

@mielliott
Copy link
Collaborator Author

Yeah, I think I figured it out. Just a sec.

mielliott added a commit that referenced this issue Apr 25, 2022
mielliott added a commit that referenced this issue Apr 25, 2022
@mielliott
Copy link
Collaborator Author

There were some weird bugs due to some inherited code in the SelectedLinesReader class, but it's sorted out now:

$ preston track "https://duckduckgo.com"
...
<https://duckduckgo.com> <http://purl.org/pav/hasVersion> <hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9> <urn:uuid:2cb167f2-a461-4bd8-af3b-80d9b9f83bd3> .

$ preston cat 'line:hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9!/L1'
<!DOCTYPE html>

$ preston cat 'line:hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9!/L2'
<!--[if IEMobile 7 ]> <html lang="en-US" class="no-js iem7"> <![endif]-->

$ preston cat 'line:hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9!/L1-L2'
<!DOCTYPE html>
<!--[if IEMobile 7 ]> <html lang="en-US" class="no-js iem7"> <![endif]-->

And you can also enjoy lists of lines

$ preston cat 'line:hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9!/L1,L3,L5'
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="ie6 lt-ie10 lt-ie9 lt-ie8 lt-ie7 no-js" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="ie8 lt-ie10 lt-ie9 no-js" lang="en-US"> <![endif]-->

as well as lists of line ranges

$ preston cat 'line:hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9!/L1-L2,L10-L13'
<!DOCTYPE html>
<!--[if IEMobile 7 ]> <html lang="en-US" class="no-js iem7"> <![endif]-->
        <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta http-equiv="content-type" content="text/html; charset=UTF-8;charset=utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=1" />
<meta name="HandheldFriendly" content="true"/>

@mielliott
Copy link
Collaborator Author

Note that the trailing newline character is not printed. This was the existing behavior for single-line queries, so it is preserved for multi-line queries. For example, line:blah!/L1 and line:blah!/L1-L1 should print the same thing (no trailing newline character)

@mielliott
Copy link
Collaborator Author

Without the trailing newline character, some possibly unexpected things happen. e.g.

$ preston cat 'line:hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9!/L1' | wc -l
0

$ preston cat 'line:hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9!/L1' | echo | wc -l
1

$ cat <(preston cat 'line:hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9!/L1') <(preston cat 'line:hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9!/L2') | wc -l
0

$ preston cat 'line:hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9!/L1,L2' | wc -l
1

$ preston cat 'line:hash://sha256/a1c07898c5bdc4f43460ba67550dcca6a5299115f91f7c0023f35b5c432d5ad9!/L1-5' | wc -l
4

Maybe we should include the "\n" at the end of lines. Is it part of the line? I feel like it is. @jhpoelen thoughts?

Also note that cut is so bold as to even add a newline even when there isn't one (I don't know how I feel about this though).

$ echo -n "no newline at the end of this" > txt
$ cat txt
no newline at the end of this$
$ cut txt -b4-
newline at the end of this
$

Notice where the $ appears after running each command.

@jhpoelen
Copy link
Member

jhpoelen commented Apr 26, 2022

@mielliott neat examples.

I'd say, follow the wisdom of cat:

cat doesn't add a newline, but echo without -n does -

$ echo -n "bla" | cat | wc -l 
0
$ echo -n "bla" | cat | echo | wc -l
1

with cut appending newline except when using -z

$ echo -n "bla" | cat | cut -b1-2 | wc -l
1
$ echo -n "bla" | cat | cut -z -b1-2 | wc -l
0

@mielliott
Copy link
Collaborator Author

Agreed, I don't think we should be adding \n where there isn't one.

I was more wondering about preston's current behavior of removing the \n at the end of a line. For example, in catting line:blah!/L1, preston does not print the \n at the end of the line, but head does. Note that head does not add \n. e.g.

$ preston get 'line:hash://sha256/d15a9b5273914ed4b5033e3d22e4de2e2740c947bbfab1e3ba85c948e39640b1!/L1' | wc -l
0
$ preston get 'hash://sha256/d15a9b5273914ed4b5033e3d22e4de2e2740c947bbfab1e3ba85c948e39640b1' | head -n1 | wc -l
1
$ echo -ne "borkbork" | head | wc -l
0
$ echo -ne "borkbork\n" | head | wc -l
1

So I suggest that have preston print the \n if there is one

@jhpoelen
Copy link
Member

sounds good!

@jhpoelen
Copy link
Member

@mielliott thanks for implementing the line range feature!

I just tried:

$ preston track https://duckduckgo.com
...
<https://duckduckgo.com> <http://purl.org/pav/hasVersion> <hash://sha256/29a040825232ab4fb9fa48e47e155e6138a4551814064f7bfe89c2a9260f463f> <urn:uuid:bd23ebc9-3a2f-486b-aa22-4f9e784a3699> .

and

$ preston cat 'line:hash://sha256/29a040825232ab4fb9fa48e47e155e6138a4551814064f7bfe89c2a9260f463f!/L1,L3'
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="ie6 lt-ie10 lt-ie9 lt-ie8 lt-ie7 no-js" lang="en-US"> <![endif]-->

and

$ preston cat 'line:hash://sha256/29a040825232ab4fb9fa48e47e155e6138a4551814064f7bfe89c2a9260f463f!/L1-L3'
<!DOCTYPE html>
<!--[if IEMobile 7 ]> <html lang="en-US" class="no-js iem7"> <![endif]-->
<!--[if lt IE 7]> <html class="ie6 lt-ie10 lt-ie9 lt-ie8 lt-ie7 no-js" lang="en-US"> <![endif]-->

very neat!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants