Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python collections wrapper and passage retrieval setup #645

Merged
merged 23 commits into from May 15, 2019

Conversation

Projects
None yet
2 participants
@emmileaf
Copy link
Collaborator

commented May 13, 2019

Keeping an open PR with initial setup for iterating over collections with Pyjnius and segmenting documents, as well as discussion for adapting this to generally address #602.

@emmileaf emmileaf requested a review from lintool May 13, 2019

@emmileaf

This comment has been minimized.

Copy link
Collaborator Author

commented May 13, 2019

Currently the setup uses the java document generator classes for collection-specific content parsing logic. This is more interlaced with other parts of the code such as IndexCollection:

public LuceneDocumentGenerator(IndexCollection.Args args, IndexCollection.Counters counters) {
this.transform = null;
config(args);
setCounters(counters);
}

An alternative to this approach could be providing a Python interface for only the collections classes, and skip using generator classes to convert SourceDocument into LuceneDocument? This would move the content parsing and HTML cleaning logic over to be addressed or re-written in Python instead.


JBaseFileSegmentStatus = autoclass('io.anserini.collection.BaseFileSegment$Status')

JCarCollection = autoclass('io.anserini.collection.CarCollection')

This comment has been minimized.

Copy link
@lintool

lintool May 13, 2019

Member

Can we use Python enums for collections?

@@ -0,0 +1,54 @@
import jnius_config

This comment has been minimized.

Copy link
@lintool

lintool May 13, 2019

Member

And you're planning on moving common code to something like src/main/python/io/anserini/collection/? We probably want to mirror the java package structure?

@@ -0,0 +1,40 @@
### Segmenting collection

This comment has been minimized.

Copy link
@lintool

lintool May 13, 2019

Member

Let's call this passage_retrieval?
So move everything to src/main/python/passage_retrieval/?

emmileaf added some commits May 13, 2019

@emmileaf

This comment has been minimized.

Copy link
Collaborator Author

commented May 14, 2019

PR changes:

collection = Collection(collection_class, input_path)

for (i, fs) in enumerate(collection.segments):
    for (i, doc) in enumerate(fs):
         foo(doc)

passage_retrieval/collection is what I can move into src/main/python/io/anserini/collection/ for later… In terms of mirroring the java package structure, are we expecting to have other subdirectories fall into src/main/python/io/anserini/ in the future? @lintool

logger.info("Total duration: %s", str(datetime.timedelta(seconds=elapsed)))


#IterCollection('C:/cygwin64/home/Emily/usra/collection/disk45',

This comment has been minimized.

Copy link
@lintool

lintool May 14, 2019

Member

probably want to remove this?

@lintool

This comment has been minimized.

Copy link
Member

commented May 14, 2019

passage_retrieval/collection is what I can move into src/main/python/io/anserini/collection/ for later… In terms of mirroring the java package structure, are we expecting to have other subdirectories fall into src/main/python/io/anserini/ in the future? @lintool

Yes, please move over? I anticipate that we'll gradually mirror and "export" Java classes as Python gets more use. For example, we probably want to expose the topic readers in Python also... I can imagine doing learning-to-rank in Python also...

Let's do the src/main/python/io/anserini/collection/ move, and we should merge this...

@emmileaf emmileaf changed the title Iterating over collections and segmenting documents with Python Python collections wrapper and passage retrieval setup May 15, 2019

@lintool lintool merged commit 284d019 into castorini:master May 15, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.