Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility to ignore certain WoS-tags #143

Open
ghost opened this issue May 18, 2016 · 5 comments
Open

Possibility to ignore certain WoS-tags #143

ghost opened this issue May 18, 2016 · 5 comments
Assignees
Milestone

Comments

@ghost
Copy link

ghost commented May 18, 2016

Hi,

first of all, thanks for this amazing python package.

The corpus object in my analysis gets really huge. For example i don't need Funding information, mail adresses and some more. Is it possible to ignore certain Web of Science field tags when populating the paper-objects?

Like this?
ignore_tags = ['EM', 'FU']

@erickpeirson
Copy link
Collaborator

@Epipremnum Yeah, that's a great idea. Just the last few days I have been working on "streaming" representations of corpora and papers (i.e. on disk, in a database) to cut down on memory overhead -- the logic is basically as you describe, to pass over the metadata records once and load into memory only the immediately-needed fields. So your suggestion is a logical continuation of that line of work. I'll keep this thread up to date as we work on it!

Thanks for using tethne -- it would be good to hear more about your use-case, if you're willing to share. :-)

@erickpeirson erickpeirson added this to the v0.8-beta milestone May 18, 2016
@erickpeirson erickpeirson self-assigned this May 18, 2016
@ghost
Copy link
Author

ghost commented May 18, 2016

Hi Erick,

thanks for the quick response. That would be great.

For now i wrote a script to preprocess my bibliography files. I copied all lines corresponding to tags i wanted to keep to a new file. Tethne then gets only these lines as input that i am interested in. It works. But there is still a lot of memory used.

Thank you.

@erickpeirson
Copy link
Collaborator

This will be TETHNE-124.

@erickpeirson
Copy link
Collaborator

@Epipremnum On the develop branch I have added a parameter called parse_only to the WoS and DfR readers. If you have a moment, it would be great to hear whether or not this addresses your need.

The relevant tests are here: https://github.com/diging/tethne/blob/develop/tethne/tests/test_readers_parseonly.py

@erickpeirson
Copy link
Collaborator

This is now in v0.8.1.dev2, which can be installed via pip with --pre:

$ pip install -U tethne --pre

Example:

>>> from tethne.readers.wos import read
>>> corpus = read('/path/to/my/data', parse_only=['title', 'date'])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant