-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run xmltweet on the data set #15
Comments
What's the ratio of input size to output size?
|
It's pretty close to 1:1 actually, so I guess definitely not enough space. Here's an example. Raw:
Tokenized:
|
Yeah - we could do a subset, but we are definitely lean on space. There is D On Tue, Mar 3, 2015 at 9:01 PM, cdschillaci notifications@github.com
|
@davclark new xmltweet.exe is now compiled and moved to /var/local/destress/scripts/ |
@anasrferreira Is this your version or the new official BIDMach version? |
This is from the most recent BIDMach pull onto my mercury account. |
Actually I think it does... I think it is in BIDMach/bin P On Wed, Mar 4, 2015 at 5:09 PM, anasrferreira notifications@github.com
|
I announced this on Slack - but just so we're clear, there's 2TB now on /var/local |
Running xmltweet on everything now. |
Whee! On Mon, Mar 9, 2015 at 9:13 PM, coryschillaci notifications@github.com
|
Ran in 115 minutes. The dictionaries etc. have been output to |
Awesome! P
|
@davclark Do we have disk space to run the tokenizer at this point? I uploaded a script into process_data/tokenize_files.sh which can be run with minimal changes (just select the input and output folders).
The text was updated successfully, but these errors were encountered: