Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run xmltweet on the data set #15

Closed
cdschillaci opened this issue Mar 4, 2015 · 12 comments
Closed

Run xmltweet on the data set #15

cdschillaci opened this issue Mar 4, 2015 · 12 comments

Comments

@cdschillaci
Copy link

@davclark Do we have disk space to run the tokenizer at this point? I uploaded a script into process_data/tokenize_files.sh which can be run with minimal changes (just select the input and output folders).

@davclark
Copy link
Member

davclark commented Mar 4, 2015

What's the ratio of input size to output size?
On Mar 3, 2015 5:51 PM, "cdschillaci" notifications@github.com wrote:

@davclark https://github.com/davclark Do we have disk space to run the
tokenizer at this point? I uploaded a script into
process_data/tokenize_files.sh which can be run with minimal changes (just
select the input and output folders).

Reply to this email directly or view it on GitHub
#15.

@cdschillaci
Copy link
Author

It's pretty close to 1:1 actually, so I guess definitely not enough space. Here's an example.

Raw:

-rw-rw-r-- 1 schillaci schillaci 1.8G Mar 3 19:37 ba.xml

Tokenized:

-rw-rw-r-- 1 schillaci schillaci 1.6G Mar 3 19:38 ba.xml.imat
-rw-rw-r-- 1 schillaci schillaci 27M Mar 3 19:38 ba_dict.imat
-rw-rw-r-- 1 schillaci schillaci 134M Mar 3 19:38 ba_dict.sbmat

@davclark
Copy link
Member

davclark commented Mar 4, 2015

Yeah - we could do a subset, but we are definitely lean on space. There is
a 3TB drive coming via FedEx...

D

On Tue, Mar 3, 2015 at 9:01 PM, cdschillaci notifications@github.com
wrote:

It's pretty close to 1:1 actually, so definitely not enough space. Here's
an example.

Raw:

-rw-rw-r-- 1 schillaci schillaci 1.8G Mar 3 19:37 ba.xml

Tokenized:

-rw-rw-r-- 1 schillaci schillaci 1.6G Mar 3 19:38 ba.xml.imat
-rw-rw-r-- 1 schillaci schillaci 27M Mar 3 19:38 ba_dict.imat
-rw-rw-r-- 1 schillaci schillaci 134M Mar 3 19:38 ba_dict.sbmat

Reply to this email directly or view it on GitHub
#15 (comment)
.

@anasrferreira
Copy link
Contributor

@davclark new xmltweet.exe is now compiled and moved to /var/local/destress/scripts/
@cdschillaci tokenize_files.sh has been updated with new xmltweet.exe path.

@cdschillaci
Copy link
Author

@anasrferreira Is this your version or the new official BIDMach version?

@anasrferreira
Copy link
Contributor

This is from the most recent BIDMach pull onto my mercury account.
BIDMach doesn't come with xmltweet.exe. It needs to be compiled.

@peparedes
Copy link
Contributor

Actually I think it does... I think it is in BIDMach/bin

P

On Wed, Mar 4, 2015 at 5:09 PM, anasrferreira notifications@github.com
wrote:

This is from the most recent BIDMach pull onto my mercury account.
BIDMach doesn't come with xmltweet.exe. It needs to be compiled.


Reply to this email directly or view it on GitHub
#15 (comment)
.

@davclark
Copy link
Member

I announced this on Slack - but just so we're clear, there's 2TB now on /var/local

@coryschillaci
Copy link
Contributor

Running xmltweet on everything now.

@davclark
Copy link
Member

Whee!

On Mon, Mar 9, 2015 at 9:13 PM, coryschillaci notifications@github.com
wrote:

Running xmltweet on everything now.

Reply to this email directly or view it on GitHub
#15 (comment)
.

@coryschillaci
Copy link
Contributor

Ran in 115 minutes. The dictionaries etc. have been output to /var/local/destress/tokenized

@peparedes
Copy link
Contributor

Awesome!

P
On Mar 9, 2015 11:44 PM, "coryschillaci" notifications@github.com wrote:

Ran in 115 minutes. The dictionaries etc. have been output to
/var/local/destress/tokenized


Reply to this email directly or view it on GitHub
#15 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants