Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi core support for training on large number of instances #67

Open
napsternxg opened this issue Jun 6, 2016 · 19 comments
Open

Multi core support for training on large number of instances #67

napsternxg opened this issue Jun 6, 2016 · 19 comments

Comments

@napsternxg
Copy link

I think CRFSuite can be optimized to utilize multiple cores available on all machines these days. A simple fix I thought for that was computing the scores in the for loop of encoder_objective_and_gradients_batch especially at line

for (i = 0;i < N;++i) {

An additional dependency might be added if we want to use a multi processing library like openMP for implementing the feature, which can be switched on or off using a flag.

Some API changes might also be needed in order to ensure the proper aggregation of results from each of the parallel jobs.

I would love to have a feedback on this and know if anyone else is working on this patch?

@napsternxg
Copy link
Author

@kmike @chokkan @ogrisel What do you guys think about it ?

@tianjianjiang
Copy link

I just submitted a pull request #68 for it but with difference loops annotated.

@usptact
Copy link

usptact commented Jul 4, 2016

@tianjianjiang All due respect to the author of CRFSuite (did really great job) but it would take a while to get your improvement merged in. Perhaps the best bet for you would be to fork the project and work there. Thanks for your contribution.

@bratao
Copy link

bratao commented Jul 4, 2016

@usptact , I think he already did. https://github.com/tianjianjiang/crfsuite-openmp

@tianjianjiang
Copy link

@usptact Not a problem at all.
@bratao Thanks for the clarification.

In fact, it's rather a good idea to wait for a while. I've noticed that in different OS with different compilers and on certain data set, the calculation can be inefficient or even hanging (0% CPU time).

@tianjianjiang
Copy link

The pull request #68 has just been updated to improve the performance. It seems finally faster than original version now.

@napsternxg
Copy link
Author

@tianjianjiang thanks for the work. Can you add some test scripts for benchmarking the performance. An ipython notebook would be a very good option.

@CSabty
Copy link

CSabty commented Feb 18, 2017

Hii, I am new to the field of multi processing and I just want to know how to run CRFsuite using the library openMP as without it, it's extremely slow for big data sets?
Thank you in advance

@usptact
Copy link

usptact commented Feb 18, 2017

@CSabty If you need speed for learning from very large datasets, please take a look at Wapiti or use Vowpal Wabbit in learning to search mode. I use the latter when I need to train a NER model very quickly.

@bratao
Copy link

bratao commented Feb 18, 2017

@usptact could you please share what command line you used for ner with Vowpal? I was never able to come with a working command line for taggging.

@usptact
Copy link

usptact commented Feb 18, 2017

@bratao Sure, here you go:

vw  --data train.feat \
    --learning_rate 0.5 \
    --cache --kill_cache \
    --threads \
    --passes 10 \
    --search_task sequence \
    --search $NUM_LABELS \
    --search_rollin=policy \
    --search_rollout=none \
    --named_labels "$(< labels)" \
    -b 28 \
    --l1=1e-7 \
    -f $MODEL \
    --readable_model $MODEL.txt

You will need the training file ("train.feat") in multi-line format (see doc) and a file "labels" with string labels that are BIO tags (in my case). If there are only few, you can list the tags as comma-separated list in console.

@CSabty
Copy link

CSabty commented Feb 20, 2017

@usptact Thank you so much for your reply, I am working on NER training as well. Do you think Wapiti or Vowpal Wabbit are better in performance (speed wise) than CRF++ ? As I was planning to use CRF++ using multi-core because I feel it has more recourses online and maybe simpler compared to the other ones.

@usptact
Copy link

usptact commented Feb 20, 2017

@CSabty In my experience, performance-wise, the CRF is still the best although I did not do thorough comparison.

@yiqingyang2012
Copy link

@usptact

You will need the training file ("train.feat") in multi-line format (see doc) and a file "labels" with string labels that are BIO tags (in my case). If there are only few, you can list the tags as comma-separated list in console.

In POS task, can i use the same feature with crfsuite when training by Vowpal Wabbit tool? And features can follows with a " : " and then a float scaling value in crfstuite train dateset, but it seems like the ':' is used to set the feature value rather than feature importance in Vowpal Wabbit.

it's too painful to use Vowpal Wabbit, do you have write some sequence search related blog?
thanks ~~

@usptact
Copy link

usptact commented Jun 29, 2017

Both in CRFSuite and VW, the ":" character is special. In former you can escape it like this "\:" but in latter you can't. Assuming you don't want to change default weight of 1.0.

@jbkoh
Copy link

jbkoh commented Sep 21, 2017

I wonder if this development of multicore CRF has been dead or not. I am dying for such feature.

@usptact
Copy link

usptact commented Sep 22, 2017

@jbkoh If you are looking for multi CPU training of CRFs, take a look at https://github.com/zhongkaifu/CRFSharp

@tianjianjiang
Copy link

In my experiences, CRFsuite and libLBFGS are not OpenMP friendly. Of course there are other ways to have multi-core support, but for OpenMP, it might even require fundamental changes, which is probably an unacceptable cost, in CRFsuite.

@jbkoh
Copy link

jbkoh commented Sep 22, 2017

@usptact @tianjianjiang Thanks for the information! I wish I could have exploited the cores with PyCRFSuite, but I can switch to the pointer. Thank you all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants