-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi core support for training on large number of instances #67
Comments
I just submitted a pull request #68 for it but with difference loops annotated. |
@tianjianjiang All due respect to the author of CRFSuite (did really great job) but it would take a while to get your improvement merged in. Perhaps the best bet for you would be to fork the project and work there. Thanks for your contribution. |
@usptact , I think he already did. https://github.com/tianjianjiang/crfsuite-openmp |
The pull request #68 has just been updated to improve the performance. It seems finally faster than original version now. |
@tianjianjiang thanks for the work. Can you add some test scripts for benchmarking the performance. An ipython notebook would be a very good option. |
Hii, I am new to the field of multi processing and I just want to know how to run CRFsuite using the library openMP as without it, it's extremely slow for big data sets? |
@CSabty If you need speed for learning from very large datasets, please take a look at Wapiti or use Vowpal Wabbit in learning to search mode. I use the latter when I need to train a NER model very quickly. |
@usptact could you please share what command line you used for ner with Vowpal? I was never able to come with a working command line for taggging. |
@bratao Sure, here you go:
You will need the training file ("train.feat") in multi-line format (see doc) and a file "labels" with string labels that are BIO tags (in my case). If there are only few, you can list the tags as comma-separated list in console. |
@usptact Thank you so much for your reply, I am working on NER training as well. Do you think Wapiti or Vowpal Wabbit are better in performance (speed wise) than CRF++ ? As I was planning to use CRF++ using multi-core because I feel it has more recourses online and maybe simpler compared to the other ones. |
@CSabty In my experience, performance-wise, the CRF is still the best although I did not do thorough comparison. |
In POS task, can i use the same feature with crfsuite when training by Vowpal Wabbit tool? And features can follows with a " : " and then a float scaling value in crfstuite train dateset, but it seems like the ':' is used to set the feature value rather than feature importance in Vowpal Wabbit. it's too painful to use Vowpal Wabbit, do you have write some sequence search related blog? |
Both in CRFSuite and VW, the ":" character is special. In former you can escape it like this "\:" but in latter you can't. Assuming you don't want to change default weight of 1.0. |
I wonder if this development of multicore CRF has been dead or not. I am dying for such feature. |
@jbkoh If you are looking for multi CPU training of CRFs, take a look at https://github.com/zhongkaifu/CRFSharp |
In my experiences, CRFsuite and libLBFGS are not OpenMP friendly. Of course there are other ways to have multi-core support, but for OpenMP, it might even require fundamental changes, which is probably an unacceptable cost, in CRFsuite. |
@usptact @tianjianjiang Thanks for the information! I wish I could have exploited the cores with PyCRFSuite, but I can switch to the pointer. Thank you all. |
I think CRFSuite can be optimized to utilize multiple cores available on all machines these days. A simple fix I thought for that was computing the scores in the for loop of
encoder_objective_and_gradients_batch
especially at linecrfsuite/lib/crf/src/crf1d_encode.c
Line 825 in 8c0028c
An additional dependency might be added if we want to use a multi processing library like openMP for implementing the feature, which can be switched on or off using a flag.
Some API changes might also be needed in order to ensure the proper aggregation of results from each of the parallel jobs.
I would love to have a feedback on this and know if anyone else is working on this patch?
The text was updated successfully, but these errors were encountered: