Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EXP on [ "_Naive", "_Smote", "_TunedLearner , "_TunedSmote"] #9

Closed
WeiFoo opened this issue Nov 30, 2015 · 7 comments
Closed

EXP on [ "_Naive", "_Smote", "_TunedLearner , "_TunedSmote"] #9

WeiFoo opened this issue Nov 30, 2015 · 7 comments
Labels

Comments

@WeiFoo
Copy link
Contributor

WeiFoo commented Nov 30, 2015

Run

learners = ['naive_bayes']
methods = [ "_Naive", "_Smote", "_TunedLearner" , "_TunedSmote"]
 for feature_num in [100, 400, 700, 1000]:
    for  l in learners:
       for m in methods:
          seed(1)
          ten_folds_cross_valuation(l, m)

"_TunedLearner" and"_TunedSmote" happened in each fold

SMOTE params:

neighbors: k=[2,15], default is 5
over-sampling size for each minority(multiple classes): num = [10, max_num_majority],
in SMOTE paper, num can be 2X~5X original size. Here I set the range from 10 to the number of majority class instances.

NB params:

alpha: [0.0, 1.0], Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing). default:1.0
fit_prior: [False, True] Whether to learn class prior probabilities or not. If false, a uniform prior will be used.

DE:

  • np = 10
  • cr = 0.3
  • f = 0.75
  • life = 5
  • max_repeats = 50 (only iterates 50 times, even though life>0)

Dataset:

anime.txt

Job submitted

hpc

@WeiFoo
Copy link
Contributor Author

WeiFoo commented Nov 30, 2015

Why tune_smote so slowly

  • 1st: 10 folds cross evaluation(Zhe used 25), we need to tune smote in each training data, totally 10 times
  • 2nd: in DE, we have num_population. if we do 10* num_varaible, here, we have 28 variables(k, and 27 number of minority class), then we have to use np = 280, that's not acceptable because, for each candidate, when evaluate it, we have to call SMOTE(), generating new training data, fitting NB learner and predicting labels, finally get scores for that candidate. A lot of work to do and a lot of time here(30~60 seconds here for each run, depending on data size), I decided to use np=10.
  • 3rd: if the frontier keeps improving, then we have to run more than 5 times, on average 10 runs

theoretically, we need run 10x10x10 =1000 SMOTE+Fitting learner+ Predicting.

For HPC

based on previous experience, 4 cores will run immediately. and 8 will wait for quite long time until have to kill jobs.

@rahlk
Copy link

rahlk commented Dec 1, 2015

Have you tried multiprocessing? It might be the answer to our time issues. Specifically:

1st: 10 folds cross evaluation(Zhe used 25), we need to tune smote in each training data, totally 10 times

Try to parallelize the 10 crossvals on 10 threads. Each crossval can occur concurrently so, this will produce a massive speed up (if you use HPCs, you may even get x10 speedups..)

  • 2nd: in DE, we have num_population. if we do 10* num_varaible, here, we have 28 variables(k, and 27 number of minority class), then we have to use np = 280, that's not acceptable because, for each candidate, when evaluate it, we have to call SMOTE(), generating new training data, fitting NB learner and predicting labels, finally get scores for that candidate. A lot of work to do and a lot of time here(30~60 seconds here for each run, depending on data size), I decided to use np=10.
  • 3rd: if the frontier keeps improving, then we have to run more than 5 times, on average 10 runs
    theoretically, we need run 10x10x10 =1000 SMOTE+Fitting learner+ Predicting.

A parallel DE might be a better option, we have a working version of a parallel DE, it is ridiculously fast on HPCs

based on previous experience, 4 cores will run immediately. and 8 will wait for quite long time until have to kill jobs.

There are several work arounds to this.

  1. Check out bqueues -u <unity-id> it tells you the various queues you can use. See below:
    screen shot 2015-11-30 at 11 16 01 pm
  2. Use reasonable wait times, -W 6000 is too high

Combine Multiprocessing with HPCs to get massive speedups.

@WeiFoo
Copy link
Contributor Author

WeiFoo commented Dec 1, 2015

@rahlk Thanks a lot!!! I will change to use multiprocessing and modify my stuff accordingly. very good suggestion !
@rahlk another thing is, I already used mpi4py module to parallel my python code, would it cause some trouble or low-level conflicts with multiprocessing module? any idea?

@WeiFoo WeiFoo added the results label Dec 1, 2015
@WeiFoo
Copy link
Contributor Author

WeiFoo commented Dec 1, 2015

Results

rank ,                                          name ,    med   ,  iqr 
----------------------------------------------------
   1 ,                    NB_Naive_100_mean_weighted ,      44  ,    13 (-*       --    |              ), 0.42,  0.43,  0.44,  0.56,  0.59
   2 ,                   NB_Naive_1000_mean_weighted ,      59  ,     7 (       -- *  --|-             ), 0.53,  0.57,  0.59,  0.63,  0.69
   2 ,                    NB_Naive_400_mean_weighted ,      60  ,     8 (        -  *  -|--            ), 0.54,  0.57,  0.60,  0.64,  0.70
   2 ,                    NB_Naive_700_mean_weighted ,      63  ,     3 (             *-|---           ), 0.59,  0.60,  0.63,  0.63,  0.71
   2 ,               NB_TunedSmote_100_mean_weighted ,      65  ,     8 (       ----    * -            ), 0.53,  0.60,  0.66,  0.69,  0.70
   2 ,                    NB_Smote_100_mean_weighted ,      65  ,     9 (     -------   *-             ), 0.50,  0.61,  0.66,  0.68,  0.69
   2 ,             NB_TunedLearner_100_mean_weighted ,      67  ,     7 (     --------- |*--           ), 0.50,  0.65,  0.67,  0.69,  0.72
   3 ,                    NB_Smote_400_mean_weighted ,      79  ,     2 (               |  ----- *     ), 0.70,  0.79,  0.80,  0.80,  0.81
   3 ,               NB_TunedSmote_400_mean_weighted ,      79  ,     3 (               |    --- *     ), 0.73,  0.79,  0.80,  0.80,  0.81
   3 ,             NB_TunedLearner_400_mean_weighted ,      80  ,     1 (               |   ----- *    ), 0.72,  0.80,  0.81,  0.82,  0.83
   4 ,               NB_TunedSmote_700_mean_weighted ,      82  ,     1 (               |       -- *   ), 0.78,  0.82,  0.83,  0.84,  0.85
   4 ,                    NB_Smote_700_mean_weighted ,      84  ,     2 (               |       -- *   ), 0.78,  0.82,  0.84,  0.84,  0.84
   4 ,              NB_TunedSmote_1000_mean_weighted ,      84  ,     2 (               |       --- *  ), 0.78,  0.83,  0.84,  0.85,  0.86
   4 ,             NB_TunedLearner_700_mean_weighted ,      85  ,     2 (               |        ---*- ), 0.80,  0.85,  0.85,  0.87,  0.87
   4 ,                   NB_Smote_1000_mean_weighted ,      84  ,     2 (               |        ---*- ), 0.79,  0.84,  0.85,  0.86,  0.87
   4 ,            NB_TunedLearner_1000_mean_weighted ,      86  ,     1 (               |       -----* ), 0.79,  0.86,  0.86,  0.88,  0.89



NB_TunedSmote_1000_mean_weighted means:

  • leaner is Naive Bayes
  • Tuned Smote
  • select 1000 features
  • goal is weighted_mean of F-measure (we have multiple classes)

Time:
takes 26 hours!

Obvservation

  • Tuning Learner is always better in this limit experiments
  • Tuning Smote is the same as Smote
  • Naive Learner is the worst one

@azhe825
Copy link

azhe825 commented Dec 1, 2015

0.8 is great. Two things: 1. I prefer unweighted_mean of F-measure, 2. do you have the result of oversampling rate after tuned?

@rahlk
Copy link

rahlk commented Dec 1, 2015

@WeiFoo With multiprocessing? Or without? Also, Parallel DE?

@WeiFoo
Copy link
Contributor Author

WeiFoo commented Dec 3, 2015

I think we discussed this well @rahlk @azhe825 and I will close it

@WeiFoo WeiFoo closed this as completed Dec 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants