Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate a model file and reuse model to classify new samples (eg streaming big data) #30

Closed
giorgio79 opened this issue Jan 13, 2016 · 6 comments

Comments

@giorgio79
Copy link

Can autosklearn generate a model file that can be reused for classifying new data? Would be useful for classifying big data streams.

@mfeurer
Copy link
Contributor

mfeurer commented Jan 13, 2016

Yes, that would be useful, but so far it can't. What you can do is use show_models(). It outputs something like:

[(weight, constructor),
 (weight, constructor)]

which determines the final ensemble. You can use that in order to retrain your model on the full data and pickle it in your own code.

@giorgio79
Copy link
Author

Looks like scikit learn uses some external libs
http://stackoverflow.com/questions/10592605/save-classifier-to-disk-in-scikit-learn

@Motorrat
Copy link
Contributor

Motorrat commented May 6, 2016

is there a simple programmatic way to convert the output of show_models() into a string that can be used to construct the classifiers in the code? Currently it comes out as

(0.040000, SimpleClassificationPipeline(configuration={
  'balancing:strategy': 'weighting',
  'classifier:__choice__': 'random_forest',
  'classifier:random_forest:bootstrap': 'False',
  'classifier:random_forest:criterion': 'entropy',
  'classifier:random_forest:max_depth': 'None',
  'classifier:random_forest:max_features': 1.6519823800472522,
  'classifier:random_forest:max_leaf_nodes': 'None',
  'classifier:random_forest:min_samples_leaf': 14,
  'classifier:random_forest:min_samples_split': 13,
  'classifier:random_forest:min_weight_fraction_leaf': 0.0,
  'classifier:random_forest:n_estimators': 100,
  'imputation:strategy': 'mean',
  'one_hot_encoding:use_minimum_fraction': 'False',
  'preprocessor:__choice__': 'no_preprocessing',
  'rescaling:__choice__': 'min/max'})),
(0.040000, SimpleClassificationPipeline(configuration={
  'balancing:strategy': 'weighting',
  'classifier:__choice__': 'sgd',
  'classifier:sgd:alpha': 8.157889958167601e-05,
  'classifier:sgd:average': 'False',
  'classifier:sgd:eta0': 0.042599381735495594,
  'classifier:sgd:fit_intercept': 'True',
  'classifier:sgd:learning_rate': 'optimal',
  'classifier:sgd:loss': 'perceptron',
  'classifier:sgd:n_iter': 25,
  'classifier:sgd:penalty': 'l2',
  'imputation:strategy': 'median',
  'one_hot_encoding:minimum_fraction': 0.040130045634589266,
  'one_hot_encoding:use_minimum_fraction': 'True',
  'preprocessor:__choice__': 'no_preprocessing',
  'rescaling:__choice__': 'normalize'})),

@mfeurer
Copy link
Contributor

mfeurer commented May 6, 2016

Have a look at this.

@Motorrat
Copy link
Contributor

Motorrat commented May 6, 2016

Also show_models() can be very slow and occupies a lot of memory - takes tens of minutes and tens of GB in my case.

Instead I am using
for quality in $(grep obj $ats/log-run*|sed -e 's/^.*obj\ \(.*$\)/\1/'|sort|uniq|head -10); do grep final -A 1 $(grep -l "$quality" $ats/log-run*|sort|head -1); done;
ats=salted_temp_dir_of_autoscklearn
to get top 10 classifiers that were chosen as having best scores from the log files and obviously this virtually takes no time at all.
I wonder if there is a reason show_models does what it does.

@mfeurer
Copy link
Contributor

mfeurer commented Oct 17, 2016

The latest version of auto-sklearn features pickleable classifiers/regressors. If there is still an issue with model persistence, please open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants