Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of features of the model must match the input #199

Closed
abhikjha opened this issue Mar 22, 2018 · 11 comments
Closed

Number of features of the model must match the input #199

abhikjha opened this issue Mar 22, 2018 · 11 comments

Comments

@abhikjha
Copy link

Hi Aurelien,

I am facing one major issue when applying classifier model on Test Data for prediction. The error which I am getting is "Number of features of the model must match the input". Could you please let me know how to resolve this issue?

Best Regards
Abhik

@ageron
Copy link
Owner

ageron commented Mar 25, 2018

Hi @abhikjha ,
This errors says that the model was trained on a number of features n which is different from the number of features that you are giving it at test time.
This could be due to a number of reasons, in particular:

  • if the test data has a different number of features as the input data.
  • if the preprocessing steps add a different number of features during training and testing.

A very common cause for the second case is if you call fit_transform() both on the training set and the test set. For example, if you are using a CategoricalEncoder or a LabelBinarizer to encode a categorical feature, then when you call fit_transform() during training, it will see all possible categorical values, and it will add one dummy variable per categorical value. If you call fit_transform() on the test data, assuming there are just a few instances, there might be less distinct categorical values in the test data, so the transformer will add less dummy variables. This would cause the error.

So make sure you call fit_transform() on the training data, but you call transform() on the test data, not fit_transform().

If this does not explain your problem, please post more details, and the code you are trying to run.
Hope this helps,
Aurélien

@abhikjha
Copy link
Author

Hi Aurélien

Thanks very much for your prompt reply. I tried transform() only on test data and fit_transform() on training data as you explained in your book, but still I am getting this error.

The way I followed the process earlier is as follows -

  1. Split the data into Train & Test using StratifiedShuffleSplit
  2. Segregating features and labels from both train and test data
  3. Using Pipeline transformations, applied StandardScaler() and CategoricalEncoder() on training data
  4. Ran my ML models on train data

However, if I flip the process like this, I am not getting any error and columns of both train and test data are same in number (this is not the case in above method) -

  1. Segregating features and labels from data
  2. Using Pipeline transformations, applied StandardScaler() and CategoricalEncoder() on full data
  3. Split this transformed data into Train / Test data using train_test_split() using stratify on labels
  4. Ran my ML models on train data

But the problem with above method is now I have "Predict_Probas" results which I have converted into a Pandas Data frame and I would like to combine this with my original test data which later I will export in CSV. This I need to show to my management as the final result of my ML project. How can I now get my original test data back with (without standardization & one-hot encoding) with names of columns intact?

I am really sorry for troubling you and will await for your kind response on this query.

Best Regards
Abhik

@abhikjha
Copy link
Author

Apart from this, sklearn documentation does not mention only "transform" as one of their methods. Does this mean it is always applying "fit_transform" even if I am mentioning "transform"

http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

@ageron
Copy link
Owner

ageron commented Mar 26, 2018

Hi @abhikjha ,

The first process you describe looks good to me. The second should not be used (even though it works) because you are fitting the test data, which means that whatever generalization error you measure on the test set at the end will be biases (too optimistic), which defeats the purpose of the test set.

So the correct option is to try to debug the first process. Could you please copy the code you are using here so I can take a look? If it is too large, you can use a gist.

Cheers,
Aurélien

@abhikjha
Copy link
Author

Hi Aurélien

Thank you for your email and kind offer to review my code. However, before sending the codes to you, I thought I will give it a retry.

First problem I faced was something like this when transforming - "Found Unknown features during Transform". Struggled on this for a bit, then saw that in CategoricalEncoder code, we need to put "handle_unknown = "ignore". Accidentally I put this as part of transform() which not surprisingly gave me error again. Then I put this as part of CategoricalEncoder pipeline.

It all worked smoothly now. Shape of Transformed Train and Test are all same. :) Cant say how happy I am.

By the way, once I am done with everything, I will share my notebook with you which is in HTML format. Can it be shared by gist or if I can email this over to you?

Just to tell you about my project: I am working on creating a predictive model where based on some information, my model should predict which insurance policy will lapse and which will continue to be in-force. This is my first ML project and I have used all the concepts you taught us in your chapter 2 and 3. I cannot say how much these were helpful to me.

Results of the model: very surprisingly, I got 98% accuracy on RandomForest, F1 score is around 98% as well and ROC score is around 99%. This all on training set. Lets see if I get same performance on test set as well. Do above scores look too good to be true?

Best Regards
Abhik

@ageron
Copy link
Owner

ageron commented Mar 26, 2018

Hi @abhikjha ,

Great news! I'm really glad you found my book helpful.

However, the performance on the training set may indeed be too good to be true, unfortunately... A RandomForest is a very powerful model, it can learn the "by heart" and give you an amazing performance on the training set, but not generalize well to new data. Try evaluating it on the validation set (or using cross-validation), and it still performs well then it is indeed a good model. But if it performs poorly, you will need to regularize the RandomForest (and/or try other models).
You can definitely share the code using a gist, or simply create a github project and push all your code to it.

Cheers,
Aurélien

@abhikjha
Copy link
Author

Thanks Aurélien..

Just to let you know, my runs are finished on training set - I used cross_val_score and cross_val_predict with 5 folds to get same results on accuracy, ROC score and F1 score which I mentioned earlier. Since, model is still running (total size of data is more than 1 million - 0.8 training and 0.2 test), I will shortly update you with the results on test set.

Best Regards
Abhik

@abhikjha
Copy link
Author

Furthermore, I used RandomizedSearchCV to find out the hyperparameters of RandomForest before running it on cross_val_score / predict.. also, I am using MLPClassifier as well just to cross check whether I am getting similar results on both classifiers...

@ageron
Copy link
Owner

ageron commented Mar 27, 2018

Thanks for the update. Perfect, your model is looking excellent, good job!

@abhikjha
Copy link
Author

Good Morning, Aurélien. Thank you so much for your time and words of appreciation. Model now finished on my test data as well and although performance deteriorated very slightly, but overall results are still in the similar range as mentioned above.

Cant express how helpful your book has been and I would like to give all the credits for my successful implementation of modelling to your book.

Looking forward to interact with you and get your kind guidance here on other future projects as well

Best Regards
Abhik

@ageron
Copy link
Owner

ageron commented Apr 2, 2018

I'm really glad everything ended up working fine! Thanks for your kind words.
Cheers, Aurélien

@ageron ageron closed this as completed Apr 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants