New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number of features of the model must match the input #199
Comments
Hi @abhikjha ,
A very common cause for the second case is if you call So make sure you call If this does not explain your problem, please post more details, and the code you are trying to run. |
Hi Aurélien Thanks very much for your prompt reply. I tried transform() only on test data and fit_transform() on training data as you explained in your book, but still I am getting this error. The way I followed the process earlier is as follows -
However, if I flip the process like this, I am not getting any error and columns of both train and test data are same in number (this is not the case in above method) -
But the problem with above method is now I have "Predict_Probas" results which I have converted into a Pandas Data frame and I would like to combine this with my original test data which later I will export in CSV. This I need to show to my management as the final result of my ML project. How can I now get my original test data back with (without standardization & one-hot encoding) with names of columns intact? I am really sorry for troubling you and will await for your kind response on this query. Best Regards |
Apart from this, sklearn documentation does not mention only "transform" as one of their methods. Does this mean it is always applying "fit_transform" even if I am mentioning "transform" http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html |
Hi @abhikjha , The first process you describe looks good to me. The second should not be used (even though it works) because you are fitting the test data, which means that whatever generalization error you measure on the test set at the end will be biases (too optimistic), which defeats the purpose of the test set. So the correct option is to try to debug the first process. Could you please copy the code you are using here so I can take a look? If it is too large, you can use a gist. Cheers, |
Hi Aurélien Thank you for your email and kind offer to review my code. However, before sending the codes to you, I thought I will give it a retry. First problem I faced was something like this when transforming - "Found Unknown features during Transform". Struggled on this for a bit, then saw that in CategoricalEncoder code, we need to put "handle_unknown = "ignore". Accidentally I put this as part of transform() which not surprisingly gave me error again. Then I put this as part of CategoricalEncoder pipeline. It all worked smoothly now. Shape of Transformed Train and Test are all same. :) Cant say how happy I am. By the way, once I am done with everything, I will share my notebook with you which is in HTML format. Can it be shared by gist or if I can email this over to you? Just to tell you about my project: I am working on creating a predictive model where based on some information, my model should predict which insurance policy will lapse and which will continue to be in-force. This is my first ML project and I have used all the concepts you taught us in your chapter 2 and 3. I cannot say how much these were helpful to me. Results of the model: very surprisingly, I got 98% accuracy on RandomForest, F1 score is around 98% as well and ROC score is around 99%. This all on training set. Lets see if I get same performance on test set as well. Do above scores look too good to be true? Best Regards |
Hi @abhikjha , Great news! I'm really glad you found my book helpful. However, the performance on the training set may indeed be too good to be true, unfortunately... A Cheers, |
Thanks Aurélien.. Just to let you know, my runs are finished on training set - I used cross_val_score and cross_val_predict with 5 folds to get same results on accuracy, ROC score and F1 score which I mentioned earlier. Since, model is still running (total size of data is more than 1 million - 0.8 training and 0.2 test), I will shortly update you with the results on test set. Best Regards |
Furthermore, I used RandomizedSearchCV to find out the hyperparameters of RandomForest before running it on cross_val_score / predict.. also, I am using MLPClassifier as well just to cross check whether I am getting similar results on both classifiers... |
Thanks for the update. Perfect, your model is looking excellent, good job! |
Good Morning, Aurélien. Thank you so much for your time and words of appreciation. Model now finished on my test data as well and although performance deteriorated very slightly, but overall results are still in the similar range as mentioned above. Cant express how helpful your book has been and I would like to give all the credits for my successful implementation of modelling to your book. Looking forward to interact with you and get your kind guidance here on other future projects as well Best Regards |
I'm really glad everything ended up working fine! Thanks for your kind words. |
Hi Aurelien,
I am facing one major issue when applying classifier model on Test Data for prediction. The error which I am getting is "Number of features of the model must match the input". Could you please let me know how to resolve this issue?
Best Regards
Abhik
The text was updated successfully, but these errors were encountered: