Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter 2 error during prediction #646

Closed
singh-krishan opened this issue Jan 3, 2022 · 2 comments
Closed

Chapter 2 error during prediction #646

singh-krishan opened this issue Jan 3, 2022 · 2 comments

Comments

@singh-krishan
Copy link

Hi @ageron ,
I have defined the full pipeline as:

full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder(), cat_attribs),
])

where the num_pipeline is defined as:

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('std_scalar', StandardScaler()),
])

Now, when I execute this code:

some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]

some_data_prepared = full_pipeline.fit_transform(some_data)
print('Predictions:', lin_reg.predict(some_data_prepared))
print('Labels:', list(some_labels))

I get this error:

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 13 is different from 11)

I noticed that when I pass the whole housing dataset to the full_pipeline, the ocean_proximity is getting transformed to 5 different columns resulting in total of 13 fields. But, when I pass only a subset of the dataset (i.e. housing.iloc[:5]), the transformation is not applied to the ocean_proximity column.

Any suggestions on what could be wrong?

Thanks a lot

@ageron
Copy link
Owner

ageron commented Jan 10, 2022

Hi @singh-krishan ,

Thanks for your question. Make sure you fit estimators only on training data. This means you should call fit() or fit_transform() or fit_predict() only on training data, never on other data (such as the validation set, the test set, or new data). In your code, you should therefore replace full_pipeline.fit_transform(some_data) with full_pipeline.transform(some_data). However, before you do that, you should first fit the model on the training set.
So the code should look like:

housing_prepared = full_pipeline.fit_transform(housing)
some_data_prepared = full_pipeline.transform(some_data)

In the full training set, there are 5 distinct values in the ocean_proximity column. That's why after the full_pipeline is fit on the training set, it outputs one-hot vectors of size 5 for each ocean_proximity category. But if some_data is small enough, it is likely to contain less categories, which is what you observed. But if you only call transform(some_data) and not fit_transform(some_data), it will output one-hot vectors of size 5.

Hope this helps.

@singh-krishan
Copy link
Author

thanks @ageron , makes sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants