Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean on Training and use the information from that to clean on Test #1

Open
VikasNS opened this issue Jul 19, 2018 · 1 comment
Open
Labels
question Further information is requested

Comments

@VikasNS
Copy link

VikasNS commented Jul 19, 2018

I have a dataset wherein both test and training dataset have missing values.
Just like in sklearn where we train using fit_transform and use that later to transform test data.
I want to clean the training set and using that info test set should also be clean.
How to do this.?

@srstevenson srstevenson added the question Further information is requested label Sep 14, 2018
@shwinnn
Copy link
Contributor

shwinnn commented Oct 23, 2018

Hi @VikasNS, you will need to slightly modify the clean function for this use case. With the below changes:

-      def clean(dataframe, numerical_columns, categorical_columns, tune_rbm):
+      def clean(dataframe, numerical_columns, categorical_columns, tune_rbm, rbm=None):

and

-      rbm = train_rbm(preprocessed_array, tune_hyperparameters=tune_rbm)
+      if rbm is None:
+          rbm = train_rbm(preprocessed_array, tune_hyperparameters=tune_rbm)

You will be able to do the following:

    import boltzmannclean
    import numpy as np

    numerical_columns = ['list', 'of', 'numerical', 'column', 'names']
    categorical_columns = ['list', 'of', 'categorical', 'column', 'names']


    numerics, scaler = boltzmannclean.preprocess_numerics(
        training_dataframe, numerical_columns
    )
    categoricals, category_dict = boltzmannclean.preprocess_categoricals(
        training_dataframe, categorical_columns
    )
    preprocessed_array = np.hstack((numerics, categoricals))

    pretrained_rbm = boltzmannclean.train_rbm(
        preprocessed_array, tune_hyperparameters=True # or False, up to you
    ) 
    
    cleaned_training_dataframe = boltzmannclean.clean(
        training_dataframe, numerical_columns, categorical_columns,
        tune_rbm=False, rbm=pretrained_rbm
    )
    cleaned_test_dataframe = boltzmannclean.clean(
        test_dataframe, numerical_columns, categorical_columns,
        tune_rbm=False, rbm=pretrained_rbm
    )

acroz pushed a commit that referenced this issue Nov 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

3 participants