Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in upload_model_and_df when target value is not declared in column_types #361

Closed
princyiakov opened this issue Nov 27, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@princyiakov
Copy link
Contributor

🐛 Bug Report

When I am trying to run upload_model_and_df function, I get an error : KeyError: 'default' since I did not declare target value in column_types . But this behavior was supported in the previous version.

🔬 How To Reproduce

Steps to reproduce the behavior:

  1. Select Giskard Credit Scoring Demo Notebook

  2. Remove 'default' column from column_types
    column_types = {'account_check_status':"category",
    'duration_in_month':"numeric",
    'credit_history':"category",
    'purpose':"category",
    'credit_amount':"numeric",
    'savings':"category",
    'present_employment_since':"category",
    'installment_as_income_perc':"numeric",
    'sex':"category",
    'personal_status':"category",
    'other_debtors':"category",
    'present_residence_since':"numeric",
    'property':"category",
    'age':"numeric",
    'other_installment_plans':"category",
    'housing':"category",
    'credits_this_bank':"numeric",
    'job':"category",
    'people_under_maintenance':"numeric",
    'telephone':"category",
    'foreign_worker':"category"}

  3. Run the Notebook after updating your Token

Environment

  • OS: macOS
  • Python version, get it with: 3.9.7
  • giskard PYPI version : 1.7.0
  • giskard app version : 1.3.0
    --

### Screenshots

📈 Expected behavior : Should be able to upload model and data without declaring target in column_types like in the previous version

📎 Stack Trace

/Users/princyiakov/opt/anaconda3/lib/python3.9/site-packages/giskard/client/project.py:634: UserWarning: Feature 'people_under_maintenance' is declared as 'numeric' but has 2 (<= nuniques_category=2) distinct values. Are you sure it is not a 'category' feature?
  warning(
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/var/folders/kp/fq5v4zt954n8wtmwm0n_n0j80000gn/T/ipykernel_60553/2045684294.py in <module>
----> 1 credit_scoring.upload_model_and_df(
      2     prediction_function=clf_logistic_regression.predict_proba, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model
      3     model_type='classification', # "classification" for classification model OR "regression" for regression model
      4     df=test_data, # the dataset you want to use to inspect your model
      5     column_types=column_types, # A dictionary with columns names of df as key and types(category, numeric, text) of columns as values

~/opt/anaconda3/lib/python3.9/site-packages/giskard/client/project.py in upload_model_and_df(self, prediction_function, model_type, df, column_types, feature_names, target, model_name, dataset_name, classification_threshold, classification_labels)
    388         """
    389         self.analytics.track("Upload model and dataset")
--> 390         data, raw_column_types = self._validate_and_compress_data(column_types, df, target)
    391         classification_labels, model = self._validate_model(
    392             classification_labels,

~/opt/anaconda3/lib/python3.9/site-packages/giskard/client/project.py in _validate_and_compress_data(self, column_types, df, target)
    329         self.validate_columns_columntypes(df, column_types, target)
    330         self._validate_column_types(column_types)
--> 331         self._validate_column_categorization(df, column_types)
    332         raw_column_types = df.dtypes.apply(lambda x: x.name).to_dict()
    333         data = compress(save_df(df))

~/opt/anaconda3/lib/python3.9/site-packages/giskard/client/project.py in _validate_column_categorization(df, feature_types)
    630         for column in df.columns:
    631             if nuniques[column] <= nuniques_category and \
--> 632                     (feature_types[column] == SupportedColumnType.NUMERIC.value or \
    633                      feature_types[column] == SupportedColumnType.TEXT.value):
    634                 warning(

KeyError: 'default'
@princyiakov princyiakov added the bug Something isn't working label Nov 27, 2022
@princyiakov
Copy link
Contributor Author

Code to reproduce the Error :

import pandas as pd

from sklearn import model_selection
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# To download and read the credit scoring dataset
url = 'https://raw.githubusercontent.com/Giskard-AI/examples/main/datasets/credit_scoring_classification_model_dataset/german_credit_prepared.csv'
credit = pd.read_csv(url, sep=',',engine="python") #To download go to https://github.com/Giskard-AI/giskard-client/tree/main/sample_data/classification

# Declare the type of each column in the dataset(example: category, numeric, text)
column_types = {'account_check_status':"category", 
               'duration_in_month':"numeric",
               'credit_history':"category",
               'purpose':"category",
               'credit_amount':"numeric",
               'savings':"category",
               'present_employment_since':"category",
               'installment_as_income_perc':"numeric",
               'sex':"category",
               'personal_status':"category",
               'other_debtors':"category",
               'present_residence_since':"numeric",
               'property':"category",
               'age':"numeric",
               'other_installment_plans':"category",
               'housing':"category",
               'credits_this_bank':"numeric",
               'job':"category",
               'people_under_maintenance':"numeric",
               'telephone':"category",
               'foreign_worker':"category"}

# feature_types is used to declare the features the model is trained on
feature_types = {i:column_types[i] for i in column_types if i!='default'}

# Pipeline to fill missing values, transform and scale the numeric columns
columns_to_scale = [key for key in feature_types.keys() if feature_types[key]=="numeric"]
numeric_transformer = Pipeline([('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Pipeline to fill missing values and one hot encode the categorical values
columns_to_encode = [key for key in feature_types.keys() if feature_types[key]=="category"]
categorical_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False)) ])

# Perform preprocessing of the columns with the above pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, columns_to_scale),
      ('cat', categorical_transformer, columns_to_encode)
          ]
)

# Pipeline for the model Logistic Regression
clf_logistic_regression = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(max_iter =1000))])

# Split the data into train and test
Y=credit['default']
X= credit.drop(columns="default")
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.20,random_state = 30, stratify = Y)

# Fit and score your model
clf_logistic_regression.fit(X_train, Y_train)
clf_logistic_regression.score(X_test, Y_test)

# Prepare data to upload on Giskard
train_data = pd.concat([X_train, Y_train], axis=1)
test_data = pd.concat([X_test, Y_test ], axis=1)

"""# Upload the model in Giskard 🚀🚀🚀

### Initiate a project
"""

from giskard import GiskardClient

url = "http://localhost:19000" #if Giskard is installed locally (for installation, see: https://docs.giskard.ai/start/guides/installation)
#url = "http://app.giskard.ai" # If you want to upload on giskard URL
token = "YOUR GENERATED TOKEN" #you can generate your API token in the Admin tab of the Giskard application (for installation, see: https://docs.giskard.ai/start/guides/installation)

client = GiskardClient(url, token)

# your_project = client.create_project("project_key", "PROJECT_NAME", "DESCRIPTION")
# Choose the arguments you want. But "project_key" should be unique and in lower case
credit_scoring = client.create_project("credit_scoring", "German Credit Scoring", "Project to predict if user will default")

# If you've already created a project with the key "credit-scoring" use
#credit_scoring = client.get_project("credit_scoring")

"""### Upload your model and a dataset (see [documentation](https://docs.giskard.ai/start/guides/upload-your-model))"""

credit_scoring.upload_model_and_df(
    prediction_function=clf_logistic_regression.predict_proba, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model
    model_type='classification', # "classification" for classification model OR "regression" for regression model
    df=test_data, # the dataset you want to use to inspect your model
    column_types=column_types, # A dictionary with columns names of df as key and types(category, numeric, text) of columns as values
    target='default', # The column name in df corresponding to the actual target variable (ground truth).
    feature_names=list(feature_types.keys()), # List of the feature names of prediction_function
    classification_labels=clf_logistic_regression.classes_ ,  # List of the classification labels of your prediction
    model_name='logistic_regression_v1', # Name of the model
    dataset_name='test_data' # Name of the dataset
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants