Error in upload_model_and_df when target value is not declared in column_types #361

princyiakov · 2022-11-27T03:57:26Z

🐛 Bug Report

When I am trying to run upload_model_and_df function, I get an error : KeyError: 'default' since I did not declare target value in column_types . But this behavior was supported in the previous version.

🔬 How To Reproduce

Steps to reproduce the behavior:

Select Giskard Credit Scoring Demo Notebook
Remove 'default' column from column_types
column_types = {'account_check_status':"category",
'duration_in_month':"numeric",
'credit_history':"category",
'purpose':"category",
'credit_amount':"numeric",
'savings':"category",
'present_employment_since':"category",
'installment_as_income_perc':"numeric",
'sex':"category",
'personal_status':"category",
'other_debtors':"category",
'present_residence_since':"numeric",
'property':"category",
'age':"numeric",
'other_installment_plans':"category",
'housing':"category",
'credits_this_bank':"numeric",
'job':"category",
'people_under_maintenance':"numeric",
'telephone':"category",
'foreign_worker':"category"}
Run the Notebook after updating your Token

Environment

OS: macOS
Python version, get it with: 3.9.7
giskard PYPI version : 1.7.0
giskard app version : 1.3.0
--

### Screenshots

📈 Expected behavior : Should be able to upload model and data without declaring target in column_types like in the previous version

📎 Stack Trace

/Users/princyiakov/opt/anaconda3/lib/python3.9/site-packages/giskard/client/project.py:634: UserWarning: Feature 'people_under_maintenance' is declared as 'numeric' but has 2 (<= nuniques_category=2) distinct values. Are you sure it is not a 'category' feature?
  warning(
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/var/folders/kp/fq5v4zt954n8wtmwm0n_n0j80000gn/T/ipykernel_60553/2045684294.py in <module>
----> 1 credit_scoring.upload_model_and_df(
      2     prediction_function=clf_logistic_regression.predict_proba, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model
      3     model_type='classification', # "classification" for classification model OR "regression" for regression model
      4     df=test_data, # the dataset you want to use to inspect your model
      5     column_types=column_types, # A dictionary with columns names of df as key and types(category, numeric, text) of columns as values

~/opt/anaconda3/lib/python3.9/site-packages/giskard/client/project.py in upload_model_and_df(self, prediction_function, model_type, df, column_types, feature_names, target, model_name, dataset_name, classification_threshold, classification_labels)
    388         """
    389         self.analytics.track("Upload model and dataset")
--> 390         data, raw_column_types = self._validate_and_compress_data(column_types, df, target)
    391         classification_labels, model = self._validate_model(
    392             classification_labels,

~/opt/anaconda3/lib/python3.9/site-packages/giskard/client/project.py in _validate_and_compress_data(self, column_types, df, target)
    329         self.validate_columns_columntypes(df, column_types, target)
    330         self._validate_column_types(column_types)
--> 331         self._validate_column_categorization(df, column_types)
    332         raw_column_types = df.dtypes.apply(lambda x: x.name).to_dict()
    333         data = compress(save_df(df))

~/opt/anaconda3/lib/python3.9/site-packages/giskard/client/project.py in _validate_column_categorization(df, feature_types)
    630         for column in df.columns:
    631             if nuniques[column] <= nuniques_category and \
--> 632                     (feature_types[column] == SupportedColumnType.NUMERIC.value or \
    633                      feature_types[column] == SupportedColumnType.TEXT.value):
    634                 warning(

KeyError: 'default'

The text was updated successfully, but these errors were encountered:

princyiakov · 2022-11-27T04:29:44Z

Code to reproduce the Error :

import pandas as pd

from sklearn import model_selection
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# To download and read the credit scoring dataset
url = 'https://raw.githubusercontent.com/Giskard-AI/examples/main/datasets/credit_scoring_classification_model_dataset/german_credit_prepared.csv'
credit = pd.read_csv(url, sep=',',engine="python") #To download go to https://github.com/Giskard-AI/giskard-client/tree/main/sample_data/classification

# Declare the type of each column in the dataset(example: category, numeric, text)
column_types = {'account_check_status':"category", 
               'duration_in_month':"numeric",
               'credit_history':"category",
               'purpose':"category",
               'credit_amount':"numeric",
               'savings':"category",
               'present_employment_since':"category",
               'installment_as_income_perc':"numeric",
               'sex':"category",
               'personal_status':"category",
               'other_debtors':"category",
               'present_residence_since':"numeric",
               'property':"category",
               'age':"numeric",
               'other_installment_plans':"category",
               'housing':"category",
               'credits_this_bank':"numeric",
               'job':"category",
               'people_under_maintenance':"numeric",
               'telephone':"category",
               'foreign_worker':"category"}

# feature_types is used to declare the features the model is trained on
feature_types = {i:column_types[i] for i in column_types if i!='default'}

# Pipeline to fill missing values, transform and scale the numeric columns
columns_to_scale = [key for key in feature_types.keys() if feature_types[key]=="numeric"]
numeric_transformer = Pipeline([('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Pipeline to fill missing values and one hot encode the categorical values
columns_to_encode = [key for key in feature_types.keys() if feature_types[key]=="category"]
categorical_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False)) ])

# Perform preprocessing of the columns with the above pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, columns_to_scale),
      ('cat', categorical_transformer, columns_to_encode)
          ]
)

# Pipeline for the model Logistic Regression
clf_logistic_regression = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(max_iter =1000))])

# Split the data into train and test
Y=credit['default']
X= credit.drop(columns="default")
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.20,random_state = 30, stratify = Y)

# Fit and score your model
clf_logistic_regression.fit(X_train, Y_train)
clf_logistic_regression.score(X_test, Y_test)

# Prepare data to upload on Giskard
train_data = pd.concat([X_train, Y_train], axis=1)
test_data = pd.concat([X_test, Y_test ], axis=1)

"""# Upload the model in Giskard 🚀🚀🚀

### Initiate a project
"""

from giskard import GiskardClient

url = "http://localhost:19000" #if Giskard is installed locally (for installation, see: https://docs.giskard.ai/start/guides/installation)
#url = "http://app.giskard.ai" # If you want to upload on giskard URL
token = "YOUR GENERATED TOKEN" #you can generate your API token in the Admin tab of the Giskard application (for installation, see: https://docs.giskard.ai/start/guides/installation)

client = GiskardClient(url, token)

# your_project = client.create_project("project_key", "PROJECT_NAME", "DESCRIPTION")
# Choose the arguments you want. But "project_key" should be unique and in lower case
credit_scoring = client.create_project("credit_scoring", "German Credit Scoring", "Project to predict if user will default")

# If you've already created a project with the key "credit-scoring" use
#credit_scoring = client.get_project("credit_scoring")

"""### Upload your model and a dataset (see [documentation](https://docs.giskard.ai/start/guides/upload-your-model))"""

credit_scoring.upload_model_and_df(
    prediction_function=clf_logistic_regression.predict_proba, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model
    model_type='classification', # "classification" for classification model OR "regression" for regression model
    df=test_data, # the dataset you want to use to inspect your model
    column_types=column_types, # A dictionary with columns names of df as key and types(category, numeric, text) of columns as values
    target='default', # The column name in df corresponding to the actual target variable (ground truth).
    feature_names=list(feature_types.keys()), # List of the feature names of prediction_function
    classification_labels=clf_logistic_regression.classes_ ,  # List of the classification labels of your prediction
    model_name='logistic_regression_v1', # Name of the model
    dataset_name='test_data' # Name of the dataset
)

princyiakov added the bug Something isn't working label Nov 27, 2022

andreybavt assigned princyiakov and unassigned princyiakov Dec 12, 2022

luca-martial closed this as completed Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in upload_model_and_df when target value is not declared in column_types #361

Error in upload_model_and_df when target value is not declared in column_types #361

princyiakov commented Nov 27, 2022

princyiakov commented Nov 27, 2022

Error in upload_model_and_df when target value is not declared in column_types #361

Error in upload_model_and_df when target value is not declared in column_types #361

Comments

princyiakov commented Nov 27, 2022

🐛 Bug Report

🔬 How To Reproduce

Environment

📈 Expected behavior : Should be able to upload model and data without declaring target in column_types like in the previous version

📎 Stack Trace

princyiakov commented Nov 27, 2022