Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightGBM error special JSON characters in feature name #399

Closed
braaannigan opened this issue Apr 1, 2020 · 14 comments
Closed

LightGBM error special JSON characters in feature name #399

braaannigan opened this issue Apr 1, 2020 · 14 comments
Assignees
Labels
bug Something isn't working module: tabular

Comments

@braaannigan
Copy link

Hi - great library everyone!

I'm doing tabular prediction with news article data. The LightGBM model doesn't run with the following error:

LightGBMError: Do not support special JSON characters in feature name.
Do not support special JSON characters in feature name.

The upstream issue in LightGBM is here:microsoft/LightGBM#2455

Basically they check for special json characters .e.g [],{}": in features names and throw an error if found. Could autogluon check for these characters and remove any offending features?

Liam

@Innixma Innixma added bug Something isn't working module: tabular labels Apr 1, 2020
@Innixma
Copy link
Contributor

Innixma commented Apr 1, 2020

Good catch, thanks for the info! Just to clarify, LightGBM fails in this scenario, but the rest of the models train successfully and you are able to get a result from AutoGluon?

If so, I think I can look into fixing this, the question is if I can get it to still use the features rather than dropping them entirely (I think I can by clever renaming).

@braaannigan
Copy link
Author

Yep, everything else works fine and I get a result at the end

@Innixma Innixma self-assigned this Apr 1, 2020
@Innixma
Copy link
Contributor

Innixma commented Apr 2, 2020

To aid in testing to ensure I am covering all of the failure cases, could you provide a sample training data of the news article data you are experiencing the problem with or the location which you obtained the data? If you wish to keep it private that is fine, but it could help accelerating the fix on my end.

@braaannigan
Copy link
Author

Sure - first 3 rows of csv are enough to trigger it - see attached zip.
err.tar.gz

Is there a way to access errors in the underlying models using the %debug in ipython?

@oguzhangur96
Copy link

Hi,

I have encountred the same problem. Fixed it before the training begins with a regular expression and lambda function (pandas). Object named "data" must be a dataframe.

import re
data = data.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

Could provide data/column headers if needed.

@Innixma
Copy link
Contributor

Innixma commented May 4, 2020

Thanks for the info @oguzhangur96! Right now the main difficulty here is to ensure that column names are always converted when entering the model (whether for feature importances, fit, inference) and also inversely converted back to the original upon leaving. This seems to be something that LightGBM itself should handle, and I don't have a good understanding of why they haven't done so.

Furthermore, it isn't sufficient to only remove the special characters, because we have to ensure no two columns have the same name. One simple way is to just rename columns 0-n as '0', '1', ...'n-1', 'n', but then this operation would also have to be done on all of the 99% of problems where this is not required, particularly for online-inference of single rows. I have to first benchmark inference times in these situations to ensure our online-inference speed isn't significantly slowed due to this fix.

@Innixma
Copy link
Contributor

Innixma commented May 6, 2020

@braaannigan , @oguzhangur96

#451 is merged and should fix this issue (Just in time for our upcoming 0.0.7 release!). I have tested this on various NLP problems which previously caused crashes, and it successfully trains.

Thanks again for highlighting this issue!

Best,
Nick

@Innixma Innixma closed this as completed May 6, 2020
@braaannigan
Copy link
Author

Great - looking forward to the next release!

@ind-kum
Copy link

ind-kum commented Jun 4, 2020

how can i fix this error "LightGBMError: Do not support special JSON characters in feature name."

@Innixma
Copy link
Contributor

Innixma commented Jun 4, 2020

@ind-kum Please check your AutoGluon version and upgrade to 0.0.10 if you haven't. This should fix the issue.

@antoinecomp
Copy link

Do not support special JSON characters in feature name: how to use LGBM in columns with Russian names?

I have a train part of a dataframe with Russian names:

shop__56 | sub_type_Сумки, Альбомы, Коврики д/мыши | shop__46 | sub_type_Для дома и офиса (Цифра) | shop__49 | shop__58 | shop__37 | sub_type_Служебные | sub_type_CD локального производства | shop__22 | ... | shop__48 | sub_type_Артбуки, энциклопедии | sub_type_Подарочные издания | sub_type_PSN | shop_id | sub_type_CD фирменного производства | sub_type_DVD | sub_type_PSVita | sub_type_Комиксы, манга | sub_type_Дополнительные издания
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 41 | 0 | 0 | 0 | 0 | 0

And when I apply LightGBM on it, it rases the same issue:

---------------------------------------------------------------------------
LightGBMError                             Traceback (most recent call last)
<ipython-input-125-711fbc08b2b9> in <module>
     11                            min_split_gain=0.0222415,
     12                            min_child_weight=40)
---> 13 model_lgb.fit(X_train, y_train)

/opt/conda/lib/python3.7/site-packages/lightgbm/sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
    777                                        verbose=verbose, feature_name=feature_name,
    778                                        categorical_feature=categorical_feature,
--> 779                                        callbacks=callbacks, init_model=init_model)
    780         return self
    781 

/opt/conda/lib/python3.7/site-packages/lightgbm/sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
    615                               evals_result=evals_result, fobj=self._fobj, feval=eval_metrics_callable,
    616                               verbose_eval=verbose, feature_name=feature_name,
--> 617                               callbacks=callbacks, init_model=init_model)
    618 
    619         if evals_result:

/opt/conda/lib/python3.7/site-packages/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    229     # construct booster
    230     try:
--> 231         booster = Booster(params=params, train_set=train_set)
    232         if is_valid_contain_train:
    233             booster.set_train_data_name(train_data_name)

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in __init__(self, params, train_set, model_file, model_str, silent)
   2051                     break
   2052             # construct booster object
-> 2053             train_set.construct()
   2054             # copy the parameters from train_set
   2055             params.update(train_set.get_params())

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in construct(self)
   1323                                 init_score=self.init_score, predictor=self._predictor,
   1324                                 silent=self.silent, feature_name=self.feature_name,
-> 1325                                 categorical_feature=self.categorical_feature, params=self.params)
   1326             if self.free_raw_data:
   1327                 self.data = None

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in _lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)
   1149             raise TypeError('Wrong predictor type {}'.format(type(predictor).__name__))
   1150         # set feature names
-> 1151         return self.set_feature_name(feature_name)
   1152 
   1153     def __init_from_np2d(self, mat, params_str, ref_dataset):

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in set_feature_name(self, feature_name)
   1630                 self.handle,
   1631                 c_array(ctypes.c_char_p, c_feature_name),
-> 1632                 ctypes.c_int(len(feature_name))))
   1633         return self
   1634 

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in _safe_call(ret)
     53     """
     54     if ret != 0:
---> 55         raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
     56 
     57 

LightGBMError: Do not support special JSON characters in feature name.

So how can I use LightGBM on this kind of features? Do I need to transform them to English column names?

@Innixma
Copy link
Contributor

Innixma commented Mar 11, 2021

@antoinecomp This error doesn't appear to be from AutoGluon, are you using AutoGluon?

@ahbon123
Copy link

ahbon123 commented Jul 7, 2022

Hi,

I have encountred the same problem. Fixed it before the training begins with a regular expression and lambda function (pandas). Object named "data" must be a dataframe.

import re
data = data.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

Could provide data/column headers if needed.

Hello,

After I add this line data = data.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x)) to my code, it raised a new error: Feature (PMI) appears more than one time, could you please help? Thanks.

@NicolasMICAUX
Copy link

Hello,
After I add this line data = data.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x)) to my code, it raised a new error: Feature (PMI) appears more than one time, could you please help? Thanks.

Hi @ahbon123 , if you still have the problem, here is something you can try:

# Change columns names ([LightGBM] Do not support special JSON characters in feature name.)
new_names = {col: re.sub(r'[^A-Za-z0-9_]+', '', col) for col in df.columns}
new_n_list = list(new_names.values())
# [LightGBM] Feature appears more than one time.
new_names = {col: f'{new_col}_{i}' if new_col in new_n_list[:i] else new_col for i, (col, new_col) in enumerate(new_names.items())}
df = df.rename(columns=new_names)

The idea is to add _34 for example if the 34th column has same name than the 17th one after removing JSON special characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module: tabular
Projects
None yet
Development

No branches or pull requests

7 participants