LightGBM error special JSON characters in feature name #399

braaannigan · 2020-04-01T13:19:12Z

Hi - great library everyone!

I'm doing tabular prediction with news article data. The LightGBM model doesn't run with the following error:

LightGBMError: Do not support special JSON characters in feature name.
Do not support special JSON characters in feature name.

The upstream issue in LightGBM is here:microsoft/LightGBM#2455

Basically they check for special json characters .e.g [],{}": in features names and throw an error if found. Could autogluon check for these characters and remove any offending features?

Liam

The text was updated successfully, but these errors were encountered:

Innixma · 2020-04-01T18:14:07Z

Good catch, thanks for the info! Just to clarify, LightGBM fails in this scenario, but the rest of the models train successfully and you are able to get a result from AutoGluon?

If so, I think I can look into fixing this, the question is if I can get it to still use the features rather than dropping them entirely (I think I can by clever renaming).

braaannigan · 2020-04-01T18:46:12Z

Yep, everything else works fine and I get a result at the end

Innixma · 2020-04-02T19:19:54Z

To aid in testing to ensure I am covering all of the failure cases, could you provide a sample training data of the news article data you are experiencing the problem with or the location which you obtained the data? If you wish to keep it private that is fine, but it could help accelerating the fix on my end.

braaannigan · 2020-04-03T06:16:15Z

Sure - first 3 rows of csv are enough to trigger it - see attached zip.
err.tar.gz

Is there a way to access errors in the underlying models using the %debug in ipython?

oguzhangur96 · 2020-05-04T08:21:59Z

Hi,

I have encountred the same problem. Fixed it before the training begins with a regular expression and lambda function (pandas). Object named "data" must be a dataframe.

import re
data = data.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

Could provide data/column headers if needed.

Innixma · 2020-05-04T11:37:32Z

Thanks for the info @oguzhangur96! Right now the main difficulty here is to ensure that column names are always converted when entering the model (whether for feature importances, fit, inference) and also inversely converted back to the original upon leaving. This seems to be something that LightGBM itself should handle, and I don't have a good understanding of why they haven't done so.

Furthermore, it isn't sufficient to only remove the special characters, because we have to ensure no two columns have the same name. One simple way is to just rename columns 0-n as '0', '1', ...'n-1', 'n', but then this operation would also have to be done on all of the 99% of problems where this is not required, particularly for online-inference of single rows. I have to first benchmark inference times in these situations to ensure our online-inference speed isn't significantly slowed due to this fix.

Innixma · 2020-05-06T20:24:15Z

@braaannigan , @oguzhangur96

#451 is merged and should fix this issue (Just in time for our upcoming 0.0.7 release!). I have tested this on various NLP problems which previously caused crashes, and it successfully trains.

Thanks again for highlighting this issue!

Best,
Nick

braaannigan · 2020-05-07T14:33:35Z

Great - looking forward to the next release!

ind-kum · 2020-06-04T10:52:01Z

how can i fix this error "LightGBMError: Do not support special JSON characters in feature name."

Innixma · 2020-06-04T18:20:35Z

@ind-kum Please check your AutoGluon version and upgrade to 0.0.10 if you haven't. This should fix the issue.

antoinecomp · 2021-03-11T11:41:37Z

Do not support special JSON characters in feature name: how to use LGBM in columns with Russian names?

I have a train part of a dataframe with Russian names:

shop__56 | sub_type_Сумки, Альбомы, Коврики д/мыши | shop__46 | sub_type_Для дома и офиса (Цифра) | shop__49 | shop__58 | shop__37 | sub_type_Служебные | sub_type_CD локального производства | shop__22 | ... | shop__48 | sub_type_Артбуки, энциклопедии | sub_type_Подарочные издания | sub_type_PSN | shop_id | sub_type_CD фирменного производства | sub_type_DVD | sub_type_PSVita | sub_type_Комиксы, манга | sub_type_Дополнительные издания
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 41 | 0 | 0 | 0 | 0 | 0

And when I apply LightGBM on it, it rases the same issue:

---------------------------------------------------------------------------
LightGBMError                             Traceback (most recent call last)
<ipython-input-125-711fbc08b2b9> in <module>
     11                            min_split_gain=0.0222415,
     12                            min_child_weight=40)
---> 13 model_lgb.fit(X_train, y_train)

/opt/conda/lib/python3.7/site-packages/lightgbm/sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
    777                                        verbose=verbose, feature_name=feature_name,
    778                                        categorical_feature=categorical_feature,
--> 779                                        callbacks=callbacks, init_model=init_model)
    780         return self
    781 

/opt/conda/lib/python3.7/site-packages/lightgbm/sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
    615                               evals_result=evals_result, fobj=self._fobj, feval=eval_metrics_callable,
    616                               verbose_eval=verbose, feature_name=feature_name,
--> 617                               callbacks=callbacks, init_model=init_model)
    618 
    619         if evals_result:

/opt/conda/lib/python3.7/site-packages/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    229     # construct booster
    230     try:
--> 231         booster = Booster(params=params, train_set=train_set)
    232         if is_valid_contain_train:
    233             booster.set_train_data_name(train_data_name)

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in __init__(self, params, train_set, model_file, model_str, silent)
   2051                     break
   2052             # construct booster object
-> 2053             train_set.construct()
   2054             # copy the parameters from train_set
   2055             params.update(train_set.get_params())

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in construct(self)
   1323                                 init_score=self.init_score, predictor=self._predictor,
   1324                                 silent=self.silent, feature_name=self.feature_name,
-> 1325                                 categorical_feature=self.categorical_feature, params=self.params)
   1326             if self.free_raw_data:
   1327                 self.data = None

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in _lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)
   1149             raise TypeError('Wrong predictor type {}'.format(type(predictor).__name__))
   1150         # set feature names
-> 1151         return self.set_feature_name(feature_name)
   1152 
   1153     def __init_from_np2d(self, mat, params_str, ref_dataset):

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in set_feature_name(self, feature_name)
   1630                 self.handle,
   1631                 c_array(ctypes.c_char_p, c_feature_name),
-> 1632                 ctypes.c_int(len(feature_name))))
   1633         return self
   1634 

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in _safe_call(ret)
     53     """
     54     if ret != 0:
---> 55         raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
     56 
     57 

LightGBMError: Do not support special JSON characters in feature name.

So how can I use LightGBM on this kind of features? Do I need to transform them to English column names?

Innixma · 2021-03-11T22:16:05Z

@antoinecomp This error doesn't appear to be from AutoGluon, are you using AutoGluon?

ahbon123 · 2022-07-07T10:38:06Z

Hi,

I have encountred the same problem. Fixed it before the training begins with a regular expression and lambda function (pandas). Object named "data" must be a dataframe.
import re
data = data.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
Could provide data/column headers if needed.

Hello,

After I add this line data = data.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x)) to my code, it raised a new error: Feature (PMI) appears more than one time, could you please help? Thanks.

NicolasMICAUX · 2022-11-07T19:10:01Z

Hello,
After I add this line data = data.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x)) to my code, it raised a new error: Feature (PMI) appears more than one time, could you please help? Thanks.

Hi @ahbon123 , if you still have the problem, here is something you can try:

# Change columns names ([LightGBM] Do not support special JSON characters in feature name.)
new_names = {col: re.sub(r'[^A-Za-z0-9_]+', '', col) for col in df.columns}
new_n_list = list(new_names.values())
# [LightGBM] Feature appears more than one time.
new_names = {col: f'{new_col}_{i}' if new_col in new_n_list[:i] else new_col for i, (col, new_col) in enumerate(new_names.items())}
df = df.rename(columns=new_names)

The idea is to add _34 for example if the 34th column has same name than the 17th one after removing JSON special characters.

Innixma added bug Something isn't working module: tabular labels Apr 1, 2020

Innixma self-assigned this Apr 1, 2020

Innixma mentioned this issue May 6, 2020

Tabular: Fixed crash in LightGBM with special characters in column names #451

Merged

Innixma closed this as completed May 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LightGBM error special JSON characters in feature name #399

LightGBM error special JSON characters in feature name #399

braaannigan commented Apr 1, 2020

Innixma commented Apr 1, 2020

braaannigan commented Apr 1, 2020

Innixma commented Apr 2, 2020

braaannigan commented Apr 3, 2020

oguzhangur96 commented May 4, 2020

Innixma commented May 4, 2020

Innixma commented May 6, 2020

braaannigan commented May 7, 2020

ind-kum commented Jun 4, 2020

Innixma commented Jun 4, 2020

antoinecomp commented Mar 11, 2021

Innixma commented Mar 11, 2021

ahbon123 commented Jul 7, 2022

NicolasMICAUX commented Nov 7, 2022

LightGBM error special JSON characters in feature name #399

LightGBM error special JSON characters in feature name #399

Comments

braaannigan commented Apr 1, 2020

Innixma commented Apr 1, 2020

braaannigan commented Apr 1, 2020

Innixma commented Apr 2, 2020

braaannigan commented Apr 3, 2020

oguzhangur96 commented May 4, 2020

Innixma commented May 4, 2020

Innixma commented May 6, 2020

braaannigan commented May 7, 2020

ind-kum commented Jun 4, 2020

Innixma commented Jun 4, 2020

antoinecomp commented Mar 11, 2021

Innixma commented Mar 11, 2021

ahbon123 commented Jul 7, 2022

NicolasMICAUX commented Nov 7, 2022