New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LightGBM error special JSON characters in feature name #399
Comments
Good catch, thanks for the info! Just to clarify, LightGBM fails in this scenario, but the rest of the models train successfully and you are able to get a result from AutoGluon? If so, I think I can look into fixing this, the question is if I can get it to still use the features rather than dropping them entirely (I think I can by clever renaming). |
Yep, everything else works fine and I get a result at the end |
To aid in testing to ensure I am covering all of the failure cases, could you provide a sample training data of the news article data you are experiencing the problem with or the location which you obtained the data? If you wish to keep it private that is fine, but it could help accelerating the fix on my end. |
Sure - first 3 rows of csv are enough to trigger it - see attached zip. Is there a way to access errors in the underlying models using the %debug in ipython? |
Hi, I have encountred the same problem. Fixed it before the training begins with a regular expression and lambda function (pandas). Object named "data" must be a dataframe. import re
data = data.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x)) Could provide data/column headers if needed. |
Thanks for the info @oguzhangur96! Right now the main difficulty here is to ensure that column names are always converted when entering the model (whether for feature importances, fit, inference) and also inversely converted back to the original upon leaving. This seems to be something that LightGBM itself should handle, and I don't have a good understanding of why they haven't done so. Furthermore, it isn't sufficient to only remove the special characters, because we have to ensure no two columns have the same name. One simple way is to just rename columns 0-n as '0', '1', ...'n-1', 'n', but then this operation would also have to be done on all of the 99% of problems where this is not required, particularly for online-inference of single rows. I have to first benchmark inference times in these situations to ensure our online-inference speed isn't significantly slowed due to this fix. |
#451 is merged and should fix this issue (Just in time for our upcoming 0.0.7 release!). I have tested this on various NLP problems which previously caused crashes, and it successfully trains. Thanks again for highlighting this issue! Best, |
Great - looking forward to the next release! |
how can i fix this error "LightGBMError: Do not support special JSON characters in feature name." |
@ind-kum Please check your AutoGluon version and upgrade to 0.0.10 if you haven't. This should fix the issue. |
I have a train part of a dataframe with Russian names:
And when I apply LightGBM on it, it rases the same issue:
So how can I use LightGBM on this kind of features? Do I need to transform them to English column names? |
@antoinecomp This error doesn't appear to be from AutoGluon, are you using AutoGluon? |
Hello, After I add this line |
Hi @ahbon123 , if you still have the problem, here is something you can try: # Change columns names ([LightGBM] Do not support special JSON characters in feature name.)
new_names = {col: re.sub(r'[^A-Za-z0-9_]+', '', col) for col in df.columns}
new_n_list = list(new_names.values())
# [LightGBM] Feature appears more than one time.
new_names = {col: f'{new_col}_{i}' if new_col in new_n_list[:i] else new_col for i, (col, new_col) in enumerate(new_names.items())}
df = df.rename(columns=new_names) The idea is to add |
Hi - great library everyone!
I'm doing tabular prediction with news article data. The LightGBM model doesn't run with the following error:
The upstream issue in LightGBM is here:microsoft/LightGBM#2455
Basically they check for special json characters .e.g
[],{}":
in features names and throw an error if found. Could autogluon check for these characters and remove any offending features?Liam
The text was updated successfully, but these errors were encountered: