Customer Churn Prediction with XGBoost #161

johnl8888 · 2018-01-10T21:54:21Z

When I tried a different csv data set using XGBboost, I got the following issues:

Arguments: train
[2018-01-10:21:51:56:INFO] Running standalone xgboost training.
[2018-01-10:21:51:56:INFO] File size need to be processed in the node: 38.24mb. Available memory size in the node: 8611.8mb
/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_helper.py:279: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.
df = pd.read_csv(os.path.join(files_path, csv_file), sep=None, header=None)
/opt/amazon/lib/python2.7/site-packages/sage_xgboost/exceptions.py:19: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6
message = getattr(exception, 'message', str(exception))
/opt/amazon/lib/python2.7/site-packages/sage_xgboost/exceptions.py:19: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6
message = getattr(exception, 'message', str(exception))
[2018-01-10:21:52:06:ERROR] Algorithm Error: Could not determine delimiter (caused by Error)

Caused by: Could not determine delimiter
Traceback (most recent call last):
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train.py", line 34, in main
standalone_train(resource_config, train_config, data_config)
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_methods.py", line 16, in standalone_train
train_job(resource_config, train_config, data_config)
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_helper.py", line 389, in train_job
dtrain = get_dmatrix(train_path, file_type, exceed_memory)
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_helper.py", line 317, in get_dmatrix
dmatrix = get_csv_dmatrix(files_path)
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_helper.py", line 279, in get_csv_dmatrix
df = pd.read_csv(os.path.join(files_path, csv_file), sep=None, header=None)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 315, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 645, in init
self._make_engine(self.engine)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 805, in _make_engine
self._engine = klass(self.f, **self.options)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 1601, in init
self._make_reader(f)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 1705, in _make_reader
sniffed = csv.Sniffer().sniff(line)
File "/opt/amazon/python2.7/lib/python2.7/csv.py", line 188, in sniff
raise Error, "Could not determine delimiter"
Error: Could not determine delimiter

ValueErrorTraceback (most recent call last)
in ()
16 num_round=100)
17
---> 18 xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in fit(self, inputs, wait, logs, job_name)
152 self.latest_training_job = _TrainingJob.start_new(self, inputs)
153 if wait:
--> 154 self.latest_training_job.wait(logs=logs)
155 else:
156 raise NotImplemented('Asynchronous fit not available')

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in wait(self, logs)
321 def wait(self, logs=True):
322 if logs:
--> 323 self.sagemaker_session.logs_for_job(self.job_name, wait=True)
324 else:
325 self.sagemaker_session.wait_for_job(self.job_name)

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/session.pyc in logs_for_job(self, job_name, wait, poll)
656
657 if wait:
--> 658 self._check_job_status(job_name, description)
659 if dot:
660 print()

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/session.pyc in _check_job_status(self, job, desc)
399 if status != 'Completed':
400 reason = desc.get('FailureReason', '(No reason provided)')
--> 401 raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))
402
403 def wait_for_endpoint(self, endpoint, poll=5):

ValueError: Error training xgboost-2018-01-10-21-46-25-058: Failed Reason: InternalServerError: We encountered an internal error. Please try again.

johnl8888 · 2018-01-11T00:08:05Z

Also tried the libsvm format using dump_svmlight_file:

Arguments: train
[2018-01-10:23:52:25:INFO] Running standalone xgboost training.
[2018-01-10:23:52:25:INFO] File size need to be processed in the node: 0.57mb. Available memory size in the node: 8618.42mb
[2018-01-10:23:52:25:ERROR] Customer Error: Blankspace and colon not found in the file. ContentType by defaullt is in libsvm. Please ensure the file is in libsvm format.
Traceback (most recent call last):
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train.py", line 34, in main
standalone_train(resource_config, train_config, data_config)
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_methods.py", line 16, in standalone_train
train_job(resource_config, train_config, data_config)
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_helper.py", line 386, in train_job
validate_file_format(train_path, file_type)
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_helper.py", line 254, in validate_file_format
validate_libsvm_format(os.path.join(files_path, data_file))
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_helper.py", line 273, in validate_libsvm_format
Please ensure the file is in libsvm format.")
CustomerError: Blankspace and colon not found in the file. ContentType by defaullt is in libsvm. Please ensure the file is in libsvm format.

ValueErrorTraceback (most recent call last)
in ()
16 num_round=100)
17
---> 18 xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

/home/ec2-user/anaconda3/envs/python2/lib/python2.7/site-packages/sagemaker/estimator.pyc in fit(self, inputs, wait, logs, job_name)
152 self.latest_training_job = _TrainingJob.start_new(self, inputs)
153 if wait:
--> 154 self.latest_training_job.wait(logs=logs)
155 else:
156 raise NotImplemented('Asynchronous fit not available')

/home/ec2-user/anaconda3/envs/python2/lib/python2.7/site-packages/sagemaker/estimator.pyc in wait(self, logs)
321 def wait(self, logs=True):
322 if logs:
--> 323 self.sagemaker_session.logs_for_job(self.job_name, wait=True)
324 else:
325 self.sagemaker_session.wait_for_job(self.job_name)

/home/ec2-user/anaconda3/envs/python2/lib/python2.7/site-packages/sagemaker/session.pyc in logs_for_job(self, job_name, wait, poll)
656
657 if wait:
--> 658 self._check_job_status(job_name, description)
659 if dot:
660 print()

/home/ec2-user/anaconda3/envs/python2/lib/python2.7/site-packages/sagemaker/session.pyc in _check_job_status(self, job, desc)
399 if status != 'Completed':
400 reason = desc.get('FailureReason', '(No reason provided)')
--> 401 raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))
402
403 def wait_for_endpoint(self, endpoint, poll=5):

ValueError: Error training xgboost-2018-01-10-23-46-28-723: Failed Reason: ClientError: Blankspace and colon not found in the file. ContentType by defaullt is in libsvm. Please ensure the file is in libsvm format.

djarpin · 2018-01-11T00:38:48Z

Thanks @johnl8888 , and sorry you're running into troubles. Would you be able to provide the top few records of both files? This will help us troubleshoot the issue.

johnl8888 · 2018-01-11T14:36:10Z

hi David, attached is the cvs data sample, last column is the outcome, the 1st column should be dropped. I can schedule a quick goto meeting session with you to walk through the notebook if needed. let me know thanks John

…

On Wed, Jan 10, 2018 at 6:38 PM, David Arpin ***@***.***> wrote: Thanks @johnl8888 <https://github.com/johnl8888> , and sorry you're running into troubles. Would you be able to provide the top few records of both files? This will help us troubleshoot the issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#161 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAXF8kofHGAhhYyz4rFwLPKMaH82ynkgks5tJVgZgaJpZM4RaAHc> .

djarpin · 2018-01-11T16:12:05Z

Thanks, @johnl8888 . Unfortunately, it doesn't look like the email attachment came through in my email or in the GitHub comments. However, CSVs passed to XGBoost need to be in a specific format:

No header row
Outcome variable in the first column, features in the rest of the columns (there's no ability to drop them during the training process)
All columns need to be numeric

In the example notebook we actually read in a CSV that doesn't conform to these standards and then transform it and re-output a version that does with .to_csv(). We then send the transformed version to S3 for the training job.

If you still have trouble with running a training job after this, feel free to just dump the first 10 lines of the CSV you're passing to the algorithm into the comments section here, and that should give us the next direction to go for troubleshooting.

Thanks!

johnl8888 · 2018-01-11T18:33:44Z

Hi David, After loaded the csv file, I did the same transformation and clean up - outcome variable is rearranged in the 1st column, ensuring all columns are numeric, and use .to_csv() removed header. I'll retry it today in a new AWS instance. If still having trouble, would you mind I schedule a 30-min session on gotomeeting with you to check on the notebook and data? thanks John

…

On Thu, Jan 11, 2018 at 10:12 AM, David Arpin ***@***.***> wrote: Thanks, @johnl8888 <https://github.com/johnl8888> . Unfortunately, it doesn't look like the email attachment came through in my email or in the GitHub comments. However, CSVs passed to XGBoost need to be in a specific format: 1. No header row 2. Outcome variable in the first column, features in the rest of the columns (there's no ability to drop them during the training process) 3. All columns need to be numeric In the example notebook we actually read in a CSV that doesn't conform to these standards and then transform it and re-output a version that does with .to_csv(). We then send the transformed version to S3 for the training job. If you still have trouble with running a training job after this, feel free to just dump the first 10 lines of the CSV you're passing to the algorithm into the comments section here, and that should give us the next direction to go for troubleshooting. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#161 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAXF8oroHYXC1F8vvkO_m_PBkAiogAHuks5tJjLWgaJpZM4RaAHc> .

johnl8888 · 2018-01-11T23:59:45Z

Now the csv file saved as libsvm format works now; but the same csv data with .csv format failed with different error (similar to the error I got using libsvm format yesterday): CustomerError: Blankspace and colon found in the file. Please ensure the file is in csv format. Will try it again. thanks John

…

On Thu, Jan 11, 2018 at 12:33 PM, john liu ***@***.***> wrote: Hi David, After loaded the csv file, I did the same transformation and clean up - outcome variable is rearranged in the 1st column, ensuring all columns are numeric, and use .to_csv() removed header. I'll retry it today in a new AWS instance. If still having trouble, would you mind I schedule a 30-min session on gotomeeting with you to check on the notebook and data? thanks John On Thu, Jan 11, 2018 at 10:12 AM, David Arpin ***@***.***> wrote: > Thanks, @johnl8888 <https://github.com/johnl8888> . Unfortunately, it > doesn't look like the email attachment came through in my email or in the > GitHub comments. However, CSVs passed to XGBoost need to be in a specific > format: > > 1. No header row > 2. Outcome variable in the first column, features in the rest of the > columns (there's no ability to drop them during the training process) > 3. All columns need to be numeric > > In the example notebook we actually read in a CSV that doesn't conform to > these standards and then transform it and re-output a version that does > with .to_csv(). We then send the transformed version to S3 for the > training job. > > If you still have trouble with running a training job after this, feel > free to just dump the first 10 lines of the CSV you're passing to the > algorithm into the comments section here, and that should give us the next > direction to go for troubleshooting. > > Thanks! > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#161 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAXF8oroHYXC1F8vvkO_m_PBkAiogAHuks5tJjLWgaJpZM4RaAHc> > . >

djarpin · 2018-01-12T18:41:28Z

Thanks @johnl8888 . It sounds like at least you're up and running with the LibSVM format. Just to confirm, the CSV file should not have any blankspaces or colons in it (as your error suggests is happening). It should only contain numeric values with a single delimiter (typically comma).

One other thing that might be happening... Can you make sure the CSV is stored in its own S3 prefix. If you have both the LibSVM file and the CSV file sitting in s3://my-bucket/xgboost-test/train/, and you pass that as your training location to SageMaker, then both files will be loaded to the training instance and the algorithm may be confused and loading the LibSVM file but thinking it's a CSV.

djarpin · 2018-01-22T17:09:47Z

@johnl8888 , I'll close this issue for now as I'm hoping you were able to get up and running based on our last exchange. Feel free to re-open if needed. Thanks again for your interest in SageMaker Example Notebooks!

JavierLopezT · 2019-03-13T17:35:28Z

Hello,

I am having this problem and I don't know how to solve. Supposedly the csv is already matching the format and the data is in its own bucket. The error I get is the following:

Error for Training job xgboost-2019-03-13-16-21-25-000: Failed Reason: ClientError: Blankspace and colon not found in firstline '0.0,0.0,99.0,314.07,1.0,0.0,0.0,0.0,0.48027846,0.0...' of file 'train.csv'

We can see that the label is in the first row and the others are the features, no headers and everything numerical, so I am wondering what am I doing wrong.

smart-patrol · 2019-04-29T17:54:12Z

I am getting a similar error and followed the exact same steps.

No header row
Outcome variable in the first column, features in the rest of the columns (there's no ability to drop them during the training process)
All columns need to be numeric

Here is snapshot of the data:
0.0,-1.0,-1.0,0.43,-0.6397578,-0.0030769934,-0.3481717,-0.6736527,0.52619594,-0.57142854,-0.12195122,-0.13138686,0.15079366,0.2798353,-0.07718044,0.3561645,-0.5319149,0.17164181,0.29268286,-1.0284938,-0.39880952,0.30730823,-0.09433937,0.1566265,-0.17105263,-0.4765625,-0.25,0.36363637,0.30769232,-0.25,-0.6870229,0.37499985

Should I switched to using Libsvm?

smart-patrol · 2019-05-01T16:48:54Z

I too switched to Libsvm format and it worked....

* match only end of string * use os.path.basename

djarpin closed this as completed Jan 22, 2018

atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this issue Aug 16, 2022

Only match regex at end of string (aws#161)

79785f0

* match only end of string * use os.path.basename

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Customer Churn Prediction with XGBoost #161

Customer Churn Prediction with XGBoost #161

johnl8888 commented Jan 10, 2018

johnl8888 commented Jan 11, 2018

djarpin commented Jan 11, 2018

johnl8888 commented Jan 11, 2018 via email

djarpin commented Jan 11, 2018

johnl8888 commented Jan 11, 2018 via email

johnl8888 commented Jan 11, 2018 via email

djarpin commented Jan 12, 2018

djarpin commented Jan 22, 2018

JavierLopezT commented Mar 13, 2019

smart-patrol commented Apr 29, 2019 •

edited

smart-patrol commented May 1, 2019

Customer Churn Prediction with XGBoost #161

Customer Churn Prediction with XGBoost #161

Comments

johnl8888 commented Jan 10, 2018

johnl8888 commented Jan 11, 2018

djarpin commented Jan 11, 2018

johnl8888 commented Jan 11, 2018 via email

djarpin commented Jan 11, 2018

johnl8888 commented Jan 11, 2018 via email

johnl8888 commented Jan 11, 2018 via email

djarpin commented Jan 12, 2018

djarpin commented Jan 22, 2018

JavierLopezT commented Mar 13, 2019

smart-patrol commented Apr 29, 2019 • edited

smart-patrol commented May 1, 2019

smart-patrol commented Apr 29, 2019 •

edited