Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to read a EBCDIC file with multiple columns #694

Open
SHUBHAMDw opened this issue Jul 17, 2024 · 26 comments
Open

How to read a EBCDIC file with multiple columns #694

SHUBHAMDw opened this issue Jul 17, 2024 · 26 comments
Labels
question Further information is requested

Comments

@SHUBHAMDw
Copy link

Background [Optional]

I need to read a dat file which can have multiple variable length column.

01 CUSTOMER-RECORD.
05 SEGMENT-INDICATORS.
10 CUSTOMER-DETAILS-PRESENT PIC X(1).
10 ACCOUNT-INFORMATION-PRESENT PIC X(1).
10 TRANSACTION-HISTORY-PRESENT PIC X(1).
05 CUSTOMER-DETAILS.
10 CUSTOMER-ID PIC X(10).
10 CUSTOMER-NAME PIC X(30).
05 ACCOUNT-INFORMATION .
10 ACCOUNT-NUMBER PIC X(10).
05 TRANSACTION-HISTORY.
10 TRANSACTION-ID PIC X(10).

Question

Based on SEGMENT-INDICATORS we need to read file
ie if CUSTOMER-DETAILS-PRESENT is 1 then will have CUSTOMER-DETAILS
if ACCOUNT-INFORMATION-PRESENT is 1 then will have ACCOUNT-INFORMATION so on.
not able to read such file in pyspark using cobix.

@SHUBHAMDw SHUBHAMDw added the question Further information is requested label Jul 17, 2024
@SHUBHAMDw
Copy link
Author

@yruslan can you please guide me on this.

@yruslan
Copy link
Collaborator

yruslan commented Jul 19, 2024

Hi, this is not directly supported, unfortunately.

The only workaround I see right now is to have a REDEFINE for each use case. Which is messy...

01 CUSTOMER-RECORD.
  05 SEGMENT-INDICATORS.
    10 CUSTOMER-DETAILS-PRESENT PIC X(1).
    10 ACCOUNT-INFORMATION-PRESENT PIC X(1).
    10 TRANSACTION-HISTORY-PRESENT PIC X(1).
  05 DETAILS100.
    10 CUSTOMER-DETAILS.
      15 CUSTOMER-ID PIC X(10).
  05 DETAILS010 REDEFINES DETAILS100.
    10 ACCOUNT-INFORMATION .
      15 ACCOUNT-NUMBER PIC X(10).
  05 DETAILS001 REDEFINES DETAILS100.
    10 TRANSACTION-HISTORY.
      15 TRANSACTION-ID PIC X(10).
  05 DETAILS110 REDEFINES DETAILS100.
    10 CUSTOMER-DETAILS.
      15 CUSTOMER-ID PIC X(10).
    10 ACCOUNT-INFORMATION .
      15 ACCOUNT-NUMBER PIC X(10).
  05 DETAILS011 REDEFINES DETAILS100.
    10 ACCOUNT-INFORMATION .
      15 ACCOUNT-NUMBER PIC X(10).
    10 TRANSACTION-HISTORY.
      15 TRANSACTION-ID PIC X(10).
  05 DETAILS101 REDEFINES DETAILS100.
    10 CUSTOMER-DETAILS.
      15 CUSTOMER-ID PIC X(10).
    10 TRANSACTION-HISTORY.
      15 TRANSACTION-ID PIC X(10).
  05 DETAILS111 REDEFINES DETAILS100.
    10 CUSTOMER-DETAILS.
      15 CUSTOMER-ID PIC X(10).
      15 CUSTOMER-NAME PIC X(30).
    10 ACCOUNT-INFORMATION .
      15 ACCOUNT-NUMBER PIC X(10).
    10 TRANSACTION-HISTORY.
      15 TRANSACTION-ID PIC X(10).

and then you can resolve fields based on indicators.

@SHUBHAMDw
Copy link
Author

V2_DATA

The above is the data I need to read.
"

  • Customer Data COBOL Copybook with Hexadecimal Fields and Segment Presence Indicators

     01  CUSTOMER-RECORD.
         05  SEGMENT-INDICATORS.
             10  CUSTOMER-DETAILS-PRESENT      PIC X(1) COMP-X.
             10  ACCOUNT-INFORMATION-PRESENT   PIC X(1) COMP-X.
             10  TRANSACTION-HISTORY-PRESENT   PIC X(1) COMP-X.
         05  CUSTOMER-DETAILS.
             10  CUSTOMER-ID                  PIC X(10) COMP-X.
             10  CUSTOMER-NAME                PIC X(30) COMP-X.
             10  CUSTOMER-ADDRESS             PIC X(50) COMP-X.
             10  CUSTOMER-PHONE-NUMBER        PIC X(15) COMP-X.
         05  ACCOUNT-INFORMATION.
             10  ACCOUNT-NUMBER               PIC X(10) COMP-X.
             10  ACCOUNT-TYPE                 PIC X(2) COMP-X.
             10  ACCOUNT-BALANCE              PIC X(12) COMP-X.
         05  TRANSACTION-HISTORY.
             10  TRANSACTION-ID               PIC X(10) COMP-X.
             10  TRANSACTION-DATE             PIC X(8) COMP-X.
             10  TRANSACTION-AMOUNT           PIC X(12) COMP-X.
             10  TRANSACTION-TYPE             PIC X(2) COMP-X.
    

"

With above format. So this is not feasible at present?

@SHUBHAMDw
Copy link
Author

hi @yruslan .Cobrix is not able to understand next record for permutation DETAILS100. I guess it wont undertstand next record if either of the record is missing and length get disturbed. Can you suggest
.The source data is a continuous binary file. and if a particular segment is not present then cobrix gets confused what to read next as.

@SHUBHAMDw
Copy link
Author

SHUBHAMDw commented Jul 22, 2024

Tried
* You may obtain a copy of the License at *
* *
* http://www.apache.org/licenses/LICENSE-2.0 *
* *
* Unless required by applicable law or agreed to in writing, software *
* distributed under the License is distributed on an "AS IS" BASIS, *
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. *
* See the License for the specific language governing permissions and *
* limitations under the License. *
* *
****************************************************************************
01 CUSTOMER-RECORD.
10 SEGMENT-INDICATORS.
15 CUSTOMER-DETAILS-PRESENT PIC 9(1).
15 ACCOUNT-INFORMATION-PRESENT PIC 9(1).
15 TRANSACTION-HISTORY-PRESENT PIC 9(1).
01 CUSTOMER-DETAILS-TAB.
10 CUST-TAB OCCURS 1 TO 2 TIMES DEPENDING ON CUSTOMER-DETAILS-PRESENT.
15 CUSTOMER-ID PIC X(10).
15 CUSTOMER-NAME PIC X(30).
15 CUSTOMER-ADDRESS PIC X(50).
15 CUSTOMER-PHONE-NUMBER PIC X(15).
01 ACCOUNT-INFORMATION-TAB.
10 ACCT-INFO-TAB OCCURS 1 TO 2 TIMES DEPENDING ON ACCOUNT-INFORMATION-PRESENT.
15 ACCOUNT-NUMBER PIC X(10).
15 ACCOUNT-TYPE PIC X(2).
15 ACCOUNT-BALANCE PIC X(12).
01 TRANSACTION-HISTORY-TAB.
10 TRANS-TAB OCCURS 1 TO 2 TIMES DEPENDING ON TRANSACTION-HISTORY-PRESENT.
15 TRANSACTION-ID PIC X(10).
15 TRANSACTION-DATE PIC X(8).
15 TRANSACTION-AMOUNT PIC X(12).
15 TRANSACTION-TYPE PIC X(2).

Getting error :
za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 13: Invalid input 'EGMENT-INDICATORS' at position 13:6

Code :
df = spark.read.format("cobol").option("copybook", "/mnt/idfprodappdata/Client_Data/xx_Cobix_RnD/SCHEMA/T_3.cob").option("record_format", "F").option("variable_size_occurs", True).option("variable_size_occurs", "true").load("/mnt/idfprodappdata/x/xx/DATA/customer_data_file_V2.dat")
df.display()

@yruslan
Copy link
Collaborator

yruslan commented Jul 23, 2024

Yes, I see that there is an additional complication. Record size varies, and it depends on index fields. Currently, Cobrix only supports record length mapping only if the segment field is a single field. Since your index fields are close together you can combine them, as the workaround.

See that I've removed

01 CUSTOMER-RECORD.
  05  SEGMENT-ID    PIC X(3).
  05  SEGMENT-INDICATORS REDEFINES SEGMENT-ID.
      10  CUSTOMER-DETAILS-PRESENT      PIC X(1).
      10  ACCOUNT-INFORMATION-PRESENT   PIC X(1).
      10  TRANSACTION-HISTORY-PRESENT   PIC X(1).
  05 DETAILS100.
    10 CUSTOMER-DETAILS.
      15 CUSTOMER-ID PIC X(10).
  05 DETAILS010 REDEFINES DETAILS100.
    10 ACCOUNT-INFORMATION .
      15 ACCOUNT-NUMBER PIC X(10).
  05 DETAILS001 REDEFINES DETAILS100.
    10 TRANSACTION-HISTORY.
      15 TRANSACTION-ID PIC X(10).
  05 DETAILS110 REDEFINES DETAILS100.
    10 CUSTOMER-DETAILS.
      15 CUSTOMER-ID PIC X(10).
    10 ACCOUNT-INFORMATION .
      15 ACCOUNT-NUMBER PIC X(10).
  05 DETAILS011 REDEFINES DETAILS100.
    10 ACCOUNT-INFORMATION .
      15 ACCOUNT-NUMBER PIC X(10).
    10 TRANSACTION-HISTORY.
      15 TRANSACTION-ID PIC X(10).
  05 DETAILS101 REDEFINES DETAILS100.
    10 CUSTOMER-DETAILS.
      15 CUSTOMER-ID PIC X(10).
    10 TRANSACTION-HISTORY.
      15 TRANSACTION-ID PIC X(10).
  05 DETAILS111 REDEFINES DETAILS100.
    10 CUSTOMER-DETAILS.
      15 CUSTOMER-ID PIC X(10).
      15 CUSTOMER-NAME PIC X(30).
    10 ACCOUNT-INFORMATION .
      15 ACCOUNT-NUMBER PIC X(10).
    10 TRANSACTION-HISTORY.
      15 TRANSACTION-ID PIC X(10).

Then, you can use segment id to size mapping. But you need to get the size info for each combination of indexes.

.option("record_format", "F")
.option("record_length_field", "SEGMENT_ID")
.option("record_length_map", """{"001":50,"010":30,"100":20,"011":80,"110":50,"101":70,"111":100}""") 
.option("segment_field", "SEGMENT-ID")
.option("redefine-segment-id-map:0", "DETAILS001 => 001")
.option("redefine-segment-id-map:1", "DETAILS010 => 010")
.option("redefine-segment-id-map:2", "DETAILS100 => 100")
.option("redefine-segment-id-map:3", "DETAILS011 => 011")
.option("redefine-segment-id-map:4", "DETAILS110 => 110")
.option("redefine-segment-id-map:5", "DETAILS101 => 101")
.option("redefine-segment-id-map:6", "DETAILS111 => 111")

@SHUBHAMDw
Copy link
Author

@yruslan
Getting error :
IllegalStateException: The record length field SEGMENT_ID must be an integral type.

@SHUBHAMDw
Copy link
Author

The input data is binary like 100,001,111,101,100

@yruslan
Copy link
Collaborator

yruslan commented Jul 23, 2024

Which version of Cobrix are you using?

You can add

.option("pedantic", "true")

to ensure all passed options are recognized.

@SHUBHAMDw
Copy link
Author

SHUBHAMDw commented Jul 23, 2024

@yruslan This the copy book:
* You may obtain a copy of the License at *
* *
* http://www.apache.org/licenses/LICENSE-2.0 *
* *
* Unless required by applicable law or agreed to in writing, software *
* distributed under the License is distributed on an "AS IS" BASIS, *
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. *
* See the License for the specific language governing permissions and *
* limitations under the License. *
* *
****************************************************************************
01 CUSTOMER-RECORD.
05 SEGMENT-ID PIC X(3).
05 DETAILS100.
10 CUSTOMER-DETAILS.
15 CUSTOMER-ID PIC X(10).
15 CUSTOMER-NAME PIC X(30).
15 CUSTOMER-ADDRESS PIC X(50).
15 CUSTOMER-PHONE-NUMBER PIC X(15).
05 DETAILS110 REDEFINES DETAILS100.
10 CUSTOMER-DETAILS.
15 CUSTOMER-ID PIC X(10).
15 CUSTOMER-NAME PIC X(30).
15 CUSTOMER-ADDRESS PIC X(50).
15 CUSTOMER-PHONE-NUMBER PIC X(15).
10 ACCOUNT-INFORMATION.
15 ACCOUNT-NUMBER PIC X(10).
15 ACCOUNT-TYPE PIC X(2).
15 ACCOUNT-BALANCE PIC X(12).
05 DETAILS101 REDEFINES DETAILS100.
10 CUSTOMER-DETAILS.
15 CUSTOMER-ID PIC X(10).
15 CUSTOMER-NAME PIC X(30).
15 CUSTOMER-ADDRESS PIC X(50).
15 CUSTOMER-PHONE-NUMBER PIC X(15).

		   10  TRANSACTION-HISTORY.
			   15  TRANSACTION-ID               PIC X(10).
			   15  TRANSACTION-DATE             PIC X(8).
			   15  TRANSACTION-AMOUNT           PIC X(12).
			   15  TRANSACTION-TYPE             PIC X(2).
       05  DETAILS111 REDEFINES DETAILS100.
		   10  CUSTOMER-DETAILS.
			   15  CUSTOMER-ID                  PIC X(10).
			   15  CUSTOMER-NAME                PIC X(30).
			   15  CUSTOMER-ADDRESS             PIC X(50).
			   15  CUSTOMER-PHONE-NUMBER        PIC X(15).
		   10  ACCOUNT-INFORMATION.
			   15  ACCOUNT-NUMBER               PIC X(10).
			   15  ACCOUNT-TYPE                 PIC X(2).
			   15  ACCOUNT-BALANCE              PIC X(12).
		   10  TRANSACTION-HISTORY.
			   15  TRANSACTION-ID               PIC X(10).
			   15  TRANSACTION-DATE             PIC X(8).
			   15  TRANSACTION-AMOUNT           PIC X(12).
			   15  TRANSACTION-TYPE             PIC X(2).

read code:
df = spark.read.format("cobol").option("copybook", "/mnt/idfprodappdata/x/xx/SCHEMA/T_3_4.cob").option("record_format", "F").option("record_length_field", "SEGMENT_ID").option("record_length_map", """{"001":50,"010":30,"100":20,"011":80,"110":50,"101":70,"111":100}""") .option("segment_field", "SEGMENT-ID").option("redefine-segment-id-map:2", "DETAILS100 => 100").option("redefine-segment-id-map:4", "DETAILS110 => 110").option("redefine-segment-id-map:5", "DETAILS101 => 101").option("redefine-segment-id-map:6", "DETAILS111 => 111").option("pedantic", "true").load("/mnt/idfprodappdata/x/xx/DATA/customer_data_file_V2.dat")

Getting error : IllegalArgumentException: Redundant or unrecognized option(s) to 'spark-cobol': record_length_map.

@SHUBHAMDw
Copy link
Author

Version :za.co.absa.cobrix:spark-cobol_2.12:2.6.9

@SHUBHAMDw
Copy link
Author

@yruslan The above file I also tried to read using occurs but was getting syntax error.

@yruslan
Copy link
Collaborator

yruslan commented Jul 23, 2024

Version :za.co.absa.cobrix:spark-cobol_2.12:2.6.9

The record_length_map was added in more recent versions of Cobrix, try 2.7.2.

Getting error : IllegalArgumentException: Redundant or unrecognized option(s) to 'spark-cobol': record_length_map.

This confirms that you need to update to 2.7.2 in order to use this option.

@SHUBHAMDw
Copy link
Author

SHUBHAMDw commented Jul 24, 2024

@yruslan I have updated the version but getting :
NumberFormatException: For input string: ""

data is like this:
image

@SHUBHAMDw
Copy link
Author

@yruslan Hi , I had updated the version to za.co.absa.cobrix:spark-cobol_2.12:2.7.3

@SHUBHAMDw
Copy link
Author

I had also tried occurs but that is giving me syntax error as :
"Py4JJavaError: An error occurred while calling o493.load.
: za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 13: Invalid input 'EGMENT-INDICATORS' at position 13:6"

  * You may obtain a copy of the License at                                  *
  *                                                                          *
  *     http://www.apache.org/licenses/LICENSE-2.0                           *
  *                                                                          *
  * Unless required by applicable law or agreed to in writing, software      *
  * distributed under the License is distributed on an "AS IS" BASIS,        *
  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. *
  * See the License for the specific language governing permissions and      *
  * limitations under the License.                                           *
  *                                                                          *
  ****************************************************************************
  01 CUSTOMER-RECORD.
	10 SEGMENT-INDICATORS.
		15  CUSTOMER-DETAILS-PRESENT      PIC 9(1).
		15  ACCOUNT-INFORMATION-PRESENT   PIC 9(1).
		15  TRANSACTION-HISTORY-PRESENT   PIC 9(1).
  01 CUSTOMER-DETAILS-TAB.
	10 CUST-TAB OCCURS 1 TO 2 TIMES DEPENDING ON CUSTOMER-DETAILS-PRESENT.
		15  CUSTOMER-ID                  PIC X(10).
		15  CUSTOMER-NAME                PIC X(30).
		15  CUSTOMER-ADDRESS             PIC X(50).
		15  CUSTOMER-PHONE-NUMBER        PIC X(15).
  01 ACCOUNT-INFORMATION-TAB.
	10 ACCT-INFO-TAB OCCURS 1 TO 2 TIMES DEPENDING ON ACCOUNT-INFORMATION-PRESENT.
		15  ACCOUNT-NUMBER               PIC X(10).
		15  ACCOUNT-TYPE                 PIC X(2).
		15  ACCOUNT-BALANCE              PIC X(12).
  01 TRANSACTION-HISTORY-TAB.
	10 TRANS-TAB OCCURS 1 TO 2 TIMES DEPENDING ON TRANSACTION-HISTORY-PRESENT.
		15 TRANSACTION-ID               PIC X(10).
		15 TRANSACTION-DATE             PIC X(8).
		15 TRANSACTION-AMOUNT           PIC X(12).
		15 TRANSACTION-TYPE             PIC X(2).

@yruslan
Copy link
Collaborator

yruslan commented Jul 24, 2024

The error message is due to padding of the copybook with spaces. Please, fix the padding by making sure the first 6 characters of each line are part of a comment, or use the these options to specify the padding for your copybook:

https://github.com/AbsaOSS/cobrix?tab=readme-ov-file#copybook-parsing-options

@SHUBHAMDw
Copy link
Author

@yruslan regarding -->I have updated the version but getting :
NumberFormatException: For input string: ""

@yruslan
Copy link
Collaborator

yruslan commented Jul 24, 2024

@yruslan regarding -->I have updated the version but getting : NumberFormatException: For input string: ""

Please, post the full stack trace - it is hard to tell what causing the error.

Also, make sure the segment value to record length map that you pass to record_length_map is a correct JSON, and record lengths are correct. Values that I posted were just examples of the feature.
An example of a valid JSON to pass to the record length mapping:

{
    "100": 20,
    "101": 70,
    "110": 50,
    "111": 100,
    "001": 50,
    "010": 30,
    "011": 80
}

@SHUBHAMDw
Copy link
Author

SHUBHAMDw commented Jul 24, 2024

@yruslan Stack Trace :
NumberFormatException: For input string: ""
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5) (192.223.255.11 executor 1): java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.$anonfun$fetchRecordUsingRecordLengthFieldExpression$1(VRLRecordReader.scala:200)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.$anonfun$fetchRecordUsingRecordLengthFieldExpression$1$adapted(VRLRecordReader.scala:195)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:193)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.fetchRecordUsingRecordLengthFieldExpression(VRLRecordReader.scala:195)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.fetchNext(VRLRecordReader.scala:98)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.next(VRLRecordReader.scala:75)
at za.co.absa.cobrix.cobol.reader.iterator.VarLenNestedIterator.fetchNext(VarLenNestedIterator.scala:84)
at za.co.absa.cobrix.cobol.reader.iterator.VarLenNestedIterator.next(VarLenNestedIterator.scala:74)
at za.co.absa.cobrix.cobol.reader.iterator.VarLenNestedIterator.next(VarLenNestedIterator.scala:43)
at za.co.absa.cobrix.spark.cobol.reader.VarLenNestedReader$RowIterator.next(VarLenNestedReader.scala:43)
at za.co.absa.cobrix.spark.cobol.reader.VarLenNestedReader$RowIterator.next(VarLenNestedReader.scala:40)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32)
at com.google.common.collect.Iterators$PeekingImpl.hasNext(Iterators.java:1139)
at com.databricks.photon.NativeRowBatchIterator.hasNext(NativeRowBatchIterator.java:44)
at 0xc246792 .HasNext(external/workspace_spark_3_5/photon/jni-wrappers/jni-row-batch-iterator.cc:50)
at com.databricks.photon.JniApiImpl.hasNext(Native Method)
at com.databricks.photon.JniApi.hasNext(JniApi.scala)
at com.databricks.photon.JniExecNode.hasNext(JniExecNode.java:76)
at com.databricks.photon.BasePhotonResultHandler$$anon$1.hasNext(PhotonExec.scala:891)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.$anonfun$hasNext$1(PhotonBasicEvaluatorFactory.scala:228)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.photon.PhotonResultHandler.timeit(PhotonResultHandler.scala:30)
at com.databricks.photon.PhotonResultHandler.timeit$(PhotonResultHandler.scala:28)
at com.databricks.photon.BasePhotonResultHandler.timeit(PhotonExec.scala:878)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.hasNext(PhotonBasicEvaluatorFactory.scala:228)
at com.databricks.photon.CloseableIterator$$anon$10.hasNext(CloseableIterator.scala:211)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$5(UnsafeRowBatchUtils.scala:88)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$3(UnsafeRowBatchUtils.scala:88)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$1(UnsafeRowBatchUtils.scala:68)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$2(Collector.scala:214)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:190)
at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:155)
at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:129)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:149)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:101)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$10(Executor.scala:1013)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:106)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:1016)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:903)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.$anonfun$failJobAndIndependentStages$1(DAGScheduler.scala:3910)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3908)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3822)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3809)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3809)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1680)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1665)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1665)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:4157)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:4069)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:4057)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:55)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$runJob$1(DAGScheduler.scala:1329)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1317)
at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:3034)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$runSparkJobs$1(Collector.scala:355)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:299)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$collect$1(Collector.scala:384)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:381)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:122)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:131)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:94)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:90)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:78)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$computeResult$1(ResultCacheManager.scala:552)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.collectResult$1(ResultCacheManager.scala:546)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.computeResult(ResultCacheManager.scala:563)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:400)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:399)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:318)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeCollectResult$1(SparkPlan.scala:560)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:557)
at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3840)
at org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3831)
at org.apache.spark.sql.Dataset.$anonfun$withAction$3(Dataset.scala:4803)
at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:1152)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4801)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$9(SQLExecution.scala:398)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:713)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:278)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1180)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:165)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:650)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4801)
at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3830)
at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:324)
at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:100)
at com.databricks.backend.daemon.driver.PythonDriverLocalBase.generateTableResult(PythonDriverLocalBase.scala:856)
at com.databricks.backend.daemon.driver.JupyterDriverLocal.computeListResultsItem(JupyterDriverLocal.scala:1451)
at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.addCustomDisplayData(JupyterDriverLocal.scala:283)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.$anonfun$fetchRecordUsingRecordLengthFieldExpression$1(VRLRecordReader.scala:200)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.$anonfun$fetchRecordUsingRecordLengthFieldExpression$1$adapted(VRLRecordReader.scala:195)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:193)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.fetchRecordUsingRecordLengthFieldExpression(VRLRecordReader.scala:195)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.fetchNext(VRLRecordReader.scala:98)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.next(VRLRecordReader.scala:75)
at za.co.absa.cobrix.cobol.reader.iterator.VarLenNestedIterator.fetchNext(VarLenNestedIterator.scala:84)
at za.co.absa.cobrix.cobol.reader.iterator.VarLenNestedIterator.next(VarLenNestedIterator.scala:74)
at za.co.absa.cobrix.cobol.reader.iterator.VarLenNestedIterator.next(VarLenNestedIterator.scala:43)
at za.co.absa.cobrix.spark.cobol.reader.VarLenNestedReader$RowIterator.next(VarLenNestedReader.scala:43)
at za.co.absa.cobrix.spark.cobol.reader.VarLenNestedReader$RowIterator.next(VarLenNestedReader.scala:40)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32)
at com.google.common.collect.Iterators$PeekingImpl.hasNext(Iterators.java:1139)
at com.databricks.photon.NativeRowBatchIterator.hasNext(NativeRowBatchIterator.java:44)
at 0xc246792 .HasNext(external/workspace_spark_3_5/photon/jni-wrappers/jni-row-batch-iterator.cc:50)
at com.databricks.photon.JniApiImpl.hasNext(Native Method)
at com.databricks.photon.JniApi.hasNext(JniApi.scala)
at com.databricks.photon.JniExecNode.hasNext(JniExecNode.java:76)
at com.databricks.photon.BasePhotonResultHandler$$anon$1.hasNext(PhotonExec.scala:891)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.$anonfun$hasNext$1(PhotonBasicEvaluatorFactory.scala:228)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.photon.PhotonResultHandler.timeit(PhotonResultHandler.scala:30)
at com.databricks.photon.PhotonResultHandler.timeit$(PhotonResultHandler.scala:28)
at com.databricks.photon.BasePhotonResultHandler.timeit(PhotonExec.scala:878)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.hasNext(PhotonBasicEvaluatorFactory.scala:228)
at com.databricks.photon.CloseableIterator$$anon$10.hasNext(CloseableIterator.scala:211)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$5(UnsafeRowBatchUtils.scala:88)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$3(UnsafeRowBatchUtils.scala:88)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$1(UnsafeRowBatchUtils.scala:68)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$2(Collector.scala:214)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:190)
at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:155)
at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:129)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:149)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:101)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$10(Executor.scala:1013)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:106)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:1016)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:903)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more

@yruslan
Copy link
Collaborator

yruslan commented Jul 24, 2024

Looks like some of your options might be incorrect. Use:

.option("pedantic", "true")

to reveal the incorrect option. Also, please share all options you are passing to spark-cobol

@SHUBHAMDw
Copy link
Author

@yruslan
df = spark.read.format("cobol").option("copybook", "/mnt/idfprodappdata/x/xx/SCHEMA/T_3_4.cob").option("record_format", "F").option("record_length_field", "SEGMENT_ID").option("record_length_map", """{"001":50,"010":30,"100":20,"011":80,"110":50,"101":70,"111":100}""") .option("segment_field", "SEGMENT-ID").option("redefine-segment-id-map:2", "DETAILS100 => 100").option("redefine-segment-id-map:4", "DETAILS110 => 110").option("redefine-segment-id-map:5", "DETAILS101 => 101").option("redefine-segment-id-map:6", "DETAILS111 => 111").option("pedantic", "true").load("/mnt/idfprodappdata/x/xx/DATA/customer_data_file_V2.dat")
df.display()

@yruslan
Copy link
Collaborator

yruslan commented Jul 25, 2024

The options look good with the exception of the JSON you are passing to record_length_map. As I said, I just provided an example. It is up to you to figure out record sizes for each of the cases.

I can help you if you send the layout position table that is printed in the log when you use spark-cobol

UPDATE. You can also try

.option("record_length_field", "SEGMENT-ID")

instead of

.option("record_length_field", "SEGMENT_ID")

@yruslan
Copy link
Collaborator

yruslan commented Jul 25, 2024

I had also tried occurs but that is giving me syntax error as : "Py4JJavaError: An error occurred while calling o493.load. : za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 13: Invalid input 'EGMENT-INDICATORS' at position 13:6"

  * You may obtain a copy of the License at                                  *
  *                                                                          *
  *     http://www.apache.org/licenses/LICENSE-2.0                           *
  *                                                                          *
  * Unless required by applicable law or agreed to in writing, software      *
  * distributed under the License is distributed on an "AS IS" BASIS,        *
  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. *
  * See the License for the specific language governing permissions and      *
  * limitations under the License.                                           *
  *                                                                          *
  ****************************************************************************
  01 CUSTOMER-RECORD.
	10 SEGMENT-INDICATORS.
		15  CUSTOMER-DETAILS-PRESENT      PIC 9(1).
		15  ACCOUNT-INFORMATION-PRESENT   PIC 9(1).
		15  TRANSACTION-HISTORY-PRESENT   PIC 9(1).
  01 CUSTOMER-DETAILS-TAB.
	10 CUST-TAB OCCURS 1 TO 2 TIMES DEPENDING ON CUSTOMER-DETAILS-PRESENT.
		15  CUSTOMER-ID                  PIC X(10).
		15  CUSTOMER-NAME                PIC X(30).
		15  CUSTOMER-ADDRESS             PIC X(50).
		15  CUSTOMER-PHONE-NUMBER        PIC X(15).
  01 ACCOUNT-INFORMATION-TAB.
	10 ACCT-INFO-TAB OCCURS 1 TO 2 TIMES DEPENDING ON ACCOUNT-INFORMATION-PRESENT.
		15  ACCOUNT-NUMBER               PIC X(10).
		15  ACCOUNT-TYPE                 PIC X(2).
		15  ACCOUNT-BALANCE              PIC X(12).
  01 TRANSACTION-HISTORY-TAB.
	10 TRANS-TAB OCCURS 1 TO 2 TIMES DEPENDING ON TRANSACTION-HISTORY-PRESENT.
		15 TRANSACTION-ID               PIC X(10).
		15 TRANSACTION-DATE             PIC X(8).
		15 TRANSACTION-AMOUNT           PIC X(12).
		15 TRANSACTION-TYPE             PIC X(2).

This is also a possible solution. It is more elegant than what I proposed. I think if you just fix the padding of the copybook, it might work. (e.g. add 4 spaces to each of the line)

@SHUBHAMDw
Copy link
Author

SHUBHAMDw commented Jul 25, 2024

  * You may obtain a copy of the License at                                  *
  *                                                                          *
  *     http://www.apache.org/licenses/LICENSE-2.0                           *
  *                                                                          *
  * Unless required by applicable law or agreed to in writing, software      *
  * distributed under the License is distributed on an "AS IS" BASIS,        *
  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. *
  * See the License for the specific language governing permissions and      *
  * limitations under the License.                                           *
  *                                                                          *
  ****************************************************************************
  01 CUSTOMER-RECORD.
       10 SEGMENT-INDICATORS.
           15  CUSTOMER-DETAILS-PRESENT      PIC 9(1).
           15  ACCOUNT-INFORMATION-PRESENT   PIC 9(1).
           15  TRANSACTION-HISTORY-PRESENT   PIC 9(1).
  01 CUSTOMER-DETAILS-TAB.
       10 CUST-TAB OCCURS 1 TO 2 TIMES DEPENDING ON CUSTOMER-DETAILS-PRESENT.
           15  CUSTOMER-ID                  PIC X(10).
           15  CUSTOMER-NAME                PIC X(30).
           15  CUSTOMER-ADDRESS             PIC X(50).
           15  CUSTOMER-PHONE-NUMBER        PIC X(15).
  01 ACCOUNT-INFORMATION-TAB.
       10 ACCT-INFO-TAB OCCURS 1 TO 2 TIMES DEPENDING ON ACCOUNT-INFORMATION-PRESENT.
           15  ACCOUNT-NUMBER               PIC X(10).
           15  ACCOUNT-TYPE                 PIC X(2).
           15  ACCOUNT-BALANCE              PIC X(12).
  01 TRANSACTION-HISTORY-TAB.
       10 TRANS-TAB OCCURS 1 TO 2 TIMES DEPENDING ON TRANSACTION-HISTORY-PRESENT.
           15 TRANSACTION-ID               PIC X(10).
           15 TRANSACTION-DATE             PIC X(8).
           15 TRANSACTION-AMOUNT           PIC X(12).
           15 TRANSACTION-TYPE             PIC X(2).

@yruslan : I have fixed the padding still giving syntax error.
Py4JJavaError: An error occurred while calling o409.load.
: za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 19: Invalid input '15' at position 19:15
at za.co.absa.cobrix.cobol.parser.antlr.ThrowErrorStrategy.recover(ANTLRParser.scala:38)
at za.co.absa.cobrix.cobol.parser.antlr.copybookParser.item(copybookParser.java:3305)
at za.co.absa.cobrix.cobol.parser.antlr.copybookParser.main(copybookParser.java:215)
at za.co.absa.cobrix.cobol.parser.antlr.ANTLRParser$.parse(ANTLRParser.scala:85)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.parseTree(CopybookParser.scala:282)
at za.co.absa.cobrix.cobol.reader.schema.CobolSchema$.fromReaderParameters(CobolSchema.scala:108)
at za.co.absa.cobrix.cobol.reader.VarLenNestedReader.loadCopyBook(VarLenNestedReader.scala:202)
at za.co.absa.cobrix.cobol.reader.VarLenNestedReader.(VarLenNestedReader.scala:52)
at za.co.absa.cobrix.spark.cobol.reader.VarLenNestedReader.(VarLenNestedReader.scala:37)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createVariableLengthReader(DefaultSource.scala:112)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.buildEitherReader(DefaultSource.scala:76)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:56)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:44)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:398)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:392)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:348)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:348)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:248)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
at java.lang.Thread.run(Thread.java:750)
File , line 1
----> 1 df = spark.read.format("cobol").option("copybook", "/mnt/idfprodappdata/x/xx/SCHEMA/T_3_2.cob").option("record_format", "F").option("variable_size_occurs", True).option("variable_size_occurs", "true").load("/mnt/idfprodappdata/x/xx/DATA/data_test25_occurs.dat")
2 df.display()
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
332 format(target_id, ".", name, value))

@yruslan
Copy link
Collaborator

yruslan commented Jul 25, 2024

I think you fixed only padding at the beginning of each line, but not at the end,
In copybooks, characters are ignored after position 72.

your copybook should look like this (including spaces):

      01 CUSTOMER-RECORD.
           10 SEGMENT-INDICATORS.
               15  CUSTOMER-DETAILS-PRESENT      PIC 9(1).
               15  ACCOUNT-INFORMATION-PRESENT   PIC 9(1).
               15  TRANSACTION-HISTORY-PRESENT   PIC 9(1).
      01 CUSTOMER-DETAILS-TAB.
           10 CUST-TAB OCCURS 1 TO 2 TIMES
                   DEPENDING ON CUSTOMER-DETAILS-PRESENT.
               15  CUSTOMER-ID                  PIC X(10).
               15  CUSTOMER-NAME                PIC X(30).
               15  CUSTOMER-ADDRESS             PIC X(50).
               15  CUSTOMER-PHONE-NUMBER        PIC X(15).
      01 ACCOUNT-INFORMATION-TAB.
           10 ACCT-INFO-TAB OCCURS 1 TO 2 TIMES
                   DEPENDING ON ACCOUNT-INFORMATION-PRESENT.
               15  ACCOUNT-NUMBER               PIC X(10).
               15  ACCOUNT-TYPE                 PIC X(2).
               15  ACCOUNT-BALANCE              PIC X(12).
      01 TRANSACTION-HISTORY-TAB.
           10 TRANS-TAB OCCURS 1 TO 2 TIMES
                   DEPENDING ON TRANSACTION-HISTORY-PRESENT.
               15 TRANSACTION-ID               PIC X(10).
               15 TRANSACTION-DATE             PIC X(8).
               15 TRANSACTION-AMOUNT           PIC X(12).
               15 TRANSACTION-TYPE             PIC X(2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants