{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":469954767,"defaultBranch":"master","name":"spark","ownerLogin":"djydata","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2022-03-15T00:51:01.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/119560218?v=4","public":true,"private":false,"isOrgOwned":true},"refInfo":{"name":"","listCacheKey":"v0:1647413884.534485","currentOid":""},"activityList":{"items":[{"before":"6277fc783f517656be66da76200b332807fd0595","after":"2a49feeb5d727552758a75fdcfbc49e8f6eed72f","ref":"refs/heads/master","pushedAt":"2023-12-12T04:10:47.000Z","pushType":"push","commitsCount":5884,"pusher":{"login":"vitojeng","name":"Vito Jeng","path":"/vitojeng","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/475186?s=80&v=4"},"commit":{"message":"[SPARK-43427][PROTOBUF] spark protobuf: allow upcasting unsigned integer types\n\n### What changes were proposed in this pull request?\n\nJIRA: https://issues.apache.org/jira/browse/SPARK-43427\n\nProtobuf supports unsigned integer types, including uint32 and uint64. When deserializing protobuf values with fields of these types, `from_protobuf` currently transforms them to the spark types of:\n\n```\nuint32 => IntegerType\nuint64 => LongType\n```\n\nIntegerType and LongType are [signed](https://spark.apache.org/docs/latest/sql-ref-datatypes.html) integer types, so this can lead to confusing results. Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value is above 2^63, their representation in binary will contain a 1 in the highest bit, which when interpreted as a signed integer will be negative (I.e. overflow). No information is lost, as `IntegerType` and `LongType` contain 32 and 64 bits respectively, however their representation can be confusing.\n\nIn this PR, we add an option (`upcast.unsigned.ints`) to allow upcasting unsigned integer types into a larger integer type that can represent them natively, i.e.\n\n```\nuint32 => LongType\nuint64 => Decimal(20, 0)\n```\n\nI added an option so that it doesn't break any existing clients.\n\n**Example of current behavior**\n\nConsider a protobuf message like:\n\n```\nsyntax = \"proto3\";\n\nmessage Test {\n uint64 val = 1;\n}\n```\n\nIf we compile the above and then generate a message with a value for `val` above 2^63:\n\n```\nimport test_pb2\n\ns = test_pb2.Test()\ns.val = 9223372036854775809 # 2**63 + 1\nserialized = s.SerializeToString()\nprint(serialized)\n```\n\nThis generates the binary representation:\n\nb'\\x08\\x81\\x80\\x80\\x80\\x80\\x80\\x80\\x80\\x80\\x01'\n\nThen, deserializing this using `from_protobuf`, we can see that it is represented as a negative number. I did this in a notebook so its easier to see, but could reproduce in a scala test as well:\n\n![image](https://github.com/apache/spark/assets/1002986/7144e6a9-3f43-455e-94c3-9065ae88206e)\n\n**Precedent**\nI believe that unsigned integer types in parquet are deserialized in a similar manner, i.e. put into a larger type so that the unsigned representation natively fits. https://issues.apache.org/jira/browse/SPARK-34817 and https://github.com/apache/spark/pull/31921. So an option to get similar behavior would be useful.\n\n### Why are the changes needed?\nImprove unsigned integer deserialization behavior.\n\n### Does this PR introduce any user-facing change?\nYes, adds a new option.\n\n### How was this patch tested?\nUnit Testing\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\nCloses #43773 from justaparth/parth/43427-add-option-to-expand-unsigned-integers.\n\nAuthored-by: Parth Upadhyay \nSigned-off-by: Hyukjin Kwon ","shortMessageHtmlLink":"[SPARK-43427][PROTOBUF] spark protobuf: allow upcasting unsigned inte…"}}],"hasNextPage":false,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAADx_bZ4AA","startCursor":null,"endCursor":null}},"title":"Activity · djydata/spark"}