Describe the bug
Auron's JNI Hadoop file wrappers may truncate paths containing a literal #.
This happens because JniBridge rebuilds paths via new Path(new URI(path)), where Java URI treats everything after # as a fragment, so Hadoop receives a truncated path.
For example, this intended path:
hdfs://mycluster/auron-it-hdfs-rbf-repro/raw#mini.txt
is opened as:
/auron-it-hdfs-rbf-repro/raw
To Reproduce
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("auron-hdfs-hash-repro").getOrCreate()
jvm = spark._jvm
conf = spark._jsc.hadoopConfiguration()
path = "hdfs://mycluster/auron-it-hdfs-rbf-repro/raw#mini.txt"
hadoop_path = jvm.org.apache.hadoop.fs.Path(path)
fs = hadoop_path.getFileSystem(conf)
print("path=", path)
if fs.exists(hadoop_path):
fs.delete(hadoop_path, False)
out = fs.create(hadoop_path, True)
out.write(bytearray(b"data"))
out.close()
print("plain_hadoop_create_ok")
plain_in = fs.open(hadoop_path)
plain_in.close()
print("plain_hadoop_open_ok")
jvm.org.apache.auron.jni.JniBridge.openFileAsDataInputWrapper(fs, path).close()
print("jni_open_ok")
spark.stop()
The relevant output is:
Expected behavior
Auron should preserve Hadoop Path(String) semantics when opening or creating files through JNI Hadoop wrappers.
JniBridge.openFileAsDataInputWrapper and JniBridge.createFileAsDataOutputWrapper should open the exact path string passed by the caller, including literal # characters in the Hadoop path.
Describe the bug
Auron's JNI Hadoop file wrappers may truncate paths containing a literal #.
This happens because
JniBridgerebuilds paths vianew Path(new URI(path)), where JavaURItreats everything after#as a fragment, so Hadoop receives a truncated path.For example, this intended path:
is opened as:
To Reproduce
The relevant output is:
Expected behavior
Auron should preserve Hadoop
Path(String)semantics when opening or creating files through JNI Hadoop wrappers.JniBridge.openFileAsDataInputWrapperandJniBridge.createFileAsDataOutputWrappershould open the exact path string passed by the caller, including literal#characters in the Hadoop path.