-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] The specified path does not exist.", 404, GET #1277
Comments
Additionaly, job 1 sees "com.microsoft.azure.storage.StorageException: The specified blob does not exist." also. org.apache.spark.SparkException: Task failed while writing rows. |
I even changed schema for reading from wasbs to abfss and assigned public permisions on the container but still I am getting this: "Caused by: Operation failed: "The specified path does not exist.", 404, GET, https://raddsstatsstorage.dfs.core.windows.net/stats-prod?upn=false&resource=filesystem&maxResults=500&directory=10m/per_ip/_delta_log&timeout=90&recursive=false, PathNotFound, "The specified path does not exist. RequestId:e0249403-c01f-0006-1eef-99b549000000 Time:2022-07-17T15:12:17.9805557Z"" |
I'm seeing the same issue. Our scenario is:
Job 2 will see these errors at least once a day |
@0xdarkman @jeffco11 Looks like you can see the request id in the error. Would you be able to reach Azure to check these requests? This is unlikely a Delta issue as the error complains the |
@jeffco11 when you write stream using abfss schema do you create container yourself upfront? |
_delta_log should be there, should not? |
this is the reponse from Azure:
I am certian I should not be managing existence of _delta_log at all. |
@0xdarkman I assume |
I got a similar response from the Azure team. If you look for a file that doesn't exist, you should expect a 404 error. We don't manage the _deltalog so I'm not sure what's happening. In our scenario the _deltalog always exists because we are streaming data to the table constantly and we occasionally get the 404's when we try to read from that table. We created the container upfront originally and have been writing to this table for well over a year. |
correct, _delta_log exists. I ensure I write some data first before I start reading from the path. Furthermore, _delta_log is created and managed by delta table. #1277 (comment) summarizes the problem well. As soon as I switch to gen1 (change storage account, abfss -> wasbs, dfs -> blob) then all works well. |
Yep, this looks like either a hadoop-azure library issue or the server issue. It sounds more likely a server issue because the request url seems correct to me. Delta just calls FileSystems provided by hadoop-azure. It's unlikely that the issue can be fixed inside Delta Lake. |
What do you call "server" in this case? The spark application runs on kubernetes in Azure and writes to Azure storage gen 2 Delta Table. |
Azure storage server. Hadoop-azure basically just calls azure storage client sdk to send http requests to Azure storage server (I don't know how Azure calls it though). |
FWIW: We are seeing the same here with |
It is resolved. MS implemented bug fix few days ago. |
Thanks for the update. I will close this issue. |
I'm still facing this issue. Seems the fix is not rolled out yet. @0xdarkman - Did they fix the issue on storage or spark layer? |
what is the resolution |
@0xdarkman Could you let us know where the fix was made? |
Same here. Where could I find the fix? Many thanks. |
@li-plaintext would you or anyone be able to contact MSFT support to look at this issue? |
I do confirm the issue is not resolved. |
@0xdarkman did you manage to solve this with the help of Microsoft? We are currently facing the exact same issue. |
They said to me that they applied the fix on the storage account but then after couple of weeks they rolled out the change back because of the tenant change. So simply saying they did not fix it. Now I am waiting for the fix to be rolled out to all my storage accounts but still they have not fixed it yet and are not telling me what is the exact issue. So in short lots of nonse with no good explanation. |
Thanks for the quick reply! That sucks though... but according to them it was something to be fixed on the storage account level if I understand correctly and not something that is fixable by ourselves. I guess there will be no other option then to open a ticket with them. |
No, i did not find a solution. I used gen1 instead as intermediate storage between two streams. |
Hi 0xdarkman We use The stack trace show it is triggered when the |
In our case what happend was that the container that synapse used in the datalake for its internal storage and compute operations had accidently been deleted. By restoring this container, in our case synws, the issue was resolved. Maybe this might help you out as well. |
Hi @0xdarkman Could you please share what was the resolution for it. Caused by: java.io.FileNotFoundException: Operation failed: "The specified path does not exist.", 404, PUT, https://redactedt/conatinerName/_temporary/0/_temporary/attempt_20231205133137_0183_m_000190_28900/Partition%3D19/TimeBucket%3D2023-12-02%2000%253A00%253A00/JobId%3D95ae4077-c42f-4fa4-9196-84166f638ccd/redacted/redacted/part-00190-f17dd298-a5a1-495e-9d10-3ec56650d0e6.c000.snappy.parquet?action=flush&retainUncommittedData=false&position=18418&close=true&timeout=90, PathNotFound, "The specified path does not exist. RequestId:87cdb709-c01f-0006-137f-277125000000 Time:2023-12-05T13:33:07.5183221Z" |
Azure storage account premium file share throttling was the root cause.
Increase file share size helped to increase iops limits and bypass the
problem.
…On Thu, 14 Dec 2023, 12:57 kpsingh05, ***@***.***> wrote:
Hi @0xdarkman <https://github.com/0xdarkman> Could you please share what
was the resolution for it.
I am facing the same error; yarn logs show the error, but job succeeds.
But When I verify on storage delta lake, data is not present.
Caused by: java.io.FileNotFoundException: Operation failed: "The specified
path does not exist.", 404, PUT,
https://redactedt/conatinerName/_temporary/0/_temporary/attempt_20231205133137_0183_m_000190_28900/Partition%3D19/TimeBucket%3D2023-12-02%2000%253A00%253A00/JobId%3D95ae4077-c42f-4fa4-9196-84166f638ccd/redacted/redacted/part-00190-f17dd298-a5a1-495e-9d10-3ec56650d0e6.c000.snappy.parquet?action=flush&retainUncommittedData=false&position=18418&close=true&timeout=90,
PathNotFound, "The specified path does not exist.
RequestId:87cdb709-c01f-0006-137f-277125000000
Time:2023-12-05T13:33:07.5183221Z"
—
Reply to this email directly, view it on GitHub
<#1277 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUJ6HW2QHB4NRZIQSNTM4TYJLSQ3AVCNFSM53ZJOYM2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBVGU3TCNBUGE4Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I do stream with spark with scala from kafka >> process stream with job 1 and write to DeltaTable 1 >> process stream with job 2 and write stream to Delta Table 2.
The job2 runs for a while but then it fails because of the error below: The specified path does not exist.", 404, GET
It appends to DeltaTable so I do not understand why it gets 404.
By the way, using abfss ---> abfss does not help.
scalaVersion := "2.12.12"
sparkVersion = "3.2.1"
hadoopVersion = "3.3.0"
"com.microsoft.azure" % "azure-storage" % "8.6.6",
"io.delta" %% "delta-core" % "1.2.1",
destination storage account gen2
job1: reads from kafka and writes to DeltaTable1 using wasbs schema and blob.core.windows.net endpoint.
job2: reads from DeltaTable 1 using wasbs schema and blob.core.windows.net endpoint and writes to DeltaTable 2 using abfss schema and dfs.core.windows.net endpoint.
when I write stream I do use "append" mode and paritionBy multiple columns
The intended behaviour would be for the job not to fail.
Spark uses this configuration for auth:
spark.conf.set(s"fs.azure.account.key.$accountName.blob.core.windows.net", accountKey)
What is the problem here and how to fix it?
Caused by: Operation failed: "The specified path does not exist.", 404, GET, https://raddsstatsstorage.dfs.core.windows.net/stats-prod?upn=false&resource=filesystem&maxResults=500&directory=1h/per_customer_fqdn/_delta_log&timeout=90&recursive=false, PathNotFound, "The specified path does not exist. RequestId:6733a6ac-d01f-0074-18aa-99c477000000 Time:2022-07-17T06:54:17.9766705Z"
at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:146)
at org.apache.hadoop.fs.azurebfs.services.AbfsClient.listPath(AbfsClient.java:225)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:704)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:666)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:360)
... 58 more
22/07/17 06:54:18 INFO ShutdownHookManager: Shutdown hook called
22/07/17 06:54:18 INFO ShutdownHookManager: Deleting directory /tmp/spark-b04ed4ee-f724-45e7-b724-0c902c6de8b1
22/07/17 06:54:18 INFO ShutdownHookManager: Deleting directory /var/data/spark-f5e6d40d-907a-46d3-87b9-33cb85d7eb32/spark-dc28d981-c1b3-4578-be20-ada1a71003fd
22/07/17 06:54:18 INFO MetricsSystemImpl: Stopping azure-file-system metrics system...
22/07/17 06:54:18 INFO MetricsSystemImpl: azure-file-system metrics system stopped.
22/07/17 06:54:18 INFO MetricsSystemImpl: azure-file-system metrics system shutdown complete.
Caused by: Operation failed: "The specified path does not exist.", 404, GET, https://raddsstatsstorage.dfs.core.windows.net/stats-prod?upn=false&resource=filesystem&maxResults=500&directory=1h/per_customer_fqdn/_delta_log&timeout=90&recursive=false, PathNotFound, "The specified path does not exist. RequestId:6733a6ac-d01f-0074-18aa-99c477000000 Time:2022-07-17T06:54:17.9766705Z"
at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:146)
at org.apache.hadoop.fs.azurebfs.services.AbfsClient.listPath(AbfsClient.java:225)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:704)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:666)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:360)
... 58 more
22/07/17 06:54:17 INFO OptimisticTransaction: [tableId=bf7d1277,txnId=c48d284b] Committed delta #3080 to abfss://stats-prod@raddsstatsstorage.dfs.core.windows.net/1h/per_customer_fqdn/_delta_log
22/07/17 06:54:17 INFO DeltaLog: Try to find Delta last complete checkpoint before version 3080
22/07/17 06:54:17 ERROR MicroBatchExecution: Query [id = 0147f9bb-9b4a-4a28-b364-d4cf15a02efa, runId = 649875a1-4ffa-4cb1-9dd5-6c8f429ec907] terminated with error
java.io.FileNotFoundException: Operation failed: "The specified path does not exist.", 404, GET, https://raddsstatsstorage.dfs.core.windows.net/stats-prod?upn=false&resource=filesystem&maxResults=500&directory=1h/per_customer_fqdn/_delta_log&timeout=90&recursive=false, PathNotFound, "The specified path does not exist. RequestId:6733a6ac-d01f-0074-18aa-99c477000000 Time:2022-07-17T06:54:17.9766705Z"
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1074)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:363)
at io.delta.storage.HadoopFileSystemLogStore.listFrom(HadoopFileSystemLogStore.java:59)
at org.apache.spark.sql.delta.storage.LogStoreAdaptor.listFrom(LogStore.scala:362)
at org.apache.spark.sql.delta.storage.DelegatingLogStore.listFrom(DelegatingLogStore.scala:125)
at org.apache.spark.sql.delta.Checkpoints.findLastCompleteCheckpoint(Checkpoints.scala:233)
at org.apache.spark.sql.delta.Checkpoints.findLastCompleteCheckpoint$(Checkpoints.scala:224)
at org.apache.spark.sql.delta.DeltaLog.findLastCompleteCheckpoint(DeltaLog.scala:64)
at org.apache.spark.sql.delta.SnapshotManagement.$anonfun$getSnapshotAt$1(SnapshotManagement.scala:568)
at scala.Option.orElse(Option.scala:447)
The text was updated successfully, but these errors were encountered: