Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Windows-1252 fallback logic for Encoding.Default #10190

Merged
merged 30 commits into from
Jun 10, 2024
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
bc6af31
enable tests
radeusgd Jun 3, 2024
801b51d
test fallback in CSV read
radeusgd Jun 3, 2024
9b53599
add win fallback test
radeusgd Jun 3, 2024
72c2987
associate stream with its file
radeusgd Jun 3, 2024
443ef43
checkpoint
radeusgd Jun 3, 2024
5c19f58
introduce Restartable_Input_Stream and `is_peekable`
radeusgd Jun 4, 2024
aeed042
move
radeusgd Jun 4, 2024
0407f2b
checkpoint
radeusgd Jun 4, 2024
bb63dfc
WIP
radeusgd Jun 4, 2024
68636f0
comment
radeusgd Jun 5, 2024
aff68b2
move encoding heuristics from Java to Enso
radeusgd Jun 5, 2024
18acd6a
peek/skip bytes in Enso stream
radeusgd Jun 5, 2024
5c45640
javafmt
radeusgd Jun 5, 2024
c785e3d
fix test
radeusgd Jun 5, 2024
fde5dfd
implement the fallback check logic
radeusgd Jun 5, 2024
ebcf529
fix a test
radeusgd Jun 5, 2024
f02352f
fix a test 2
radeusgd Jun 5, 2024
94d6d33
fix a test 3
radeusgd Jun 5, 2024
6f9e65a
fix Temporary_File typo - renamed var
radeusgd Jun 5, 2024
db0c98c
remember to cleanup the byte-array input stream
radeusgd Jun 5, 2024
a059d88
input stream tests - peek / skip
radeusgd Jun 7, 2024
5bf15a3
restartable tests
radeusgd Jun 7, 2024
fdd1425
leave unspecified usage afterwards
radeusgd Jun 7, 2024
dc348ee
Merge branch 'refs/heads/develop' into wip/radeusgd/10148-win-1252-fa…
radeusgd Jun 8, 2024
8a8f7e6
adding tests
radeusgd Jun 8, 2024
ae9f516
restartable stream lifetime
radeusgd Jun 8, 2024
7564871
pending
radeusgd Jun 8, 2024
67f6671
Merge branch 'refs/heads/develop' into wip/radeusgd/10148-win-1252-fa…
radeusgd Jun 10, 2024
cbcce78
changelog
radeusgd Jun 10, 2024
d888a00
CR: typos in test comments
radeusgd Jun 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,10 @@ type Encoding
encoding. Otherwise, the input is decoded using UTF-8 unless it contains
invalid UTF-8 sequences, in which case Windows-1252 is used as a fallback.

When used for encoding, it will always encode using UTF-8.
When used for encoding, it will either use the same encoding detection
heuristics as in read in case of Append mode. When writing a new file,
it will always use UTF-8.

This encoding cannot be passed to some functions that require a Java
Charset.
default -> Encoding =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ import project.Errors.Time_Error.Time_Error
import project.Meta
import project.Nothing.Nothing
import project.Panic.Panic
import project.System.Internal.Reporting_Stream_Decoder_Helper
from project.Data.Boolean import Boolean, False, True
from project.Data.Json import Invalid_JSON, JS_Object, Json
from project.Data.Numbers import Float, Integer, Number, Number_Parse_Error
Expand Down Expand Up @@ -763,11 +764,7 @@ Text.bytes self (encoding : Encoding = Encoding.utf_8) (on_problems : Problem_Be
@encoding Encoding.default_widget
Text.from_bytes : Vector Integer -> Encoding -> Problem_Behavior -> Text
Text.from_bytes bytes (encoding : Encoding = Encoding.default) (on_problems : Problem_Behavior = Problem_Behavior.Report_Error) =
result = Encoding_Utils.from_bytes bytes encoding.to_java_charset_or_null : WithProblems
if result.problems.is_empty then result.result else
problems = result.problems.map decoding_problem->
Encoding_Error.Error decoding_problem.message
on_problems.attach_problems_after result.result problems
Reporting_Stream_Decoder_Helper.decode_bytes_to_text bytes encoding on_problems

## ICON convert
Returns a vector containing bytes representing the UTF-8 encoding of the
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,13 @@ import project.System.File_Format.Infer
import project.System.File_Format.Plain_Text_Format
import project.System.File_Format_Metadata.File_Format_Metadata
import project.System.Input_Stream.Input_Stream
import project.System.Advanced.Restartable_Input_Stream.Restartable_Input_Stream
from project.Data.Boolean import Boolean, False, True
from project.Data.Text.Extensions import all
from project.Metadata.Choice import Option
from project.Metadata.Widget import Single_Choice
from project.System.File_Format import format_types

polyglot java import java.io.ByteArrayInputStream
polyglot java import java.io.InputStream

## PRIVATE
Expand All @@ -62,50 +62,27 @@ type Response_Body
Raw_Stream (raw_stream:Input_Stream) (metadata:File_Format_Metadata) uri:URI

## PRIVATE
Materialized_Byte_Array (bytes:Vector) (metadata:File_Format_Metadata) uri:URI

## PRIVATE
Materialized_Temporary_File (temporary_file:Temporary_File) (metadata:File_Format_Metadata) uri:URI
Materialized_Stream (restartable_stream:Restartable_Input_Stream) (metadata:File_Format_Metadata) uri:URI

## PRIVATE
with_stream : (Input_Stream -> Any ! HTTP_Error) -> Any ! HTTP_Error
with_stream self action = case self of
Response_Body.Raw_Stream raw_stream _ _ ->
Managed_Resource.bracket raw_stream (_.close) action
Response_Body.Materialized_Byte_Array bytes _ _ ->
byte_stream = Input_Stream.new (ByteArrayInputStream.new bytes) (HTTP_Error.handle_java_exceptions self.uri)
Managed_Resource.bracket byte_stream (_.close) action
Response_Body.Materialized_Temporary_File temporary_file _ _ -> temporary_file.with_file file->
opts = [File_Access.Read.to_java]
stream = HTTP_Error.handle_java_exceptions self.uri (file.input_stream_builtin opts)
file_stream = Input_Stream.new stream (HTTP_Error.handle_java_exceptions self.uri) associated_file=temporary_file
Managed_Resource.bracket (file_stream) (_.close) action
Response_Body.Materialized_Stream restartable_stream _ _ ->
restartable_stream.with_fresh_stream action

## PRIVATE
ADVANCED
Materializes the stream into either a byte array or a temporary file and
return a new Response_Body.
materialize : Input_Stream
materialize self = case self of
Response_Body.Raw_Stream _ _ _ ->
self.with_stream body_stream->
body_stream.with_java_stream body_java_stream->
first_block = body_java_stream.readNBytes maximum_body_in_memory
case first_block.length < maximum_body_in_memory of
True -> Response_Body.Materialized_Byte_Array (Vector.from_polyglot_array first_block) self.metadata self.uri
False -> Context.Output.with_enabled <|
## Write contents to a temporary file
temp_file = Temporary_File.new self.uri.host
r = temp_file.with_file file->
file.with_output_stream [File_Access.Write, File_Access.Create, File_Access.Truncate_Existing] output_stream->
output_stream.with_java_stream java_output_stream->
java_output_stream.write first_block
body_java_stream.transferTo java_output_stream
java_output_stream.flush
Nothing
r.if_not_error <|
Response_Body.Materialized_Temporary_File temp_file self.metadata self.uri
_ -> self
Response_Body.Raw_Stream _ metadata uri ->
restartable_stream = self.with_stream body_stream->
body_stream.as_restartable_stream
Response_Body.Materialized_Stream restartable_stream metadata uri
Response_Body.Materialized_Stream _ _ _ -> self

## ALIAS parse
GROUP Input
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,12 @@ type Managed_Resource
function once it is no longer in use.

Arguments:
- resource: The resource to register.
- function: The action to be executed on resource to clean it up when
it is no longer in use.

Returns:
A `Managed_Resource` object that can be used to access the resource.
register : Any -> (Any -> Nothing) -> Managed_Resource
register resource function = @Builtin_Method "Managed_Resource.register"

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
import project.Any.Any
import project.Data.Text.Text
import project.Data.Vector.Vector
import project.Nothing.Nothing
import project.Runtime.Context
import project.Runtime.Managed_Resource.Managed_Resource
import project.System.File.Advanced.Temporary_File.Temporary_File
import project.System.File.File
import project.System.File.File_Access.File_Access
import project.System.Input_Stream.Input_Stream
from project.Data.Boolean import Boolean, False, True

## PRIVATE
An input stream that can be read multiple times.

It may be useful when multiple passes over the data are required.
If you need to check only the beginning of the stream, consider using a much
lighter `Input_Stream.as_peekable_stream`.

A generic stream can be converted to `Restartable_Input_Stream` by reading
all its contents and storing them either in memory or in a temporary file.
A stream backed by an existing file can be converted to
`Restartable_Input_Stream` at no cost.

! Stream Lifetime

Note that if we use an existing file as a shortcut to avoid copying the
data, we need to assume that the file will not be modified in the meantime.
Thus the `Restartable_Input_Stream` does not fully guarantee immutability
of the data. The lifetime of such `Restartable_Input_Stream` is tied to the
lifetime of its backing file.

If the stream should stay usable for a longer time, `extend_lifetime=True`
should be passed when creating it.
type Restartable_Input_Stream
## PRIVATE
`bytes` may be a Vector or a raw `byte[]` array (convertible to vector, but no annotation to avoid conversions).
private From_Bytes bytes

## PRIVATE
private From_Existing_File file:File

## PRIVATE
private From_Temporary_File temporary_file:Temporary_File

## PRIVATE
to_text self -> Text =
suffix = case self of
Restartable_Input_Stream.From_Bytes _ -> "From_Bytes"
Restartable_Input_Stream.From_Existing_File file -> "From_Existing_File "+file.to_text
Restartable_Input_Stream.From_Temporary_File _ -> "From_Temporary_File"
"Restartable_Input_Stream."+suffix

## PRIVATE
make (input_stream : Input_Stream) (extend_lifetime : Boolean) -> Restartable_Input_Stream =
case input_stream.associated_source of
temp_file : Temporary_File -> Restartable_Input_Stream.From_Temporary_File temp_file
file : File ->
if extend_lifetime then cache_generic_input_stream input_stream else
Restartable_Input_Stream.From_Existing_File file
bytes : Vector -> Restartable_Input_Stream.From_Bytes bytes
_ -> cache_generic_input_stream input_stream

## PRIVATE
Runs the provided action with a fresh input stream pointing to the
beginning of the data represented by this stream.

This method may be called multiple times, allowing multiple 'rounds' of
processing.
with_fresh_stream self (action : Input_Stream -> Any) -> Any =
case self of
Restartable_Input_Stream.From_Bytes bytes ->
Managed_Resource.bracket (Input_Stream.from_bytes bytes) (.close) action
Restartable_Input_Stream.From_Existing_File file ->
file.with_input_stream [File_Access.Read] action
Restartable_Input_Stream.From_Temporary_File temp_file ->
temp_file.with_file file->
file.with_input_stream [File_Access.Read] action

## PRIVATE
Maximum size for a stream to be held in memory.
If the amount of data exceeds this limit, it will be stored in a temporary file.
max_in_memory_size =
# 64 KiB
64 * 1024

## PRIVATE
private cache_generic_input_stream (input_stream : Input_Stream) -> Restartable_Input_Stream =
input_stream.with_java_stream java_input_stream->
first_block = java_input_stream.readNBytes max_in_memory_size
case first_block.length < max_in_memory_size of
True ->
Restartable_Input_Stream.From_Bytes first_block
False ->
Context.Output.with_enabled <|
temp_file = Temporary_File.new "restartable-input-stream"
r = temp_file.with_file file->
file.with_output_stream [File_Access.Write, File_Access.Create, File_Access.Truncate_Existing] output_stream->
output_stream.with_java_stream java_output_stream->
java_output_stream.write first_block
java_input_stream.transferTo java_output_stream
java_output_stream.flush
Nothing
r.if_not_error <|
Restartable_Input_Stream.From_Temporary_File temp_file
Original file line number Diff line number Diff line change
Expand Up @@ -246,11 +246,11 @@ type File
file.with_input_stream [File_Access.Create, File_Access.Read] action
with_input_stream : Vector File_Access -> (Input_Stream -> Any ! File_Error) -> Any ! File_Error
with_input_stream self (open_options : Vector) action =
new_input_stream : File -> Vector File_Access -> Output_Stream ! File_Error
new_input_stream : File -> Vector File_Access -> Input_Stream ! File_Error
new_input_stream file open_options =
opts = open_options . map (_.to_java)
stream = File_Error.handle_java_exceptions file (file.input_stream_builtin opts)
Input_Stream.new stream (File_Error.handle_java_exceptions self)
Input_Stream.new stream (File_Error.handle_java_exceptions self) associated_source=self

if self.is_directory then Error.throw (File_Error.IO_Error self "File '"+self.path+"' is a directory") else
open_as_data_link = (open_options.contains Data_Link_Access.No_Follow . not) && (Data_Link.is_data_link self)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ type Temporary_File
If the stream is already backed by a temporary or regular file, that file is returned.
from_stream_light : Input_Stream -> Temporary_File | File
from_stream_light stream =
case stream.associated_file of
case stream.associated_source of
tmp : Temporary_File -> tmp
file : File -> file
_ -> Temporary_File.from_stream stream
Expand Down
Loading
Loading