Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAS date formats implementation #86

Merged
merged 7 commits into from
Feb 11, 2021

Conversation

xantorohara
Copy link
Contributor

This change adds SAS-like formatting of Dates, Times and DateTimes.
I'll put details in comments below.
Please, don't merge it until I describe here "What and Why".

@xantorohara
Copy link
Contributor Author

xantorohara commented Feb 10, 2021

Formats

SAS has huge amount of temporal formats. Parso supports most of them (but not all) in terms of parsing (when the result is represented as epoch seconds, or as a Java Date object). Formatting dates is a tricky task, and for now Parso supports less amount of output formats than it can parse. Here is a lists of supported formats:

Date formats

Implemented:

  • DATE
  • DAY
  • DDMMYY
  • DDMMYYB
  • DDMMYYC
  • DDMMYYD
  • DDMMYYN
  • DDMMYYP
  • DDMMYYS
  • MMDDYY
  • MMDDYYB
  • MMDDYYC
  • MMDDYYD
  • MMDDYYN
  • MMDDYYP
  • MMDDYYS
  • YYMMDD
  • YYMMDDB
  • YYMMDDC
  • YYMMDDD
  • YYMMDDN
  • YYMMDDP
  • YYMMDDS
  • MMYY
  • MMYYC
  • MMYYD
  • MMYYN
  • MMYYP
  • MMYYS
  • YYMM
  • YYMMC
  • YYMMD
  • YYMMN
  • YYMMP
  • YYMMS
  • JULIAN
  • JULDAY
  • MONTH
  • YEAR
  • MONYY
  • YYMON
  • B8601DA
  • E8601DA
  • MONNAME
  • WEEKDATE
  • WEEKDATX
  • WEEKDAY
  • DOWNAME
  • WORDDATE
  • WORDDATX
  • QTR

Date-time formats

Implemented:

  • DATETIME
  • B8601DN
  • E8601DN
  • DTDATE
  • DTMONYY
  • DTYEAR
  • TOD

Not implemented:

  • B8601DT
  • B8601DX
  • B8601DZ
  • B8601LX
  • E8601DT
  • E8601DX
  • E8601DZ
  • E8601LX
  • DATEAMPM
  • DTWKDATX
  • MDYAMPM

Time formats

Implemented:

  • TIME
  • MMSS
  • HHMM
  • HOUR
  • E8601TM

Not implemented:

  • TIMEAMPM
  • E8601LZ

@xantorohara
Copy link
Contributor Author

How to use it

This change adds two new options into the OutputDateType enum:

  • SAS_FORMAT_EXPERIMENTAL
  • SAS_FORMAT_TRIM_EXPERIMENTAL

First one is used to output dates using the full width specified in the format. Second one trims leading spaces (in the same way as SAS Universal Viewer shows dates by default or with a "Trim formatted values" option checked).

These options have "_EXPERIMENTAL" suffix to say that this is a not a final solution and something may be changed in future
(new formats can be added, some algorithms may be reworked and so on). Think that it is a kind of Beta implementation.

Reader class can be created like previously, but with one of new option. Like here:

SasFileReader reader = new SasFileReaderImpl(is, null, SAS_FORMAT_EXPERIMENTAL);

After that dates will be produced as formatted strings.

@xantorohara
Copy link
Contributor Author

Test datasets

In order to have all possible formats on the hands a lot of datasets were created.

Each dataset consists of:

  • SAS program to generate the dataset
  • Dataset in ".sas7bdat" format (produced by the SAS program)
  • Dataset in ".tsv" text format (".sas7bdat" file was opened in SAS Universal Viewer and the formatted result was copy-pasted into the tab-separated file)

These datasets are needed for testing purposes and they are places into the "src/test/resources/dates/sas" directory
(it is files like "date_format_dtmonyy.sas", "date_format_dtmonyy.sas7bdat", "date_format_dtdate.tsv").

Some formats have two datasets:

  • one with a small amount of manually selected sets of key dates
  • another with a big amount of auto-generated dates (to proof implementation on a large population)

For now these datasets consist about one thousand of date format variations (including width and precision combinations) and tens of thousand of date samples.

@xantorohara
Copy link
Contributor Author

Rounding issues

There is a slight difference between SAS and Parso results: in rounding of fractional values.
SAS is really strange. Sometimes its rounding rules doesn't follow any visible logic, and we don't have a chance to look into the SAS code that formats dates. Moreover, different formats have their personal nuances.

For example:
Expected result for the 0.55 half-up rounded to precision 1 is 0.6 .
But in SAS it is sometimes 0.5, but sometimes 0.6. And sometimes it is 0.4. ¯\_(ツ)_/¯

I researched a lot, but for some formats I did not find the combination of arithmetic operations to have exactly the same result as SAS does. For some formats I've found ways to have pretty the same result, but resulted code was huge, unobvious and hardly understandable and I've rolled it back in favor of a simpler source code.

Anyway implemented solution have more correct values in terms of arithmetic, but result sometimes differs with SAS.

@xantorohara
Copy link
Contributor Author

Unit tests

All code created in scope of this implementation is 100% covered by unit tests.

image

Unit-tests compare original SAS-formatted dates with a Parso-formatted dates and check their equality.

In the comment above I've mentioned rounding issues. Unit-tests trying to find and bypass cases when there is a difference in rounded result. Some unit-tests can "skip" rounding issues if there is fractional difference between actual and expected numbers in a most minor position no more than 1.

Tests report such cases:

Ignored [308] ([0.28%]) SAS rounding bugs for [date_format_tod_loop]
Ignored [59] ([0.04%]) SAS rounding bugs for [date_format_hhmm_loop]
Ignored [4] ([0.0%]) SAS rounding bugs for [date_format_hour_loop]
Ignored [5] ([0.0%]) SAS rounding bugs for [date_format_mmss_loop]
Ignored [14] ([0.36%]) SAS rounding bugs for [date_format_time]
Ignored [43] ([0.02%]) SAS rounding bugs for [date_format_time_loop]
Ignored [10] ([0.51%]) SAS rounding bugs for [date_format_e8601tm]
Ignored [4] ([0.02%]) SAS rounding bugs for [date_format_e8601tm_loop]

It is less than 1% of such values, but for some users it may be not acceptable.
So again, this implementation is marked as EXPERIMENTAL.

@xantorohara
Copy link
Contributor Author

I've also updated Rocket table to use this unreleased version of Parso.
Version "rocket-table-1.1.3-beta.zip" now also support format options.

It can be enabled using "--sas-date-format-type" command line option. Like:

java -jar rocket-table.jar --sas-date-format-type=SAS_FORMAT_TRIM_EXPERIMENTAL

So it is possible to visually explore this new feature how Parso now formats dates.

image

image

Then will add UI control to switch format ON/OFF.

@xantorohara
Copy link
Contributor Author

Fallback formats

As I mentioned in the first comment: not all declared formats are implemented.
For example, Parso can read dates with "DATEAMPM" format from the ".sas7bdat" file. It can show this date as a Java Date, epoch seconds or as a raw SAS value. But formatter for "DATEAMPM" is not yet implemented.

In such cases the fallback format will be used to format dates:

  • DATE7. - for dates
  • TIME8. - for times
  • DATETIME16. - for date-times

There is also QTRR SAS format used in the test sources which neither declared in main Parso code, nor implemented as a format. It is only used in unit-tests, to check how Parso handle unknown formats.

@xantorohara
Copy link
Contributor Author

Performance and thread-safety

Formats are declared as Java Enums:

  • SasDateFormat
  • SasDateTimeFormat
  • SasTimeFormat

Each enum element produces format function for the given width and precision. This function is a kind of closure; it consists of pre-calculated patterns, adjusted precision or some other things that can be calculated once for the specific format.

SasTemporalFormatter caches all these closures and uses them to format dates.

Formatting is not such fast as presenting date as row value, epoch seconds or Java Date. BigDecimal, DecimalFormat and DateTimeFormatter Java classes are involved into the formatting, so it has a bit overhead against of plain arithmetic operations.

In a normal way each instance of the SasFileParser has it's own instance of a SasTemporalFormatter. SasFileParser itself is a single-threaded, so no thread-safety issues expected here.

@xantorohara
Copy link
Contributor Author

Looks like this is all I was going to say. Now it can be reviewed.

@xantorohara
Copy link
Contributor Author

Cleaned up it a bit, removed commented unused lines.

@printsev printsev merged commit 246ea55 into epam:master Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants