Optimize the performance of TimestampParser#parse(String) #145

muga · 2015-03-18T20:58:57Z

muga · 2015-03-27T01:46:34Z

After fixing page size, I tried to profile the performance again. I used localfile input plugin and null output plugin. The timestamp parsing consumes almost CPU about 90%.

frsyuki · 2015-04-09T18:01:00Z

An implementation idea is implementing timestamp parser in Java. JRuby implements timestamp formatter in Java. Parser should be able to be the same design with formatter.

hito4t · 2015-04-24T16:11:10Z

I've added the issue #164 , which my be related to this issue.
Please check the issue.

hata · 2015-04-28T06:04:35Z

Hello.

I would like to inform my knowledge about the performance of TimestampParser.

I tested differences between Java and JRuby(original) implementation. I used time command and the result is like this(All tests run on Mac OS X Intel Core i5/2.7GHz):

csv file contains 1 column and the column is timestamp text like '1989-07-25 07:24:36'
the file size is 100MB . The number of rows is 5000000 .

When the file was tested for embulk 0.6.5, embulk finished about 4:48 in my environment.

real    4m48.649s
user    4m56.812s
sys 0m3.264s

When the same file was tested for embulk 0.6.5 + my change ( 26ea959 ), embulk finished about 13.4 secs.

real    0m13.403s
user    0m17.465s
sys 0m0.535s

From this, it may be better to improve parsing timestamp. My tested code can parse some patterns using java. It may help for some users who only use patterns supported by SimpleDateFormat.

Sample testcase

Generate timestamp csv file.

import java.io.*;
import java.util.*;
import java.text.*;

public class GenTimestampColumns
{
    private static final long TIME_INIT_MSEC = 100000000;
    private static final long TIME_STEP_MSEC = 123456;
    private static final String RESULT_FILE_NAME = "result.csv";

    // 2015-01-27 19:23:49
    // yyyy-MM-dd hh:mm:ss
    public static void main(String[] args) throws Exception {
        System.out.println("Generate Timestamp Columns ...");
        int rowCount = Integer.parseInt(args[0]);
        SimpleDateFormat[] formats = new SimpleDateFormat[args.length -1];

        for (int i = 1;i < args.length;i++) {
            formats[i -1] = new SimpleDateFormat(args[i]);
        }

        PrintWriter writer = new PrintWriter(new FileWriter(RESULT_FILE_NAME));
        StringBuffer buffer = new StringBuffer();
        long currentTime = TIME_INIT_MSEC;

        for (long lineNum = 0;lineNum < rowCount;lineNum++) {
            for (SimpleDateFormat format : formats) {
                if (buffer.length() > 0) {
                    buffer.append(",");
                }
                buffer.append(format.format(new Date(currentTime)));
            }
            writer.println(buffer.toString());
            buffer = new StringBuffer();
            currentTime += TIME_STEP_MSEC;
        }

        writer.close();
        System.out.println("Finished(output file:" + RESULT_FILE_NAME + ") ...");
    }
}

Example script to generate timestamp sample csv file. The following script generates 100MB test file.

#!/bin/sh

ROW_COUNT=5000000

DATE_FORMAT="yyyy-MM-dd hh:mm:ss"

javac -d classes src/*.java
java -classpath classes GenTimestampColumns $ROW_COUNT "$DATE_FORMAT"

For example, use like the following config.yml file to test the above file.

exec: {}
in:
  type: file
  path_prefix: <Set timestamp file path>
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    escape: ''
    skip_header_lines: 0
    columns:
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
out: {type: 'null'}

Sample change to improve some type of performance

My enhancement code is 26ea959 . This can improve some patterns supported by SimpleDateFormat. This doesn't handle unsupported patterns like micro/nano seconds.

I considered I make a pull request for my change and I decided not to do it because it may be required to check more locale related behavior(default locale is better to set 'en' ?) and test patterns. I guess this performance behavior is under investigation and my code only improve some patterns. So, this is just FYI for the performance issue.

frsyuki · 2015-04-28T07:32:59Z

Thank you for your information. Your benchmark result reaches the same conclusion with @muga's CPU profile result.

Here is the idea of @muga and me:

Writing fast code is easy but writing precise code is difficult. Maintaining the code is also difficult.
JRuby has a problem where Time.strptime can't parse nanoseconds. But CRuby can parse. So, it's one of the incompatibility with CRuby which should be enhanced.
It would be great if we can contribute the new fast Time.strptime code to JRuby project so that the parser implementation will be improved by all jruby users, which is much larger community than embulk users.

@muga did you bring this idea to the JRuby community, by the way? What's the status of the code? Did you find the reusable test cases of Time.strptime in JRuby code?

hata · 2015-04-28T14:50:40Z

Thank you very much for the comment and it sounds good to me.

flicker581 · 2016-02-29T09:29:52Z

Maintaining an open issue for years is easy. Closing it is difficult.

dmikurube · 2017-11-16T04:36:08Z

ASAIU, this has been solved in #611. Closing this ticket. Cc: @muga

frsyuki assigned muga Apr 9, 2015

frsyuki mentioned this issue Apr 28, 2015

TimestampParser doesn't support values under microseconds #164

Open

muga added this to the v0.9 milestone Apr 12, 2016

dmikurube added topic:timestamp and removed topic:timestamp labels Mar 16, 2017

dmikurube closed this as completed Nov 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the performance of TimestampParser#parse(String) #145

Optimize the performance of TimestampParser#parse(String) #145

muga commented Mar 18, 2015

muga commented Mar 27, 2015

frsyuki commented Apr 9, 2015

hito4t commented Apr 24, 2015

hata commented Apr 28, 2015

frsyuki commented Apr 28, 2015

hata commented Apr 28, 2015

flicker581 commented Feb 29, 2016

dmikurube commented Nov 16, 2017

Optimize the performance of TimestampParser#parse(String) #145

Optimize the performance of TimestampParser#parse(String) #145

Comments

muga commented Mar 18, 2015

muga commented Mar 27, 2015

frsyuki commented Apr 9, 2015

hito4t commented Apr 24, 2015

hata commented Apr 28, 2015

Sample testcase

Sample change to improve some type of performance

frsyuki commented Apr 28, 2015

hata commented Apr 28, 2015

flicker581 commented Feb 29, 2016

dmikurube commented Nov 16, 2017