Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the performance of TimestampParser#parse(String) #145

Closed
muga opened this issue Mar 18, 2015 · 8 comments
Closed

Optimize the performance of TimestampParser#parse(String) #145

muga opened this issue Mar 18, 2015 · 8 comments
Assignees
Milestone

Comments

@muga
Copy link
Contributor

muga commented Mar 18, 2015

@muga
Copy link
Contributor Author

muga commented Mar 27, 2015

After fixing page size, I tried to profile the performance again. I used localfile input plugin and null output plugin. The timestamp parsing consumes almost CPU about 90%.

@frsyuki
Copy link
Contributor

frsyuki commented Apr 9, 2015

An implementation idea is implementing timestamp parser in Java. JRuby implements timestamp formatter in Java. Parser should be able to be the same design with formatter.

@hito4t
Copy link
Contributor

hito4t commented Apr 24, 2015

I've added the issue #164 , which my be related to this issue.
Please check the issue.

@hata
Copy link
Contributor

hata commented Apr 28, 2015

Hello.

I would like to inform my knowledge about the performance of TimestampParser.

I tested differences between Java and JRuby(original) implementation. I used time command and the result is like this(All tests run on Mac OS X Intel Core i5/2.7GHz):

  • csv file contains 1 column and the column is timestamp text like '1989-07-25 07:24:36'
  • the file size is 100MB . The number of rows is 5000000 .

When the file was tested for embulk 0.6.5, embulk finished about 4:48 in my environment.

real    4m48.649s
user    4m56.812s
sys 0m3.264s

When the same file was tested for embulk 0.6.5 + my change ( 26ea959 ), embulk finished about 13.4 secs.

real    0m13.403s
user    0m17.465s
sys 0m0.535s

From this, it may be better to improve parsing timestamp. My tested code can parse some patterns using java. It may help for some users who only use patterns supported by SimpleDateFormat.

Sample testcase

Generate timestamp csv file.

import java.io.*;
import java.util.*;
import java.text.*;

public class GenTimestampColumns
{
    private static final long TIME_INIT_MSEC = 100000000;
    private static final long TIME_STEP_MSEC = 123456;
    private static final String RESULT_FILE_NAME = "result.csv";

    // 2015-01-27 19:23:49
    // yyyy-MM-dd hh:mm:ss
    public static void main(String[] args) throws Exception {
        System.out.println("Generate Timestamp Columns ...");
        int rowCount = Integer.parseInt(args[0]);
        SimpleDateFormat[] formats = new SimpleDateFormat[args.length -1];

        for (int i = 1;i < args.length;i++) {
            formats[i -1] = new SimpleDateFormat(args[i]);
        }

        PrintWriter writer = new PrintWriter(new FileWriter(RESULT_FILE_NAME));
        StringBuffer buffer = new StringBuffer();
        long currentTime = TIME_INIT_MSEC;

        for (long lineNum = 0;lineNum < rowCount;lineNum++) {
            for (SimpleDateFormat format : formats) {
                if (buffer.length() > 0) {
                    buffer.append(",");
                }
                buffer.append(format.format(new Date(currentTime)));
            }
            writer.println(buffer.toString());
            buffer = new StringBuffer();
            currentTime += TIME_STEP_MSEC;
        }

        writer.close();
        System.out.println("Finished(output file:" + RESULT_FILE_NAME + ") ...");
    }
}

Example script to generate timestamp sample csv file. The following script generates 100MB test file.

#!/bin/sh

ROW_COUNT=5000000

DATE_FORMAT="yyyy-MM-dd hh:mm:ss"

javac -d classes src/*.java
java -classpath classes GenTimestampColumns $ROW_COUNT "$DATE_FORMAT"

For example, use like the following config.yml file to test the above file.

exec: {}
in:
  type: file
  path_prefix: <Set timestamp file path>
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    escape: ''
    skip_header_lines: 0
    columns:
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
out: {type: 'null'}

Sample change to improve some type of performance

My enhancement code is 26ea959 . This can improve some patterns supported by SimpleDateFormat. This doesn't handle unsupported patterns like micro/nano seconds.

I considered I make a pull request for my change and I decided not to do it because it may be required to check more locale related behavior(default locale is better to set 'en' ?) and test patterns. I guess this performance behavior is under investigation and my code only improve some patterns. So, this is just FYI for the performance issue.

@frsyuki
Copy link
Contributor

frsyuki commented Apr 28, 2015

Thank you for your information. Your benchmark result reaches the same conclusion with @muga's CPU profile result.

Here is the idea of @muga and me:

  • Writing fast code is easy but writing precise code is difficult. Maintaining the code is also difficult.
  • JRuby has a problem where Time.strptime can't parse nanoseconds. But CRuby can parse. So, it's one of the incompatibility with CRuby which should be enhanced.
  • It would be great if we can contribute the new fast Time.strptime code to JRuby project so that the parser implementation will be improved by all jruby users, which is much larger community than embulk users.

@muga did you bring this idea to the JRuby community, by the way? What's the status of the code? Did you find the reusable test cases of Time.strptime in JRuby code?

@hata
Copy link
Contributor

hata commented Apr 28, 2015

Thank you very much for the comment and it sounds good to me.

@flicker581
Copy link

Maintaining an open issue for years is easy. Closing it is difficult.

@muga muga added this to the v0.9 milestone Apr 12, 2016
@dmikurube
Copy link
Member

ASAIU, this has been solved in #611. Closing this ticket. Cc: @muga

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants