New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove parsing perf bottleneck in WordEmbeddingsTransform #1599

Merged
merged 19 commits into from Nov 21, 2018

Conversation

@adamsitnik
Member

adamsitnik commented Nov 11, 2018

This PR improves the performance of reading large text files and affects two of our most time-consuming benchmarks.

Info:

BenchmarkDotNet=v0.11.2, OS=Windows 10.0.17134.345 (1803/April2018Update/Redstone4)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
Frequency=3507503 Hz, Resolution=285.1031 ns, Timer=TSC
.NET Core SDK=3.0.100-alpha1-009697
  [Host]     : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
  Job-OXDQNP : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT

Before:

Method Mean
WikiDetox_WordEmbeddings_OVAAveragedPerceptron 286.7 s
WikiDetox_WordEmbeddings_SDCAMC 184.1 s

After:

Method Mean
WikiDetox_WordEmbeddings_OVAAveragedPerceptron 169.02 s
WikiDetox_WordEmbeddings_SDCAMC 65.32 s

Which is two minutes less to read the huge file for both benchmarks which results in a x3 boost for WikiDetox_WordEmbeddings_SDCAMC and 40% improvement for WikiDetox_WordEmbeddings_OVAAveragedPerceptron

Reading the file was a bottleneck:

image

I have applied all possible optimizations and parallelized this operation.

I am going to post a detailed description on Monday.

@adamsitnik adamsitnik changed the title from Remove parsing perf bottleneck in WikiDetox_WordEmbeddings benchmarks to Remove file read bottleneck in WikiDetox_WordEmbeddings benchmarks Nov 11, 2018

@adamsitnik

This comment has been minimized.

Member

adamsitnik commented Nov 11, 2018

@adamsitnik adamsitnik changed the title from Remove file read bottleneck in WikiDetox_WordEmbeddings benchmarks to Remove parsing perf bottleneck in WordEmbeddingsTransform Nov 11, 2018

@sfilipi sfilipi requested a review from yaeldekel Nov 12, 2018

@danmosemsft danmosemsft requested a review from tannergooding Nov 13, 2018

@tannergooding

This comment has been minimized.

Member

tannergooding commented Nov 13, 2018

This seems like just the kind of thing the Utf8Parser in System.Memory was meant to handle....

adamsitnik added some commits Nov 13, 2018

@adamsitnik

This comment has been minimized.

Member

adamsitnik commented Nov 13, 2018

@danmosemsft

This comment has been minimized.

Member

danmosemsft commented Nov 16, 2018

Seems all feedback was addressed? Just needs rebase?

@adamsitnik

This comment has been minimized.

Member

adamsitnik commented Nov 16, 2018

Just needs rebase?

And a two approvals ;)

@danmosemsft

This comment has been minimized.

Member

danmosemsft commented Nov 16, 2018

@shauheen do you have further feedback?

@shauheen

Branch still needs to be updated with master.

@shauheen shauheen requested a review from Zruty0 Nov 17, 2018

changes have been made

@eerhardt

This comment has been minimized.

Member

eerhardt commented Nov 19, 2018

using System;

need a copyright header


Refers to: src/Microsoft.ML.Core/Utilities/LineParser.cs:1 in 0190b0c. [](commit_id = 0190b0c, deletion_comment = False)

@eerhardt

This comment has been minimized.

Member

eerhardt commented Nov 19, 2018

using Microsoft.ML.Runtime.Internal.Utilities;

Add copyright header.


Refers to: test/Microsoft.ML.Tests/Transformers/LineParserTests.cs:1 in 0190b0c. [](commit_id = 0190b0c, deletion_comment = False)

@adamsitnik

This comment has been minimized.

Member

adamsitnik commented Nov 20, 2018

@shauheen @eerhardt I have addressed all issues, PTAL one more time.

@tannergooding thanks for pointing the Utf8Parser. I did not use it in this PR because I don't know if every input file is Utf8. In the next PR I will provide "fast path" for UTF8 files.

@eerhardt

This comment has been minimized.

Member

eerhardt commented Nov 20, 2018

using System;

Still needs a copyright header.


In reply to: 439979964 [](ancestors = 439979964)


Refers to: src/Microsoft.ML.Core/Utilities/LineParser.cs:1 in 0190b0c. [](commit_id = 0190b0c, deletion_comment = False)

@adamsitnik adamsitnik merged commit feddc72 into dotnet:master Nov 21, 2018

2 checks passed

MachineLearning-CI #20181120.8 succeeded
Details
license/cla All CLA requirements met.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment