Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove parsing perf bottleneck in WordEmbeddingsTransform #1599

Merged
merged 19 commits into from Nov 21, 2018

Conversation

adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Nov 11, 2018

This PR improves the performance of reading large text files and affects two of our most time-consuming benchmarks.

Info:

BenchmarkDotNet=v0.11.2, OS=Windows 10.0.17134.345 (1803/April2018Update/Redstone4)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
Frequency=3507503 Hz, Resolution=285.1031 ns, Timer=TSC
.NET Core SDK=3.0.100-alpha1-009697
  [Host]     : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
  Job-OXDQNP : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT

Before:

Method Mean
WikiDetox_WordEmbeddings_OVAAveragedPerceptron 286.7 s
WikiDetox_WordEmbeddings_SDCAMC 184.1 s

After:

Method Mean
WikiDetox_WordEmbeddings_OVAAveragedPerceptron 169.02 s
WikiDetox_WordEmbeddings_SDCAMC 65.32 s

Which is two minutes less to read the huge file for both benchmarks which results in a x3 boost for WikiDetox_WordEmbeddings_SDCAMC and 40% improvement for WikiDetox_WordEmbeddings_OVAAveragedPerceptron

Reading the file was a bottleneck:

image

I have applied all possible optimizations and parallelized this operation.

I am going to post a detailed description on Monday.

@adamsitnik adamsitnik changed the title Remove parsing perf bottleneck in WikiDetox_WordEmbeddings benchmarks Remove file read bottleneck in WikiDetox_WordEmbeddings benchmarks Nov 11, 2018
@adamsitnik
Copy link
Member Author

/cc @danmosemsft

@adamsitnik adamsitnik changed the title Remove file read bottleneck in WikiDetox_WordEmbeddings benchmarks Remove parsing perf bottleneck in WordEmbeddingsTransform Nov 11, 2018
@tannergooding
Copy link
Member

This seems like just the kind of thing the Utf8Parser in System.Memory was meant to handle....

@adamsitnik
Copy link
Member Author

@shauheen done #1608

@danmoseley
Copy link
Member

Seems all feedback was addressed? Just needs rebase?

@adamsitnik
Copy link
Member Author

Just needs rebase?

And a two approvals ;)

@danmoseley
Copy link
Member

@shauheen do you have further feedback?

Copy link
Contributor

@shauheen shauheen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Branch still needs to be updated with master.

@shauheen shauheen requested a review from Zruty0 November 17, 2018 17:08
@shauheen shauheen dismissed their stale review November 17, 2018 17:08

changes have been made

@eerhardt
Copy link
Member

using System;

need a copyright header


Refers to: src/Microsoft.ML.Core/Utilities/LineParser.cs:1 in 0190b0c. [](commit_id = 0190b0c, deletion_comment = False)

@eerhardt
Copy link
Member

using Microsoft.ML.Runtime.Internal.Utilities;

Add copyright header.


Refers to: test/Microsoft.ML.Tests/Transformers/LineParserTests.cs:1 in 0190b0c. [](commit_id = 0190b0c, deletion_comment = False)

@adamsitnik
Copy link
Member Author

@shauheen @eerhardt I have addressed all issues, PTAL one more time.

@tannergooding thanks for pointing the Utf8Parser. I did not use it in this PR because I don't know if every input file is Utf8. In the next PR I will provide "fast path" for UTF8 files.

@eerhardt
Copy link
Member

using System;

Still needs a copyright header.


In reply to: 439979964 [](ancestors = 439979964)


Refers to: src/Microsoft.ML.Core/Utilities/LineParser.cs:1 in 0190b0c. [](commit_id = 0190b0c, deletion_comment = False)

@adamsitnik adamsitnik merged commit feddc72 into dotnet:master Nov 21, 2018
@msftbot msftbot bot locked as resolved and limited conversation to collaborators Mar 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants