Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samples template for ranking catalog #3338

Merged
merged 3 commits into from Apr 17, 2019

Conversation

Projects
None yet
4 participants
@artidoro
Copy link
Contributor

artidoro commented Apr 15, 2019

Tracked under #2522.

This PR adds samples for FastTree ranking.

This PR also rewrites the samples for LightGbm ranking using a template file which will make it easier to maintain going forward.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Apr 15, 2019

Codecov Report

Merging #3338 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3338      +/-   ##
==========================================
- Coverage   72.71%   72.69%   -0.02%     
==========================================
  Files         807      807              
  Lines      145172   145172              
  Branches    16225    16225              
==========================================
- Hits       105559   105537      -22     
- Misses      35195    35220      +25     
+ Partials     4418     4415       -3
Flag Coverage Δ
#Debug 72.69% <ø> (-0.02%) ⬇️
#production 68.23% <ø> (-0.02%) ⬇️
#test 88.97% <ø> (ø) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.FastTree/TreeTrainersCatalog.cs 94.18% <ø> (ø) ⬆️
src/Microsoft.ML.Core/Data/ProgressReporter.cs 70.95% <0%> (-6.99%) ⬇️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs 89.26% <0%> (-0.63%) ⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 84.9% <0%> (+0.2%) ⬆️
GroupId = (uint)(i / groupSize),
// Create random features that are correlated with the label.
// For data points with larger labels, the feature values are slightly increased by adding a constant.
Features = Enumerable.Repeat(randomFloat() + label * 0.1f, 50).ToArray()

This comment has been minimized.

Copy link
@shmoradims

shmoradims Apr 15, 2019

Contributor

Repeat(randomFloat() + label * 0.1f, 50).ToArray() [](start = 42, length = 50)

is this correct? I think it repeats the same number 50 times which is not what you want.

in other samples we do it like this:
Enumerable.Repeat(label, 50).Select(x => x + randomFloat()).ToArray() #Resolved

/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[FastTreeBinaryClassification](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Ranking/FastTree.cs)]

This comment has been minimized.

Copy link
@rogancarr

rogancarr Apr 16, 2019

Contributor

FastTreeBinaryClassification [](start = 26, length = 28)

Please have the names match the destination file. #Resolved

This comment has been minimized.

Copy link
@artidoro

artidoro Apr 16, 2019

Author Contributor

Thank you for pointing out!


In reply to: 275908194 [](ancestors = 275908194)

var mlContext = new MLContext(seed: 0);

// Create a list of training data points.
var dataPoints = GenerateRandomDataPoints(1000);

This comment has been minimized.

Copy link
@rogancarr

rogancarr Apr 16, 2019

Contributor

var dataPoints = GenerateRandomDataPoints(1000); [](start = 11, length = 49)

There is an asymmetry between the dataset loading for train and test.

I would prefer to have one call to GenerateRandomDataPoints, followed by a call to LoadFromEnumerable, and then followed by a TrainTestSplit that uses the GroupingColumn (or whatever we called it) option. This would be a good place to remind people that they need to be careful about the GroupId column. #Resolved

This comment has been minimized.

Copy link
@artidoro

artidoro Apr 16, 2019

Author Contributor

You are right, I like your suggested approach better, I am going to write an issue about it, and fix it for all the samples (not just ranking).
This template is pretty standard across the different samples at the moment.


In reply to: 275909222 [](ancestors = 275909222)

var predictions = mlContext.Data.CreateEnumerable<Prediction>(transformedTestData, reuseRowObject: false).ToList();

// Look at 5 predictions
foreach (var p in predictions.Take(5))

This comment has been minimized.

Copy link
@rogancarr

rogancarr Apr 16, 2019

Contributor

predictions.Take(5) [](start = 30, length = 19)

Don't use Linq. Instead, show how to do this using mlContext.Data.TakeRows prior to creating an enumerable. This puts more of sample coverage over our codebase. #Resolved

This comment has been minimized.

Copy link
@artidoro

artidoro Apr 16, 2019

Author Contributor

Ok will do that.


In reply to: 275909687 [](ancestors = 275909687)

{
[KeyType(5)]
public uint Label { get; set; }
[KeyType(100)]

This comment has been minimized.

Copy link
@Ivanidzo4ka

Ivanidzo4ka Apr 16, 2019

Member
        [](start = 26, length = 12)

can you remove this extra spaces. #Resolved

This comment has been minimized.

Copy link
@artidoro

artidoro Apr 16, 2019

Author Contributor

now I can see them haha :) Thank you for pointing out.


In reply to: 275927787 [](ancestors = 275927787)

[KeyType(5)]
public uint Label { get; set; }
[KeyType(100)]
public uint GroupId { get; set; }

This comment has been minimized.

Copy link
@Ivanidzo4ka

Ivanidzo4ka Apr 16, 2019

Member
        [](start = 45, length = 12)

and this one #Resolved

float randomFloat() => (float)random.NextDouble();
for (int i = 0; i < count; i++)
{
var label = random.Next(0, 5);

This comment has been minimized.

Copy link
@Ivanidzo4ka

Ivanidzo4ka Apr 16, 2019

Member

var label = random.Next(0, 5) [](start = 16, length = 29)

isn't label = 0 while it uint/key means this is a missing label? #Pending

This comment has been minimized.

Copy link
@artidoro

artidoro Apr 16, 2019

Author Contributor

So the underlying value can be 0. When reading as a KeyType there is mapping that will shift everything by +1. So that 0 maps to 1.
I looked at the IDataView that was generated and the label values go from 1 to 5.


In reply to: 275928728 [](ancestors = 275928728)

Microsoft.ML.SamplesUtils.ConsoleUtils.PrintMetrics(metrics);
FeatureFraction = 0.9
},
RowGroupColumnName = "GroupId"

This comment has been minimized.

Copy link
@Ivanidzo4ka

Ivanidzo4ka Apr 16, 2019

Member

RowGroupColumnName = "GroupId" [](start = 16, length = 30)

I find it strange what we use this line only in one example out of 4.
If it's important, why not in all 4, if it's not, why we even show it? #Resolved

This comment has been minimized.

Copy link
@artidoro

artidoro Apr 16, 2019

Author Contributor

So in the two samples without option, the RowGrouColumnName column defaults to 'GroupID'.
However, I just noticed that in the advanced options constructor, the RowGroupColumnName defaults to null. We should probably fix this.


In reply to: 275931195 [](ancestors = 275931195)

This comment has been minimized.

Copy link
@artidoro

artidoro Apr 16, 2019

Author Contributor

In the meantime I am going to include this line in the other sample with options.


In reply to: 275985563 [](ancestors = 275985563,275931195)

var transformedTestData = model.Transform(testData);

// Take the top 5 rows.
var topTransformedTestData = mlContext.Data.TakeRows(transformedTestData, 5);

This comment has been minimized.

Copy link
@Ivanidzo4ka

Ivanidzo4ka Apr 16, 2019

Member

what @wschin did in his PR, he just generate 5 rows of data. What's the point of having 500 test samples if you use only 5 of them?

This comment has been minimized.

Copy link
@artidoro

artidoro Apr 16, 2019

Author Contributor

Yes you are right, I don't like this approach anyways, I have filed an issue to start using TrainTestSplit instead of generating training and test data separately.

I will fix that for all samples at once.

@Ivanidzo4ka
Copy link
Member

Ivanidzo4ka left a comment

:shipit:

@shmoradims
Copy link
Contributor

shmoradims left a comment

:shipit:

@artidoro artidoro force-pushed the artidoro:rankingsamples branch from 3aa7cd2 to 1877e7d Apr 17, 2019

@artidoro artidoro force-pushed the artidoro:rankingsamples branch from 1877e7d to 76aa0d4 Apr 17, 2019

@artidoro artidoro merged commit cc40049 into dotnet:master Apr 17, 2019

3 checks passed

MachineLearning-CI #20190417.19 succeeded
Details
MachineLearning-CodeCoverage #20190417.19 succeeded
Details
license/cla All CLA requirements met.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.