Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update samples #238

Closed
wants to merge 7 commits into from
Closed

Update samples #238

wants to merge 7 commits into from

Conversation

OliaG
Copy link
Member

@OliaG OliaG commented May 24, 2018

No description provided.


[Column(ordinal: "4")]
[ColumnName("Label")]
public string Label;
Copy link
Contributor

@zeahmed zeahmed May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Column(ordinal: "4")] 
[ColumnName("Label")] 
public string Label;

I think this is not needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, just removed

@OliaG OliaG force-pushed the samples branch 2 times, most recently from 9ceb805 to ed14986 Compare May 24, 2018 22:55
@OliaG OliaG requested a review from terrajobst May 24, 2018 23:01
@OliaG OliaG changed the title WIP: Update samples Update samples May 24, 2018
@TomFinley
Copy link
Contributor

TomFinley commented May 24, 2018

Hi @OliaG , thanks for updating the PR. So I see we are deleting the files for taxi, at least, so I judge from the change line count (+34,393 −2,048,602), what I see of test/data/taxi-fare-train.csv, etc. Is that correct?

While the addition of these large files was unfortunate, due to the way git works with cloning preserving the whole history, the damage is already done in some sense. Any sort of "edits" on this file to delete or whatever merely compound the problem, especially if we decide to do a rebase to solve the problem once and for all any touching will just complicate that process. Perhaps we should just avoid touching it?

@TomFinley
Copy link
Contributor

Maybe if we did git move on those files it would realize this is really a move, but if I look at the diff over here it seems to think we've deleted and added.

@TomFinley
Copy link
Contributor

5 784 152:3 153:18 154:18 155:18 156:126 157:136 158:175 159:26 160:166 161:255 162:247 163:127 176:30 177:36 178:94 179:154 180:170 181:253 182:253 183:253 184:253 185:253 186:225 187:172 188:253 189:242 190:195 191:64 203:49 204:238 205:253 206:253 207:253 208:253 209:253 210:253 211:253 212:253 213:251 214:93 215:82 216:82 217:56 218:39 231:18 232:219 233:253 234:253 235:253 236:253 237:253 238:198 239:182 240:247 241:241 260:80 261:156 262:107 263:253 264:253 265:205 266:11 268:43 269:154 289:14 290:1 291:154 292:253 293:90 319:139 320:253 321:190 322:2 347:11 348:190 349:253 350:70 376:35 377:241 378:225 379:160 380:108 381:1 405:81 406:240 407:253 408:253 409:119 410:25 434:45 435:186 436:253 437:253 438:150 439:27 463:16 464:93 465:252 466:253 467:187 493:249 494:253 495:249 496:64 518:46 519:130 520:183 521:253 522:253 523:207 524:2 544:39 545:148 546:229 547:253 548:253 549:253 550:250 551:182 570:24 571:114 572:221 573:253 574:253 575:253 576:253 577:201 578:78 596:23 597:66 598:213 599:253 600:253 601:253 602:253 603:198 604:81 605:2 622:18 623:171 624:219 625:253 626:253 627:253 628:253 629:195 630:80 631:9 648:55 649:172 650:226 651:253 652:253 653:253 654:253 655:244 656:133 657:11 676:136 677:253 678:253 679:253 680:212 681:135 682:132 683:16

This looks like MNIST to me.

Generally I'd prefer that related sample data be stored in directories. If we're going to move all data files to a single directory, then we will need to give them meaningful names.

"Train-Tiny" is only a meaningful name if you know it is MNIST. I was only able to guess it was MNIST because I am very familiar with this dataset, other people may not have that advantage.


Refers to: examples/datasets/Train-Tiny-28x28.txt:1 in 9647337. [](commit_id = 9647337, deletion_comment = False)

{
for (var j = 0; j < metrics.ConfusionMatrix.ClassNames.Count; j++)
{
Console.Write("\t" + metrics.ConfusionMatrix[i, j] + "\t");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Console.Write("\t" + metrics.ConfusionMatrix[i, j] + "\t"); [](start = 20, length = 59)

Were the double tabs on both the start and end intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was intentional; it's using Console.Write not WriteLine, so the trailing tab is for subsequent writes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I could see having a trailing tab, or a leading tab, but I don't think having both is intentional?

Let's imagine you're outputting, on one row, the numbers, say, 0, 1, 2, 3. The output is \t0\t\t1\t\t2\t\t3\t. Right? That seems wrong.


In reply to: 190758071 [](ancestors = 190758071)

private static async Task<PredictionModel<TaxiTrip, TaxiTripFarePrediction>> TrainAsync()
{
// LearningPipeline holds all steps of the learning process: data, transforms, learners.
var pipeline = new LearningPipeline
Copy link
Contributor

@TomFinley TomFinley May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are very nice comments. #ByDesign

@OliaG
Copy link
Member Author

OliaG commented May 24, 2018

@TomFinley

I see we are deleting the files for taxi, at least, so I judge from the change line count (+34,393 −2,048,602), what I see of test/data/taxi-fare-train.csv, etc. Is that correct?

That's correct. I've undone the shrinking so now it's a pure move without changes.

While the addition of these large files was unfortunate, due to the way git works with cloning preserving the whole history, the damage is already done in some sense.

Good point.

Maybe if we did git move on those files it would realize this is really a move, but if I look at the diff over here it seems to think we've deleted and added.

Git doesn't track renames. git move is syntax sugar for add and remove. Git will track renames later based on a similarity threshold, which is AFAIK 75% by default and the massive shrinking would be above that threshold.

"Train-Tiny" is only a meaningful name if you know it is MNIST.

OK, so I'll rename Train-Tiny-28x28 to MNIST-Train-Tiny-28x28.

@OliaG OliaG added API Issues pertaining the friendly API and removed API Issues pertaining the friendly API labels May 24, 2018
<Folder Include="datasets\" />
</ItemGroup>

</Project>
Copy link
Member

@codemzs codemzs May 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale behind creating a separate project for "Binary Classification sentiment analysis"? Same for MC_Iris, TaxiFarePrediction.

What I'm seeing is you have taken the scenario test and broke it down into many files. What does this accomplish?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to have separate project for different problem types so it aligns with our docs

[Column("1", name: "Label")]
public float Sentiment;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this class needs its own file? You can keep SentimentData, SentimentPrediction and TestSentimentData in Program.cs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general .NET coding guidelines use one class per file and we want our samples to use idiomatic C#.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OliaG Can you please point me to this guideline? Vast majority of ML.Net code has multiple class definitions in one file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't find the written up rule, but based on my conversation with @terrajobst that's what the .NET team is usually doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there isn't a written rule then we should keep things consistent with the current code base. Here is an example from .Net Corefx repo that shows multiple classes defined in one file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codemzs

If there isn't a written rule then we should keep things consistent with the current code base

I disagree. Samples aren't part of the ML.NET codebase. They are for our customers and the vast majority doesn't need to know nor care about the ML.NET coding conventions. That's why we try to follow the conventions that the vast majority of our customers are using.

That being said, I don't have a problem if we were to merge all these classes into one file as it might even help to keep the sample easier to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guideline is interesting, and one we generally try to obey ourselves, but not when separating definitions would lead to less comprehensible code. So we tend to declare types meant to be understood together as a unit, in the same file. (This isn't Java after all. :) ) This clearly falls into that bucket. Especially in this case where the intent is pedagogical, we would "bend" any rules we usually follow about style (even if this was an actual policy, which, as Zeeshan points out, it is not).

In this case what I'd rather do is have a (private?) nested type, declared right next to their usages in the method. This will structure the example so that the method and class can be understood. Then you can get rid of these three files, and have only one, more easily understood example.


In reply to: 190942775 [](ancestors = 190942775)

{
var pipeline = new LearningPipeline();

pipeline.Add(new TextLoader<GitHubIssue>(DataPath, useHeader: true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pipeline.Add(new TextLoader(DataPath, useHeader: true)); [](start = 9, length = 72)

How is this line even compiling? Are you using some outdated nuget package? The loader API has been changed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are samples, which means they are building against the latest package that is published to nuget.org, not what is currently in master. Hence, they are working as we're consuming what our customers would, which is what we want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. The alternate argument is, if they're checked in with the repo, they ought to target the code in the repo, since then the samples are simply always up to date, if we make them part of our build/test cycle. Indeed until I saw that we were using this outdated stuff, I had fully expected that to be the case.

The nuget story makes the samples far more difficult to maintain as a code artifact. Especially in these days, when the API is still in heavy flux. I'm considering two worlds:

  • The samples are incrementally changed so that as we move towards 0.2, changing the samples one at a time by the person authoring that PR (and so, hopefully, understands how it should be use), I see how development happens. Also at any given point, the samples are correct.
  • The samples remain unchanged. Come release time, it is someone's job (whose?) to comprehend all PRs against the API in that time, so they have a clue of how the sample should be updated, and then we hope they do all that correctly. Meanwhile the release is blocked on this point.

Generally anything that reduces maintainability and adds costs to our release process, is something we ought to be hostile towards.


In reply to: 190764164 [](ancestors = 190764164)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually yeah, I was thinking about it, and the nuget approach just seems deeply wrong. You couldn't even write the PR to update the samples until after you've actually published the nuget. This means that any release branch we make will either have by design outdated samples, or we will have to accept a world in which the nuget published for version X never corresponds to the branch release/X. Regarding it being what we want, I'm not sure who wants that, but whoever they are, let's have them want something else. :)


In reply to: 190949466 [](ancestors = 190949466,190764164)

OliaG added 6 commits May 24, 2018 17:13
- Iris sample
- Sentiment Analysis sample
- TaxiFare sample
- GitHub issues classification sample
Moved datasets from tests to examples
Removed model files
Copy link
Member

@codemzs codemzs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

@asthana86 asthana86 self-requested a review May 25, 2018 13:10
@OliaG OliaG closed this May 25, 2018
@OliaG OliaG deleted the samples branch May 30, 2018 19:40
@dotnet dotnet locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants