New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Multiple Training Files to the Pipeline? #192

Open
cflint987 opened this Issue May 19, 2018 · 3 comments

Comments

Projects
None yet
4 participants
@cflint987

cflint987 commented May 19, 2018

System information
OS version/distro: Windows 7 Home
.NET Version (eg., dotnet --info): ML .net V0.1.0

Issue:
What is the correct way to add multiple training files to a Learning Pipeline?

In the Taxi Fare example, just adding another textloader and/or ColumnCopier, etc seems to not be correct.

Example:
pipeline.Add(new TextLoader(DataPath, useHeader: true, separator: ","));
pipeline.Add(new TextLoader(DataPath2, useHeader: true, separator: ","));

@cflint987 cflint987 changed the title from .Net Framework Support? to Adding Multiple Training Files to the Pipeline? May 19, 2018

@shauheen shauheen added the question label May 21, 2018

@GalOshri

This comment has been minimized.

Show comment
Hide comment
@GalOshri

GalOshri May 21, 2018

Contributor

Thanks for asking! This is not currently possible, but let's use this issue to track enabling multiple inputs in a pipeline.

Just to clarify: is your intention to concatenate the two files as soon as they are loaded, or to apply different transforms/trainers to them?

A potential workaround for now is to read in the examples from both files into memory and use the CollectionDataSource (see example usage here). You could also concatenate the two files into one CSV outside of the ML.NET pipeline.

Contributor

GalOshri commented May 21, 2018

Thanks for asking! This is not currently possible, but let's use this issue to track enabling multiple inputs in a pipeline.

Just to clarify: is your intention to concatenate the two files as soon as they are loaded, or to apply different transforms/trainers to them?

A potential workaround for now is to read in the examples from both files into memory and use the CollectionDataSource (see example usage here). You could also concatenate the two files into one CSV outside of the ML.NET pipeline.

@cflint987

This comment has been minimized.

Show comment
Hide comment
@cflint987

cflint987 May 22, 2018

My intention is for creating and testing ML structures with large datasets to be modular and less taxing on file transfers to/from servers. For example, moving 100GB is to a server is easier if split by time or another parameter. It also allows ML structures to be updated as new data comes in without having to concat onto what are already are/is a large file.

Reducing the memory footprint by loading subsets of the data would be nice, but as I understand it, that is not possible for all ML structures.

I have concated the files and it works properly but this would be a nice feature to have.

Thanks for the answer.

cflint987 commented May 22, 2018

My intention is for creating and testing ML structures with large datasets to be modular and less taxing on file transfers to/from servers. For example, moving 100GB is to a server is easier if split by time or another parameter. It also allows ML structures to be updated as new data comes in without having to concat onto what are already are/is a large file.

Reducing the memory footprint by loading subsets of the data would be nice, but as I understand it, that is not possible for all ML structures.

I have concated the files and it works properly but this would be a nice feature to have.

Thanks for the answer.

@glebuk

This comment has been minimized.

Show comment
Hide comment
@glebuk

glebuk May 23, 2018

Contributor

@cflint987,
We have a work item to address your exact scenario. Please take a look at PR #61. Feel free to comment and ask @tyclintw.

Contributor

glebuk commented May 23, 2018

@cflint987,
We have a work item to address your exact scenario. Please take a look at PR #61. Feel free to comment and ask @tyclintw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment