Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Expect to support the filesystem not implementing the append-mode. #391

Open
3 tasks done
Tracked by #1030
yuyang733 opened this issue Dec 7, 2022 · 16 comments
Open
3 tasks done
Tracked by #1030

Comments

@yuyang733
Copy link

yuyang733 commented Dec 7, 2022

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the feature

Cloud object storage and its corresponding accelerated cache are widely used in storage-computing separation architecture.
But many of them may not implement the append-mode or the write amplification of the append-mode will be tricky.

Is it possible to consider supporting the way of independently storing each block as a data file at the same time?

Motivation

Cloud object storage and its corresponding accelerated cache are widely used in storage-computing separation architecture.
But many of them may not implement the append-mode or the write amplification of the append-mode will be tricky.

Describe the solution

The initially envisaged solution is to implement a non-append abstract storage type, such as: AbstractObjectStorageWriteHandler, AbstractObjectStorageReadHandler, and AbstractObjectDeleteHandler, and Implement the basic interface required by RSS in it.

The implementation of the read, write, and delete interfaces placed in the specific storage layer can be implemented by subclasses.

Additional context

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@zuston
Copy link
Member

zuston commented Dec 7, 2022

Do you mean that one dataFlushEvent could directly write a new single oss file? If so, how to maintain the index file?

@yuyang733
Copy link
Author

Do you mean that one dataFlushEvent could directly write a new single oss file? If so, how to maintain the index file?

Aren't several blocks appended to a data file at present? I want to write a block into a single data file, so it corresponds to an index file.

@yuyang733
Copy link
Author

Do you mean that one dataFlushEvent could directly write a new single oss file? If so, how to maintain the index file?

Supply a way that supports the non-append storage.

@zuston
Copy link
Member

zuston commented Dec 7, 2022

Could u help give a data layout picture, which includes the relations of data block and oss file name.

Besides, we should also consider how to support the local_order in object store.

@yuyang733
Copy link
Author

Could u help give a data layout picture, which includes the relations of data block and oss file name.

Besides, we should also consider how to support the local_order in object store.

Okay, I will provide a design gram to describe it later.

@zuston
Copy link
Member

zuston commented Dec 7, 2022

Aren't several blocks appended to a data file at present?

Yes. I know it’s better to avoid using the append mode to improve the performance. But if the data layout is changed, do we need a index file? In the local file or HDFS storage type, the index file will maintain the relation of block offset in the single one file.

@yuyang733
Copy link
Author

Aren't several blocks appended to a data file at present?

Yes. I know it’s better to avoid using the append mode to improve the performance. But if the data layout is changed, do we need a index file? In the local file or HDFS storage type, the index file will maintain the relation of block offset in the single one file.

Great, thanks for reminding me, the index file seems to be not needed anymore.

@zuston
Copy link
Member

zuston commented Dec 7, 2022

I have another question that will we directly use the object store api or to use the hadoop filesystem api to support this?

If it is the former, maybe it’s better to introduce the uniffle dedicated filesystem api to wrap different concrete filesystems, also including the cos.

@zuston
Copy link
Member

zuston commented Dec 7, 2022

the index file seems to be not needed anymore.

Emm.. Maybe not. If having no a index structure, it means we don’t a global view to find files we needed in object store, especially for local order.

WDYT? @jerqi

@jerqi
Copy link
Contributor

jerqi commented Dec 7, 2022

Every data file should have index file.

@zuston
Copy link
Member

zuston commented Dec 7, 2022

Every data file should have index file.

The performance will not be good, as we have to read all index files in one time for doing split segments.

@jerqi
Copy link
Contributor

jerqi commented Dec 7, 2022

Every data file should have index file.

The performance will not be good, as we have to read all index files in one time for doing split segments.

We can read only one index file.

@yuyang733
Copy link
Author

I have another question that will we directly use the object store api or to use the hadoop filesystem api to support this?

If it is the former, maybe it’s better to introduce the uniffle dedicated filesystem api to wrap different concrete filesystems, also including the cos.

In fact, Can an abstract storage layer be provided in uniffle, and each manufacturer implements the necessary storage interface?

It does not care whether the concrete class uses the native API (such as S3, etc) or the HCFS file system.

@advancedxy
Copy link
Contributor

In fact, Can an abstract storage layer be provided in uniffle, and each manufacturer implements the necessary storage interface?

It may require a detailed design doc to illustrate your idea and proposal.

In fact, I believe uniffle(and other RSS systems also) makes a big assumption of filesystem capability, such as append support, if we want to support new storages and dropping append requirement, we should reconsider the data layout patterns and all the features it required such as read operation, data distribution, etc. cc @zuston and @LuciferYang.

P.S: I think it's nice to have object store as a new storage type supported, we just need to think it throughly, make sure it doesn't introduce too much complexity, and maintain the flexibility to extend more storages.

@LuciferYang
Copy link
Contributor

Thanks for ping me @advancedxy I am not very familiar with object storage, but I think it is better to design it separately to avoid incompatibility of data layout and negative impact on the performance of the current implementation

@zuston
Copy link
Member

zuston commented Jan 16, 2023

Any update on this? @yuyang733 cc @jerqi @advancedxy

If object store is supported, I will use this to store huge partition to reduce HDFS pressure for iQiyi. And this is an important feature for uniffle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants