New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help wanted] Issue of too large repo, too many files, causing git process slow, need a new structure. #3941

Open
kumarharsh opened this Issue Oct 18, 2014 · 33 comments

Comments

Projects
None yet
10 participants
@kumarharsh
Contributor

kumarharsh commented Oct 18, 2014

I was cloning the cdnjs repo, and looking at the following screenshot (and disregarding the download speed 馃悓 ), it seems to be packing *>1GB * of content, even while cloning with a --depth 1

cdnjs

As it stands, this repo is prohibitively large, and with each addition of another new library, it would become larger, making any new additions more expensive.

I'm new to cdnjs, so I'm not sure how it'll work out, or if it's feasible, but still...

I'd like to suggest some way to split this repo into two, one containing just the package.json files for the different libraries, and another which actually contains the libraries. Perhaps the new autoupdate feature would work seamlessly with this?

@PeterDaveHello

This comment has been minimized.

Show comment
Hide comment
Member

PeterDaveHello commented Oct 18, 2014

should tag @thomasdavis @ryankirkman @drewfreyling for comments

@kumarharsh

This comment has been minimized.

Show comment
Hide comment
@kumarharsh

kumarharsh Oct 18, 2014

Contributor

Probably break the project into 2 submodules?

For others facing the same issue, I did come upon this blog post by Atlassian.

Contributor

kumarharsh commented Oct 18, 2014

Probably break the project into 2 submodules?

For others facing the same issue, I did come upon this blog post by Atlassian.

@kumarharsh

This comment has been minimized.

Show comment
Hide comment
@kumarharsh

kumarharsh Oct 18, 2014

Contributor

After completing the shallow checkout with depth=4, it turns out that the /ajax/libs folder is a whopping 4.6 gigabytes in size.

As an aside, perhaps the node_modules folder can be kept off from being committed into the repo. The package.json file takes care of that anyways.

Contributor

kumarharsh commented Oct 18, 2014

After completing the shallow checkout with depth=4, it turns out that the /ajax/libs folder is a whopping 4.6 gigabytes in size.

As an aside, perhaps the node_modules folder can be kept off from being committed into the repo. The package.json file takes care of that anyways.

@PeterDaveHello

This comment has been minimized.

Show comment
Hide comment
@PeterDaveHello

PeterDaveHello Oct 18, 2014

Member

keep node_modules because of it can save time to CI build, but circle CI is fast enough(we just transfer to use it before few weeks), it has good cache mechanism, I'll consider to remove that folder, thanks for your comments.

Member

PeterDaveHello commented Oct 18, 2014

keep node_modules because of it can save time to CI build, but circle CI is fast enough(we just transfer to use it before few weeks), it has good cache mechanism, I'll consider to remove that folder, thanks for your comments.

@thomasdavis

This comment has been minimized.

Show comment
Hide comment
@thomasdavis

thomasdavis Nov 13, 2014

Member

So we will most likely be moving towards replacing this repo with a repo which only contains package.json's.

Though I will add instructions for people to just clone with depth=3/4 now.

Member

thomasdavis commented Nov 13, 2014

So we will most likely be moving towards replacing this repo with a repo which only contains package.json's.

Though I will add instructions for people to just clone with depth=3/4 now.

@kumarharsh

This comment has been minimized.

Show comment
Hide comment
@kumarharsh

kumarharsh Nov 13, 2014

Contributor

oh that's nice!
umm... any ETA on that? just asking, no pressure :)

Contributor

kumarharsh commented Nov 13, 2014

oh that's nice!
umm... any ETA on that? just asking, no pressure :)

@PeterDaveHello

This comment has been minimized.

Show comment
Hide comment
@PeterDaveHello

PeterDaveHello Apr 27, 2015

Member

BTW, the size of .git behind this repo is currently about 647MB.

Member

PeterDaveHello commented Apr 27, 2015

BTW, the size of .git behind this repo is currently about 647MB.

@PeterDaveHello PeterDaveHello changed the title from Suggestions for improvement in the submission flow to [Help wanted] Issue of too large repo, too many files, causing git process slow Apr 27, 2015

@pieroxy

This comment has been minimized.

Show comment
Hide comment
@pieroxy

pieroxy Apr 27, 2015

Contributor

Do you have an ETA on an alternative way to handle adding/updating of libraries? This repo is close to the implosion where the process is too cumbersome and people will just not submit their lib to it (not only my impression) so I feel something very fast is needed.

Contributor

pieroxy commented Apr 27, 2015

Do you have an ETA on an alternative way to handle adding/updating of libraries? This repo is close to the implosion where the process is too cumbersome and people will just not submit their lib to it (not only my impression) so I feel something very fast is needed.

@PeterDaveHello

This comment has been minimized.

Show comment
Hide comment
@PeterDaveHello

PeterDaveHello Apr 27, 2015

Member

No ETA yet, we don't have enough human resource and idea to handle that, I tried git-lfs and realized that its overhead is too high, we need help on this issue.

Member

PeterDaveHello commented Apr 27, 2015

No ETA yet, we don't have enough human resource and idea to handle that, I tried git-lfs and realized that its overhead is too high, we need help on this issue.

@PeterDaveHello PeterDaveHello changed the title from [Help wanted] Issue of too large repo, too many files, causing git process slow to [Help wanted] Issue of too large repo, too many files, causing git process slow, need a new structure. Apr 27, 2015

@thomasdavis

This comment has been minimized.

Show comment
Hide comment
@thomasdavis

thomasdavis Apr 27, 2015

Member

We will seemingly just have to manage a normal static directory on a master server somewhere. We can back it up with rsync everytime we update it to emulate version control.

Member

thomasdavis commented Apr 27, 2015

We will seemingly just have to manage a normal static directory on a master server somewhere. We can back it up with rsync everytime we update it to emulate version control.

@pieroxy

This comment has been minimized.

Show comment
Hide comment
@pieroxy

pieroxy Apr 28, 2015

Contributor

I don't think the GIT way is going to scale any further for this purpose (nor will GitHub). What I feel is needed is a small UI where people can upload their lib to. With a simple form that gets the https URL of the GitHub repo, the package.json in the repo should do the rest.

Eventually, you could maintain that in a git repo that would contain txt files (one per lib) and the file would contain the URL of the git repo. But again, that's not what git was meant for.

Contributor

pieroxy commented Apr 28, 2015

I don't think the GIT way is going to scale any further for this purpose (nor will GitHub). What I feel is needed is a small UI where people can upload their lib to. With a simple form that gets the https URL of the GitHub repo, the package.json in the repo should do the rest.

Eventually, you could maintain that in a git repo that would contain txt files (one per lib) and the file would contain the URL of the git repo. But again, that's not what git was meant for.

@ryankirkman

This comment has been minimized.

Show comment
Hide comment
@ryankirkman

ryankirkman Apr 29, 2015

Member

We're on the waiting list for Git Large FIle Storage for cdnjs so hopefully that sorts our storage problem out :)

/cc @PeterDaveHello @thomasdavis

Member

ryankirkman commented Apr 29, 2015

We're on the waiting list for Git Large FIle Storage for cdnjs so hopefully that sorts our storage problem out :)

/cc @PeterDaveHello @thomasdavis

@ryankirkman

This comment has been minimized.

Show comment
Hide comment
@ryankirkman

ryankirkman Apr 29, 2015

Member

@kumarharsh FYI you can use git sparse checkout to only work on the part of the repo you care about. It should help reduce the repo size.

Member

ryankirkman commented Apr 29, 2015

@kumarharsh FYI you can use git sparse checkout to only work on the part of the repo you care about. It should help reduce the repo size.

@pieroxy

This comment has been minimized.

Show comment
Hide comment
@pieroxy

pieroxy Apr 29, 2015

Contributor

@ryankirkman: Git large file storage is about handling large files, not a large number of files. It will not help one bit here. Git sparse checkout, while helping a bit, is not going to scale either.

What you are looking for here is a way to store a HUGE amount of small files. Git is not the answer for this. I've been working in related fields for more than 20 years now and I'll probably stop following this thread soon. I'll give you my last advice here.

A revision control system is JUST NOT THE RIGHT TOOL FOR THE JOB. STOP LOOKING AT WORKING AROUND GIT. It will not scale. It was never meant to. It will never scale the way you want to no matter what "plugins", or workarounds you find. Just because git is not about doing what you are looking to do. @PeterDaveHello asking for help for git being too slow is not the right message. You don't need a way to get git faster, you need a system to handle the hosting of a large number of files and a system for individual users to update their files. This has nothing to do with git. All work around git is at best going to buy you a few more month, but that is all. And you will get trapped into a system that doesn't work (at least properly) until the death of the project.

What you guys need is a replicated file system (among all the mirrors you host). This is all but an FTP and a rsync. Simple tools exist, just find the proper way to mix them.

Contributor

pieroxy commented Apr 29, 2015

@ryankirkman: Git large file storage is about handling large files, not a large number of files. It will not help one bit here. Git sparse checkout, while helping a bit, is not going to scale either.

What you are looking for here is a way to store a HUGE amount of small files. Git is not the answer for this. I've been working in related fields for more than 20 years now and I'll probably stop following this thread soon. I'll give you my last advice here.

A revision control system is JUST NOT THE RIGHT TOOL FOR THE JOB. STOP LOOKING AT WORKING AROUND GIT. It will not scale. It was never meant to. It will never scale the way you want to no matter what "plugins", or workarounds you find. Just because git is not about doing what you are looking to do. @PeterDaveHello asking for help for git being too slow is not the right message. You don't need a way to get git faster, you need a system to handle the hosting of a large number of files and a system for individual users to update their files. This has nothing to do with git. All work around git is at best going to buy you a few more month, but that is all. And you will get trapped into a system that doesn't work (at least properly) until the death of the project.

What you guys need is a replicated file system (among all the mirrors you host). This is all but an FTP and a rsync. Simple tools exist, just find the proper way to mix them.

@PeterDaveHello

This comment has been minimized.

Show comment
Hide comment
@PeterDaveHello

PeterDaveHello Apr 30, 2015

Member

@ryankirkman git-lfs won't help, I already tried, the waiting list is for access git-lfs on GitHub, but we can use it with other implement, and I tried, the overhead is too high, it's not designed for small files, and it won't fix our problem.

Member

PeterDaveHello commented Apr 30, 2015

@ryankirkman git-lfs won't help, I already tried, the waiting list is for access git-lfs on GitHub, but we can use it with other implement, and I tried, the overhead is too high, it's not designed for small files, and it won't fix our problem.

@PeterDaveHello

This comment has been minimized.

Show comment
Hide comment
@PeterDaveHello

PeterDaveHello Apr 30, 2015

Member

@pieroxy thank you for your advice, I know that git-lfs won't help, and still finding a solution.

Member

PeterDaveHello commented Apr 30, 2015

@pieroxy thank you for your advice, I know that git-lfs won't help, and still finding a solution.

@kumarharsh

This comment has been minimized.

Show comment
Hide comment
@kumarharsh

kumarharsh May 4, 2015

Contributor

@pieroxy I agree with everything you said there. There is really no way the current system is going to scale at all.

Contributor

kumarharsh commented May 4, 2015

@pieroxy I agree with everything you said there. There is really no way the current system is going to scale at all.

@PeterDaveHello

This comment has been minimized.

Show comment
Hide comment
@PeterDaveHello

PeterDaveHello May 4, 2015

Member

BTW, I wrote the steps to use sparse checkout + shallow/clone(pull) as workaround until we fix this issue.
One of the methods will be a web interface for contributors to update the file and lib and then we all do the works behind our sever, but still commit as the contributor.

Member

PeterDaveHello commented May 4, 2015

BTW, I wrote the steps to use sparse checkout + shallow/clone(pull) as workaround until we fix this issue.
One of the methods will be a web interface for contributors to update the file and lib and then we all do the works behind our sever, but still commit as the contributor.

@arasmussen

This comment has been minimized.

Show comment
Hide comment
@arasmussen

arasmussen Jun 25, 2015

Contributor

Honestly you already have a great tool (cdnjs-importer) that takes a git repo as an input and does the rest for you. Can't you just build a tiny website with an input "library git repo" and a submit button and automate the rest? I feel like submitting a pull request is incredibly overkill.

Contributor

arasmussen commented Jun 25, 2015

Honestly you already have a great tool (cdnjs-importer) that takes a git repo as an input and does the rest for you. Can't you just build a tiny website with an input "library git repo" and a submit button and automate the rest? I feel like submitting a pull request is incredibly overkill.

@arasmussen

This comment has been minimized.

Show comment
Hide comment
@arasmussen

arasmussen Jun 25, 2015

Contributor

And by tiny website I mean add a "submit" page to cdnjs.com.

Contributor

arasmussen commented Jun 25, 2015

And by tiny website I mean add a "submit" page to cdnjs.com.

@PeterDaveHello

This comment has been minimized.

Show comment
Hide comment
@PeterDaveHello

PeterDaveHello Jun 25, 2015

Member

Yes, will do, but no schedule for it yet.

Member

PeterDaveHello commented Jun 25, 2015

Yes, will do, but no schedule for it yet.

@IonicaBizau

This comment has been minimized.

Show comment
Hide comment
@IonicaBizau

IonicaBizau Jun 25, 2015

Member

@arasmussen That's a neat idea!
@PeterDaveHello That would be very easy, using cdnjs-importer as library.

Member

IonicaBizau commented Jun 25, 2015

@arasmussen That's a neat idea!
@PeterDaveHello That would be very easy, using cdnjs-importer as library.

@swcheon

This comment has been minimized.

Show comment
Hide comment
@swcheon

swcheon Aug 15, 2015

I want to maintain many libraries latest! but It is hard to commit. Git repo is too much large.
When I open the SourceTree, The app is down..

swcheon commented Aug 15, 2015

I want to maintain many libraries latest! but It is hard to commit. Git repo is too much large.
When I open the SourceTree, The app is down..

@PeterDaveHello

This comment has been minimized.

Show comment
Hide comment
@PeterDaveHello

PeterDaveHello Aug 15, 2015

Member

Try sparse-checkout + shallow clone:
sparseCheckout.md

Member

PeterDaveHello commented Aug 15, 2015

Try sparse-checkout + shallow clone:
sparseCheckout.md

@fiznool

This comment has been minimized.

Show comment
Hide comment
@fiznool

fiznool Dec 2, 2015

+1. I came here to submit a PR to add Imager.js but I don't have the time or bandwidth to wait for the entire git repo to clone. A simple website submission would be ideal.

fiznool commented Dec 2, 2015

+1. I came here to submit a PR to add Imager.js but I don't have the time or bandwidth to wait for the entire git repo to clone. A simple website submission would be ideal.

@PeterDaveHello

This comment has been minimized.

Show comment
Hide comment
@PeterDaveHello

PeterDaveHello Dec 2, 2015

Member

@fiznool sorry about that, it's on our todo list, for the meantime, you can just send a request issue ticket and then we will add it, tahnks.

Member

PeterDaveHello commented Dec 2, 2015

@fiznool sorry about that, it's on our todo list, for the meantime, you can just send a request issue ticket and then we will add it, tahnks.

@clayreimann

This comment has been minimized.

Show comment
Hide comment
@clayreimann

clayreimann May 4, 2016

@PeterDaveHello What would you guys say to splitting out the package definitions from the actual contents?

this repo (or a new one) could be the destination for package.json files and a separate repository could be where the actual content gets stored. That way when someone is trying to add a library (like I am) they only need to clone the config repo?

I would be willing to help with this as I want to add a library but can't because my editor crashes (and sometimes my shell) as I'm trying to work with this repository.

clayreimann commented May 4, 2016

@PeterDaveHello What would you guys say to splitting out the package definitions from the actual contents?

this repo (or a new one) could be the destination for package.json files and a separate repository could be where the actual content gets stored. That way when someone is trying to add a library (like I am) they only need to clone the config repo?

I would be willing to help with this as I want to add a library but can't because my editor crashes (and sometimes my shell) as I'm trying to work with this repository.

@PeterDaveHello

This comment has been minimized.

Show comment
Hide comment
@PeterDaveHello

PeterDaveHello May 4, 2016

Member

@clayreimann we'll need much more time on the discussion about how to handle the files, but at the mean time, actually you can submit a PR on GitHub with only a single package.json, please take a look at #7149

Member

PeterDaveHello commented May 4, 2016

@clayreimann we'll need much more time on the discussion about how to handle the files, but at the mean time, actually you can submit a PR on GitHub with only a single package.json, please take a look at #7149

@clayreimann

This comment has been minimized.

Show comment
Hide comment
@clayreimann

clayreimann May 4, 2016

Where will (is?) that discussion happening?

Where will (is?) that discussion happening?

@PeterDaveHello

This comment has been minimized.

Show comment
Hide comment
@PeterDaveHello

PeterDaveHello May 4, 2016

Member

@clayreimann some on GitHub some on gitter, sorry that in fact there is no full time developer here, so in fact the things may be too messy, but I really appreciate that you would like to help solve this problem.

Member

PeterDaveHello commented May 4, 2016

@clayreimann some on GitHub some on gitter, sorry that in fact there is no full time developer here, so in fact the things may be too messy, but I really appreciate that you would like to help solve this problem.

@clayreimann

This comment has been minimized.

Show comment
Hide comment
@clayreimann

clayreimann May 4, 2016

@PeterDaveHello Is there a place where the architecture of cdnjs is described? i.e. where is it hosted (gh-pages?), where does cloudflare get the assets it's caching, where does PeterBot live?

@PeterDaveHello Is there a place where the architecture of cdnjs is described? i.e. where is it hosted (gh-pages?), where does cloudflare get the assets it's caching, where does PeterBot live?

@PeterDaveHello

This comment has been minimized.

Show comment
Hide comment
@PeterDaveHello

PeterDaveHello May 4, 2016

Member

@clayreimann : CloudFlare pulls CDNJS repo periodically to their edge servers, PeterBot lives on my own VPS.

Member

PeterDaveHello commented May 4, 2016

@clayreimann : CloudFlare pulls CDNJS repo periodically to their edge servers, PeterBot lives on my own VPS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment