Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help wanted] Issue of too large repo, too many files, causing git process slow, need a new structure. #3941

Closed
kumarharsh opened this issue Oct 18, 2014 · 35 comments

Comments

@kumarharsh
Copy link
Contributor

kumarharsh commented Oct 18, 2014

I was cloning the cdnjs repo, and looking at the following screenshot (and disregarding the download speed 🐌 ), it seems to be packing *>1GB * of content, even while cloning with a --depth 1

cdnjs

As it stands, this repo is prohibitively large, and with each addition of another new library, it would become larger, making any new additions more expensive.

I'm new to cdnjs, so I'm not sure how it'll work out, or if it's feasible, but still...

I'd like to suggest some way to split this repo into two, one containing just the package.json files for the different libraries, and another which actually contains the libraries. Perhaps the new autoupdate feature would work seamlessly with this?


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@PeterDaveHello
Copy link
Contributor

should tag @thomasdavis @ryankirkman @drewfreyling for comments

@kumarharsh
Copy link
Contributor Author

Probably break the project into 2 submodules?

For others facing the same issue, I did come upon this blog post by Atlassian.

@kumarharsh
Copy link
Contributor Author

After completing the shallow checkout with depth=4, it turns out that the /ajax/libs folder is a whopping 4.6 gigabytes in size.

As an aside, perhaps the node_modules folder can be kept off from being committed into the repo. The package.json file takes care of that anyways.

@PeterDaveHello
Copy link
Contributor

keep node_modules because of it can save time to CI build, but circle CI is fast enough(we just transfer to use it before few weeks), it has good cache mechanism, I'll consider to remove that folder, thanks for your comments.

@thomasdavis
Copy link
Member

So we will most likely be moving towards replacing this repo with a repo which only contains package.json's.

Though I will add instructions for people to just clone with depth=3/4 now.

@kumarharsh
Copy link
Contributor Author

oh that's nice!
umm... any ETA on that? just asking, no pressure :)

@PeterDaveHello
Copy link
Contributor

BTW, the size of .git behind this repo is currently about 647MB.

@PeterDaveHello PeterDaveHello changed the title Suggestions for improvement in the submission flow [Help wanted] Issue of too large repo, too many files, causing git process slow Apr 27, 2015
@pieroxy
Copy link
Contributor

pieroxy commented Apr 27, 2015

Do you have an ETA on an alternative way to handle adding/updating of libraries? This repo is close to the implosion where the process is too cumbersome and people will just not submit their lib to it (not only my impression) so I feel something very fast is needed.

@PeterDaveHello
Copy link
Contributor

No ETA yet, we don't have enough human resource and idea to handle that, I tried git-lfs and realized that its overhead is too high, we need help on this issue.

@PeterDaveHello PeterDaveHello changed the title [Help wanted] Issue of too large repo, too many files, causing git process slow [Help wanted] Issue of too large repo, too many files, causing git process slow, need a new structure. Apr 27, 2015
@thomasdavis
Copy link
Member

We will seemingly just have to manage a normal static directory on a master server somewhere. We can back it up with rsync everytime we update it to emulate version control.

@PeterDaveHello
Copy link
Contributor

@pieroxy
Copy link
Contributor

pieroxy commented Apr 28, 2015

I don't think the GIT way is going to scale any further for this purpose (nor will GitHub). What I feel is needed is a small UI where people can upload their lib to. With a simple form that gets the https URL of the GitHub repo, the package.json in the repo should do the rest.

Eventually, you could maintain that in a git repo that would contain txt files (one per lib) and the file would contain the URL of the git repo. But again, that's not what git was meant for.

@ryankirkman
Copy link
Member

We're on the waiting list for Git Large FIle Storage for cdnjs so hopefully that sorts our storage problem out :)

/cc @PeterDaveHello @thomasdavis

@ryankirkman
Copy link
Member

@kumarharsh FYI you can use git sparse checkout to only work on the part of the repo you care about. It should help reduce the repo size.

@pieroxy
Copy link
Contributor

pieroxy commented Apr 29, 2015

@ryankirkman: Git large file storage is about handling large files, not a large number of files. It will not help one bit here. Git sparse checkout, while helping a bit, is not going to scale either.

What you are looking for here is a way to store a HUGE amount of small files. Git is not the answer for this. I've been working in related fields for more than 20 years now and I'll probably stop following this thread soon. I'll give you my last advice here.

A revision control system is JUST NOT THE RIGHT TOOL FOR THE JOB. STOP LOOKING AT WORKING AROUND GIT. It will not scale. It was never meant to. It will never scale the way you want to no matter what "plugins", or workarounds you find. Just because git is not about doing what you are looking to do. @PeterDaveHello asking for help for git being too slow is not the right message. You don't need a way to get git faster, you need a system to handle the hosting of a large number of files and a system for individual users to update their files. This has nothing to do with git. All work around git is at best going to buy you a few more month, but that is all. And you will get trapped into a system that doesn't work (at least properly) until the death of the project.

What you guys need is a replicated file system (among all the mirrors you host). This is all but an FTP and a rsync. Simple tools exist, just find the proper way to mix them.

@PeterDaveHello
Copy link
Contributor

@ryankirkman git-lfs won't help, I already tried, the waiting list is for access git-lfs on GitHub, but we can use it with other implement, and I tried, the overhead is too high, it's not designed for small files, and it won't fix our problem.

@PeterDaveHello
Copy link
Contributor

@pieroxy thank you for your advice, I know that git-lfs won't help, and still finding a solution.

@kumarharsh
Copy link
Contributor Author

@pieroxy I agree with everything you said there. There is really no way the current system is going to scale at all.

@PeterDaveHello
Copy link
Contributor

BTW, I wrote the steps to use sparse checkout + shallow/clone(pull) as workaround until we fix this issue.
One of the methods will be a web interface for contributors to update the file and lib and then we all do the works behind our sever, but still commit as the contributor.

@arasmussen
Copy link
Contributor

Honestly you already have a great tool (cdnjs-importer) that takes a git repo as an input and does the rest for you. Can't you just build a tiny website with an input "library git repo" and a submit button and automate the rest? I feel like submitting a pull request is incredibly overkill.

@arasmussen
Copy link
Contributor

And by tiny website I mean add a "submit" page to cdnjs.com.

@PeterDaveHello
Copy link
Contributor

Yes, will do, but no schedule for it yet.

@IonicaBizau
Copy link
Contributor

@arasmussen That's a neat idea! ✨
@PeterDaveHello That would be very easy, using cdnjs-importer as library.

@swcheon
Copy link

swcheon commented Aug 15, 2015

I want to maintain many libraries latest! but It is hard to commit. Git repo is too much large.
When I open the SourceTree, The app is down..

@PeterDaveHello
Copy link
Contributor

Try sparse-checkout + shallow clone:
sparseCheckout.md

@fiznool
Copy link

fiznool commented Dec 2, 2015

+1. I came here to submit a PR to add Imager.js but I don't have the time or bandwidth to wait for the entire git repo to clone. A simple website submission would be ideal.

@PeterDaveHello
Copy link
Contributor

@fiznool sorry about that, it's on our todo list, for the meantime, you can just send a request issue ticket and then we will add it, tahnks.

@clayreimann
Copy link

clayreimann commented May 4, 2016

@PeterDaveHello What would you guys say to splitting out the package definitions from the actual contents?

this repo (or a new one) could be the destination for package.json files and a separate repository could be where the actual content gets stored. That way when someone is trying to add a library (like I am) they only need to clone the config repo?

I would be willing to help with this as I want to add a library but can't because my editor crashes (and sometimes my shell) as I'm trying to work with this repository.

@PeterDaveHello
Copy link
Contributor

@clayreimann we'll need much more time on the discussion about how to handle the files, but at the mean time, actually you can submit a PR on GitHub with only a single package.json, please take a look at #7149

@clayreimann
Copy link

Where will (is?) that discussion happening?

@PeterDaveHello
Copy link
Contributor

@clayreimann some on GitHub some on gitter, sorry that in fact there is no full time developer here, so in fact the things may be too messy, but I really appreciate that you would like to help solve this problem.

@clayreimann
Copy link

@PeterDaveHello Is there a place where the architecture of cdnjs is described? i.e. where is it hosted (gh-pages?), where does cloudflare get the assets it's caching, where does PeterBot live?

@PeterDaveHello
Copy link
Contributor

PeterDaveHello commented May 4, 2016

@clayreimann : CloudFlare pulls CDNJS repo periodically to their edge servers, PeterBot lives on my own VPS.

@dbsanfte
Copy link

dbsanfte commented Oct 2, 2019

I'd hate to see what a low priority issue looks like.

@MattIPv4
Copy link
Member

We are working to split how cdnjs works so that there is a much smaller repository for humans to work with. Keep an eye on cdnjs/pacakges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests