Skip to content
This repository has been archived by the owner on Apr 11, 2023. It is now read-only.

Request for a smaller dataset for researchers with lesser resources #81

Closed
vj68 opened this issue Oct 23, 2019 · 8 comments
Closed

Request for a smaller dataset for researchers with lesser resources #81

vj68 opened this issue Oct 23, 2019 · 8 comments

Comments

@vj68
Copy link

vj68 commented Oct 23, 2019

Thank you for making this amazing problem statement public, along with a very comprehensive dataset!

Can a relatively smaller size dataset ( a subset ) of it be made available for independent developers/researchers who might try running this on their personal machines ?

This will open up the problem for a larger audience and may bring in some innovative solutions!

@vj68 vj68 changed the title A smaller dataset for researchers with lesser resources Request for a smaller dataset for researchers with lesser resources Oct 23, 2019
@bzz
Copy link
Contributor

bzz commented Oct 27, 2019

Hi @rajurajvijay619, did you try using just a single language for the experiments?

E.g for java, I find total 500k samples from 184Mb of .gz to be very comfortably manageable on a laptop.

As one can see from published analysis example,
Screen Shot 2019-10-27 at 3 13 15 PM
languages like Go, JS or Ruby would give even smaller dataset sizes and fit on almost any local machine.

Hope this helps and good luck with experiments!

@sara-02
Copy link

sara-02 commented Oct 28, 2019

@bzz just one question, when running for a single language(local machine), does the setup still requires GPUs?

@hamelsmu
Copy link
Contributor

@sara-02 you can download data without GPUs, however running the default models in this repo will be painfully slow without gpus. However, you can try training on a smaller sample of the data as @bzz proposes, you can also set this parameter to limit the size of the data.

Also, google colab notebooks are great for free GPUs. Thanks for getting involved with this project ❤️

@hamelsmu
Copy link
Contributor

@rajurajvijay619 can you describe your constraints a bit more? Is it disk size for downloading the dataset? Can you download the entire dataset and just sample from that?

Thanks for your feedback

@sara-02
Copy link

sara-02 commented Oct 29, 2019

@sara-02 you can download data without GPUs, however running the default models in this repo will be painfully slow without gpus. However, you can try training on a smaller sample of the data as @bzz proposes, you can also set this parameter to limit the size of the data.

Also, google colab notebooks are great for free GPUs. Thanks for getting involved with this project heart

Thanks. I will look into colab as well as running it locally with only one language. I was hesitant to start because the first set in setup states that Additionally, you must install Nvidia-Docker to satisfy GPU-compute related dependencies. So, I thought the code might not run as-is on a local system with GPUs.

@hamelsmu
Copy link
Contributor

@sara-02 you are correct regarding docker. I think in the end it could make your life easier to use the Docker setup, as installing all the dependencies by hand can become very cumbersome and brittle.

Let me know where you are struggling with Docker and I will be more than happy to help! I wrote this tutorial regarding Docker incase a gentle introduction is useful.

Looking forward to see what you do with this dataset! Please do not be shy in asking questions!

@hamelsmu
Copy link
Contributor

if you are using collab, I do not believe you will be able to use Docker, in that case you will have to install via pip all the dependencies defined in the Dockerfile in the Collab notebook

@hamelsmu
Copy link
Contributor

I'll go ahead and close this issue, please lmk if there are any more questions

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants