-
Notifications
You must be signed in to change notification settings - Fork 382
Request for a smaller dataset for researchers with lesser resources #81
Comments
Hi @rajurajvijay619, did you try using just a single language for the experiments? E.g for java, I find total 500k samples from 184Mb of As one can see from published analysis example, Hope this helps and good luck with experiments! |
@bzz just one question, when running for a single language(local machine), does the setup still requires GPUs? |
@sara-02 you can download data without GPUs, however running the default models in this repo will be painfully slow without gpus. However, you can try training on a smaller sample of the data as @bzz proposes, you can also set this parameter to limit the size of the data. Also, google colab notebooks are great for free GPUs. Thanks for getting involved with this project ❤️ |
@rajurajvijay619 can you describe your constraints a bit more? Is it disk size for downloading the dataset? Can you download the entire dataset and just sample from that? Thanks for your feedback |
Thanks. I will look into colab as well as running it locally with only one language. I was hesitant to start because the first set in setup states that |
@sara-02 you are correct regarding docker. I think in the end it could make your life easier to use the Docker setup, as installing all the dependencies by hand can become very cumbersome and brittle. Let me know where you are struggling with Docker and I will be more than happy to help! I wrote this tutorial regarding Docker incase a gentle introduction is useful. Looking forward to see what you do with this dataset! Please do not be shy in asking questions! |
if you are using collab, I do not believe you will be able to use Docker, in that case you will have to install via |
I'll go ahead and close this issue, please lmk if there are any more questions |
Thank you for making this amazing problem statement public, along with a very comprehensive dataset!
Can a relatively smaller size dataset ( a subset ) of it be made available for independent developers/researchers who might try running this on their personal machines ?
This will open up the problem for a larger audience and may bring in some innovative solutions!
The text was updated successfully, but these errors were encountered: