New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Structure] Dockerizing the project #30
Conversation
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
…estration Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
We'll download the csv file during run time to help reduce the repo size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, I've removed the large csv (it's a dataset containing 10,000 protein sequences) because we'll be downloading that during runtime. I'll run this locally and see if it works for me.
@fissoreg have a play with this as well?
@georgeamccarthy how do I prevent |
These are the logs that the
Other that that, the Also, a folder is created under Notice in the 5th download line, it is pulling something 1.68G in size (scary 😨 ). |
Checked. So I deleted the containers entirely, removed After this log output, the container spends exactly 10ms (600000ms) trying to reach a |
The intended behaviour is for the file to be downloaded on first run. So if that's what it's doing that's ok, it's an unavoidably large file because it's needed for computing the embeddings. Have I understood your question?
Not really sure what a Peapod is but this means our flow is unable str start after 600000.0ms of trying. I'm having a similar issue #31 which I think might be related. Not sure what to suggest for now, I'm looking into it. It's possible that it's unrelated to Docker.
The backend will need to start successfully before the frontend will connect. Hopefully once the previous issue is sorted it will work.
😱 Yes please! Currently we're having the host computer compute the embeddings on first run. |
Yep!
Interesting.
Noted.
Noted. |
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
… Dockerizing Removing data/pdb_data_seq.csv
Some general remarks:
Cool stuff @Rubix982 , welcome to the project! ;) |
The problem with this is that I have to figure out (and I'm not sure if this is possible) how to cache that. I did try doing Python and Docker have this weird thing that if I install any dependency with the command I'll try it again in some time if this indeed takes care of the caching the way I want it to. Docker best practices do not like the idea of reinstalling dependencies over and over again. This is a big no-no in distributed computing.
I tried this ... and the Docker image only offered some jina related commands, and I'm not familiar with CLI that the By default, Docker creates the container and enters as root. More documentation reading required here.
For this, I need to know where you guys want me to remove the data folder entirely. As homework for me, I have to look into,
@fissoreg I'm waiting for our 1-to-1 so you can introduce me to @jina-ai more so I can contribute as well. 👍 |
In addition to some reading you might find the Jina slack helpful, they really welcome discussion on anything from simple to advanced. I've found it super helpful! :) http://slack.jina.ai |
I would have said that wasted bandwidth is better then managing dependencies manually...
...but this is a convincing remark! Anyways, the following is strange:
So I think that the best thing would be to try to understand what is happening there and fix it. Let me add, as @gmelodie would say: "Research time is not wasted time".
I hope this will be helpful! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to #18
RUN useradd --create-home jina | ||
|
||
# Add the models folder locally to container | ||
COPY ./models /app/models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be loaded from a config file.
def initialize_executor(): | ||
|
||
# If the model is not already cached ... | ||
if not os.path.isdir("./models/prot_bert"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the path should be loaded from a config file (the same as for the Dockerfile). For now we have backend_config.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into the YAML configuration for Jina tonight over how I can store variables there to act as a whole for the backend project.
Thanks for the referral!
@georgeamccarthy, @fissoreg I'm unable to merge by myself. Need help with that. @fissoreg Unfortunately, there is no easy way of using YAML variables inside Dockerfiles. The only way is to use |
The merge is here: Rubix982#1
Ok thanks @Rubix982! Maybe .env files is a good option. If you have other ideas, let's discuss. The important point is that it would be better to have to parametrize the various paths only once and in one place. |
This is weird. I did not get a notification on my own repository. Thanks. I'll try to implement the |
This is great, thanks! But let's make that into a different PR, maybe? So we can merge this one ASAP. |
2. Full PDB dataset. 3. Minors.
2. Fixed `Dockerfile` and `Makefile` for backend. 3. Fixed dependencies.
After making the changes suggested by Cristian on Slack, I am able to finally get results from the endpoint The changes have been made and pushed to my fork. After the fixes, the Streamlit application throws these errors, These errors are thrown from line 95 of # Execute the query on the transport
result = client.execute(query, variable_values={"ids": ids}) I believe these bug fixes are independent of this PR's objective. This should be merged and closed, and the issue solved in another PR. What do you think, @fissoreg? |
Agreed, let's merge and move forward. |
Pull Request Type
Purpose
Why?
Changes Introduced
requirements.txt
from root, splits dependencies amongst thebackend
and thefrontend
, by creating individualrequirements.txt
data/
from the root intobackend/
*.py
files inbackend
tobackend/src/
Docker Hub
. This is the repository for the frontend, and the backendBugs (WIP)
Errno 111 - Connection Refused
aiohttp
- to be added inrequirements.txt
Notes
backend
container is gigantic, it's near 1 GB due to thetorch
dependency (831MB). I was able to cache the containers, which means you will only need to install the requirements once for both the containers, and it should practically load them given that the dependencies have not changeddocker
,docker-compose
on a machine. The containers can be built and started with runningmake docker
in the root. They can be temporarily closed withCtrl^C
, started again withmake up
, and removed withmake remove
jina
in the Dockerfile was created becausepip
does not like installing as rootpdb_data_seq.csv
which is 10K lines long (hence so much green in this PR), I'm not sure why it did thatFeedback required over
Mentions