Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Structure] Dockerizing the project #30

Merged
merged 35 commits into from Aug 14, 2021

Conversation

Rubix982
Copy link
Contributor

@Rubix982 Rubix982 commented Jul 23, 2021

Pull Request Type

  • 🏆 Enhancements - this PR aims to dockerize the application

Purpose

  • It introduces containers and ease of building and adding features without worrying for configuration

Why?

  • As per changes required for the demo

Changes Introduced

  • Removes the requirements.txt from root, splits dependencies amongst the backend and the frontend, by creating individual requirements.txt
  • Moves the data/ from the root into backend/
  • Moves all *.py files in backend to backend/src/
  • Introduced image tag names for the images to be pushed to Docker Hub. This is the repository for the frontend, and the backend

Bugs (WIP)

  • Frontend cannot make connection with the backend, Errno 111 - Connection Refused
  • Jina requires aiohttp - to be added in requirements.txt
  • Testing to make sure the frontend and the backend work without any problems

Notes

  • The backend container is gigantic, it's near 1 GB due to the torch dependency (831MB). I was able to cache the containers, which means you will only need to install the requirements once for both the containers, and it should practically load them given that the dependencies have not changed
  • To start the containers, you need to have docker, docker-compose on a machine. The containers can be built and started with running make docker in the root. They can be temporarily closed with Ctrl^C, started again with make up, and removed with make remove
  • A user called jina in the Dockerfile was created because pip does not like installing as root
  • Jina by itself download a file called pdb_data_seq.csv which is 10K lines long (hence so much green in this PR), I'm not sure why it did that

Feedback required over

  • A quick pair of 👀 on the code
  • Discussion on the technical approach

Mentions

Rubix982 and others added 12 commits July 23, 2021 13:09
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
…estration

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
We'll download the csv file during run time to help reduce the repo size.
Copy link
Owner

@georgeamccarthy georgeamccarthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I've removed the large csv (it's a dataset containing 10,000 protein sequences) because we'll be downloading that during runtime. I'll run this locally and see if it works for me.

@fissoreg have a play with this as well?

@Rubix982
Copy link
Contributor Author

Rubix982 commented Jul 23, 2021

@georgeamccarthy how do I prevent jina from downloading it? It does that if I remove the file. I'll try removing it again and see if it can work as expected without the .csv file?

@Rubix982
Copy link
Contributor Author

Rubix982 commented Jul 23, 2021

These are the logs that the backend generates after I've installed aiohttp. I need help with this since I'm not sure what a peapod is, and what is the resource it is trying to reach, but instead throws a TimeOutError for, referring to the below message,

protein-search-backend | Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f3328dada90>
can not be started
due to TimeoutError('jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms'),
Flow is aborted

Other that that, the frontend still cannot be make an established connection.

Also, a folder is created under backend/ called embeddings that has a protein.json in it. Should I add embeddings in the .gitignore?

image

Notice in the 5th download line, it is pulling something 1.68G in size (scary 😨 ).

@Rubix982
Copy link
Contributor Author

Rubix982 commented Jul 23, 2021

Checked. So I deleted the containers entirely, removed embeddings and data/pdb_data_seq.csv, then built the containers from scratch. So the backend still pulls them, as seen in the below screenshot. We can either think of why this happens, or simply add this to .gitignore as well. Also, it jina always downloads 4 things, but doesn't mention what they are, as seen in the screen shot below,

image

After this log output, the container spends exactly 10ms (600000ms) trying to reach a peapod, as shown in the screenshot in the previous comment.

@georgeamccarthy
Copy link
Owner

georgeamccarthy commented Jul 23, 2021

@georgeamccarthy how do I prevent jina from downloading it? It does that if I remove the file. I'll try removing it again and see if it can work as expected without the .csv file?

The intended behaviour is for the file to be downloaded on first run. So if that's what it's doing that's ok, it's an unavoidably large file because it's needed for computing the embeddings. Have I understood your question?

These are the logs that the backend generates after I've installed aiohttp. I need help with this since I'm not sure what a peapod is, and what is the resource it is trying to reach, but instead throws a TimeOutError for, referring to the below message,

protein-search-backend | Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f3328dada90>
can not be started
due to TimeoutError('jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms'),
Flow is aborted

Not really sure what a Peapod is but this means our flow is unable str start after 600000.0ms of trying. I'm having a similar issue #31 which I think might be related. Not sure what to suggest for now, I'm looking into it. It's possible that it's unrelated to Docker.

Other that that, the frontend still cannot be make an established connection.

The backend will need to start successfully before the frontend will connect. Hopefully once the previous issue is sorted it will work.

Also, a folder is created under backend/ called embeddings that has a protein.json in it. Should I add embeddings in the .gitignore?
https://user-images.githubusercontent.com/41635766/126763589-504ae2ce-6120-492c-8e4c-8b55418cb069.png
Notice in the 5th download line, it is pulling something 1.68G in size (scary 😨 ).

😱 Yes please! Currently we're having the host computer compute the embeddings on first run.

@Rubix982
Copy link
Contributor Author

The intended behaviour is for the file to be downloaded on first run. So if that's what it's doing that's ok, it's an unavoidably large file because it's needed for computing the embeddings. Have I understood your question?

Yep!

Not really sure what a Peapod is but this means our flow is unable str start after 600000.0ms of trying. I'm having a similar issue #31 which I think might be related. Not sure what to suggest for now, I'm looking into it. It's possible that it's unrelated to Docker.

Interesting.

The backend will need to start successfully before the frontend will connect. Hopefully once the previous issue is sorted it will work.

Noted.

Yes please! Currently we're having the host computer compute the embeddings on first run.

Noted.

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>
.gitignore Outdated Show resolved Hide resolved
backend/Dockerfile Show resolved Hide resolved
docker-compose.yml Show resolved Hide resolved
@fissoreg
Copy link
Collaborator

Some general remarks:

  • adding aiohttp as a dependence is probably not a good idea. To support the http protocol, Jina should be installed with pip install "jina[client,http]" according to doc
  • as base Docker image, we could use the official Jina image: https://hub.docker.com/r/jinaai/jina. This would avoid the problems with the PATH variables and the need to make a new user.
  • Dockerfiles for backend and frontend are pretty similar, they could be joined.

Cool stuff @Rubix982 , welcome to the project! ;)

@Rubix982
Copy link
Contributor Author

Rubix982 commented Jul 26, 2021

adding aiohttp as a dependence is probably not a good idea. To support the http protocol, Jina should be installed with pip install "jina[client,http]" according to doc

The problem with this is that I have to figure out (and I'm not sure if this is possible) how to cache that. I did try doing pip install jina[client,http], but for me, it was not caching. Which means on every container build, it downloads those dependencies from scratch, which seemed like wasted bandwidth for me.

Python and Docker have this weird thing that if I install any dependency with the command RUN pip install pkg - this does not get cached. But if I add a requirements.txt with the package name AND version specified, it caches (I"m not sure what magic is this).

I'll try it again in some time if this indeed takes care of the caching the way I want it to. Docker best practices do not like the idea of reinstalling dependencies over and over again. This is a big no-no in distributed computing.

as base Docker image, we could use the official Jina image: https://hub.docker.com/r/jinaai/jina. This would avoid the problems with the PATH variables and the need to make a new user.

I tried this ... and the Docker image only offered some jina related commands, and I'm not familiar with CLI that the jina offers. I took a quick 10-15 minutes look at it, but it did not seem to do what I was thinking about making this PR.

By default, Docker creates the container and enters as root. More documentation reading required here.

Dockerfiles for backend and frontend are pretty similar, they could be joined.

For this, I need to know where you guys want me to remove the data folder entirely.

As homework for me, I have to look into,

  1. How does the Jina Docker image work and what does it offer?
  2. Is it possible to provide the same build context and share it among two separate container builds?

@fissoreg I'm waiting for our 1-to-1 so you can introduce me to @jina-ai more so I can contribute as well. 👍

@georgeamccarthy georgeamccarthy linked an issue Jul 26, 2021 that may be closed by this pull request
@georgeamccarthy georgeamccarthy removed this from Next in protein_search 1.0 Jul 26, 2021
@georgeamccarthy
Copy link
Owner

In addition to some reading you might find the Jina slack helpful, they really welcome discussion on anything from simple to advanced. I've found it super helpful! :) http://slack.jina.ai

@fissoreg
Copy link
Collaborator

The problem with this is that I have to figure out (and I'm not sure if this is possible) how to cache that. I did try doing pip install jina[client,http], but for me, it was not caching. Which means on every container build, it downloads those dependencies from scratch, which seemed like wasted bandwidth for me.

I would have said that wasted bandwidth is better then managing dependencies manually...

I'll try it again in some time if this indeed takes care of the caching the way I want it to. Docker best practices do not like the idea of reinstalling dependencies over and over again. This is a big no-no in distributed computing.

...but this is a convincing remark!

Anyways, the following is strange:

Python and Docker have this weird thing that if I install any dependency with the command RUN pip install pkg - this does not get cached. But if I add a requirements.txt with the package name AND version specified, it caches (I"m not sure what magic is this).

So I think that the best thing would be to try to understand what is happening there and fix it. Let me add, as @gmelodie would say: "Research time is not wasted time".

@fissoreg I'm waiting for our 1-to-1 so you can introduce me to @jina-ai more so I can contribute as well. +1

I hope this will be helpful! :)

Copy link
Collaborator

@fissoreg fissoreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to #18

RUN useradd --create-home jina

# Add the models folder locally to container
COPY ./models /app/models
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be loaded from a config file.

def initialize_executor():

# If the model is not already cached ...
if not os.path.isdir("./models/prot_bert"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the path should be loaded from a config file (the same as for the Dockerfile). For now we have backend_config.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into the YAML configuration for Jina tonight over how I can store variables there to act as a whole for the backend project.

Thanks for the referral!

@Rubix982
Copy link
Contributor Author

Rubix982 commented Aug 10, 2021

@georgeamccarthy, @fissoreg I'm unable to merge by myself. Need help with that.

@fissoreg Unfortunately, there is no easy way of using YAML variables inside Dockerfiles. The only way is to use .env files at the moment. As for executors.py, we can use the python-dotenv package to import the variables from the .env into a python script.

@fissoreg
Copy link
Collaborator

@georgeamccarthy, @fissoreg I'm unable to merge by myself. Need help with that.

The merge is here: Rubix982#1
There are some problems, but you can merge that PR so we have the merge commit here and we can keep the discussion flowing. Have a look before merging, to see if all the Dockerfile stuff looks good to you.

@fissoreg Unfortunately, there is no easy way of using YAML variables inside Dockerfiles. The only way is to use .env files at the moment. As for executors.py, we can use the python-dotenv package to import the variables from the .env into a python script.

Ok thanks @Rubix982! Maybe .env files is a good option. If you have other ideas, let's discuss. The important point is that it would be better to have to parametrize the various paths only once and in one place.

@Rubix982
Copy link
Contributor Author

This is weird. I did not get a notification on my own repository. Thanks.

I'll try to implement the .env method on my repository and get back to you.

@fissoreg
Copy link
Collaborator

I'll try to implement the .env method on my repository and get back to you.

This is great, thanks! But let's make that into a different PR, maybe? So we can merge this one ASAP.

2. Full PDB dataset.
3. Minors.
2. Fixed `Dockerfile` and `Makefile` for backend.
3. Fixed dependencies.
@Rubix982
Copy link
Contributor Author

Rubix982 commented Aug 14, 2021

After making the changes suggested by Cristian on Slack, I am able to finally get results from the endpoint /search.

image

The changes have been made and pushed to my fork.

After the fixes, the Streamlit application throws these errors,

image

These errors are thrown from line 95 of frontend/app.py,

# Execute the query on the transport
result = client.execute(query, variable_values={"ids": ids})

I believe these bug fixes are independent of this PR's objective. This should be merged and closed, and the issue solved in another PR. What do you think, @fissoreg?

@georgeamccarthy
Copy link
Owner

georgeamccarthy commented Aug 14, 2021

Agreed, let's merge and move forward.

@georgeamccarthy georgeamccarthy changed the title [WIP - Structure] Dockerizing the project [Structure] Dockerizing the project Aug 14, 2021
@fissoreg
Copy link
Collaborator

I believe these bug fixes are independent of this PR's objective. This should be merged and closed, and the issue solved in another PR. What do you think, @fissoreg?

I didn't get this error but yes, let's merge and move on. We will also need to fix the automated tests.

Great job @Rubix982 !

@fissoreg fissoreg merged commit fdb596c into georgeamccarthy:main Aug 14, 2021
protein_search 1.0 automation moved this from Next to Done Aug 14, 2021
@Rubix982 Rubix982 deleted the Dockerizing branch August 14, 2021 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployment documentation Improvements or additions to documentation feature internal
Projects
Development

Successfully merging this pull request may close these issues.

Dockerize the application
3 participants