[Structure] Dockerizing the project #30

Rubix982 · 2021-07-23T08:30:49Z

Pull Request Type

🏆 Enhancements - this PR aims to dockerize the application

Purpose

It introduces containers and ease of building and adding features without worrying for configuration

Why?

As per changes required for the demo

Changes Introduced

Removes the requirements.txt from root, splits dependencies amongst the backend and the frontend, by creating individual requirements.txt
Moves the data/ from the root into backend/
Moves all *.py files in backend to backend/src/
Introduced image tag names for the images to be pushed to Docker Hub. This is the repository for the frontend, and the backend

Bugs (WIP)

Frontend cannot make connection with the backend, Errno 111 - Connection Refused
Jina requires aiohttp - to be added in requirements.txt
Testing to make sure the frontend and the backend work without any problems

Notes

The backend container is gigantic, it's near 1 GB due to the torch dependency (831MB). I was able to cache the containers, which means you will only need to install the requirements once for both the containers, and it should practically load them given that the dependencies have not changed
To start the containers, you need to have docker, docker-compose on a machine. The containers can be built and started with running make docker in the root. They can be temporarily closed with Ctrl^C, started again with make up, and removed with make remove
A user called jina in the Dockerfile was created because pip does not like installing as root
Jina by itself download a file called pdb_data_seq.csv which is 10K lines long (hence so much green in this PR), I'm not sure why it did that

Feedback required over

A quick pair of 👀 on the code
Discussion on the technical approach

Mentions

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

…estration Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

We'll download the csv file during run time to help reduce the repo size.

georgeamccarthy

Looks great, I've removed the large csv (it's a dataset containing 10,000 protein sequences) because we'll be downloading that during runtime. I'll run this locally and see if it works for me.

@fissoreg have a play with this as well?

Rubix982 · 2021-07-23T09:31:25Z

@georgeamccarthy how do I prevent jina from downloading it? It does that if I remove the file. I'll try removing it again and see if it can work as expected without the .csv file?

Rubix982 · 2021-07-23T09:34:19Z

These are the logs that the backend generates after I've installed aiohttp. I need help with this since I'm not sure what a peapod is, and what is the resource it is trying to reach, but instead throws a TimeOutError for, referring to the below message,

protein-search-backend | Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f3328dada90>
can not be started
due to TimeoutError('jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms'),
Flow is aborted

Other that that, the frontend still cannot be make an established connection.

Also, a folder is created under backend/ called embeddings that has a protein.json in it. Should I add embeddings in the .gitignore?

Notice in the 5th download line, it is pulling something 1.68G in size (scary 😨 ).

Rubix982 · 2021-07-23T10:15:01Z

Checked. So I deleted the containers entirely, removed embeddings and data/pdb_data_seq.csv, then built the containers from scratch. So the backend still pulls them, as seen in the below screenshot. We can either think of why this happens, or simply add this to .gitignore as well. Also, it jina always downloads 4 things, but doesn't mention what they are, as seen in the screen shot below,

After this log output, the container spends exactly 10ms (600000ms) trying to reach a peapod, as shown in the screenshot in the previous comment.

georgeamccarthy · 2021-07-23T10:32:04Z

@georgeamccarthy how do I prevent jina from downloading it? It does that if I remove the file. I'll try removing it again and see if it can work as expected without the .csv file?

The intended behaviour is for the file to be downloaded on first run. So if that's what it's doing that's ok, it's an unavoidably large file because it's needed for computing the embeddings. Have I understood your question?

These are the logs that the backend generates after I've installed aiohttp. I need help with this since I'm not sure what a peapod is, and what is the resource it is trying to reach, but instead throws a TimeOutError for, referring to the below message,

protein-search-backend | Flow@ 1[E]:pod0:<jina.peapods.pods.Pod object at 0x7f3328dada90>
can not be started
due to TimeoutError('jina.peapods.peas.BasePea:pod0 can not be initialized after 600000.0ms'),
Flow is aborted

Not really sure what a Peapod is but this means our flow is unable str start after 600000.0ms of trying. I'm having a similar issue #31 which I think might be related. Not sure what to suggest for now, I'm looking into it. It's possible that it's unrelated to Docker.

Other that that, the frontend still cannot be make an established connection.

The backend will need to start successfully before the frontend will connect. Hopefully once the previous issue is sorted it will work.

Also, a folder is created under backend/ called embeddings that has a protein.json in it. Should I add embeddings in the .gitignore?
https://user-images.githubusercontent.com/41635766/126763589-504ae2ce-6120-492c-8e4c-8b55418cb069.png
Notice in the 5th download line, it is pulling something 1.68G in size (scary 😨 ).

😱 Yes please! Currently we're having the host computer compute the embeddings on first run.

Rubix982 · 2021-07-23T10:36:24Z

The intended behaviour is for the file to be downloaded on first run. So if that's what it's doing that's ok, it's an unavoidably large file because it's needed for computing the embeddings. Have I understood your question?

Yep!

Not really sure what a Peapod is but this means our flow is unable str start after 600000.0ms of trying. I'm having a similar issue #31 which I think might be related. Not sure what to suggest for now, I'm looking into it. It's possible that it's unrelated to Docker.

Interesting.

The backend will need to start successfully before the frontend will connect. Hopefully once the previous issue is sorted it will work.

Noted.

Yes please! Currently we're having the host computer compute the embeddings on first run.

Noted.

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

… Dockerizing Removing data/pdb_data_seq.csv

.gitignore

backend/Dockerfile

docker-compose.yml

fissoreg · 2021-07-26T10:57:06Z

Some general remarks:

adding aiohttp as a dependence is probably not a good idea. To support the http protocol, Jina should be installed with pip install "jina[client,http]" according to doc
as base Docker image, we could use the official Jina image: https://hub.docker.com/r/jinaai/jina. This would avoid the problems with the PATH variables and the need to make a new user.
Dockerfiles for backend and frontend are pretty similar, they could be joined.

Cool stuff @Rubix982 , welcome to the project! ;)

Rubix982 · 2021-07-26T11:08:19Z

adding aiohttp as a dependence is probably not a good idea. To support the http protocol, Jina should be installed with pip install "jina[client,http]" according to doc

The problem with this is that I have to figure out (and I'm not sure if this is possible) how to cache that. I did try doing pip install jina[client,http], but for me, it was not caching. Which means on every container build, it downloads those dependencies from scratch, which seemed like wasted bandwidth for me.

Python and Docker have this weird thing that if I install any dependency with the command RUN pip install pkg - this does not get cached. But if I add a requirements.txt with the package name AND version specified, it caches (I"m not sure what magic is this).

I'll try it again in some time if this indeed takes care of the caching the way I want it to. Docker best practices do not like the idea of reinstalling dependencies over and over again. This is a big no-no in distributed computing.

as base Docker image, we could use the official Jina image: https://hub.docker.com/r/jinaai/jina. This would avoid the problems with the PATH variables and the need to make a new user.

I tried this ... and the Docker image only offered some jina related commands, and I'm not familiar with CLI that the jina offers. I took a quick 10-15 minutes look at it, but it did not seem to do what I was thinking about making this PR.

By default, Docker creates the container and enters as root. More documentation reading required here.

Dockerfiles for backend and frontend are pretty similar, they could be joined.

For this, I need to know where you guys want me to remove the data folder entirely.

As homework for me, I have to look into,

How does the Jina Docker image work and what does it offer?
Is it possible to provide the same build context and share it among two separate container builds?

@fissoreg I'm waiting for our 1-to-1 so you can introduce me to @jina-ai more so I can contribute as well. 👍

georgeamccarthy · 2021-07-27T09:52:14Z

In addition to some reading you might find the Jina slack helpful, they really welcome discussion on anything from simple to advanced. I've found it super helpful! :) http://slack.jina.ai

fissoreg · 2021-07-27T10:03:36Z

The problem with this is that I have to figure out (and I'm not sure if this is possible) how to cache that. I did try doing pip install jina[client,http], but for me, it was not caching. Which means on every container build, it downloads those dependencies from scratch, which seemed like wasted bandwidth for me.

I would have said that wasted bandwidth is better then managing dependencies manually...

I'll try it again in some time if this indeed takes care of the caching the way I want it to. Docker best practices do not like the idea of reinstalling dependencies over and over again. This is a big no-no in distributed computing.

...but this is a convincing remark!

Anyways, the following is strange:

Python and Docker have this weird thing that if I install any dependency with the command RUN pip install pkg - this does not get cached. But if I add a requirements.txt with the package name AND version specified, it caches (I"m not sure what magic is this).

So I think that the best thing would be to try to understand what is happening there and fix it. Let me add, as @gmelodie would say: "Research time is not wasted time".

@fissoreg I'm waiting for our 1-to-1 so you can introduce me to @jina-ai more so I can contribute as well. +1

I hope this will be helpful! :)

fissoreg

Related to #18

fissoreg · 2021-08-09T13:42:30Z

backend/Dockerfile

+RUN useradd --create-home jina
+
+# Add the models folder locally to container
+COPY ./models /app/models


This should be loaded from a config file.

fissoreg · 2021-08-09T13:43:31Z

backend/src/executors.py

+    def initialize_executor():
+
+        # If the model is not already cached ...
+        if not os.path.isdir("./models/prot_bert"):


Here the path should be loaded from a config file (the same as for the Dockerfile). For now we have backend_config.py.

I'll look into the YAML configuration for Jina tonight over how I can store variables there to act as a whole for the backend project.

Thanks for the referral!

Rubix982 · 2021-08-10T10:04:38Z

@georgeamccarthy, @fissoreg I'm unable to merge by myself. Need help with that.

@fissoreg Unfortunately, there is no easy way of using YAML variables inside Dockerfiles. The only way is to use .env files at the moment. As for executors.py, we can use the python-dotenv package to import the variables from the .env into a python script.

fissoreg · 2021-08-10T10:19:42Z

@georgeamccarthy, @fissoreg I'm unable to merge by myself. Need help with that.

The merge is here: Rubix982#1
There are some problems, but you can merge that PR so we have the merge commit here and we can keep the discussion flowing. Have a look before merging, to see if all the Dockerfile stuff looks good to you.

@fissoreg Unfortunately, there is no easy way of using YAML variables inside Dockerfiles. The only way is to use .env files at the moment. As for executors.py, we can use the python-dotenv package to import the variables from the .env into a python script.

Ok thanks @Rubix982! Maybe .env files is a good option. If you have other ideas, let's discuss. The important point is that it would be better to have to parametrize the various paths only once and in one place.

Rubix982 · 2021-08-10T10:48:32Z

This is weird. I did not get a notification on my own repository. Thanks.

I'll try to implement the .env method on my repository and get back to you.

fissoreg · 2021-08-10T10:53:26Z

I'll try to implement the .env method on my repository and get back to you.

This is great, thanks! But let's make that into a different PR, maybe? So we can merge this one ASAP.

2. Full PDB dataset. 3. Minors.

2. Fixed `Dockerfile` and `Makefile` for backend. 3. Fixed dependencies.

Fixing backend.

Rubix982 · 2021-08-14T09:14:02Z

After making the changes suggested by Cristian on Slack, I am able to finally get results from the endpoint /search.

The changes have been made and pushed to my fork.

After the fixes, the Streamlit application throws these errors,

These errors are thrown from line 95 of frontend/app.py,

# Execute the query on the transport
result = client.execute(query, variable_values={"ids": ids})

I believe these bug fixes are independent of this PR's objective. This should be merged and closed, and the issue solved in another PR. What do you think, @fissoreg?

georgeamccarthy · 2021-08-14T09:22:15Z

Agreed, let's merge and move forward.

fissoreg · 2021-08-14T09:31:13Z

I believe these bug fixes are independent of this PR's objective. This should be merged and closed, and the issue solved in another PR. What do you think, @fissoreg?

I didn't get this error but yes, let's merge and move on. We will also need to fix the automated tests.

Great job @Rubix982 !

Rubix982 and others added 12 commits July 23, 2021 13:09

Removed root requirements, split amongst containers

e72c56f

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

Changed backend directory structure

fb8db43

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

Shifted location of data to backend/data/

a5fdc32

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

Changed port for backend

4f68646

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

Add Dockerfiles for containerziation, Dock Compose for container orch…

646e4d8

…estration Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

Introduce policy with deal with Docker Containers

91a51cb

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

Bug fix, random chown data line

0dc058e

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

Exposing PORT

b7bc9c1

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

Tag "image" for the containers

c3d8e42

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

Added FastAPI as dependency, required by Jina

67e76f8

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

Added aiohttp dependency, remove app.py for testing

1180c08

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

Delete pdb_data_seq.csv

5b6f162

We'll download the csv file during run time to help reduce the repo size.

georgeamccarthy reviewed Jul 23, 2021

View reviewed changes

georgeamccarthy mentioned this pull request Jul 23, 2021

[WIP] Dockerize the application #29

Closed

Rubix982 added 3 commits July 23, 2021 15:55

Added .csv and embeddings to .gitignore

b684f13

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

Cache testing done for backend/Dockfile

090d34b

Signed-off-by: Saif Ul Islam <saifulislam84210@gmail.com>

Merge branch 'Dockerizing' of github.com:Rubix982/protein_search into…

a385964

… Dockerizing Removing data/pdb_data_seq.csv

fissoreg reviewed Jul 26, 2021

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

backend/Dockerfile Show resolved Hide resolved

docker-compose.yml Show resolved Hide resolved

georgeamccarthy added this to Next in protein_search 1.0 Jul 26, 2021

georgeamccarthy linked an issue Jul 26, 2021 that may be closed by this pull request

Dockerize the application #6

Closed

georgeamccarthy removed this from Next in protein_search 1.0 Jul 26, 2021

georgeamccarthy approved these changes Aug 9, 2021

View reviewed changes

fissoreg reviewed Aug 9, 2021

View reviewed changes

Merged conflicts.

c1d9c6f

georgeamccarthy self-requested a review August 10, 2021 13:00

georgeamccarthy mentioned this pull request Aug 10, 2021

[WIP] Load or download the model. #55

Draft

1 task

georgeamccarthy added documentation Improvements or additions to documentation frontend backend feature internal deployment and removed frontend backend labels Aug 10, 2021

fissoreg added 3 commits August 10, 2021 17:53

Fix tests + search test + formatting.

3c4cb8c

1. Shuffling proteins.

22519cb

2. Full PDB dataset. 3. Minors.

1. Got rid of static method to load ProtBERT model.

a061368

2. Fixed `Dockerfile` and `Makefile` for backend. 3. Fixed dependencies.

fissoreg mentioned this pull request Aug 12, 2021

Fixing backend. Rubix982/protein_search#2

Merged

fissoreg and others added 2 commits August 13, 2021 10:48

Fixed connection between frontend and backend.

ac8d9c2

Merge pull request #2 from fissoreg/Rubix982-Dockerizing

c292330

Fixing backend.

georgeamccarthy changed the title ~~[WIP - Structure] Dockerizing the project~~ [Structure] Dockerizing the project Aug 14, 2021

fissoreg merged commit fdb596c into georgeamccarthy:main Aug 14, 2021

protein_search 1.0 automation moved this from Next to Done Aug 14, 2021

Rubix982 deleted the Dockerizing branch August 14, 2021 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Structure] Dockerizing the project #30

[Structure] Dockerizing the project #30

Rubix982 commented Jul 23, 2021 •

edited by fissoreg

georgeamccarthy left a comment •

edited

Rubix982 commented Jul 23, 2021 •

edited by georgeamccarthy

Rubix982 commented Jul 23, 2021 •

edited

Rubix982 commented Jul 23, 2021 •

edited

georgeamccarthy commented Jul 23, 2021 •

edited

Rubix982 commented Jul 23, 2021

fissoreg commented Jul 26, 2021

Rubix982 commented Jul 26, 2021 •

edited

georgeamccarthy commented Jul 27, 2021

fissoreg commented Jul 27, 2021

fissoreg left a comment

fissoreg Aug 9, 2021

fissoreg Aug 9, 2021

Rubix982 Aug 9, 2021

Rubix982 commented Aug 10, 2021 •

edited

fissoreg commented Aug 10, 2021

Rubix982 commented Aug 10, 2021

fissoreg commented Aug 10, 2021

Rubix982 commented Aug 14, 2021 •

edited

georgeamccarthy commented Aug 14, 2021 •

edited

fissoreg commented Aug 14, 2021

[Structure] Dockerizing the project #30

[Structure] Dockerizing the project #30

Conversation

Rubix982 commented Jul 23, 2021 • edited by fissoreg

Pull Request Type

Purpose

Why?

Changes Introduced

Bugs (WIP)

Notes

Feedback required over

Mentions

georgeamccarthy left a comment • edited

Choose a reason for hiding this comment

Rubix982 commented Jul 23, 2021 • edited by georgeamccarthy

Rubix982 commented Jul 23, 2021 • edited

Rubix982 commented Jul 23, 2021 • edited

georgeamccarthy commented Jul 23, 2021 • edited

Rubix982 commented Jul 23, 2021

fissoreg commented Jul 26, 2021

Rubix982 commented Jul 26, 2021 • edited

georgeamccarthy commented Jul 27, 2021

fissoreg commented Jul 27, 2021

fissoreg left a comment

Choose a reason for hiding this comment

fissoreg Aug 9, 2021

Choose a reason for hiding this comment

fissoreg Aug 9, 2021

Choose a reason for hiding this comment

Rubix982 Aug 9, 2021

Choose a reason for hiding this comment

Rubix982 commented Aug 10, 2021 • edited

fissoreg commented Aug 10, 2021

Rubix982 commented Aug 10, 2021

fissoreg commented Aug 10, 2021

Rubix982 commented Aug 14, 2021 • edited

georgeamccarthy commented Aug 14, 2021 • edited

fissoreg commented Aug 14, 2021

Rubix982 commented Jul 23, 2021 •

edited by fissoreg

georgeamccarthy left a comment •

edited

Rubix982 commented Jul 23, 2021 •

edited by georgeamccarthy

Rubix982 commented Jul 23, 2021 •

edited

Rubix982 commented Jul 23, 2021 •

edited

georgeamccarthy commented Jul 23, 2021 •

edited

Rubix982 commented Jul 26, 2021 •

edited

Rubix982 commented Aug 10, 2021 •

edited

Rubix982 commented Aug 14, 2021 •

edited

georgeamccarthy commented Aug 14, 2021 •

edited