Skip to content

Commit

Permalink
Refactoring and dockerizing - Merge.
Browse files Browse the repository at this point in the history
[Structure] Dockerizing the project
  • Loading branch information
fissoreg committed Aug 14, 2021
2 parents 3d193d6 + c292330 commit fdb596c
Show file tree
Hide file tree
Showing 20 changed files with 579 additions and 202 deletions.
9 changes: 6 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,9 @@ dmypy.json
# Mac OS
.DS_Store

data/*
embeddings/*
results/*
# Custom - backend
backend/embeddings/
backend/data/
backend/models/
backend/tokenizers/
backend/results
42 changes: 40 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,52 @@ testdeps:
pip install black coverage flake8 pytest

format:
black backend tests
black backend frontend tests

lint: ## Lint
flake8 backend teasts
flake8 backend frontend tests

test: ## Run tests
pytest -ra

build:
make format
make coverage

# For building the docker compose
docker:

@ # Creating directory to store the models into
@ mkdir -p backend/models

@ # Creating directory to store the tokenizers into
@ mkdir -p backend/tokenizers

@ # Creating directory to store the data into
@ mkdir -p backend/data

@ # Creating directory to store the embeddings into
@ mkdir -p backend/embeddings

@ # Allow both the Docker container and local directory
@ # to access contents
@ # By default, only root on both container and host
@ # machine can access the folders
@ sudo chmod -R 777 backend/models
@ sudo chmod -R 777 backend/tokenizers
@ sudo chmod -R 777 backend/data
@ sudo chmod -R 777 backend/embeddings

docker-compose -f docker-compose.yml up --build

# For starting the docker compose,
up:
docker-compose -f docker-compose.yml up

## For removing containers
remove:
docker-compose down --remove-orphans

# List all containers
ps:
docker-compose ps -a
71 changes: 62 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,25 +10,78 @@ Neural search through protein sequences using the ProtBert model and the Jina AI

## Setting up the environment

Making a new `venv` virtual environment
First, clone the repository with `git`,

```bash
git clone https://github.com/georgeamccarthy/protein_search/ # Cloning
cd protein_search # Changing directory
```
$ cd *path_to*/protein_search
$ python -m venv env
$ source venv/bin/activate
```

Installing dependencies
### :heavy_check_mark: I have Docker

If you're familiar with `Docker`, you can simply run `make docker` (assuming you're running Linux).

The above command will,

1. Create the container for the `frontend`, installs dependencies, starts the `Streamlit` application
2. Create the container for the `backend`, installs dependencies, starts the `Jina` application
3. Provide you with links as logs to access the two containers

Visually, you should see something like,

![Successful Docker Setup](assets/img/demo.png)

From there on, you should be able to visit the Streamlit frontend, and enter your protein
relatd query.

Some notes before you use this route,

1. `Docker` takes a few moments to build the wheel for the dependencies, so the `pip` step in each of the containers my last as long as 1-2 minutes.
2. The `torch` dependency in `backend/requirements.txt` is 831.1 MBs large at the time of writing. Unless you get red colored logs, everything is fine and just taking time to be installed for `torch`
3. This project uses the `Rostbert/prot_bert` pre-trained model from `HuggingFace` which is 1.68 GBs in size.

The great news is that you will need to install these dependencies and build the images only once. Docker will cache all of the layers and steps, and caching for the pre-trained model has been integrated.

Some more functionalites provided are,

- To stop the logs from `docker`, press `Ctrl^C`
- For resuming, run `make up`
- To remove the containers from the background, run `make remove`
- To build the containers again, run `make docker`

As for introducing new changes, both the containers do not need to be restarted to do so.

### :x: I don't use Docker

For each of the folders `frontend`, and `backend`, run the following commands

- Making a new `venv` virtual environment,

```bash
cd folder_to_go_into/ # `folder_to_go_into` is either `frontend` or `backend`
python3 -m venv env
source venv/bin/activate
```
$ pip install -r requirements.txt

- Installing dependencies

```bash
pip install -r requirements.txt
```

or using `make`
If in `backend`, run `python3 src/app.py`

Open a new terminal, head back into the `frontend` folder, repeat `venv` creation and dependency
installation, and run `streamlit run app.py`.

<!-- or using `make`
```
$ make deps
```
```
make deps should be updated to new structure
-->

## Formatting, linting and testing

Expand Down
Binary file added assets/img/demo.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 44 additions & 0 deletions backend/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Get image for Python
FROM python:3.8

# Set working directory
WORKDIR /app/

# pip does not like being run as root, so create a user
RUN useradd --create-home jina

# Add needed folders locally to container
COPY ./models /app/models
COPY ./tokenizers /app/tokenizers
COPY ./data /app/data
COPY ./embeddings /app/embeddings

# Give jina user permission to the folders
RUN chown jina models
RUN chown jina tokenizers
RUN chown jina data
RUN chown jina embeddings

# Switch to user
USER jina

# Path change needed for huggingface-cli and jina
ENV PYTHONPATH "${PYTHONPATH}:/home/jina/.local/bin"
ENV PATH "${PATH}:/home/jina/.local/bin"

# Copy the requirements over to the container
COPY ./requirements.txt /app/requirements.txt

# Install dependencies in the requirements
RUN pip3 install -r requirements.txt

# Add the src folder locally to container
ADD ./src /app/src

RUN python src/init.py

# Expose port
EXPOSE 8020

# Run the application
CMD ["python", "src/app.py" ]
123 changes: 0 additions & 123 deletions backend/my_executors.py

This file was deleted.

21 changes: 21 additions & 0 deletions backend/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
torch==1.9.0
transformers==4.9.0
asgiref==3.4.1
click==8.0.1
h11==0.12.0
uvicorn==0.14.0
fastapi==0.67.0
pydantic==1.8.2
starlette==0.14.2
typing-extensions==3.10.0.0
uvloop==0.15.3
aiohttp==3.7.4.post0
async-timeout==3.0.1
attrs==21.2.0
chardet==4.0.0
idna==3.2
multidict==5.1.0
typing-extensions==3.10.0.0
yarl==1.6.3
pandas
jina
6 changes: 3 additions & 3 deletions backend/app.py → backend/src/app.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from jina.types.document.generators import from_csv
from jina import DocumentArray, Flow

from my_executors import ProtBertExecutor, MyIndexer
from executors import ProtBertExecutor, MyIndexer
from backend_config import pdb_data_path, embeddings_path, pdb_data_url
from helpers import cull_duplicates, download_csv, log

Expand All @@ -20,7 +20,7 @@ def index():
docs_generator = from_csv(
fp=data_file, field_resolver={"sequence": "text", "structureId": "id"}
)
proteins = DocumentArray(docs_generator)
proteins = DocumentArray(docs_generator).shuffle()
log(f"Loaded {len(proteins)} proteins from {pdb_data_path}.")

log("Building index.")
Expand All @@ -38,7 +38,7 @@ def main():

log("Creating flow.")
flow = (
Flow(port_expose=12345, protocol="http")
Flow(port_expose=8020, protocol="http")
.add(uses=ProtBertExecutor)
.add(uses=MyIndexer)
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
backend_model = "Rostlab/prot_bert"

# dataset link
pdb_data_url = "http://www.lri.fr/owncloud/index.php/s/fxIqHWvg1Zsq0JW/download"
pdb_data_url = "https://www.lri.fr/owncloud/index.php/s/eq7aCSJP3Ci0Vyq/download"

# Number of search results to show.
top_k = 10
Expand Down

0 comments on commit fdb596c

Please sign in to comment.