Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/infilling #163

Merged
merged 36 commits into from
May 26, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
72842e3
Fix requirements.txt (#84)
AndriiHura Jan 24, 2022
a67e15c
fix itsdangerous requirements
mtalimanchuk Feb 18, 2022
0f8ef0e
pin itsdangerous requirements for all flask==1.1.1 servers
mtalimanchuk Feb 18, 2022
6f0684a
Merge pull request #102 from deepmipt/fix/combined-classification-fla…
mtalimanchuk Feb 18, 2022
d237711
Merge pull request #103 from deepmipt/dev
dilyararimovna Feb 18, 2022
e990264
Merge pull request #107 from deepmipt/dev
dilyararimovna Mar 2, 2022
3208f71
Merge pull request #119 from deepmipt/dev
dilyararimovna Mar 11, 2022
ab44553
Merge pull request #123 from deepmipt/dev
dilyararimovna Mar 18, 2022
1c9a463
Merge pull request #137 from deepmipt/dev
dilyararimovna Apr 8, 2022
548a0ac
infilling added (not tested)
olkaso Apr 19, 2022
b2d22ca
files moved, some paths fixed
olkaso Apr 25, 2022
c6d1158
[DGM-49] path to model fixed, test added, seems working
olkaso Apr 27, 2022
f8e4a59
Merge pull request #145 from deepmipt/dev
dilyararimovna Apr 30, 2022
48872a6
Merge pull request #150 from deepmipt/dev
dilyararimovna May 4, 2022
ed42f0c
Merge pull request #153 from deepmipt/dev
dilyararimovna May 5, 2022
30f290c
Merge pull request #155 from deepmipt/dev
dilyararimovna May 6, 2022
de510bc
Merge pull request #158 from deepmipt/dev
dilyararimovna May 11, 2022
527d119
takes a batch, bigger test added
olkaso May 12, 2022
32869ac
assert added to test
olkaso May 12, 2022
820fba7
assert added to test
olkaso May 12, 2022
85f1c78
minor changes
olkaso May 24, 2022
3134088
Merge branch 'main' into feat/infilling
dilyararimovna May 25, 2022
907c39f
fix: codestyle
dilyararimovna May 25, 2022
b6cac93
fix: proxy pass
dilyararimovna May 25, 2022
bd89b3a
Merge remote-tracking branch 'upstream/dev'
dilyararimovna May 25, 2022
65bb530
Merge branch 'main' into feat/infilling
dilyararimovna May 25, 2022
5f8b422
fix: yml configs
dilyararimovna May 25, 2022
c6b1a4d
fix: refactor infilling and usage
dilyararimovna May 25, 2022
9a22412
fix: paths
dilyararimovna May 25, 2022
e7ccbc4
fix: dockerfile
dilyararimovna May 25, 2022
1050312
fix: upd files
dilyararimovna May 25, 2022
5133dd7
fix: working version
dilyararimovna May 25, 2022
b736ebf
fix: codestyle
dilyararimovna May 25, 2022
115f1d0
fix: codestyle
dilyararimovna May 25, 2022
2422cb3
fix: works on gpu
dilyararimovna May 26, 2022
f1bdc2e
gix: readme
dilyararimovna May 26, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions .env
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,5 @@ WIKIDATA_DIALOGUE_SERVICE_URL=http://wikidata-dial-service:8092/model
NEWS_API_ANNOTATOR_URL=http://news-api-annotator:8112/respond
WIKI_FACTS_URL=http://wiki-facts:8116/respond
FACT_RANDOM_SERVICE_URL=http://fact-random:8119/respond
INFILLING_SERVICE_URL=http://infilling:8122/respond

6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,12 @@ Dream Architecture is presented in the following image:
| User Persona Extractor | 40 MiB RAM | determines which age category the user belongs to based on some key words |
| Wiki parser | 100 MiB RAM | extracts Wikidata triplets for the entities detected with Entity Linking |

## Services
| Name | Requirements | Description |
|---------------------------|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| DialoGPT | 1.3 GiB RAM, 1 GiB GPU | generative service based on Transformers generative model, the model is set in docker compose argument `PRETRAINED_MODEL_NAME_OR_PATH` (for example, `microsoft/DialoGPT-small` with 0.2-0.5 sec on GPU) |
| Infilling | 1.7 GiB RAM, 1 GiB GPU | generative service based on Infilling model, for the given utterance returns utterance where `_` from original text is replaced with generated tokens |

## Skills
| Name | Requirements | Description |
|---------------------------|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
Expand Down
3 changes: 3 additions & 0 deletions assistant_dists/dream/cpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,6 @@ services:
dialogpt:
environment:
CUDA_VISIBLE_DEVICES: ""
infilling:
environment:
CUDA_VISIBLE_DEVICES: ""
5 changes: 5 additions & 0 deletions assistant_dists/dream/dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -399,6 +399,11 @@ services:
- "./services/dialogpt:/src"
ports:
- 8125:8125
infilling:
volumes:
- "./services/infilling:/src"
ports:
- 8122:8122
dff-template-skill:
volumes:
- "./skills/dff_template_skill:/src"
Expand Down
21 changes: 19 additions & 2 deletions assistant_dists/dream/docker-compose.override.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ services:
dff-funfact-skill:8104, dff-bot-persona-skill:8105, news-api-annotator:8112,
dff-gossip-skill:8109, dff-wiki-skill:8111, dff-gaming-skill:8115, topic-recommendation:8113,
user-persona-extractor:8114, wiki-facts:8116, dff-music-skill:8099, entity-detection:8103, dff-art-skill:8117,
midas-predictor:8121, dialogpt:8125, dff-template-skill:8120"
midas-predictor:8121, dialogpt:8125, infilling:8122, dff-template-skill:8120"
WAIT_HOSTS_TIMEOUT: ${WAIT_TIMEOUT:-480}
convers-evaluator-annotator:
env_file: [.env]
Expand Down Expand Up @@ -1144,7 +1144,7 @@ services:
memory: 50M
reservations:
memory: 50M

dialogpt:
env_file: [ .env ]
build:
Expand All @@ -1164,6 +1164,23 @@ services:
reservations:
memory: 2G

infilling:
env_file: [ .env ]
build:
context: ./services/infilling/
args:
SERVICE_PORT: 8122
command: flask run -h 0.0.0.0 -p 8122
environment:
- CUDA_VISIBLE_DEVICES=0
- FLASK_APP=server
deploy:
resources:
limits:
memory: 2.5G # ?
reservations:
memory: 2.5G # ?

dff-template-skill:
env_file: [.env]
build:
Expand Down
4 changes: 4 additions & 0 deletions assistant_dists/dream/gpu1.yml
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,10 @@ services:
restart: unless-stopped
environment:
- CUDA_VISIBLE_DEVICES=9
infilling:
restart: unless-stopped
environment:
- CUDA_VISIBLE_DEVICES=7
dff-template-skill:
restart: unless-stopped
version: '3.7'
11 changes: 10 additions & 1 deletion assistant_dists/dream/proxy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -593,7 +593,16 @@ services:
environment:
- PROXY_PASS=dream.deeppavlov.ai:8125
- PORT=8125


infilling:
command: [ "nginx", "-g", "daemon off;" ]
build:
context: dp/proxy/
dockerfile: Dockerfile
environment:
- PROXY_PASS=dream.deeppavlov.ai:8122
- PORT=8122

dff-template-skill:
command: [ "nginx", "-g", "daemon off;" ]
build:
Expand Down
3 changes: 3 additions & 0 deletions assistant_dists/dream/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -130,5 +130,8 @@ services:
dialogpt:
environment:
- CUDA_VISIBLE_DEVICES=6
infilling:
environment:
- CUDA_VISIBLE_DEVICES=8
dff-template-skill:
version: '3.7'
10 changes: 10 additions & 0 deletions common/infilling.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
import os
import requests


INFILLING_SERVICE_URL = os.getenv("INFILLING_SERVICE_URL", "http://0.0.0.0:8122/respond")


def infill_texts(texts, timeout=1):
result = requests.post(INFILLING_SERVICE_URL, json={"texts": texts}, timeout=timeout).json()["infilled_text"]
return result
34 changes: 34 additions & 0 deletions services/infilling/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# syntax=docker/dockerfile:experimental

FROM pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime

RUN apt-get update && apt-get install -y --allow-unauthenticated wget

WORKDIR /src

ARG MODEL_DIR=/data/
ENV MODEL_DIR ${MODEL_DIR}
ARG SERVICE_PORT
ENV SERVICE_PORT ${SERVICE_PORT}

COPY ./requirements.txt /src/requirements.txt
RUN pip install -r /src/requirements.txt

COPY . /src

RUN mkdir /data/
RUN ls
RUN python download_model.py model sto ilm | bash
WORKDIR /data/

RUN wget http://files.deeppavlov.ai/dream/infilling/additional_ids_to_tokens.pkl
RUN wget http://files.deeppavlov.ai/dream/infilling/vocab.bpe
RUN wget http://files.deeppavlov.ai/dream/infilling/encoder.json
RUN wget http://files.deeppavlov.ai/dream/infilling/config.json

WORKDIR /src

HEALTHCHECK --interval=5s --timeout=90s --retries=3 CMD curl --fail 127.0.0.1:${SERVICE_PORT}/healthcheck || exit 1


CMD gunicorn --workers=1 server:app -b 0.0.0.0:${SERVICE_PORT} --timeout=300
5 changes: 5 additions & 0 deletions services/infilling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
GPU RAM = 1Gb
cpu time = 0.5-2 sec
gpu time = 0.1-0.5 sec

Very unstable inference time
3 changes: 3 additions & 0 deletions services/infilling/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
GPT2_MODEL_NAMES = ["gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl"]
GPT2_TOKENIZER_LEN = 50257
GPT2_EOS_TOKEN_ID = 50256
124 changes: 124 additions & 0 deletions services/infilling/download_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
PREMASKED_DATA = {
"train": {
"sto_mixture": "https://drive.google.com/open?id=1LxlyPqz3OvAZsYRRC8yRdSoaCKGB0Ucg",
"abs_mixture": "https://drive.google.com/open?id=1rw45GKP4iRJLzXnRtX-rnk_NeGXOqWkU",
"lyr_mixture": "https://drive.google.com/open?id=1jGCjboxlFUF0jqvB0_-L0eeylhKWfZJV",
},
"valid": {
"sto_mixture": "https://drive.google.com/open?id=1Y4HRYrBnqwtdbziF5Q6b5WaIFxJd1v7m",
"abs_mixture": "https://drive.google.com/open?id=1hHdXX43qkkm-zpUCJz_iuv1vRRpyfbaP",
"lyr_mixture": "https://drive.google.com/open?id=1xR0LC5WHV1UQDPjWTN0HcOQ9C5jsXYef",
},
"test": {
# Table 1/6
"sto_sentence": "https://drive.google.com/open?id=1w02hGewoBk_Pq-thrtbOcU1JPGRdGL_U",
"abs_sentence": "https://drive.google.com/open?id=18aNMfcqC1wyC8wWJHbCfMxLY49Dbg-du",
"lyr_sentence": "https://drive.google.com/open?id=18Szj-HYwh3sjLmmfF8TNwAmB2oEddool",
# Table 3
"sto_document": "https://drive.google.com/open?id=1ydEjL0SMbX8p-1w6XeLWNrzeVn8TleGT",
"abs_document": "https://drive.google.com/open?id=1UjPh51URE8hvK-yTw3xkVwkBwEcCo6Uz",
"lyr_document": "https://drive.google.com/open?id=1KNvdzn1xhpw0Xdh0pMWN3CtrDKEK8d4N",
# Table 4
"sto_mixture": "https://drive.google.com/open?id=1Zsuj8Plrcs49f-5rV6dvJ5W2kIz_C30u",
"abs_mixture": "https://drive.google.com/open?id=1TA3ySrvcWxaNtoDPpN8Jk7uqjGKdPqda",
"lyr_mixture": "https://drive.google.com/open?id=1FGEL3CGzLvnWpgvUYWsHsOUgW65DVxgw",
# Table 5
"sto_paragraph": "https://drive.google.com/open?id=1MBM96hfN2cGJidG-mi_4bE0K07xgWAxT",
"abs_paragraph": "https://drive.google.com/open?id=1xXJfjCNzRLXYZgHgUrNimP4CtUW0Ziph",
"lyr_paragraph": "https://drive.google.com/open?id=10ScpFR8sG3Ur0WpWdkPYxAsT94jNNmZh",
# Table 7
"sto_ngram": "https://drive.google.com/open?id=1x8RBys_jbreSFO1zMdmwiT2ref2F8q_C",
"abs_ngram": "https://drive.google.com/open?id=1JJyh7clJjyPF-rm4rHFLyX7Y-l_doD0K",
"lyr_ngram": "https://drive.google.com/open?id=1dbCCc68TvY6segwTrrxYS1ukVbdC7zgJ",
# Table 8
"sto_word": "https://drive.google.com/open?id=178joxkympgzDwZoExnalWujRq2Jv_37P",
"abs_word": "https://drive.google.com/open?id=1PdVg-TnG5VQt8GCQOQA841AGw1GR44yl",
"lyr_word": "https://drive.google.com/open?id=1Td-yr6g5cTxW4yoz_Wv4gSi-wbu1376R",
},
}

PRETRAINED_MODELS = {
# Trained on stories
"sto_lm": "https://drive.google.com/open?id=1-FGKu-bodqOsCGrFCYY6Yyp2rTk2rRpc",
"sto_lmrev": "https://drive.google.com/open?id=1_uCgugc57tPGfFofKbU8doJN23cf4lEY",
"sto_lmall": "https://drive.google.com/open?id=1dPOLkggPbe-Pzn8VVkcrinuGJv2yRieR",
"sto_ilm": "https://drive.google.com/open?id=1oYFLxkX6mWbmpEwQH8BmgE7iKix2W0zU",
"sto_lmscratch": "https://drive.google.com/open?id=1vGxdfZUWtOB5ajpDgSGUXuHK5_BGY9GA",
"sto_lmrevscratch": "https://drive.google.com/open?id=1xbyQ5bMJpTxlsPtL1YsH2jmUUh_49gOI",
"sto_lmallscratch": "https://drive.google.com/open?id=1Qy13Dw60Jd5HqN89q8WvCMtwvTXJw7tj",
"sto_ilmscratch": "https://drive.google.com/open?id=14BFLWSaPi2JSsKsa68lcTSnCOnYV9jPm",
# Trained on abstracts
"abs_lm": "https://drive.google.com/open?id=1BSIFfuSTznmHIKa4R-AnwIxN93b1Ap-b",
"abs_lmrev": "https://drive.google.com/open?id=1yl36oZq9R_d3IhlFWLlMGq46n8F9Lq1q",
"abs_lmall": "https://drive.google.com/open?id=1qyM0OCL8pI5dL7sfAag-y9X_bnlTS_1Z",
"abs_ilm": "https://drive.google.com/open?id=1FBY9DR60WWX05orILaFHuyZYlB4ChTpS",
"abs_lmscratch": "https://drive.google.com/open?id=103Cw2ZSb5g5PlTKslmbmhqCaxn3N65OO",
"abs_lmrevscratch": "https://drive.google.com/open?id=1HeuxA2A6iEs6SW26jlCom3x_tFQHnIGu",
"abs_lmallscratch": "https://drive.google.com/open?id=1XU61GMduqJeCzYqDk8BQ7S4M8tbzqF9g",
"abs_ilmscratch": "https://drive.google.com/open?id=1ZTZOO5fVTlnPBw7EC_4OOEzHmcs6tAFO",
# Trained on lyrics
"lyr_lm": "https://drive.google.com/open?id=1FJBgz26lZPcanZTEf0iWxZCXIEM6esu6",
"lyr_lmrev": "https://drive.google.com/open?id=1XAug1jhm7sa5lksDV6GMyF8sFQLwk1Y6",
"lyr_lmall": "https://drive.google.com/open?id=1nrNkd4cBsdZS0eajA3wD1i5b6t6R6bow",
"lyr_ilm": "https://drive.google.com/open?id=1nYuYCS5fDP2_vB7A92guk0PWh5CC2I5x",
"lyr_lmscratch": "https://drive.google.com/open?id=1JzDRUSWVeyGnNaWKVYM8t1BPAs58t6uB",
"lyr_lmrevscratch": "https://drive.google.com/open?id=1Kkli5Brmc3D6qE0b5ww5daZdZroaN1YB",
"lyr_lmallscratch": "https://drive.google.com/open?id=18JYIBOtDfnksZPl4TW9cOzjOh_qDBCJP",
"lyr_ilmscratch": "https://drive.google.com/open?id=1RObPpSttNtMw4UQ1bGiVzEM-94QqkwHT",
}

PRETRAINED_MODEL_CONFIG_JSON = "https://drive.google.com/open?id=15JnXi7L6LeEB2fq4dFK2WRvDKyX46hVi"
PRETRAINED_SPECIAL_VOCAB_PKL = "https://drive.google.com/open?id=1nTQVe2tfkWV8dumbrLIHzMgPwpLIbYUd"

PAPER_TASK_TO_INTERNAL = {
"lm": "lm",
"lmrev": "reverse_lm",
"lmall": "naive",
"ilm": "ilm",
}

_DOWNLOAD_TEMPLATE = """
wget -nc --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget \
--quiet --save-cookies /tmp/cookies.txt \
--keep-session-cookies \
--no-check-certificate \
'https://docs.google.com/uc?export=download&id={gdrive_id}' \
-O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\\1\\n/p')&id={gdrive_id}" \
-O {local_path} && rm -rf /tmp/cookies.txt
""".strip()


if __name__ == "__main__":
import os
import sys
from pathlib import Path

MODEL_DIR = Path(os.environ.get("MODEL_DIR", "/data/"))

if sys.argv[1] == "model":
data_tag, model_type = sys.argv[2:]
model_tag = "{}_{}".format(data_tag[:3], model_type)
gdrive_urls = [PRETRAINED_MODELS[model_tag], PRETRAINED_MODEL_CONFIG_JSON, PRETRAINED_SPECIAL_VOCAB_PKL]
local_fns = ["pytorch_model.bin", "config.json", "additional_ids_to_tokens.pkl"]
elif sys.argv[1] == "data_train":
data_tag = sys.argv[2][:3]
out_dir = MODEL_DIR.joinpath("data")
gdrive_urls = [PREMASKED_DATA[s]["{}_mixture".format(data_tag)] for s in ["train", "valid"]]
local_fns = ["{}_mixture_{}.pkl".format(data_tag, s) for s in ["train", "valid"]]
elif sys.argv[1] == "data_eval":
data_tag = sys.argv[2][:3]
out_dir = MODEL_DIR.joinpath("data")
gdrive_urls = [
PREMASKED_DATA["test"]["{}_{}".format(data_tag, g)]
for g in ["mixture", "document", "paragraph", "sentence", "ngram", "word"]
]
local_fns = [
"{}_{}_test.pkl".format(data_tag, g)
for g in ["mixture", "document", "paragraph", "sentence", "ngram", "word"]
]

print("mkdir -p {}".format(str(MODEL_DIR)))
for gdrive_url, local_fn in zip(gdrive_urls, local_fns):
print(
_DOWNLOAD_TEMPLATE.format(gdrive_id=gdrive_url.split("=")[1], local_path=str(MODEL_DIR.joinpath(local_fn)))
)