End-to-End MLOps system (FastAPI +MLflow+Docker + Terrafom)
mlops-project/ โ โโโ src/ | |โโ init.py โ โโโ preprocess.py โ โโโ train.py โ โโโ evaluate.py โ โโโ api/ โ โโโ app.py # FastAPI app โ โโโ model_loader.py # Load MLflow model | โโโ Dockerfile โ โโโ requirements.txt โ |โโ mlflow/ # MLflow service ๐ HERE โ โโโ Dockerfile | |โโ data/ โโโ docker-compose.yml โโโ docker-compose-base.yml # base โโโ docker-compose.dev.yml # dev override โโโ docker-compose.prod.yml # prod override โ โโโ .env.dev โโโ .env.prod โ | โโโ requirements.txt โโโ terraform/ โ โโโ main.tf โ โโโ params.yaml โโโ README.md
- Need to check whether mounts are correct.
- Change the owner to the actual user from root before creation mount files and give properpermission.
- Check Backend URI path be mindful whether it's relative of absolute path use 3 /// for relative and 4 //// for absolute path
- If artifacts not visible to the mount path check docker run command and use artifact_destination and serve-artifact
#useful commands:
chmod 400 <private_key> #chmod 400 ~/Downloads/vm.pem ssh -i azureuser@vm-public-ip
export MLFLOW_TRACKING_URI=http://40.75.103.57:5000 echo $MLFLOW_TRACKING_URI python -m src.train
mlflow server
--backend-store-uri sqlite:///mlflow.db
--default-artifact-root ./artifacts
--host 0.0.0.0
--port 5000
- uvicorn api.app:app --reload
- uvicorn api.app:app --host 0.0.0.0 -port 8000 --reload #---swagger UI
- 127.0.0.1:8000/docs
docker build -t mlflow-server ./mlflow
mkdir -p /home/azureuser/mlartifacts sudo chmod -R 777 /home/azureuser/mlartifacts
mkdir -p /home/azureuser/mldb Give permission--> Check container user->
docker exec -it id docker exec -it mlflow id
respinse uid=1000 gid=1000 sudo chown -R 1000:1000 /home/azureuser/mldb sudo chmod -R 777 /home/azureuser/mldb
export MLFLOW_TRACKING_URI=http://40.75.103.57:5000 echo $MLFLOW_TRACKING_URI
docker run -d -p 5000:5000 --name mlflow mlflow-server
docker run -d -p 5000:5000 -v /home/azureuser/mlartifacts:/mlflow/martifacts -v /home/azureuser/mldb:/mlflow/db --name mlflow mlflow-server
docker run -d -p 5000:5000
-u
-v /home/azureuser/mldb:/mlflow/db
-v /home/azureuser/mlartifacts:/mlflow/artifacts
--name mlflow
mlflow-server:latest
server
--backend-store-uri sqlite:////mlflow/db/mlflow.db
--artifacts-destination /mlflow/artifacts
--serve-artifacts
--allowed-hosts "*"
--host 0.0.0.0
--port 5000
docker run -p 5000:5000 mlflow-server
server \
--host 0.0.0.0
--port 5000
--backend-store-uri sqlite:///mlflow.db
--default-artifact-root /mlflow/artifacts
--allowed-hosts "*"
docker run -it --entrypoint sh
docker run -p 5000:5000 mlflow-server server --host 0.0.0.0 --port 5000 --backend-store-uri sqlite:///mlflow.db --default-artifact-root /mlflow/artifacts --allowed-hosts "*"
Run inside vm:
from local machine:
curl http:// vm-ip:5000
Go Inside container:
docker exec -it mlflow sh ps aux | grep mlflow netstat -tuln | grep 5000 if net-stat not available apt-get update && apt-get install net-tools curl -y ss -tuln | grep 5000 curl http://localhost:5000 if curl not available: python -c "import requests; print(requests.get('http://localhost:5000').status_code)" expected response 200
python -c "import requests; print(requests.get('http://localhost:5000').text[:200])"
docker build -t ml-api ./api
docker run -d -p 8000:8000
-u
--name ml-api-container
-e MLFLOW_TRACKING_URI=http://:5000
-e MODEL_URI=models:/iris-model@latest
ml-api
docker run -d -p 8000:8000
--name ml-api-container
-e MLFLOW_TRACKING_URI=http://40.75.103.57:5000
-e MODEL_URI=models:/iris-model@latest
ml-api
docker ps docker ps -a docker stop docker rm docker rm -f docker images docker rmi docker stop docker restart
docker rm $(docker ps -aq -f status=exited)
docker container prune -f docker system prune -a # Removed stopped container, unused images, build cache
docker exec -it sh
echo $MLFLOW_TRACKING_URI
Should return correct URI
python -c "import requests; r = requests.get('http://:5000'); print(r.status_code)"
response expected 200
run python
then inside run below to list registered models:
from mlflow.tracking import MlflowClient
client = MlflowClient()
models = client.search_registered_models() for m in models: print(f"Model name: {m.name}") for v in m.latest_versions: print(f" Version: {v.version}, Stage: {v.current_stage}")
import mlflow.pyfunc
model_uri = "models:/iris-model/@latest" # or just "models:/iris-model" model = mlflow.pyfunc.load_model(model_uri) print("Model loaded successfully!")
from mlflow.tracking import MlflowClient
client = MlflowClient() models = client.search_registered_models() for m in models: print(f"Model name: {m.name}")
from mlflow.tracking import MlflowClient from mlflow.tracking import MlflowClient client = MlflowClient() models = client.search_registered_models() print([m.name for m in models])
docker exec -it python docker exec -it sh
- Step 1 โ Confirm MLflow server config docker ps docker inspect mlflow-container | grep -A 20 Cmd
docker inspect mlflow --format '{{json .Mounts}}'
- Step 2 โ Check actual DB usage
docker exec -it mlflow-container bash ls /mlflow/db ls /mlflow/artifacts
docker stop mlflow docker rm mlflow
rm -rf mlruns
mkdir -p /home/azureuser/mldb mkdir -p /home/azureuser/mlartifacts chmod -R 777 /home/azureuser/mldb chmod -R 777 /home/azureuser/mlartifacts
ls -ld /home/azureuser/mlartifacts ls -ld /home/azureuser/mldb
change owner:
sudo chown -R azureuser:azureuser /home/azureuser/mlartifacts sudo chown -R azureuser:azureuser /home/azureuser/mldb
Then give permission:
chmod -R 755 /home/azureuser/mlartifacts mkdir -p /home/azureuser/data sudo chown -R azureuser:azureuser /home/azureuser/data chmod -R 755 /home/azureuser/data
sudo rm -rf /home/azureuser/mlartifacts sudo rm -rf /home/azureuser/mldb
docker run -d -p 5000:5000
-v /home/azureuser/mldb:/mlflow/db
-v /home/azureuser/mlartifacts:/mlflow/artifacts
--name mlflow
mlflow-server:latest
server
--backend-store-uri sqlite:///mlflow/db/mlflow.db
--default-artifact-root /mlflow/artifacts
--allow-hosts="*"
--host 0.0.0.0
--port 5000
-u $(id -u):$ (id -g)
Runs container as your VM user (azureuser)
Prevents root-owned files
Ensures:
/home/azureuser/mldb โ
writable
/home/azureuser/mlartifacts โ
writable
docker run -d -p 5000:5000
-u $(id -u):$ (id -g)
-v /home/azureuser/mldb:/mlflow/db
-v /home/azureuser/mlartifacts:/mlflow/artifacts
--name mlflow
mlflow-server:latest
server
--backend-store-uri sqlite:////mlflow/db/mlflow.db
--default-artifact-root /mlflow/artifacts
--allowed-hosts "*"
--host 0.0.0.0
--port 5000
docker inspect mlflow-server:latest | grep -A 5 Entrypoint docker inspect mlflow-container | grep -A 20 Cmd
docker logs mlflow docker inspect mlflow --format '{{.State.Status}} {{.State.ExitCode}} {{.State.Error}}'
Using 3 slashes for absolute path sqlite:///mlflow/db/mlflow.db
๐ MLflow interprets as:
./mlflow/db/mlflow.db
but expected is:
docker inspect mlflow-server:latest --format '{{json .Config.Entrypoint}} {{json .Config.Cmd}}'
python -c "import requests; print(requests.get('http://127.0.0.1:5000').status_code)"
response 200 ls /home/azureuser/mlartifacts python -c "import mlflow; print(mlflow.get_tracking_uri())"
docker exec -it mlflow id
response expected as uid=1000 gid=1000
docker exec -it mlflow bash ls /mlflow/db ls /mlflow/artifacts
docker run -d -p 8000:8000
-u
--name ml-api-container
-e MLFLOW_TRACKING_URI=http://40.75.103.57:5000
-e MODEL_URI=models:/iris-model@latest
ml-api
ls -al /home/azureuser/mldb
mlflow.db should present ls -al /home/azureuser/mlartifacts
sqlite3 /home/azureuser/mldb/mlflow.db "select run_uuid, artifact_uri from runs order by start_time desc limit 5;"
output : /mlflow/artifacts/1/5f9419cde1684a33a143aaf9dd7a86fc/artifacts
it is local filesystem. follow as below: find /home/azureuser/mlartifacts -maxdepth 6 -type f | head -50
docker rm -f mlflow 2>/dev/null
docker run -d -p 5000:5000
-u
-v /home/azureuser/mldb:/mlflow/db
-v /home/azureuser/mlartifacts:/mlflow/artifacts
--name mlflow
mlflow-server:latest
server
--backend-store-uri sqlite:////mlflow/db/mlflow.db
--default-artifact-root /mlflow/artifacts
--serve-artifacts
--allowed-hosts "*"
--host 0.0.0.0
--port 5000
docker run -d -p 5000:5000
-u
-v /home/azureuser/mldb:/mlflow/db
-v /home/azureuser/mlartifacts:/mlflow/artifacts
--name mlflow
mlflow-server:latest
server
--backend-store-uri sqlite:////mlflow/db/mlflow.db
--artifacts-destination /mlflow/artifacts
--serve-artifacts
--allowed-hosts "*"
--host 0.0.0.0
--port 5000
After adding server-artifacts if the oitput of below:
sqlite3 /home/azureuser/mldb/mlflow.db "select run_uuid, artifact_uri from runs order by start_time desc limit 5;"
sqlite3 /home/azureuser/mldb/mlflow.db "select experiment_id, name, artifact_location from experiments order by experiment_id;"
is pointing to local then change experiment name.
ls -al /home/azureuser/mlartifacts find /home/azureuser/mlartifacts -maxdepth 6 -type f | head -50
echo $MLFLOW_TRACKING_URI python -c "import mlflow; print(mlflow.get_tracking_uri())"
both should point to vm-ip:5000
local: python -c "import mlflow; print(mlflow.version)"
Inside VM : docker exec -it mlflow python -c "import mlflow; print(mlflow.version)"
Output will show the lifecycle state and status:
curl -X POST http://127.0.0.1:5000/api/2.0/mlflow/runs/search
-H "Content-Type: application/json"
-d '{"experiment_ids":["2"]}'
sqlite3 /home/azureuser/mldb/mlflow.db
"select run_uuid, experiment_id, lifecycle_stage, status from runs order by start_time desc limit 10;"
find /home/azureuser/mlartifacts -maxdepth 6 -type f | head -50
expected o/p:
find /home/azureuser/mlartifacts -maxdepth 6 -type f | head -50 /home/azureuser/mlartifacts/3/360d485c204d4559b9ca78c56d0f7475/artifacts/params.yaml /home/azureuser/mlartifacts/3/models/m-5487e19c80df4c7ebf7eca83abbdf887/artifacts/MLmodel /home/azureuser/mlartifacts/3/models/m-5487e19c80df4c7ebf7eca83abbdf887/artifacts/model.pkl /home/azureuser/mlartifacts/3/models/m-5487e19c80df4c7ebf7eca83abbdf887/artifacts/requirements.txt /home/azureuser/mlartifacts/3/models/m-5487e19c80df4c7ebf7eca83abbdf887/artifacts/python_env.yaml /home/azureuser/mlartifacts/3/models/m-5487e19c80df4c7ebf7eca83abbdf887/artifacts/conda.yaml
sqlite3 /home/azureuser/mldb/mlflow.db "select experiment_id, name, artifact_location from experiments order by experiment_id;"
expected output:
0|Default|/mlflow/artifacts/0 1|mlops-production-iris|/mlflow/artifacts/1 2|mlops-production-iris-remote|/mlflow/artifacts/2 3|mlops-production-iris-remote1-|mlflow-artifacts:/3
sqlite3 /home/azureuser/mldb/mlflow.db "select run_uuid, artifact_uri from runs order by start_time desc limit 5;"
expected o/p:
360d485c204d4559b9ca78c56d0f7475|mlflow-artifacts:/3/360d485c204d4559b9ca78c56d0f7475/artifacts
docker build -t ml-train -f train.Dockerfile .
docker run --rm --name ml-train-container
-u
-e MLFLOW_TRACKING_URI=http://40.75.103.57:5000
-v /home/azureuser/MLOps-Regression_model/data:/app/data
ml-train
Check Docker path:
azureuser@vm26:~/MLOps-Regression_model$ which docker /usr/bin/docker
login to VM --Open crontab (crontab -e)---Choose editor: nano
add training job to schedule every day at 2AM:
0 2 * * * /usr/bin/docker run --rm --name ml-train-container
-u
-e MLFLOW_TRACKING_URI=http://40.75.103.57:5000
-v /home/azureuser/MLOps-Regression_model/data:/app/data
ml-train
python3 -m src.pipeine >> /home/azureuser/train.log 2>&1
-
-
-
-
- /usr/bin/docker run --rm --name ml-train-container
-u$(id -u):$ (id -g)
-e MLFLOW_TRACKING_URI=http://40.75.103.57:5000
-v /home/azureuser/MLOps-Regression_model/data:/app/data
ml-train >> /home/azureuser/train.log 2>&1
- /usr/bin/docker run --rm --name ml-train-container
-
-
-
write --> save (Ctrl+O)--ENTER--exit(Ctrl+X)
crontab -l
cat /home/azureuser/train.log
Just remove the commands, comment out etc
profiles in Docker-compose means service will not run by default. Only ru it when explicitly requested.
docker-compose --version docker composer version
if docker-compose or docker compose which ever works use that only in below commands
docker compose build docker compose up -d mlflow api
docker-compose build train docker-compose --profile training build docker compose run --rm train docker compose --profile training up
Multiple profile:
docker-compose --profile train --profile debug up
dcoker-compose build docker-compose build train docker compose up -d docker compose run --rm train
After only api code change dcoker-compose build api docker compose up -d api
stop = pause containers down = remove containers + network
docker-compose start (restart the same container, which were stopped) docker-compose stop (only stop container) docker-compose up remove containers, netwroks)
docker-compose ps
How to run ๐งช DEV
docker-compose --env-file .env.dev
-f docker-compose-base.yml
-f docker-compose.dev.yml
build
docker-compose --env-file .env.dev
-f docker-compose-base.yml
-f docker-compose.dev.yml
run --rm trainer
docker-compose --env-file .env.dev
-f docker-compose-base.yml
-f docker-compose.dev.yml up -d
Run trainer:
docker-compose --env-file .env.dev
-f docker-compose-base.yml
-f docker-compose.dev.yml
run --rm trainer
๐ PROD
docker-compose --env-file .env.prod
-f docker-compose-base.yml
-f docker-compose.prod.yml pull
docker-compose --env-file .env.prod
-f docker-compose-base.yml
-f docker-compose.prod.yml up -d
alias dcdev='docker-compose --env-file .env.dev -f docker-compose-base.yml -f docker-compose.dev.yml'
dcdev build dcdev up -d dcdev run --rm trainer
docker-compose run --rm -it debug sh: or if debug s already running then docker exec -it debug sh
inside container:
verify docker internal DNS
ping -c 2 mlflow ping -c 2 api
- docker-compose stop docker-compose stop
๐ What it does:
stops containers keeps them on disk
๐ Next time:
docker-compose start
๐ resumes same containers
- docker-compose down docker-compose down
๐ What it does:
stops containers removes containers removes network
- Go inside debug container
Best way:
docker-compose run --rm -it debug sh
If debug is already running:
docker exec -it debug sh 2. Verify Docker internal DNS
Inside debug container:
ping -c 2 mlflow ping -c 2 api
If ping is unavailable, use:
nslookup mlflow nslookup api 3. Verify MLflow is reachable from inside Docker network
Inside debug container:
wget -qO- http://mlflow:5000
Or if curl exists:
curl http://mlflow:5000
Expected:
HTML or response content no connection refused 4. Verify MLflow runs API
Inside debug container:
wget -qO- --header="Content-Type: application/json"
--post-data='{"experiment_ids":["3"]}'
http://mlflow:5000/api/2.0/mlflow/runs/search
If curl exists:
curl -X POST http://mlflow:5000/api/2.0/mlflow/runs/search
-H "Content-Type: application/json"
-d '{"experiment_ids":["3"]}'
Use your actual experiment id if different.
- Verify API is reachable from inside Docker network
Inside debug container:
wget -qO- http://api:8000/docs
Or:
curl http://api:8000/docs 6. Verify API health endpoint
If you have /health:
Expected:
{"status":"ok"}
or your equivalent.
- Verify environment variables inside API container
From VM shell:
docker exec -it api sh
Then:
env | grep MLFLOW env | grep MODEL
You should see something like:
MLFLOW_TRACKING_URI=http://mlflow:5000 MODEL_URI=models:/iris-model@latest
If @latest gives issues, use:
models:/iris-model/latest 8. Verify model loading manually inside API container
Inside API container:
python -c "import os, mlflow; mlflow.set_tracking_uri(os.getenv('MLFLOW_TRACKING_URI')); print(mlflow.pyfunc.load_model(os.getenv('MODEL_URI')))"
If this fails, it will show the real model loading error immediately.
- Check API logs live
From VM shell:
docker-compose logs -f api
Watch for:
model load error connection error to MLflow import errors FastAPI startup errors 10. Check MLflow logs live docker-compose logs -f mlflow
Watch for:
backend store errors artifact write errors invalid host header permission denied failed requests from API/trainer 11. Check trainer logs
If trainer is a one-time job:
docker-compose run --rm trainer
If it fails, youโll see logs directly.
If trainer container remains:
docker logs trainer 12. Verify mounted files inside MLflow container docker exec -it mlflow sh
Then:
ls -la /mlflow/db ls -la /mlflow/artifacts find /mlflow/artifacts -maxdepth 6 -type f | head -50
Expected:
mlflow.db exists artifact files exist 13. Verify DB has experiments and runs
From VM shell:
sqlite3 /home/azureuser/mldb/mlflow.db "select experiment_id, name, artifact_location from experiments;" sqlite3 /home/azureuser/mldb/mlflow.db "select run_uuid, experiment_id, lifecycle_stage, status from runs order by start_time desc limit 10;" 14. Verify artifact URI for latest runs sqlite3 /home/azureuser/mldb/mlflow.db "select run_uuid, artifact_uri from runs order by start_time desc limit 5;"
Healthy new runs should show proxied style for your new experiment, like:
mlflow-artifacts:/... 15. Verify API prediction endpoint from VM
From VM shell:
curl -X POST http://127.0.0.1:8000/predict
-H "Content-Type: application/json"
-d '{"data":[[5.1,3.5,1.4,0.2]]}'
Or from local machine:
curl -X POST http://:8000/predict
-H "Content-Type: application/json"
-d '{"data":[[5.1,3.5,1.4,0.2]]}'
16. Check Compose containers status
docker-compose ps
17. Clean restart when debugging stale state
docker-compose down
docker-compose up -d mlflow api
Then rerun trainer:
docker-compose run --rm trainer 18. Most useful quick flow for your exact setup
When something breaks, do this in order:
docker-compose ps docker-compose logs -f mlflow docker-compose logs -f api docker-compose run --rm -it debug sh
Inside debug:
wget -qO- http://mlflow:5000 wget -qO- http://api:8000/docs
Then inside API container:
docker exec -it api sh env | grep MLFLOW env | grep MODEL python -c "import os, mlflow; mlflow.set_tracking_uri(os.getenv('MLFLOW_TRACKING_URI')); print(mlflow.pyfunc.load_model(os.getenv('MODEL_URI')))"
Then DB check:
sqlite3 /home/azureuser/mldb/mlflow.db "select run_uuid, artifact_uri from runs order by start_time desc limit 5;" 19. Your most common issue map
If MLflow unreachable from debug:
network/service name issue
If API unreachable from debug:
API startup failure or port issue
If API container env wrong:
compose env issue
If model load fails in API:
wrong MODEL_URI registry issue artifact serving issue
If DB shows runs but UI doesnโt:
MLflow UI bug/cache/filter issue, backend is still okay 20. Recommended exact commands for your current stack
From VM:
docker-compose ps docker-compose logs -f mlflow docker-compose logs -f api docker-compose run --rm -it debug sh
Inside debug:
wget -qO- http://mlflow:5000 wget -qO- http://api:8000/docs
Inside API container:
docker exec -it api sh env | grep MLFLOW env | grep MODEL python -c "import os, mlflow; mlflow.set_tracking_uri(os.getenv('MLFLOW_TRACKING_URI')); print(mlflow.pyfunc.load_model(os.getenv('MODEL_URI')))"
If you want, I can turn this into a single reusable debug.sh script for your VM.
Can you please share a file with all the debug commands mentioned above
Iโve created a ready-to-use debug command file for you.
๐ Download the debug commands file
You can keep this on your VM and quickly refer whenever something breaks.
If you want next, I can:
Convert this into a bash script (one-click debug) Or integrate it into your project as a make debug command ๐
MLOps Debug Commands Cheat Sheet
-
Check compose service/container status docker-compose ps
-
See MLflow logs live docker-compose logs -f mlflow
-
See API logs live docker-compose logs -f api
-
Open debug container shell docker-compose run --rm -it debug sh
-
From inside debug: verify Docker DNS for MLflow ping -c 2 mlflow
-
From inside debug: verify Docker DNS for API ping -c 2 api
-
From inside debug: verify MLflow UI/service is reachable wget -qO- http://mlflow:5000
-
From inside debug: verify MLflow runs search API wget -qO- --header="Content-Type: application/json"
--post-data='{"experiment_ids":["3"]}'
http://mlflow:5000/api/2.0/mlflow/runs/search -
From inside debug: verify API docs are reachable wget -qO- http://api:8000/docs
-
From inside debug: verify API health endpoint wget -qO- http://api:8000/health
-
Enter API container shell docker exec -it api sh
-
Inside API container: verify MLflow env variables env | grep MLFLOW env | grep MODEL
-
Inside API container: verify model loading manually python -c "import os, mlflow; mlflow.set_tracking_uri(os.getenv('MLFLOW_TRACKING_URI')); print(mlflow.pyfunc.load_model(os.getenv('MODEL_URI')))"
-
Enter MLflow container shell docker exec -it mlflow sh
-
Inside MLflow container: inspect DB and artifact folders ls -la /mlflow/db ls -la /mlflow/artifacts find /mlflow/artifacts -maxdepth 6 -type f | head -50
-
On VM host: inspect experiments in SQLite DB sqlite3 /home/azureuser/mldb/mlflow.db "select experiment_id, name, artifact_location from experiments;"
-
On VM host: inspect latest runs in SQLite DB sqlite3 /home/azureuser/mldb/mlflow.db "select run_uuid, experiment_id, lifecycle_stage, status from runs order by start_time desc limit 10;"
-
On VM host: inspect latest artifact URIs sqlite3 /home/azureuser/mldb/mlflow.db "select run_uuid, artifact_uri from runs order by start_time desc limit 5;"
-
Test API prediction endpoint from VM host curl -X POST http://127.0.0.1:8000/predict
-H "Content-Type: application/json"
-d '{"data":[[5.1,3.5,1.4,0.2]]}' -
Clean restart full stack and rerun training docker-compose down docker-compose up -d mlflow api docker-compose run --rm trainer
Health:
docker-compose ps docker-compose logs -f mlflow docker-compose logs -f api
Network
docker-compose run --rm -it debug sh
wget -qO- http://mlflow:5000
wget -qO- http://api:8000/docs
API model loading
docker exec -it api sh
env | grep MLFLOW
env | grep MODEL
python -c "import os, mlflow; mlflow.set_tracking_uri(os.getenv('MLFLOW_TRACKING_URI')); print(mlflow.pyfunc.load_model(os.getenv('MODEL_URI')))"
MLflow storage
docker exec -it mlflow sh
ls -la /mlflow/db
ls -la /mlflow/artifacts
find /mlflow/artifacts -maxdepth 6 -type f | head -50
sqlite3 /home/azureuser/mldb/mlflow.db "select experiment_id, name, artifact_location from experiments;"
sqlite3 /home/azureuser/mldb/mlflow.db "select run_uuid, artifact_uri from runs order by start_time desc limit 5;"
End-to-end
curl -X POST http://127.0.0.1:8000/predict
-H "Content-Type: application/json"
-d '{"data":[[5.1,3.5,1.4,0.2]]}'
If you want, Iโll regenerate the downloadable file with the full 20-point version.Network
docker-compose run --rm -it debug sh
wget -qO- http://mlflow:5000
wget -qO- http://api:8000/docs
API model loading
docker exec -it api sh
env | grep MLFLOW
env | grep MODEL
python -c "import os, mlflow; mlflow.set_tracking_uri(os.getenv('MLFLOW_TRACKING_URI')); print(mlflow.pyfunc.load_model(os.getenv('MODEL_URI')))"
MLflow storage
docker exec -it mlflow sh
ls -la /mlflow/db
ls -la /mlflow/artifacts
find /mlflow/artifacts -maxdepth 6 -type f | head -50
sqlite3 /home/azureuser/mldb/mlflow.db "select experiment_id, name, artifact_location from experiments;"
sqlite3 /home/azureuser/mldb/mlflow.db "select run_uuid, artifact_uri from runs order by start_time desc limit 5;"
End-to-end
curl -X POST http://127.0.0.1:8000/predict
-H "Content-Type: application/json"
-d '{"data":[[5.1,3.5,1.4,0.2]]}'
docker login -u docker tag mlops-regression_model_api:latest debago/iris-api:v1 docker push debago/iris-api:v1
run in seqence:
docker-compose stop mlflow or
docker stop mlflow
cp /home/azureuser/mldb/mlflow.db /home/azureuser/mldb/mlflow.db.bak
docker run --rm -u
az version az login az aks install-cli kubectl version --client
Powershell:
az account list -o table az account set --subscription <subscription name/id>
az group create --name mlops-rg
--location eastus
az aks create --resource-group mlops-rg
--name mlops-aks --node-count 1
--node-vm-size Standard_B2s --enable-managed-identity
--generate-ssh-keys
az aks get-credentials --resource-group mlops-rg
--name mlops-aks
kubectl get nodes
kubectl create namespace mlops
kubectl apply -f .\k8s\api-deployment.yml kubectl apply -f .\k8s\api-service.yml kubectl apply -f .\k8s\mlflow-deployment.yml kubectl apply -f .\k8s\mlflow-service.yml
kubectl delete job trainer-job -n mlops kubectl apply -f .\k8s\trainer-job.yml
kubectl delete -f .\k8s\api-deployment.yml
kubectl get pods -n mlops kubectll get svc -n mlops kubectl logs deployment/mlflow -n mlops
first :Check pod name kubectl get pods -n mlops -l app=api
login into pod:
kubectl exec -it -n mlops -- sh
inside the pod test MLFLOW:
wget -qO- http://mlflow:5000
if wget is unavailable, use python:
python -c "import urllib.request;print(urllib.request.urlopen('http://mlflow:5000').status)" expected: 200
python -c "import socket; print(socket.gethostbyname('mlflow'))"
kubectl logs deployment/api -n mlops
kubectl apply -f k8s
kubectl get jobs -n mlops kubectl get pods -n mlops kubectl logs job/trainer -n mlops
Kubectl get pods -n mlops --show-labels kubectl get endpoints -n mlops
Expected o/p: api-service :8000 if empty, labels mismatch
Debug commands:
kubectl describe svc api-service -n mlops
Golden rule:
Deployment lables == Service selector
kubectl get all -n mlops kubectl delete all --all -n mlops kubectl delete namespace mlops # it deletes everything inside the namespace kubectl delete job trainer-job -n mlops
kubectl delete pod -n mlops --field-selector=status.phase=Failed kubectl delete pod -n mlops --field-selector=status.phase=Unknown kubectl delete pod -n mlops --field-selector=status.phase!=Running kubectl delete pod -n mlops --all
kubectl rollout restart deployment mlflow -n mlops
kubectl port-forward svc/mlflow 5000:5000 -n mlops
az aks nodepool list --resource-group mlops-rg
--cluster-name mlops-aks `
-o table
Kubectl get nodes -o wide kubectl get pods -o wide
Add new nodepool to aks:
az aks nodepool add --resource-group mlops-rg
--cluster-name mlops-aks --name nodepool2
--node-count 1 `
--node-vm-size Standard_B4ms
Drain old nodes:
kubectl drain aks-nodepool1-23665012-vmss000000 --ignore-daemonsets
--delete-emptydir-data
Now Old pods are automatically shifted to new nodes
After pvc is applied check:
kubectl get pods -n mlops -l app=mlflow kubectl exec -it mlflow-56cb5b7bd5-sdqhv -n mlops -- sh
kubectl describe pod -n mlops
kubectl logs deployment/api -n mlops
kubectl run debug-api
--image=debago/iris-api:dev
-n mlops
--command -- sleep 3600