What version of Dgraph are you using?
20.03.1
Have you tried reproducing the issue with the latest release?
This is the latest release
What is the hardware spec (RAM, OS)?
K8s running on 3 M5.xLarge (4 vcpu/16 gb ram)
Steps to reproduce the issue (command/config used to run Dgraph).
K8s yaml for cluster setup
apiVersion: v1
kind: Service
metadata:
name: dgraph-alpha-public
labels:
app: dgraph-alpha
monitor: alpha-dgraph-io
spec:
type: LoadBalancer
ports:
- port: 8080
targetPort: 8080
name: alpha-http
- port: 9080
targetPort: 9080
name: alpha-grpc
selector:
app: dgraph-alpha
---
apiVersion: v1
kind: Service
metadata:
name: dgraph-alpha-private
annotations:
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
labels:
app: dgraph-alpha
monitor: alpha-dgraph-io
spec:
type: LoadBalancer
ports:
- port: 8080
targetPort: 8080
name: alpha-http
- port: 9080
targetPort: 9080
name: alpha-grpc
selector:
app: dgraph-alpha
# ---
# # This service is created in-order to debug & profile a specific alpha.
# # You can create one for each alpha that you need to profile.
# # For a more general HTTP APIs use the above service instead.
# apiVersion: v1
# kind: Service
# metadata:
# name: dgraph-alpha-0-http-public
# labels:
# app: dgraph-alpha
# spec:
# type: LoadBalancer
# ports:
# - port: 8080
# targetPort: 8080
# name: alpha-http
# selector:
# statefulset.kubernetes.io/pod-name: dgraph-alpha-0
---
apiVersion: v1
kind: Service
metadata:
name: dgraph-ratel-public
labels:
app: dgraph-ratel
spec:
type: LoadBalancer
ports:
- port: 8000
targetPort: 8000
name: ratel-http
selector:
app: dgraph-ratel
---
# This is a headless service which is necessary for discovery for a dgraph-zero StatefulSet.
# https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#creating-a-statefulset
apiVersion: v1
kind: Service
metadata:
name: dgraph-zero
labels:
app: dgraph-zero
spec:
ports:
- port: 5080
targetPort: 5080
name: zero-grpc
clusterIP: None
# We want all pods in the StatefulSet to have their addresses published for
# the sake of the other Dgraph Zero pods even before they're ready, since they
# have to be able to talk to each other in order to become ready.
publishNotReadyAddresses: true
selector:
app: dgraph-zero
---
# This is a headless service which is necessary for discovery for a dgraph-alpha StatefulSet.
# https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#creating-a-statefulset
apiVersion: v1
kind: Service
metadata:
name: dgraph-alpha
labels:
app: dgraph-alpha
spec:
ports:
- port: 7080
targetPort: 7080
name: alpha-grpc-int
clusterIP: None
# We want all pods in the StatefulSet to have their addresses published for
# the sake of the other Dgraph alpha pods even before they're ready, since they
# have to be able to talk to each other in order to become ready.
publishNotReadyAddresses: true
selector:
app: dgraph-alpha
---
# This StatefulSet runs 3 Dgraph Zero.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: dgraph-zero
spec:
serviceName: "dgraph-zero"
replicas: 3
selector:
matchLabels:
app: dgraph-zero
template:
metadata:
labels:
app: dgraph-zero
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["dgraph-zero"]
topologyKey: kubernetes.io/hostname
containers:
- name: zero
image: dgraph/dgraph:v20.03.1
imagePullPolicy: IfNotPresent
ports:
- containerPort: 5080
name: zero-grpc
- containerPort: 6080
name: zero-http
volumeMounts:
- name: datadir
mountPath: /dgraph
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: GODEBUG
value: madvdontneed=1
command:
- bash
- "-c"
- |
set -ex
[[ `hostname` =~ -([0-9]+)$ ]] || exit 1
ordinal=${BASH_REMATCH[1]}
idx=$(($ordinal + 1))
if [[ $ordinal -eq 0 ]]; then
exec dgraph zero --my=$(hostname -f):5080 --idx $idx --replicas 3 -v=2
else
exec dgraph zero --my=$(hostname -f):5080 --peer dgraph-zero-0.dgraph-zero.${POD_NAMESPACE}.svc.cluster.local:5080 --idx $idx --replicas 3 -v=2
fi
livenessProbe:
httpGet:
path: /health
port: 6080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 6
successThreshold: 1
readinessProbe:
httpGet:
path: /state
port: 6080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 6
successThreshold: 1
terminationGracePeriodSeconds: 60
volumes:
- name: datadir
persistentVolumeClaim:
claimName: datadir
updateStrategy:
type: RollingUpdate
volumeClaimTemplates:
- metadata:
name: datadir
annotations:
volume.alpha.kubernetes.io/storage-class: anything
spec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 5Gi
---
# This StatefulSet runs 3 replicas of Dgraph Alpha.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: dgraph-alpha
spec:
serviceName: "dgraph-alpha"
replicas: 3
selector:
matchLabels:
app: dgraph-alpha
template:
metadata:
labels:
app: dgraph-alpha
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["dgraph-alpha"]
topologyKey: kubernetes.io/hostname
containers:
- name: alpha
image: dgraph/dgraph:v20.03.1
imagePullPolicy: IfNotPresent
ports:
- containerPort: 7080
name: alpha-grpc-int
- containerPort: 8080
name: alpha-http
- containerPort: 9080
name: alpha-grpc
volumeMounts:
- name: datadir
mountPath: /dgraph
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "12Gi"
cpu: "2"
env:
# This should be the same namespace as the dgraph-zero
# StatefulSet to resolve a Dgraph Zero's DNS name for
# Alpha's --zero flag.
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: GODEBUG
value: madvdontneed=1
command:
- bash
- "-c"
- |
set -ex
dgraph alpha --my=$(hostname -f):7080 --lru_mb 2048 --zero dgraph-zero-0.dgraph-zero.${POD_NAMESPACE}.svc.cluster.local:5080 -v=2
livenessProbe:
httpGet:
path: /health?live=1
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 6
successThreshold: 1
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 6
successThreshold: 1
terminationGracePeriodSeconds: 600
volumes:
- name: datadir
persistentVolumeClaim:
claimName: datadir
updateStrategy:
type: RollingUpdate
volumeClaimTemplates:
- metadata:
name: datadir
annotations:
volume.alpha.kubernetes.io/storage-class: anything
spec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 100Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: dgraph-ratel
labels:
app: dgraph-ratel
spec:
selector:
matchLabels:
app: dgraph-ratel
template:
metadata:
labels:
app: dgraph-ratel
spec:
containers:
- name: ratel
image: dgraph/dgraph:v20.03.1
ports:
- containerPort: 8000
command:
- dgraph-ratel
Hitting the Alpha load balancer with ~25-75 mutations/sec to ingest data into the graph or really any consistent flow of data into the alpha nodes
Expected behaviour and actual result.
Expected behaviour:
- Alpha nodes release unused memory and DONT OOM kill themselves and restart. They are drinking every drop of memory available to them, which was bumped up to 12GB, and it still did not release that memory. Given that the mutation rate is only 25-75/sec it seems like 12GB/alpha Pod should be plenty of memory.
Behavior Continued EVEN when GODEBUG=madvdontneed=1 was used
Actual Result:
- Below shows spiking and failing of dgraph alpha nodes(Back when I put the memory limit at 6GB. Same things are happening with 12 GB

-
Continual cycle of OOM kill and crash loop backoff causing restarting

-
Image of the Pod cycling through OOM errors

-
Pprof logs do not show the same memory restraints, and only show 2GB used even when container is dying.
Pprof inuse_objects show a rather high amount of objects though.
I believe the issue lies with lack of GC, or potentially a Memory leak within the Alpha pods
pprof.dgraph.alloc_objects.alloc_space.inuse_objects.inuse_space.009.pb.gz
pprof.dgraph.samples.cpu.001.pb.gz
This issue is Blocking our team, so any help would be greatly appreciated
What version of Dgraph are you using?
20.03.1
Have you tried reproducing the issue with the latest release?
This is the latest release
What is the hardware spec (RAM, OS)?
K8s running on 3 M5.xLarge (4 vcpu/16 gb ram)
Steps to reproduce the issue (command/config used to run Dgraph).
K8s yaml for cluster setup
Hitting the Alpha load balancer with ~25-75 mutations/sec to ingest data into the graph or really any consistent flow of data into the alpha nodes
Expected behaviour and actual result.
Expected behaviour:
Behavior Continued EVEN when GODEBUG=madvdontneed=1 was used
Actual Result:
Continual cycle of OOM kill and crash loop backoff causing restarting

Image of the Pod cycling through OOM errors

Pprof logs do not show the same memory restraints, and only show 2GB used even when container is dying.
Pprof inuse_objects show a rather high amount of objects though.
I believe the issue lies with lack of GC, or potentially a Memory leak within the Alpha pods
pprof.dgraph.alloc_objects.alloc_space.inuse_objects.inuse_space.009.pb.gz
pprof.dgraph.samples.cpu.001.pb.gz
This issue is Blocking our team, so any help would be greatly appreciated