Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q: Memory consumption when using in-memory/internal mode #54

Closed
rmoriz opened this issue Jul 24, 2017 · 27 comments
Closed

Q: Memory consumption when using in-memory/internal mode #54

rmoriz opened this issue Jul 24, 2017 · 27 comments

Comments

@rmoriz
Copy link

rmoriz commented Jul 24, 2017

I've a test setup here running 6 chef-clients against goiardi running on a VPS and goiardi is taking almost the whole memory:

Number of nodes: 6

Options:
'GOIARDI_LOG_EVENTS=1',
'GOIARDI_LOG_EVENT_KEEP=100'
'GOIARDI_LOG_LEVEL=debug'

(stdout shows event purge works)

config file:

index-file = "/var/lib/goiardi/goiardi-index.bin"
data-file = "/var/lib/goiardi/goiardi-data.bin"
freeze-interval = 10
time-slew = "15m"
conf-root = "/etc/goiardi"
use-auth = true
use-ssl = false
https-urls = true
disable-webui = false
local-filestore-dir = "/var/lib/goiardi/lfs

Size of /var/lib/goiardi is 101MB

docker ps:

CONTAINER           CPU %               MEM USAGE / LIMIT    MEM %               NET I/O             BLOCK I/O           PIDS
72b6855d467d        0.00%               999.8MiB / 1000MiB   99.98%              1.34MB / 1.55MB     575MB / 696MB       7

Is this expected behaviour? Do you have numbers/experiences on how much memory goiardi needs?
Thanks!

@ctdk
Copy link
Owner

ctdk commented Jul 24, 2017

Goiardi's known to eat up a lot of memory in at least some configurations, although that's more than I would expect with only six clients. A couple of questions, though, just to get things figured out:

  1. Are you adding and removing clients constantly? (I'm curious there if there's memory not being freed.)
  2. If you restart goiardi, does the memory usage drop significantly?
  3. Does reindexing help?
  4. Do the nodes have an unusual number of attributes or something?

The size of the search index has been a definite problem in the past, although it certainly should be better than that with six clients. For a more immediate solution, since you're running it in docker anyway, have you considered running goiardi with postgres and the postgres search?

@rmoriz
Copy link
Author

rmoriz commented Jul 24, 2017

1.) no
2.) yes, it starts with around 200MB with all data loaded but keeps growing over time.
3.) I've never done this. knife index rebuild ?
4.) No, not really. I've a couple of data bags (~50) that contain encrypted TLS certificate chains. I've added index-val-trim=2000.

Here's my setup shortly after a reboot of the host/restart of the containers:

$ docker stats --no-stream
NAME                CPU %               MEM USAGE / LIMIT    MEM %               NET I/O             BLOCK I/O           PIDS
chef-goiardi        0.00%               355.2MiB / 1000MiB   35.52%              323kB / 1.31MB      48.6MB / 82.9MB     7
chef-proxy          0.00%               14.95MiB / 200MiB    7.47%               1.64MB / 735kB      20.3MB / 4.1kB      3
chef-webui          0.07%               92.13MiB / 500MiB    18.43%              1.53kB / 0B         60.1MB / 0B         2
chef-browser        0.04%               76.37MiB / 400MiB    19.09%              1.44kB / 0B         49.5MB / 0B         6

$ docker ps
CONTAINER ID        IMAGE                             COMMAND                  CREATED             STATUS              PORTS                                      NAMES
72b6855d467d        chef-goiardi:latest               "/go/bin/goiardi -..."   16 hours ago        Up 6 minutes        4545/tcp                                   chef-goiardi
520897942db6        rmoriz/openresty_autossl:latest   "/usr/local/openre..."   16 hours ago        Up 6 minutes        0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp   chef-proxy
b589392147a2        chef-webui:latest                 "bundle exec rails..."   17 hours ago        Up 6 minutes        3000/tcp                                   chef-webui
7f3b7abb4585        chef-browser:latest               "./bin/rackup -o 0..."   17 hours ago        Up 6 minutes        9292/tcp                                   chef-browser

@ctdk
Copy link
Owner

ctdk commented Jul 24, 2017

In re: 3, the command is indeed knife index rebuild. Checking out the general memory usage as well (along with the suddenly more urgent need to get up to speed with Chef 12+ support, of course), along with everything else going on - I haven't forgotten about you all, just busy and not entirely well as usual.

@rmoriz
Copy link
Author

rmoriz commented Jul 29, 2017

I have now ~20 nodes and goiardi takes 10+ GB of RAM, CPU usage is very high (spikes). Also nginx is barking about some timed out connections to goiardi.

NAME                         CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
chef-goiardi                 42.36%              11.39GiB / 16GiB      71.19%              264MB / 597MB       115kB / 0B          21

Looks like the number of pids grows, too.

@ctdk
Copy link
Owner

ctdk commented Jul 31, 2017

Bizarrely, the in-mem index is only using 22.7MB RAM with 50 nodes generated with fauxhai on OS X. This is... odd; I'm pretty sure I've also done this test with goiardi on Linux (I had to make some changes a long time ago to get the index more under control because some folks noticed it was taking up a lot of RAM), but it sure feels like there might be something different going on between OS X and Linux I hadn't noticed before.

One question: did you build the binaries yourself, get the packages from packagecloud.io, or get one of the individual binaries from github?

@rmoriz
Copy link
Author

rmoriz commented Jul 31, 2017

I'm using the following Dockerfile:

FROM golang:1.8-alpine

# This is set to the most recent goiardi release when the image was built
ENV GOIARDI_VERSION=master

RUN set -ex \
    && apk add --no-cache --virtual .build-deps \
    git \
    mercurial

RUN go get -v -d github.com/ctdk/goiardi \
  && cd /go/src/github.com/ctdk/goiardi \
  && if [ -n $GOIARDI_VERSION ]; then git checkout $GOIARDI_VERSION; fi \
  && go install github.com/ctdk/goiardi \
  && apk del .build-deps
...

The one thing that is different from regular setups is that my chef-clients don't report all attributes to the server, maybe some attribute that goiardi expects are missing and cause those issues?

typcial client.rb looks like:

automatic_attribute_whitelist ["fqdn", "os", "os_version", "ohai_time", "hostname", "keys/ssh", "roles", "recipes", "ipaddress", "ip6address", "platform", "platform_version", "cloud/", "cloud_v2/", "chef_packages", "pspdfkit_docker", "status", "uptime", "uptime_seconds", "node_name", "plattform", "plattform_family", "plattform_version", "os", "os_version", "ohai/"]
chef_server_url "https://...chef-server.../"
default_attribute_whitelist []
encrypted_data_bag_secret "/etc/chef/encrypted_data_bag_secret"
log_location "/var/log/chef/client.log"
override_attribute_whitelist []
validation_client_name "chef-validator"
verify_api_cert true
# Using default node name (fqdn)



# Do not crash if a handler is missing / not installed yet
begin
rescue NameError => e
  Chef::Log.error e
end

@rmoriz
Copy link
Author

rmoriz commented Aug 1, 2017

arround using 16GB of RAM and gob is not able to persist anymore:

2017/08/01 22:23:02 [ERROR] [main.setSaveTicker] gob: encoder: message too big

A restart now loads the old data. Data directory is filled with ds-store###### tempfiles.

Unfortunately in a docker setup pprof is not reachable due to the typical default networking:

2017/08/01 22:28:14 [DEBUG] [main] (goiardi.go:(*interceptHandler).ServeHTTP:299) Serving /debug/pprof/heap -- GET
2017/08/01 22:28:14 [DEBUG] [main] (goiardi.go:(*interceptHandler).ServeHTTP:311) remote ip candidates: ra: '172.18.0.1', ''
2017/08/01 22:28:14 [DEBUG] [main] (goiardi.go:(*interceptHandler).ServeHTTP:314) ips now: '"172.18.0.1"' '"<nil>"'
2017/08/01 22:28:14 [DEBUG] [main] (goiardi.go:(*interceptHandler).ServeHTTP:315) local? '%!q(bool=false)' '%!q(bool=false)'
2017/08/01 22:28:14 [DEBUG] [main] (goiardi.go:(*interceptHandler).ServeHTTP:317) blocked 172.18.0.1 (x-forwarded-for: <nil>) from accessing /debug/pprof!

I suggest to allow a config option to whitelist a subnet/CIDR for debugging.

Update 1

When goiardi runs in use-auth mode the pprof endpoints require chef-authentication => won't work even from localhost.

Update 2

I was able to get a heap dump but I was not able to get any useful information out of it (sorry, Go noob here)

I've copied the data and index files to my macOS workstation and was able to export it (700MB in JSON), however an import of it with empty data and index files fails:

➜  goiardi-test goiardi -i data/goiardi-index.bin   -D data/goiardi-data.bin -g debug -m foo
2017/08/02 04:04:41 Logging at debug level
2017/08/02 04:04:41 [INFO] [github.com/rmoriz/goiardi/vendor/github.com/ctdk/goiardi/config] Trimming values in search index disabled
2017/08/02 04:04:41 [WARNING] [github.com/rmoriz/goiardi/vendor/github.com/ctdk/goiardi/config] index-val-trim's default behavior when not set or set to 0 is to disable search index value trimming; this behavior will change with the next goiardi release
Importing data from foo....
2017/08/02 04:04:51 [DEBUG] [github.com/rmoriz/goiardi/vendor/github.com/ctdk/goiardi/indexer] (file_index.go:(*FileIndex).Save:668) Index has changed, saving to disk
2017/08/02 04:05:02 [INFO] [main] Importing data, version 1.1 created on 2017-08-02 03:55:17.017145184 +0200 CEST
2017/08/02 04:05:02 [INFO] [main] Loading clients
2017/08/02 04:05:02 [INFO] [main] Loading users
2017/08/02 04:05:02 [INFO] [main] Loading filestore
2017/08/02 04:05:02 [INFO] [main] Loading cookbooks
2017/08/02 04:05:02 [INFO] [main] Loading data bags
2017/08/02 04:05:02 [INFO] [main] Loading environments
2017/08/02 04:05:02 [INFO] [main] Loading nodes
2017/08/02 04:05:02 [INFO] [main] Loading roles
2017/08/02 04:05:02 [INFO] [main] Loading sandboxes
2017/08/02 04:05:02 [INFO] [main] Loading loginfo
panic: interface conversion: interface {} is json.Number, not float64

goroutine 1 [running]:
github.com/rmoriz/goiardi/vendor/github.com/ctdk/goiardi/loginfo.Import(0xc42104c8a0, 0xc42007cc30, 0x18187bd)

Update 3
Node attributes are modest:

$ for i in $(knife node list); do knife node show -l -F json $i > $i.json; done
$ ls -1 *.json|wc -l
27
$  du -hc *.json
12K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
40K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
12K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
8,0K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
24K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
20K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
12K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
80K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
16K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
12K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
24K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
36K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
60K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
8,0K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
8,0K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
8,0K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
8,0K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
8,0K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
8,0K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
8,0K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
8,0K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
8,0K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
8,0K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
8,0K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
36K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
8,0K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
8,0K	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.json
496K	total

Theory

I suspect a leak in the gob/memory data store code that gets persisted and reloaded over time. I'll setup a new instance with the same data but the postgres backend, maybe it confirms that suspicion

@ctdk
Copy link
Owner

ctdk commented Aug 2, 2017

I was just thinking this morning that there might be a problem with the memory data store. I'm still working on duplicating your problem, but I suspect it's the problem.

@rmoriz
Copy link
Author

rmoriz commented Aug 3, 2017

I've just moved the setup to postgres and resource consumption is amazingly small (same data/nodes/cookbooks).

NAME           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
goiardi        7.77%               1.456GiB / 32GiB      4.55%               65.5MB / 28MB       0B / 11.8MB         16
postgres       5.71%               117.4MiB / 4GiB       2.87%               17.1MB / 49.2MB     0B / 178MB          10

IIRC even after the initial import into the in-memory store, the total memory usage was higher than the combination of goiardi and pg at this point 😲

@ctdk
Copy link
Owner

ctdk commented Aug 3, 2017

I've pushed up a branch with some helpful changes for this at https://github.com/ctdk/goiardi/tree/mem-bonkers-tmp. It adds an argument to whitelist IPs for /debug/pprof and fixes it so you don't need to be authenticated to use it. (That might be a regression.)

Looking at the pprof output, I think I've found some good candidates for the problem. I believe it's something to do with the data store (which you already pointed out), but I have a suspicion that it's something to do with the event log specifically. I haven't used the event log much with the in-memory mode, except to make sure it appeared to work. (When I do use it it's with Postgres, generally, since that's the main mode I use.)

One question with moving that setup to postgres: are you still using the in-mem search, or is that using the postgres search?

@rmoriz
Copy link
Author

rmoriz commented Aug 3, 2017

I switched to pg search:

env [
  'GOIARDI_INDEX_VAL_TRIM=100',
  'GOIARDI_PROXY_HOSTNAME=chef',
  'GOIARDI_HOSTNAME=chef-goiardi',
  'GOIARDI_PROXY_PORT=443',
  'GOIARDI_HTTPS_URLS=https://..../',
  'GOIARDI_USE_SERF=true',
  'GOIARDI_SERF_ADDR=...:7373',
  'GOIARDI_USE_SHOVEY=true',
  'GOIARDI_SIGN_PRIV_KEY=/etc/goiardi/shovey.key',
  'GOIARDI_LOG_LEVEL=debug',
  'GOIARDI_USE_POSTGRESQL=1',
  'GOIARDI_POSTGRESQL_USERNAME=goiardi',
  'GOIARDI_POSTGRESQL_PASSWORD=pw',
  'GOIARDI_POSTGRESQL_DBNAME=db,
  'GOIARDI_POSTGRESQL_HOST=host',
  'GOIARDI_POSTGRESQL_SSL_MODE=disable',
  'GOIARDI_PG_SEARCH=1',
  'GOIARDI_DB_POOL_SIZE=4',
  'GOIARDI_MAX_CONN=12',
]

@ctdk
Copy link
Owner

ctdk commented Aug 3, 2017

That's cool (and definitely what I would do - I've found the pg search to be superior); I asked because I was curious if the mem usage you were reporting included the search or not.

@ctdk
Copy link
Owner

ctdk commented Aug 3, 2017

I've made progress on fixing the search query parser do deal with the negated range queries too, but am still being pulled in 800 directions all the time. :-/ I'm getting closer to getting them all wrapped up at least.

@rmoriz
Copy link
Author

rmoriz commented Aug 4, 2017

We don't use complicated search terms. Usually just to search for data bags and nodes (based on "positive" assertions of fqdn, tags, roles or attributes).

I like to keep things as simple as possible because most complicated chef setups tend to fail (or end in an unmaintainable rabbit hole) - at least this is my experience after ~7 years.

@ctdk
Copy link
Owner

ctdk commented Aug 4, 2017

No argument from me there -- when I was refactoring the search to fix the broken NOT queries, I looked at NOT range and both a) could find no evidence it was used anywhere, at least with my cursory glance & against chef-pedant and b) it seemed iffy if Lucene/Solr even supported that particular query*.

Anecdotally, upwards of 95% of chef search queries are extremely basic. I've never felt that Solr was the right choice for chef search (like, it's both massive overkill and not particularly well suited to this kind of indexing), which was my motivation for coming up with a couple of replacements. Still, I'm only one guy with other things going on as well, and I'll admit that I've kind of neglected the in-memory aspect of goiardi in favor of the postgres part. I've even been considering dropping MySQL from 1.0.0 when that's ready, but I haven't decided yet.

  • The Lucene docs weren't especially clear on it, at least from what I was able to gather. However, there are also multiple query parsers available, so while as far as I could gather the default query parser didn't support NOT [ ..... range ..... ], I was either mistaken or chef server uses a different parser.

@rmoriz
Copy link
Author

rmoriz commented Aug 17, 2017

So… I've no an issue with the pg search ;)

Ohai by default provides a couple of automatic attributes per node, e.g.

{
  "cloud_v2": {
     "provider": "digital_ocean"
  }
}

However the search does not find any:

➜   knife search node 'provider:*' -i
0 items found

# versus chef-zero with local data
➜   knife search node 'provider:*' -iz
5 items found
node01.domain.com
node02.domain.com
node03.domain.com
node04.domain.com
node05.domain.com

any idea?

knife node show node01.domain.com -l does show the attribute + values

Env/config:

'GOIARDI_INDEX_VAL_TRIM=100',
'GOIARDI_PROXY_HOSTNAME=...',
'GOIARDI_HOSTNAME=...',
'GOIARDI_PROXY_PORT=443',
'GOIARDI_HTTPS_URLS=https://.../',
'GOIARDI_USE_SERF=true',
'GOIARDI_SERF_ADDR=...:7373',
'GOIARDI_USE_SHOVEY=true',
'GOIARDI_SIGN_PRIV_KEY=/etc/goiardi/shovey.key',
'GOIARDI_LOG_LEVEL=debug',
'GOIARDI_USE_POSTGRESQL=1',
'GOIARDI_POSTGRESQL_USERNAME=goiardi',
'GOIARDI_POSTGRESQL_PASSWORD=...',
'GOIARDI_POSTGRESQL_DBNAME=goiardi',
'GOIARDI_POSTGRESQL_HOST=...',
'GOIARDI_POSTGRESQL_SSL_MODE=disable',
'GOIARDI_CONVERT_SEARCH=true',
'GOIARDI_PG_SEARCH=1',
'GOIARDI_DB_POOL_SIZE=4',
'GOIARDI_MAX_CONN=12',

Update:

Okay, forgot to change the separator:

knife search node 'cloud.v2.provider*' -i

works

@rmoriz
Copy link
Author

rmoriz commented Aug 17, 2017

Looks like the indexation of arrays in attributes into search does not work properly:

goiardi=# select * from search_items where path = 'corp.docker.images';
   id    | organization_id | search_collection_id |               item_name                |                                                    value                                                    |          path          
---------+-----------------+----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------+------------------------
 3908923 |               1 |                    3 | docker.corp.com             | map\[Containers:-1 Labels:map\[\] SharedSize:-1 VirtualSize:19636526 id:sha256:xxxxxx     | corp.docker.images

The attribute is an array created by a custom ohai plugin that just queries the local dockerd about all images, containers, volumes:

"automatic": {
  "corp_docker": {
    "containers": [
      {
      ...
      },
      {
      ...
      }
    ],
    "images": [
      {
        "Containers": -1,
        "Created": 1502464017,
        "Labels": {

        },
        "ParentId": "",
        "RepoDigests": [
          "hub.docker.corp/app1@sha256:...."
        ],
        "RepoTags": [
          "hub.docker.corp/app1:unstable"
        ],
        "SharedSize": -1,
        "Size": 1004576564,
        "VirtualSize": 1004576564,
        "id": "sha256:..."
      },
      {
        "Containers": -1,
        "Created": 1502464017,
        "Labels": {

        },
        "ParentId": "",
        "RepoDigests": [
          "hub.docker.corp/app2@sha256:...."
        ],
        "RepoTags": [
          "hub.docker.corp/app2:unstable"
        ],
        "SharedSize": -1,
        "Size": 1004576564,
        "VirtualSize": 1004576564,
        "id": "sha256:..."
      },
      ...
    }
  }
}

We have one case where we use this information to find out all nodes that have a specific image deployed and/or running a container with that image.

knife search node 'corp.docker.images:RepoTags:*app1*' -l

Guess I'll have to refactor our plugin to just dump the data in hashes into ohai/chef…

@ctdk
Copy link
Owner

ctdk commented Aug 17, 2017

Thanks for the heads up on that being broken too. I think this is a regression, although from when I'm not sure yet. I do know where it would be coming from though.

@ctdk
Copy link
Owner

ctdk commented Aug 18, 2017

It just occurred to me that that thing with the arrays not being broken down correctly might be the source of the ridiculous memory usage (and why I wasn't able to duplicate it, because I was creating large numbers of fauxhai nodes that IIRC don't have lots of that). I'll run some more tests for that.

@ctdk
Copy link
Owner

ctdk commented Aug 23, 2017

I have some changes brewing in the mem-bonkers-tmp branch that will fix this problem with the map slices not being indexed correctly (it's working for basic cases, but I still need to get it working for some of the more obscure possibilities), and I strongly suspect this will also fix that problem you were seeing with the ridiculous memory usage.

Thanks again for being such a thorough tester, and thanks for your patience.

@rmoriz
Copy link
Author

rmoriz commented Aug 24, 2017

Thanks! I've only 2 days a week time for ops, so I'm sorry my response time is a high.

FYI: Here's an example node json of one of our nodes https://gist.github.com/rmoriz/b1e3e335eb95ef268c42502c3c5f4b78 - we are alrady stripping out most ohai data using a custom client.rb configuration:

automatic_attribute_whitelist ["fqdn", "os", "os_version", "ohai_time", "hostname", "keys/ssh", "roles", "recipes", "ipaddress", "ip6address", "platform", "platform_version", "cloud/", "cloud_v2/", "chef_packages", "example-app_docker", "status", "uptime", "uptime_seconds", "node_name", "plattform", "plattform_family", "plattform_version", "os", "os_version", "ohai/"]

the example-app_docker is a custom ohai plugin that queries dockerd for volumes, images, containers etc.

@ctdk
Copy link
Owner

ctdk commented Sep 1, 2017

It needs cleanup and the unicode changes for the parser to be added back in, but the mem-bonkers-tmp branch has some pretty significant internal changes that you should find improve memory usage significantly. There's still some room for improvement, but just between the previous commit and this one memory usage after reindexing about 40 fauxhai nodes with complicated slices of maps with postgres search went from ~300MB RAM afterwards to about 60MB. In-mem had some pretty big improvements as well.

Before it's all merged, I ought to get some actual numbers together for it.

@ctdk
Copy link
Owner

ctdk commented Sep 12, 2017

The https://github.com/ctdk/goiardi/tree/0.11.6-merging branch has the fixes for this with (from informal testing so far) vastly better memory usage, both generally and with the arrays of maps or arrays (and it handles them correctly too), along with the whitelisting for /debug/pprof and such.

@rmoriz It also has the serf reconnect fix. I just need to run the various chef-pedant tests, and maybe add some more golang tests to test the slices of maps. I'm planning on also getting some better numbers for memory usage.

@ctdk
Copy link
Owner

ctdk commented Jan 28, 2018

Merged and released.

@ctdk ctdk closed this as completed Jan 28, 2018
@rmoriz
Copy link
Author

rmoriz commented Apr 14, 2018

Just a follow up on this issue which is now solved:

I'm still running my personal goiardi instance (~10 nodes) with the gob backend (as a docker container) and it crashed regularly due to low memory (vm has 2GB+1GB SSD swap) while another instance with the pg backend and ~50 nodes works fine for months.

I replicated everything locally on my mac using docker and added some debug information to see the amount of various objects in the data store. The number of reports was at 65xxx - just setting the purge option I obviously missed originally didn't help, because purge happens only every 2 hours and the container is then way OOM and restarted.

I changed my local instance to purge every minute, waited until the reports were purged and written to disk and then copied goiardi-data.bin and goiardi-index.bin back to the VM and redeployed the container with the setting GOIARDI_PURGE_REPORTS_AFTER=24h to prevent running into the same issue again.

After the purge I had to restart goiardi/the container in both cases to regain memory, not sure why. It's running now around 300MB, down from over 2GB. This also solved lots of orphaned temporary index and data files (I assume OOM kill happend before the stream was fully persisted).

There is still an unresolved issue with orphaned sandboxes (in my case 127) but I guess they don't take that much of memory. With the pg instance I have a cronjob to cleanup old data via SQL, that part is missing with gob.

@ctdk
Copy link
Owner

ctdk commented Apr 24, 2018

I haven't forgotten about the sandbox thing, I just haven't gotten an answer yet on how long they need to hang around. I've asked again, though, and hopefully I'll get an answer this time.

As an aside, I'm also starting to think about re-evaluating the storage backends and whether it's especially worthwhile to keep the in-mem/file backed storage (and the MySQL one, honestly) or tearing out in-mem and totally redoing it, or just settling on Postgres for everything and providing a more convenient docker-compose config to run everything together. The in-mem search, while I'm quite proud of what I did there and has improved a lot thanks to everyone's bug reports, is still kind of a memory hog.

I'm certainly open to suggestions on what I should do here. I'm still slowly moving along on the last strange problems for basic Chef Server 12 behavior, although it's constrained by day job stuff and how well I'm feeling on a given day - if I'm worn out and whatever ails me is teeing off on me more than usual, I'm not going to be real productive while riding the train home.

@ctdk
Copy link
Owner

ctdk commented May 1, 2018

Good news, @rmoriz. I got an answer finally on the sandbox thing, and they are purged. (Apparently back in the older times it was less often than now or something, but they fixed that up. Certainly when I was digging around in the old 10.x server code it wasn't obvious.) I'll get purging of old sandboxes turned around pretty soon - the in-mem purge may take longer, but the SQL ones won't. Thanks for your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants