Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatal error: concurrent map read and map write #11716

Closed
4 tasks
effofxprime opened this issue Apr 21, 2022 · 10 comments
Closed
4 tasks

fatal error: concurrent map read and map write #11716

effofxprime opened this issue Apr 21, 2022 · 10 comments

Comments

@effofxprime
Copy link

effofxprime commented Apr 21, 2022

Summary of Bug

I am running a full archival RPC node. https://rpc.erialos.me, https://trpc.rpc.erialos.me, https://grpc.rpc.erialos.me, https://rest.rpc.erialos.me.

I am also running Ping.Pub explorer(on another server) that calls my RPC to populate the explorer data.
https://explorer.erialos.me
I can refresh the explorer, and given enough times, I will wide up with a fatal error.

fatal error: concurrent map read and map write

I've been having this issue since I started my RPC node. I have started to wonder if it has something to do with my LVM configuration for the drive mounted to store the blockchain data on. It's not because of a performance issue but I have a 6 drive striped LVM. I wonder if there is some performance hangup within LVM that might be causing this. I do not suspect that this is actually the issue but it has been going on long enough that I am attempting to consider all possibilities.

I have also thought that maybe it was the version of Cosmos SDK our chain is using. There are a few bug reports that noted this error that also seem to have been resolved: https://github.com/cosmos/cosmos-sdk/issues?q=is%3Aissue+is%3Aclosed+concurrent+map+read+and+map+write

Version

vidulum@rpc:~$ vidulumd tendermint version
tendermint: 0.34.12
abci: 0.17.0
blockprotocol: 11
p2pprotocol: 8

Cosmos SDK v0.44.0

Consensus Dump:
https://trpc.rpc.erialos.me/dump_consensus_state?

Steps to Reproduce

Refresh explorer enough times until RPC crashes.

Complete Crash Log

https://pastebin.com/ZdDtHwM4


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@alexanderbez
Copy link
Contributor

@marbar3778 I can't recall, didn't we address the fatal error: concurrent map read and map write issue?

@alexanderbez
Copy link
Contributor

#11114

@tac0turtle
Copy link
Member

Not sure if this was back ported to 0.44, but 0.45 has this solved

@effofxprime
Copy link
Author

I wasn't sure from the bug search I did on concurrent map read/write if mine was the same type of concurrent map read/write.

Assuming that it is, would my only option at this time to wait for a chain upgrade to have this resolved? Or is there a patch for 0.44 that I could apply?

@alexanderbez
Copy link
Contributor

I'm not sure how much effort we're going to be putting into maintaining 0.44.x anyway. Is there a reason you can't bump to 0.45?

@effofxprime
Copy link
Author

I am fine with bumping to 0.45, but I do not maintain the Vidulum chain. AFAIK, wouldn't the chain have to update as a whole for me to be able to update to cosmos sdk 0.45?

I can also wait for our upgrade, speaking with the chains lead dev, I think we will be looking at doing so here soon. (months time frame probably).

Mostly wanted to make sure that the concurrent fatal error I was having was just like the ones prior and had already been solved upstream.

@alexanderbez
Copy link
Contributor

wouldn't the chain have to update as a whole for me to be able to update to cosmos sdk 0.45?

Yes :) but in that case, you might as well upgrade to 0.46

@effofxprime
Copy link
Author

Great. I'm going to recommend this to the lead dev and just inform everyone that it will be an issue until we update.
I appreciate your response, Alexander. Have a great day.

@alexanderbez
Copy link
Contributor

My pleasure!

@effofxprime
Copy link
Author

I have, in essence, resolved the concurrent map read/write errors via nginx.
Looking into how the root problem was found, it appears that when trying to push the service to do more faster is when it happens.
I've created connection limits for tendermint rpc(trpc) and grpc.

At the top of your site specific nginx conf, define the following:

# Create a rate limiting zone and apply to all endpoints
# limit_req zone=request burst=21 delay=3; <-- goes in `server/location` directives
limit_req_zone $binary_remote_addr zone=request:100m rate=33r/s;

# Limit number of connections
# limit_conn servers 100; <-- place in server directives
limit_conn_zone $server_name zone=servers:100m;

# Limit bandwidth: Us the following directives inside a location or server.
# limit_conn servers 100;
# limit_rate 500k;
#
# ALT way:
# limit_rate_after 3m;
# limit_rate $response_rate;

## DYNAMIC BANDWITH CONTROL
## https://docs.nginx.com/nginx/admin-guide/security-controls/controlling-access-proxied-http/
##
map $ssl_protocol $response_rate {
    "TLSv1.1" 10k;
    "TLSv1.2" 100k;
    "TLSv1.3" 500k;
}

# Proper headers mapping for websocket endpoint
map $http_upgrade $connection_upgrade {
        default Upgrade;
        '' close;
}

My grpc configuration is as follows:

upstream grpc-proxy {
        server 127.0.0.1:9090 weight=3 max_conns=10 max_fails=30 fail_timeout=360s;
        keepalive 60;
}

upstream grpc-web-proxy {
        server 127.0.0.1:9091 weight=3 max_conns=10 max_fails=30 fail_timeout=360s;
        keepalive 60;
}

server {
        listen 443 ssl http2;

        listen 194.195.209.168:9090 ssl;
            listen 194.195.209.168:9091 ssl;
        server_name grpc.rpc.erialos.me;

        limit_conn servers 50;
        limit_req zone=request burst=10 delay=3;
        limit_rate_after 3m;
        limit_rate $response_rate;


        # Enable logging, set log format
        error_log /var/log/nginx/9090-error.log;

        ssl_certificate_key     /etc/letsencrypt/live/rpc.erialos.me/privkey.pem;
        ssl_certificate         /etc/letsencrypt/live/rpc.erialos.me/fullchain.pem;
        ssl_protocols TLSv1.2 TLSv1.3;

        location / {
                grpc_pass grpc://grpc-proxy;
        }
        location  /websocket {
                grpc_pass grpc://grpc-web-proxy;

        }

## reimpliment later
#        error_page 497 = @foobar;
#
#        location @foobar {
#            return 301 https://$host:$server_port$request_uri;
#        }


}

Two area's of importance here:
These configuration settings in the server directive

        limit_conn servers 50;
        limit_req zone=request burst=10 delay=3;
        limit_rate_after 3m;
        limit_rate $response_rate;

And this part of the upstream directive:
weight=3 max_conns=10 max_fails=30 fail_timeout=360s

I have not experimented with increasing the connections yet to see where it fails but I figure this will help most get a headstart if like us, they're still on 0.44.x

Lastly, here is my current trpc section of Nginx too. I notice a lot of people struggle with a proper websocket proxy.

upstream tendermint_rpc {
        server unix:///dev/shm/vidulum/trpc.socket weight=3 max_conns=50 max_fails=30 fail_timeout=360s;
        keepalive 60;
}

server {
        listen 443 ssl;
        listen 194.195.209.168:26657 ssl;

        server_name trpc.rpc.erialos.me;

        limit_conn servers 50;
        limit_req zone=request burst=10 delay=3;
        limit_rate_after 3m;
        limit_rate $response_rate;


        error_log /var/log/nginx/26657_error.log;

        ssl_certificate_key         /etc/letsencrypt/live/rpc.erialos.me/privkey.pem;
        ssl_certificate             /etc/letsencrypt/live/rpc.erialos.me/fullchain.pem;

        location / {
            #proxy_headers_hash_bucket_size 128;
            proxy_pass http://tendermint_rpc;


            #ssl certs
            #proxy_ssl_trusted_certificate   /etc/letsencrypt/archive/rpc.erialos.me/chain2.pem;
            #proxy_ssl_certificate           /etc/letsencrypt/arhcive/rpc.erialos.me/privkey2.pem;
            #proxy_ssl_certificate_key       /etc/letsencrypt/archive/rpc.erialos.me/fullchain2.pem;

            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection $connection_upgrade;

            proxy_set_header Host $http_host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
            # $scheme was $http/$https
            proxy_set_header X-Forwarded-Host $http_host;
            proxy_set_header X-Forwarded-Port $server_port;
            #
            # These proxy settings were required for tendermint rpc to work/display properly via web
            #
            proxy_hide_header   Access-Control-Allow-Origin;
            proxy_set_header           Connection "Keep-Alive";

            #
            # Pingpub CORS proxy
            #
            proxy_set_header Access-Control-Allow-Origin *;
            proxy_set_header Access-Control-Max-Age 3600;
            proxy_set_header Access-Control-Expose-Headers Content-Length;

            #proxy_ssl_protocols TLSv1.2 TLSv1.3;
            #proxy_ssl_server_name on;
            #proxy_ssl_verify    on;
            #proxy_ssl_verify_depth  2;



        }

        location /websocket {
            #proxy_headers_hash_bucket_size 128;
            #proxy_buffering off;
            proxy_pass http://tendermint_rpc/websocket;


            #ssl certs
            #proxy_ssl_certificate           /etc/letsencrypt/archive/rpc.erialos.me/privkey2.pem;
            #proxy_ssl_certificate_key       /etc/letsencrypt/archive/rpc.erialos.me/fullchain2.pem;
            #proxy_ssl_verify    on;
            #proxy_ssl_verify_depth  2;
            #proxy_hide_header   Access-Control-Allow-Origin;
            #proxy_set_header           Connection "Keep-Alive";

            #
            # Pingpub CORS proxy
            #
            #proxy_set_header Access-Control-Allow-Origin *;
            #proxy_set_header Access-Control-Max-Age 3600;
            #proxy_set_header Access-Control-Expose-Headers Content-Length;


            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection $connection_upgrade;

            proxy_set_header Host $http_host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
            #was https for the forwarded proto
            proxy_set_header X-Forwarded-Host $http_host;
            proxy_set_header X-Forwarded-Port $server_port;

        }

#        error_page 497 = @foobar;
#
#        location @foobar {
#            return 301 https://$host:$server_port$request_uri;
#        }

}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants