Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multipath not restored after interface is restarted, OMR becomes single path TCP router. #2936

Closed
ioogithub opened this issue Aug 24, 2023 · 60 comments

Comments

@ioogithub
Copy link

ioogithub commented Aug 24, 2023

Expected Behavior

Multipath should be restored to the same state after omr-tracker stops and starts an interface with all wan connections used.

Current Behavior

Multipath is broken after omr-tracker restarts an interface, traffic now only uses one wan, OpenMPTCProuter becomes OpenSPTCProuter (single path TCP router).

Steps to Reproduce the Problem

  1. Port forwarding is using v2ray and is working as expected, traffic uses both wan1 and wan2 interfaces.
  2. Start a file upload.
  3. Observe o the bandwidth page that OMR is correctly using all wan connections (wan1 and wan2):
    https://ibb.co/2YZDzBB
  4. OMR-tracker switches wan2 interface off (due to ping or error):
Thu Aug 24 17:54:56 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched off because check error and ping from 10.0.0.202 error (9.9.9.9,1.0.0.1,114.114.115.115)
Thu Aug 24 17:54:56 2023 user.notice post-tracking-post-tracking: Delete default route to x.x.x.x via y.y.y.y dev eth2
  1. OMR-tracker switches wan2 interface back on:
Thu Aug 24 18:01:45 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched up
Thu Aug 24 18:01:47 2023 user.notice post-tracking-post-tracking: Interface route not yet set, set route ip r add default via y.y.y.y dev eth2 metric 4
Thu Aug 24 18:01:47 2023 user.notice post-tracking-post-tracking: New public ip detected for wan2 (eth2): x.x.x.x
Thu Aug 24 18:01:47 2023 user.notice post-tracking-post-tracking: Reload MPTCP for eth2
  1. Start a new upload, traffic now only uses wan1, router is not Multi-path any longer:
    https://ibb.co/94QZJLq

  2. Router never recovers from this state. Start another upload 1 hour later and it still only uses wan1.

  3. ip r shows there are routes and default routes for both wan1 and wan2 but MPTCP refuses to use wan2 after OM-tracker restarts it.

Possible Solution 1

I tried two steps to fix the problem:

  1. First I tried resetting using this command: /etc/init.d/openmptcprouter-vps restart
  2. When I observelogread -fon OMR and journalctl -f on VPS I do not see any log events after this comamnd! This command executed and exits but it didn't do anything observable from the logs

Possible Solution 2:

  1. Click Save and Apply from the wizard page.
  2. This ultimately fixes the problem and traffic starts using wan1 and wan2 again however this is not a good solution as it majorly disrupts the network and is manual intervention.
  3. I can see from the log above:
    Thu Aug 24 18:01:47 2023 user.notice post-tracking-post-tracking: Reload MPTCP for eth2
    I guess there is where the bug is, Reload MPTCP is not properly restoring the MPTCP bond. Is there a way to get more information on what MPTCP is doing here, is there any debug mode?
  • The best solution would be to have OMR fix itself properly after the omr-tracker event.
  • Is there an temporary solution where I can detect if OMR-tracker runs and then run a comamnd to restore the MPTCP bond properly?

I have been tracking this problem for a long time where I see the performance of the system degrade over time.
I didn't have the knowledge until recently to isolate the bug and report it. I am available to test any solutions.

Context (Environment)

The issue is bad because it effectively breaks OMR. If only a single path is used after OMR-tracker then there is no purpose to run OMR at all. Also there is no way currently to recover from this problem.

Specifications

  • OpenMPTCProuter version: v0.59.1-5.4 r0+16594-ce92de8c8c
  • OpenMPTCProuter VPS version: VPS 0.1028
  • OpenMPTCProuter VPS provider:
  • OpenMPTCProuter platform: x86_64)
@ioogithub ioogithub added the bug label Aug 24, 2023
@ioogithub
Copy link
Author

ioogithub commented Aug 25, 2023

I have more information. I can reproduce the bug, not with omr-tracker but by stopping the interface (wan2) and starting again. After I start it, OMR never uses it for MPTCP so it becomes single path router. Here is the log wan2 started again:

Thu Aug 24 19:54:50 2023 kern.info kernel: [23103.714924] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Thu Aug 24 19:54:56 2023 daemon.notice netifd: wan2 (30320): udhcpc: sending select for x.x.x.x
Thu Aug 24 19:54:56 2023 daemon.notice netifd: wan2 (30320): udhcpc: lease of x.x.x.x obtained, lease time 172800
Thu Aug 24 19:54:56 2023 daemon.notice netifd: Interface 'wan2' is now up
Thu Aug 24 19:54:56 2023 daemon.notice netifd: Network device 'eth2' link is down
Thu Aug 24 19:54:56 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Thu Aug 24 19:54:57 2023 user.notice MPTCP: Flush route cache
Thu Aug 24 19:54:57 2023 user.notice firewall: Reloading firewall due to ifup of wan2 (eth2)
Thu Aug 24 19:54:57 2023 user.notice firewall.omr-server: Firewall reload, set server part firewall reloading
Thu Aug 24 19:54:57 2023 user.notice mptcp: Reloading mptcp config due to ifup of wan2 (eth2)
Thu Aug 24 19:54:59 2023 user.notice post-tracking-post-tracking: Set firewall on server vps
Thu Aug 24 19:55:00 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Aug 24 19:55:00 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Aug 24 19:55:00 2023 kern.info kernel: [23113.674687] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Thu Aug 24 19:55:02 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched up
Thu Aug 24 19:55:46 2023 user.notice post-tracking-post-tracking: Check API configuration...
Thu Aug 24 19:55:52 2023 user.notice post-tracking-post-tracking: Check API configuration... Done

So I guess Thu Aug 24 19:54:57 2023 user.notice mptcp: Reloading mptcp config due to ifup of wan2 (eth2) is the problem.

I checked ip r before restarting the interface and after bringing it back up. All routes the the same but how can I tell MPTCP to use both wan1 and wan2 after wan2 comes back online?

@ioogithub ioogithub changed the title MPTCP Bond not restored after omr-tracker stops and starts an interface, OMR becomes single path TCP router. Multipath not restored after omr-tracker stops and starts an interface, OMR becomes single path TCP router. Aug 25, 2023
@fareign
Copy link

fareign commented Aug 25, 2023

try to reload v2ray? only some VPN use MPTCP.

@ioogithub
Copy link
Author

ioogithub commented Aug 25, 2023

try to reload v2ray? only some VPN use MPTCP.
I tried to reload with /etc/init.d/openmptcprouter-vps restart but it executes and exits and doesn't restart anything that I can see in the logs.

If I reload with /etc/init.d/v2ray reload it doesn't seem to fix the problem, traffic still comes in over only 1 wan and I see these errors:

[info] DNS disabled: main_dns
[info] Setting transparent proxy on port: 1234
[info] Transparent proxy mode: default
Flush terminated
ss-rules6: unknown option def
iptables-restore: line 2 failed
iptables-restore: line 2 failed
iptables-restore: line 2 failed
iptables-restore: line 2 failed
iptables-restore: line 2 failed
iptables-restore: line 2 failed
iptables-restore: line 2 failed
iptables-restore: line 2 failed
Warning: Section 'zone_vpn' cannot resolve device of network 'omr6in4'
Warning: Option @redirect[3].v2ray is unknown
Warning: Option @redirect[4].v2ray is unknown
Warning: Section @redirect[4] (19224 on v2ray) does not specify a protocol, assuming TCP+UDP

If I restart with /etc/init.d/v2ray restart
then it does work but it creates a new problem, all clients on the network get disconnected so this is too disruptive as omr-tracker can restart an interface at any time.

MPTCP is working perfectly until an interface is restarted or omr-tracker tracker bring an interface down and up. After that this interface cannot be used for multi path.

There has to be a way to tell v2ray that the multi path is available to be used again. When I bring the interface down, OMR knows how to fail over to one interface successfully, it should be able to recover when the interface comes back up.

@ioogithub ioogithub changed the title Multipath not restored after omr-tracker stops and starts an interface, OMR becomes single path TCP router. Multipath not restored after interface is restarted, OMR becomes single path TCP router. Aug 25, 2023
@Ysurac
Copy link
Owner

Ysurac commented Aug 25, 2023

Do you have same issue using Shadowsocks ?
And do you have same issue using 6.1 snapshot image ?

@ioogithub
Copy link
Author

ioogithub commented Aug 25, 2023

Do you have same issue using Shadowsocks ?
I just did 3 tests with v2ray and 3 tests with shadowsocks. The problem affects both v2ray and shadowsocks proxies.

Test:

  1. reboot OMR, start file upload transfer, observe normal multi-path
  2. restart wan2 interface
  3. transfer a file (upload), observe traffic on only wan1 interface.

I can repeat the test by restating wan1 and same result. After interface restarts the proxy does not use it for multi-path.

Here is the result:

Normal multi-path:
before restart

After wan2 interface is reset, multi-path is broken:
after restart

I am looking at the logs after restarting an interface trying to find out what is not running. I see one log entry:
Fri Aug 25 13:10:49 2023 daemon.notice netifd: wan2 (23088): Command failed: Permission denied but it doesn't say what command failed.

Please help me to troubleshoot:

  • Which script tells the proxy that multi-path is available and proxy should use multi-path?
  • Is there a uci setting that is not being set properly?
  • Maybe something is failing on restart but there is no log showing failure.

I think this is a serious problem because omr-tracker can restart interface at any time and that disables multipath on OMR router, user will not realize that OMR is running in degraded performance.

If OMR can successfully degrade the router to single path it should be able to upgrade it back to multi-path as well right?

@Ysurac
Copy link
Owner

Ysurac commented Aug 25, 2023

It's not a know problem and I don't have this issue on my 0.59.1 (and still not with latest snapshot).
What is the result of multipath, ip a and ip r commands ?

@ioogithub
Copy link
Author

Did you test it with an upload so it would use the port forward and the proxy, either shadowsocks or v2ray same result for me?

I think I saw download is working once but I am not sure because testing uploads now.

I am looking at ip a and ip r and ip rule and multipath. I will compare results of working fresh reboot of OMR and results after interface is restarted and incoming transfers not using multi-oath.

@ioogithub
Copy link
Author

ioogithub commented Aug 25, 2023

I did lots of tests:

  1. restart OMR, capture ip r, ip a, ip rule, multipath settings
  2. restart wan2
  3. capture all settings
  4. compare settings from step 1 and 3.

The routes, rules, interfaces and multipath settings are all identical when its working and when it is broken after restarting an interface.

It's not a know problem and I don't have this issue on my 0.59.1 (and still not with latest snapshot). What is the result of multipath, ip a and ip r commands ?

Here are the outputs, everything looks normal there is nothing to me:

ip r

default via 10.255.255.1 dev tun0 
default metric 1 
        nexthop via 10.6.0.1 dev eth1 weight 2 
        nexthop via 10.0.0.1 dev eth2 weight 1 
default via 10.6.0.1 dev eth1 metric 3 
default via 10.0.0.1 dev eth2 metric 4 
default via 10.255.255.1 dev tun0 metric 1200 
10.0.0.0/24 via 10.0.0.1 dev eth2 
10.0.0.0/24 dev eth2 scope link metric 4 
10.255.255.1 dev tun0 proto kernel scope link src 10.255.255.2 
10.255.255.2 dev tun0 scope link metric 1200 
6.1.2.2 dev eth1 proto static scope link src 10.6.8.1 metric 3 
10.6.0.0/10 dev eth1 scope link metric 3 
127.0.0.0/8 dev lo proto static scope link metric 6 
x.x.x.x metric 1 
        nexthop via 10.6.0.1 dev eth1 weight 10 
        nexthop via 10.0.0.1 dev eth2 weight 1 

multipath:

bond0 is deactivated
erspan0 is deactivated
eth0 is deactivated
eth1 is in default mode
eth2 is in default mode
eth3 is deactivated
gre0 is deactivated
gretap0 is deactivated
ifb4tun0 is deactivated
ip6gre0 is deactivated
ip6tnl0 is deactivated
lo is deactivated
sit0 is deactivated
teql0 is deactivated
tun0 is deactivated

ip a:

1: lo: <LOOPBACK,UP,LOWER_UP,80000> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ip6tnl0@NONE: <NOARP,80000> mtu 1452 qdisc noop state DOWN group default qlen 1000
    link/tunnel6 :: brd ::
3: sit0@NONE: <NOARP,80000> mtu 1480 qdisc noop state DOWN group default qlen 1000
    link/sit 0.0.0.0 brd 0.0.0.0
4: gre0@NONE: <NOARP,80000> mtu 1476 qdisc noop state DOWN group default qlen 1000
    link/gre 0.0.0.0 brd 0.0.0.0
5: gretap0@NONE: <BROADCAST,MULTICAST,80000> mtu 1462 qdisc noop state DOWN group default qlen 1000
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
6: erspan0@NONE: <BROADCAST,MULTICAST,80000> mtu 1450 qdisc noop state DOWN group default qlen 1000
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
7: ip6gre0@NONE: <NOARP,80000> mtu 1448 qdisc noop state DOWN group default qlen 1000
    link/gre6 :: brd ::
8: bond0: <BROADCAST,MULTICAST,MASTER,80000> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether b1:b1:b1:b1:b1:b1 brd ff:ff:ff:ff:ff:ff
9: teql0: <NOARP,80000> mtu 1500 qdisc noop state DOWN group default qlen 100
    link/void 
10: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP,80000> mtu 1500 qdisc mq state UP group default qlen 100
    link/ether a:aa:aa:aa:aa:aa brd ff:ff:ff:ff:ff:ff
    inet 192.168.x.x/24 brd 192.168.200.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 bbbb:bbbb:bbbb::1/60 scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 cccc:cccc:cccc:cccc:cccc/64 scope link 
       valid_lft forever preferred_lft forever
11: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 100
    link/ether dd:dd:dd:dd:dd:dd brd ff:ff:ff:ff:ff:ff
    inet 10.6.8.1/10 brd 100.127.255.255 scope global eth1
       valid_lft forever preferred_lft forever
12: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP group default qlen 100
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.2/24 brd 10.0.0.255 scope global eth2
       valid_lft forever preferred_lft forever
13: eth3: <NO-CARRIER,BROADCAST,MULTICAST,UP,80000> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether a1:a1:a1:a1:a1:a1 brd ff:ff:ff:ff:ff:ff
    inet 192.168.a.b/24 brd 192.168.200.255 scope global eth3
       valid_lft forever preferred_lft forever
21: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP,80000> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 100
    link/none 
    inet 10.255.255.2 peer 10.255.255.1/32 scope global tun0
       valid_lft forever preferred_lft forever
    inet6 a2a2:a2a2:a2a2:a2a2:a2a2/64 scope link stable-privacy 
       valid_lft forever preferred_lft forever
24: ifb4tun0: <BROADCAST,NOARP,80000> mtu 1500 qdisc noop state DOWN group default qlen 32
    link/ether a3:a3:a3:a3:a3:a3 brd ff:ff:ff:ff:ff:ff

Do you have any ideas of what else I can look at, its like the proxy get stuck on a single path or it doesn't get notified that multi-path is available again.

Questions:

  • What script or settings tell the proxy to use a single path when an interface is down?
  • What scripts or settings tell the proxy to use multi-path when all interfaces are up?

If ip r, ip rule etc. are the same with it is working and not working it must be something else, what else should I look for?

@Ysurac
Copy link
Owner

Ysurac commented Aug 25, 2023

Only mptcp init script is used, to set route table and multipath command to set multipath status of an interface.
I don't understand exactly the issue, it's only with port forwarding ? In this case with shadowsocks as Proxy only VPN is used and with v2ray it's used if you have the checkbox enabled in firewall configuration.
It's aggregated in download and not on port forwarding upload in both case ?

What is the MPTCP scheduler used ?

@ioogithub
Copy link
Author

mptcp init script is used, to set route table

this script sets route table: /etc/init.d/mptcp

multipath command to set multipath status of an interface

Is this command inside the mptcp init script?

But if the route table and multipath status is the same before and after the interface is restarted then the problem is somewhere else.

Does this mean that only the routes and multipath status of interface determines if the proxy will use multipath? So the proxy pushes data out and the kernel uses the routing table to route over multiple interfaces? Does this mean the proxy doesn't know about MPTCP? There are no other settings that notify the proxy?

I don't understand exactly the issue, it's only with port forwarding ?

I think it is only uploads, when a client from outside the network requests data from a computer inside the network after an interface has been restarted, it does not use that interface. It sends data only on the other interface. In this case port forwarding is used. I don't know if port forwarding is involved but downloads do not use that port forwarding.

In this case with shadowsocks as Proxy only VPN is used

For this I made used used different port forwarding rules which swtching to shadowsocks.

with v2ray it's used if you have the checkbox enabled in firewall configuration.

Yes I am using the v2ray checkbox port forwarding rule.

All of this works normally before an interface is stopped and started or restarted.
After an interface is restarted there is no way to get OMR to use that interface for uploads.

I see the problem with both v2ray and shadowsocks:

  • After I restart wan1 and try a transfer, OMR only uses wan2 (broken)
  • After I restart wan2 and try a transfer, OMR only uses wan1 (broken)
  • If I reboot OMR, it uses wan1 and wan2 and it aggregates the bandwidth (working).

It seems like the proxy gets stuck, it doesn't get notified that both wan1 and wan2 are available.

I will start a fresh test:

  1. reboot omr (known working state)
  2. restart wan1
  3. start an upload and look at bandwidth for aggregate.
  4. start a download and look at bandwidth for aggregate.

What is the MPTCP scheduler used ?

scheduler: BLEST
congestion control: reno
path manager: full mesh
TCP sync: 2
MPTCP version: 0
MPTCP checksum: disabled

There is an MPTCP debug if I set it will we get more data in the logs?

If you have any more ideas please let me know.

@ioogithub
Copy link
Author

I have done a lot of testing today, here is the result:

v2ray multi-path

uploads

  • after OMR reboot: working
  • after one interface restart: NOT working traffic never returns to the restarted interface.

downloads

  • after OMR reboot: working
  • after one interface restart: working

shadowsocks multi-path

uploads

  • after fresh OMR reboot: working
  • after fresh OMR reboot: working *but with unusual behavior.

downloads

  • after fresh OMR reboot: working
  • after one interface restart: working

Conclusion

I initially thought the problem was uploads for shadowsocks and v2ray but I was wrong. When shadowsocks, sometimes it starts working only after a delay of up to 2 minutes. Sometimes it starts right away. In my previous tests I did not wait long enough. I did see it working today.

So the problem is isolated to a specific use case: v2ray uploads (traffic initiated from outside the network using the v2ray port forwarding) only. I have never seen this work.

I have also tried the following:

  • all different combinations of path schedulers, congestion control, and other MPTCP settings.
  • I put MPTCP into debug mode and watched the kernel sync messages, everything looks normal, at least it looks the same when mptcp is aggregating versus not aggregating.
  • I switched MPTCP master from wan1 to wan2, same result.

Nothing I can do will restore aggregate multi-path for uploads with v2ray once an interface is restarted. Traffic always flows though the one interface that was not restarted.

  • If I restart wan1 then that moment all upload traffic stops going out over wan1 and never recovers
  • If I restart wan1 then the moment all upload traffic stops going out over wan2 and never recovers

The only way to recover from this is to click Save and Apply on the Wizard screen or reboot OMR.

Question: how is the MPTCP actually working on OMR? The v2ray config.json doesn't have MPTCP set in the stream settings, it is just TCP so I guess v2ray doesn't even know about MPTCP so how does it even work?

It's not a know problem and I don't have this issue on my 0.59.1 (and still not with latest snapshot). What is the result of multipath, ip a and ip r commands ?

You tested tested uploads with port forwarding using v2ray as the proxy with 0.59.1 and it is working for you correct?

  1. You are able to start an upload
  2. see traffic flowing over multiple interfaces in the bandwidth graph
  3. click on restart for one interface
  4. observe the interface is back online on the status page and logread
  5. observe that traffic stops flowing over the restarted interface.
  6. wait 10 minutes
  7. start another upload, observe traffic still only flowing on the unstarted interface.

If this works for you and it does not work for me that suggests a configuration problem. Tomorrow I will setup a VPS with the 6.1 upstream kernel and boot my router with openmptcprouter-v0.59.2alpha-6.1-r0+23789-ce6ad123e7-x86-64-generic-ext4-combined-efi, configure v2ray and port forwarding and repeat the tests above. I will report the results back here.

@ioogithub
Copy link
Author

ioogithub commented Aug 27, 2023

Here is the result of testing with openmptcprouter v0.59.2alpha-6.1 r0+23789-ce6ad123e7.

  1. create new VPS: wget -O - http://www.openmptcprouter.com/server-test/debian-x86_64.sh | UPSTREAM6="yes" sh
  2. book router with image: openmptcprouter-v0.59.2alpha-6.1-r0+23789-ce6ad123e7-x86-64-generic-ext4-combined-efi.img.gz

Configure openwrt:

  1. Set LAN IP, set wan1 and wan2 to default and set force link
  2. Set server IP and server key.
  3. Create one port forwarding rule
  4. Everything else defaults.

shadowsocks upload: multipath working
shadowsocks upload after interface reset: multipath working

Configure V2ray:

  1. Set v2ray vless
  2. Create port forwarding, no v2ray check.
  3. Edit port forward and add v2ray check.
  4. Check /etc/shorewall/rules, correct.
  5. Confirm proxy is receiving traffic on OMR status page. Known issue: No proxy displayed here.
  6. Confirm traffic on port 65228 (tcpdump -i eth port 65228: yes

Test v2ray upload after interface restart.

  1. Start an upload using v2ray: multi-path working
    v2rayeforerestart

  2. Reset wan1 interface: multi-path not working
    v2rayafterrestartwan1

Same results as the latest stable release. Uploads do not use aggregate multi-path with v2ray as the proxy after an interface is restarted.

There is currently no way to recover from this.

  • Restarting v2ray (/etc/init.d/v2ray restart) does not work.
  • Pressing Save and Apply on the wizard page did not work and it break the port forward.
  • Reboot OMR, still lost access to the port forward. So not way to test this again.

With stable, at lease Save and Apply or reboot OMR will fix the problem. In 6.1 this does not work. So the this issue at this time it is better to stay with stable until a work around is found.

@ioogithub
Copy link
Author

This is an example of an event that kills aggregate uploads:

Aug 28 03:33:19 OpenMPTCProuter user.notice post-tracking-post-tracking: wan2 (eth2) switched off because check error and ping from x.x.x.x error (1.0.0.1,114.114.115.115,1.2.4.8)
Aug 28 03:33:19 OpenMPTCProuter user.notice post-tracking-post-tracking: Delete default route to y.y.y.y via x.x.x.x dev eth2
Aug 28 03:41:15 OpenMPTCProuter daemon.notice netifd: Network device 'eth2' link is down
Aug 28 03:41:15 OpenMPTCProuter daemon.notice netifd: Interface 'wan2' has link connectivity loss
Aug 28 03:41:22 OpenMPTCProuter daemon.notice netifd: Network device 'eth2' link is up
Aug 28 03:41:22 OpenMPTCProuter daemon.notice netifd: Interface 'wan2' has link connectivity 
Aug 28 03:41:25 OpenMPTCProuter user.notice post-tracking-post-tracking: wan2 (eth2) switched up
Aug 28 03:41:25 OpenMPTCProuter user.notice post-tracking-post-tracking: Interface route not yet set, set route ip r add default via x.x.x.x dev eth2 metric 4
Aug 28 03:41:26 OpenMPTCProuter user.notice post-tracking-post-tracking: New public ip detected for wan2 (eth2): a.a.a.a
Aug 28 03:41:26 OpenMPTCProuter user.notice post-tracking-post-tracking: Reload MPTCP for eth2

The OMR-tracker brings the interface down and up again. After this, OMR is in single mode and will not use wan2 again until the server is restarted so this issue has a big impact on OMR stability and reliability.

There has to be a way to have uploads recover without manual user intervention after these interface reset events.

@Ysurac
Copy link
Owner

Ysurac commented Aug 28, 2023

On latest snapshot, I'm not able to reproduce the issue for now using "iperf3" to test upload speed with V2Ray VLESS.
How do you test upload speed ? By downloading externally using port forward ?

@ioogithub
Copy link
Author

ioogithub commented Aug 28, 2023

Yes, I am doing a real world test like this:

external client -> VPS (public IP) -> port forward OMR via v2ray vless -> host on LAN.

What exact test are you running? omr-iperf on OMR? If I do that right now this is what I get but I only see one wan connection use so maybe this test is not correct test:
https://ibb.co/L5W55Tg

For my test, after OMR is restarted, If I do a download from an external client (upload) in a good state, I get this:
https://ibb.co/LvLDFn8

If I do a download from an external client (upload) after an interface was restarted (by omr-tracker) or manually, I get this:
https://ibb.co/tDfRGD6

It never recovers from this. omr-tracker reset wan2 8 hours ago:
Aug 28 07:43:41 OpenMPTCProuter user.notice post-tracking-post-tracking: wan2 (eth2) switched off because check error and ping from x.x.x.x error (1.0.0.1,114.114.115.115,1.2.4.8), so OMR has been sitting in single path mode for uploads since this time.

If I do this: /etc/init/d/v2ray restart or wizard validate on the stable version, it will return to normal aggregate however this operation is disruptive, it kills all client connections and is manual and it only lasts until omr-tracker resets one of the interfaces again.

@ioogithub
Copy link
Author

ioogithub commented Aug 28, 2023

here is a new upload test as discussed using iperf and a new port foward setup for the test.

iperf test:

  • on client outside network: iperf -c VPN_Public_IP -p 19223 -t 180 -i 180
  • on server behind OMR: iperf -s -p 19223
  • firewall rule: ACCEPT net $FW tcp 19223 # OMR openmptcprouter open router 19223 port tcp --- V2Ray to 192.168.x.x:19223
  • wan2 restarted on interface page.

result:

result, bandwidth never return to wan2: https://ibb.co/x8tJqQf
start a new file transfer: https://ibb.co/PY1vNsz
you can see, on the first screenshot, when wan2 goes down, omr switches to wan1 but after wan2 comes back up it never recovers.

The problem is with stable /etc/init.d/v2ray restart would fix the problem but disconnect all clients on the network. With 6.1 testing, after running /etc/init.d/v2ray restart packets do not leave the vps and get to omr. The only way to recover and start a new test is to restart both OMR and VPS. Validating wizard and v2ray restart, packets do not get past vps with tcpdump -i eth0 port 19223

@Ysurac
Copy link
Owner

Ysurac commented Aug 28, 2023

I've tested, using V2RAY VLESS protocol redirect port to internal http server and downloading file from external and no problem, when I remove a connection and put it back it's used again.
I have also no problem with v2ray restart.

Can you give me full log when you test disconnect and then reconnect a connection ?
I will reset my install to a fresh config for more tests.

@ioogithub
Copy link
Author

Can you give me full log when you test disconnect and then reconnect a connection ?

Yes, please let me know exactly which log you are looking for ? logread from OMR, journalctl -f from vps or something else? Is there any debug mode I can enable to get more info?

@Ysurac
Copy link
Owner

Ysurac commented Aug 29, 2023

I would need system log from the router, so the result of logread.

I tested with a new VPS and router installation and I have no issue. What is the result of uci show network via SSH on router ?

@ioogithub
Copy link
Author

I have been testing with the 6.1 kernel and switched back to 0.51.9 recently as it is easier for me to recover.

What is the result of uci show network

network.loopback=interface
network.loopback.proto='static'
network.loopback.ipaddr='127.0.0.1'
network.loopback.netmask='255.0.0.0'
network.loopback.multipath='off'
network.loopback.device='lo'
network.loopback.metric='6'
network.globals=globals
network.globals.mptcp_subflows='3'
network.globals.mptcp_add_addr_accepted='1'
network.globals.mptcp_add_addr_timeout='120'
network.globals.mptcp_fullmesh_num_subflows='1'
network.globals.mptcp_fullmesh_create_on_err='1'
network.globals.mptcp_ndiffports_num_subflows='1'
network.globals.ula_prefix='abac:4c32:5cbb::/48'
network.globals.mptcp_version='0'
network.globals.mptcp_rr_cwnd_limited='Y'
network.globals.mptcp_rr_num_segments='1'
network.globals.multipath='enable'
network.globals.mptcp_syn_retries='2'
network.globals.mptcp_scheduler='blest'
network.globals.congestion='reno'
network.globals.mptcp_path_manager='fullmesh'
network.globals.mptcp_checksum='0'
network.globals.mptcp_debug='0'
network.lan=interface
network.lan.proto='static'
network.lan.ipaddr='192.168.x.x'
network.lan.netmask='255.255.255.0'
network.lan.device='eth0'
network.lan.ifname='eth0'
network.lan.delegate='0'
network.lan.addlatency='0'
network.lan.multipath='off'
network.lan.ip4table='lan'
network.lan.metric='7'
network.lan.label='toMesh'
network.lan.defaultroute='0'
network.lan.peerdns='0'
network.lan.ip6assign='60'
network.lan.ipv6='0'
network.lan.force_link='1'
network.lan_rule=rule
network.lan_rule.lookup='lan'
network.lan_rule.priority='100'
network.wan1=interface
network.wan1.device='eth1'
network.wan1.ip4table='wan'
network.wan1.defaultroute='0'
network.wan1.addlatency='0'
network.wan1.ipv6='0'
network.wan1.peerdns='0'
network.wan1.proto='dhcp'
network.wan1.label='starlink'
network.wan1.force_link='1'
network.wan1.metric='3'
network.wan1.multipath='master'
network.wan1_dev=device
network.wan1_dev.name='eth1'
network.wan1_dev.txqueuelen='20'
network.wan2=interface
network.wan2.device='eth2'
network.wan2.ip4table='wan'
network.wan2.defaultroute='0'
network.wan2.addlatency='0'
network.wan2.ipv6='0'
network.wan2.peerdns='0'
network.wan2.proto='dhcp'
network.wan2.force_link='1'
network.wan2.metric='4'
network.wan2.label='lte'
network.wan2.multipath='on'
network.wan2_dev=device
network.wan2_dev.name='eth2'
network.wan2_dev.txqueuelen='20'
network.omrvpn=interface
network.omrvpn.device='tun0'
network.omrvpn.ip4table='vpn'
network.omrvpn.multipath='off'
network.omrvpn.leasetime='12h'
network.omrvpn.type='tunnel'
network.omrvpn.txqueuelen='100'
network.omrvpn.metric='1200'
network.omrvpn.proto='none'
network.omr6in4=interface
network.omr6in4.proto='6in4'
network.omr6in4.ip4table='vpn'
network.omr6in4.multipath='off'
network.omr6in4.ipaddr='10.255.255.2'
network.omr6in4.peeraddr='10.255.255.1'
network.omr6in4.ip6addr='fe80::a00:2/126'
network.omr6in4.gateway='fe80::a00:1/126'
network.omr6in4.force_link='1'
network.omr6in4.metric='1201'
network.lan_dev=device
network.lan_dev.name='eth0'
network.wan3_dev=device
network.wan3_dev.type='macvlan'
network.wan3_dev.mode='vepa'
network.wan3_dev.ifname='eth0'
network.wan3_dev.name='wan3'
network.wan3_dev.txqueuelen='20'
network.localNIC=interface
network.localNIC.proto='static'
network.localNIC.device='eth3'
network.localNIC.netmask='255.255.255.0'
network.localNIC.multipath='off'
network.localNIC.addlatency='0'
network.localNIC.metric='11'
network.localNIC.ipaddr='192.168.y.y'
network.localNIC.ipv6='0'
network.localNIC.defaultroute='0'
network.localNIC.peerdns='0'
network.localNIC_dev=device
network.localNIC_dev.name='eth3'
network.@route[2]=route
network.@route[2].interface='lan'
network.@route[2].target='192.168.a.a'
network.@route[2].netmask='255.255.255.0'
network.@route[2].gateway='192.168.x.xyy'

I would need system log from the router, so the result of logread.

Here is the log from the moment I clicked reset on the wan2 interface. After this I just see expected lease renewal messages, nothing else related to connections and the log is quiet.

Thu Aug 31 14:13:42 2023 daemon.notice netifd: Interface 'wan2' is now down
Thu Aug 31 14:13:42 2023 kern.info kernel: [ 3351.120795] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Thu Aug 31 14:13:42 2023 daemon.notice netifd: Interface 'wan2' is disabled
Thu Aug 31 14:13:42 2023 daemon.notice netifd: Interface 'wan2' is enabled
Thu Aug 31 14:13:42 2023 daemon.notice netifd: Interface 'wan2' is setting up now
Thu Aug 31 14:13:42 2023 daemon.notice netifd: Network device 'eth2' link is down
Thu Aug 31 14:13:42 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Thu Aug 31 14:13:42 2023 daemon.notice netifd: wan2 (29374): udhcpc: started, v1.33.2
Thu Aug 31 14:13:42 2023 daemon.notice netifd: wan2 (29374): udhcpc: sending discover
Thu Aug 31 14:13:42 2023 kern.err kernel: [ 3351.487765] __mptcp_init4_subsockets: token 0xa463fb6a bind() to y.y.y.y index 12 failed, error -99
Thu Aug 31 14:13:43 2023 user.notice MPTCP: Flush route cache
Thu Aug 31 14:13:44 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched off because wan2 may have ip issues, interface have no IPv4, interface have no IPv4 gateway
Thu Aug 31 14:13:44 2023 user.notice post-tracking-post-tracking: Delete default route to x.x.x.x via  dev eth2
Thu Aug 31 14:13:45 2023 daemon.notice netifd: wan2 (29374): udhcpc: sending discover
Thu Aug 31 14:13:46 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Aug 31 14:13:46 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Aug 31 14:13:46 2023 kern.info kernel: [ 3354.700767] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Thu Aug 31 14:13:49 2023 daemon.notice netifd: wan2 (29374): udhcpc: sending discover
Thu Aug 31 14:13:52 2023 daemon.notice netifd: wan2 (29374): udhcpc: sending select for y.y.y.y
Thu Aug 31 14:13:52 2023 daemon.notice netifd: wan2 (29374): udhcpc: lease of y.y.y.y obtained, lease time 172800
Thu Aug 31 14:13:52 2023 daemon.notice netifd: Interface 'wan2' is now up
Thu Aug 31 14:13:52 2023 daemon.notice netifd: Network device 'eth2' link is down
Thu Aug 31 14:13:52 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Thu Aug 31 14:13:52 2023 user.notice firewall: Reloading firewall due to ifup of wan2 (eth2)
Thu Aug 31 14:13:52 2023 user.notice firewall.omr-server: Firewall reload, set server part firewall reloading
Thu Aug 31 14:13:53 2023 user.notice post-tracking-post-tracking: Set firewall on server vps
Thu Aug 31 14:13:54 2023 daemon.err glorytun[14226]: read: Operation timed out
Thu Aug 31 14:13:54 2023 daemon.info glorytun[14226]: STOPPED tun0
Thu Aug 31 14:13:55 2023 daemon.err glorytun[14226]: x.x.x.x.65001: connected
Thu Aug 31 14:13:55 2023 daemon.info glorytun[14226]: STARTED tun0
Thu Aug 31 14:13:55 2023 user.notice MPTCP: Add route z.z.z.z/24 via z.z.z.b dev eth2
Thu Aug 31 14:13:55 2023 user.notice MPTCP: Add route a.a.a.a via a.a.a.b dev eth0
Thu Aug 31 14:13:55 2023 user.notice mptcp: Reloading mptcp config due to ifup of wan2 (eth2)
Thu Aug 31 14:13:55 2023 user.notice MPTCP: Add route z.z.z.z/24 via z.z.z.b dev eth2
Thu Aug 31 14:13:56 2023 user.notice MPTCP: Add route a.a.a.a via a.a.a.b dev eth0
Thu Aug 31 14:13:56 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Aug 31 14:13:56 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Aug 31 14:13:56 2023 kern.info kernel: [ 3364.671214] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Thu Aug 31 14:13:56 2023 user.notice MPTCP: Flush route cache
Thu Aug 31 14:13:56 2023 user.notice MPTCP: Add route z.z.z.z/24 via z.z.z.b dev eth2
Thu Aug 31 14:13:56 2023 user.notice MPTCP: Add route a.a.a.a via a.a.a.b dev eth0
Thu Aug 31 14:13:56 2023 user.notice MPTCP: Flush route cache
Thu Aug 31 14:13:58 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched up
Thu Aug 31 14:14:01 2023 user.notice MPTCP: Add route z.z.z.z/24 via z.z.z.b dev eth2
Thu Aug 31 14:14:01 2023 user.notice MPTCP: Add route a.a.a.a via a.a.a.b dev eth0
Thu Aug 31 14:14:45 2023 user.notice post-tracking-post-tracking: Check API configuration...
Thu Aug 31 14:14:53 2023 user.notice post-tracking-post-tracking: Check API configuration... Done

After this point wan2 does not recover:
Screenshot_20230831_141709

  • I can come back in 6 hours and see the same bandwidth graph, only wan1 traffic for uploads
  • I can repeat the test but restarting wan1, same result, only traffic on wan2 now.
  • I also tried resetting the lan interface and ensuring that any existing connections timed out before starting a new transfer, same results.
  • tested file transfers with sftp, curl from http server, ierf etc.
  • only manual intervention /etc/init.d/vray restart will fix the problem and allow upload traffic to use both wan1 and wan2 but this causes other problems because it kills all active connection on the network.

@Ysurac
Copy link
Owner

Ysurac commented Aug 31, 2023

Can you try to do multipath eth2 off && multipath eth2 on to check if this change something ?

@ioogithub
Copy link
Author

multipath eth2 off && multipath eth2 on

  1. start a transfer after interface reset when OMR is in single wan state
  2. multipath eth2 off && multipath eth2 on
  3. No interruption or change to the transfer
  4. Also no notice in logread -f from this command.

Result: I cannot see any effect from this command.

There was one error after i reset wan2:
Thu Aug 31 14:13:42 2023 kern.err kernel: [ 3351.487765] __mptcp_init4_subsockets: token 0xa463fb6a bind() to y.y.y.y index 12 failed, error -99

Could this be the problem it happens when wan2 is brought down.

I am going to reboot OMR to a known good state and watch the logs closely again.

@ioogithub
Copy link
Author

Here is a log with two question I have.

  1. OMR rebooted at 15:58. At 16:14 this message:

Thu Aug 31 16:14:21 2023 kern.info kernel: [ 1079.764399] igc 0000:04:00.0: changing MTU from 1500 to 1460
and this change causes eth2 to go down and up:

Thu Aug 31 16:14:21 2023 daemon.notice netifd: Network device 'eth2' link is down
Thu Aug 31 16:14:21 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Thu Aug 31 16:14:25 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Aug 31 16:14:25 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Aug 31 16:14:25 2023 kern.info kernel: [ 1083.381921] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

Is there an OMR process that is monitoring something and trying to optimize MTU and changes it while the interface is up and running? The consequence is that it restarts the interface after and then the interface isn't used for uploads because of this issue.

  1. I also see these messages repeated:
Thu Aug 31 16:02:06 2023 user.notice MPTCP: Flush route cache
Thu Aug 31 16:02:06 2023 user.notice MPTCP: Add route 10.x.x.x/24 via 10.x.x.y dev eth2
Thu Aug 31 16:02:06 2023 user.notice omr-bypass: Starting OMR-ByPass...
Thu Aug 31 16:02:07 2023 user.notice omr-bypass: OMR-ByPass is running
Thu Aug 31 16:02:07 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Thu Aug 31 16:02:07 2023 user.notice omr-tracker: Launching...
Thu Aug 31 16:02:14 2023 user.notice omr-tracker: Launched
Thu Aug 31 16:02:15 2023 daemon.info omr-tracker-v2ray: V2Ray is up (can contact via http 1.0.0.1)

Sometimes I see these message every few minutes and sometimes I do not see them for 1 hour in the log. What are these for and what determines how often they will run? Is this normal?

Also, is glorytun actually retarting everytime this is logged:
Thu Aug 31 16:02:07 2023 daemon.info glorytun: starting glorytun vpn instance vpn

@ioogithub
Copy link
Author

I rebooted OMR at 17:00, same issue, the kernel or some OMR process changed the MTU again at 17:13:

Thu Aug 31 17:14:35 2023 kern.info kernel: [  966.328211] igc 0000:04:00.0: changing MTU from 1500 to 1460
Thu Aug 31 17:14:36 2023 daemon.notice netifd: Network device 'eth2' link is down
Thu Aug 31 17:14:36 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Thu Aug 31 17:14:39 2023 user.notice post-tracking-post-tracking: Check API configuration...
Thu Aug 31 17:14:39 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Aug 31 17:14:39 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Aug 31 17:14:39 2023 kern.info kernel: [  969.825546] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

and then the log file is quite, no omr-tracker or glorytun log messages for 2 hours.

What is the purpose of resetting the MTU and why don't I see the omr-tracking, glorytun, and mtcp flush cache messages any longer?

@ioogithub
Copy link
Author

As soon as I initiate 1 file upload, these messages appear again, every few minutes. Are they related to connections or file transfers? Are they related to this issue?

Thu Aug 31 19:29:27 2023 user.notice omr-tracker: Launching...
Thu Aug 31 19:29:34 2023 user.notice omr-tracker: Launched
Thu Aug 31 19:29:34 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Thu Aug 31 19:29:34 2023 user.notice MPTCP: Flush route cache
Thu Aug 31 19:29:34 2023 user.notice MPTCP: Add route 10.a.a.a/24 via 10.0.0.1 dev eth2
Thu Aug 31 19:29:34 2023 user.notice omr-bypass: Starting OMR-ByPass...
Thu Aug 31 19:29:35 2023 daemon.info omr-tracker-v2ray: V2Ray is up (can contact via http 1.0.0.1)
Thu Aug 31 19:29:35 2023 user.notice omr-bypass: OMR-ByPass is running
Thu Aug 31 19:29:35 2023 user.notice omr-tracker: Launching...
Thu Aug 31 19:29:42 2023 user.notice omr-tracker: Launched
Thu Aug 31 19:29:42 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Thu Aug 31 19:29:42 2023 user.notice MPTCP: Flush route cache
Thu Aug 31 19:29:43 2023 user.notice MPTCP: Add route 10.a.a.a/24 via 10.0.0.1 dev eth2
Thu Aug 31 19:29:43 2023 user.notice omr-bypass: Starting OMR-ByPass...
Thu Aug 31 19:29:43 2023 daemon.info omr-tracker-v2ray: V2Ray is up (can contact via http 1.0.0.1)
Thu Aug 31 19:29:43 2023 user.notice omr-bypass: OMR-ByPass is running
Thu Aug 31 19:29:51 2023 user.notice omr-tracker: Launching...
Thu Aug 31 19:29:58 2023 user.notice omr-tracker: Launched
Thu Aug 31 19:29:58 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Thu Aug 31 19:29:58 2023 user.notice MPTCP: Flush route cache
Thu Aug 31 19:29:58 2023 user.notice MPTCP: Add route 10.a.a.a/24 via 10.0.0.1 dev eth2
 Aug 31 19:29:58 2023 user.notice omr-bypass: Starting OMR-ByPass...
Thu Aug 31 19:29:59 2023 user.notice omr-bypass: OMR-ByPass is running
Thu Aug 31 19:29:59 2023 daemon.info omr-tracker-v2ray: V2Ray is up (can contact via http 1.0.0.1)

@Ysurac
Copy link
Owner

Ysurac commented Sep 1, 2023

You can fix MTU in Network->Interfaces for each interfaces, then it will not try to calculate MTU (but changing MTU should not put the interface down).

@ioogithub
Copy link
Author

You can fix MTU in Network->Interfaces for each interfaces, then it will not try to calculate MTU (but changing MTU should not put the interface down).

Are you sure? I always see a reset after an MTU change:

I get this:

Fri Sep 1 01:08:08 2023 kern.info kernel: [ 3654.871161] igc 0000:04:00.0: changing MTU from 1500 to 1460

Fri Sep  1 01:08:08 2023 daemon.notice netifd: Network device 'eth2' link is down
Fri Sep  1 01:08:08 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Fri Sep  1 01:08:11 2023 daemon.notice netifd: Network device 'eth2' link is up
Fri Sep  1 01:08:11 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Fri Sep  1 01:08:11 2023 kern.info kernel: [ 3658.468358] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

I can also reproduce it with this:

ip link set dev eth0 mtu 1460

it will always reset the interface.
Fri Sep 1 01:08:08 2023 daemon.notice netifd: Network device 'eth0' link is down

but this MTU just contributes to the problem, it doesn't fix the root cause of aggregate broken after interface restarted.

What about this error error after i reset wan2:
Thu Aug 31 14:13:42 2023 kern.err kernel: [ 3351.487765] __mptcp_init4_subsockets: token 0xa463fb6a bind() to y.y.y.y index 12 failed, error -99 could this be causing the issue or is this just from the interface going down?

@Ysurac
Copy link
Owner

Ysurac commented Sep 6, 2023

The attack is on the VPS, not related with eth2 on the router.

You can increase timeout, tries and retry but it's a connection issue.

Strange that v2ray restart on VPS change something...
You can set to restart v2ray on interface up: uci set openmptcprouter.wan2.script_alert_up='/etc/init.d/v2ray restart'

@ioogithub
Copy link
Author

I found something interesting while looking for a solution. I noticed that v2ray spans a lot of unconnected UDP sockets that never have any data in their sent or received queues on the vps. Here is an example from this command: ss -tuln -p:

udp            UNCONN          0               0                                     *:19912                              *:*             users:(("v2ray",pid=1339875,fd=109))
udp            UNCONN          0               0                                     *:37331                              *:*             users:(("v2ray",pid=1339875,fd=90))
udp            UNCONN          0               0                                     *:11731                              *:*             users:(("v2ray",pid=1339875,fd=410))
udp            UNCONN          0               0                                     *:43477                              *:*             users:(("v2ray",pid=1339875,fd=320))
udp            UNCONN          0               0                                     *:27097                              *:*             users:(("v2ray",pid=1339875,fd=456))
udp            UNCONN          0               0                                     *:18911                              *:*             users:(("v2ray",pid=1339875,fd=536))
udp            UNCONN          0               0                                     *:44513                              *:*             users:(("v2ray",pid=1339875,fd=429))
udp            UNCONN          0               0                                     *:56803                              *:*             users:(("v2ray",pid=1339875,fd=347))
udp            UNCONN          0               0                                     *:43491                              *:*             users:(("v2ray",pid=1339875,fd=539))
udp            UNCONN          0               0                                     *:41446                              *:*             users:(("v2ray",pid=1339875,fd=427))
udp            UNCONN          0               0                                     *:26092                              *:*             users:(("v2ray",pid=1339875,fd=274))
udp            UNCONN          0               0                                     *:28143                              *:*             users:(("v2ray",pid=1339875,fd=466))
udp            UNCONN          0               0                                     *:19952                              *:*             users:(("v2ray",pid=1339875,fd=498))
udp            UNCONN          0               0                                     *:63992                              *:*             users:(("v2ray",pid=1339875,fd=541))

v2ray was restarted 1h20m ago and it currently has 276 of these open,unused sockets which seems high. Is this normal?

I also see a flood of connect message in the journalctl from the vps like this:

Sep 07 00:06:56 vps v2ray[1339875]: 2023/09/07 00:06:56 a.a.a.a:2557 accepted udp:x.x.x.x:28777 [direct] email: openmptcprouter
Sep 07 00:06:56 vps v2ray[1339875]: 2023/09/07 00:06:56 a.a.a.a:17109 accepted udp:x.x.x.x:51415 [direct] email: openmptcprouter
Sep 07 00:06:56 vps v2ray[1339875]: 2023/09/07 00:06:56 a.a.a.a:42446 accepted udp:x.x.x.x:58168 [direct] email: openmptcprouter
Sep 07 00:06:56 vps v2ray[1339875]: 2023/09/07 00:06:56 a.a.a.a:60231 accepted udp:x.x.x.x:49001 [direct] email: openmptcprouter
Sep 07 00:06:56 vps v2ray[1339875]: 2023/09/07 00:06:56 a.a.a.a:3317 accepted udp:x.x.x.x:51035 [direct] email: openmptcprouter

For example for this second (00:06:56) there were 30 entries. Usually its closer to 1-5 every second. Are these journal logs related to the open sockets? I recognize a lot of these IP addresses on the omt-tracker page. Does omr-tracker open up new UDP socket for every connection attempt? Another strange thing about these connection logs is (a.a.a.a) is always wan1, I never see any for wan2. Is this also normal?

My current fix for this single-path aggregate problem is to restart v2ray and incoming transfers on my port use aggregate again. I noticed roughly that I would have to restart every 1 or 2 hours to restore multipath aggregate. Is v2ray becoming overwhelmed by this large amount of open sockets or is this a normal amount? why are they opened and not closed? Why always UDP?

@Ysurac
Copy link
Owner

Ysurac commented Sep 7, 2023

You shouldn't have so many UDP sockets if you don't use UDP too much.
What are the destination IPs ? OMR-Tracker only do http request for the proxy part and ping request for wan part if you didn't set it to DNS check.
In the log you see the initial wan connection, it's normal.
It's a large amount with some random destination port, this seems really strange...

@ioogithub
Copy link
Author

ioogithub commented Sep 7, 2023

You shouldn't have so many UDP sockets if you don't use UDP too much.

I don't have many this behavior doesn't look right to me.

What are the destination IPs ?

The destination IPs are always the IP address of WAN1. I did this journalctl | grep wan2IP and I don't see any results.

OMR-Tracker only do http request for the proxy part and ping request for wan part if you didn't set it to DNS check.

  • I don't see DNS check but here are all the checks from the omr-tracker page.
  • Does omr-tracker use udp on random ports or does it use 80?

Proxy:
Enabled

Server tracker:
Enabled

Default settings:
Enabled
Mail Alert
Restart if down

OMR VPN:
Enabled
Server http test
Mail alert

There are so many messages in the journalctl that I always have to do something like this journalctl -f -n 50000 | grep -v -e accepted to view it for general troubleshooting.

Also these show up as warnings, why are connection events warnings: from config.json "loglevel": "warning", what is v2ray warning about if these are just connection events?

Why do these requests only go to wan1 and why are there so many of them? Maybe this has something to do with the problem, after a while (around 1-2 hours) v2ray aggregate stops working, it always works after a restart. Also after 1-2 hours there are hundreds of unconnected UDP sockets open.

I restarted v2ray 30 minutes ago and I am already at 242 sockets open:
sudo ss -tuln -p | grep v2ray | wc -l
253

There are all like this on different port numbers:
udp UNCONN 0 0 *:13591 *:* users:(("v2ray",pid=3146957,fd=264))

Here is an example of a legitimate port that I am forwarding:
tcp LISTEN 0 4096 *:19223 *:* users:(("v2ray",pid=3146957,fd=9))
but what is opening all these UDP unconnected sockets?

In the log you see the initial wan connection, it's normal.

I will double check, are you asking for jounrlactl log for wan connection when vps is booted or something else?

@Ysurac
Copy link
Owner

Ysurac commented Sep 7, 2023

Destination IP is WAN IP ? I think it's source IP. Destination IP is x.x.x.x in your journalctl log.
OMR-Tracker use http so port 80 for proxy check.
For OMR-Tracker connection status, DNS check is set if you change this (default is ping):
image
V2Ray is verbose in warning mode, I changed this in develop script to error mode.
As I said, it's normal that you see only one IP, OpenMPTCProuter doesn't do some round robin.
You need to find what is opening so much UDP port, maybe something running on a client computer on LAN side.

@ioogithub
Copy link
Author

I just had a chance to test this change we made earlier:
uci set openmptcprouter.wan2.script_alert_up='/etc/init.d/v2ray restart'
it adds this config to /etc/config/openmptcprputer

config interface 'wan2'
        option metric '4'
        option multipathvpn '0'
        option asn 'ROGERS-COM-DPS'
        option mptcp_status 'MPTCP enabled'
        option multipath 'on'
        option mtu '1460'
        option manufacturer 'huawei'
        option script_alert_up '/etc/init.d/v2ray restart'
        option publicip '97.109.191.12'
        option state 'up'
        option lc '1694116043'
        option latency_previous '41'
        option latency '30'

but it did not work. eth2 was just restated by omr-tracker, it came back online but v2ray did not restart. Here is the log:

Thu Sep  7 15:46:17 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched off because check error and ping from 10.x.x.x error (9.9.9.9,1.0.0.1,114.114.115.115)
Thu Sep  7 15:46:17 2023 user.notice post-tracking-post-tracking: Delete default route to 1vpsip via 10.0.0.1 dev eth2
Thu Sep  7 15:46:19 2023 user.notice post-tracking-post-tracking: Restart wan2
Thu Sep  7 15:46:19 2023 daemon.notice netifd: wan2 (28088): udhcpc: received SIGTERM
Thu Sep  7 15:46:19 2023 daemon.notice netifd: wan2 (28088): udhcpc: unicasting a release of 10.x.x.x to 10.0.0.1
Thu Sep  7 15:46:19 2023 daemon.notice netifd: wan2 (28088): udhcpc: sending release
Thu Sep  7 15:46:19 2023 daemon.notice netifd: wan2 (28088): udhcpc: entering released state
Thu Sep  7 15:46:19 2023 daemon.notice netifd: wan2 (28088): Command failed: Permission denied
Thu Sep  7 15:46:19 2023 daemon.notice netifd: Interface 'wan2' is now down
Thu Sep  7 15:46:19 2023 kern.info kernel: [88622.022865] igc 0000:04:00.0: changing MTU from 1460 to 1500
Thu Sep  7 15:46:19 2023 daemon.notice netifd: Interface 'wan2' is disabled
Thu Sep  7 15:46:19 2023 daemon.notice netifd: Network device 'eth2' link is down
Thu Sep  7 15:46:19 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Thu Sep  7 15:46:20 2023 user.notice MPTCP: Flush route cache
Thu Sep  7 15:46:24 2023 kern.info kernel: [88626.981973] igc 0000:04:00.0: changing MTU from 1500 to 1460
Thu Sep  7 15:46:24 2023 daemon.notice netifd: Interface 'wan2' is enabled
Thu Sep  7 15:46:24 2023 daemon.notice netifd: Interface 'wan2' is setting up now
Thu Sep  7 15:46:24 2023 daemon.notice netifd: wan2 (11044): udhcpc: started, v1.33.2
Thu Sep  7 15:46:24 2023 daemon.notice netifd: wan2 (11044): udhcpc: sending discover
Thu Sep  7 15:46:27 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Sep  7 15:46:27 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Sep  7 15:46:27 2023 kern.info kernel: [88629.977387] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Thu Sep  7 15:46:27 2023 daemon.notice netifd: wan2 (11044): udhcpc: sending discover
Thu Sep  7 15:46:33 2023 daemon.notice netifd: Interface 'wan2' is now up
Thu Sep  7 15:46:33 2023 daemon.notice netifd: Network device 'eth2' link is down
Thu Sep  7 15:46:33 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Thu Sep  7 15:46:33 2023 user.notice firewall: Reloading firewall due to ifup of wan2 (eth2)
Thu Sep  7 15:46:33 2023 user.notice firewall.omr-server: Firewall reload, set server part firewall reloading
Thu Sep  7 15:46:33 2023 user.notice mptcp: Reloading mptcp config due to ifup of wan2 (eth2)
Thu Sep  7 15:46:34 2023 user.notice MPTCP: Flush route cache
Thu Sep  7 15:46:35 2023 user.notice post-tracking-post-tracking: Set firewall on server vps
Thu Sep  7 15:46:37 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Sep  7 15:46:37 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Sep  7 15:46:37 2023 kern.info kernel: [88640.106992] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Thu Sep  7 15:46:38 2023 user.notice MPTCP: Flush route cache
Thu Sep  7 15:46:39 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched up
Thu Sep  7 15:48:05 2023 user.notice MPTCP: Flush route cache
Thu Sep  7 15:48:06 2023 user.notice omr-bypass: Starting OMR-ByPass...
Thu Sep  7 15:48:06 2023 user.notice omr-bypass: OMR-ByPass is running
Thu Sep  7 15:48:06 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Thu Sep  7 15:48:06 2023 user.notice omr-tracker: Launching...
Thu Sep  7 15:48:13 2023 user.notice omr-tracker: Launched
Thu Sep  7 15:48:14 2023 daemon.info omr-tracker-v2ray: V2Ray is up (can contact via http 1.0.0.1)
Thu Sep  7 15:48:37 2023 user.notice omr-tracker: Launching...
Thu Sep  7 15:48:44 2023 user.notice omr-tracker: Launched
Thu Sep  7 15:48:44 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Thu Sep  7 15:48:44 2023 user.notice omr-bypass: Starting OMR-ByPass...
Thu Sep  7 15:48:45 2023 user.notice omr-bypass: OMR-ByPass is running
Thu Sep  7 15:48:45 2023 user.notice MPTCP: Flush route cache
Thu Sep  7 15:48:49 2023 user.notice omr-bypass: Starting OMR-ByPass...

Here is the code that detected wan2 was back up:

Thu Sep  7 15:46:37 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Sep  7 15:46:37 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Sep  7 15:46:39 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched up

post-tracking-post-tracking can see wan2 is switched up but /etc/init.d/v2ray restart never ran. Is this the right place or is there somewhere else I can put this?

@Ysurac
Copy link
Owner

Ysurac commented Sep 8, 2023

Yes, I made a mistake, it's uci set omr-tracker.wan2=interface && uci set omr-tracker.wan2.script_alert_up='/etc/init.d/v2ray restart'

@ioogithub
Copy link
Author

Yes, I made a mistake, it's uci set omr-tracker.wan2=interface && uci set omr-tracker.wan2.script_alert_up='/etc/init.d/v2ray restart'

Thanks, I will try this new command. Is there a reverse command to delete the old one?

Also, I am seeing some error messages on the vps like this:


Sep 06 00:59:57 vps v2ray[2580939]: 2023/09/06 00:59:57 x.x.x.x:17668 rejected  proxy/vless/encoding: failed to read request version > read tcp b.b.b.b:65228->x.x.x.x:17668: read: connection reset by peer
Sep 08 16:30:16 vps v2ray[3697154]: 2023/09/08 16:30:16 x.x.x.x:53988 rejected  proxy/vless/encoding: failed to read request version > tls: first record does not look like a TLS handshake
Sep 08 16:30:41 vps v2ray[3697154]: 2023/09/08 16:30:41 x.x.x.x:60627 rejected  proxy/vless/encoding: failed to read request version > tls: first record does not look like a TLS handshake

x.x.x.x is the public ip of my wan1 router interface. Are these error typical, what causes then, could they be a problem?

@Ysurac
Copy link
Owner

Ysurac commented Sep 9, 2023

uci del openmptcprouter.wan2.script_alert_up
No idea for the V2Ray errors.

@ioogithub
Copy link
Author

No idea for the V2Ray errors.

Have you never seen this error before? I have another one on the vps, a.a.a.a is wan1. and x.x.x.x the vps IP:

Sep 09 11:21:30 vps v2ray[3697154]: 2023/09/09 11:21:30 a.a.a.a:9035 rejected proxy/vless/encoding: failed to read request version > read tcp x.x.x.x:65228->a.a.a.a:9035: read: connection reset by peer

There is no corresponding error on the router at this time.

What does the message mean, is the v2ray connection between the vps and the router being terminated, is there anyway to troubleshoot this further?
There is what I found "This could be caused by various factors such as incorrect configuration files, inconsistent system time or id between the server and client, or blocked v2ray servers"

  • time is synced on both router and vps
  • config files haven't changed
  • it can't be blocked since it works most of the time.

@ioogithub
Copy link
Author

It seems to happen once per day on average:

Sep 11 15:55:59 vps v2ray[818]: 2023/09/11 15:55:59 y.y.y.y:12679 rejected proxy/vless/encoding: failed to read request version > read tcp x.x.x.x:65228->y.y.y.y:12679: read: connection reset by peer

Can you think of anyway to troubleshoot this? Why is v2ray server rejecting v2ray client on OMR? maybe this is contributing the the problem.

@ioogithub
Copy link
Author

ioogithub commented Sep 27, 2023

Yes, I made a mistake, it's uci set omr-tracker.wan2=interface && uci set omr-tracker.wan2.script_alert_up='/etc/init.d/v2ray restart'

I am modifying my new install to restart glorytun and not v2ray when an interface restarts because this fixes aggreegate.
It is not working. Here is the new command:

uci set omr-tracker.wan2=interface && uci set omr-tracker.wan2.script_alert_up='/etc/init.d/glorytun restart'

uci show:

omr-tracker.wan2=interface
omr-tracker.wan2.script_alert_up='/etc/init.d/glorytun restart'

when i restart eth2, run logread | grep glorytun
<no results>

This command does not run. It did work when I had the v2ray, why won't this command run to restart glorytun? Is this command correct?

I confirm, aggregate is not automatically restored. I do need to find a way to automatically restart glorytun on an interface restart.

@ioogithub
Copy link
Author

ioogithub commented Sep 27, 2023

okay it is actually working, my test was wrong, it was restarting the interface, when i did a proper stop and start it did trigger:


Wed Sep 27 16:49:35 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched up
Wed Sep 27 16:49:35 2023 daemon.info glorytun[22860]: STOPPED tun0
Wed Sep 27 16:49:35 2023 daemon.notice netifd: Network device 'tun0' link is down
Wed Sep 27 16:49:35 2023 daemon.notice netifd: Interface 'omrvpn' has link connectivity loss
Wed Sep 27 16:49:35 2023 daemon.notice netifd: Interface 'omrvpn' is now down
Wed Sep 27 16:49:35 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Wed Sep 27 16:49:35 2023 daemon.notice netifd: Interface 'omrvpn' is disabled
Wed Sep 27 16:49:36 2023 daemon.info glorytun[27727]: INITIALIZED tun0
Wed Sep 27 16:49:36 2023 daemon.notice netifd: Interface 'omrvpn' is enabled
Wed Sep 27 16:49:36 2023 daemon.notice netifd: Network device 'tun0' link is up
Wed Sep 27 16:49:36 2023 daemon.notice netifd: Interface 'omrvpn' has link connectivity
Wed Sep 27 16:49:36 2023 daemon.notice netifd: Interface 'omrvpn' is setting up now
Wed Sep 27 16:49:36 2023 daemon.notice netifd: Interface 'omrvpn' is now up

I did not see this before. Aggregate is restored but this is not a great solution it is too disruptive, is there any other way to fix this issue, I am using v2ray for TCP and UDP so why does restarting the glorytun VPN fix aggregate issues with v2ray?

ip route show table 3
and
ip route show table 4

is the same when there is no aggregate and when it is working again.

@ioogithub
Copy link
Author

After restarting glorytun I got a kernel panic, first time I have seen this on omr:

Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.153871] ------------[ cut here ]------------
Wed Sep 27 16:51:16 2023 kern.crit kernel: [ 5122.154394] Kernel BUG at dql_completed+0x140/0x150 [verbose debug info unavailable]
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.155374] invalid opcode: 0000 [#1] SMP NOPTI
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.155837] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G         C        5.4.194 #0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.156816] Hardware name: redacted
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.157748] RIP: 0010:dql_completed+0x140/0x150
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.158212] Code: f8 ea 00 48 89 77 58 e9 4d ff ff ff 45 85 e4 41 0f 95 c4 41 39 db 0f 95 c0 41 84 c4 74 05 45 85 ed 78 09 89 c8 e9 2f ff ff ff <0f> 0b 01 f6 89 c8 29 f0 41 0f 48 c2 eb 8d 90 90 55 48 89 e5 41 56
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.159706] RSP: 0018:ffffc90000003e38 EFLAGS: 00010287
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.160196] RAX: 0000000000000002 RBX: ffffc90004a05d20 RCX: 0000000000000056
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.160660] RDX: 000000000000004a RSI: 00000000000000a0 RDI: ffff888173817340
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.161114] RBP: ffffc90000003e50 R08: 0000000000000000 R09: 0000000000017389
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.161557] R10: 0000000000000000 R11: 000000000001733f R12: ffff8881752daa80
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.161991] R13: ffff888173817280 R14: ffff888173135460 R15: 00000000ffffff46
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.162424] FS:  0000000000000000(0000) GS:ffff88817c000000(0000) knlGS:0000000000000000
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.163363] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.163790] CR2: 0000560d17df9000 CR3: 0000000003208000 CR4: 0000000000340ef0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.164216] Call Trace:
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.164628]  <IRQ>
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.165084]  igc_del_mac_steering_filter+0x4e1/0x2470 [igc]
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.165512]  __napi_poll+0x2c/0xb0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.165908]  net_rx_action+0xca/0x1f0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.166304]  __do_softirq+0xc9/0x20e
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.166694]  irq_exit+0x6f/0x80
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.167072]  do_IRQ+0x57/0xe0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.167444]  common_interrupt+0xf/0xf
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.167807]  </IRQ>
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.168159] RIP: 0010:cpuidle_enter_state+0x137/0x2c0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.168535] Code: 89 c6 e8 7c 0f 9f ff 45 84 ff 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 67 01 00 00 31 ff e8 40 fa a2 ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 88 c5 00 00 00 49 63 cd 4c 2b 75 d0 48 8d 04 49 48 c1
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.169694] RSP: 0018:ffffffff82203e08 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd7
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.170427] RAX: ffff88817c0246c0 RBX: ffffffff822550e0 RCX: 000000000000001f
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.170881] RDX: 000004a898279135 RSI: 00000000401a41a4 RDI: 0000000000000000
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.171286] RBP: ffffffff82203e38 R08: 0000000000000000 R09: 0000000000023f40
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.171678] R10: 000009513d583da4 R11: ffffc90005627cf8 R12: ffff88817a8ac400
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.172078] R13: 0000000000000003 R14: 000004a898279135 R15: 0000000000000000
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.172440]  ? cpuidle_enter_state+0x114/0x2c0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.172800]  cpuidle_enter+0x2f/0x40
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.173157]  call_cpuidle+0x1e/0x40
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.173516]  do_idle+0xf4/0x1c0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.173861]  cpu_startup_entry+0x18/0x20
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.174201]  rest_init+0xa4/0xa6
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.174562]  arch_call_rest_init+0x9/0xc
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.174885]  start_kernel+0x57a/0x588
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.175200]  x86_64_start_reservations+0x29/0x2b
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.175511]  x86_64_start_kernel+0x71/0x74
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.175824]  secondary_startup_64+0xa4/0xb0

not sure what the implications of this are.

@ioogithub
Copy link
Author

ioogithub commented Sep 27, 2023

Here is a real world example. eth2 went down:

Wed Sep 27 18:32:56 2023 daemon.notice netifd: Network device 'eth2' link is down
Wed Sep 27 18:32:56 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Wed Sep 27 18:32:56 2023 kern.info kernel: [ 6087.417383] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Down
Wed Sep 27 18:33:03 2023 daemon.notice netifd: Network device 'eth2' link is up
Wed Sep 27 18:33:03 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Wed Sep 27 18:33:03 2023 kern.info kernel: [ 6094.447380] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Wed Sep 27 18:33:04 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched up
Wed Sep 27 18:33:04 2023 user.notice post-tracking-post-tracking: Interface route not yet set, set route ip r add default via x.x.0.1 dev eth2 metric 4
Wed Sep 27 18:33:06 2023 user.notice post-tracking-post-tracking: New public ip detected for wan2 (eth2): x.x.x.x
Wed Sep 27 18:33:06 2023 user.notice post-tracking-post-tracking: Reload MPTCP for eth2

but the this did not run:

omr-tracker.wan2=interface
omr-tracker.wan2.script_alert_up='/etc/init.d/glorytun restart'

I had to run this manually, then it returned to aggreate: /etc/init.d/glorytun restart.

Is there another way to get glorytun to restart when an interface goes down and up again automatically?

@ioogithub
Copy link
Author

The problem is that this command does not persist across reboots:

uci set omr-tracker.wan2=interface && uci set omr-tracker.wan2.script_alert_up='/etc/init.d/glorytun restart'

the value is not written to /etc/config/omr-tracker

this setting does work, how can i get it to persist?

@ioogithub
Copy link
Author

ioogithub commented Sep 28, 2023

I saw this for the first time ever in the log today:

Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info: service stopped (unbound 1.15.0).
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info: server stats for thread 0: 1177 queries, 0 answers from cache, 1177 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info: server stats for thread 0: requestlist max 4 avg 0.475786 exceeded 0 jostled 0
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info: average recursion processing time 0.000397 sec
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info: histogram of recursion processing times
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info: [25%]=0.00031344 median[50%]=0.000380463 [75%]=0.000447487
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info: lower(secs) upper(secs) recursions
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info:    0.000008    0.000016 4
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info:    0.000016    0.000032 4
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info:    0.000032    0.000064 10
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info:    0.000064    0.000128 3
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info:    0.000128    0.000256 21
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info:    0.000256    0.000512 1122
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info:    0.000512    0.001024 11

I don't think this has been running properly before.

@ioogithub
Copy link
Author

ioogithub commented Sep 28, 2023

I believe I was able to get the command to persist by editing /etc/config/omr-tracker and adding the following:

config interface 'wan2'
        option script_alert_up '/etc/init.d/glorytun restart'

will this cause any issues with other scripts that read this file?

I am seeing a new problem. Sometimes restarting glorytun is not working now. A few times I had to restart v2ray and then it restored aggregate uploads. Now I don't know which on to restart- v2ray or glorytun. I don't know what to look for to know which service needs to be restarted.

Also, the order is not right now. During the last restart, glorytun was restarted before this log line:

Wed Sep 27 23:10:58 2023 user.notice post-tracking-post-tracking: New public ip detected for wan2 (eth2): x.x.x.x

but it needs to be restarted after this time in order to restore aggregate. Is this possible?

@Ysurac
Copy link
Owner

Ysurac commented Sep 28, 2023

To keep uci settings after reboot, you need to do a uci commit

@ioogithub
Copy link
Author

Ysurac can you explain:

I was looking at why tun0 was disconnecting and I realized this:

When using v2ray with port forwarding, OMR sets these rules on the vps in /etc/shorewall/rules:
DNAT net vpn:$OMR_ADDR tcp 12345 # OMR openmptcprouter redirect router 12345 port tcp
but this doesn't make sense, v2ray is the proxy and this rule is forwarding the traffic from vps to router using the vpn. The vpn by default is glorytun.

I verified it by stopping tun0 and it does prevent any uploads. So this means that uploads use the glorytun tun0??

This also explains why restarting '/etc/init.d/glorytyn restart' will return the bandwidth graph back to aggregate mode again.

How does this work? Do all uploads use the glorytun vpn? Why does v2ray open ports for glorytun?

But I want uploads to use v2ray the same as downloads, how to configure this?

If my whole upload aggregate problem in this thread actually a glorytun problem not a v2ray problem?

@Ysurac
Copy link
Owner

Ysurac commented Oct 13, 2023

V2Ray is used only for port forwarding when the checkbox V2Ray is checked in the port forwarding configuration.

@ioogithub
Copy link
Author

ioogithub commented Oct 14, 2023

V2Ray is used only for port forwarding when the checkbox V2Ray is checked in the port forwarding configuration.

Yes this is what I want but I don't think it is working. I think uploads are using the tun0 (glorytun) not the v2ray according to status page and restarting services.

To test:

  1. setup port forwarding on port 1234 and check v2ray. confirm:
DNAT            net             vpn:$OMR_ADDR   tcp     1234    # OMR openmptcprouter redirect router 1234 port tcp
  1. From a server on the internet, download a file from the port 1234: curl https://domain:1234/file.bin -o /dev/null

  2. Observe status page: https://host/cgi-bin/luci/admin/system/openmptcprouter/status
    -> Proxy traffic (v2ray) is not increasing, VPN traffic (tun0) is is increasing!

  3. On the router, restart v2ray while file is transferring: /etc/init.d/v2ray restart.
    Observe bandwidth page (https://host/cgi-bin/luci/admin/network/mptcp/bandwidth):
    -> Transfer continues

  4. On the router, restart glorytun while file is transferring: /etc/init/d/glorytun restart.
    Oobserve bandwidth page (https://host/cgi-bin/luci/admin/network/mptcp/bandwidth):
    -> Transfer is stopped

So uploads using port forward seems to not use proxy (v2ray) but actually uses tun0 glorytun (vpn).

@ioogithub
Copy link
Author

I double checked, v2ray was not checked!!! On this new install I was using glorytun all along for uploads.

Okay this makes senses now, behavior is as expected.

@xzjq
Copy link
Sponsor

xzjq commented Nov 25, 2023

I seem to experience this issue as well. I have three WANs and they do not remain aggregated for bandwidth for established connections over time (status page indicates everything is fine, but the MPTCP bandwidth page only shows one or two of the interfaces being used for traffic). As @ioogithub noted, the disaggregation seems to occur after omr-tracker slays an interface, e.g. "post-tracking-001-post-tracking: wan1 (eth0.300) switched off because check error and ping" even though the interface is subsequently switched up within a minute or two.

New connections will continue to use aggregated bandwidth, but established connections (e.g. a TCP-based VPN on a client device that is tunneled via the router) will only use 1 or 2 of the WANS. Aggregation for those existing connections can be restored by going to the Settings Wizard and clicking "save and apply" with nothing changed, and of course this flushes the MPTCP route table, etc, as can be seen in logread. Similarly, kicking the (non-OMR) VPN tunnel established on the client device will also restore aggregation for that connection.

This is on [v0.60beta1-5.4 r0+16862-170d9e447d]

@xzjq
Copy link
Sponsor

xzjq commented Dec 10, 2023

The scenario seems to be that MPTCP has aggregated connections, but when an interface drops it loses the subflows for that connection (this makes sense). If the interface comes back, MPTCP does not "heal" and add replacement subflows using that interface for existing connections.

I do not know whether MPTCP even has the ability to add subflows for currently existing/ongoing connections. However, when the interface is up/status page is all green, it seems to create new connections that use multipath/aggregation.

E.g. when I have a client device TCP VPN running for hours, it will eventually disaggregate and use only 1-2 WANs (one is always the master), as evidenced by looking at the bandwidth graphs where one or more WANs has 0 traffic.

Restarting the client device TCP VPN can restore aggregated performance, where all WANs show 40+ Mbps traffic during speedtests.

Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants