-
-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multipath not restored after interface is restarted, OMR becomes single path TCP router. #2936
Comments
I have more information. I can reproduce the bug, not with omr-tracker but by stopping the interface (wan2) and starting again. After I start it, OMR never uses it for MPTCP so it becomes single path router. Here is the log wan2 started again:
So I guess I checked |
try to reload v2ray? only some VPN use MPTCP. |
If I reload with
If I restart with MPTCP is working perfectly until an interface is restarted or omr-tracker tracker bring an interface down and up. After that this interface cannot be used for multi path. There has to be a way to tell v2ray that the multi path is available to be used again. When I bring the interface down, OMR knows how to fail over to one interface successfully, it should be able to recover when the interface comes back up. |
Do you have same issue using Shadowsocks ? |
It's not a know problem and I don't have this issue on my 0.59.1 (and still not with latest snapshot). |
Did you test it with an upload so it would use the port forward and the proxy, either shadowsocks or v2ray same result for me? I think I saw download is working once but I am not sure because testing uploads now. I am looking at ip a and ip r and ip rule and multipath. I will compare results of working fresh reboot of OMR and results after interface is restarted and incoming transfers not using multi-oath. |
I did lots of tests:
The routes, rules, interfaces and multipath settings are all identical when its working and when it is broken after restarting an interface.
Here are the outputs, everything looks normal there is nothing to me:
Do you have any ideas of what else I can look at, its like the proxy get stuck on a single path or it doesn't get notified that multi-path is available again. Questions:
If ip r, ip rule etc. are the same with it is working and not working it must be something else, what else should I look for? |
Only mptcp init script is used, to set route table and multipath command to set multipath status of an interface. What is the MPTCP scheduler used ? |
this script sets route table: /etc/init.d/mptcp
Is this command inside the mptcp init script? But if the route table and multipath status is the same before and after the interface is restarted then the problem is somewhere else. Does this mean that only the routes and multipath status of interface determines if the proxy will use multipath? So the proxy pushes data out and the kernel uses the routing table to route over multiple interfaces? Does this mean the proxy doesn't know about MPTCP? There are no other settings that notify the proxy?
I think it is only uploads, when a client from outside the network requests data from a computer inside the network after an interface has been restarted, it does not use that interface. It sends data only on the other interface. In this case port forwarding is used. I don't know if port forwarding is involved but downloads do not use that port forwarding.
For this I made used used different port forwarding rules which swtching to shadowsocks.
Yes I am using the v2ray checkbox port forwarding rule. All of this works normally before an interface is stopped and started or restarted. I see the problem with both v2ray and shadowsocks:
It seems like the proxy gets stuck, it doesn't get notified that both wan1 and wan2 are available. I will start a fresh test:
scheduler: BLEST There is an MPTCP debug if I set it will we get more data in the logs? If you have any more ideas please let me know. |
I have done a lot of testing today, here is the result: v2ray multi-pathuploads
downloads
shadowsocks multi-pathuploads
downloads
ConclusionI initially thought the problem was uploads for shadowsocks and v2ray but I was wrong. When shadowsocks, sometimes it starts working only after a delay of up to 2 minutes. Sometimes it starts right away. In my previous tests I did not wait long enough. I did see it working today. So the problem is isolated to a specific use case: v2ray uploads (traffic initiated from outside the network using the v2ray port forwarding) only. I have never seen this work. I have also tried the following:
Nothing I can do will restore aggregate multi-path for uploads with v2ray once an interface is restarted. Traffic always flows though the one interface that was not restarted.
The only way to recover from this is to click Save and Apply on the Wizard screen or reboot OMR. Question: how is the MPTCP actually working on OMR? The v2ray config.json doesn't have MPTCP set in the stream settings, it is just TCP so I guess v2ray doesn't even know about MPTCP so how does it even work?
You tested tested uploads with port forwarding using v2ray as the proxy with 0.59.1 and it is working for you correct?
If this works for you and it does not work for me that suggests a configuration problem. Tomorrow I will setup a VPS with the 6.1 upstream kernel and boot my router with openmptcprouter-v0.59.2alpha-6.1-r0+23789-ce6ad123e7-x86-64-generic-ext4-combined-efi, configure v2ray and port forwarding and repeat the tests above. I will report the results back here. |
Here is the result of testing with openmptcprouter v0.59.2alpha-6.1 r0+23789-ce6ad123e7.
Configure openwrt:
shadowsocks upload: multipath working Configure V2ray:
Test v2ray upload after interface restart.Same results as the latest stable release. Uploads do not use aggregate multi-path with v2ray as the proxy after an interface is restarted. There is currently no way to recover from this.
With stable, at lease Save and Apply or reboot OMR will fix the problem. In 6.1 this does not work. So the this issue at this time it is better to stay with stable until a work around is found. |
This is an example of an event that kills aggregate uploads:
The OMR-tracker brings the interface down and up again. After this, OMR is in single mode and will not use wan2 again until the server is restarted so this issue has a big impact on OMR stability and reliability. There has to be a way to have uploads recover without manual user intervention after these interface reset events. |
On latest snapshot, I'm not able to reproduce the issue for now using "iperf3" to test upload speed with V2Ray VLESS. |
Yes, I am doing a real world test like this: external client -> VPS (public IP) -> port forward OMR via v2ray vless -> host on LAN. What exact test are you running? For my test, after OMR is restarted, If I do a download from an external client (upload) in a good state, I get this: If I do a download from an external client (upload) after an interface was restarted (by omr-tracker) or manually, I get this: It never recovers from this. omr-tracker reset wan2 8 hours ago: If I do this: |
here is a new upload test as discussed using iperf and a new port foward setup for the test. iperf test:
result:result, bandwidth never return to wan2: https://ibb.co/x8tJqQf The problem is with stable |
I've tested, using V2RAY VLESS protocol redirect port to internal http server and downloading file from external and no problem, when I remove a connection and put it back it's used again. Can you give me full log when you test disconnect and then reconnect a connection ? |
Yes, please let me know exactly which log you are looking for ? logread from OMR, journalctl -f from vps or something else? Is there any debug mode I can enable to get more info? |
I would need system log from the router, so the result of I tested with a new VPS and router installation and I have no issue. What is the result of |
Can you try to do |
Result: I cannot see any effect from this command. There was one error after i reset wan2: Could this be the problem it happens when wan2 is brought down. I am going to reboot OMR to a known good state and watch the logs closely again. |
Here is a log with two question I have.
Is there an OMR process that is monitoring something and trying to optimize MTU and changes it while the interface is up and running? The consequence is that it restarts the interface after and then the interface isn't used for uploads because of this issue.
Sometimes I see these message every few minutes and sometimes I do not see them for 1 hour in the log. What are these for and what determines how often they will run? Is this normal? Also, is glorytun actually retarting everytime this is logged: |
I rebooted OMR at 17:00, same issue, the kernel or some OMR process changed the MTU again at 17:13:
and then the log file is quite, no omr-tracker or glorytun log messages for 2 hours. What is the purpose of resetting the MTU and why don't I see the omr-tracking, glorytun, and mtcp flush cache messages any longer? |
As soon as I initiate 1 file upload, these messages appear again, every few minutes. Are they related to connections or file transfers? Are they related to this issue?
|
You can fix MTU in Network->Interfaces for each interfaces, then it will not try to calculate MTU (but changing MTU should not put the interface down). |
Are you sure? I always see a reset after an MTU change: I get this:
I can also reproduce it with this:
it will always reset the interface. but this MTU just contributes to the problem, it doesn't fix the root cause of aggregate broken after interface restarted. What about this error error after i reset wan2: |
The attack is on the VPS, not related with eth2 on the router. You can increase timeout, tries and retry but it's a connection issue. Strange that v2ray restart on VPS change something... |
I found something interesting while looking for a solution. I noticed that v2ray spans a lot of unconnected UDP sockets that never have any data in their sent or received queues on the vps. Here is an example from this command:
v2ray was restarted 1h20m ago and it currently has 276 of these open,unused sockets which seems high. Is this normal? I also see a flood of connect message in the journalctl from the vps like this:
For example for this second (00:06:56) there were 30 entries. Usually its closer to 1-5 every second. Are these journal logs related to the open sockets? I recognize a lot of these IP addresses on the omt-tracker page. Does omr-tracker open up new UDP socket for every connection attempt? Another strange thing about these connection logs is (a.a.a.a) is always wan1, I never see any for wan2. Is this also normal? My current fix for this single-path aggregate problem is to restart v2ray and incoming transfers on my port use aggregate again. I noticed roughly that I would have to restart every 1 or 2 hours to restore multipath aggregate. Is v2ray becoming overwhelmed by this large amount of open sockets or is this a normal amount? why are they opened and not closed? Why always UDP? |
You shouldn't have so many UDP sockets if you don't use UDP too much. |
I don't have many this behavior doesn't look right to me.
The destination IPs are always the IP address of WAN1. I did this
Proxy: Server tracker: Default settings: OMR VPN: There are so many messages in the journalctl that I always have to do something like this Also these show up as warnings, why are connection events warnings: from config.json Why do these requests only go to wan1 and why are there so many of them? Maybe this has something to do with the problem, after a while (around 1-2 hours) v2ray aggregate stops working, it always works after a restart. Also after 1-2 hours there are hundreds of unconnected UDP sockets open. I restarted v2ray 30 minutes ago and I am already at 242 sockets open: There are all like this on different port numbers: Here is an example of a legitimate port that I am forwarding:
I will double check, are you asking for jounrlactl log for wan connection when vps is booted or something else? |
I just had a chance to test this change we made earlier:
but it did not work. eth2 was just restated by omr-tracker, it came back online but v2ray did not restart. Here is the log:
Here is the code that detected wan2 was back up:
|
Yes, I made a mistake, it's |
Thanks, I will try this new command. Is there a reverse command to delete the old one? Also, I am seeing some error messages on the vps like this:
x.x.x.x is the public ip of my wan1 router interface. Are these error typical, what causes then, could they be a problem? |
|
Have you never seen this error before? I have another one on the vps, a.a.a.a is wan1. and x.x.x.x the vps IP:
There is no corresponding error on the router at this time. What does the message mean, is the v2ray connection between the vps and the router being terminated, is there anyway to troubleshoot this further?
|
It seems to happen once per day on average: Sep 11 15:55:59 vps v2ray[818]: 2023/09/11 15:55:59 y.y.y.y:12679 rejected proxy/vless/encoding: failed to read request version > read tcp x.x.x.x:65228->y.y.y.y:12679: read: connection reset by peer Can you think of anyway to troubleshoot this? Why is v2ray server rejecting v2ray client on OMR? maybe this is contributing the the problem. |
I am modifying my new install to restart glorytun and not v2ray when an interface restarts because this fixes aggreegate.
uci show:
when i restart eth2, run This command does not run. It did work when I had the v2ray, why won't this command run to restart glorytun? Is this command correct? I confirm, aggregate is not automatically restored. I do need to find a way to automatically restart glorytun on an interface restart. |
okay it is actually working, my test was wrong, it was restarting the interface, when i did a proper stop and start it did trigger:
I did not see this before. Aggregate is restored but this is not a great solution it is too disruptive, is there any other way to fix this issue, I am using v2ray for TCP and UDP so why does restarting the glorytun VPN fix aggregate issues with v2ray? ip route show table 3 is the same when there is no aggregate and when it is working again. |
After restarting glorytun I got a kernel panic, first time I have seen this on omr:
not sure what the implications of this are. |
Here is a real world example. eth2 went down:
but the this did not run:
I had to run this manually, then it returned to aggreate: /etc/init.d/glorytun restart. Is there another way to get glorytun to restart when an interface goes down and up again automatically? |
The problem is that this command does not persist across reboots:
the value is not written to /etc/config/omr-tracker this setting does work, how can i get it to persist? |
I saw this for the first time ever in the log today:
I don't think this has been running properly before. |
I believe I was able to get the command to persist by editing /etc/config/omr-tracker and adding the following:
will this cause any issues with other scripts that read this file? I am seeing a new problem. Sometimes restarting glorytun is not working now. A few times I had to restart v2ray and then it restored aggregate uploads. Now I don't know which on to restart- v2ray or glorytun. I don't know what to look for to know which service needs to be restarted. Also, the order is not right now. During the last restart, glorytun was restarted before this log line:
but it needs to be restarted after this time in order to restore aggregate. Is this possible? |
To keep uci settings after reboot, you need to do a |
Ysurac can you explain: I was looking at why tun0 was disconnecting and I realized this: When using v2ray with port forwarding, OMR sets these rules on the vps in /etc/shorewall/rules: I verified it by stopping tun0 and it does prevent any uploads. So this means that uploads use the glorytun tun0?? This also explains why restarting '/etc/init.d/glorytyn restart' will return the bandwidth graph back to aggregate mode again. How does this work? Do all uploads use the glorytun vpn? Why does v2ray open ports for glorytun? But I want uploads to use v2ray the same as downloads, how to configure this? If my whole upload aggregate problem in this thread actually a glorytun problem not a v2ray problem? |
V2Ray is used only for port forwarding when the checkbox V2Ray is checked in the port forwarding configuration. |
Yes this is what I want but I don't think it is working. I think uploads are using the tun0 (glorytun) not the v2ray according to status page and restarting services. To test:
So uploads using port forward seems to not use proxy (v2ray) but actually uses tun0 glorytun (vpn). |
I double checked, v2ray was not checked!!! On this new install I was using glorytun all along for uploads. Okay this makes senses now, behavior is as expected. |
I seem to experience this issue as well. I have three WANs and they do not remain aggregated for bandwidth for established connections over time (status page indicates everything is fine, but the MPTCP bandwidth page only shows one or two of the interfaces being used for traffic). As @ioogithub noted, the disaggregation seems to occur after omr-tracker slays an interface, e.g. New connections will continue to use aggregated bandwidth, but established connections (e.g. a TCP-based VPN on a client device that is tunneled via the router) will only use 1 or 2 of the WANS. Aggregation for those existing connections can be restored by going to the Settings Wizard and clicking "save and apply" with nothing changed, and of course this flushes the MPTCP route table, etc, as can be seen in logread. Similarly, kicking the (non-OMR) VPN tunnel established on the client device will also restore aggregation for that connection. This is on |
The scenario seems to be that MPTCP has aggregated connections, but when an interface drops it loses the subflows for that connection (this makes sense). If the interface comes back, MPTCP does not "heal" and add replacement subflows using that interface for existing connections. I do not know whether MPTCP even has the ability to add subflows for currently existing/ongoing connections. However, when the interface is up/status page is all green, it seems to create new connections that use multipath/aggregation. E.g. when I have a client device TCP VPN running for hours, it will eventually disaggregate and use only 1-2 WANs (one is always the master), as evidenced by looking at the bandwidth graphs where one or more WANs has 0 traffic. Restarting the client device TCP VPN can restore aggregated performance, where all WANs show 40+ Mbps traffic during speedtests. |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days |
Expected Behavior
Multipath should be restored to the same state after omr-tracker stops and starts an interface with all wan connections used.
Current Behavior
Multipath is broken after omr-tracker restarts an interface, traffic now only uses one wan, OpenMPTCProuter becomes OpenSPTCProuter (single path TCP router).
Steps to Reproduce the Problem
https://ibb.co/2YZDzBB
Start a new upload, traffic now only uses wan1, router is not Multi-path any longer:
https://ibb.co/94QZJLq
Router never recovers from this state. Start another upload 1 hour later and it still only uses wan1.
ip r shows there are routes and default routes for both wan1 and wan2 but MPTCP refuses to use wan2 after OM-tracker restarts it.
Possible Solution 1
I tried two steps to fix the problem:
/etc/init.d/openmptcprouter-vps restart
logread -f
on OMR andjournalctl -f
on VPS I do not see any log events after this comamnd! This command executed and exits but it didn't do anything observable from the logsPossible Solution 2:
Save and Apply
from the wizard page.Thu Aug 24 18:01:47 2023 user.notice post-tracking-post-tracking: Reload MPTCP for eth2
I guess there is where the bug is, Reload MPTCP is not properly restoring the MPTCP bond. Is there a way to get more information on what MPTCP is doing here, is there any debug mode?
I have been tracking this problem for a long time where I see the performance of the system degrade over time.
I didn't have the knowledge until recently to isolate the bug and report it. I am available to test any solutions.
Context (Environment)
The issue is bad because it effectively breaks OMR. If only a single path is used after OMR-tracker then there is no purpose to run OMR at all. Also there is no way currently to recover from this problem.
Specifications
The text was updated successfully, but these errors were encountered: