New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLOUDSTACK-10346] Problem with NAT configuration and VMs not accessing each other via public IPs #2514

Closed
wants to merge 1 commit into
base: 4.11
from

Conversation

@rafaelweingartner
Member

rafaelweingartner commented Mar 27, 2018

Description

When users create a VPC, and configure a NAT from a public IP to application in a VM. This VM(applications) are not accessible via public IP for other VMs in the same VPC

The problem is in the NAT table. If you take a closer look at rules, you will see something like:
-A PREROUTING -d publicIP/32 -i eth1 -p tcp -m tcp --dport 80 -j DNAT --to-destination internalIp:80

The problem is that according to this rule only packets coming via eth1(public interface), will be “redirected” to the internal IP. We need an extra entry to each one of the NAT configurations. For the presented rule, we would need something like:
-A PREROUTING -d publicIP/32 -i eth2 -p tcp -m tcp --dport 80 -j DNAT --to-destination internalIp:80

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

Locally in a development environment and with Both XenServer 6.5 and 7.2.

Checklist:

  • I have read the CONTRIBUTING document.
  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@blueorangutan package

@blueorangutan

This comment has been minimized.

blueorangutan commented Mar 27, 2018

@rafaelweingartner a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

This comment has been minimized.

blueorangutan commented Mar 27, 2018

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-1842

@resmo

This comment has been minimized.

Member

resmo commented Mar 27, 2018

any reason this should not be targeted for 4.11.1?

@rafaelweingartner rafaelweingartner changed the base branch from master to 4.11 Mar 27, 2018

@rafaelweingartner rafaelweingartner changed the base branch from 4.11 to master Mar 27, 2018

@rafaelweingartner

This comment has been minimized.

Member

rafaelweingartner commented Mar 27, 2018

No reason at all. It is just my habit of opening PRs against master directly.

@rhtyd

This comment has been minimized.

Member

rhtyd commented Mar 28, 2018

@rafaelweingartner since it's a useful bugfix, can you change the base branch and rebase the PR against 4.11 branch?

@ustcweizhou

This comment has been minimized.

Contributor

ustcweizhou commented Mar 28, 2018

@rafaelweingartner We faced the same issue before.
it is fixed by adding some rules in ip route tables.

$ ip route show table Table_eth1
throw 10.10.1.0/24  proto static
throw 10.10.2.0/24  proto static

eth1 is public interface
10.10.1.0/24 and 10.10.2.0/24 are cidr of vpc tiers.

The idea came from the code before 4.7 (systemvm refactoring)
https://github.com/apache/cloudstack/blob/4.5/systemvm/patches/debian/config/opt/cloud/bin/ipassoc.sh
copy_routes_from_main in line 122

@rafaelweingartner

This comment has been minimized.

Member

rafaelweingartner commented Mar 28, 2018

To tell you the truth, I do not understand how you solve this in the routing table. I saw your PR, and it is merged in 4.9.3.0, 4.11, and master, and still the problem persists.

I am solving in the iptables, making it consistent with other configurations that we already have, but for different scenarios. This solution (the one I am introducing here) was already used when we "attach"/"direct connect" a public IP to a VM. Check configure.py at lines 816-836.

The packet is coming from ETH2 (internal interface) to one of our public IPs, we need to execute NAT as well from packets from ETH2 if we want VMs accessing each other via their public IPs.

@ustcweizhou

This comment has been minimized.

Contributor

ustcweizhou commented Mar 29, 2018

@rafaelweingartner thanks for your reply.
I will test it and let you know the result. It might takes few days.

This issue does not exist on our platform based on 4.7.1 (with some other changes)

@borisstoyanov

This comment has been minimized.

Contributor

borisstoyanov commented Apr 2, 2018

@blueorangutan package

@blueorangutan

This comment has been minimized.

blueorangutan commented Apr 2, 2018

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

This comment has been minimized.

blueorangutan commented Apr 2, 2018

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-1857

@borisstoyanov

This comment has been minimized.

Contributor

borisstoyanov commented Apr 2, 2018

@blueorangutan

This comment has been minimized.

blueorangutan commented Apr 2, 2018

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

[CLOUDSTACK-10346] VPC-NAT configuration, access via public IPs does …
…not work

When users create a VPC, and configure a NAT from a public IP to application in a VM. This VM(applications) are not accessible via public IP for other VMs in the same VPC.

The problem is in the NAT table. If you take a closer look at rules, you will see something like:

-A PREROUTING -d publicIP/32 -i eth1 -p tcp -m tcp --dport 80 -j DNAT --to-destination internalIp:80

The problem is that according to this rule only packets coming via eth1(public interface), will be “redirected” to the internal IP. We need an extra entry to each one of the NAT configurations. For the presented rule, we would need something like:

-A PREROUTING -d publicIP/32 -i eth2 -p tcp -m tcp --dport 80 -j DNAT --to-destination internalIp:80

@rafaelweingartner rafaelweingartner changed the base branch from master to 4.11 Apr 2, 2018

@rafaelweingartner

This comment has been minimized.

Member

rafaelweingartner commented Apr 2, 2018

@rhtyd changed the target branch to 4.11

@rafaelweingartner rafaelweingartner modified the milestones: 4.12.0.0, 4.11.1.0 Apr 2, 2018

@blueorangutan

This comment has been minimized.

blueorangutan commented Apr 3, 2018

Trillian test result (tid-2443)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 91103 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2514-t2443-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_certauthority_root.py
Intermitten failure detected: /marvin/tests/smoke/test_deploy_virtio_scsi_vm.py
Intermitten failure detected: /marvin/tests/smoke/test_routers.py
Intermitten failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 65 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_04_restart_network_wo_cleanup Failure 4.00 test_routers.py
test_hostha_enable_ha_when_host_in_maintenance Error 1.45 test_hostha_kvm.py
test_hostha_kvm_host_degraded Error 1992.79 test_hostha_kvm.py
@borisstoyanov

@rafaelweingartner marvin tests looks good, can you add a message how you've tested this in your environment

@rafaelweingartner

This comment has been minimized.

Member

rafaelweingartner commented Apr 3, 2018

Sure. I did the following to test this issue:

  • created a VPC
  • then allocated an IP
  • deployed a VM
  • configured a NAT rule for port 22 to the recently deployed VM
  • tried to access the VM (from itself) via SSH using the public IP

Without the changes introduces by this PR it is not possible to execute the last step.

@borisstoyanov

Thank you @rafaelweingartner.
LGTM

@rafaelweingartner rafaelweingartner changed the title from Problem with NAT configuration and VMs not accessing each other via public IPs to [CLOUDSTACK-10346] Problem with NAT configuration and VMs not accessing each other via public IPs Apr 3, 2018

@rafaelweingartner

This comment has been minimized.

Member

rafaelweingartner commented Apr 6, 2018

@ustcweizhou are you ok with the changes introduced here?

@rhtyd

This comment has been minimized.

Member

rhtyd commented Apr 18, 2018

@rafaelweingartner okay I did a quick test. I allocated a public IP to my VPC, but did not SNAT it to any VM. Next I used console-proxy and was able to ssh from other vm to the VM which was port-forwarded to the VM with public IP (nat).

iptables rules outputs:

#iptables -t nat -S
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-A PREROUTING -d 192.168.1.53/32 -p tcp -m tcp --dport 22 -j DNAT --to-destination 10.1.1.50:22
-A OUTPUT -d 192.168.1.53/32 -p tcp -m tcp --dport 22 -j DNAT --to-destination 10.1.1.50:22
-A POSTROUTING -s 10.1.1.0/24 -o eth2 -j SNAT --to-source 10.1.1.1
-A POSTROUTING -o eth1 -j SNAT --to-source 192.168.1.56
-A POSTROUTING -d 192.168.1.53/32 -p tcp -m tcp --dport 22 -j SNAT --to-source 10.1.1.50:22

# iptables -t mangle -S
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-N ACL_OUTBOUND_eth2
-N VPN_STATS_eth1
-A PREROUTING -i eth2 -m state --state RELATED,ESTABLISHED -j CONNMARK --restore-mark --nfmask 0xffffffff --ctmask 0xffffffff
-A PREROUTING -s 10.1.1.0/24 ! -d 10.1.1.1/32 -i eth2 -m state --state NEW -j ACL_OUTBOUND_eth2
-A FORWARD -j VPN_STATS_eth1
-A POSTROUTING -p udp -m udp --dport 68 -j CHECKSUM --checksum-fill
-A ACL_OUTBOUND_eth2 -d 224.0.0.18/32 -j ACCEPT
-A ACL_OUTBOUND_eth2 -j ACCEPT
-A ACL_OUTBOUND_eth2 -d 225.0.0.50/32 -j ACCEPT
-A VPN_STATS_eth1 -o eth1 -m mark --mark 0x525
-A VPN_STATS_eth1 -i eth1 -m mark --mark 0x524


# iptables -t filter -S
-P INPUT DROP
-P FORWARD DROP
-P OUTPUT ACCEPT
-N ACL_INBOUND_eth2
-N NETWORK_STATS
-N NETWORK_STATS_eth1
-A INPUT -d 10.1.1.1/32 -i eth2 -p tcp -m tcp --dport 443 -m state --state NEW -j ACCEPT
-A INPUT -d 10.1.1.1/32 -i eth2 -p tcp -m tcp --dport 80 -m state --state NEW -j ACCEPT
-A INPUT -d 10.1.1.1/32 -i eth2 -p tcp -m tcp --dport 53 -j ACCEPT
-A INPUT -d 10.1.1.1/32 -i eth2 -p udp -m udp --dport 53 -j ACCEPT
-A INPUT -j NETWORK_STATS
-A INPUT -i eth2 -p udp -m udp --dport 67 -j ACCEPT
-A INPUT -s 10.1.1.0/24 -i eth2 -p udp -m udp --dport 53 -j ACCEPT
-A INPUT -s 10.1.1.0/24 -i eth2 -p tcp -m tcp --dport 53 -j ACCEPT
-A INPUT -i eth2 -p tcp -m tcp --dport 80 -m state --state NEW -j ACCEPT
-A INPUT -i eth2 -p tcp -m tcp --dport 8080 -m state --state NEW -j ACCEPT
-A INPUT -d 224.0.0.18/32 -j ACCEPT
-A INPUT -d 225.0.0.50/32 -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -i eth0 -p tcp -m tcp --dport 3922 -m state --state NEW,ESTABLISHED -j ACCEPT
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -j NETWORK_STATS_eth1
-A FORWARD -j NETWORK_STATS
-A FORWARD -s 10.0.0.0/8 ! -d 10.0.0.0/8 -j ACCEPT
-A FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -d 10.1.1.0/24 -o eth2 -j ACL_INBOUND_eth2
-A OUTPUT -j NETWORK_STATS
-A ACL_INBOUND_eth2 -d 225.0.0.50/32 -j ACCEPT
-A ACL_INBOUND_eth2 -d 224.0.0.18/32 -j ACCEPT
-A ACL_INBOUND_eth2 -j ACCEPT
-A ACL_INBOUND_eth2 -j DROP
-A NETWORK_STATS -i eth0 -o eth2 -p tcp
-A NETWORK_STATS -i eth2 -o eth0 -p tcp
-A NETWORK_STATS ! -i eth0 -o eth2 -p tcp
-A NETWORK_STATS -i eth2 ! -o eth0 -p tcp
-A NETWORK_STATS_eth1 -s 10.1.1.0/24 -o eth1
-A NETWORK_STATS_eth1 -d 10.1.1.0/24 -i eth1

You can check my lab setup here https://lab.yadav.cloud/stack/ (use cloud:cloud, it's a read only admin :) )

@rhtyd

This comment has been minimized.

Member

rhtyd commented Apr 18, 2018

@rafaelweingartner what is the ACLs for the VPC? I used an allow all policy (i.e. both ingress+egress is allow all)

@rafaelweingartner

This comment has been minimized.

Member

rafaelweingartner commented Apr 18, 2018

I just tested in your system, and it is indeed working. I applied your changes here, I am now restarting the network to see what is going to happen.

@rafaelweingartner

This comment has been minimized.

Member

rafaelweingartner commented Apr 18, 2018

No success :(
Here goes my iptables (with PR2579):

# Generated by iptables-save v1.4.14 on Wed Apr 18 17:44:29 2018
*nat
:PREROUTING ACCEPT [3:180]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [6:456]
:POSTROUTING ACCEPT [0:0]
-A PREROUTING -d 10.0.2.102/32 -i eth1 -p tcp -m tcp --dport 22 -j DNAT --to-destination 172.16.0.214:22
-A PREROUTING -d 10.0.2.103/32 -i eth1 -p tcp -m tcp --dport 22 -j DNAT --to-destination 172.16.0.168:22
-A OUTPUT -d 10.0.2.102/32 -p tcp -m tcp --dport 22 -j DNAT --to-destination 172.16.0.214:22
-A OUTPUT -d 10.0.2.103/32 -p tcp -m tcp --dport 22 -j DNAT --to-destination 172.16.0.168:22
-A POSTROUTING -s 172.16.0.0/24 -o eth2 -j SNAT --to-source 172.16.0.1
-A POSTROUTING -o eth1 -j SNAT --to-source 10.0.2.130
-A POSTROUTING -o eth1 -j SNAT --to-source 10.0.2.102
-A POSTROUTING -d 10.0.2.102/32 -p tcp -m tcp --dport 22 -j SNAT --to-source 172.16.0.214:22
-A POSTROUTING -d 10.0.2.103/32 -p tcp -m tcp --dport 22 -j SNAT --to-source 172.16.0.168:22
COMMIT
# Completed on Wed Apr 18 17:44:29 2018
# Generated by iptables-save v1.4.14 on Wed Apr 18 17:44:29 2018
*mangle
:PREROUTING ACCEPT [215:13820]
:INPUT ACCEPT [195:11692]
:FORWARD ACCEPT [23:2308]
:OUTPUT ACCEPT [166:12048]
:POSTROUTING ACCEPT [189:14356]
:ACL_OUTBOUND_eth2 - [0:0]
:VPN_STATS_eth1 - [0:0]
-A PREROUTING -i eth2 -m state --state RELATED,ESTABLISHED -j CONNMARK --restore-mark --nfmask 0xffffffff --ctmask 0xffffffff
-A PREROUTING -s 172.16.0.0/24 ! -d 172.16.0.1/32 -i eth2 -m state --state NEW -j ACL_OUTBOUND_eth2
-A FORWARD -j VPN_STATS_eth1
-A POSTROUTING -p udp -m udp --dport 68 -j CHECKSUM --checksum-fill
-A ACL_OUTBOUND_eth2 -d 224.0.0.18/32 -j ACCEPT
-A ACL_OUTBOUND_eth2 -j ACCEPT
-A ACL_OUTBOUND_eth2 -d 225.0.0.50/32 -j ACCEPT
-A VPN_STATS_eth1 -o eth1 -m mark --mark 0x525
-A VPN_STATS_eth1 -i eth1 -m mark --mark 0x524
COMMIT
# Completed on Wed Apr 18 17:44:29 2018
# Generated by iptables-save v1.4.14 on Wed Apr 18 17:44:29 2018
*filter
:INPUT DROP [3:180]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [166:12048]
:ACL_INBOUND_eth2 - [0:0]
:FW_EGRESS_RULES - [0:0]
:NETWORK_STATS - [0:0]
:NETWORK_STATS_eth1 - [0:0]
-A INPUT -d 172.16.0.1/32 -i eth2 -p tcp -m tcp --dport 443 -m state --state NEW -j ACCEPT
-A INPUT -d 172.16.0.1/32 -i eth2 -p tcp -m tcp --dport 80 -m state --state NEW -j ACCEPT
-A INPUT -d 172.16.0.1/32 -i eth2 -p tcp -m tcp --dport 53 -j ACCEPT
-A INPUT -d 172.16.0.1/32 -i eth2 -p udp -m udp --dport 53 -j ACCEPT
-A INPUT -j NETWORK_STATS
-A INPUT -d 224.0.0.18/32 -j ACCEPT
-A INPUT -d 225.0.0.50/32 -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -i eth0 -p tcp -m tcp --dport 3922 -m state --state NEW,ESTABLISHED -j ACCEPT
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -i eth2 -p udp -m udp --dport 67 -j ACCEPT
-A INPUT -s 172.16.0.0/24 -i eth2 -p udp -m udp --dport 53 -j ACCEPT
-A INPUT -s 172.16.0.0/24 -i eth2 -p tcp -m tcp --dport 53 -j ACCEPT
-A INPUT -i eth2 -p tcp -m tcp --dport 80 -m state --state NEW -j ACCEPT
-A INPUT -i eth2 -p tcp -m tcp --dport 8080 -m state --state NEW -j ACCEPT
-A FORWARD -j NETWORK_STATS_eth1
-A FORWARD -j NETWORK_STATS
-A FORWARD -s 172.16.0.0/16 ! -d 172.16.0.0/16 -j ACCEPT
-A FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -d 172.16.0.0/24 -o eth2 -j ACL_INBOUND_eth2
-A OUTPUT -j NETWORK_STATS
-A ACL_INBOUND_eth2 -d 225.0.0.50/32 -j ACCEPT
-A ACL_INBOUND_eth2 -d 224.0.0.18/32 -j ACCEPT
-A ACL_INBOUND_eth2 -j ACCEPT
-A ACL_INBOUND_eth2 -j DROP
-A NETWORK_STATS -i eth0 -o eth2 -p tcp
-A NETWORK_STATS -i eth2 -o eth0 -p tcp
-A NETWORK_STATS ! -i eth0 -o eth2 -p tcp
-A NETWORK_STATS -i eth2 ! -o eth0 -p tcp
-A NETWORK_STATS_eth1 -s 172.16.0.0/24 -o eth1
-A NETWORK_STATS_eth1 -d 172.16.0.0/24 -i eth1
COMMIT
# Completed on Wed Apr 18 17:44:29 2018
@rafaelweingartner

This comment has been minimized.

Member

rafaelweingartner commented Apr 18, 2018

I forgot to mention. In this test environment I am using ACS 4.9.
You said that when you tested this PR it failed. Did you test with 4.11 or 4.12?

@rhtyd

This comment has been minimized.

Member

rhtyd commented Apr 18, 2018

@rafaelweingartner can you test against 4.11? I tested it against 4.11 ( you can verify yourself the lab env). A lot has changed between 4.9 and 4.11 including python based systemvm codebase and VR template, it is possible that my fix does not work in your 4.9 based env.

@rafaelweingartner

This comment has been minimized.

Member

rafaelweingartner commented Apr 18, 2018

That is what I am doing now. I am building 4.11 with your PR to test.

@rhtyd

This comment has been minimized.

Member

rhtyd commented Apr 18, 2018

@rafaelweingartner On further investigation and hints from Jayapal's reply on dev@ I found the issue was caused in 4.11/master due to a missing ip route rule which @ustcweizhou has advised. My hack worked because packets were no longer marked which was tied to routing rules. On adding this, instead of MARK rules worked for me at last:
ip route add throw 10.1.1.0/24 table Table_eth1 proto static

@rafaelweingartner

This comment has been minimized.

Member

rafaelweingartner commented Apr 18, 2018

That is what you do in #2579, right?

What does ip route add throw mean?

I mean, I understand the other commands such as add gw, but this throw clause...

@rafaelweingartner

This comment has been minimized.

Member

rafaelweingartner commented Apr 18, 2018

The first time when I saw this throw clause in the @ustcweizhou comments, I googled and I found this:

throw - a special control route used together with policy rules. If such a route is selected, lookup in this table is terminated pretending that no route was found. Without policy routing it is equivalent to the absence of the route in the routing table. The packets are dropped and the ICMP message net unreachable is generated. The local senders get an ENETUNREACH error.

That is why I said I do not understand how that throw option can solve the NAT translation that we are having

@rhtyd

This comment has been minimized.

Member

rhtyd commented Apr 18, 2018

@rafaelweingartner yes I've updated #2579, do review that. Meanwhile, in your env can you post the following:

ip route show table Table_eth1

Mine looks like this:

# ip route show table Table_eth1
default via 192.168.1.1 dev eth1 proto static 
throw 10.1.1.0/24 proto static 
throw 10.1.2.0/24 proto static 
throw 192.168.1.0/24 proto static

You can add routing rule like this: (replace cidr with your VPC tiers cidrs, this assumes that eth1 is public nic for VPC VR which usually is)

ip route add throw 10.1.1.0/24 table Table_eth1 proto static
ip route add throw 10.1.2.0/24 table Table_eth1 proto static
@rhtyd

This comment has been minimized.

Member

rhtyd commented Apr 18, 2018

@rafaelweingartner okay let me try to explain what I understand (and btw this is wrt 4.11, and may not apply for older ACS). The mangle table get rules to mark some packets from PREROUTING (incoming packets), this is done by configure.py mainly and you can get hint from CsRule.py. You can put routing rules based on marked packets. For example, I see this in my VR:

0:	from all lookup local 
32761:	from all fwmark 0x3 lookup Table_eth3 
32762:	from all fwmark 0x2 lookup Table_eth2 
32763:	from all fwmark 0x1 lookup Table_eth1 

By above, packets marked 0x1 will use routing table at Table_eth1 which we can list as:

# ip route show table Table_eth1
default via 192.168.1.1 dev eth1 proto static 
throw 10.1.1.0/24 proto static 
throw 10.1.2.0/24 proto static 
throw 192.168.1.0/24 proto static

The issue at least for 4.11/master was that these routing table rules (throw stuff may not be necessary, but the important is eth1 can do routing on VPC tier cidrs). I added that, and tests confirm it works.

Just for reference, the same could be obtained without using a throw route: (I preferred to use throw route to make it explicit to fail when a route is not found, also you don't need to specify the dev interface with that syntax)

# ip route add dev eth2 10.1.1.0/24  table Table_eth1 
# ip route add dev eth3 10.1.2.0/24  table Table_eth1 
# ip route show table Table_eth1
default via 192.168.1.1 dev eth1 proto static 
10.1.1.0/24 dev eth2 scope link 
10.1.2.0/24 dev eth3 scope link 

@rhtyd rhtyd referenced this pull request Apr 18, 2018

Merged

router: fix routing table for marked packets #2579

5 of 12 tasks complete
@rafaelweingartner

This comment has been minimized.

Member

rafaelweingartner commented Apr 19, 2018

@rhtyd sorry the late reply, but the testing took much more time than expected.
PR #2579 is not working here. System VMs start just fine, but VRs do not. The VR go to Running state, and then ACs starts to apply configurations in it, and then the scripts you changed are executed, and VR stops being reachable via XAPI interface.

I tested with current master, and it is working just fine. However, differently, from the 4.9.3.0 that we have noticed a problem with NAT, SNAT is not working (in our 4.9, it was working just fine).The normal NAT is working (this was not working for our 4.9 here). Go figure!? Can you get master and do some testing too?

My hosts are XenServer 6.5/7.2. Did you test with XenServer or only with KVM?

PR #2514 was intended to fix the normal NAT, which does not seem to be broken in master anymore. Therefore, I think we can close it. We need however to check this errors before releasing a new version. I will re-test again with master before closing my PR. If something change, I will let you know.

@rhtyd

This comment has been minimized.

Member

rhtyd commented Apr 19, 2018

@rafaelweingartner I've made further changes on the PR, did you use the latest (see diff?). I've tested only on kvm, I can attempt testing next week on vmware and xenserver, waiting for test matrix to return results soon.

@rafaelweingartner

This comment has been minimized.

Member

rafaelweingartner commented Apr 19, 2018

Hmm, I think I used a different commit.

git checkout mangle-free-vpc-nat
Switched to branch 'mangle-free-vpc-nat'
[root@centos7 packaging]# git log -1
commit d61eed07af366e4e8753d049c182e196af7b2b34

I will try again tomorrow then with your newest commit.

@rhtyd

This comment has been minimized.

Member

rhtyd commented Apr 19, 2018

Cool, I've updated some commentary above. I found the right symptom but fixed it wrongly at first.

@blueorangutan

This comment has been minimized.

blueorangutan commented Apr 19, 2018

Trillian test result (tid-2518)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 86394 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2514-t2518-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_routers.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermitten failure detected: /marvin/tests/smoke/test_host_maintenance.py
Smoke tests completed. 65 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_04_restart_network_wo_cleanup Failure 2.83 test_routers.py
test_05_rvpc_multi_tiers Failure 318.41 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 343.51 test_vpc_redundant.py
@rafaelweingartner

This comment has been minimized.

Member

rafaelweingartner commented Apr 19, 2018

I just finished testing master again. Indeed the problem that I was solving with this PR does not exist in master. Therefore, I will close this PR.

@blueorangutan

This comment has been minimized.

blueorangutan commented Apr 19, 2018

@rafaelweingartner a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

This comment has been minimized.

blueorangutan commented Apr 19, 2018

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-1950

@izenk

This comment has been minimized.

izenk commented Jun 22, 2018

@rhtyd
Sorry for asking here, but can't understand if my situation is covered by this fix or not.

Am I understand correctly, now in 4.11.0.0 we can have such situation:
vm1 -> vpc1 -> vr1:SNAT:public_ip (192.168.1.11)
vm2 -> vpc2 -> vr2:SNAT:public_ip (192.168.1.12)

Public network (192.168.0.0/16) gateway is 192.168.1.1

Public network is on eth1 on VR and there is an iptables mark rule for it

#iptables-save |grep 0x1
-A PREROUTING -i eth1 -m state --state NEW -j CONNMARK --set-xmark 0x1/0xffffffff

vr1 has rule for public network:

32763:	from all fwmark 0x1 lookup Table_eth1

and rules for table are:

#ip route show table Table_eth1
default via 192.168.1.1 dev eth1 proto static

Such configuration means that vm2 can't connect to vm1 by public_ip(192.168.1.11), because vr2 during request marks packets with 0x1 and then tries to route through gate 192.168.1.1, when it should not (because 1.11 and 1.12 are in the same network)

#ip route get 192.168.1.12 mark 0x1
192.168.1.12 via 192.168.1.1 dev eth1 table Table_eth1 src 192.168.1.11 mark 1

So, the result table should be:

#ip route show table Table_eth1
default via 192.168.1.1 dev eth1 proto static
192.168.0.0/16 dev eth1 scope link
@rhtyd

This comment has been minimized.

Member

rhtyd commented Jun 27, 2018

@izenk can you test the 4.11.1 (rc3)? I think it should be fixed and you're right about the expecting routes in the table.

@izenk

This comment has been minimized.

izenk commented Jun 27, 2018

@rhtyd
Sorry, can't test. Only can trust you again :-)

@rhtyd

This comment has been minimized.

Member

rhtyd commented Jun 27, 2018

Thanks @izenk, I'm not sure but hope it should be fixed in 4.11.1 and of course time will tell.

@izenk

This comment has been minimized.

izenk commented Sep 24, 2018

@rhtyd
works in 4.11.1 except case, when vms are inside the same VPC and the same TIER.
In that case, vm1 can't connect to vm2 through public ip

@rhtyd

This comment has been minimized.

Member

rhtyd commented Oct 2, 2018

Let me test that this week @izenk and see if I can reproduce the bug (guest VMs insider same VPC/ same tier unable to access each other via public IP).

@izenk

This comment has been minimized.

izenk commented Oct 3, 2018

@rhtyd may be this helps..
Looks like routing goes in such way, that packets do not go through SNAT rule on VR, what leads to such situation:
if two vms are in one TIER and vm1 wants to connect to vm2 through PUBLIC IP, it should look like:
vm1 internal ip -> VR SNAT -> VR DNAT -> vm2 internal ip
right now, on vm2 I can see packets from vm1, but source ip is set to vm1 internal ip (not VR SNAT)

So if I try telnet from vm1 to VR publicIP:80 (which is forwarded to vm2:80), on vm2 I can see packets on port 80, but these packets are from vm1 internal ip(should be from VR SNAT). Next, I even can see replies from vm2 on vm1, but because this replies are coming from vm2 directly (to vm1:internal ip) - in fact connection is not established.

@rhtyd

This comment has been minimized.

Member

rhtyd commented Oct 5, 2018

@izenk The basic tests of ingress/egress to/from user VMs to other user VMs using public SNAT (attached) IP pass for me both when VMs were in (a) different tiers in the same VPC, (b) across VPCs, (c) in the same tier. Packet marking in 4.11.2.0 is slightly changed to fix a bug where marking a packet with 0x0 failed in the new VR. Routing etc LGTM, I tested the latest 4.11 branch (4.11.2.0 rc2 + two new commits not related to VR).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment