![](http://spark.apache.org/images/spark-logo.png) ![](https://upload.wikimedia.org/wikipedia/commons/f/f8/Python_logo_and_wordmark.svg)

Set Operations on RDDs
===============================



Spark supports many of the operations we have in mathematical sets, such as union and intersection, even when the RDDs themselves are not properly sets. It is important to note that these operations require that the RDDs being operated on are of the same type.


Set operations are quite straightforward to understand as it work as expected. The only consideration comes from the fact that RDDs are not real sets, and therefore operations such as the union of RDDs doesn't remove duplicates. 
- subtract, 
- distinct, and 
- cartesian.

## 1. Getting the data and creating the RDD

In this case we will use the complete dataset provided for the KDD Cup 1999, containing nearly half million network interactions. The file is provided as a Gzip file that we will download locally.

In [1]:
import urllib
f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data.gz", 
                        "kddcup.data.gz")

In [2]:
data_file = "./kddcup.data.gz"
raw_data = sc.textFile(data_file)

## 2. Getting attack interactions using ```subtract```


In [3]:
normal_raw_data = raw_data.filter(lambda x: "normal." in x)

In [4]:
attack_raw_data = raw_data.subtract(normal_raw_data)

Let's do some counts to check our results.

In [5]:
from time import time

# count all
t0 = time()
raw_data_count = raw_data.count()
tt = time() - t0

print "All count in {} secs".format(round(tt, 3))

All count in 16.635 secs


In [6]:
# count normal
t0 = time()
normal_raw_data_count = normal_raw_data.count()
tt = time() - t0
print "Normal count in {} secs".format(round(tt, 3))

Normal count in 16.226 secs


In [7]:
# count attacks
t0 = time()
attack_raw_data_count = attack_raw_data.count()
tt = time() - t0
print "Attack count in {} secs".format(round(tt, 3))

Attack count in 114.181 secs


In [8]:
print "There are {} normal interactions and {} attacks, \
from a total of {} interactions".format(normal_raw_data_count,attack_raw_data_count,raw_data_count)

There are 972781 normal interactions and 3925650 attacks, from a total of 4898431 interactions


## 3. Protocol and service combinations using ```cartesian```

We can compute the Cartesian product between two RDDs by using the ```cartesian``` transformation. It returns all possible pairs of elements between two RDDs.

In our case we will use it to generate all the possible combinations between service and protocol in our network interactions.

First of all we need to isolate each collection of values in two separate RDDs. For that we will use distinct on the CSV-parsed dataset. From the dataset description we know that protocol is the second column and service is the third (tag is the last one and not the first as appears in the page).


**dataset description:**
```json
back,buffer_overflow,ftp_write,guess_passwd,imap,ipsweep,land,loadmodule,multihop,neptune,nmap,normal,perl,phf,pod,portsweep,rootkit,satan,smurf,spy,teardrop,warezclient,warezmaster.
duration: continuous.
protocol_type: symbolic.
service: symbolic.
flag: symbolic.
src_bytes: continuous.
dst_bytes: continuous.
land: symbolic.
wrong_fragment: continuous.
urgent: continuous.
hot: continuous.
num_failed_logins: continuous.
logged_in: symbolic.
num_compromised: continuous.
root_shell: continuous.
su_attempted: continuous.
num_root: continuous.
num_file_creations: continuous.
num_shells: continuous.
num_access_files: continuous.
num_outbound_cmds: continuous.
is_host_login: symbolic.
is_guest_login: symbolic.
count: continuous.
srv_count: continuous.
serror_rate: continuous.
srv_serror_rate: continuous.
rerror_rate: continuous.
srv_rerror_rate: continuous.
same_srv_rate: continuous.
diff_srv_rate: continuous.
srv_diff_host_rate: continuous.
dst_host_count: continuous.
dst_host_srv_count: continuous.
dst_host_same_srv_rate: continuous.
dst_host_diff_srv_rate: continuous.
dst_host_same_src_port_rate: continuous.
dst_host_srv_diff_host_rate: continuous.
dst_host_serror_rate: continuous.
dst_host_srv_serror_rate: continuous.
dst_host_rerror_rate: continuous.
dst_host_srv_rerror_rate: continuous.
```

In [10]:
# Get the protocols
csv_data = raw_data.map(lambda x: x.split(","))
protocols = csv_data.map(lambda x: x[1]).distinct()
protocols.collect()

[u'udp', u'icmp', u'tcp']

In [11]:
# Get Services
services = csv_data.map(lambda x: x[2]).distinct()
services.collect()

[u'urp_i',
 u'http_443',
 u'Z39_50',
 u'smtp',
 u'domain',
 u'private',
 u'echo',
 u'time',
 u'shell',
 u'red_i',
 u'eco_i',
 u'sunrpc',
 u'ftp_data',
 u'urh_i',
 u'pm_dump',
 u'pop_3',
 u'pop_2',
 u'systat',
 u'ftp',
 u'uucp',
 u'whois',
 u'harvest',
 u'netbios_dgm',
 u'efs',
 u'remote_job',
 u'daytime',
 u'ntp_u',
 u'finger',
 u'ldap',
 u'netbios_ns',
 u'kshell',
 u'iso_tsap',
 u'ecr_i',
 u'nntp',
 u'http_2784',
 u'printer',
 u'domain_u',
 u'uucp_path',
 u'courier',
 u'exec',
 u'aol',
 u'netstat',
 u'telnet',
 u'gopher',
 u'rje',
 u'sql_net',
 u'link',
 u'ssh',
 u'netbios_ssn',
 u'csnet_ns',
 u'X11',
 u'IRC',
 u'tftp_u',
 u'login',
 u'supdup',
 u'name',
 u'nnsp',
 u'mtp',
 u'http',
 u'bgp',
 u'ctf',
 u'hostnames',
 u'klogin',
 u'vmnet',
 u'tim_i',
 u'discard',
 u'imap4',
 u'auth',
 u'other',
 u'http_8001']

Now we can do the cartesian product.

In [12]:
product = protocols.cartesian(services).collect()
print "There are {} combinations of protocol X service".format(len(product))

There are 210 combinations of protocol X service


Obviously, for such small RDDs doesn't really make sense to use Spark cartesian product. We could have perfectly collected the values after using distinct and do the cartesian product locally. Moreover, distinct and cartesian are expensive operations so they must be used with care when the operating datasets are large.