![](http://spark.apache.org/images/spark-logo.png) ![](https://upload.wikimedia.org/wikipedia/commons/f/f8/Python_logo_and_wordmark.svg)

In [1]:
%autosave 10

Autosaving every 10 seconds


# MLlib: Basic Statistics and Exploratory Data Analysis
[MLlib](https://spark.apache.org/docs/latest/mllib-guide.html) is Spark's machine learning library.

In [3]:
# Import Packages
import numpy as np
from pyspark.mllib.stat import Statistics 
from math import sqrt 


In [2]:
# Getting the data and creating the RDD
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)


# Local vectors
A [local vector](https://spark.apache.org/docs/latest/mllib-data-types.html#local-vector) is often used as a base type for RDDs in Spark MLlib. A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values.


For dense vectors, MLlib uses either Python lists or the NumPy array type. The later is recommended, so you can simply pass NumPy arrays around.


For sparse vectors, users can construct a SparseVector object from MLlib or pass SciPy scipy.sparse column vectors if SciPy is available in their environment. The easiest way to create sparse vectors is to use the factory methods implemented in Vectors.


# An RDD of dense vectors
Let's represent each network interaction in our dataset as a dense vector. For that we will use the NumPy array type.


In [4]:
def parse_interaction(line):
    line_split = line.split(",")
    # keep just numeric and logical values
    symbolic_indices = [1,2,3,41]
    clean_line_split = [item for i, item in enumerate(line_split)
                              if i not in symbolic_indices]
    return np.array([float(x) for x in clean_line_split])

In [5]:
vector_data = raw_data.map(parse_interaction)


# Summary statistics
Spark's MLlib provides column summary statistics for ```RDD[Vector]``` through the function ```colStats``` available in Statistics. The method returns an instance of [```MultivariateStatisticalSummary```](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.MultivariateStatisticalSummary), which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.

In [6]:
# Compute column summary statistics.
summary = Statistics.colStats(vector_data)

In [7]:
print "Duration Statistics:"
print " Mean: {}".format(round(summary.mean()[0],3))
print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3))
print " Max value: {}".format(round(summary.max()[0],3))
print " Min value: {}".format(round(summary.min()[0],3))
print " Total value count: {}".format(summary.count())
print " Number of non-zero values: {}".format(summary.numNonzeros()[0])

Duration Statistics:
 Mean: 47.979
 St. deviation: 707.746
 Max value: 58329.0
 Min value: 0.0
 Total value count: 494021
 Number of non-zero values: 12350.0


# Summary statistics by label

The interesting part of summary statistics, in our case, comes from being able to obtain them by the type of network attack or 'label' in our dataset. By doing so we will be able to better characterise our dataset dependent variable in terms of the independent variables range of values.


If we want to do such a thing we could filter our RDD containing labels as keys and vectors as values. For that we just need to adapt our parse_interaction function to return a tuple with both elements.

In [8]:
def parse_interaction_with_key(line):
    line_split = line.split(",")
    # keep just numeric and logical values
    symbolic_indices = [1,2,3,41]
    clean_line_split = [item for i, item in enumerate(line_split)
                              if i not in symbolic_indices]
    return (line_split[41], np.array([float(x) for x in clean_line_split]))

In [9]:
label_vector_data = raw_data.map(parse_interaction_with_key)

The next step is not very sophisticated. We use filter on the RDD to leave out other labels but the one we want to gather statistics from.

In [10]:
normal_label_data = label_vector_data.filter(lambda x: x[0]=="normal.")


Now we can use the new RDD to call ```colStats``` on the values.

In [11]:
normal_summary = Statistics.colStats(normal_label_data.values())

In [12]:
print "Duration Statistics for label: {}".format("normal")
print " Mean: {}".format(normal_summary.mean()[0],3)
print " St. deviation: {}".format(round(sqrt(normal_summary.variance()[0]),3))
print " Max value: {}".format(round(normal_summary.max()[0],3))
print " Min value: {}".format(round(normal_summary.min()[0],3))
print " Total value count: {}".format(normal_summary.count())
print " Number of non-zero values: {}".format(normal_summary.numNonzeros()[0])

Duration Statistics for label: normal
 Mean: 216.657322313
 St. deviation: 1359.213
 Max value: 58329.0
 Min value: 0.0
 Total value count: 97278
 Number of non-zero values: 11690.0


Instead of working with a key/value pair we could have just filter our raw data split using the label in column 41. Then we can parse the results as we did before. This will work as well. However having our data organised as key/value pairs will open the door to better manipulations. Since values() is a transformation on an RDD, and not an action, we don't perform any computation until we call colStats anyway.

But lets wrap this within a function so we can reuse it with any label.

In [13]:
def summary_by_label(raw_data, label):
    label_vector_data = raw_data.map(parse_interaction_with_key).filter(lambda x: x[0]==label)
    return Statistics.colStats(label_vector_data.values())

In [14]:
normal_sum = summary_by_label(raw_data, "normal.")

print "Duration Statistics for label: {}".format("normal")
print " Mean: {}".format(normal_sum.mean()[0],3)
print " St. deviation: {}".format(round(sqrt(normal_sum.variance()[0]),3))
print " Max value: {}".format(round(normal_sum.max()[0],3))
print " Min value: {}".format(round(normal_sum.min()[0],3))
print " Total value count: {}".format(normal_sum.count())
print " Number of non-zero values: {}".format(normal_sum.numNonzeros()[0])

Duration Statistics for label: normal
 Mean: 216.657322313
 St. deviation: 1359.213
 Max value: 58329.0
 Min value: 0.0
 Total value count: 97278
 Number of non-zero values: 11690.0



Let's try now with some network attack. We have all of them listed [here](http://kdd.ics.uci.edu/databases/kddcup99/training_attack_types).


In [17]:
def print_label_summary(labels):
    for label in labels:
        label_summary = summary_by_label(raw_data, label)
        print "Duration Statistics for label: {}".format(label)
        print " Mean: {}".format(label_summary.mean()[0],3)
        print " St. deviation: {}".format(round(sqrt(label_summary.variance()[0]),3))
        print " Max value: {}".format(round(label_summary.max()[0],3))
        print " Min value: {}".format(round(label_summary.min()[0],3))
        print " Total value count: {}".format(label_summary.count())
        print " Number of non-zero values: {}".format(label_summary.numNonzeros()[0])
        print "\n\n######################################################\n\n"        


In [18]:
labels = ["back.", "buffer_overflow."]
print_label_summary(labels)

Duration Statistics for label: back.
 Mean: 0.128915115751
 St. deviation: 1.11
 Max value: 14.0
 Min value: 0.0
 Total value count: 2203
 Number of non-zero values: 40.0


######################################################


Duration Statistics for label: buffer_overflow.
 Mean: 91.7
 St. deviation: 97.515
 Max value: 321.0
 Min value: 0.0
 Total value count: 30
 Number of non-zero values: 22.0


######################################################




We could build a table with duration statistics for each type of interaction in our dataset. First we need to get a list of labels as described in the first line [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names).

In [19]:
label_list = ["back.","buffer_overflow.","ftp_write.","guess_passwd.",
              "imap.","ipsweep.","land.","loadmodule.","multihop.",
              "neptune.","nmap.","normal.","perl.","phf.","pod.","portsweep.",
              "rootkit.","satan.","smurf.","spy.","teardrop.","warezclient.",
              "warezmaster."]

Then we get a list of statistics for each label.

In [20]:
stats_by_label = [(label, summary_by_label(raw_data, label)) for label in label_list]

Now we get the duration column, first in our dataset (i.e. index 0).

In [21]:
duration_by_label = [ 
    (stat[0], np.array([float(stat[1].mean()[0]), float(sqrt(stat[1].variance()[0])), float(stat[1].min()[0]), float(stat[1].max()[0]), int(stat[1].count())])) 
    for stat in stats_by_label]

That we can put into a Pandas data frame.

In [22]:
import pandas as pd
pd.set_option('display.max_columns', 50)

stats_by_label_df = pd.DataFrame.from_items(duration_by_label, columns=["Mean", "Std Dev", "Min", "Max", "Count"], orient='index')

In [23]:
print "Duration statistics, by label"
stats_by_label_df

Duration statistics, by label


Unnamed: 0,Mean,Std Dev,Min,Max,Count
back.,0.128915,1.110062,0.0,14.0,2203.0
buffer_overflow.,91.7,97.514685,0.0,321.0,30.0
ftp_write.,32.375,47.449033,0.0,134.0,8.0
guess_passwd.,2.716981,11.879811,0.0,60.0,53.0
imap.,6.0,14.17424,0.0,41.0,12.0
ipsweep.,0.034483,0.438439,0.0,7.0,1247.0
land.,0.0,0.0,0.0,0.0,21.0
loadmodule.,36.222222,41.408869,0.0,103.0,9.0
multihop.,184.0,253.851006,0.0,718.0,7.0
neptune.,0.0,0.0,0.0,0.0,107201.0
