## K-Prototypes Part II - Applied

In the first section of this exploration of clustereing with mixed data, I covered some of the theroetical reasons for _why_ straight up K-Means is not aprpopriate for mixed data. I also wrote about some of the racticaly reasons for wanting to cluster data in a business conetxt.
<br> For this subsequent continuation of the tutorial, I will attempt to apply the implementation of  [K-Protoypes]( https://github.com/nicodv/kmodes#huang97) to a data set I have not worked with before. My only _a priori_ requierement is that it had to have mixed continuous and categorical data 

## KDD Cup 1999 Data 

Scikit learned has several pre-canned data sets you can play around with. I chose the KDD Cup 1999 Data set. According to the publishers of this data:

<br>_"This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between 'bad' connections, called intrusions or attacks, and 'good' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment."_ <br>
<br>
A  noye on the options I applied to loading this data set: 
 1. This is simulated data and contains many more 'rare' events than real world intrusion data would. One of the goals of clustering I mentioned in Part I is the fact that it can be used to detect groups that over-index on outliers and rare events that you may eventually want to model. At first glance, this data set would not look like a good candidate foir that. However, there is an option (the 'SA' subset for  resampling the data that drastically cuts down the proportion of 'bad' connections and allows one to simulate it as a rare event.


In [1]:
import sklearn.datasets
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#Load up a small sample of the 'SA' subset of KDD Cup 1999
kddcup=sklearn.datasets.fetch_kddcup99 (subset='SA', random_state=666, percent10=True)

A list of [features](https://kdd.ics.uci.edu/databases/kddcup99/kddcup.names) and their data types. Note that they use the term 'symbolic' for what I've been calling categorical.

In [42]:
features={'duration': 'continuous',
'protocol_type': 'symbolic',
'service': 'symbolic',
'flag': 'symbolic',
'src_bytes': 'continuous',
'dst_bytes': 'continuous',
'land': 'symbolic',
'wrong_fragment': 'continuous',
'urgent': 'continuous',
'hot': 'continuous',
'num_failed_logins': 'continuous',
'logged_in': 'symbolic',
'num_compromised': 'continuous',
'root_shell': 'continuous',
'su_attempted': 'continuous',
'num_root': 'continuous',
'num_file_creations': 'continuous',
'num_shells':'continuous',
'num_access_files': 'continuous',
'num_outbound_cmds': 'continuous',
'is_host_login': 'symbolic',
'is_guest_login': 'symbolic',
'count': 'continuous',
'srv_count': 'continuous',
'serror_rate': 'continuous',
'srv_serror_rate': 'continuous',
'rerror_rate': 'continuous',
'srv_rerror_rate': 'continuous',
'same_srv_rate': 'continuous',
'diff_srv_rate': 'continuous',
'srv_diff_host_rate': 'continuous',
'dst_host_count': 'continuous',
'dst_host_srv_count': 'continuous',
'dst_host_same_srv_rate': 'continuous',
'dst_host_diff_srv_rate': 'continuous',
'dst_host_same_src_port_rate': 'continuous',
'dst_host_srv_diff_host_rate': 'continuous',
'dst_host_serror_rate': 'continuous',
'dst_host_srv_serror_rate': 'continuous',
'dst_host_rerror_rate': 'continuous',
 'dst_host_srv_rerror_rate': 'continuous'}
#Isolate the faeture names from this dict
keys, values =features.keys(), features.values()
features_list=list(keys)

In [None]:
#Convert the array to a pandas data frame and use the feature list for the column names 
import pandas as pd
kddcup_df=pd.DataFrame(kddcup['data'], columns=features_list)

In [65]:
#Make a list of the values we know are discrete based on the supplied dictionary. These will eventually be one-hotted. 
symbolic={k: v for k, v in features.items() if v=='symbolic'}
symbolic_keys_list=list(symbolic)
symbolic_keys_list

['protocol_type',
 'service',
 'flag',
 'land',
 'logged_in',
 'is_host_login',
 'is_guest_login']

Ok, we have a list of fields that we _know_ are discrete. I still like to do a check for other coninuous variables that may truly be discrete by checking whethe or not they have only a few unique values. <br> Often, if for example, when you find a continuous  variable that _onkly_ takes on the value of 1 or 0, it is likely (but not assured) that this is actually a discrete categorical dummy variable.

In [44]:
kddcup_df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0,b'tcp',b'http',b'SF',181,5450,0,0,0,0,...,9,9,1,0,0.11,0,0,0,0,0
1,0,b'tcp',b'http',b'SF',239,486,0,0,0,0,...,19,19,1,0,0.05,0,0,0,0,0
2,0,b'tcp',b'http',b'SF',235,1337,0,0,0,0,...,29,29,1,0,0.03,0,0,0,0,0
3,0,b'tcp',b'http',b'SF',219,1337,0,0,0,0,...,39,39,1,0,0.03,0,0,0,0,0
4,0,b'tcp',b'http',b'SF',217,2032,0,0,0,0,...,49,49,1,0,0.02,0,0,0,0,0


In [15]:
kddcup_df.shape

(100655, 41)

In [27]:
kddcup_df.rename(columns=features, inplace=True)

In [29]:
kddcup_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,31,32,33,34,35,36,37,38,39,40
0,0,b'tcp',b'http',b'SF',181,5450,0,0,0,0,...,9,9,1,0,0.11,0,0,0,0,0
1,0,b'tcp',b'http',b'SF',239,486,0,0,0,0,...,19,19,1,0,0.05,0,0,0,0,0
2,0,b'tcp',b'http',b'SF',235,1337,0,0,0,0,...,29,29,1,0,0.03,0,0,0,0,0
3,0,b'tcp',b'http',b'SF',219,1337,0,0,0,0,...,39,39,1,0,0.03,0,0,0,0,0
4,0,b'tcp',b'http',b'SF',217,2032,0,0,0,0,...,49,49,1,0,0.02,0,0,0,0,0


In [36]:
kddcup_df.rename(columns=keys, inplace=True)

TypeError: 'dict_keys' object is not callable

In [31]:
features_list

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate']