# KDD Cup 19

This dataset contains information required to determine whether a connection attempt to a certain website is safe or not. In this document, we will build a Machine Learning model to determine for us whether the connection attempt is secure or not.

First step is to import all the libraries that will be used in our code.

In [1]:
import pandas as pd
import numpy as np

# Data Analysis:

Since the size of the given data is very large, if we try to read the data, we will face memory issues. To avoid that, we should split the data into chunks first. Because we don't have a dedicated header row for our data, we will use the 'header()' parameter with the value 'None'. We should also use 'on_bad_lines()' parameter to avoid issues when reading the data.

In [2]:
df = pd.read_csv('data.csv',header=None,chunksize=10000,on_bad_lines='skip',na_filter=False)

Now that we have read our data, we should assign its chuncks into a new dataset so we can access it. We do so by using the 'pandas.concat()' function. We can also use 'ignore_index=True' if we don't want to keep the index labels of the data being concatenated, but since our new dataset is empty, we don't have to use it here.

Important Note: For this data, we used 'na_filter=False', which makes our code no longer parse empty strings as 'Null' values. But what if we have empty strings, wouldn't that ruin our model? Yes, but since we've checked for missing data in our dataset and we found none, we can use 'na_filter()' parameter to speed up the reading speed of our dataset.

In [3]:
data = pd.concat(df)

#### We can get a sample of our data using the 'head()' function.

In [4]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,41
0,0,tcp,http,SF,215,45076,0,0,0,0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,162,4528,0,0,0,0,...,1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,236,1228,0,0,0,0,...,2,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,normal.
3,0,tcp,http,SF,233,2032,0,0,0,0,...,3,1.0,0.0,0.33,0.0,0.0,0.0,0.0,0.0,normal.
4,0,tcp,http,SF,239,486,0,0,0,0,...,4,1.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,normal.


# 1- Missing Data:

To check the amount of Null values in each column, we can do the following:

In [5]:
data.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
30    0
31    0
32    0
33    0
34    0
35    0
36    0
37    0
38    0
39    0
40    0
41    0
dtype: int64

# 2- Duplicates:

To check the amount of duplicated rows in our dataset:

In [6]:
data.duplicated().sum()

3823439

Even though we have duplicated lines in our dataset, we won't drop them just so we can practice dealing with large data.

# 3- Mixed Data:

Now, let's check if we have mixed data in our dataset:

In [7]:
def info():

    print("\n################### DATASET INFO ######################\n")

    data.info()  
    
    print("\n####################################################")
    print("\n####################################################")
    print("\n####################################################\n")

    i=0
    for column in data.columns:
        
        print(i,pd.api.types.infer_dtype(data[column]),
              "\t\tUniques:",data.iloc[:,i].nunique())
        
        i+=1
        
info()


################### DATASET INFO ######################

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898430 entries, 0 to 4898429
Data columns (total 42 columns):
 #   Column  Dtype  
---  ------  -----  
 0   0       int64  
 1   1       object 
 2   2       object 
 3   3       object 
 4   4       int64  
 5   5       int64  
 6   6       int64  
 7   7       int64  
 8   8       int64  
 9   9       int64  
 10  10      int64  
 11  11      int64  
 12  12      int64  
 13  13      int64  
 14  14      int64  
 15  15      int64  
 16  16      int64  
 17  17      int64  
 18  18      int64  
 19  19      int64  
 20  20      int64  
 21  21      int64  
 22  22      int64  
 23  23      int64  
 24  24      float64
 25  25      float64
 26  26      float64
 27  27      float64
 28  28      float64
 29  29      float64
 30  30      float64
 31  31      int64  
 32  32      int64  
 33  33      float64
 34  34      float64
 35  35      float64
 36  36      float64
 37  37   

# 4- Dependent and Independent Variables:

There are no mixed data in our columns, but we have string-type data which we have to encode. But first, let's allocate our dependent and independent variables.

In [8]:
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values

'X' variable represents our dataset features, and 'y' variable represents our target. We stored all the columns except the last column in 'X', and the last column in our data in 'y'. 

# 5- LabelEncoder:

We have 3 string-type columns in our dependent variable, and 1 string-type column in our independent variable. Machine Learning models cannot deal with string-type data. Therefore, we have to convert any string-type data into numerical data. To do so, we can use the function 'sklearn.preprocessing.LabelEncoder()', which converts all the data found in the given columns to numerical data.

For the columns in the dependent variable, we will have to use 'LabelEncoder()' and an additional encoding method to make the encoded data meaningful for our model. For the column in our independent variable, we will only have to use 'LabeleEncoder()' because it's our target.

In [9]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()

X[:,1] = labelencoder.fit_transform(X[:,1]).astype(float)
X[:,2] = labelencoder.fit_transform(X[:,2]).astype(float)
X[:,3] = labelencoder.fit_transform(X[:,3]).astype(float)

y = labelencoder.fit_transform(y)

You can notice that we parsed the encoded 'X' columns as float. This is not necessary since 'CatBoostEncoder()' does that anyways.

# 6- CatBoostEncoder:

For datasets of this volume, it takes too long to process its data, so if we want to add more columns to the data, it will take even longer. For this reason, we will use 'CatBoostEncoder()' to encode our data instead of using 'pandas.get_dummies()'.

'CatBoostEncoder()' is an encoding method that relies on the target to determine the values of the encoded features. It's basically an enhanced version of 'TargetEncoder()', which solves the data leakage problem that 'TargetEncoder()' suffers from.

In [10]:
from category_encoders import CatBoostEncoder
encoder = CatBoostEncoder()

X[:,1:2] = encoder.fit_transform(X[:,1:2],y)
X[:,2:3] = encoder.fit_transform(X[:,2:3],y)
X[:,3:4] = encoder.fit_transform(X[:,3:4],y)

  uniques = Index(uniques)
  uniques = Index(uniques)
  uniques = Index(uniques)


 We can notice that we gave the 'CatBoostEncoder()' two values: the column we need to encode and the target. The reason for that is that the 'CatBoostEncoder()' relies on the target of our dataset to encode the data.

We can also notice that we got a warning in the previous code. It's stating that in a future version of Python, the new columns will be object-type, which is not a big issue since that can be fixed with a simple code.

# 7- Model Training:

Now that we have solved all our problems with the dataset, we can start training our model. We can achieve better results by changing the values of 'test_size()' and 'random_state()'. To get the best results, it is best to try multiple values and run your model each time, compare the results of each operation, then choose the values which gave you the highest accuracy.

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Note that we have to scale our 'X_train' and 'X_test' to use it with 'LogisticRegression()'. To do so, we will use 'sklearn.preprocessing.StandardScaler()', which scales our data based on all the values found in our dataset.

In [12]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# 8- LogisticRegression:

 For this model, we will apply 'LogisticRegression()'. Other algorithms can be used and would probably give better results (such as SVR), but since they take way too long to run and since 'LogisticRegression()' gives us decent results, we will use it.

In [13]:
from sklearn.linear_model import LogisticRegression
regressor=LogisticRegression()

regressor.fit(X_train,y_train)
y_pred = regressor.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


The warning in the previous code states that our algorithm reached the maximum number of iterations. We can solve that by changing the value of the parameter 'max_iter()' but that will make our code take way more time to run for a very small improvement in the accuracy of our model.

# 9- Result:

To show the accuracy of our model, we can use the 'confusion_matrix()'. Note that our target has many unique numbers (23 to be precise), so the 'confusion_matrix()' - ironically - can be confusing.

In [14]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

cm

array([[   534,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,     17,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0],
       [     0,      1,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      4,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0],
       [     0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      3,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0],
       [     0,      0,      0,     10,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0],
       [     0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      1,      2,      0,      0,      0,      0,
             0,      0,      0,      0,      0,     

#### A better method to show the accuracy of our model if we have a large amount of unique values in our target:

In [15]:
from sklearn.metrics import accuracy_score
print(round(accuracy_score(y_test, y_pred),5))

0.9993


#### Our model achieved an accuracy of 99.93%, which is marvelous!