# Detecting Botnets in the Wild

This project is a *binary classification* attempt at identifying botnets within a network of devices. This is done by extracting features from a graph representing the network of devices, and then trying various models to see how they perform at botnet detection.

The dataset and library functions for extracting the dataset 
from hdp5 format is from [this paper](https://github.com/harvardnlp/botnet-detection)

This notebook is organized into sections, chronologically in the order that I worked on it. Here's the organization of that:

- Background
- Looking at the dataset provided
- Extracting a graph from the dataset provided
- Feature extraction from our graph
- Getting the data in the right format
- Model training
- More Data!! More Training!!
- Results and Discussion



## Bibliography:

Zhou, J., Xu, Z., Rush, A.M., Yu, M. (2020) *Automating Botnet Detection with Graph Neural Networks*. AutoML for Networking and Systems Workshop of MLSys 2020 Conference.


# Note

I also wrote a rough draft of a paper, it should be in the zip file attached.

In [3]:
# to display one equation
from IPython.display import display, Latex

In [4]:
# dataset loader and extractor from 
from botdet.data.dataset_botnet import BotnetDataset
from botdet.data.dataloader import GraphDataLoader

# libraries to plot and handle data
import networkx as nx
import matplotlib.pyplot as plt

In [5]:
# using the code provided, we extract the training set. 
# Even though it says the graph format
# is 'nx', it's a dictionary. I was lied to!!!

# because of this, I need to get the data into 'nx' (Networkx) 
# format myself.
botnet_dataset_train = BotnetDataset(name='chord', graph_format='nx')

In [6]:
print(botnet_dataset_train)

BotnetDataset(topology: chord | split: train | #graphs: 768 | graph format: nx)


In [7]:

data = botnet_dataset_train.data
botnet_dataset_train.path

'data/botnet/processed/chord_train.hdf5'

# Reminder Block
Don't worry about this block below, I'm just using it to learn how to use `networkx`

In [8]:
# reminder block

# create graph
G = nx.Graph()

# add node + node attribute
G.add_node(1,evil = False)

# add node w no attribute
G.add_node(2)

# add edge w no attribute
G.add_edge(1,3)

# add edge + edge attribute
G.add_edge(1,2,evil=True)

# print nodes
print(list(G.nodes))

# access a specific node
print(G.nodes[1])

# access a specific edge
print(G.edges[(1,2)])

# create new attribute for existing node 1
G.nodes[1]["new"] = 1

# print info on node 1
G.nodes[1]

# adding an edge
G.add_edge(2,3)
G.add_node(4)

# in our dataset provided, all nodes have an edge to themself
G.add_edge(1,1)
G.add_edge(2,2)
G.add_edge(3,3)
G.add_edge(4,4)

adjacents = G.adjacency()
# adjacents is an iterator, iterating through a tuple
# the first element of the tuple is the node label
# the second element is a dictionary, pointing to all of
# the nodes adjacent to the node stored in the first element
for i in G.adjacency():
    print(i)
    print(list(i[1].keys()))
    
    
# access neighbors doing
print(list(G.adj[1]))

[1, 2, 3]
{'evil': False}
{'evil': True}
(1, {3: {}, 2: {'evil': True}, 1: {}})
[3, 2, 1]
(2, {1: {'evil': True}, 3: {}, 2: {}})
[1, 3, 2]
(3, {1: {}, 2: {}, 3: {}})
[1, 2, 3]
(4, {4: {}})
[4]
[3, 2, 1]


# Looking at the Data Provided

We are provided with 6 datasets, here are the names they were given and some information regarding them:
```
'chord' (synthetic, 10k botnet nodes)
'debru' (synthetic, 10k botnet nodes)
'kadem' (synthetic, 10k botnet nodes)
'leet' (synthetic, 10k botnet nodes)
'c2' (real, ~3k botnet nodes)
'p2p' (real, ~3k botnet nodes)
```

Each dataset has multiple graphs. e.g. so far I'm only working with the 'chord' dataset, it has 768 graphs, and I'm only looking at one of the graphs for now.

Within each graph, each node is a device, and an edge between two nodes represents communication between the two devices. In otherwords, this graph shows the communication within a large network of devices.

For each graph we're given the following:

- `x`: node signals/features, they don't explain what this is, I haven't deciphered it either, but that's ok! We'll get our own features.
- `edge_index`: describes all edges between nodes in the graph
    - This is an array of 2 arrays, lets call it `[A,B]`, where A and B are arrays within a larger array. The length of A and B are the same, equal to the total number of edges. Each item in A and B are labels for a node, and nodes within the same index of A and the same index of B are adjacent. i.e. `A[0]` and `B[0]` are two node labels with an edge betwee n them. 
    - We use this information to create our graph
- `y`: node labels, has an entry for every node.
    - `1` means it's an evil node (part of a botnet)
    - `0` means it's not an evil node
- `edge_y`: edge labels, has an entry for every edge
    - same as `y`, however this doesn't exist in some of the other graphs, so I'm not using this.

In [9]:
# To have a look at the data provided


print(data["0"])

print("\n")
print(len(data.keys()))

{'edge_index': array([[     0,      0,      0, ..., 143104, 143105, 143106],
       [   282,    430,    799, ..., 143104, 143105, 143106]]), 'edge_y': array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), 'x': array([[1.],
       [1.],
       [1.],
       ...,
       [1.],
       [1.],
       [1.]], dtype=float32), 'y': array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), 'num_edges': 1500299, 'num_evil_edges': 39937, 'num_evils': 10000, 'num_nodes': 143107}


776


# Graph Extraction

## create_graph
run `create_graph` and give it the dictionary with edge_index, edge_y, x,... and this'll produce a nx.Graph object

returns an `nx.Graph` object

In [10]:
def create_graph(data,graph_name):
    G = nx.Graph(name=graph_name)
    
    node_labels = data["y"]
    
    # go through each node and add a node to the output graph G
    for index in range(0,len(node_labels)):
        G.add_node(index,evil = node_labels[index])
        # DEBUG:
        # print("Adding node " + str(index) + " evil = " + str(node_labels[index]))
    
    edge_node1 = data["edge_index"][0]
    edge_node2 = data["edge_index"][1]
    
    edge_labels = data["edge_y"]
        
    for index in range(0,len(edge_node1)):
        G.add_edge(edge_node1[index],edge_node2[index],evil = edge_labels[index])
        # DEBUG:
        # print("Adding edge " + str(edge_node1[index]) + " -> " + str(edge_node2[index]) + " " + str(edge_labels[index]))
    return G

In [11]:
# out = create_graph(data["0"],"0")

# Feature Extraction

## add_neighborhood
given an `nx.Graph` object:

This'll loop through each node, and for each node, it'll create a dictionary, whose keys are the selected node's neighbors, and corresponding values are the labels of other neighbors that are also adjacent to the given neighbor (the key).

This dictionary will be added as another attribute of each given node

Adds 2 attributes to each node:
- the dictionary mentioned above
- `neighbor_count` the number of adjacent nodes

In [12]:
def add_neighborhood(graph):
    
    adjacency = graph.adjacency()
    
    # for each node
    for node_label in list(graph.nodes):
        
        # add number of neighbors attribute
        graph.nodes[node_label]["neighbor_count"] = len(graph.adj[node_label]) - 1

        out_dict = {}
        
        neighbors_neighbors = []
        
        # for each neighbor
        for neighbor in graph.adj[node_label]:
            
            for other_neighbor in graph.adj[node_label]:
                if neighbor != other_neighbor and \
                neighbor in graph.adj[other_neighbor] and \
                not other_neighbor in neighbors_neighbors:
                    neighbors_neighbors.append(other_neighbor)
            
            out_dict[neighbor] = neighbors_neighbors
        
        graph.nodes[node_label]["neighborhood"] = out_dict

In [13]:
# don't rerun unless you must, it takes a while
# add_neighborhood(out)

## Extracting Features from our Neighborhood

To do so, make sure you understand the "neighborhood" dictionary that we have. As a reminder, there's a neighborhood dictionary for each node. 

Let's consider node 1.


out.nodes[1]

Let's look at the neighborhood, node 1 is adjacent to nodes: 141737, 14039, and 1 (itself). Lets ignore the key 1, as that's just the same node that we're looking at.

To find the number of triangles:
We know from drawing it out (literally or imaginatively), there's one triangle, whose edges are 1,141737, and 142039. How can we find that from the dictionary data?

I suggest this formula:




In [14]:
display(Latex(r'$ \frac{ \sum{(length of values - 2)}}{2} $'))

<IPython.core.display.Latex object>

So for the example above with node 1, we look at the `key` 141737 in the dictionary, the `value` is `[1,141737,142039]`, so the `length of the value-2` is 1. Then we look at 142039 in the dictionary, the `value` is `[1,141737,142039]`, the `length of the value-2` is 1. `(1 + 1) / 2 = 1`. Therefore there are two triangles associated with node 1.

## get_num_of_triangles

Given a graph, it'll loop through the graph and get the number of triangles in the graph for each node, and add it as an atrribute to the node.

In [15]:
def get_num_of_triangles(graph):
    
    # for each node
    for node in list(graph.nodes):
        
        # if it has no neighbors, it has no triangles
        if graph.nodes[node]["neighbor_count"] == 0:
            graph.nodes[node]["num_of_triangles"] = 0
            continue

        # loop through each neighbor in neighborhood
        num_triangs = 0
        for key,value in graph.nodes[node]["neighborhood"].items():
            # don't count if the neighbor is the target node itself
            if key == node:
                continue
            
            # add up length of value - 2
            if len(value) != 0:
                num_triangs += len(value) - 2
            
        # error check
        if num_triangs % 2 != 0:
            print("Something went wrong with node " + str(node))
        
        # divide sum by 2
        num_triangs /= 2
        
        graph.nodes[node]["num_of_triangles"] = num_triangs

In [16]:
# relatively fast
# get_num_of_triangles(out)

## add_clustering_coef

Adds the clustering coefficient to each node in the graph as another attribute

This coefficient is found by doing
` num of pairs of A's friends who are also friends / num of pair of A's friends `

I read about clustering coefficients on [this website](https://towardsdatascience.com/graph-machine-learning-with-python-pt-1-basics-metrics-and-algorithms-cc40972de113).

In [17]:
def add_clustering_coef(graph):
    
    
    for node in list(graph.nodes):
        
        if graph.nodes[node]["neighbor_count"] == 0:
            
            graph.nodes[node]["clust_coef"] = 0
        else:
            
            pairs_friends = 0
            neighborhood = graph.nodes[node]["neighborhood"]

            for neighbors in neighborhood.values():
                pairs_friends += len(neighbors)

            pairs_friends /= 2

            graph.nodes[node]["clust_coef"] = pairs_friends / graph.nodes[node]["neighbor_count"]

In [18]:
# add_clustering_coef(out)

## Degree Centrality

There's an `nx` function for it (yay!)

This is almost the same as the degree of each node, but degree_centrality also normalizes the degree for all nodes, relative to the other nodes. This'll help emphasize those nodes that have a *really* large degree. Overall, this is a better feature to use than just the degree.

In [19]:
# Using our library function, we can get an array of 
# degree centrality for each node

# nx.degree_centrality(out)

## Eigenvector Centrality

In [20]:
#  eig = nx.eigenvector_centrality(out,max_iter=600)

## Betweenness Centrality

In [None]:
# bet = nx.betweenness_centrality(out

#NOTE: NOT USING THIS BECAUSE IT RAN FOREVER!!!!!

## number_of_cliques

Finds the number of maximal cliques for each node

In [22]:
# cliq = nx.number_of_cliques(out)

# Features of each Node
- **neighborhood**
    - adjacent nodes i as a dictionary, with values node j,k,... where i is also adjacent to j and k ...
    
    - **HOWEVER**, we can't pass a dictionary in as a feature, we must extract features using this information. 
    
- as suggested by the prof: **number of triangles** 
    
- **clustering coefficient**
    - number of pairs of a chosen node's neighbors that are also adjacent to each other, divided by the total number of pairs of nodes adjacent to a chosen node
    
- **degree centrality**
    - normalized measure of degree of a given node

- **eigenvector centrality**

- **max number of cliques**

# Desired Data Format
- currently aiming to get a data format that we can put into a model
- so we want an X and a Y, where Y is just the label, and X is all the features
- `x` will have the following features in this order:
    - `number of triangles`, `clustering coefficient`, `degree centrality`
- `y` will have the following label
    - `0` if not evil, `1` if evil
- this'll be achieved using `get_x_and_y`


https://networkx.org/documentation/stable/reference/algorithms/index.html

In [23]:
def get_x_and_y(graph):
    
    cent = nx.degree_centrality(graph)
    eig = nx.eigenvector_centrality(graph,max_iter=600)
    cliq = nx.number_of_cliques(graph)
    
    x = []
    y = []
    count = 0
    for node in list(graph.nodes):
        x_i = []
        
        x_i.append(graph.nodes[node]["num_of_triangles"])
        x_i.append(graph.nodes[node]["clust_coef"])
        x_i.append(cent[node])
        x_i.append(eig[node])
        x_i.append(cliq[node])
        
        x.append(x_i)
        
        y.append(graph.nodes[node]["evil"])
        
        if count % 1000 == 0:
            print(".",end="")
        
        count += 1
    
    return x,y

In [37]:
# x,y = get_x_and_y(out)

................................................................................................................................................

note that the thingy above takes super long, there are about 143,000 nodes, and I print a `.` for each 1000 nodes we clear

After 30 minutes we've only done about 20,000 nodes, that means the approximate time the line above should take is 3.5 hours.

The code below also takes a while

In [38]:
# f = open("chord0","w")
# f.write("number of triangles,cluster coefficient, degree centrality, eigenvector centrality, max cliques, evil\n")
# for i in range(len(x)):
#     line = str(x[i][0]) +","+ str(x[i][1]) +","+ str(x[i][2]) + "," + str(x[i][3]) + "," + str(x[i][4]) +"," + str(y[i]) +"\n"
#     f.write(line)
# f.close()

In [43]:
# GETTING TEST DATA GRAPH 0 ONLY

import autograd.numpy as np

train_data = np.loadtxt("chord0",delimiter=",",skiprows=1)
x = train_data[:,:-1]
y = train_data[:,-1]

test_data = np.loadtxt("chord0-test",delimiter=",",skiprows=1)
x_test = test_data[:,:-1]
y_test = test_data[:,-1]


# Model Training


## evaluate
From HW3, takes in the actual labels and predicted labels and returns the confusion matrix and the accuracy

In [24]:
def evaluate(y_actual,y_pred):
    ## Your code here
    
    true_positive = 0
    true_negative = 0
    false_positive = 0
    false_negative = 0
    
    for index in range(len(y_pred)):
        if y_actual[index] == 1 and y_pred[index] == 1:
            true_positive += 1
        elif y_actual[index] == 0 and y_pred[index] == 1:
            false_positive += 1
        elif y_actual[index] == 1 and y_pred[index] == 0:
            false_negative += 1
        else:
            true_negative += 1
    
    accuracy = (true_positive + true_negative) / (true_positive + true_negative + false_positive + false_negative)
    return false_positive, false_negative, true_positive, true_negative, accuracy


## Logistic Regression

Let's try sklearn's LogisticRegression

In [14]:
from sklearn.linear_model import LogisticRegression

In [39]:
lr = LogisticRegression()
lr.fit(x,y)
y_predicted = lr.predict_proba(x_test)

print(y_predicted)
# need to reformat output, right now it's [ chance of 0, chance of 1]
# we're changing it to [0] or [1]
# same as HW3 !!
result = []
for tupl in y_predicted:
    if tupl[0] > tupl[1]:
        result.append(0)
    else:
        result.append(1)

y_predicted = result

[[0.92426721 0.07573279]
 [0.92627257 0.07372743]
 [0.9162452  0.0837548 ]
 ...
 [0.92272417 0.07727583]
 [0.90423197 0.09576803]
 [0.89213025 0.10786975]]


In [40]:
fp,fn,tp,tn,acc = evaluate(np.array(y_test),np.array(y_predicted))

print("Accuracy: " + str(acc))
print("True positive: " + str(tp))
print("True negative: " + str(tn))
print("False positive: " + str(fp))
print("False negative: " + str(fn))

Accuracy: 0.9305135318328244
True positive: 0
True negative: 133163
False positive: 0
False negative: 9944


So given the data for graph 0 of our test data, we got an accuracy of 93%.

It looks like our model is always predicted negative. ):

## More Models

In [48]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn import svm  
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

## multi_test

Given x and ys for testing and training, will test multiple binary classification models on it and print out results

In [49]:
def multi_test(x,y,x_test,y_test):
    
    # initialize models
    rfc = RandomForestClassifier()
    lsvc = LinearSVC()
    Svm = svm.SVC()
    mnb = MultinomialNB()
    sgdc = SGDClassifier()
    dtc = DecisionTreeClassifier()
    knc = KNeighborsClassifier()
    lr = LogisticRegression()

    models = [rfc,lsvc,Svm,mnb,sgdc,dtc,knc,lr]
    model_names = ["rfc","lsvc","svm","mnb","sgdc","dtc","knc","lr"]

    predictions = []
    evaluations = []

    for model in models:
        print(str(model))
        
        # fit models
        model.fit(x,y)
        
        # make prediction
        y_predicted = model.predict(x_test)
        predictions.append(y_predicted)
        
        # evaluate
        evaluations.append(evaluate(y_test,y_predicted))
        fp,fn,tp,tn,acc = evaluate(np.array(y_test),np.array(y_predicted))
        
        f1 = f1_score(y_test,y_predicted)
        auc = roc_auc_score(y_test,y_predicted)
        
        print("\tF1 Score " + str(f1))
        print("\tAUC Score " + str(auc))
        print("\tAccuracy: " + str(acc))
        print("\tTrue positive: " + str(tp))
        print("\tTrue negative: " + str(tn))
        print("\tFalse positive: " + str(fp))
        print("\tFalse negative: " + str(fn))
        print("\n")
        

In [50]:
multi_test(x,y,x_test,y_test)

RandomForestClassifier()
	F1 Score 0.5469541778975741
	AUC Score 0.7406655574310446
	Accuracy: 0.9415946793058635
	True positive: 5073
	True negative: 130414
	False positive: 3477
	False negative: 4927


LinearSVC()




	F1 Score 0.0
	AUC Score 0.4999962656190483
	Accuracy: 0.9304960004447811
	True positive: 0
	True negative: 133890
	False positive: 1
	False negative: 10000


SVC()
	F1 Score 0.0
	AUC Score 0.5
	Accuracy: 0.9305029501497661
	True positive: 0
	True negative: 133891
	False positive: 0
	False negative: 10000


MultinomialNB()
	F1 Score 0.31898899306971057
	AUC Score 0.8339088777438364
	Accuracy: 0.7097594707104683
	True positive: 9781
	True negative: 92347
	False positive: 41544
	False negative: 219


SGDClassifier()
	F1 Score 0.0
	AUC Score 0.5
	Accuracy: 0.9305029501497661
	True positive: 0
	True negative: 133891
	False positive: 0
	False negative: 10000


DecisionTreeClassifier()
	F1 Score 0.5310404499242916
	AUC Score 0.7321234474311195
	Accuracy: 0.9397321583698772
	True positive: 4910
	True negative: 130309
	False positive: 3582
	False negative: 5090


KNeighborsClassifier()
	F1 Score 0.5652349846811955
	AUC Score 0.7354501419064763
	Accuracy: 0.9477312688076391
	True positive: 4889

# More Data!! More Training!!

It's worth noting that everything that we've done so far: the turning data into a graph, extracting its features, etc has been for *one graph* within a dataset of *768 graphs*  out of *6 data sets*.

The time it takes to run all the code above for one graph sums up to about 80 minutes. So on my computer's runtime, it should take a little over 4 weeks to do all 768 graphs for this one dataset.

It would be real nice if I had a server and more time!

Below, I do more running on other graphs


## save_to_file

Given the desired graph number, the name of the data (e.g. chord0, chord0-test), and the botnetdata.data

This'll do all the graph creation, feature extraction, and write the result to a file. So that in the future, we can just use numpy's np.loadtxt.

In [25]:
def save_to_file(graph_num,name,data):
    # create graph
    out1 = create_graph(data[str(graph_num)],str(graph_num))
    
    # extract features
    add_neighborhood(out1)
    get_num_of_triangles(out1)
    add_clustering_coef(out1)
    
    # get x and y
    x1,y1 = get_x_and_y(out1)
    
    # write results to a file
    f = open(name,"w")
    f.write("number of triangles,cluster coefficient, degree centrality, eigenvector centrality, max cliques, evil\n")
    for i in range(len(x1)):
        line = str(x1[i][0]) +","+ str(x1[i][1]) +","+ str(x1[i][2]) +"," + str(x1[i][3]) +"," + str(x1[i][4]) +"," +str(y1[i]) +"\n"
        f.write(line)
    f.close()

In the cell below is what I ran to extract graphs from the dataset, and save it to a file so that I can call np.loadtxt() in the future.

In [26]:
# this code takes a long time, don't run I already did that and saved it to a file

botnet_dataset_train = BotnetDataset(name='chord', graph_format='nx')
data_train = botnet_dataset_train.data

botnet_dataset_test = BotnetDataset(name='chord', split='test', graph_format='nx')
data_test = botnet_dataset_test.data

print("0")
save_to_file(0,"chord0-test",data_test)

print("1")

save_to_file(1,"chord1",data_train)
save_to_file(1,"chord1-test",data_test)

print("2")
save_to_file(2,"chord2",data_train)
save_to_file(2,"chord2-test",data_test)

print("3)")
save_to_file(3,"chord3",data_train)
save_to_file(3,"chord3-test",data_test)

print("4")
save_to_file(4,"chord4",data_train)
save_to_file(4,"chord4-test",data_test)

print("5")
save_to_file(5,"chord5",data_train)
save_to_file(5,"chord5-test",data_test)

print("6")
save_to_file(6,"chord6",data_train)
save_to_file(6,"chord6-test",data_test)

0
................................................................................................................................................1
..................................................................................................................................................................................................................................................................................................2
................................................................................................................................................................................................................................................................................................3)
.............................................................................................................................................................................................................................................................................

In [51]:
def load_data(filename):
    data = np.loadtxt(filename,delimiter=",",skiprows=1)
    
    x = data[:,:-1]
    y = data[:,-1]
    
    return x,y

In [54]:
# load our data from files

import numpy as np

# graph 0
x0,y0 = load_data("chord0")
x0_test,y0_test = load_data("chord0-test")

# graph 1
x1,y1 = load_data("chord1")
x1_test,y1_test = load_data("chord1-test")

# graph 2
x2,y2 = load_data("chord2")
x2_test,y2_test = load_data("chord2-test")

# graph 3
x3,y3 = load_data("chord3")
x3_test,y3_test = load_data("chord3-test")

# graph 4
x4,y4 = load_data("chord4")
x4_test,y4_test = load_data("chord4-test")

# graph 5
x5,y5 = load_data("chord5")
x5_test,y5_test = load_data("chord5-test")

# graph 6
x6,y6 = load_data("chord6")
x6_test,y6_test = load_data("chord6-test")

[0. 0. 0. ... 0. 0. 0.]


In [55]:
# concatenate the loaded information

x_train = np.vstack((x0,x1))
x_train = np.vstack((x_train,x2))
x_train = np.vstack((x_train,x3))
x_train = np.vstack((x_train,x4))
x_train = np.vstack((x_train,x5))
x_train = np.vstack((x_train,x6))

y_train = np.concatenate((y0,y1))
y_train = np.concatenate((y_train,y2))
y_train = np.concatenate((y_train,y3))
y_train = np.concatenate((y_train,y4))
y_train = np.concatenate((y_train,y5))
y_train = np.concatenate((y_train,y6))

x_test = np.vstack((x0_test,x1_test))
x_test = np.vstack((x_test,x2_test))
x_test = np.vstack((x_test,x3_test))
x_test = np.vstack((x_test,x4_test))
x_test = np.vstack((x_test,x5_test))
x_test = np.vstack((x_test,x6_test))

y_test = np.concatenate((y0_test,y1_test))
y_test = np.concatenate((y_test,y2_test))
y_test = np.concatenate((y_test,y3_test))
y_test = np.concatenate((y_test,y4_test))
y_test = np.concatenate((y_test,y5_test))
y_test = np.concatenate((y_test,y6_test))


In [56]:
multi_test(x_train,y_train,x_test,y_test)

RandomForestClassifier()
	F1 Score 0.5584368299688498
	AUC Score 0.7362610573606113
	Accuracy: 0.9455328993264344
	True positive: 34510
	True negative: 912896
	False positive: 19085
	False negative: 35490


LinearSVC()




	F1 Score 0.00042827163842453143
	AUC Score 0.5000889021419351
	Accuracy: 0.9301194334024298
	True positive: 15
	True negative: 931947
	False positive: 34
	False negative: 69985


SVC()
	F1 Score 0.0
	AUC Score 0.5
	Accuracy: 0.9301383958378452
	True positive: 0
	True negative: 931981
	False positive: 0
	False negative: 70000


MultinomialNB()
	F1 Score 0.19446078977082967
	AUC Score 0.5904166605097866
	Accuracy: 0.7977945689588924
	True positive: 24455
	True negative: 774920
	False positive: 157061
	False negative: 45545


SGDClassifier()
	F1 Score 0.0
	AUC Score 0.5
	Accuracy: 0.9301383958378452
	True positive: 0
	True negative: 931981
	False positive: 0
	False negative: 70000


DecisionTreeClassifier()
	F1 Score 0.5499742658073239
	AUC Score 0.7426131913556792
	Accuracy: 0.9415328234766926
	True positive: 35797
	True negative: 907601
	False positive: 24380
	False negative: 34203


KNeighborsClassifier()
	F1 Score 0.5713000824470964
	AUC Score 0.7365357716519971
	Accuracy: 0.94862477