# Hacka (midterm) thon 

## Detecting Malicious URLs 

Today you are invited to repeat the path of researchers Detecting Malicious URLs.
An anonymized 120-day subset of our ICML-09 data set.
The data set consists of about 2.4 million URLs (examples) and 3.2 million features. 

#### 1. Download data using link below
[Download Dataset](http://www.sysnet.ucsd.edu/projects/url/url_svmlight.tar.gz)

#### 2. Description of Data (SVM-light)
Uncompressing the archive url_svmlight.tar.gz will yield a directory url_svmlight/ containing the following files:

1. **FeatureTypes**. A text file list of feature indices that correspond to real-valued features.
2. **DayX.svm** (where X is an integer from 0 to 120) --- The data for day X in SVM-light format. A label of +1 corresponds to a malicious URL and -1 corresponds to a benign URL.


#### 3. Read article
Please familiarize yourself with original research article. It will give you required context.

*"**Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs**"* 

*Justin Ma, Lawrence K. Saul, Stefan Savage, Geoffrey M. Voelker* 

## Demo part

#### 1. Upload data

In [1]:
import glob
import matplotlib.pyplot as plt
from sklearn.datasets import load_svmlight_file
files = glob.glob('./url_svmlight/*.svm')
print("There are %d files" % len(files))
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

There are 121 files


#### 2. What is inside

In [2]:
import tarfile
from sklearn.datasets import load_svmlight_file
import numpy as np

In [3]:
uri = ('./url_svmlight.tar.gz')
tar = tarfile.open(uri, "r:gz")
max_obs = 0
max_vars = 0
i = 0
split = 5
for tarinfo in tar:
    print("extracting %s,f size %s" % (tarinfo.name, tarinfo.size))
    if tarinfo.isfile():
        f = tar.extractfile(tarinfo.name)
        X,y = load_svmlight_file(f)
        max_vars = np.maximum(max_vars, X.shape[0])
        max_obs = np.maximum(max_obs, X.shape[1])
    if i > split:
        break
    i+=1
print("max X = %s, max y dimension = %s" % (max_obs, max_vars)) 

extracting url_svmlight,f size 0
extracting url_svmlight/Day33.svm,f size 18674876
extracting url_svmlight/Day32.svm,f size 18599211
extracting url_svmlight/Day53.svm,f size 18963938
extracting url_svmlight/Day20.svm,f size 18633460
extracting url_svmlight/Day7.svm,f size 18777054
extracting url_svmlight/Day117.svm,f size 18106370
max X = 3231952, max y dimension = 20000


#### 3. What is inside

In [4]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

classes = [-1,1] # 1_:url- safety, -1: url- non-safety
sgd = SGDClassifier(loss='log')
n_features = 3231952
split = 5
i = 0
for tarinfo in tar:
    if i > split:
        break
    if tarinfo.isfile():
        f = tar.extractfile(tarinfo.name)
        X,y = load_svmlight_file(f,n_features=n_features)
        if i < split:
            sgd.partial_fit(X,y, classes = classes)
        if i == split:
            print (classification_report(sgd.predict(X),y))
    i+=1

             precision    recall  f1-score   support

         -1       0.97      0.99      0.98     13979
          1       0.98      0.93      0.95      6021

avg / total       0.97      0.97      0.97     20000



## Midterm (Part 2)

### Grading criteria
- Complete solution - 50%
- F1 Score - 40%
    - The first champion champion get 40%
    - Worst champion get 20%
    - All others are on a scale between them
- Code Style - 10%


### Deadline
9:00 AM MSK Monday

#### 1. Train, test
- Upload data (you can use template above)
- Separate your dataset into train and test subsets of observations
- Use the 8:2 ratio: 80% train set, 20% test set

In [6]:
from random import random
def train_test_split(train_size = 0.8):
    
    train = open('./url_svmlight/train.svm','w')
    test  = open('./url_svmlight/test.svm','w')
    summ = 0
    for i in range(121):
        inn = open('./url_svmlight/Day' + str(i) + '.svm','r')
        
        print("file: " + str(i))
        
        q = [0, 0]
        
        for line in inn:
            a = line.split()
            if(a[0] == "-1"):
                q[0] += 1
            else:
                q[1] += 1
        
        inn.close()
        summ += q[0] + q[1]
        check = [int(train_size * q[0]),int(train_size * q[1])]
        start = [0, 0]
        
        inn = open('./url_svmlight/Day' + str(i) + '.svm','r')
        
        if (train_size * q[0]) % 1 >=0.5:
            check[0] += 1
        if (train_size * q[1]) % 1 >=0.5:
            check[1] += 1
        for line in inn:
            a = line.split()

            rand = random()
            if rand > 0.5:
                if a[0] == "-1":
                    if start[0] + 1 <= check[0]:
                        train.write(line)
                        train.write("\n")
                        start[0] += 1
                    else:
                        test.write(line)
                        test.write("\n")
                        q[0] -= 1
                else:
                    if start[1] + 1 <= check[1]:
                        train.write(line)
                        train.write("\n")
                        start[1] += 1
                    else:
                        test.write(line)
                        test.write("\n")
                        q[1] -= 1
            else:
                if(a[0] == "-1"):
                    if q[0] > check[0]:
                        test.write(line)
                        test.write("\n")
                        q[0] -= 1
                    else :
                        train.write(line)
                        train.write("\n")
                        start[0] += 1
                else:
                    if q[1] > check[1]:
                        test.write(line)
                        test.write("\n")
                        q[1] -= 1
                    else :
                        train.write(line)
                        train.write("\n")
                        start[1] += 1
        print("finish: " + str(i))
    print(summ)

In [7]:
from sklearn.datasets import load_svmlight_file
from sklearn.linear_model import Perceptron
import numpy as np

train_test_split()

data = None
n_features = 3231961


data, target = load_svmlight_file("./url_svmlight/train.svm",n_features=n_features)

file: 0
finish: 0
file: 1
finish: 1
file: 2
finish: 2
file: 3
finish: 3
file: 4
finish: 4
file: 5
finish: 5
file: 6
finish: 6
file: 7
finish: 7
file: 8
finish: 8
file: 9
finish: 9
file: 10
finish: 10
file: 11
finish: 11
file: 12
finish: 12
file: 13
finish: 13
file: 14
finish: 14
file: 15
finish: 15
file: 16
finish: 16
file: 17
finish: 17
file: 18
finish: 18
file: 19
finish: 19
file: 20
finish: 20
file: 21
finish: 21
file: 22
finish: 22
file: 23
finish: 23
file: 24
finish: 24
file: 25
finish: 25
file: 26
finish: 26
file: 27
finish: 27
file: 28
finish: 28
file: 29
finish: 29
file: 30
finish: 30
file: 31
finish: 31
file: 32
finish: 32
file: 33
finish: 33
file: 34
finish: 34
file: 35
finish: 35
file: 36
finish: 36
file: 37
finish: 37
file: 38
finish: 38
file: 39
finish: 39
file: 40
finish: 40
file: 41
finish: 41
file: 42
finish: 42
file: 43
finish: 43
file: 44
finish: 44
file: 45
finish: 45
file: 46
finish: 46
file: 47
finish: 47
file: 48
finish: 48
file: 49
finish: 49
file: 50
finish: 50


#### 2. Find out whether it is possible to reduce the dimension?

In [None]:
# YOUR CODE HERE

#### 3. Create a model

In [8]:
from sklearn.linear_model import PassiveAggressiveClassifier
pac = PassiveAggressiveClassifier()
pac.fit(data,target)



PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
              fit_intercept=True, loss='hinge', max_iter=None, n_iter=None,
              n_jobs=1, random_state=None, shuffle=True, tol=None,
              verbose=0, warm_start=False)

#### 4. Get the quality
- precision
- recall
- f1-score
- support 

In [9]:
from sklearn.metrics import classification_report

data, target = load_svmlight_file("./url_svmlight/test.svm",n_features=n_features)
print (classification_report(pac.predict(data),target, digits = 6))

             precision    recall  f1-score   support

       -1.0   0.993186  0.990672  0.991927    321609
        1.0   0.981064  0.986131  0.983591    157617

avg / total   0.989199  0.989178  0.989185    479226

