# Hacka (midterm) thon 

## Detecting Malicious URLs 

Today you are invited to repeat the path of researchers Detecting Malicious URLs.
An anonymized 120-day subset of our ICML-09 data set.
The data set consists of about 2.4 million URLs (examples) and 3.2 million features. 

#### 1. Download data using link below
[Download Dataset](http://www.sysnet.ucsd.edu/projects/url/url_svmlight.tar.gz)

#### 2. Description of Data (SVM-light)
Uncompressing the archive url_svmlight.tar.gz will yield a directory url_svmlight/ containing the following files:

1. **FeatureTypes**. A text file list of feature indices that correspond to real-valued features.
2. **DayX.svm** (where X is an integer from 0 to 120) --- The data for day X in SVM-light format. A label of +1 corresponds to a malicious URL and -1 corresponds to a benign URL.


#### 3. Read article
Please familiarize yourself with original research article. It will give you required context.

*"**Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs**"* 

*Justin Ma, Lawrence K. Saul, Stefan Savage, Geoffrey M. Voelker* 

## Demo part

#### 1. Upload data

In [1]:
import glob
import matplotlib.pyplot as plt
from sklearn.datasets import load_svmlight_file
files = glob.glob('./url_svmlight/*.svm')
print("There are %d files" % len(files))
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

There are 121 files


#### 2. What is inside

In [2]:
import tarfile
from sklearn.datasets import load_svmlight_file
import numpy as np

In [3]:
uri = ('./url_svmlight.tar.gz')
tar = tarfile.open(uri, "r:gz")
max_obs = 0
max_vars = 0
i = 0
split = 5
for tarinfo in tar:
    print("extracting %s,f size %s" % (tarinfo.name, tarinfo.size))
    if tarinfo.isfile():
        f = tar.extractfile(tarinfo.name)
        X,y = load_svmlight_file(f)
        max_vars = np.maximum(max_vars, X.shape[0])
        max_obs = np.maximum(max_obs, X.shape[1])
    if i > split:
        break
    i+=1
print("max X = %s, max y dimension = %s" % (max_obs, max_vars)) 

extracting url_svmlight,f size 0
extracting url_svmlight/Day33.svm,f size 18674876
extracting url_svmlight/Day32.svm,f size 18599211
extracting url_svmlight/Day53.svm,f size 18963938
extracting url_svmlight/Day20.svm,f size 18633460
extracting url_svmlight/Day7.svm,f size 18777054
extracting url_svmlight/Day117.svm,f size 18106370
max X = 3231952, max y dimension = 20000


#### 3. What is inside

In [4]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

classes = [-1,1] # 1_:url- safety, -1: url- non-safety
sgd = SGDClassifier(loss='log')
n_features = 3231952
split = 5
i = 0
for tarinfo in tar:
    if i > split:
        break
    if tarinfo.isfile():
        f = tar.extractfile(tarinfo.name)
        X,y = load_svmlight_file(f,n_features=n_features)
        if i < split:
            sgd.partial_fit(X,y, classes = classes)
        if i == split:
            print (classification_report(sgd.predict(X),y))
    i+=1

             precision    recall  f1-score   support

         -1       0.97      0.99      0.98     13979
          1       0.98      0.93      0.95      6021

avg / total       0.97      0.97      0.97     20000



## Midterm (Part 2)

### Grading criteria
- Complete solution - 50%
- F1 Score - 40%
    - The first champion champion get 40%
    - Worst champion get 20%
    - All others are on a scale between them
- Code Style - 10%


### Deadline
9:00 AM MSK Monday

#### 1. Train, test
- Upload data (you can use template above)
- Separate your dataset into train and test subsets of observations
- Use the 8:2 ratio: 80% train set, 20% test set

In [None]:
# YOUR CODE HERE

#### 2. Find out whether it is possible to reduce the dimension?

In [None]:
# YOUR CODE HERE

#### 3. Create a model

In [None]:
# YOUR CODE HERE

#### 4. Get the quality
- precision
- recall
- f1-score
- support 

In [None]:
# YOUR CODE HERE