In [None]:
import pandas as pd
import numpy as np
from sklearn import feature_extraction
from sklearn import tree
from sklearn.cross_validation import train_test_split
from sklearn import metrics


#Decision Tree Worksheet
This worksheet will walk you through the process of creating a decision tree classifer which you will use to determine whether a given URL is malicious or not.  Once you have implemented the classifier, the worksheet will walk you through evaluating your model.  

##Step 1:  Get your data
The first step in any data science project is gathering the data.  In the ```Data``` folder, you will find a file called ```urlData2.csv``` which contains a series of URLs known to be malicious and other known not to be malicious.  The code we used to actually build this dataset is included below:
```python
whitelist = pd.read_csv( '../Data/url-whitelist.csv', names=['rank','url'])
whitelist = whitelist.drop( 'rank', axis=1)
whitelist = whitelist.sample( 400 )
whitelist['isMalicious'] = 0

blacklist = pd.read_csv( '../Data/url-blacklist', names=['url','something'] )
blacklist = blacklist.drop( 'something', axis=1)
blacklist = blacklist.sample(400)
blacklist['isMalicious'] = 1
urlData = pd.concat( [whitelist, blacklist], ignore_index=True )

def getUrlData( row ):
    try:
        data = pd.read_json( 'https://whois.apitruck.com/:' + row['url'] )
        data = data.drop( 'error', axis=1 )
        data = data.T
        row['created'] = data['created'].iloc[0]
        row['changed'] = data['changed'].iloc[0]
        row['expires'] = data['expires'].iloc[0]
    
        row['dnssec'] = data['dnssec'].iloc[0]
        return row
    except:
        print( "Could not find: " + row['url'] )
        row['created'] = False
        row['changed'] = False
        row['expires'] = False
        return row


urlData = urlData.apply( getUrlData, axis=1, reduce=False )
urlData.to_csv( 'urlData2.csv' )
```

Read in the data, remove any unneeded columns, and convert any dates to datetime columns.  This data is meant to simulate real world data, so there are invalid dates in the date column.  You can either drop the invalid dates, or replace the invalid dates with some marker.  I'd recommend reading the documentation for the ```.to_datetime()``` method (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html), especially the ```coerce``` option.  Store your data in a DataFrame called ```urlData```.

##Step 2:  Add Features
Now that you have your data in memory, you need to extract features from the URL that you will later use for the classification exercise.  To begin with, use the techniques you have learned to extract the following features:
- Time difference between creation and expiration 
- The url suffix (ie: .com, .org)
- The number of periods in the URL
- The number of digits
- The position (or index) of the first digit

Feel free to add additional features.

In [None]:
def firstDigitIndex( s ):
    for i, c in enumerate(s):
        if c.isdigit():
            return i
        return -1


##Step 3: Vectorize your Data
Once you have extracted our features from the dataset, you now need to vectorize the categorical columns, and remove any extraneous columns from the dataset.  In this instance, extraneous columns are any columns that are not features or the target column.  You can pretty much adapt code from the Play Golf example, but the basic steps are:
1.  Identify the columns which contain categorical data
2.  Extract those columns and convert them to a dictionary
3.  Use scikit-learn's ```DictVectorizer``` to vectorize the data
4.  Create a DataFrame with the Vectorized data
5.  Merge the vectorized data with the original dataset
6.  Drop the categorical columns and any other unneeded columns


In [None]:
#Identify Columns which contain categorical data


#Convert those columns to a dictionary


#Use the scikit-learn dict vectorizer to encode these columsn


#Convert this back to a dataframe with the appropriate column names

#Get the vectorized column names

#Add those columns to the dataframe vector

#duplicate the index

#Drop unneeded columns


##Step 4:  Split the data
Next, we need to divide the dataset into two dataframes, one containing the features, and the other containing the target column.  Traditionally, many machine learning texts refer to these as X and Y respectively, however, I believe that using unclear variable names is an excellent way to introduce errors to your code, as well as generating unreadable code.  Therefore I will name these as ```features``` and ```labels``` respectively. 

Finally, we will need to randomly divide the dataset into a training and testing set.  To do this, use the ```train_test_split()``` function within scikit-learn.  For our purposes, divide the data up by 75/25 training to test data.


In [None]:
# Split the data set into features and labels

#Split the data set into training and test data




##Step 5:  Train the Classifier
Finally, we have prepared and segmented the data. Let's start classifying!!  
1.  Use the scikit-learn ```DecisionTreeClassfier()```, create a decision tree, and train it
2.  Next, pull a few random rows from the data and see if your classifier got it correct.
If you are interested in trying a real unknown URL, you'll have to create a function to generate the features for that URL before you run it through the classifier.  

In [None]:
# Train the decision tree based on the entropy criterion

#Extract a row from the data

#Make the prediction


##Optional Step 5a:  Visualizing your Tree
As an optional step, you can actually visualize your tree.  The following code will generate a graph of your decision tree.  You will need graphviz (http://www.graphviz.org) and pydotplus (or pydot) installed for this to work.

In [None]:
#These libraries are used to visualize the decision tree and require that you have GraphViz
#and pydot or pydotplus installed on your computer.




##Step 6:  Assesing your Model's Accuracy
Now that you have a model, the next step is to assess how accurate the model is.
1.  Call the ```.predict()``` method with your training data and store the results in a new dataframe called ```training_predictions```.
2.  Next, call the ```metrics.accuracy_score()``` method using your training target data and the predictions you just made.  
3.  Print out the result.


In [None]:
#How accurate is the classifier?


In [None]:
#Classification Report...  but this is all on known data.


#K-fold validation
Steps:
1.  Partition the dataset into *k* different subsets
2.  Create *k* different models by training on *k-1* subsets and testing on the remaining subsets
3.  Measure the performance on each of the models and take the average measure.


In [None]:
def mean_score( scores ):
    return "Mean score: {0:.3f} (+/- {1:.3f})".format( np.mean(scores), sem( scores ))

print( mean_score( scores))