### Objective

So far I have ran the data through 4 different models with several combinations of settings and sample sizes. By far the best two models have been a pure Naive Bayes and a SGD with the rbf kernel. The plan now is to do a better job of manipulating the data. The first part of that is exploring functions that help me determine how impactful features are. 

#### Sklearn
Sklearn has a whole module dedicated to finding the importance of a feature called "feature_selection", I'll practice using these functions in this notebook to get a better understanding of how they work.

In [56]:
import pandas as pd
import numpy as np
import sklearn.feature_selection as sel
from sklearn import linear_model

In [8]:
X = pd.read_csv('data/X.csv', index_col=0)
Y = pd.read_csv('data/Y.csv', index_col=0, names=['Crimes'])
y = Y['Crimes']

In [213]:
def featRank(col, importance, boolean=True):
    ranks = []
    for i,x in enumerate(importance):
        ranks.append({'feature': col[i], 'importance': x})
    return sorted(ranks, key=lambda x: x['importance'], reverse=boolean)

def printRank(collection, integer = True):
    n = 0
    for importance,feat in collection:
        if integer:
            rank = n + 1
        else:
            rank = collection[n][importance]
        print collection[n][feat] + ":", rank
        n += 1

In [205]:
X = (1 + X - X.mean()) / (X.max() - X.min()) # added +1 so that all values stay positive

#### Univariate Feature Selection

In [230]:
gus = sel.GenericUnivariateSelect(mode='percentile')
gus.fit(X,y)

GenericUnivariateSelect(mode='percentile', param=1e-05,
            score_func=<function f_classif at 0x000000000F1209E8>)

In [228]:
gus_ranks = featRank(X.columns,gus.scores_)
printRank(gus_ranks)

Address: 1
PdDistrict: 2
Year: 3
X: 4
DayOfWeek: 5
Month: 6
Day: 7
Y: 8


In [208]:
chi2_importance = sel.chi2(X,y)

In [214]:
chi_ranks = featRank(X.columns,chi2_importance[0])
printRank(chi_ranks)

Address: 1
PdDistrict: 2
Year: 3
DayOfWeek: 4
Month: 5
Day: 6
X: 7
Y: 8


#### Recursive Feature Elimination

In [207]:
%%time
estimator = linear_model.SGDClassifier(loss='log', n_jobs=-1)
selector = sel.RFE(estimator, 1)
selector = selector.fit(X,y)

Wall time: 2min 10s


In [215]:
rfe_rank = featRank(X.columns,selector.ranking_, False)
printRank(rfe_rank)

Address: 1
X: 2
PdDistrict: 3
Year: 4
DayOfWeek: 5
Y: 6
Day: 7
Month: 8


#### Elimination through L1 prior 

In [209]:
lasso = linear_model.LassoLarsCV(cv = 20)
%time lasso.fit(X,y)

Wall time: 13.5 s


LassoLarsCV(copy_X=True, cv=20, eps=2.2204460492503131e-16,
      fit_intercept=True, max_iter=500, max_n_alphas=1000, n_jobs=1,
      normalize=True, precompute='auto', verbose=False)

In [224]:
print "Alpha:", lasso.alpha_
lasso_ranks = featRank(X.columns,lasso.coef_**2) # square it is ranked on just magnitude
printRank(lasso_ranks)

Alpha: 1.10094040294e-05
X: 1
Y: 2
PdDistrict: 3
Address: 4
Year: 5
DayOfWeek: 6
Month: 7
Day: 8


#### Elimination through L1 and L2

In [219]:
elastic = linear_model.ElasticNetCV(cv=50, n_jobs=-1)
%time elastic.fit(X,y)

Wall time: 12.3 s


ElasticNetCV(alphas=None, copy_X=True, cv=50, eps=0.001, fit_intercept=True,
       l1_ratio=0.5, max_iter=1000, n_alphas=100, n_jobs=-1,
       normalize=False, positive=False, precompute='auto',
       random_state=None, selection='cyclic', tol=0.0001, verbose=0)

In [223]:
print "Alpha:", elastic.alpha_
elastic_ranks = featRank(X.columns, elastic.coef_**2) # square it is ranked on just magnitude
printRank(elastic_ranks)

Alpha: 0.000276666176685
X: 1
Y: 2
PdDistrict: 3
Address: 4
Year: 5
Month: 6
DayOfWeek: 7
Day: 8


### Results

The results between the elastic and the L1 method are very similiar and the chi2,percentile, and rfe also match up pretty well. Since the chi2 * general Univariate are faster and more flexible (they don't require an estimator) and elastic fit doesn't penalize colinearity I'll be using these two to for feature selection.