# Step 2 - Identify Salient Features Using $\ell1$-penalty

**NOTE: EACH OF THESE SHOULD BE WRITTEN SOLELY WITH REGARD TO STEP 2 - Identify Features**

### Domain and Data

Here we continue our exploration of the Madelon synthetic dataset. 

### Problem Statement

The primary task is to create a model that perform binary classification. In this phase, we are performing exploratory data analysis by constructing a model whose purpose is to identify salient features.

### Solution Statement

During this phase, we are hoping to identify a set of features that are salient to our prediction pipeline. 

### Metric

**TODO**: Write a statement about the metric you will be using. This is with regard to identifying features. This is the metric that will show you whether or not a feature is important. Provide a brief justification for choosing this metric.

### Benchmark

**TODO**: This may or may not directly connect to your metric. It would be good here to provide a statement about how many feautures you might be looking for. 

## Implementation

Implement the following code pipeline using the functions you write in `lib/project_5.py`.

<img src="assets/identify_features.png" width="600px">

In [1]:
from lib.project_5_solution import load_data_from_database, \
                                   make_data_dict, \
                                   general_model, \
                                   general_transformer
from numpy import arange, where
from pandas import DataFrame, set_option
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

### Zero-like

We have chosen a value of `1E-5` as a value that is sufficiently zero-like i.e. a coefficient who's absolue value is less than `1E-5` will be considered to be effectively zero. 

### Pipeline Construction

In [2]:
madelon_df = load_data_from_database()

In [3]:
this_data_dictionary = make_data_dict(madelon_df)

In [4]:
this_data_dictionary = general_transformer(StandardScaler(),
                                           this_data_dictionary)

In [5]:
this_data_dictionary = general_model(LogisticRegression(C=1,penalty='l1'),
                                     this_data_dictionary)

### Analysis of Coefficients

In [6]:
this_data_dictionary['processes']

[StandardScaler(copy=True, with_mean=True, with_std=True),
 LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
           verbose=0, warm_start=False)]

In [7]:
this_log_reg = this_data_dictionary['processes'][1]

In [17]:
this_log_reg.coef_[0]

(500,)

In [19]:
len([coef for coef in this_log_reg.coef_[0] if abs(coef) < 1E-5])

38

According to this, running a logistic regression with $\ell1$-regularization with `C=1`, 31 features were discarded.

In [20]:
this_data_dictionary = make_data_dict(madelon_df)
this_data_dictionary = general_transformer(StandardScaler(), this_data_dictionary)
this_data_dictionary = general_model(LogisticRegression(C=1E-1,
                                                        penalty='l1'), 
                                     this_data_dictionary)
this_log_reg = this_data_dictionary['processes'][1]
len([coef for coef in this_log_reg.coef_[0] if abs(coef) < 1E-5])

232

According to this, running a logistic regression with $\ell1$-regularization with `C=1E-1`, 225 features were discarded.

In [21]:
this_data_dictionary = make_data_dict(madelon_df)
this_data_dictionary = general_transformer(StandardScaler(), this_data_dictionary)
this_data_dictionary = general_model(LogisticRegression(C=1E-2,
                                                        penalty='l1'), 
                                     this_data_dictionary)
this_log_reg = this_data_dictionary['processes'][1]
len([coef for coef in this_log_reg.coef_[0] if abs(coef) < 1E-5])

498

According to this, running a logistic regression with $\ell1$-regularization with `C=1E-2`, 499 features were discarded.

In [22]:
from math import pow

In [24]:
pow(2,3); pow(3,2)

9.0

In [25]:
arange(-2,0,.1)

array([-2. , -1.9, -1.8, -1.7, -1.6, -1.5, -1.4, -1.3, -1.2, -1.1, -1. ,
       -0.9, -0.8, -0.7, -0.6, -0.5, -0.4, -0.3, -0.2, -0.1])

In [26]:
for C in arange(-2,0,.1):
    this_data_dictionary = make_data_dict(madelon_df)
    this_data_dictionary = general_transformer(StandardScaler(), this_data_dictionary)
    this_data_dictionary = general_model(LogisticRegression(C=pow(10,C), 
                                                            penalty='l1'),
                                         this_data_dictionary)
    this_log_reg = this_data_dictionary['processes'][1]
    zero_coefs = [coef for coef in this_log_reg.coef_[0] if abs(coef) < 1E-5]
    number_of_coefs = len(zero_coefs)
    print("C: {} no. discarded coefs: {}".format(C, number_of_coefs))

C: -2.0 no. discarded coefs: 499
C: -1.9 no. discarded coefs: 499
C: -1.8 no. discarded coefs: 498
C: -1.7 no. discarded coefs: 494
C: -1.6 no. discarded coefs: 483
C: -1.5 no. discarded coefs: 455
C: -1.4 no. discarded coefs: 418
C: -1.3 no. discarded coefs: 367
C: -1.2 no. discarded coefs: 327
C: -1.1 no. discarded coefs: 280
C: -1.0 no. discarded coefs: 248
C: -0.9 no. discarded coefs: 205
C: -0.8 no. discarded coefs: 153
C: -0.7 no. discarded coefs: 135
C: -0.6 no. discarded coefs: 103
C: -0.5 no. discarded coefs: 89
C: -0.4 no. discarded coefs: 73
C: -0.3 no. discarded coefs: 59
C: -0.2 no. discarded coefs: 52
C: -0.1 no. discarded coefs: 57


In [27]:
for C in arange(-1.7,-1.3,.01):
    this_data_dictionary = make_data_dict(madelon_df)
    this_data_dictionary = general_transformer(StandardScaler(), this_data_dictionary)
    this_data_dictionary = general_model(LogisticRegression(C=pow(10,C), penalty='l1'), this_data_dictionary)
    this_log_reg = this_data_dictionary['processes'][1]
    number_of_coefs = len([coef for coef in this_log_reg.coef_[0] if abs(coef) < 1E-5])
    print("C: {} no. coefs: {}".format(C, number_of_coefs))

C: -1.7 no. coefs: 497
C: -1.69 no. coefs: 497
C: -1.68 no. coefs: 493
C: -1.67 no. coefs: 494
C: -1.66 no. coefs: 490
C: -1.65 no. coefs: 489
C: -1.64 no. coefs: 489
C: -1.63 no. coefs: 489
C: -1.62 no. coefs: 487
C: -1.61 no. coefs: 484
C: -1.6 no. coefs: 485
C: -1.59 no. coefs: 479
C: -1.58 no. coefs: 477
C: -1.57 no. coefs: 473
C: -1.56 no. coefs: 470
C: -1.55 no. coefs: 469
C: -1.54 no. coefs: 461
C: -1.53 no. coefs: 463
C: -1.52 no. coefs: 464
C: -1.51 no. coefs: 451
C: -1.5 no. coefs: 447
C: -1.49 no. coefs: 452
C: -1.48 no. coefs: 443
C: -1.47 no. coefs: 449
C: -1.46 no. coefs: 437
C: -1.45 no. coefs: 434
C: -1.44 no. coefs: 423
C: -1.43 no. coefs: 439
C: -1.42 no. coefs: 425
C: -1.41 no. coefs: 412
C: -1.4 no. coefs: 412
C: -1.39 no. coefs: 409
C: -1.38 no. coefs: 408
C: -1.37 no. coefs: 400
C: -1.36 no. coefs: 392
C: -1.35 no. coefs: 406
C: -1.34 no. coefs: 376
C: -1.33 no. coefs: 381
C: -1.32 no. coefs: 378
C: -1.31 no. coefs: 372


### Display Non-Zero Coefficient Indices 

In [30]:
from numpy import array
this_arr = array((1,3,2,4))
where(this_arr < 3)

(array([0, 2]),)

In [31]:
C=-1.56
this_data_dictionary = make_data_dict(madelon_df)
this_data_dictionary = general_transformer(StandardScaler(), this_data_dictionary)
this_data_dictionary = general_model(LogisticRegression(C=pow(10,C), penalty='l1'), this_data_dictionary)
this_log_reg = this_data_dictionary['processes'][1]
where(this_log_reg.coef_[0] > 1E-5)


(array([  4,  18,  48,  74, 116, 177, 196, 204, 211, 241, 282, 329, 333,
        348, 424, 431, 475]),)

In [32]:
import pandas as pd

In [43]:
set_option('display.max_columns', 70)
results = []
for C in arange(-1.6,-1.5,.01):
    this_data_dictionary = make_data_dict(madelon_df)
    this_data_dictionary = general_transformer(StandardScaler(),
                                               this_data_dictionary)
    this_data_dictionary = general_model(LogisticRegression(C=pow(10,C),
                                                            penalty='l1'),
                                         this_data_dictionary)
    this_log_reg = this_data_dictionary['processes'][1]    
    test = {
        'C' : C,
        'non-zero' : list(where(abs(this_log_reg.coef_[0]) > 1E-5))[0]
    }
    results.append(test)
results = DataFrame(results)

In [44]:
results

Unnamed: 0,C,non-zero
0,-1.6,"[18, 48, 55, 56, 149, 159, 204, 213, 241, 296,..."
1,-1.59,"[10, 48, 55, 65, 199, 204, 248, 264, 323, 343,..."
2,-1.58,"[18, 48, 55, 73, 136, 137, 204, 205, 224, 282,..."
3,-1.57,"[26, 48, 60, 73, 136, 137, 149, 199, 204, 217,..."
4,-1.56,"[5, 18, 48, 53, 136, 137, 140, 152, 161, 205, ..."
5,-1.55,"[4, 48, 55, 73, 109, 119, 136, 152, 161, 196, ..."
6,-1.54,"[5, 43, 44, 48, 56, 119, 136, 146, 199, 204, 2..."
7,-1.53,"[4, 43, 48, 53, 55, 109, 130, 161, 177, 193, 1..."
8,-1.52,"[18, 48, 53, 61, 91, 107, 119, 139, 140, 152, ..."
9,-1.51,"[5, 10, 24, 43, 46, 48, 53, 55, 105, 116, 121,..."


In [45]:
set_of_non_zero_indices = {None}
set_of_non_zero_indices

{None}

In [46]:
set_of_non_zero_indices = {None}

for ls in results['non-zero']:
    set_of_non_zero_indices = set_of_non_zero_indices.union(ls)

list_of_non_zero_indices = list(set_of_non_zero_indices)
list_of_non_zero_indices.sort()
    
def index_in_identified_coefs(ls):
    return int(this_index in ls)

for this_index in list_of_non_zero_indices:
    results[this_index] = results['non-zero'].apply(index_in_identified_coefs)



In [50]:
set_option('display.max_columns', 80)
results.drop('non-zero', axis=1, inplace=True)
results

Unnamed: 0,C,None,1,4,5,10,18,24,26,41,43,44,46,48,53,55,56,60,61,65,68,73,91,105,107,109,116,117,119,121,123,130,136,137,139,140,146,149,152,153,...,336,341,343,347,348,377,382,384,395,403,404,409,410,411,412,413,420,422,424,425,430,431,441,444,445,447,453,454,456,461,466,471,473,475,477,478,481,494,496,497
0,-1.6,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0
1,-1.59,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,1,0,1,0
2,-1.58,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,1,0,1,0,1,1,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,1
3,-1.57,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,...,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0
4,-1.56,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,...,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0
5,-1.55,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0
6,-1.54,0,0,0,1,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,...,0,0,0,0,1,1,0,1,0,0,0,0,0,0,1,0,0,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,1,0,1,0
7,-1.53,0,0,1,0,0,0,0,0,0,1,0,0,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,1,0,1,0,0,0,1,0,1,0,1,0,0,1,0,1,1,0,0,1,0,0
8,-1.52,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,1,0,...,0,0,0,1,0,1,0,0,0,0,0,1,0,0,1,1,0,0,1,1,1,1,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,1
9,-1.51,0,0,0,1,1,0,1,0,0,1,0,1,1,1,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,1,0,...,0,0,0,0,0,1,1,1,0,0,0,0,1,1,0,1,1,0,1,0,1,1,1,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,1,0


In [51]:
means = results.drop('C', axis=1).mean()

In [52]:
DataFrame(means).sort(0, ascending=False)[:30]

  if __name__ == '__main__':


Unnamed: 0,0
475,1.0
48,1.0
424,0.909091
430,0.909091
204,0.909091
323,0.818182
496,0.727273
205,0.636364
431,0.636364
377,0.636364
