# Project 3: Feature Selection and Classification

## Goal: use machine learning techniques to identify important features in order to reduce large dataset, and then build a predictive model for the entire dataset based on this subset of features.

## Step 1. Pulled in three subsets of the data:
    
1. 0.1% of dataset (220 rows). Query time = 2.7 seconds. Memory usage = 8.4 MB.    
  
2. 0.5% of dataset (500 rows). Query time = 6.2 seconds. Memory usage = 19.1 MB.    
  
3. Used ANOVA (f_classif) to find the features that are the least important (have the highest p-values) across all subsets. Pulled subsets of 1000 rows at a time, dropped the 4,500 features/columns with the highest p-values, and merged them into a dataframe of 20,000 rows and 500 columns.



## Step 2.
## Built a pipeline to perform a naive logistic regression as a baseline model (using GridSearchCV for cross-validation), with a high C-value for minimal regularization. 
## Built three feature selection pipelines.
## Chose the best of those three pipelines, and applied it to a GridSearched Logistic Regression to tune the model.


### Pipeline 0: GridSearchCV on a naive (empty param grid) Logistic Regression as a benchmark

In [13]:
logreg = GridSearchCV(LogisticRegression(C=1E10, random_state=42), param_grid=params, cv=5)

### Pipeline 1: GridSearchCV on StandardScaler, SelectFromModel(Lasso), LogisticRegression(C=1E10)

In [19]:
pipe_1_for_gs = Pipeline([
    ('scaler', StandardScaler()), 
    ('sfm', SelectFromModel(Lasso())), 
    ('lr', LogisticRegression(C=1E10))
])

In [20]:
params = {'sfm__estimator':[Lasso(alpha=.01), RandomForestClassifier()]
          }


In [21]:
gs_pipe_1 = GridSearchCV(pipe_1_for_gs, 
                         params, 
                         cv=StratifiedShuffleSplit())

### Pipeline 2: GridSearchCV on StandardScaler, SelectPercentile, LogisticRegression(C=1E10)

In [26]:
pipe_2_for_gs = Pipeline([
    ('scaler', StandardScaler()),
    ('anova', SelectPercentile()), 
    ('lr', LogisticRegression(C=1E10))
])

In [27]:
params = {'anova__percentile': [1,3,10,15,30]
}


    gs_pipe_2.best_estimator_

    Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('anova', SelectPercentile(percentile=3,
             score_func=<function f_classif at 0x7fc82ea39840>)), ('lr', LogisticRegression(C=10000000000.0, class_weight=None, dual=False,
              fit_intercept=True, intercept_scaling=1, max_iter=100,
              multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
              solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])

### Pipeline 3: GridSearchCV on StandardScaler, SelectKBest, RandomForestClassifier, LogisticRegression(C=1E10)

In [35]:
pipe_3_for_gs = Pipeline([
    ('scaler', StandardScaler()),
    ('skb', SelectKBest()),
    ('rf', RandomForestClassifier()),
    ('lr', LogisticRegression(C=1E10))
])

In [36]:
params = {'rf__n_estimators': [3,5,10,12,15,30],
    'skb__k': [2,6,10,14,20] # k should correspond to what we know about the data and how many important features!
}

In [37]:
gs_pipe_3 = GridSearchCV(pipe_3_for_gs,
                         params,
                         cv=StratifiedShuffleSplit())

In [42]:
gs_pipe_3.best_estimator_

Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('skb', SelectKBest(k=14, score_func=<function f_classif at 0x7f5439cad6a8>)), ('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nod...ty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])

### Pipeline 4: use parameters from the best Pipeline to tune the model
### GridSearchCV on StandardScaler, SelectPercentile, LogisticRegression, to find best values for LogisticRegression

In [46]:
best_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('anova', SelectPercentile(percentile=15)), 
    ('lr', LogisticRegression())
])

In [47]:
params = {
    'lr__penalty': ['l1', 'l2'],
    'lr__C': [0.01, .1, 1, 10, 25, 50, 75, 100, 1E3, 1E4]
}

In [48]:
gspipe = GridSearchCV(best_pipe,
                      params,
                      cv=StratifiedShuffleSplit())

best params: lr_C=100, lr_penalty = l1  
second-best params: lr_C=0.01, lr_penalty = l1

In [1]:
import pandas as pd

In [4]:
results = pd.read_csv('Project3Results.csv')

In [5]:
results

Unnamed: 0,data subset,pipeline,train score,test score,fit time,pipeline_memory usage,query time,data_memory usage
0,1,0,1.0,0.59,0.26,500 bytes,2.7,8.4 MB
1,1,1,1.0,0.58,0.08,1.1 KB,,
2,1,2,1.0,0.65,0.04,2.7 KB,,
3,1,3,0.77,0.6,0.06,2.7 KB,,
4,1,4,1.0,0.62,0.04,11.7 KB,,
5,2,0,1.0,0.52,1.12,500 bytes,6.2,19.1 MB
6,2,1,1.0,0.53,0.19,1.1 KB,,
7,2,2,1.0,0.53,0.14,2.7 KB,,
8,2,3,0.56,0.52,0.08,2.7 KB,,
9,2,4,1.0,0.56,0.1,11.7 KB,,


### Interesting: the training scores dropped significantly when I pulled in more data, but the test scores only dropped slightly

### Next steps: check final shape of data after transformations; see what features (columns) get kept and if they're the same each time. Would mean we only have to pull those features to use for predictions. Document (system architecture) engineering stuff (AWS, sklearn, etc.)

### Feedback: 1) how to explore the feature selection that occurred and what it means? 2) did tuning the logistic regression in the final step actually do anything? (scores didn't improve) 