# Modeling the Madelon Data Set

### Domain

The goal in this project was to create a data analysis pipeline.  The pipeline will have consistent elemental steps used to read data from a remote SQL data base, initial benchmarking, feature selection, model selection, and validation.

### Data

In this project I worked with the Madelon data set, a synthetic data set with many variables and a high degree of non-linearity.  It contains 500 features and a binary classification label (-1,1), which I rescale to (0,1).  There are a total of 2000 entries, divided evenly between the two labels.  According to the source website (), the data set has a high degree of non-linearity.

### Problem Statement

My goal in this project was to produce a model which accurately predicts the labels in the Madelon data set.  As the data is highly non-linear, this required significant feature selection and model selection.  I implemented three separate analysis pipelines, corresponding to an initial benchmark using logistic regression, feature selection using a logistic regression with lasso regularization, and the final model selection.

### Solution Statement

In constructing the pipeline, I constructed four key wrapper functions.

    load_data_from_database | Accesses the database and saves the data from the 'dsi' table in 'data'.
                            |
    make_data_dict          | Generates features and labels, then splits into training and validation sets.  The
                            | default split is 70% training/30% validations, and a random seed was used throughout
                            | to ensure the same split in the notebook used for each step.
                            |
    general_transformer     | Performs an arbitrary transformation on the training set, then applies that
                            | transformation to the validation set.
                            |
    general_model           | Fits and score an arbitrary model, with any model inputs defined in the model before
                            | it is passed to the function.

The first two functions have the same inputs/outputs in all three steps, but the last two function have different inputs/outputs in each step:

    Step 1: Benchmarking        | general_transformer is used for normalization, general_model takes in an
                                | unregularized logistic regression
                                |
    Step 2: Feature Selection   | general_transformer is used for normalization, general_model takes in a series of
                                | logistic regressions with different Lasso regularization weights
                                |
    Step 3: Model Selection     | general_transformer is used for normalization and selecting the 'k' best 
                                | features, general_model takes in a set of grid search objects corresponding to
                                | l2-regularized logistic regressions, k-nearest neighbors classifiers, and SVC
                                | classifiers

Ultimately, the benchmark in step 1 will be used as a baseline for the feature selection in step 2.  The best results for the reduced number of features from step 2 will be used for the SelectKBest reduction in step 3.

### Metric

I considered the accuracy as the significant metric for this project.  Since the data is equally split between two labels, the baseline accuracy is 50% so no other metric is inherently well-suited in comparison to accuracy.  Moreover, since the data set is synthetic, there is no obvious metric which is inherently desireable for field-specific reason.  In this scenario, accuracy is the easiest metric to extract meaning from, so I will use it throughout.

### Benchmark



### Results - Step 1


### Results - Step 2


### Results - Step 3


### Results - Comparison of Models


### Conclusion