Step 0:
A few words of caution: 
1) Read all the way through the instructions. 
2) Models must be built using Python.
3) No additional data may be added or used. 
4) Not all data must be used to build an adequate model, but making use of complex variables will help us identify high-performance candidates.
5) The predictions returned should be the class probabilities for belonging to the positive class, not the class itself (i.e. a decimal value, not just 1 or 0). Be sure to output a prediction for EACH of the 10,000 rows in the test dataset.  

Step 1:
Clean and prepare your data: There are several entries where values have been deleted to simulate dirty data. Please clean the data with whatever method(s) you believe is best/most suitable. Note that some of the missing values are truly blank (unknown answers).

Step 2:
Build your models: Please build two distinctly different machine learning/statistical models to predict the value for y. When writing the code associated with each model, please have the first part produce and save the model, followed by a second part that loads and applies the model.

Step 3:
Create predictions on the test dataset using both of your trained models.  The predictions should be the class probabilities for belonging to the positive class (labeled ‘1’).  Be sure to output a prediction for EACH of the 10,000 rows in the test dataset.  Save the results of the two models in a separate CSV files titled “results1.csv” and “results2.csv”.  A result file should each have a single column representing the output from one model. 

Step 4:
Submit your work: In addition to the two result files (CSV format), please submit all of your code for cleaning, prepping, and modeling your data (text, html, or PDF preferred), and a brief write-up comparing the pros and cons of the two modeling techniques you used (PDF preferred).
Please do not submit the original data back to us. Your work will be scored on techniques used (appropriateness and complexity), model performance - measured by AUC - on the data hold out, an understanding of the two techniques you compared in your write-up, and your overall code.

In [7]:
import pandas as pd 
import numpy as np
import tensorflow as tf
import sklearn as sk

In [2]:
trainingSet = pd.read_csv('exercise_02_train.csv')
testingSet = pd.read_csv('exercise_02_test.csv')


In [5]:
trainingSet.describe()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x90,x91,x92,x94,x95,x96,x97,x98,x99,y
count,39989.0,39990.0,39992.0,39991.0,39992.0,39994.0,39990.0,39991.0,39994.0,39993.0,...,39993.0,39996.0,39993.0,39992.0,39992.0,39985.0,39987.0,39994.0,39987.0,40000.0
mean,3.446069,-7.788884,1.706058,-0.072972,0.123077,-0.608624,0.035576,-0.052651,-2.909764,-0.024265,...,-9.002636,-0.001751,-0.005731,-0.014064,-0.09504,-0.807556,-2.514305,0.03837,0.043218,0.2036
std,16.247547,37.014862,38.385085,1.503243,16.289994,15.585122,9.041371,6.953403,13.149006,2.939895,...,96.666843,2.62684,4.60532,2.166326,27.516763,23.836194,18.554646,8.450995,1.114444,0.40268
min,-60.113902,-157.341119,-163.339956,-6.276969,-61.632319,-62.808995,-35.060656,-26.736717,-53.735586,-11.497395,...,-422.711982,-10.179216,-20.044113,-9.396153,-125.064735,-108.474714,-73.908741,-35.416133,-4.376614,0.0
25%,-7.602474,-32.740989,-24.141605,-1.088182,-10.896241,-11.183089,-6.090255,-4.747798,-11.722776,-2.004215,...,-73.209185,-1.777981,-3.113418,-1.491537,-18.465082,-16.826144,-15.026614,-5.645656,-0.710712,0.0
50%,3.448865,-8.019993,1.963977,-0.062389,0.104277,-0.574567,0.046812,-0.037727,-2.941234,-0.054526,...,-6.884549,-0.019422,-0.007618,-0.012195,0.099472,-0.651197,-2.509525,0.023663,0.042663,0.0
75%,14.266716,16.853383,27.5165,0.940612,11.078565,9.955357,6.100903,4.637982,5.865014,1.9551,...,56.67681,1.761629,3.100729,1.450074,18.514579,15.275896,9.889591,5.728781,0.797856,0.0
max,75.311659,153.469221,154.05106,5.837559,65.949709,63.424046,45.053946,34.267792,66.936936,11.271939,...,378.752405,11.29574,19.414284,9.136848,112.39071,92.926545,76.120119,34.170189,4.490209,1.0
