# Phase 3 Code Challenge: Quality Assurance

This assessment is designed to test your understanding of these areas:

1. Business Decision
    - Select an appropriate metric.
1. Data Engineering
    - Importing data from a CSV
    - Handling missing values
    - Feature scaling
1. Machine Learning
    - Fitting a model on training data
    - Evaluate using cross validation
    - Hyperparameter tuning
    - Model evaluation on test data

Make sure that your code is clean and readable, and that each step of your process is documented. For this challenge each step builds upon the step before it. If you are having issues finishing one of the steps completely, move on to the next step to attempt every section.  There will be occasional hints to help you move on to the next step if you get stuck, but attempt to follow the requirements whenever possible.

### The Business Problem

You have been asked by a manufacturer to build a machine learning model to help with quality assurance.  Some fraction of all parts created in their factory have manufacturing flaws, and therefore should not be shipped to customers.  The cost of processing a returned part (false negative) is relatively high, and the cost of a secondary inspection (false positive) is relatively low.

### Data Understanding

Contained in this repo is a CSV file named `quality_assurance.csv`.  Each record represents a single part.  The columns include features labeled `A` through `Z`, and a target called `flawed`.  If `flawed` is equal to 1, that means that the part has a manufacturing flaw and should not be shipped to customers.

## Tasks

### Business Decision

Given the business problem, what classification metric do you think would be appropriate for this model? Please write your answer in a markdown cell and explain your reasoning.

### Data Preparation

Import the quality assurance CSV data using Pandas and print out the head.

#### Train-Test Split

Before performing any other preprocessing steps, split the data into training, validation, and testing sets.  Use the `train_test_split` utility function from sklearn with a `random_state` of 42 for reproducibility. **NOTE** You will be asked to perform cross-validation throughout this notebook. You will need *three sets of data* to do this successfully.

#### Missing Values

At least one column in this dataset contains missing values.  Use the sklearn `SimpleImputer` ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)) to fill in the missing values.  **Do not** add a "missing indicator".

If you are getting stuck at this step, you can drop the rows containing missing values, but this will mean your model performance will be worse. 

#### Feature Scaling

Because we intend to use a model with regularization, the feature magnitudes need to be scaled in order to avoid overly penalizing the features that happen to have larger magnitudes.  Use the sklearn `StandardScaler` ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)) to scale all of the features.  It is okay if you lose the feature names at this stage.

### Modeling

#### Initial Model

Build a classification model and train it on the preprocessed training data. Use the sklearn `LogisticRegression` model ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)).

Check the performance of the logistic regression model on the training and validation data.

If you have time, this would also be a good time to show the confusion matrix.

#### Second Model - Hyperparameter Tuning

Build a second `LogisticRegression` model, this time adjust at least one hyperparameter related to regularization (either the type or strength of regularization applied). Repeat the same performance checks on this model that you used for the first model. If you have time, you may consider creating a function for this evaluation process.

The company suspects that at least one of the features is not actually important for predicting whether a part will have a manufacturing flaw.  Inspect the coefficients of your second Logistic Regression model (`.coef_` attribute) and make a recommendation of whether they can save costs by no longer collecting data on one or more of the features.  (Your findings will be slightly different based on your model choice, which is fine.) 

If you are getting stuck at this step, skip it and continue with model iterations.

#### Third Model - Decision Tree

Create a `DecisionTreeClassifier` model ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)) and train it on the preprocessed training data. There is no need to adjust any hyperparameters and you can use the same data as your previous models.

Once again, check the performance of this decision tree model on both your training and validation sets. Compare it to your previous models in terms of your chosen metric.

#### Final Model - Grid Search

Your final model iteration will use `GridSearchCV` ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)) to determine the optimal hyperparameters for a decision tree classifier. 

Please supply the following hyperparameters to `GridSearchCV`:
- `estimator`: a basic `DecisionTreeClassifier` (do not adjust hyperparamters)
- `param_grid`: provide "gini" and "entropy" for the `criterion` hyperparameter, a range of 2-14 inclusive for `max_leaf_nodes`, and a range of 2-10 inclusive for `max_depth`
- `scoring`: a string indicating which classification metric you are using 

Once you have instantiated and fit your Grid Search please check the performance metrics of the best estimator and compare it to your best performing model. 

Which of your four models performs best? Please transform `X_test` using the same imputer and scaler fit on `X_train` and evaluate your final model's performance on `X_test`.

### Bonus

#### Pipeline

Using your best performing model, create a Pipeline ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)) which performs the entire modelling process. The Pipeline should include the `SimpleImputer` to handle missing data, `StandardScaler` to scale your features, and your best performing model to train and make predictions with.

Please fit your Pipeline with the *unprocessed training data*, and make predictions using the *unprocessed training and testing data* (ie. do not use data that has been imputed or scaled)

### Scoring

##### Business Decision - 1 Point
- [ ] Choose an appropriate metric given the business case and *explain your reasoning*

##### Data Prep - 1 Point
- [ ] Imported data, separated X and y, train-test split
- [ ] Filled in missing values of `X_train` and `X_val` with a `SimpleImputer`
- [ ] Scaled values of `X_train` and `X_val` with a `StandardScaler`

##### Initial Model - .5 Points
- [ ] Fit a `LogisticRegression` model and investigated its performance in terms of your chosen metric

##### Second Model Hyperparameter Tuning - 1 Point
- [ ] Create a second `LogisticRegression` model and adjust level of regularization using hyperparameters. 
- [ ] Fit the model on `X_train` and compare its performance to the previous model
- [ ] Interpret model's coefficients to determine unecessary features

##### Third Model Decision Tree - .5 Points
- [ ] Fit a `DecisionTreeClassifier` and compare its performance to your current best performing model

##### Grid Search and Model Evaluation - 1 Point
- [ ] Performed a GridSearch on a basic `DecisionTreeClassifier`
- [ ] Transformed `X_test` with same imputer and scaler fitted on `X_train`
- [ ] Determined the best performing model and evaluated its performance on `X_test`

##### Bonus: Pipeline - 1 Point
- [ ] Fully functional Pipeline that performs imputing, scaling, and modeling