# Notebook 6: Complex ML Classification

Author: Randy J. Chase

### Goal: Training a ML using all features/predictors/inputs and all ML methods

#### Reminder of Problem Statement

Reminder of the ML task we want to accomplish in the paper. 

1. Does this image contain a thunderstorm? <-- Classification
2. How many lightning flashes are in this image? <-- Regression

#### Background

For the training of machine learning we will use the [scikit-learn](https://scikit-learn.org/stable/) python package. Scikit-learn (also known as sklearn) has a wealth of resources for learning how to apply many ML methods. Futhermore, their documentation on any one method is extensive and very helpful. I encourage you to use their documentation if you want to use more that what we will show you in this tutorial. 

What is really nice about the sklearn package is that all of its models all work off the same syntax. In general this is how you use any of their models:

1. Create or load your input data. It must be of shape ```[n_samples,n_features]```. It is commonly written as ```X``` in all of the sklearn documentation. 
2. Create or load your output data. It must be of shape ```[n_samples]```. It is commonly written as ```y``` in all of the sklearn documentation 
3. Initialize your model. To initialize a model in python usually you just need to add ```()``` at the end of the model name. 
4. Fit your model. To do this all we will need to do is apply the ```.fit(X,y)``` to your initialized model 
5. Evaluate your trained model. To get the model predictions for evaluation we will use the ```model.predict(X_val)``` or the ```model.predict_proba(X_val)```. 

Now that we have the basic work flow, lets actually do this with the simple example in the paper, 1 input predictor. We will start with the classification task first

#### Step 1 & 2: Import packages and load data for Classification  
Last notebook, we showed you how to do train/val and test data splits. We will do that all with our pre-made function now. But we only want 1 feature, which is feature 0. So all we need to change is the ```features_to_keep``` keyword to get just the first feature. 

In [28]:
yhat = model.predict_proba(X_validate)
#print shape
print('yhat shape:',yhat.shape)

yhat shape: (86292, 2)


the shape here is ```[n_samples,k_classes]```, so there is one *column* of data per class. Lets look at the first sample

In [30]:
yhat[0]

array([0.66475995, 0.33524005])

this sample would be labeled 0 from ```model.predict()```. So to create a line on the performance diagram, we want to very the threshold of the labeling and then re-calculate the contingency table and get the $POD$ and $SR$ for that threshold value. 

While this is useful, in your adventures of using ML you might see some people plot a line on this diagram. How do we get a line on here? what does the line mean?

Well, what we have been showing so far is the ```model.predict``` output, which is actually doing something for us under the hood. The ```model.predict``` line of code is actually running the ```model.predict_proba``` which outputs a probability of each class (i.e., does not contain thunderstorm, contains thunderstorm) and then ```model.predict``` is taking the max of the two columns of probabilites to assign the class. In other words, if the probability of class 0 (no thunderstorm) is less than 0.5, then the label would be 1, and if it was greater than 0.5 it would be 0. This is not a bad place to start, but sometimes we can get better performance by using a different *threshold*.


We can do this ourselves. Lets look at ```model.predict_proba```