# Jupyter Notebooks Primer

- Jupyter notebooks allow you to run code in your browser
- In a nutshell, notebooks consists of cells, that are either [markdown syntax](https://en.wikipedia.org/wiki/Markdown) (for comments etc.) or code
- Code cells are executed by selecting them and hitting "Shift-Enter" or clicking on Run above
- For more info have a look at this [introduction](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/What%20is%20the%20Jupyter%20Notebook.html)

# The Input Data

The next cell loads pre-processed data for hundreds of samples. Columns include variant calls and some meta-information

In [None]:
import pandas as pd
# load the csv file as pandas dataframe
url = "https://raw.githubusercontent.com/andreas-wilm/microsoft-roadshow-hongkong-09-2019/master/sample_matrix_clean_m-5-75.csv?token=AAILSCML6YI4HX6W6WXEVAC5P6Q2S"
df = pd.read_csv(url)


In [None]:
# display the dataframe
df

In [None]:
# Remove columns that don't go into AutoML as features but keep a copy
annotation = df[["ID", "Status", "Gender"]].copy()
df = df.drop(["ID", "Status"], axis=1)


In [None]:
# Split into training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df, annotation['Status'], test_size=0.2, random_state=42)


# AutoML run (local)


A very cool feature in AutoML is automatic preprocessing (see `preprocess` below), which can automatically impute missing values, encode values, add features, embed words etc. See [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-create-portal-experiments#preprocess) for more information. Since the data-set here is clean already, there is no need for this.

To run AutoML we first define some basic settings

In [None]:
import logging

automl_settings = {
    "iteration_timeout_minutes": 1,
    "iterations": 10,
    "primary_metric": 'accuracy',
    "preprocess": False,
    "verbosity": logging.INFO,
    "n_cross_validations": 5
}

In [None]:
from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(task='classification',
                             debug_log='automated_ml_errors.log',
                             X=X_train.values,
                             y=y_train.values.flatten(),
                             **automl_settings)

Connect to the ML workspace on Azure so that everything is logged there as well.

**Please note:** the following will require interactive authentication. Simply follow the instructions
 

In [None]:
from azureml.core import Workspace
ws = Workspace.from_config()

In [None]:
# submit the experiment.
# note how automl runs multiple algorithms with different parameters automatically for you
from azureml.core.experiment import Experiment
experiment = Experiment(ws, "vcf-classification-local")
local_run = experiment.submit(automl_config, show_output=True)

In [None]:
# Show the run details widget
from azureml.widgets import RunDetails
RunDetails(local_run).show()

# Predict outcome

In [None]:
# get the best model
best_run, fitted_model = local_run.get_output()


In [None]:
# predict outcome for 10 samples
y_predict = fitted_model.predict(X_test.values)
print("Sample\tPredicted\tActual")
for idx, (dfidx, dfrow) in enumerate(X_test.iterrows()):
    print("{}\t{}\t{}".format(annotation.at[dfidx, 'ID'],
                              y_predict[idx],
                              annotation.at[dfidx, 'Status']))
    # top 10 is enough
    if idx == 9:
        break
print("...")

## Print stats and plot a confusion Matrix 

In [None]:
# idea from https://datatofish.com/confusion-matrix-python/
y_actual  = []
for dfidx, dfrow in X_test.iterrows():# what's the pandassy way of doing this?
    y_actual.append(annotation.at[dfidx, 'Status'])
    
data = {'y_Predicted': y_predict,
        'y_Actual': y_actual}
df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])

In [None]:
# print stats
from pandas_ml import ConfusionMatrix
Confusion_Matrix = ConfusionMatrix(df['y_Actual'], df['y_Predicted'])
Confusion_Matrix.print_stats()

In [None]:
# plot confusion matrix
# idea from https://stackoverflow.com/questions/19233771/sklearn-plot-confusion-matrix-with-labels/48018785
import seaborn as sn

import matplotlib.pyplot as plt     
ax = plt.subplot()

confusion_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'], 
                               rownames=['Actual'], colnames=['Predicted'])

sn.heatmap(confusion_matrix, annot=True, ax = ax)

# labels, title and ticks
ax.set_xlabel('Predicted')
ax.set_ylabel('True')
ax.set_title('Confusion Matrix')

## Model Interpretability and Explainability

Microsoft has [six guiding AI principles](https://blogs.partner.microsoft.com/mpn/shared-responsibility-ai-2/). One of these is transparency, which states that it must be possible to understand how AI decisions were made. This is where [model interpretability](https://docs.microsoft.com/en-us/azure/machine-learning/service/machine-learning-interpretability-explainability) comes into play. Here we will use a TabularExplainer to understand global behavior of our model. 

In [None]:
from azureml.explain.model.tabular_explainer import TabularExplainer
# "features" and "classes" fields are optional. couldn't figure out how to use them
explainer = TabularExplainer(fitted_model, X_train)

In [None]:
# Now run the explainer. This takes some time...
global_explanation = explainer.explain_global(X_train)

In [None]:
# Let's find the top features
sorted_global_importance_names = global_explanation.get_ranked_global_names()
print("Top 10 features")
print("\n".join(sorted_global_importance_names[:10]))


This should give you an idea about the causal factors in this data-set
