diff --git a/images/conifer_v1.png b/images/conifer_v1.png new file mode 100644 index 00000000..f19a9ec5 Binary files /dev/null and b/images/conifer_v1.png differ diff --git a/part5_bdt.ipynb b/part5_bdt.ipynb new file mode 100644 index 00000000..c1630f18 --- /dev/null +++ b/part5_bdt.ipynb @@ -0,0 +1,292 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "209d2b58", + "metadata": {}, + "source": [ + "## Part 5: Boosted Decision Trees\n", + "\n", + "The `conifer` package was created out of `hls4ml`, providing a similar set of features but specifically targeting inference of Boosted Decision Trees. In this notebook we will train a `GradientBoostingClassifier` with scikit-learn, using the same jet tagging dataset as in the other tutorial notebooks. Then we will convert the model using `conifer`, and run bit-accurate prediction and synthesis as we did with `hls4ml` before.\n", + "\n", + "`conifer` is available from GitHub [here](https://github.com/thesps/conifer), and we have a publication describing the inference implementation and performance in detail [here](https://iopscience.iop.org/article/10.1088/1748-0221/15/05/P05026/pdf).\n", + "\n", + "\"conifer\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eda9b784", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "from sklearn.ensemble import GradientBoostingClassifier\n", + "from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n", + "from sklearn.metrics import accuracy_score\n", + "import joblib\n", + "import conifer\n", + "import plotting\n", + "import matplotlib.pyplot as plt\n", + "import os\n", + "os.environ['PATH'] = '/opt/Xilinx/Vivado/2019.2/bin:' + os.environ['PATH']\n", + "np.random.seed(0)" + ] + }, + { + "cell_type": "markdown", + "id": "18354699", + "metadata": {}, + "source": [ + "## Load the dataset\n", + "Note you need to have gone through `part1_getting_started` to download the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1574ed18", + "metadata": {}, + "outputs": [], + "source": [ + "X_train_val = np.load('X_train_val.npy')\n", + "X_test = np.load('X_test.npy')\n", + "y_train_val = np.load('y_train_val.npy')\n", + "y_test = np.load('y_test.npy', allow_pickle=True)\n", + "classes = np.load('classes.npy', allow_pickle=True)" + ] + }, + { + "cell_type": "markdown", + "id": "24658fb4", + "metadata": {}, + "source": [ + "We need to transform the test labels from the one-hot encoded values to labels" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "00f304bd", + "metadata": {}, + "outputs": [], + "source": [ + "le = LabelEncoder().fit(classes)\n", + "ohe = OneHotEncoder().fit(le.transform(classes).reshape(-1,1))\n", + "y_train_val = ohe.inverse_transform(y_train_val.astype(np.int))\n", + "y_test = ohe.inverse_transform(y_test)" + ] + }, + { + "cell_type": "markdown", + "id": "8305e22c", + "metadata": {}, + "source": [ + "## Train a `GradientBoostingClassifier`\n", + "We will use 20 estimators with a maximum depth of 3. The number of decision trees will be `n_estimators * n_classes`, so 100 for this dataset. If you are returning to this notebook having already trained the BDT once, set `train = False` to load the model rather than retrain." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f5044231", + "metadata": {}, + "outputs": [], + "source": [ + "train = True\n", + "if train:\n", + " clf = GradientBoostingClassifier(n_estimators=20, learning_rate=1.0,\n", + " max_depth=3, random_state=0, verbose=1).fit(X_train_val, y_train_val.ravel())\n", + " if not os.path.exists('model_5'):\n", + " os.makedirs('model_5')\n", + " joblib.dump(clf, 'model_5/bdt.joblib')\n", + "else:\n", + " clf = joblib.load('model_5/bdt.joblib')" + ] + }, + { + "cell_type": "markdown", + "id": "5e9857c2", + "metadata": {}, + "source": [ + "## Create a conifer configuration\n", + "\n", + "Similarly to `hls4ml`, we can use a utility method to get a template for the configuration dictionary that we can modify." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5bab868f", + "metadata": {}, + "outputs": [], + "source": [ + "cfg = conifer.backends.xilinxhls.auto_config()\n", + "cfg['OutputDir'] = 'model_5/conifer_prj'\n", + "cfg['XilinxPart'] = 'xcu250-figd2104-2L-e'\n", + "plotting.print_dict(cfg)" + ] + }, + { + "cell_type": "markdown", + "id": "9e3ca740", + "metadata": {}, + "source": [ + "## Convert the model\n", + "The syntax for model conversion with `conifer` is a little different to `hls4ml`. We construct a `conifer.model` object, providing the trained BDT, the converter corresponding to the library we used, the conifer 'backend' that we wish to target, and the configuration.\n", + "\n", + "`conifer` has converters for:\n", + "- `sklearn`\n", + "- `xgboost`\n", + "- `tmva`\n", + "\n", + "And backends:\n", + "- `vivadohls`\n", + "- `vitishls`\n", + "- `xilinxhls` (use whichever `vivado` or `vitis` is on the path\n", + "- `vhdl`\n", + "\n", + "Here we will use the `sklearn` converter, since that's how we trained our model, and the `vivadohls` backend. For larger BDTs with many more trees or depth, it may be preferable to generate VHDL directly using the `vhdl` backend to get best performance. See [our paper](https://iopscience.iop.org/article/10.1088/1748-0221/15/05/P05026/pdf) for the performance comparison between those backends." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ebf5b06", + "metadata": {}, + "outputs": [], + "source": [ + "cnf = conifer.model(clf, conifer.converters.sklearn, conifer.backends.vivadohls, cfg)\n", + "cnf.compile()" + ] + }, + { + "cell_type": "markdown", + "id": "dc5e487b", + "metadata": {}, + "source": [ + "## profile\n", + "Similarly to hls4ml, we can visualize the distribution of the parameters of the BDT to guide the choice of precision" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "993fef56", + "metadata": {}, + "outputs": [], + "source": [ + "cnf.profile()" + ] + }, + { + "cell_type": "markdown", + "id": "9c840ca4", + "metadata": {}, + "source": [ + "## Run inference\n", + "Now we can execute the BDT inference with `sklearn`, and also the bit exact simulation using Vivado HLS. The output that the `conifer` BDT produces is equivalent to the `decision_function` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b9fd0fee", + "metadata": {}, + "outputs": [], + "source": [ + "y_skl = clf.decision_function(X_test)\n", + "y_cnf = cnf.decision_function(X_test)" + ] + }, + { + "cell_type": "markdown", + "id": "c486535e", + "metadata": {}, + "source": [ + "## Check performance\n", + "\n", + "Print the accuracy from `sklearn` and `conifer` evaluations, and plot the ROC curves. We should see that we can get quite close to the accuracy of the Neural Networks from parts 1-4." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3a87c1b8", + "metadata": {}, + "outputs": [], + "source": [ + "yt = ohe.transform(y_test).toarray().astype(np.int)\n", + "print(\"Accuracy sklearn: {}\".format(accuracy_score(np.argmax(yt, axis=1), np.argmax(y_skl, axis=1))))\n", + "print(\"Accuracy conifer: {}\".format(accuracy_score(np.argmax(yt, axis=1), np.argmax(y_cnf, axis=1))))\n", + "fig, ax = plt.subplots(figsize=(9, 9))\n", + "_ = plotting.makeRoc(yt, y_skl, classes)\n", + "plt.gca().set_prop_cycle(None) # reset the colors\n", + "_ = plotting.makeRoc(yt, y_cnf, classes, linestyle='--')" + ] + }, + { + "cell_type": "markdown", + "id": "70c43d82", + "metadata": {}, + "source": [ + "## Synthesize\n", + "Now run the Vivado HLS C Synthesis step to produce an IP that we can use, and inspect the estimate resources and latency.\n", + "You can see some live output while the synthesis is running by opening a terminal from the Jupyter home page and executing:\n", + "`tail -f model_5/conifer_prj/vivado_hls.log`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "721814ef", + "metadata": {}, + "outputs": [], + "source": [ + "cnf.build()" + ] + }, + { + "cell_type": "markdown", + "id": "ad1efe07", + "metadata": {}, + "source": [ + "## Read report\n", + "We can use an hls4ml utility to read the Vivado report" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "578a62c3", + "metadata": {}, + "outputs": [], + "source": [ + "import hls4ml\n", + "hls4ml.report.read_vivado_report('model_5/conifer_prj/')" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}