Add a BDT/conifer notebook

fastmachinelearning · Apr 16, 2021 · a23a30c · a23a30c
1 parent b806996
commit a23a30c
Show file tree

Hide file tree

Showing 2 changed files with 292 additions and 0 deletions.
diff --git a/images/conifer_v1.png b/images/conifer_v1.png
diff --git a/part5_bdt.ipynb b/part5_bdt.ipynb
@@ -0,0 +1,292 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "209d2b58",
+   "metadata": {},
+   "source": [
+    "## Part 5: Boosted Decision Trees\n",
+    "\n",
+    "The `conifer` package was created out of `hls4ml`, providing a similar set of features but specifically targeting inference of Boosted Decision Trees. In this notebook we will train a `GradientBoostingClassifier` with scikit-learn, using the same jet tagging dataset as in the other tutorial notebooks. Then we will convert the model using `conifer`, and run bit-accurate prediction and synthesis as we did with `hls4ml` before.\n",
+    "\n",
+    "`conifer` is available from GitHub [here](https://github.com/thesps/conifer), and we have a publication describing the inference implementation and performance in detail [here](https://iopscience.iop.org/article/10.1088/1748-0221/15/05/P05026/pdf).\n",
+    "\n",
+    "<img src=\"images/conifer_v1.png\" width=\"250\" alt=\"conifer\">"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eda9b784",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from sklearn.ensemble import GradientBoostingClassifier\n",
+    "from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n",
+    "from sklearn.metrics import accuracy_score\n",
+    "import joblib\n",
+    "import conifer\n",
+    "import plotting\n",
+    "import matplotlib.pyplot as plt\n",
+    "import os\n",
+    "os.environ['PATH'] = '/opt/Xilinx/Vivado/2019.2/bin:' + os.environ['PATH']\n",
+    "np.random.seed(0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "18354699",
+   "metadata": {},
+   "source": [
+    "## Load the dataset\n",
+    "Note you need to have gone through `part1_getting_started` to download the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1574ed18",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train_val = np.load('X_train_val.npy')\n",
+    "X_test = np.load('X_test.npy')\n",
+    "y_train_val = np.load('y_train_val.npy')\n",
+    "y_test = np.load('y_test.npy', allow_pickle=True)\n",
+    "classes = np.load('classes.npy', allow_pickle=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "24658fb4",
+   "metadata": {},
+   "source": [
+    "We need to transform the test labels from the one-hot encoded values to labels"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "00f304bd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "le = LabelEncoder().fit(classes)\n",
+    "ohe = OneHotEncoder().fit(le.transform(classes).reshape(-1,1))\n",
+    "y_train_val = ohe.inverse_transform(y_train_val.astype(np.int))\n",
+    "y_test = ohe.inverse_transform(y_test)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8305e22c",
+   "metadata": {},
+   "source": [
+    "## Train a `GradientBoostingClassifier`\n",
+    "We will use 20 estimators with a maximum depth of 3. The number of decision trees will be `n_estimators * n_classes`, so 100 for this dataset. If you are returning to this notebook having already trained the BDT once, set `train = False` to load the model rather than retrain."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f5044231",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train = True\n",
+    "if train:\n",
+    "    clf = GradientBoostingClassifier(n_estimators=20, learning_rate=1.0,\n",
+    "                                     max_depth=3, random_state=0, verbose=1).fit(X_train_val, y_train_val.ravel())\n",
+    "    if not os.path.exists('model_5'):\n",
+    "        os.makedirs('model_5')\n",
+    "    joblib.dump(clf, 'model_5/bdt.joblib')\n",
+    "else:\n",
+    "    clf = joblib.load('model_5/bdt.joblib')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5e9857c2",
+   "metadata": {},
+   "source": [
+    "## Create a conifer configuration\n",
+    "\n",
+    "Similarly to `hls4ml`, we can use a utility method to get a template for the configuration dictionary that we can modify."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5bab868f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cfg = conifer.backends.xilinxhls.auto_config()\n",
+    "cfg['OutputDir'] = 'model_5/conifer_prj'\n",
+    "cfg['XilinxPart'] = 'xcu250-figd2104-2L-e'\n",
+    "plotting.print_dict(cfg)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9e3ca740",
+   "metadata": {},
+   "source": [
+    "## Convert the model\n",
+    "The syntax for model conversion with `conifer` is a little different to `hls4ml`. We construct a `conifer.model` object, providing the trained BDT, the converter corresponding to the library we used, the conifer 'backend' that we wish to target, and the configuration.\n",
+    "\n",
+    "`conifer` has converters for:\n",
+    "- `sklearn`\n",
+    "- `xgboost`\n",
+    "- `tmva`\n",
+    "\n",
+    "And backends:\n",
+    "- `vivadohls`\n",
+    "- `vitishls`\n",
+    "- `xilinxhls` (use whichever `vivado` or `vitis` is on the path\n",
+    "- `vhdl`\n",
+    "\n",
+    "Here we will use the `sklearn` converter, since that's how we trained our model, and the `vivadohls` backend. For larger BDTs with many more trees or depth, it may be preferable to generate VHDL directly using the `vhdl` backend to get best performance. See [our paper](https://iopscience.iop.org/article/10.1088/1748-0221/15/05/P05026/pdf) for the performance comparison between those backends."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ebf5b06",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cnf = conifer.model(clf, conifer.converters.sklearn, conifer.backends.vivadohls, cfg)\n",
+    "cnf.compile()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dc5e487b",
+   "metadata": {},
+   "source": [
+    "## profile\n",
+    "Similarly to hls4ml, we can visualize the distribution of the parameters of the BDT to guide the choice of precision"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "993fef56",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cnf.profile()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9c840ca4",
+   "metadata": {},
+   "source": [
+    "## Run inference\n",
+    "Now we can execute the BDT inference with `sklearn`, and also the bit exact simulation using Vivado HLS. The output that the `conifer` BDT produces is equivalent to the `decision_function` method."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b9fd0fee",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "y_skl = clf.decision_function(X_test)\n",
+    "y_cnf = cnf.decision_function(X_test)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c486535e",
+   "metadata": {},
+   "source": [
+    "## Check performance\n",
+    "\n",
+    "Print the accuracy from `sklearn` and `conifer` evaluations, and plot the ROC curves. We should see that we can get quite close to the accuracy of the Neural Networks from parts 1-4."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3a87c1b8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "yt = ohe.transform(y_test).toarray().astype(np.int)\n",
+    "print(\"Accuracy sklearn: {}\".format(accuracy_score(np.argmax(yt, axis=1), np.argmax(y_skl, axis=1))))\n",
+    "print(\"Accuracy conifer: {}\".format(accuracy_score(np.argmax(yt, axis=1), np.argmax(y_cnf, axis=1))))\n",
+    "fig, ax = plt.subplots(figsize=(9, 9))\n",
+    "_ = plotting.makeRoc(yt, y_skl, classes)\n",
+    "plt.gca().set_prop_cycle(None) # reset the colors\n",
+    "_ = plotting.makeRoc(yt, y_cnf, classes, linestyle='--')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "70c43d82",
+   "metadata": {},
+   "source": [
+    "## Synthesize\n",
+    "Now run the Vivado HLS C Synthesis step to produce an IP that we can use, and inspect the estimate resources and latency.\n",
+    "You can see some live output while the synthesis is running by opening a terminal from the Jupyter home page and executing:\n",
+    "`tail -f model_5/conifer_prj/vivado_hls.log`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "721814ef",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cnf.build()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ad1efe07",
+   "metadata": {},
+   "source": [
+    "## Read report\n",
+    "We can use an hls4ml utility to read the Vivado report"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "578a62c3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import hls4ml\n",
+    "hls4ml.report.read_vivado_report('model_5/conifer_prj/')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}