From a6c6e94fdc5baf17f4d106bcb61efe4f53b3fd1f Mon Sep 17 00:00:00 2001 From: Chris Woodward Date: Fri, 23 Oct 2020 17:13:31 -0400 Subject: [PATCH 1/2] Updated ext1 for Part 3 of ArangoML Series --- .../Arangopipe_Feature_Example_ext1.ipynb | 680 ++++++++++------ ...angopipe_Feature_Example_ext1_output.ipynb | 738 ++++++++++++++++++ 2 files changed, 1162 insertions(+), 256 deletions(-) create mode 100644 examples/examples_output/Arangopipe_Feature_Example_ext1_output.ipynb diff --git a/examples/Arangopipe_Feature_Example_ext1.ipynb b/examples/Arangopipe_Feature_Example_ext1.ipynb index c6328bf..16e8eb0 100644 --- a/examples/Arangopipe_Feature_Example_ext1.ipynb +++ b/examples/Arangopipe_Feature_Example_ext1.ipynb @@ -1,258 +1,426 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\"Open" - ] + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Arangopipe_Feature_Example_ext1.ipynb", + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Installation Pre-requisites" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install python-arango\n", - "!pip install arangopipe==0.0.6.9.3\n", - "!pip install pandas PyYAML==5.1.1 sklearn2\n", - "!pip install jsonpickle\n", - "!pip install seaborn\n", - "!pip install dtreeviz\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import numpy as np\n", - "data_url = \"https://raw.githubusercontent.com/arangoml/arangopipe/arangopipe_examples/examples/data/cal_housing.csv\"\n", - "df = pd.read_csv(data_url, error_bad_lines=False)\n", - "df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Bias Variance Decompostion of Model Estimates" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Bootstrapping is used to estimate the bias of the regression model developed earlier. The bias tells us if the model suffers from systematically overestimating or underestimating at certain regions of the dataset. To illustrate the procedure, a sample of the dataset is used. It just takes longer to run the procedure on the full dataset. For details of the theory, please see:\n", - "\n", - "\n", - "1. [Section 2.2, Cosma Shalizi](https://www.stat.cmu.edu/~cshalizi/402/lectures/08-bootstrap/lecture-08.pdf)\n", - "2. [Tom Dietrich's lecture notes](https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn import linear_model\n", - "df['medianHouseValue'] = df['medianHouseValue'].apply(np.log)\n", - "preds = df.columns.tolist()\n", - "preds.remove('medianHouseValue')\n", - "SAMPLE_SIZE = 1000\n", - "df = df.sample(n = SAMPLE_SIZE)\n", - "df = df.reset_index()\n", - "\n", - "NUM_BOOTSTRAPS = 30\n", - "BOOTSTRAP_SAMPLE_SIZE = df.shape[0] - 1\n", - "bootstrap_Yest = {i : list() for i in range(df.shape[0])}\n", - "for index in range(df.shape[0]):\n", - " for bootrap_iteration in range(NUM_BOOTSTRAPS):\n", - " dfi = df.iloc[index, :]\n", - " dfb = df.sample(n = BOOTSTRAP_SAMPLE_SIZE, replace=True)\n", - " dfb = dfb.append(dfi)\n", - " X = dfb[preds].values\n", - " Y = dfb['medianHouseValue']\n", - "\n", - " clf = linear_model.Lasso(alpha=0.001, max_iter = 10000)\n", - " clf.fit(X, Y)\n", - " est_point = X[index, :].reshape(1, -1)\n", - " est_at_index = clf.predict(est_point)\n", - " bootstrap_Yest[index].append(est_at_index)\n", - " \n", - " if index % 100 == 0:\n", - " print('Completed estimating %4d points in the dataset' % (index))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "Xm = df[preds].values\n", - "Ym = df['medianHouseValue'].values\n", - "clf_0 = linear_model.Lasso(alpha=0.001, max_iter = 10000)\n", - "clf_0.fit(Xm, Ym)\n", - "Yhat_m = clf_0.predict(Xm)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Estimating the Bias at each point in the dataset\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# see section 2.2 from https://www.stat.cmu.edu/~cshalizi/402/lectures/08-bootstrap/lecture-08.pdf\n", - "# see https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf\n", - "Expval_at_i = { i : np.mean(np.array(bootstrap_Yest[i])) for i in range(df.shape[0])}\n", - "bias_at_i = {i : Expval_at_i[i] - Yhat_m[i] for i in range(df.shape[0])}\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Analysis of the Bias" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "%matplotlib inline\n", - "\n", - "import seaborn as sns\n", - "bias_values = [bias for (pt, bias) in bias_at_i.items()]\n", - "sns.kdeplot(bias_values)\n", - "plt.grid(True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Examine a Kernel Density plot of the bias to see the range of values. \n", - "\n", - "Note:\n", - "The response is log transformed, so the bias must be exponeniated to get the real difference from the true value" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.cluster import KMeans\n", - "cluster_labels = KMeans(n_clusters=5, random_state=0).fit_predict(Xm)\n", - "cluster_labels.shape" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To see where the model is making mistakes, we cluster the (sample of the) dataset and compute the average bias for each cluster. This provides insights into regions of the data space we are doing well (bias close to zero) and regions where we are not doing well. The table below shows the mean cluster bias and the size of the cluster. We see two large clusters where the bias is close to zero (cluster 0 and cluster 1). We see one outlier with a large error (cluster 3). Clusters 1 and 4 are also seem like outliers and need further analysis. This exercise illustrates how we can examine our model's characteristics. We can now link this model analysis activity to our project using Arangopipe. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df_bias = pd.DataFrame(Xm)\n", - "df_bias['cluster'] = cluster_labels\n", - "df_bias['bias'] = bias_values\n", - "df_bias.groupby('cluster')['bias'].agg([np.mean, np.size])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from arangopipe.arangopipe_storage.arangopipe_api import ArangoPipe\n", - "from arangopipe.arangopipe_storage.arangopipe_admin_api import ArangoPipeAdmin\n", - "from arangopipe.arangopipe_storage.arangopipe_config import ArangoPipeConfig\n", - "from arangopipe.arangopipe_storage.managed_service_conn_parameters import ManagedServiceConnParam\n", - "mdb_config = ArangoPipeConfig()\n", - "msc = ManagedServiceConnParam()\n", - "conn_params = { msc.DB_SERVICE_HOST : \"arangoml.arangodb.cloud\", \\\n", - " msc.DB_SERVICE_END_POINT : \"createDB\",\\\n", - " msc.DB_SERVICE_NAME : \"createDB\",\\\n", - " msc.DB_SERVICE_PORT : 8529,\\\n", - " msc.DB_CONN_PROTOCOL : 'https'}\n", - " \n", - "mdb_config = mdb_config.create_connection_config(conn_params)\n", - "admin = ArangoPipeAdmin(reuse_connection = False, config = mdb_config)\n", - "ap_config = admin.get_config()\n", - "ap = ArangoPipe(config = ap_config)\n", - "proj_info = {\"name\": \"Housing_Price_Estimation_Project\"}\n", - "proj_reg = admin.register_project(proj_info)\n", - "ds_info = {\"name\" : \"california-housing-dataset\",\\\n", - " \"description\": \"This dataset lists median house prices in Califoria. Various house features are provided\",\\\n", - " \"source\": \"UCI ML Repository\" }\n", - "ds_reg = ap.register_dataset(ds_info)\n", - "import numpy as np\n", - "df[\"medianHouseValue\"] = df[\"medianHouseValue\"].apply(lambda x: np.log(x))\n", - "featureset = df.dtypes.to_dict()\n", - "featureset = {k:str(featureset[k]) for k in featureset}\n", - "featureset[\"name\"] = \"log_transformed_median_house_value\"\n", - "fs_reg = ap.register_featureset(featureset, ds_reg[\"_key\"]) #\n", - "model_info = {\"name\": \"Bias Variance Analysis of LASSO model\", \"task\": \"Model Validation\"}\n", - "model_reg = ap.register_model(model_info, project = \"Housing_Price_Estimation_Project\")\n", - "import uuid\n", - "import datetime\n", - "import jsonpickle\n", - "\n", - "ruuid = str(uuid.uuid4().int)\n", - "model_perf = {'model_bias': bias_at_i, 'run_id': ruuid, \"timestamp\": str(datetime.datetime.now())}\n", - "\n", - "mp = clf.get_params()\n", - "mp = jsonpickle.encode(mp)\n", - "model_params = {'run_id': ruuid, 'model_params': mp}\n", - "\n", - "run_info = {\"dataset\" : ds_reg[\"_key\"],\\\n", - " \"featureset\": fs_reg[\"_key\"],\\\n", - " \"run_id\": ruuid,\\\n", - " \"model\": model_reg[\"_key\"],\\\n", - " \"model-params\": model_params,\\\n", - " \"model-perf\": model_perf,\\\n", - " \"tag\": \"Housing-Price-Hyperopt-Experiment\",\\\n", - " \"project\": \"Housing Price Estimation Project\"}\n", - "ap.log_run(run_info)" - ] - } - ], - "metadata": { - "language_info": { - "name": "python", - "pygments_lexer": "ipython3" - } - }, - "nbformat": 4, - "nbformat_minor": 1 -} + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "xVhnMXgv31aD" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LzLs75qO5B9g" + }, + "source": [ + "## **Intro**\n", + "In part 3 of the [Introduction to ArangoML series](https://www.arangodb.com/tag/arangoml/), we will take a look at what bootstrapping is and how it relates to determining bias-variance tradeoff and composition. We will dive into the concepts of these topics, leaving the heavily mathematical discussion to those who already offer great explanations for these topics(linked at the bottom). \n", + "\n", + "This post will:\n", + "\n", + "* Provide insight into the goals of these concepts\n", + "* Demonstrate a basic example\n", + "* Showcase the ease of use with ArangoML\n", + "\n", + "## **Bootstrapping**\n", + "\n", + "In the [previous post](https://www.arangodb.com/2020/10/arangoml-part-2-basic-arangopipe-workflow/), we explored the basic arangopipe workflow and discussed the concept of model building. One concept briefly mentioned was grabbing a sample of the large dataset to use during our model building exercise. This data sampling is referred to as bootstrap sampling and is one approach for tackling big data. To explain, we will continue using the housing prices dataset from the previous post.\n", + "\n", + "Bootstrapping is a statistical technique. There are situations where we have limited data but are interested in estimating the behavior of a statistic. For example, we have a single dataset, and we are interested in estimating how a model we have developed would perform on datasets we encounter in the future. Bootstrapping can give us useful estimates in these situations. \n", + "\n", + "Imagine that our dataset either currently or will in the future contain millions of documents for houses. Having this many documents would mean that every time we wanted to run a test when building the model, it would take a long time to get a result. Not only would there be a considerable time requirement but not having an accurate test dataset to test the model against would result in a potentially less precise model when used against future data. Bootstrap sampling combines the need for precision with the ability to timely develop a model. \n", + "\n", + "The intuition with bootstrapping is that each sample of the dataset is similar. The validity of our estimates depends on how reasonable it is to assume that the samples are similar. In situations where there is limited variability in the data, this assumption can be reasonable. \n", + "\n", + "With bootstrapping, we generate reasonable proxies for these other samples by sampling the dataset that we have with replacement. Replacement is allowing for all of the same documents to be used in different sample sets. Rather than removing the document from the pool of data, often referred to as a population, when used in a sample, it is returned to the pool. Returning the document to the dataset means that everytime we make a new sample set we have the exact same probability for choosing each document, this helps ensure that each sample set is equally reflective of the entire population of data. \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ijA6mwmB31aE" + }, + "source": [ + "# Installation Pre-requisites" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6O-nWbwu48kP" + }, + "source": [ + "" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PaRgJTEn31aE" + }, + "source": [ + "%%capture\n", + "!pip install python-arango\n", + "!pip install arangopipe==0.0.6.9.3\n", + "!pip install pandas PyYAML==5.1.1 sklearn2\n", + "!pip install jsonpickle\n", + "!pip install seaborn\n", + "!pip install dtreeviz\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NrhnTSjODX7m" + }, + "source": [ + "In case you have not done the previous examples, here is a quick look at the dataset we are working with.\n", + "\n", + "This dataset is available from the arangopipe repo and was originally made avaialble from the UCI ML Repository. The dataset contains data for housing in California, including:\n", + "\n", + "* The house configuration & location\n", + "* The median house values and ages\n", + "* The general population & number of households\n", + "* The median income for the area" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "N0ehJ9Au31aH" + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "data_url = \"https://raw.githubusercontent.com/arangoml/arangopipe/arangopipe_examples/examples/data/cal_housing.csv\"\n", + "df = pd.read_csv(data_url, error_bad_lines=False)\n", + "df.head()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MqVwcaYd31aJ" + }, + "source": [ + "# Bias Variance Decompostion of Model Estimates" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z81t2ijH31aJ" + }, + "source": [ + "Once we have our sample datasets, we evaluate the model on these datasets and note its performance. The observed variation in performance is a proxy for the variation we are likely to observe with datasets in the future. In this example we apply the bootstrapping idea to estimate two important statistical qualities of the model we have developed. These are the bias and the variance of the model.\n", + "\n", + "The bias of the model captures the errors resulting from incorrect assumptions about the model. In this example, we have used a linear regression model to estimate house prices. \n", + "Some questions you may ask yourself: \n", + "* Is linear regression too simple a model for this problem? \n", + "* Do we need a more complex model? \n", + "* Perhaps, for example, a polynomial regression model could be better? \n", + "\n", + "Examining the bias associated with the model can answer this question. Another source of the error associated with the model is its sensitivity to the training set. The error associated with sensitivity to the training set is the variance. Choosing a model with the right level of complexity is a critical modeling decision and involves balancing the bias and variance associated with the model. This is called the bias-variance tradeoff. \n", + "\n", + "**Note**: The intent of this explanation is to motivate the problem and the need for bootstrapping. For this reason, we have refrained from a rigorous mathematical definition of the bias and variance terms.\n", + "\n", + "## Estimating the Bias at each point in the dataset\n", + "\n", + "Evaluating the bias requires us to calculate the expected value (average) of the difference between the model estimate and the true value of the estimated quantity at each point of the dataset. In this example, this implies we need to evaluate the expected value of the difference between the model estimate and the true value of the house at each point. In calculating the expected value, we need to average over all datasets. This poses a problem because we only have a single dataset. How do we determine an average value of the deviation at this point? Bootstrapping is one way to solve this problem. For each point (house) in our dataset, we construct bootstrapped datasets that include the house. \n", + "\n", + "This method has a straightforward implementation; For each house in the dataset, construct a bootstrapped dataset that has that house, along with a random selection of other houses, from the dataset. Repeat this process to generate sufficient bootstrap datasets(NUM_BOOTSTRAPS in the code segment below)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "51znflaE31aJ" + }, + "source": [ + "from sklearn import linear_model\n", + "df['medianHouseValue'] = df['medianHouseValue'].apply(np.log)\n", + "preds = df.columns.tolist()\n", + "preds.remove('medianHouseValue')\n", + "SAMPLE_SIZE = 1000\n", + "df = df.sample(n = SAMPLE_SIZE)\n", + "df = df.reset_index()\n", + "\n", + "NUM_BOOTSTRAPS = 30\n", + "BOOTSTRAP_SAMPLE_SIZE = df.shape[0] - 1\n", + "bootstrap_Yest = {i : list() for i in range(df.shape[0])}\n", + "for index in range(df.shape[0]):\n", + " for bootrap_iteration in range(NUM_BOOTSTRAPS):\n", + " dfi = df.iloc[index, :]\n", + " dfb = df.sample(n = BOOTSTRAP_SAMPLE_SIZE, replace=True)\n", + " dfb = dfb.append(dfi)\n", + " X = dfb[preds].values\n", + " Y = dfb['medianHouseValue']\n", + "\n", + " clf = linear_model.Lasso(alpha=0.001, max_iter = 10000)\n", + " clf.fit(X, Y)\n", + " est_point = X[index, :].reshape(1, -1)\n", + " est_at_index = clf.predict(est_point)\n", + " bootstrap_Yest[index].append(est_at_index)\n", + " \n", + " if index % 100 == 0:\n", + " print('Completed estimating %4d points in the dataset' % (index))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "R9L-x4na31aM" + }, + "source": [ + "Xm = df[preds].values\n", + "Ym = df['medianHouseValue'].values\n", + "clf_0 = linear_model.Lasso(alpha=0.001, max_iter = 10000)\n", + "clf_0.fit(Xm, Ym)\n", + "Yhat_m = clf_0.predict(Xm)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E0N3TINm31aO" + }, + "source": [ + "# see section 2.2 from https://www.stat.cmu.edu/~cshalizi/402/lectures/08-bootstrap/lecture-08.pdf\n", + "# see https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf\n", + "Expval_at_i = { i : np.mean(np.array(bootstrap_Yest[i])) for i in range(df.shape[0])}\n", + "bias_at_i = {i : Expval_at_i[i] - Yhat_m[i] for i in range(df.shape[0])}\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8kMYAsUj31aQ" + }, + "source": [ + "## Analysis of the Bias\n", + "\n", + "We now have enough datasets that include the house at which we are interested in estimating the average deviation between the model and the truth. \n", + "To calculate the average deviation:\n", + "* We now develop the model on each of the bootstrapped datasets \n", + "* Then evaluate the difference between the truth and the model estimate\n", + "* We repeat this process for each bootstrap dataset\n", + "* Then average those quantities \n", + "\n", + "The average we end up with gives us the bootstrapped estimate of the bias at that point. It should be evident that the above procedure is computationally intensive. We generate bootstrap datasets that include each point and then we develop models on each of these datasets. For purposes of illustration, in this post, we will estimate the bias for a sample of the original dataset. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PgKs14p131aR" + }, + "source": [ + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "\n", + "import seaborn as sns\n", + "bias_values = [bias for (pt, bias) in bias_at_i.items()]\n", + "sns.kdeplot(bias_values)\n", + "plt.grid(True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AFg9anP931aU" + }, + "source": [ + "Examine a Kernel Density plot of the bias to see the range of values. \n", + "\n", + "Note:\n", + "The response is log transformed, so the bias must be exponeniated to get the real difference from the true value" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mYvySUyk31aU" + }, + "source": [ + "from sklearn.cluster import KMeans\n", + "cluster_labels = KMeans(n_clusters=5, random_state=0).fit_predict(Xm)\n", + "cluster_labels.shape" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k_CH2IPZ31aW" + }, + "source": [ + "After we have estimated the bias at each point, we can examine the bias values to see if there are systematic patterns in the bias. One simple evaluation technique to do this is to cluster the bias values. If there are no systematic patterns then we should not see explicit clustering tendencies when we cluster the bias estimates. Dense clusters would indicate areas where our model is misspecified - either too much of complexity or not enough complexity.\n", + "\n", + "\n", + "To see where the model is making mistakes, we cluster the (sample of the) dataset and compute the average bias for each cluster. This provides insights into regions of the data space we are doing well (bias close to zero) and regions where we are not doing well. \n", + "\n", + "The table below shows the mean cluster bias and the size of the cluster. We see two large clusters where the bias is close to zero (cluster 0 and cluster 1). We see one outlier with a large error (cluster 3). Clusters 1 and 4 are also seem like outliers and need further analysis. \n", + "\n", + "This exercise illustrates how we can examine our model's characteristics. We can now link this model analysis activity to our project using Arangopipe. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tCeWdvPk31aW" + }, + "source": [ + "df_bias = pd.DataFrame(Xm)\n", + "df_bias['cluster'] = cluster_labels\n", + "df_bias['bias'] = bias_values\n", + "df_bias.groupby('cluster')['bias'].agg([np.mean, np.size])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5CDZoueKGAZh" + }, + "source": [ + "## Storing in Arangopipe\n", + "\n", + "Calculation of the variance can be performed analogously. Bias and variance are model characteristics that are of interest to data scientists because they convey information about its strengths, limitations and performance. In this example we store the model bias for a linear regression model in arangopipe. Such an exercise may be performed by the data science team member to get a baseline profile for the modeling task. A coworker developing a more complex model, can see how his model performs in relation to the baseline model by retrieving these results from arangopipe. \n", + "\n", + "We start with setting up the connection to the ArangoML cloud database, hosted on ArangoDB Oasis." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "x17I1e8631aY" + }, + "source": [ + "from arangopipe.arangopipe_storage.arangopipe_api import ArangoPipe\n", + "from arangopipe.arangopipe_storage.arangopipe_admin_api import ArangoPipeAdmin\n", + "from arangopipe.arangopipe_storage.arangopipe_config import ArangoPipeConfig\n", + "from arangopipe.arangopipe_storage.managed_service_conn_parameters import ManagedServiceConnParam\n", + "mdb_config = ArangoPipeConfig()\n", + "msc = ManagedServiceConnParam()\n", + "conn_params = { msc.DB_SERVICE_HOST : \"arangoml.arangodb.cloud\", \\\n", + " msc.DB_SERVICE_END_POINT : \"createDB\",\\\n", + " msc.DB_SERVICE_NAME : \"createDB\",\\\n", + " msc.DB_SERVICE_PORT : 8529,\\\n", + " msc.DB_CONN_PROTOCOL : 'https'}\n", + " \n", + "mdb_config = mdb_config.create_connection_config(conn_params)\n", + "admin = ArangoPipeAdmin(reuse_connection = False, config = mdb_config)\n", + "ap_config = admin.get_config()\n", + "ap = ArangoPipe(config = ap_config)\n", + "print(\" \")\n", + "print(\"Your temporary database can be accessed using the following credentials:\")\n", + "mdb_config.get_cfg()\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X5Rz0YE4IbwL" + }, + "source": [ + "### Try it out!\n", + "Once the previous block has successfully executed you can navigate to https://arangoml.arangodb.cloud:8529 and sign in with the generated credentials to explore the temporary database.\n", + "\n", + "## Log the Project with Arangopipe\n", + "\n", + "Now that we have run our experiment it is time to save the metadata with arangopipe!\n", + "\n", + "As discussed in the 'Basic Arangopipe Workflow' notebook and post, arangopipe can be nesstled into or around your pre-existing machine learning pipelines. So, we are able to capture all of the important information we used in this experiment by simply dropping in the below code. \n", + "\n", + "This will create a project and store everything about this experiment including the various parameters used throughout it and the performance of the run as well." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q9j7RL8TGVj_" + }, + "source": [ + "\n", + "proj_info = {\"name\": \"Housing_Price_Estimation_Project\"}\n", + "proj_reg = admin.register_project(proj_info)\n", + "ds_info = {\"name\" : \"california-housing-dataset\",\\\n", + " \"description\": \"This dataset lists median house prices in California. Various house features are provided\",\\\n", + " \"source\": \"UCI ML Repository\" }\n", + "ds_reg = ap.register_dataset(ds_info)\n", + "import numpy as np\n", + "df[\"medianHouseValue\"] = df[\"medianHouseValue\"].apply(lambda x: np.log(x))\n", + "featureset = df.dtypes.to_dict()\n", + "featureset = {k:str(featureset[k]) for k in featureset}\n", + "featureset[\"name\"] = \"log_transformed_median_house_value\"\n", + "fs_reg = ap.register_featureset(featureset, ds_reg[\"_key\"]) \n", + "model_info = {\"name\": \"Bias Variance Analysis of LASSO model\", \"task\": \"Model Validation\"}\n", + "model_reg = ap.register_model(model_info, project = \"Housing_Price_Estimation_Project\")\n", + "import uuid\n", + "import datetime\n", + "import jsonpickle\n", + "\n", + "ruuid = str(uuid.uuid4().int)\n", + "model_perf = {'model_bias': bias_at_i, 'run_id': ruuid, \"timestamp\": str(datetime.datetime.now())}\n", + "\n", + "mp = clf.get_params()\n", + "mp = jsonpickle.encode(mp)\n", + "model_params = {'run_id': ruuid, 'model_params': mp}\n", + "\n", + "run_info = {\"dataset\" : ds_reg[\"_key\"],\\\n", + " \"featureset\": fs_reg[\"_key\"],\\\n", + " \"run_id\": ruuid,\\\n", + " \"model\": model_reg[\"_key\"],\\\n", + " \"model-params\": model_params,\\\n", + " \"model-perf\": model_perf,\\\n", + " \"tag\": \"Housing-Price-Hyperopt-Experiment\",\\\n", + " \"project\": \"Housing Price Estimation Project\"}\n", + "ap.log_run(run_info)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TL_HzB67LEVo" + }, + "source": [ + "The [Introduction to ArangoML series](https://www.arangodb.com/tag/arangoml/) will continue, so be sure to sign up for our newsletter to be notified of the next release!\n", + "\n", + "You can also join us on the [ArangoML Slack channel](https://arangodb-community.slack.com/archives/CN9LVJ24S) if you have any questions or comments." + ] + } + ] +} \ No newline at end of file diff --git a/examples/examples_output/Arangopipe_Feature_Example_ext1_output.ipynb b/examples/examples_output/Arangopipe_Feature_Example_ext1_output.ipynb new file mode 100644 index 0000000..6a57e97 --- /dev/null +++ b/examples/examples_output/Arangopipe_Feature_Example_ext1_output.ipynb @@ -0,0 +1,738 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Arangopipe_Feature_Example_ext1.ipynb", + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "xVhnMXgv31aD" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LzLs75qO5B9g" + }, + "source": [ + "## **Intro**\n", + "In part 3 of the [Introduction to ArangoML series](https://www.arangodb.com/tag/arangoml/), we will take a look at what bootstrapping is and how it relates to determining bias-variance tradeoff and composition. We will dive into the concepts of these topics, leaving the heavily mathematical discussion to those who already offer great explanations for these topics(linked at the bottom). \n", + "\n", + "This post will:\n", + "\n", + "* Provide insight into the goals of these concepts\n", + "* Demonstrate a basic example\n", + "* Showcase the ease of use with ArangoML\n", + "\n", + "## **Bootstrapping**\n", + "\n", + "In the [previous post](https://www.arangodb.com/2020/10/arangoml-part-2-basic-arangopipe-workflow/), we explored the basic arangopipe workflow and discussed the concept of model building. One concept briefly mentioned was grabbing a sample of the large dataset to use during our model building exercise. This data sampling is referred to as bootstrap sampling and is one approach for tackling big data. To explain, we will continue using the housing prices dataset from the previous post.\n", + "\n", + "Bootstrapping is a statistical technique. There are situations where we have limited data but are interested in estimating the behavior of a statistic. For example, we have a single dataset, and we are interested in estimating how a model we have developed would perform on datasets we encounter in the future. Bootstrapping can give us useful estimates in these situations. \n", + "\n", + "Imagine that our dataset either currently or will in the future contain millions of documents for houses. Having this many documents would mean that every time we wanted to run a test when building the model, it would take a long time to get a result. Not only would there be a considerable time requirement but not having an accurate test dataset to test the model against would result in a potentially less precise model when used against future data. Bootstrap sampling combines the need for precision with the ability to timely develop a model. \n", + "\n", + "The intuition with bootstrapping is that each sample of the dataset is similar. The validity of our estimates depends on how reasonable it is to assume that the samples are similar. In situations where there is limited variability in the data, this assumption can be reasonable. \n", + "\n", + "With bootstrapping, we generate reasonable proxies for these other samples by sampling the dataset that we have with replacement. Replacement is allowing for all of the same documents to be used in different sample sets. Rather than removing the document from the pool of data, often referred to as a population, when used in a sample, it is returned to the pool. Returning the document to the dataset means that everytime we make a new sample set we have the exact same probability for choosing each document, this helps ensure that each sample set is equally reflective of the entire population of data. \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ijA6mwmB31aE" + }, + "source": [ + "# Installation Pre-requisites" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6O-nWbwu48kP" + }, + "source": [ + "" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PaRgJTEn31aE" + }, + "source": [ + "%%capture\n", + "!pip install python-arango\n", + "!pip install arangopipe==0.0.6.9.3\n", + "!pip install pandas PyYAML==5.1.1 sklearn2\n", + "!pip install jsonpickle\n", + "!pip install seaborn\n", + "!pip install dtreeviz\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NrhnTSjODX7m" + }, + "source": [ + "In case you have not done the previous examples, here is a quick look at the dataset we are working with.\n", + "\n", + "This dataset is available from the arangopipe repo and was originally made avaialble from the UCI ML Repository. The dataset contains data for housing in California, including:\n", + "\n", + "* The house configuration & location\n", + "* The median house values and ages\n", + "* The general population & number of households\n", + "* The median income for the area" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "N0ehJ9Au31aH", + "outputId": "51bcca02-edf1-42e3-c7e4-2ccbe1caf03e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "data_url = \"https://raw.githubusercontent.com/arangoml/arangopipe/arangopipe_examples/examples/data/cal_housing.csv\"\n", + "df = pd.read_csv(data_url, error_bad_lines=False)\n", + "df.head()" + ], + "execution_count": 12, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
latlonghousingMedAgetotalRoomstotalBedroomspopulationhouseholdsmedianIncomemedianHouseValue
0-122.2237.862170991106240111388.3014358500.0
1-122.2437.855214671904961777.2574352100.0
2-122.2537.855212742355582195.6431341300.0
3-122.2537.855216272805652593.8462342200.0
4-122.2537.85529192134131934.0368269700.0
\n", + "
" + ], + "text/plain": [ + " lat long housingMedAge ... households medianIncome medianHouseValue\n", + "0 -122.22 37.86 21 ... 1138 8.3014 358500.0\n", + "1 -122.24 37.85 52 ... 177 7.2574 352100.0\n", + "2 -122.25 37.85 52 ... 219 5.6431 341300.0\n", + "3 -122.25 37.85 52 ... 259 3.8462 342200.0\n", + "4 -122.25 37.85 52 ... 193 4.0368 269700.0\n", + "\n", + "[5 rows x 9 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 12 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MqVwcaYd31aJ" + }, + "source": [ + "# Bias Variance Decompostion of Model Estimates" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z81t2ijH31aJ" + }, + "source": [ + "Once we have our sample datasets, we evaluate the model on these datasets and note its performance. The observed variation in performance is a proxy for the variation we are likely to observe with datasets in the future. In this example we apply the bootstrapping idea to estimate two important statistical qualities of the model we have developed. These are the bias and the variance of the model.\n", + "\n", + "The bias of the model captures the errors resulting from incorrect assumptions about the model. In this example, we have used a linear regression model to estimate house prices. \n", + "Some questions you may ask yourself: \n", + "* Is linear regression too simple a model for this problem? \n", + "* Do we need a more complex model? \n", + "* Perhaps, for example, a polynomial regression model could be better? \n", + "\n", + "Examining the bias associated with the model can answer this question. Another source of the error associated with the model is its sensitivity to the training set. The error associated with sensitivity to the training set is the variance. Choosing a model with the right level of complexity is a critical modeling decision and involves balancing the bias and variance associated with the model. This is called the bias-variance tradeoff. \n", + "\n", + "**Note**: The intent of this explanation is to motivate the problem and the need for bootstrapping. For this reason, we have refrained from a rigorous mathematical definition of the bias and variance terms.\n", + "\n", + "## Estimating the Bias at each point in the dataset\n", + "\n", + "Evaluating the bias requires us to calculate the expected value (average) of the difference between the model estimate and the true value of the estimated quantity at each point of the dataset. In this example, this implies we need to evaluate the expected value of the difference between the model estimate and the true value of the house at each point. In calculating the expected value, we need to average over all datasets. This poses a problem because we only have a single dataset. How do we determine an average value of the deviation at this point? Bootstrapping is one way to solve this problem. For each point (house) in our dataset, we construct bootstrapped datasets that include the house. \n", + "\n", + "This method has a straightforward implementation; For each house in the dataset, construct a bootstrapped dataset that has that house, along with a random selection of other houses, from the dataset. Repeat this process to generate sufficient bootstrap datasets(NUM_BOOTSTRAPS in the code segment below)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "51znflaE31aJ", + "outputId": "bc3aa430-cc3e-4951-f75d-403ded47577a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 187 + } + }, + "source": [ + "from sklearn import linear_model\n", + "df['medianHouseValue'] = df['medianHouseValue'].apply(np.log)\n", + "preds = df.columns.tolist()\n", + "preds.remove('medianHouseValue')\n", + "SAMPLE_SIZE = 1000\n", + "df = df.sample(n = SAMPLE_SIZE)\n", + "df = df.reset_index()\n", + "\n", + "NUM_BOOTSTRAPS = 30\n", + "BOOTSTRAP_SAMPLE_SIZE = df.shape[0] - 1\n", + "bootstrap_Yest = {i : list() for i in range(df.shape[0])}\n", + "for index in range(df.shape[0]):\n", + " for bootrap_iteration in range(NUM_BOOTSTRAPS):\n", + " dfi = df.iloc[index, :]\n", + " dfb = df.sample(n = BOOTSTRAP_SAMPLE_SIZE, replace=True)\n", + " dfb = dfb.append(dfi)\n", + " X = dfb[preds].values\n", + " Y = dfb['medianHouseValue']\n", + "\n", + " clf = linear_model.Lasso(alpha=0.001, max_iter = 10000)\n", + " clf.fit(X, Y)\n", + " est_point = X[index, :].reshape(1, -1)\n", + " est_at_index = clf.predict(est_point)\n", + " bootstrap_Yest[index].append(est_at_index)\n", + " \n", + " if index % 100 == 0:\n", + " print('Completed estimating %4d points in the dataset' % (index))" + ], + "execution_count": 13, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Completed estimating 0 points in the dataset\n", + "Completed estimating 100 points in the dataset\n", + "Completed estimating 200 points in the dataset\n", + "Completed estimating 300 points in the dataset\n", + "Completed estimating 400 points in the dataset\n", + "Completed estimating 500 points in the dataset\n", + "Completed estimating 600 points in the dataset\n", + "Completed estimating 700 points in the dataset\n", + "Completed estimating 800 points in the dataset\n", + "Completed estimating 900 points in the dataset\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "R9L-x4na31aM" + }, + "source": [ + "Xm = df[preds].values\n", + "Ym = df['medianHouseValue'].values\n", + "clf_0 = linear_model.Lasso(alpha=0.001, max_iter = 10000)\n", + "clf_0.fit(Xm, Ym)\n", + "Yhat_m = clf_0.predict(Xm)" + ], + "execution_count": 14, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E0N3TINm31aO" + }, + "source": [ + "# see section 2.2 from https://www.stat.cmu.edu/~cshalizi/402/lectures/08-bootstrap/lecture-08.pdf\n", + "# see https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf\n", + "Expval_at_i = { i : np.mean(np.array(bootstrap_Yest[i])) for i in range(df.shape[0])}\n", + "bias_at_i = {i : Expval_at_i[i] - Yhat_m[i] for i in range(df.shape[0])}\n" + ], + "execution_count": 15, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8kMYAsUj31aQ" + }, + "source": [ + "## Analysis of the Bias\n", + "\n", + "We now have enough datasets that include the house at which we are interested in estimating the average deviation between the model and the truth. \n", + "To calculate the average deviation:\n", + "* We now develop the model on each of the bootstrapped datasets \n", + "* Then evaluate the difference between the truth and the model estimate\n", + "* We repeat this process for each bootstrap dataset\n", + "* Then average those quantities \n", + "\n", + "The average we end up with gives us the bootstrapped estimate of the bias at that point. It should be evident that the above procedure is computationally intensive. We generate bootstrap datasets that include each point and then we develop models on each of these datasets. For purposes of illustration, in this post, we will estimate the bias for a sample of the original dataset. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PgKs14p131aR", + "outputId": "afb535d7-6ce5-4f55-c576-7756a49791d3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 265 + } + }, + "source": [ + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "\n", + "import seaborn as sns\n", + "bias_values = [bias for (pt, bias) in bias_at_i.items()]\n", + "sns.kdeplot(bias_values)\n", + "plt.grid(True)" + ], + "execution_count": 16, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAD4CAYAAADhNOGaAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3dd3hU953v8fd3Rr2ihgSSQEJCdIOwKAbb4BJjYsdgOxub2N4UJ05zNk5ys+ske53ESTabOJvuvTfeJDfrkrhksY1tbFzlRjGILkAghAQqgCrqZTS/+4ckR8YqIzFHZ8r39Tx60MwczXw4z4w+Oud3zu+IMQallFLBy2F3AKWUUvbSIlBKqSCnRaCUUkFOi0AppYKcFoFSSgW5ELsDjFVycrLJysqyO8aHtLW1ER0dbXcMn6TrZmS6foan62ZkY1k/RUVFdcaYlKEe87siyMrKYteuXXbH+JDCwkJWr15tdwyfpOtmZLp+hqfrZmRjWT8iUjHcY7prSCmlgpwWgVJKBTktAqWUCnJaBEopFeS0CJRSKshpESilVJDTIlBKqSCnRaDUBOl165Tvyjf53QllSvkTt9vwh3fK+K+3T1DX2sWs1Fj+6aqZrJ2fhojYHU8pQLcIlLKMq9fN5x/exb9tPsKcKXF8aVUOPb1uvvzYbn62pQS9KJTyFbpFoJRFfvt6Ka8dOcu/XjeHOy/NRkT45jWzuO/Zg/yfwuOEOoRvXDPL7phKaREoZYWd5Q389vVj3Lw4g89dNuP9+50O4Ufr59PT6+Y3r5eSPy0B3UGk7Ka7hpTyMmMMP37hMGlxEdy/bt6HHhcR7l83n9lpsXzjyb00dbltSKnU32kRKOVlb5ScZe+pJr561Uyiw4fe6I4IdfK7Ty6mrauXJ450T3BCpT5Ii0ApLzLG8MtXjpGZGMnHL84YcdncyTF8cdUMttX0su14/QQlVOrDtAiU8qKiikYOVJ3jS6tyCXWO/vH60upckiOF+549SE+v7iJS9tAiUMqLHttxktjwENYtmurR8pFhTm6bE8axs6386Z0TFqdTamhaBEp5SWNbNy8cqGF9fvqwYwNDyZ8cwtVzJvPr145Rc67DwoRKDU2LQCkv2binim6Xm08umzbmn/3ex+bR6zb8ZPMRC5IpNTItAqW85Nm9VcxPj2POlLgx/2xmYhRfWJXDpn3VvHeiwYJ0Sg1Pi0ApLyiva2N/5TluWOjZ2MBQvrQqh6nxEXxvU7FOUKcmlBaBUl7w/P5qAK67aPxFEBnm5DvXzeFwTTOP7zzprWhKjcrSIhCRa0WkRERKReTeIR6fJiJviMgeEdkvIh+1Mo9SVnluXw1LshJInxR5Qc9z3YIpLMtO5OdbSmhq1xPN1MSwrAhExAk8CKwF5gIbRGTueYv9K/CkMSYfuBX4T6vyKGWV47WtlJxp4boFUy74uUSE798wj3MdPfzylaNeSKfU6KzcIlgKlBpjyowx3cDjwLrzljHAwMhaPFBtYR6lLPHa4TMAXD031SvPN2dKHLcvn84j2ys4crrZK8+p1EjEqjnRReTjwLXGmM/1374DWGaMuXvQMlOAl4EEIBq42hhTNMRz3QXcBZCamnrx448/bknmC9Ha2kpMTIzdMXxSoK+bn+zooK3H8KNLo8b180Otn9Zuw7+83U5mrIN/WRIRtBexCfT3zoUay/q54ooriowxBUM9Zvc01BuAPxtj/kNELgEeEZH5xpgPnGtvjHkIeAigoKDArF69euKTjqKwsBBfzOULAnndNLV3U/ryq3xpVQ6rV4/v2gLDrZ9z8RV89+mDNE2ayY35I89bFKgC+b3jDd5aP1buGqoCMgfdzui/b7A7gScBjDHbgAgg2cJMSnlVYUktvW7DVXMme/25NyyZRv60Sfzo+cOca+/x+vMrNcDKItgJzBSRbBEJo28weNN5y5wErgIQkTn0FUGthZmU8qpXD58hOSachRmTvP7cDofw4/ULaOro4adb9IxjZR3LisAY4wLuBrYAh+k7OqhYRO4XkRv6F/sm8HkR2Qf8Ffi00Qu5Kj/R0+vmzaO1XDk7BYfDmn34c6fG8ZkVWfxlx0mKKhoteQ2lLB0jMMZsBjafd999g74/BKy0MoNSVtl5ooGWThdXzfHO0ULDuecjebxwoIbvPn2A5796KSEeTG+t1FjoO0qpcXr18FnCQhxcNtPaYa2Y8BC+97F5HDndwp+3llv6Wio4aREoNQ7GGF47coYVOUlEhVl/8N2aealcNXsyv3jlKNVNOlW18i4tAqXGoayujYr6dst3Cw0YOOPYbQw/eK54Ql5TBQ8tAqXG4e2jfQe3rc5LmbDXzEyM4mtX5bGl+AyvHjozYa+rAp8WgVLj8PaxOrKSoshMHN/ZxOP1ucuyyUuN4Xubimnvdk3oa6vApUWg1Bh1u9xsK6vnspkTtzUwINTp4EfrF1DV1MFvXiud8NdXgUmLQKkxKqpopL27l8sncLfQYEuzE/lEQQZ/eLuMktMttmRQgUWLQKkxevtYLSEOYfmMRNsy3Lt2DjERIfzkxcO2ZVCBQ4tAqTF661gti6clEBsRaluGxOgwvrgqh8KSWr3GsbpgWgRKjUF9axcHq5otP4nME5+6JIuU2HB+vqUEnZlFXQgtAqXG4J3SOgDbxgcGiwxz8k9X5vJeeQNvHauzO47yY1oESo3B28fqmBQVyvz0eLujAHDLkmlkJETywJYjulWgxk2LQCkPGWN4+1gtK3OTcVo02+hYhYU4uOfqPA5WNfPSwdN2x1F+SotAKQ8dO9vKmeYuLveB8YHBbsxPZ0ZyNL97o1S3CtS4aBEo5aGt/eMDK3N9qwicDuELq2ZQXN3Mu6X1dsdRfkiLQCkPbT1ez7TEKDISJnZaCU+sz08nJTac37913O4oyg9pESjlgV63YXtZPStykuyOMqTwECefXZnN28fqOFTdbHcc5We0CJTywKHqZpo7XVzio0UAsGFpJhGhDh7eVm53FOVntAiU8sDW433jA5fM8N0imBQVxo356Tyzt4qm9m674yg/okWglAe2Hq8nd3IMk+Mi7I4yok+tyKKzx80TO0/ZHUX5ES0CpUbR7XKzs7zBZ8cHBpudFsfSrET++t5JPZRUeUyLQKlR7K9sor271y+KAODWpZmU17ezvUwno1Oe0SJQahRbj9cjAsuy/aMI1s6fQmxECE/sPGl3FOUntAiUGsXW43XMnRJHQnSY3VE8EhnmZP2idDYfPK2DxsojWgRKjaCzp5fdJ5v8ZrfQgE8UZNLtcvPCgRq7oyg/oEWg1Ah2VzTS7XKzIse3ppUYzfz0OHInx/DMniq7oyg/oEWg1Ai2Hq/H6RCWZNt3WcrxEBFuzE9nZ3kjpxra7Y6jfJwWgVIj2Hq8joUZ8cSEh9gdZczWLZoKwNO6VaBGoUWg1DBau1zsqzznd7uFBmQkRLE0O5Hn91fbHUX5OC0CpYax80QDvW7jdwPFg62dn8bRM60cr221O4ryYVoESg1j6/E6wkIcLJ6eYHeUcbt2fhqAXr1MjUiLQKlhbCurZ/G0SUSEOu2OMm5T4iNZlDlJi0CNSItAqSG0dPZwqLrZb84mHsna+WkcqDqnRw+pYWkRKDWEoopG3AaW+tlho0MZ2D20pVi3CtTQtAiUGsJ7JxoIcQj50ybZHeWCTU+KZs6UON09pIalRaDUEN470cCCjHiiwvzv/IGhrJ2fRtHJRs42d9odRfkgLQKlztPZ08v+ynMszfL/3UID1s5PwxjdPaSGpkWg1Hn2nmqiu9cdEOMDA2amxpKTEs1LWgRqCFoESp3nvRMNiEDB9MApAoCr56Ty3okG2rpcdkdRPsbSIhCRa0WkRERKReTeYZb5hIgcEpFiEfmLlXmU8sR7JxqYnRZHfFSo3VG8atWsFHp6De+W1tkdRfkYy4pARJzAg8BaYC6wQUTmnrfMTODbwEpjzDzgHqvyKOWJnl43u082sjTLf88mHk7B9ESiw5wUHq21O4ryMVZuESwFSo0xZcaYbuBxYN15y3weeNAY0whgjDlrYR6lRlVc3Ux7dy9LA+BEsvOFhThYmZvMmyW1emF79QFWHhuXDpwadLsSWHbeMnkAIvIu4AS+b4x56fwnEpG7gLsAUlNTKSwstCLvBWltbfXJXL7An9bNiyd6AOipOUxhQ8mEvOZErp+p0sPLTd385YU3SI/x/SFCf3rv2MFb68fug6RDgJnAaiADeEtEFhhjmgYvZIx5CHgIoKCgwKxevXqCY46usLAQX8zlC/xp3TxasZPs5DbWr1k9Ya85kesnr6mDPxe/TntcFqsvnzEhr3kh/Om9YwdvrR8r/ySoAjIH3c7ov2+wSmCTMabHGHMCOEpfMSg14Ywx7KpoZEkAjg8MmDopkrzUGAqP6l5Y9XdWFsFOYKaIZItIGHArsOm8ZZ6hb2sAEUmmb1dRmYWZlBrW8do2mtp7Au6w0fOtnjWZnSca9TBS9T7LisAY4wLuBrYAh4EnjTHFInK/iNzQv9gWoF5EDgFvAN8yxtRblUmpkRRVNAD49fUHPLE6L4XuXjdbj+tHTfWxdIzAGLMZ2HzeffcN+t4A3+j/UspWRRWNTIoKJScl2u4olirI6j+MtOQsH5mbancc5QN8/7ABpSbIropGLp6WgIjYHcVSYSEOLslJ5k09n0D10yJQCmho66asto2LA3igeLBLc5OobOzQi9UoQItAKQB2VzQCcPG04CiClbnJADrdhAK0CJQCoOhkI6FOYWGm/1+IxhO5k2NIiQ3nXR0wVmgRKAVAUXkj86bG+/WF6sdCRFiRk8S243U63YTSIlCq2+VmX2UTFwf4YaPnW5mTTF1rN0fPtNodRdlMi0AFveLqc3S53BQEWRGsyO2bWE/HCZQWgQp6RQMDxUFWBBkJUUxPimLrcS2CYKdFoIJeUUUjmYmRTI6LsDvKhFuRk8SOsgZcvW67oygbaRGooDYw0Vygzy80nBU5ybR0uThQdc7uKMpGWgQqqFU2dlDb0hXw8wsNZ0VO3ziBzjsU3LQIVFDb1T/RXLANFA9IiglndlqsDhgHOY+KQEQ2ish1IqLFoQJKUUUjseEh5KXG2h3FNityktlV0UhnT6/dUZRNPP3F/p/AJ4FjIvLvIjLLwkxKTZhd5Y0smjYJpyOwJ5obycrcJLpd7ven2VDBx6MiMMa8aoy5DVgMlAOvishWEfmMiIRaGVApq7R09lBypiXoDhs939LsRJwO4V09jDRoebyrR0SSgE8DnwP2AL+mrxhesSSZUhbbc7IJYwjaI4YGxEaEsjAjnndLdcA4WHk6RvA08DYQBXzMGHODMeYJY8xXgRgrAypllV0VjTgEFk0LjonmRrIiJ5n9lU00d/bYHUXZwNMtgv8yxsw1xvzEGFMDICLhAMaYAsvSKWWhoooGZqfFERNu6YX6/MKK3CTcBt4ra7A7irKBp0XwoyHu2+bNIEpNJFevmz0nmygIkgvRjGbxtATCQxw6ThCkRvxTSETSgHQgUkTygYFDK+Lo202klF86crqF9u7eoB8oHhAR6qQgK4FtemJZUBptm3gNfQPEGcAvBt3fAnzHokxKWW5gormCrOAeKB5sRU4yD2wpoa61i+SYcLvjqAk04q4hY8x/G2OuAD5tjLli0NcNxpiNE5RRKa/bVdHIlPgI0idF2h3FZwxcvlKnmwg+o+0aut0Y8yiQJSLfOP9xY8wvhvgxpXxeUXmD7hY6z/ypccSGh7DteB03LJxqdxw1gUbbNRTd/68eIqoCRnVTB9XnOrlLi+ADQpwOls1I0vMJgtCIRWCM+X3/vz+YmDhKWW/X+xei0fGB863MTeLVw2c41dBOZqIeDxIsPD2h7GciEicioSLymojUisjtVodTygpF5Q1EhTmZMyV4J5obzoqcvnECPXoouHh6HsE1xphm4Hr65hrKBb5lVSilrLSropFFmZMIcepkuufLS40hOSZczycIMp5+EgZ2IV0HPGWM0csZKb/U2uXicE1z0F5/YDQiwoqcJLYer8cYY3ccNUE8LYLnReQIcDHwmoikAJ3WxVLKGntPNuE2cLGePzCsFTlJ1LZ0UXq21e4oaoJ4Og31vcAKoMAY0wO0AeusDKaUFYoqGhGBfJ1oblgD5xO8o1ctCxpjmW1rNn3nEwz+mYe9nEcpS+2qaGBWaixxEXoZjeFkJkaRlRTF28fq+MzKbLvjqAngURGIyCNADrAXGLienUGLQPmRXrdhz8km1ufryVKjuTwvhad2VdLl6iU8xGl3HGUxT7cICoC5RkePlB8rOd1Ca5cr6C9E44lVeSk8vK2CXeWN7+8qUoHL08Hig0CalUGUslpRRd9c+zq1xOiWz0gizOngzaO1dkdRE8DTIkgGDonIFhHZNPBlZTClvG1XRSOTY8PJSNCJ5kYTHR5CQVYCb5ZoEQQDT3cNfd/KEEpNhF3ljRRkJSAioy+sWJWXwk9ePMLpc52kxUfYHUdZyNPDR9+k74zi0P7vdwK7LcyllFdVNXVQ1dSh4wNjsGpWCgBv6e6hgOfpXEOfB/4G/L7/rnTgGatCKeVtO8r65s5ZPiPJ5iT+Y1ZqLKlx4bx5TIsg0Hk6RvAVYCXQDGCMOQZMHu2HRORaESkRkVIRuXeE5W4WESMiBR7mUWpMtpfVEx8Zyuw0nWjOUyLC5TNTeOdYHa5et91xlIU8LYIuY0z3wI3+k8pGPJRURJzAg8BaYC6wQUTmDrFcLPA1YIenoZUaqx0nGliWnYjDoeMDY7FqVgrnOnrYV6nTiwUyT4vgTRH5Dn0Xsf8I8BTw3Cg/sxQoNcaU9ZfI4ww9LcUPgZ+icxcpi1Q3dVBR3667hcbh0txkHAJvlpy1O4qykKdHDd0L3AkcAL4AbAb+MMrPpAOnBt2uBJYNXkBEFgOZxpgXRGTYaa1F5C7gLoDU1FQKCws9jD1xWltbfTKXL7B73WytdgHgbCijsLDCthzDsXv9jGbmJAcb3zvO4rCaCX9tX183dvPW+vGoCIwxbhF5BnjGGOOVkSMRcQC/AD7twes/BDwEUFBQYFavXu2NCF5VWFiIL+byBXavmxf/tp/4yNPccf2VPrlryO71M5rjISf44fOHyF6whOlJ0aP/gBf5+rqxm7fWz4i7hqTP90WkDigBSvqvTnafB89dBWQOup3Rf9+AWGA+UCgi5cByYJMOGCtv236inqU6PjBu18xNBWBL8WmbkyirjDZG8HX6jhZaYoxJNMYk0rd7Z6WIfH2Un90JzBSRbBEJA24F3j8b2RhzzhiTbIzJMsZkAduBG4wxu8b7n1HqfDXndHzgQmUmRjFvahxbis/YHUVZZLQiuAPYYIw5MXCHMaYMuB34x5F+0BjjAu4GtgCHgSeNMcUicr+I3HBhsZXyzI6yvvmFlmXriWQXYs28NHafbORsix7TEYhGK4JQY8yHrk7RP04w6oTuxpjNxpg8Y0yOMebH/ffdZ4z50DxFxpjVujWgvG17WT1xESHMmRJndxS/tmZeGsbAK4d0qyAQjVYE3eN8TCmfsL2snqXZSTh1fOCC5KXGkJUUpbuHAtRoRbBQRJqH+GoBFkxEQKXGq7qpg/L6dpbP0N1CF0pEWDMvjW3H62ju7LE7jvKyEYvAGOM0xsQN8RVrjNFr/SmfNjBZ2uV5KTYnCQxr5qfR02t444ieXBZoPD2zWCm/8+bRWtLiIpg5OcbuKAFhUcYk0uIieG7fxJ9YpqylRaACkqvXzTuldVyel6zXH/ASh0O4YdFU3jx6lsY2HSIMJFoEKiDtq2yipdOlu4W8bP2idHp6Dc8f0K2CQKJFoALSm0frcEjfpGnKe+ZOjWN2WixP7660O4ryIi0CFZDeOlrLwsxJTIoKsztKwFmfn87uk01U1LfZHUV5iRaBCjiNbd3sq2zi8pm6W8gK6xZNRQSe3lM1+sLKL2gRqIDzTmkdxuhho1aZEh/JJTOSeGZPFcaMeH0q5Se0CFTAeetoLXERISzMiLc7SsBan59OeX07e0412R1FeYEWgQooxhjeOlbLpTOTCXHq29sqa+enER7iYKMOGgcE/aSogHKoppkzzV2s0t1CloqNCOWjC6bw7J5qOrp77Y6jLpAWgQooW4rP4BC4ak6q3VEC3oal02jpcvH8/mq7o6gLpEWgAsqWg6cpyEokOSbc7igBb0lWAjkp0fz1vZN2R1EXSItABYwTdW2UnGlhzbw0u6MEBRFhw9Jp7D7ZxJHTzXbHURdAi0AFjIFr6q6Zp7uFJsrNizMIczp4/L1TdkdRF0CLQAWMLcWnmZ8eR0ZClN1RgkZCdBhrF6SxcXelDhr7MS0CFRDONHey52QT1+puoQm3Yek0mjtdbNaJ6PyWFoEKCC+/v1tIi2CiLctOZEZyNI/uqLA7ihonLQIVELYUn2FGSjS5ehGaCSci3LZ8OntONnGw6pzdcdQ4aBEov9fU3s32snrWzEvTi9DY5OOLM4gIdfCYbhX4JS0C5fdeOFCDy224bsEUu6MErfioUNYtTOeZPdWc69CL2/sbLQLl9zburmJWaizzpsbZHSWo3XHJdDp6enX+IT+kRaD8WnldG0UVjdy0OF13C9lsfno8izIn8cj2Cp2e2s9oESi/tnFPFQ7pmxZZ2e+O5dMpq21j6/F6u6OoMdAiUH7L7TZs3F3JytxkUuMi7I6jgOsumkJCVCiPbNNBY3+iRaD81q6KRiobO7h5cYbdUVS/iFAnn1iSySuHz1BzrsPuOMpDWgTKb23cXUl0mJNrdG4hn3Lb0um4jeGvOv+Q39AiUH6ps6eXF/bXsHbBFKLCQuyOowaZlhTF6rwU/vreSXp63XbHUR7QIlB+6YX9NbR0ubhpsQ4S+6I7LplObUsXLxefsTuK8oAWgfJLD28rJyclmktmJNkdRQ1hVd5kMhIieWR7ud1RlAe0CJTf2XuqiX2V5/jUiiw9d8BHOR3Cbcums72sgaNnWuyOo0ahRaD8zsNby4kJD+EmPVrIp92yJJOwEAePbtdDSX2dFoHyK3WtXTy/v4abF6cTE66DxL4sMTqM6xdMYePuKlq7XHbHUSPQIlB+5Ymdp+judXPHJVl2R1EeuP2S6bR2uXhmT5XdUdQItAiU33D1unl0ewWX5ibrdQf8RH7mJOZNjeNRnX/Ip2kRKL/xUvFpas518o+XTLc7ivKQiHDH8ukcOd3CropGu+OoYWgRKL9gjOHBN44zIyWaq+bomcT+5IZFU4mNCNH5h3yYpUUgIteKSImIlIrIvUM8/g0ROSQi+0XkNRHRP/XUkN4oOcvhmma+vDoXp0MPGfUnUWEhfPziDF48WENtS5fdcdQQLCsCEXECDwJrgbnABhGZe95ie4ACY8xFwN+An1mVR/kvYwy/e72U9EmRrFs01e44ahxuXz6dnl7DEztP2h1FDcHKLYKlQKkxpswY0w08DqwbvIAx5g1jTHv/ze2AHhiuPmR7WQO7TzbxxVUzCHXq3kx/lJMSw6W5yfxlx0lcOv+Qz7HyQOx0YPD0g5XAshGWvxN4cagHROQu4C6A1NRUCgsLvRTRe1pbW30yly+40HXzwM4O4sKE1PYTFBaWey2XrwiW905+rIt3Srv4zd9eZ3GqZ796gmXdjJe31o9PnJEjIrcDBcCqoR43xjwEPARQUFBgVq9ePXHhPFRYWIgv5vIFF7Ju9p5qovild/n22tlcsyrHu8F8RLC8dy7tdfPU8TfY0xrDN24Z6W/CvwuWdTNe3lo/Vm5nVwGZg25n9N/3ASJyNfBd4AZjjI4kqfcZY/jZS0dIig7jtuV6HIG/C3E6+OSyabx9rI4TdW12x1GDWFkEO4GZIpItImHArcCmwQuISD7we/pK4KyFWZQfevtYHVuP13P3lbk6nUSAuHVJJiEO4TGdf8inWFYExhgXcDewBTgMPGmMKRaR+0Xkhv7FHgBigKdEZK+IbBrm6VSQcbsNP33pCBkJkXxy2TS74ygvmRwXwbXz03hy1yk6unvtjqP6WfpnljFmM7D5vPvuG/T91Va+vvJfzx+oobi6mV/espDwEKfdcZQX3bF8Os/vr+G5fdV8Yknm6D+gLKfH4imf0+1y8x8vlzA7LZZ1C/UKZIFmaXYieakxPLy9XOcf8hFaBMrnPLajgor6dv7l2tk49CzigDMw/9DBqmb2VZ6zO45Ci0D5mLrWLn7xylEum5nM6lkpdsdRFlmfn050mFPnH/IRWgTKpzzwUgkd3b1872Pz9DKUASw2IpQbF6fz3P5qGtu67Y4T9LQIlM/Yd6qJJ4tO8dlLs/V6A0HgjuVZdLvcPLnr1OgLK0tpESif4HYb7ttUTHJMOF+9MtfuOGoCzEqLZWl2Io/uqMDt1kFjO2kRKJ/wVNEp9p1q4jsfnU1sRKjdcdQEuWP5dE41dPDmsVq7owQ1LQJlu7PNnfz4hcMszU5k/SI9XDSYrJmXRnJMOI/qoLGttAiU7b7/XDGdLjf/ftMCHSAOMmEhDj65NJPXS85SerbF7jhBS4tA2WpL8Wk2HzjN166ayYwUHSAORp9emU1kqJPfvl5qd5SgpUWgbNPc2cN9zx5kdlosd10+w+44yiaJ0WHcsXw6z+2r5nhtq91xgpIWgbLNj54/RG1LFz/7+EV65bEg9/nLZxAW4uBB3SqwhX76lC1eLj7Nk7sq+cKqHC7KmGR3HGWz5Jhwbl82nWf2VlGu1yqYcFoEasLVtXbx7Y0HmDMljq9fnWd3HOUj7uq/JvWDb+hWwUTTIlATyhjDtzceoKXTxa9uWURYiL4FVZ/JsRF8ctk0Nu6p0rGCCaafQjWhntpVySuHzvCtNbOYlRZrdxzlY75yRS6RoU5++uIRu6MEFS0CNWGOnWnhe5uKWT4jkTsvzbY7jvJByTHhfGl1Di8fOsP2snq74wQNLQI1Idq7XXz5sd1Ehzv5za35ep0BNaw7L81manwEP3juEL06B9GE0CJQE+J7zxZTWtvKL29ZxOS4CLvjKB8WEerkf18/l8M1zbx20mV3nKCgRaAs925VD08VVXL3FblcNlMvNqNGd+38NC7PS2HjsW5qznXYHSfgaREoS+052cj/K+5m+YxEvnbVTLvjKD8hIvxw3TzcwJF5h9QAAAo9SURBVL3/c0CvbWwxLQJlmZpzHdz1SBEJ4cJ/3nYxIXr2sBqD6UnRfCIvjDeP1vL4Tr14jZX0k6ks0dHdy+cf3kV7l4t7FkeQGB1mdyTlh66cFsKKnCR+8FwxR8/o7KRW0SJQXufqdXPPE3sorm7mNxvySY/Vt5kaH4cIv7plETHhIXzlsd20dengsRX0E6q8yu02/PPf9rOl+Az3XT+Xq+ak2h1J+bnJcRH8+tZ8jte2cs8Te/WylhbQIlBeY4zhvk0H2binim9+JI/PrNSTxpR3rMxN5r7r5/LKoTP8ePNhHTz2shC7A6jA4HYbfrz5MI9uP8kXVs3gbr0AvfKyT63Iory+nT++c4LYiBDu0QkLvUaLQF2wbpebf/7bPp7ZW82nV2Rx77Wz9ZKTyutEhPuun0trl4tfvXoMV6/hm9fk6XvNC7QI1AVp6ezhi48W8W5pPd9aM4svr87RD6ayjMMh/PTmiwh1Cr97o5QzzZ38cP18IkKddkfza1oEatyOnWnhK3/ZTVltGz//h4V8/OIMuyOpIOB0CP924wImx0bw69eOUXKmhd9uyGd6UrTd0fyWDharMTPG8MTOk3zsd+/Q0NbNf392qZaAmlAiwtc/ksfv77iYE3VtXPurt/nTOyfodrntjuaXdItAjcnZlk5+8NwhXthfw8rcpL5J5GJ1EjlljzXz0rgoI557/+cA9z9/iP/eVs7/umYW1y2YojPcjoEWgfJIr9vw2I4KHthSQlePm2+tmcUXV+Xg1A+bstmU+Ej+/JklFB6t5acvHuGrf93Dg2+U8qkVWaxbNJWoMP01NxpdQ2pExhjePFrLA1tKKK5u5tLcZO5fN48ZKTF2R1PqfSLCFbMmc/nMFJ7dW8VDb5Xx7Y0H+LfNh7l5cQY3LU5nQXq8HsgwDC0CNSRjDNvLGvjFKyXsLG8kIyGSX9+6iBsWTtUPk/JZTodw0+IMbsxPp6iikYe3VfCXHSf589ZyclKiuTE/nfX56WQkRNkd1adoEfiRXrehs6cXhwgRoQ5LfiF3u9y8cKCaP71TzoGqc0yODeeH6+Zxy5JpeqF55TdEhIKsRAqyEjnX0cPmAzU8vbuKn798lJ+/fJSl2YnclJ/O2gVTiI8MtTuu7bQIfFRlYztbS+s5VNPMkdPNlJ5tpb6tm4Ez6yNCHUyNj2TKpAhmJMcwZ0occ6bEMjstjsiwsR1TbYxh98kmnttXzfP7q6lr7SYnJZof3zifmxdn6DHayq/FR4ayYek0NiydxqmGdp7dW8XGPVXcu/EA920q5uo5k7kxP4NVeSlB+8eOFoGPaOtyseNEPW8dreOtY7WU1bYBEBXmJC81lqtmp5IaH0F0mBO3gfrWLmrOdVLV1MEze6p4ZHsFAA6BrORo5kyJY27/15wpcaTGhSMiGGNo7nBxsqGdvacaKapo5L0TDVSf6yQsxMGVsyazYdk0LstN1qMuVMDJTIzi7itn8pUrctlfeY6n91Tx3L5qNh84TUJUKNdfNJX1+enkZ04Kqve/FoFNOrp72XOq75fwjrIGdlU00NNriAh1sCw7iduWTeeymcnkpsSM+oY0xlDZ2MGhmmYOVTdzuKaZ/ZVNvLC/5v1lnA4hzOmgu9f9gQuCT44N5+LpCXzzmlSumZdKbIRuJqvAJyIszJzEwsxJfPe6Obx9rJan91Tz5K5TPLK9gqToMFblpbBqVgqLpyWQkRAZ0GNjlhaBiFwL/BpwAn8wxvz7eY+HAw8DFwP1wC3GmHIrM1mt121o6ezhXEcPTe39/3b0/Vvb0kXp2RaOnmmlvK4Nl9sgAnPS4vjMymwun5lCQVbCmHfFiAiZiVFkJkaxZl7a+/c3d/ZQcrqFQ9XNnG3ppNvlJjzESXxkKBkJkSzIiCd9UmC/wZUaTajTwZWzU7lydiotnT28dvgshSVneaPkLBv3VAGQGB3G/PR4ZiRHM63/s5YUE0ZiVBgJUWHERoT49RaEZUUgIk7gQeAjQCWwU0Q2GWMODVrsTqDRGJMrIrcCPwVusSrTYMYYet2GXmNwu8Hldr//b3t3b/+Xi/buXtq6XHT09NLW1Utr1wd/wTf3/8I/3dBOV+EWWrpcDDdDrghMT4wid3Isa+alUjA9kcXTEywbrIqLCGVJViJLshIteX6lAk1sRCjr+48s6nUbDlU3s7eyif2nmjhY3UxReQNt3b0f+jmHQEJUGJOiQomLDCUmPIS4iFBiI0KICQ8hNiKUmIgQYiNCiO2/HRsRQkxECJGhTkKcQqjD0fev00GIQ3A6ZML+SLNyi2ApUGqMKQMQkceBdcDgIlgHfL//+78BvxMRMRZMNv7Hd07ws5eO4O4vgAu5tkWIQ4iPDO37igolKSaMGLeDvKx04qPCiI8MZdKgxwd/Hx6iA69K+QOnQ1iQEc+CjHhYPh3o+wOysb2HUw3tNLR109jeTUNbN03tPTS2991u6XTR0umiuqmDlk4XrV19f1COR4hDcIgg0veH5Pc/No9bl07z5n+z73W8/ox/lw4MvuJ0JbBsuGWMMS4ROQckAXWDFxKRu4C7+m+2ikiJJYkvTDLn5Vbv03UzMl0/w9N1M8iGH8GGD941lvUzfbgH/GKw2BjzEPCQ3TlGIiK7jDEFdufwRbpuRqbrZ3i6bkbmrfVj5UGzVUDmoNsZ/fcNuYyIhADx9A0aK6WUmiBWFsFOYKaIZItIGHArsOm8ZTYBn+r//uPA61aMDyillBqeZbuG+vf53w1soe/w0T8ZY4pF5H5glzFmE/BH4BERKQUa6CsLf+XTu65sputmZLp+hqfrZmReWT+if4ArpVRwC86JNZRSSr1Pi0AppYKcFoEXicgDInJERPaLyNMiMsnuTL5CRP5BRIpFxC0iejggfVOwiEiJiJSKyL125/ElIvInETkrIgftzuKLRCRTRN4QkUP9n6uvXcjzaRF41yvAfGPMRcBR4Ns25/ElB4GbgLfsDuILBk3BshaYC2wQkbn2pvIpfwautTuED3MB3zTGzAWWA1+5kPePFoEXGWNeNsa4+m9up+/cCQUYYw4bY3zxjHC7vD8FizGmGxiYgkUBxpi36DuSUA3BGFNjjNnd/30LcJi+mRrGRYvAOp8FXrQ7hPJZQ03BMu4PsgpeIpIF5AM7xvscfjHFhC8RkVeBtCEe+q4x5tn+Zb5L36bbYxOZzW6erBullPeISAzwP8A9xpjm8T6PFsEYGWOuHulxEfk0cD1wVbCdJT3aulEf4MkULEoNS0RC6SuBx4wxGy/kuXTXkBf1X4jnn4EbjDHtdudRPs2TKViUGpL0Xajgj8BhY8wvLvT5tAi863dALPCKiOwVkf9rdyBfISI3ikglcAnwgohssTuTnfoPKhiYguUw8KQxptjeVL5DRP4KbANmiUiliNxpdyYfsxK4A7iy/3fNXhH56HifTKeYUEqpIKdbBEopFeS0CJRSKshpESilVJDTIlBKqSCnRaCUUkFOi0AppYKcFoFSSgW5/w/20CCA2VHc7AAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AFg9anP931aU" + }, + "source": [ + "Examine a Kernel Density plot of the bias to see the range of values. \n", + "\n", + "Note:\n", + "The response is log transformed, so the bias must be exponeniated to get the real difference from the true value" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mYvySUyk31aU", + "outputId": "3e0e7cf2-c0fe-4a65-d804-5ebb9010a2ef", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "from sklearn.cluster import KMeans\n", + "cluster_labels = KMeans(n_clusters=5, random_state=0).fit_predict(Xm)\n", + "cluster_labels.shape" + ], + "execution_count": 17, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(1000,)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 17 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k_CH2IPZ31aW" + }, + "source": [ + "After we have estimated the bias at each point, we can examine the bias values to see if there are systematic patterns in the bias. One simple evaluation technique to do this is to cluster the bias values. If there are no systematic patterns then we should not see explicit clustering tendencies when we cluster the bias estimates. Dense clusters would indicate areas where our model is misspecified - either too much of complexity or not enough complexity.\n", + "\n", + "\n", + "To see where the model is making mistakes, we cluster the (sample of the) dataset and compute the average bias for each cluster. This provides insights into regions of the data space we are doing well (bias close to zero) and regions where we are not doing well. \n", + "\n", + "The table below shows the mean cluster bias and the size of the cluster. We see two large clusters where the bias is close to zero (cluster 0 and cluster 1). We see one outlier with a large error (cluster 3). Clusters 1 and 4 are also seem like outliers and need further analysis. \n", + "\n", + "This exercise illustrates how we can examine our model's characteristics. We can now link this model analysis activity to our project using Arangopipe. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tCeWdvPk31aW", + "outputId": "714ece7b-03b5-4150-cf0b-7ee3134ee251", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 235 + } + }, + "source": [ + "df_bias = pd.DataFrame(Xm)\n", + "df_bias['cluster'] = cluster_labels\n", + "df_bias['bias'] = bias_values\n", + "df_bias.groupby('cluster')['bias'].agg([np.mean, np.size])" + ], + "execution_count": 18, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
meansize
cluster
00.061620505.0
1-0.27519523.0
2-0.11196695.0
30.0030682.0
4-0.031035375.0
\n", + "
" + ], + "text/plain": [ + " mean size\n", + "cluster \n", + "0 0.061620 505.0\n", + "1 -0.275195 23.0\n", + "2 -0.111966 95.0\n", + "3 0.003068 2.0\n", + "4 -0.031035 375.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5CDZoueKGAZh" + }, + "source": [ + "## Storing in Arangopipe\n", + "\n", + "Calculation of the variance can be performed analogously. Bias and variance are model characteristics that are of interest to data scientists because they convey information about its strengths, limitations and performance. In this example we store the model bias for a linear regression model in arangopipe. Such an exercise may be performed by the data science team member to get a baseline profile for the modeling task. A coworker developing a more complex model, can see how his model performs in relation to the baseline model by retrieving these results from arangopipe. \n", + "\n", + "We start with setting up the connection to the ArangoML cloud database, hosted on ArangoDB Oasis." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "x17I1e8631aY", + "outputId": "902bf8f8-e69a-4aa7-e651-53296833fcf8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 309 + } + }, + "source": [ + "from arangopipe.arangopipe_storage.arangopipe_api import ArangoPipe\n", + "from arangopipe.arangopipe_storage.arangopipe_admin_api import ArangoPipeAdmin\n", + "from arangopipe.arangopipe_storage.arangopipe_config import ArangoPipeConfig\n", + "from arangopipe.arangopipe_storage.managed_service_conn_parameters import ManagedServiceConnParam\n", + "mdb_config = ArangoPipeConfig()\n", + "msc = ManagedServiceConnParam()\n", + "conn_params = { msc.DB_SERVICE_HOST : \"arangoml.arangodb.cloud\", \\\n", + " msc.DB_SERVICE_END_POINT : \"createDB\",\\\n", + " msc.DB_SERVICE_NAME : \"createDB\",\\\n", + " msc.DB_SERVICE_PORT : 8529,\\\n", + " msc.DB_CONN_PROTOCOL : 'https'}\n", + " \n", + "mdb_config = mdb_config.create_connection_config(conn_params)\n", + "admin = ArangoPipeAdmin(reuse_connection = False, config = mdb_config)\n", + "ap_config = admin.get_config()\n", + "ap = ArangoPipe(config = ap_config)\n", + "print(\" \")\n", + "print(\"Your temporary database can be accessed using the following credentials:\")\n", + "mdb_config.get_cfg()\n" + ], + "execution_count": 19, + "outputs": [ + { + "output_type": "stream", + "text": [ + "API endpoint: https://arangoml.arangodb.cloud:8529/_db/_system/createDB/createDB\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings\n", + " InsecureRequestWarning)\n" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "Host Connection: https://arangoml.arangodb.cloud:8529\n", + " \n", + "Your temporary database can be accessed using the following credentials:\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'arangodb': {'DB_end_point': 'createDB',\n", + " 'DB_service_host': 'arangoml.arangodb.cloud',\n", + " 'DB_service_name': 'createDB',\n", + " 'DB_service_port': 8529,\n", + " 'arangodb_replication_factor': None,\n", + " 'conn_protocol': 'https',\n", + " 'dbName': 'MLt725appczvfcezt8cuhto',\n", + " 'password': 'MLf710xbl8p1docu8etzcc5',\n", + " 'username': 'MLpubqibo0hqoar84chbst3'},\n", + " 'mlgraph': {'graphname': 'enterprise_ml_graph'}}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X5Rz0YE4IbwL" + }, + "source": [ + "### Try it out!\n", + "Once the previous block has successfully executed you can navigate to https://arangoml.arangodb.cloud:8529 and sign in with the generated credentials to explore the temporary database.\n", + "\n", + "## Log the Project with Arangopipe\n", + "\n", + "Now that we have run our experiment it is time to save the metadata with arangopipe!\n", + "\n", + "As discussed in the 'Basic Arangopipe Workflow' notebook and post, arangopipe can be nesstled into or around your pre-existing machine learning pipelines. So, we are able to capture all of the important information we used in this experiment by simply dropping in the below code. \n", + "\n", + "This will create a project and store everything about this experiment including the various parameters used throughout it and the performance of the run as well." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q9j7RL8TGVj_" + }, + "source": [ + "\n", + "proj_info = {\"name\": \"Housing_Price_Estimation_Project\"}\n", + "proj_reg = admin.register_project(proj_info)\n", + "ds_info = {\"name\" : \"california-housing-dataset\",\\\n", + " \"description\": \"This dataset lists median house prices in California. Various house features are provided\",\\\n", + " \"source\": \"UCI ML Repository\" }\n", + "ds_reg = ap.register_dataset(ds_info)\n", + "import numpy as np\n", + "df[\"medianHouseValue\"] = df[\"medianHouseValue\"].apply(lambda x: np.log(x))\n", + "featureset = df.dtypes.to_dict()\n", + "featureset = {k:str(featureset[k]) for k in featureset}\n", + "featureset[\"name\"] = \"log_transformed_median_house_value\"\n", + "fs_reg = ap.register_featureset(featureset, ds_reg[\"_key\"]) \n", + "model_info = {\"name\": \"Bias Variance Analysis of LASSO model\", \"task\": \"Model Validation\"}\n", + "model_reg = ap.register_model(model_info, project = \"Housing_Price_Estimation_Project\")\n", + "import uuid\n", + "import datetime\n", + "import jsonpickle\n", + "\n", + "ruuid = str(uuid.uuid4().int)\n", + "model_perf = {'model_bias': bias_at_i, 'run_id': ruuid, \"timestamp\": str(datetime.datetime.now())}\n", + "\n", + "mp = clf.get_params()\n", + "mp = jsonpickle.encode(mp)\n", + "model_params = {'run_id': ruuid, 'model_params': mp}\n", + "\n", + "run_info = {\"dataset\" : ds_reg[\"_key\"],\\\n", + " \"featureset\": fs_reg[\"_key\"],\\\n", + " \"run_id\": ruuid,\\\n", + " \"model\": model_reg[\"_key\"],\\\n", + " \"model-params\": model_params,\\\n", + " \"model-perf\": model_perf,\\\n", + " \"tag\": \"Housing-Price-Hyperopt-Experiment\",\\\n", + " \"project\": \"Housing Price Estimation Project\"}\n", + "ap.log_run(run_info)" + ], + "execution_count": 20, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TL_HzB67LEVo" + }, + "source": [ + "The [Introduction to ArangoML series](https://www.arangodb.com/tag/arangoml/) will continue, so be sure to sign up for our newsletter to be notified of the next release!\n", + "\n", + "You can also join us on the [ArangoML Slack channel](https://arangodb-community.slack.com/archives/CN9LVJ24S) if you have any questions or comments." + ] + } + ] +} \ No newline at end of file From aefd5730be40ce53a6de04f48fadb7052c41ed0e Mon Sep 17 00:00:00 2001 From: Chris Woodward Date: Fri, 23 Oct 2020 20:19:26 -0400 Subject: [PATCH 2/2] Updated Boostrapping section --- examples/Arangopipe_Feature_Example_ext1.ipynb | 4 +--- .../Arangopipe_Feature_Example_ext1_output.ipynb | 4 +--- 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/examples/Arangopipe_Feature_Example_ext1.ipynb b/examples/Arangopipe_Feature_Example_ext1.ipynb index 16e8eb0..9951cb8 100644 --- a/examples/Arangopipe_Feature_Example_ext1.ipynb +++ b/examples/Arangopipe_Feature_Example_ext1.ipynb @@ -40,9 +40,7 @@ "\n", "In the [previous post](https://www.arangodb.com/2020/10/arangoml-part-2-basic-arangopipe-workflow/), we explored the basic arangopipe workflow and discussed the concept of model building. One concept briefly mentioned was grabbing a sample of the large dataset to use during our model building exercise. This data sampling is referred to as bootstrap sampling and is one approach for tackling big data. To explain, we will continue using the housing prices dataset from the previous post.\n", "\n", - "Bootstrapping is a statistical technique. There are situations where we have limited data but are interested in estimating the behavior of a statistic. For example, we have a single dataset, and we are interested in estimating how a model we have developed would perform on datasets we encounter in the future. Bootstrapping can give us useful estimates in these situations. \n", - "\n", - "Imagine that our dataset either currently or will in the future contain millions of documents for houses. Having this many documents would mean that every time we wanted to run a test when building the model, it would take a long time to get a result. Not only would there be a considerable time requirement but not having an accurate test dataset to test the model against would result in a potentially less precise model when used against future data. Bootstrap sampling combines the need for precision with the ability to timely develop a model. \n", + "Bootstrapping is a statistical technique. There are situations where we have limited data but are interested in estimating the behavior of a statistic. For example, we have a single dataset, and we are interested in estimating how a model we have developed would perform on datasets we encounter in the future. Bootstrapping can give us useful estimates in these situations. We circumvent the problem of not having sufficient data by creating synthetic datasets from the dataset we have by drawing samples from it.\n", "\n", "The intuition with bootstrapping is that each sample of the dataset is similar. The validity of our estimates depends on how reasonable it is to assume that the samples are similar. In situations where there is limited variability in the data, this assumption can be reasonable. \n", "\n", diff --git a/examples/examples_output/Arangopipe_Feature_Example_ext1_output.ipynb b/examples/examples_output/Arangopipe_Feature_Example_ext1_output.ipynb index 6a57e97..fa75c09 100644 --- a/examples/examples_output/Arangopipe_Feature_Example_ext1_output.ipynb +++ b/examples/examples_output/Arangopipe_Feature_Example_ext1_output.ipynb @@ -40,9 +40,7 @@ "\n", "In the [previous post](https://www.arangodb.com/2020/10/arangoml-part-2-basic-arangopipe-workflow/), we explored the basic arangopipe workflow and discussed the concept of model building. One concept briefly mentioned was grabbing a sample of the large dataset to use during our model building exercise. This data sampling is referred to as bootstrap sampling and is one approach for tackling big data. To explain, we will continue using the housing prices dataset from the previous post.\n", "\n", - "Bootstrapping is a statistical technique. There are situations where we have limited data but are interested in estimating the behavior of a statistic. For example, we have a single dataset, and we are interested in estimating how a model we have developed would perform on datasets we encounter in the future. Bootstrapping can give us useful estimates in these situations. \n", - "\n", - "Imagine that our dataset either currently or will in the future contain millions of documents for houses. Having this many documents would mean that every time we wanted to run a test when building the model, it would take a long time to get a result. Not only would there be a considerable time requirement but not having an accurate test dataset to test the model against would result in a potentially less precise model when used against future data. Bootstrap sampling combines the need for precision with the ability to timely develop a model. \n", + "Bootstrapping is a statistical technique. There are situations where we have limited data but are interested in estimating the behavior of a statistic. For example, we have a single dataset, and we are interested in estimating how a model we have developed would perform on datasets we encounter in the future. Bootstrapping can give us useful estimates in these situations. We circumvent the problem of not having sufficient data by creating synthetic datasets from the dataset we have by drawing samples from it.\n", "\n", "The intuition with bootstrapping is that each sample of the dataset is similar. The validity of our estimates depends on how reasonable it is to assume that the samples are similar. In situations where there is limited variability in the data, this assumption can be reasonable. \n", "\n",