diff --git a/supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb b/supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb index 281f8a47..84dcc175 100644 --- a/supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb +++ b/supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb @@ -1,819 +1,821 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "name": "python3", - "display_name": "Python 3" - }, - "language_info": { - "name": "python" - } - }, - "cells": [ - { - "cell_type": "markdown", - "source": [ - "*Reranking with a locally hosted reranker model from HuggingFace*" - ], - "metadata": { - "id": "bW9q8qD_bPhY" - } - }, - { - "cell_type": "markdown", - "source": [ - "# Setup the notebook" - ], - "metadata": { - "id": "BecBOzyDbWik" - } - }, - { - "cell_type": "markdown", - "source": [ - "## Install required libs" - ], - "metadata": { - "id": "6ayhDP72bZAe" - } - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "2Xz9uWQFbNkH" - }, - "outputs": [], - "source": [ - "!pip install -qqU elasticsearch\n", - "!pip install -qqU eland[pytorch]\n", - "!pip install -qqU datasets" - ] - }, - { - "cell_type": "markdown", - "source": [ - "## Import the required python libraries" - ], - "metadata": { - "id": "LgHQaJh0bmJQ" - } - }, - { - "cell_type": "code", - "source": [ - "import os\n", - "from elasticsearch import Elasticsearch, helpers, exceptions\n", - "from urllib.request import urlopen\n", - "from getpass import getpass\n", - "import json\n", - "import time\n", - "from datasets import load_dataset\n", - "import pandas as pd" - ], - "metadata": { - "id": "CsL466H0bjNX" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "## Create an Elasticsearch Python client\n", - "\n" - ], - "metadata": { - "id": "gsQ4XIpkbpd4" - } - }, - { - "cell_type": "markdown", - "source": [ - "## Free Trial\n", - "If you don't have an Elasticsearch cluster, or what one to test out. Head over to [cloud.elastic.co](https://cloud.elastic.co/registration?onboarding_token=search&cta=cloud-registration&tech=trial&plcmt=article%20content&pg=search-labs) and sign up. You can sign up and have a serverless project up and running in only a few mintues!" - ], - "metadata": { - "id": "M_JppGte1Wc5" - } - }, - { - "cell_type": "markdown", - "source": [ - "We are using an Elastic Cloud cloud_id and deployment (cluster) API key.\n", - "\n", - "[See this guide](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud) for finding the `cloud_id` and creating an `api_key`" - ], - "metadata": { - "id": "gdsEjwhV1tr5" - } - }, - { - "cell_type": "code", - "source": [ - "cloud_id = getpass(prompt=\"Enter your Elasticsearch Cloud ID: \")\n", - "api_key = getpass(prompt=\"Enter your Elasticsearch API key: \")\n", - "\n", - "\n", - "es = Elasticsearch(cloud_id=cloud_id, api_key=api_key)\n", - "\n", - "try:\n", - " es.info()\n", - " print(\"Successfully connected to Elasticsearch!\")\n", - "except exceptions.ConnectionError as e:\n", - " print(f\"Error connecting to Elasticsearch: {e}\")" - ], - "metadata": { - "id": "UY5WCB0HUVTb" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "# Ready Elasticsearch" - ], - "metadata": { - "id": "jQdzhNbB_e3n" - } - }, - { - "cell_type": "markdown", - "source": [ - "## Hugging Face Reranking Model\n", - "Run this cell to:\n", - "- Use Eland's `eland_import_hub_model` command to upload the reranking model to Elasticsearch.\n", - "\n", - "For this example we've chosen the [`cross-encoder/ms-marco-MiniLM-L-6-v2`](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) text similarity model.\n", - "

\n", - "**Note**:\n", - "While we are importing the model for use as a reranker, Eland and Elasticsearch do not have a dedicated rerank task type, so we still use `text_similarity`" - ], - "metadata": { - "id": "5bsLLnqCfNKk" - } - }, - { - "cell_type": "code", - "source": [ - "model_id = \"cross-encoder/ms-marco-MiniLM-L-6-v2\"\n", - "\n", - "!eland_import_hub_model \\\n", - " --cloud-id $cloud_id \\\n", - " --es-api-key $api_key \\\n", - " --hub-model-id $model_id \\\n", - " --task-type text_similarity" - ], - "metadata": { - "id": "J2MTEYrUfk9R" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "## Create Inference Endpoint\n", - "Run this cell to:\n", - "- Create an inference Endpoint\n", - "- Deploy the reranking model we impoted in the previous section\n", - "We need to create an endpoint queries can use for reranking\n", - "\n", - "Key points about the `model_config`\n", - "- `service` - in this case `elasticsearch` will tell the inference API to use a locally hosted (in Elasticsearch) model\n", - "- `num_allocations` sets the number of allocations to 1\n", - " - Allocations are independent units of work for NLP tasks. Scaling this allows for an increase in concurrent throughput\n", - "- `num_threads` - sets the number of threads per allocation to 1\n", - " - Threads per allocation affect the number of threads used by each allocation during inference. Scaling this generally increased the speed of inference requests (to a point).\n", - "- `model_id` - This is the id of the model as it is named in Elasticsearch\n", - "\n" - ], - "metadata": { - "id": "-rrQV6SAgWz8" - } - }, - { - "cell_type": "code", - "source": [ - "model_config = {\n", - " \"service\": \"elasticsearch\",\n", - " \"service_settings\": {\n", - " \"num_allocations\": 1,\n", - " \"num_threads\": 1,\n", - " \"model_id\": \"cross-encoder__ms-marco-minilm-l-6-v2\",\n", - " },\n", - " \"task_settings\": {\"return_documents\": True},\n", - "}\n", - "\n", - "inference_id = \"semantic-reranking\"\n", - "\n", - "create_endpoint = es.inference.put(\n", - " inference_id=inference_id, task_type=\"rerank\", body=model_config\n", - ")\n", - "\n", - "create_endpoint.body" - ], - "metadata": { - "id": "Abu084BYgWCE", + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { "colab": { - "base_uri": "https://localhost:8080/" + "provenance": [] }, - "outputId": "d58cc940-281e-4d56-d6e6-040e4881e78a" - }, - "execution_count": null, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "{'inference_id': 'semantic-reranking',\n", - " 'task_type': 'rerank',\n", - " 'service': 'elasticsearch',\n", - " 'service_settings': {'num_allocations': 1,\n", - " 'num_threads': 1,\n", - " 'model_id': 'cross-encoder__ms-marco-minilm-l-6-v2'},\n", - " 'task_settings': {'return_documents': True}}" - ] - }, - "metadata": {}, - "execution_count": 6 + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" } - ] - }, - { - "cell_type": "markdown", - "source": [ - "### Verify it was created\n", - "\n", - "- Run the two cells in this section to verify:\n", - "- The Inference Endpoint has been completed\n", - "- The model has been deployed\n", - "\n", - "You should see JSON output with information about the semantic endpoint" - ], - "metadata": { - "id": "X8rQXMrHhMkS" - } }, - { - "cell_type": "code", - "source": [ - "check_endpoint = es.inference.get(\n", - " inference_id=inference_id,\n", - ")\n", - "\n", - "check_endpoint.body" - ], - "metadata": { - "id": "n3Yk7rgYhP-N", - "colab": { - "base_uri": "https://localhost:8080/" + "cells": [ + { + "cell_type": "markdown", + "source": [ + "*Reranking with a locally hosted reranker model from HuggingFace*" + ], + "metadata": { + "id": "bW9q8qD_bPhY" + } }, - "outputId": "d9e68225-5796-411e-964a-6db3be5541aa" - }, - "execution_count": null, - "outputs": [ { - "output_type": "execute_result", - "data": { - "text/plain": [ - "{'endpoints': [{'inference_id': 'semantic-reranking',\n", - " 'task_type': 'rerank',\n", - " 'service': 'elasticsearch',\n", - " 'service_settings': {'num_allocations': 1,\n", - " 'num_threads': 1,\n", - " 'model_id': 'cross-encoder__ms-marco-minilm-l-6-v2'},\n", - " 'task_settings': {'return_documents': True}}]}" + "cell_type": "markdown", + "source": [ + "# Setup the notebook" + ], + "metadata": { + "id": "BecBOzyDbWik" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Install required libs" + ], + "metadata": { + "id": "6ayhDP72bZAe" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2Xz9uWQFbNkH" + }, + "outputs": [], + "source": [ + "!pip install -qqU elasticsearch\n", + "!pip install -qqU eland[pytorch]\n", + "!pip install -qqU datasets" ] - }, - "metadata": {}, - "execution_count": 7 - } - ] - }, - { - "cell_type": "markdown", - "source": [ - "## Create the index mapping\n", - "\n", - "We are going to index the `title` and `abstract` from the dataset. " - ], - "metadata": { - "id": "4vqimyNWAhWb" - } - }, - { - "cell_type": "code", - "source": [ - "index_name = \"arxiv-papers\"\n", - "\n", - "index_mapping = {\n", - " \"mappings\": {\n", - " \"properties\": {\"title\": {\"type\": \"text\"}, \"abstract\": {\"type\": \"text\"}}\n", - " }\n", - "}\n", - "\n", - "\n", - "try:\n", - " es.indices.create(index=index_name, body=index_mapping)\n", - " print(f\"Index '{index_name}' created successfully.\")\n", - "except exceptions.RequestError as e:\n", - " if e.error == \"resource_already_exists_exception\":\n", - " print(f\"Index '{index_name}' already exists.\")\n", - " else:\n", - " print(f\"Error creating index '{index_name}': {e}\")" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" }, - "outputId": "4bc9ba0b-9be3-410a-d1d6-f1da04bbfec7", - "id": "DPADF_7ytTmR" - }, - "execution_count": null, - "outputs": [ { - "output_type": "stream", - "name": "stdout", - "text": [ - "Index 'arxiv-papers' created successfully.\n" - ] - } - ] - }, - { - "cell_type": "markdown", - "source": [ - "## Ready the dataset\n", - "We are going to use the [CShorten/ML-ArXiv-Papers](https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers) dataset." - ], - "metadata": { - "id": "FqQmaT5P-Nhx" - } - }, - { - "cell_type": "markdown", - "source": [ - "## Download Dataset\n", - "**Note** You may get a warning *The secret `HF_TOKEN` does not exist in your Colab secrets*.\n", - "\n", - "You can safely ignore this." - ], - "metadata": { - "id": "aN0dbYO7oB47" - } - }, - { - "cell_type": "code", - "source": [ - "dataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" + "cell_type": "markdown", + "source": [ + "## Import the required python libraries" + ], + "metadata": { + "id": "LgHQaJh0bmJQ" + } }, - "id": "IVnpj5bBoEBL", - "outputId": "bc6371d9-d66f-482c-95f8-8cdb89e4f0ef" - }, - "execution_count": null, - "outputs": [ { - "output_type": "stream", - "name": "stderr", - "text": [ - "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n", - "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", - "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", - "You will be able to reuse this secret in all of your notebooks.\n", - "Please note that authentication is recommended but still optional to access public models or datasets.\n", - " warnings.warn(\n" - ] - } - ] - }, - { - "cell_type": "markdown", - "source": [ - "### Index into Elasticsearch\n", - "\n", - "We will loop through the dataset and send batches of rows to Elasticsearch\n", - "- This may take a couple minutes depending on your cluster sizing." - ], - "metadata": { - "id": "GQxDITCpAKWb" - } - }, - { - "cell_type": "code", - "source": [ - "def bulk_insert_elasticsearch(dataset, index_name, chunk_size=1000):\n", - " actions = []\n", - " for record in dataset:\n", - " action = {\n", - " \"_index\": index_name,\n", - " \"_source\": {\"title\": record[\"title\"], \"abstract\": record[\"abstract\"]},\n", - " }\n", - " actions.append(action)\n", - "\n", - " if len(actions) == chunk_size:\n", - " helpers.bulk(es, actions)\n", - " actions = []\n", - "\n", - " if actions:\n", - " helpers.bulk(es, actions)\n", - "\n", - "\n", - "bulk_insert_elasticsearch(dataset[\"train\"], index_name)" - ], - "metadata": { - "id": "tDZ0qEbW-ozW" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "# Query with Reranking\n", - "\n", - "This containes a `text_similarity_reranker` retriever which:\n", - "1. Uses a Standard Retriever to :\n", - " 1. Perform a lexical query against `title field\n", - "2. Perform a reranking:\n", - " 1. Taks as input the top 100 results from the previous search\n", - " - `\"rank_window_size\": 100`\n", - " 2. Taks as input the query\n", - " - `\"inference_text\": query`\n", - " 3. Uses our previously created reranking API and model\n" - ], - "metadata": { - "id": "2bwvzLfRjJ2n" - } - }, - { - "cell_type": "code", - "source": [ - "query = \"sparse vector embedding\"\n", - "\n", - "# Query scored from score\n", - "response_scored = es.search(\n", - " index=\"arxiv-papers\",\n", - " body={\n", - " \"size\": 10,\n", - " \"retriever\": {\"standard\": {\"query\": {\"match\": {\"title\": query}}}},\n", - " \"fields\": [\"title\", \"abstract\"],\n", - " \"_source\": False,\n", - " },\n", - ")\n", - "\n", - "# Query with Semantic Reranker\n", - "response_reranked = es.search(\n", - " index=\"arxiv-papers\",\n", - " body={\n", - " \"size\": 10,\n", - " \"retriever\": {\n", - " \"text_similarity_reranker\": {\n", - " \"retriever\": {\"standard\": {\"query\": {\"match\": {\"title\": query}}}},\n", - " \"field\": \"abstract\",\n", - " \"inference_id\": \"semantic-reranking\",\n", - " \"inference_text\": query,\n", - " \"rank_window_size\": 100,\n", - " }\n", - " },\n", - " \"fields\": [\"title\", \"abstract\"],\n", - " \"_source\": False,\n", - " },\n", - ")" - ], - "metadata": { - "id": "HWXQBS35jQ3n" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "## Print the table comparing the scored and reranked results" - ], - "metadata": { - "id": "Hnam80Irbj6a" - } - }, - { - "cell_type": "code", - "source": [ - "titles_scored = [\n", - " paper[\"fields\"][\"title\"][0] for paper in response_scored.body[\"hits\"][\"hits\"]\n", - "]\n", - "titles_reranked = [\n", - " paper[\"fields\"][\"title\"][0] for paper in response_reranked.body[\"hits\"][\"hits\"]\n", - "]\n", - "\n", - "# Creating a DataFrame\n", - "df = pd.DataFrame(\n", - " {\"Scored Results\": titles_scored, \"Reranked Results\": titles_reranked}\n", - ")\n", - "\n", - "df_styled = df.style.set_properties(**{\"text-align\": \"left\"}).set_caption(\n", - " f\"Comparison of Scored and Semantic Reranked Results - Query: '{query}'\"\n", - ")\n", - "\n", - "# Display the table\n", - "df_styled" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 415 + "cell_type": "code", + "source": [ + "import os\n", + "from elasticsearch import Elasticsearch, helpers, exceptions\n", + "from urllib.request import urlopen\n", + "from getpass import getpass\n", + "import json\n", + "import time\n", + "from datasets import load_dataset\n", + "import pandas as pd" + ], + "metadata": { + "id": "CsL466H0bjNX" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Create an Elasticsearch Python client\n", + "\n" + ], + "metadata": { + "id": "gsQ4XIpkbpd4" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Free Trial\n", + "If you don't have an Elasticsearch cluster, or what one to test out. Head over to [cloud.elastic.co](https://cloud.elastic.co/registration?onboarding_token=search&cta=cloud-registration&tech=trial&plcmt=article%20content&pg=search-labs) and sign up. You can sign up and have a serverless project up and running in only a few mintues!" + ], + "metadata": { + "id": "M_JppGte1Wc5" + } + }, + { + "cell_type": "markdown", + "source": [ + "We are using an Elastic Cloud cloud_id and deployment (cluster) API key.\n", + "\n", + "[See this guide](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud) for finding the `cloud_id` and creating an `api_key`" + ], + "metadata": { + "id": "gdsEjwhV1tr5" + } + }, + { + "cell_type": "code", + "source": [ + "cloud_id = getpass(prompt=\"Enter your Elasticsearch Cloud ID: \")\n", + "api_key = getpass(prompt=\"Enter your Elasticsearch API key: \")\n", + "\n", + "\n", + "es = Elasticsearch(cloud_id=cloud_id, api_key=api_key)\n", + "\n", + "try:\n", + " es.info()\n", + " print(\"Successfully connected to Elasticsearch!\")\n", + "except exceptions.ConnectionError as e:\n", + " print(f\"Error connecting to Elasticsearch: {e}\")" + ], + "metadata": { + "id": "UY5WCB0HUVTb" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Ready Elasticsearch" + ], + "metadata": { + "id": "jQdzhNbB_e3n" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Hugging Face Reranking Model\n", + "Run this cell to:\n", + "- Use Eland's `eland_import_hub_model` command to upload the reranking model to Elasticsearch.\n", + "\n", + "For this example we've chosen the [`cross-encoder/ms-marco-MiniLM-L-6-v2`](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) text similarity model.\n", + "

\n", + "**Note**:\n", + "While we are importing the model for use as a reranker, Eland and Elasticsearch do not have a dedicated rerank task type, so we still use `text_similarity`" + ], + "metadata": { + "id": "5bsLLnqCfNKk" + } + }, + { + "cell_type": "code", + "source": [ + "model_id = \"cross-encoder/ms-marco-MiniLM-L-6-v2\"\n", + "\n", + "!eland_import_hub_model \\\n", + " --cloud-id $cloud_id \\\n", + " --es-api-key $api_key \\\n", + " --hub-model-id $model_id \\\n", + " --task-type text_similarity" + ], + "metadata": { + "id": "J2MTEYrUfk9R" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Create Inference Endpoint\n", + "Run this cell to:\n", + "- Create an inference Endpoint\n", + "- Deploy the reranking model we impoted in the previous section\n", + "We need to create an endpoint queries can use for reranking\n", + "\n", + "Key points about the `model_config`\n", + "- `service` - in this case `elasticsearch` will tell the inference API to use a locally hosted (in Elasticsearch) model\n", + "- `num_allocations` sets the number of allocations to 1\n", + " - Allocations are independent units of work for NLP tasks. Scaling this allows for an increase in concurrent throughput\n", + "- `num_threads` - sets the number of threads per allocation to 1\n", + " - Threads per allocation affect the number of threads used by each allocation during inference. Scaling this generally increased the speed of inference requests (to a point).\n", + "- `model_id` - This is the id of the model as it is named in Elasticsearch\n", + "\n" + ], + "metadata": { + "id": "-rrQV6SAgWz8" + } }, - "id": "yTTNYCYcBtll", - "outputId": "0f0af538-13fd-4e62-e3d2-1dd185689904" - }, - "execution_count": null, - "outputs": [ { - "output_type": "execute_result", - "data": { - "text/plain": [ - "" + "cell_type": "code", + "source": [ + "model_config = {\n", + " \"service\": \"elasticsearch\",\n", + " \"service_settings\": {\n", + " \"num_allocations\": 1,\n", + " \"num_threads\": 1,\n", + " \"model_id\": \"cross-encoder__ms-marco-minilm-l-6-v2\",\n", + " },\n", + " \"task_settings\": {\"return_documents\": True},\n", + "}\n", + "\n", + "inference_id = \"semantic-reranking\"\n", + "\n", + "create_endpoint = es.inference.put(\n", + " inference_id=inference_id, task_type=\"rerank\", body=model_config\n", + ")\n", + "\n", + "create_endpoint.body" ], - "text/html": [ - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Comparison of Scored and Semantic Reranked Results - Query: 'sparse vector embedding'
 Scored ResultsReranked Results
0Compact Speaker Embedding: lrx-vectorScaling Up Sparse Support Vector Machines by Simultaneous Feature and\n", - " Sample Reduction
1Quantum Sparse Support Vector MachinesSpaceland Embedding of Sparse Stochastic Graphs
2Sparse Support Vector Infinite PushElliptical Ordinal Embedding
3The Sparse Vector Technique, RevisitedMinimum-Distortion Embedding
4L-Vector: Neural Label Embedding for Domain AdaptationFree Gap Information from the Differentially Private Sparse Vector and\n", - " Noisy Max Mechanisms
5Spaceland Embedding of Sparse Stochastic GraphsInterpolated Discretized Embedding of Single Vectors and Vector Pairs\n", - " for Classification, Metric Learning and Distance Approximation
6Sparse Signal Recovery in the Presence of Intra-Vector and Inter-Vector\n", - " CorrelationAttention Word Embedding
7Stable Sparse Subspace Embedding for Dimensionality ReductionBinary Speaker Embedding
8Auto-weighted Mutli-view Sparse Reconstructive EmbeddingNetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization
9Embedding Words in Non-Vector Space with Unsupervised Graph LearningEstimating Vector Fields on Manifolds and the Embedding of Directed\n", - " Graphs
\n" + "metadata": { + "id": "Abu084BYgWCE", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "d58cc940-281e-4d56-d6e6-040e4881e78a" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'inference_id': 'semantic-reranking',\n", + " 'task_type': 'rerank',\n", + " 'service': 'elasticsearch',\n", + " 'service_settings': {'num_allocations': 1,\n", + " 'num_threads': 1,\n", + " 'model_id': 'cross-encoder__ms-marco-minilm-l-6-v2'},\n", + " 'task_settings': {'return_documents': True}}" + ] + }, + "metadata": {}, + "execution_count": 6 + } ] - }, - "metadata": {}, - "execution_count": 17 - } - ] - }, - { - "cell_type": "markdown", - "source": [ - "## Print out Title and Abstract\n", - "This will print the title and the abstract for the top 10 results after semantic reranking." - ], - "metadata": { - "id": "A0HyNZoWyeun" - } - }, - { - "cell_type": "code", - "source": [ - "for paper in response_scored.body[\"hits\"][\"hits\"]:\n", - " print(\n", - " f\"Title {paper['fields']['title'][0]} \\n Abstract: {paper['fields']['abstract'][0]}\"\n", - " )" - ], - "metadata": { - "id": "4ZEx-46rn3in", - "colab": { - "base_uri": "https://localhost:8080/" }, - "outputId": "e83318ad-ca42-4aa7-98d4-37c4428eb70a" - }, - "execution_count": null, - "outputs": [ { - "output_type": "stream", - "name": "stdout", - "text": [ - "Title Compact Speaker Embedding: lrx-vector \n", - " Abstract: Deep neural networks (DNN) have recently been widely used in speaker\n", - "recognition systems, achieving state-of-the-art performance on various\n", - "benchmarks. The x-vector architecture is especially popular in this research\n", - "community, due to its excellent performance and manageable computational\n", - "complexity. In this paper, we present the lrx-vector system, which is the\n", - "low-rank factorized version of the x-vector embedding network. The primary\n", - "objective of this topology is to further reduce the memory requirement of the\n", - "speaker recognition system. We discuss the deployment of knowledge distillation\n", - "for training the lrx-vector system and compare against low-rank factorization\n", - "with SVD. On the VOiCES 2019 far-field corpus we were able to reduce the\n", - "weights by 28% compared to the full-rank x-vector system while keeping the\n", - "recognition rate constant (1.83% EER).\n", - "\n", - "Title Quantum Sparse Support Vector Machines \n", - " Abstract: We analyze the computational complexity of Quantum Sparse Support Vector\n", - "Machine, a linear classifier that minimizes the hinge loss and the $L_1$ norm\n", - "of the feature weights vector and relies on a quantum linear programming solver\n", - "instead of a classical solver. Sparse SVM leads to sparse models that use only\n", - "a small fraction of the input features in making decisions, and is especially\n", - "useful when the total number of features, $p$, approaches or exceeds the number\n", - "of training samples, $m$. We prove a $\\Omega(m)$ worst-case lower bound for\n", - "computational complexity of any quantum training algorithm relying on black-box\n", - "access to training samples; quantum sparse SVM has at least linear worst-case\n", - "complexity. However, we prove that there are realistic scenarios in which a\n", - "sparse linear classifier is expected to have high accuracy, and can be trained\n", - "in sublinear time in terms of both the number of training samples and the\n", - "number of features.\n", - "\n", - "Title Sparse Support Vector Infinite Push \n", - " Abstract: In this paper, we address the problem of embedded feature selection for\n", - "ranking on top of the list problems. We pose this problem as a regularized\n", - "empirical risk minimization with $p$-norm push loss function ($p=\\infty$) and\n", - "sparsity inducing regularizers. We leverage the issues related to this\n", - "challenging optimization problem by considering an alternating direction method\n", - "of multipliers algorithm which is built upon proximal operators of the loss\n", - "function and the regularizer. Our main technical contribution is thus to\n", - "provide a numerical scheme for computing the infinite push loss function\n", - "proximal operator. Experimental results on toy, DNA microarray and BCI problems\n", - "show how our novel algorithm compares favorably to competitors for ranking on\n", - "top while using fewer variables in the scoring function.\n", - "\n", - "Title The Sparse Vector Technique, Revisited \n", - " Abstract: We revisit one of the most basic and widely applicable techniques in the\n", - "literature of differential privacy - the sparse vector technique [Dwork et al.,\n", - "STOC 2009]. This simple algorithm privately tests whether the value of a given\n", - "query on a database is close to what we expect it to be. It allows to ask an\n", - "unbounded number of queries as long as the answer is close to what we expect,\n", - "and halts following the first query for which this is not the case.\n", - " We suggest an alternative, equally simple, algorithm that can continue\n", - "testing queries as long as any single individual does not contribute to the\n", - "answer of too many queries whose answer deviates substantially form what we\n", - "expect. Our analysis is subtle and some of its ingredients may be more widely\n", - "applicable. In some cases our new algorithm allows to privately extract much\n", - "more information from the database than the original.\n", - " We demonstrate this by applying our algorithm to the shifting heavy-hitters\n", - "problem: On every time step, each of $n$ users gets a new input, and the task\n", - "is to privately identify all the current heavy-hitters. That is, on time step\n", - "$i$, the goal is to identify all data elements $x$ such that many of the users\n", - "have $x$ as their current input. We present an algorithm for this problem with\n", - "improved error guarantees over what can be obtained using existing techniques.\n", - "Specifically, the error of our algorithm depends on the maximal number of times\n", - "that a single user holds a heavy-hitter as input, rather than the total number\n", - "of times in which a heavy-hitter exists.\n", - "\n", - "Title L-Vector: Neural Label Embedding for Domain Adaptation \n", - " Abstract: We propose a novel neural label embedding (NLE) scheme for the domain\n", - "adaptation of a deep neural network (DNN) acoustic model with unpaired data\n", - "samples from source and target domains. With NLE method, we distill the\n", - "knowledge from a powerful source-domain DNN into a dictionary of label\n", - "embeddings, or l-vectors, one for each senone class. Each l-vector is a\n", - "representation of the senone-specific output distributions of the source-domain\n", - "DNN and is learned to minimize the average L2, Kullback-Leibler (KL) or\n", - "symmetric KL distance to the output vectors with the same label through simple\n", - "averaging or standard back-propagation. During adaptation, the l-vectors serve\n", - "as the soft targets to train the target-domain model with cross-entropy loss.\n", - "Without parallel data constraint as in the teacher-student learning, NLE is\n", - "specially suited for the situation where the paired target-domain data cannot\n", - "be simulated from the source-domain data. We adapt a 6400 hours\n", - "multi-conditional US English acoustic model to each of the 9 accented English\n", - "(80 to 830 hours) and kids' speech (80 hours). NLE achieves up to 14.1%\n", - "relative word error rate reduction over direct re-training with one-hot labels.\n", - "\n", - "Title Spaceland Embedding of Sparse Stochastic Graphs \n", - " Abstract: We introduce a nonlinear method for directly embedding large, sparse,\n", - "stochastic graphs into low-dimensional spaces, without requiring vertex\n", - "features to reside in, or be transformed into, a metric space. Graph data and\n", - "models are prevalent in real-world applications. Direct graph embedding is\n", - "fundamental to many graph analysis tasks, in addition to graph visualization.\n", - "We name the novel approach SG-t-SNE, as it is inspired by and builds upon the\n", - "core principle of t-SNE, a widely used method for nonlinear dimensionality\n", - "reduction and data visualization. We also introduce t-SNE-$\\Pi$, a\n", - "high-performance software for 2D, 3D embedding of large sparse graphs on\n", - "personal computers with superior efficiency. It empowers SG-t-SNE with modern\n", - "computing techniques for exploiting in tandem both matrix structures and memory\n", - "architectures. We present elucidating embedding results on one synthetic graph\n", - "and four real-world networks.\n", - "\n", - "Title Sparse Signal Recovery in the Presence of Intra-Vector and Inter-Vector\n", - " Correlation \n", - " Abstract: This work discusses the problem of sparse signal recovery when there is\n", - "correlation among the values of non-zero entries. We examine intra-vector\n", - "correlation in the context of the block sparse model and inter-vector\n", - "correlation in the context of the multiple measurement vector model, as well as\n", - "their combination. Algorithms based on the sparse Bayesian learning are\n", - "presented and the benefits of incorporating correlation at the algorithm level\n", - "are discussed. The impact of correlation on the limits of support recovery is\n", - "also discussed highlighting the different impact intra-vector and inter-vector\n", - "correlations have on such limits.\n", - "\n", - "Title Stable Sparse Subspace Embedding for Dimensionality Reduction \n", - " Abstract: Sparse random projection (RP) is a popular tool for dimensionality reduction\n", - "that shows promising performance with low computational complexity. However, in\n", - "the existing sparse RP matrices, the positions of non-zero entries are usually\n", - "randomly selected. Although they adopt uniform sampling with replacement, due\n", - "to large sampling variance, the number of non-zeros is uneven among rows of the\n", - "projection matrix which is generated in one trial, and more data information\n", - "may be lost after dimension reduction. To break this bottleneck, based on\n", - "random sampling without replacement in statistics, this paper builds a stable\n", - "sparse subspace embedded matrix (S-SSE), in which non-zeros are uniformly\n", - "distributed. It is proved that the S-SSE is stabler than the existing matrix,\n", - "and it can maintain Euclidean distance between points well after dimension\n", - "reduction. Our empirical studies corroborate our theoretical findings and\n", - "demonstrate that our approach can indeed achieve satisfactory performance.\n", - "\n", - "Title Auto-weighted Mutli-view Sparse Reconstructive Embedding \n", - " Abstract: With the development of multimedia era, multi-view data is generated in\n", - "various fields. Contrast with those single-view data, multi-view data brings\n", - "more useful information and should be carefully excavated. Therefore, it is\n", - "essential to fully exploit the complementary information embedded in multiple\n", - "views to enhance the performances of many tasks. Especially for those\n", - "high-dimensional data, how to develop a multi-view dimension reduction\n", - "algorithm to obtain the low-dimensional representations is of vital importance\n", - "but chanllenging. In this paper, we propose a novel multi-view dimensional\n", - "reduction algorithm named Auto-weighted Mutli-view Sparse Reconstructive\n", - "Embedding (AMSRE) to deal with this problem. AMSRE fully exploits the sparse\n", - "reconstructive correlations between features from multiple views. Furthermore,\n", - "it is equipped with an auto-weighted technique to treat multiple views\n", - "discriminatively according to their contributions. Various experiments have\n", - "verified the excellent performances of the proposed AMSRE.\n", - "\n", - "Title Embedding Words in Non-Vector Space with Unsupervised Graph Learning \n", - " Abstract: It has become a de-facto standard to represent words as elements of a vector\n", - "space (word2vec, GloVe). While this approach is convenient, it is unnatural for\n", - "language: words form a graph with a latent hierarchical structure, and this\n", - "structure has to be revealed and encoded by word embeddings. We introduce\n", - "GraphGlove: unsupervised graph word representations which are learned\n", - "end-to-end. In our setting, each word is a node in a weighted graph and the\n", - "distance between words is the shortest path distance between the corresponding\n", - "nodes. We adopt a recent method learning a representation of data in the form\n", - "of a differentiable weighted graph and use it to modify the GloVe training\n", - "algorithm. We show that our graph-based representations substantially\n", - "outperform vector-based methods on word similarity and analogy tasks. Our\n", - "analysis reveals that the structure of the learned graphs is hierarchical and\n", - "similar to that of WordNet, the geometry is highly non-trivial and contains\n", - "subgraphs with different local topology.\n", - "\n" - ] + "cell_type": "markdown", + "source": [ + "### Verify it was created\n", + "\n", + "- Run the two cells in this section to verify:\n", + "- The Inference Endpoint has been completed\n", + "- The model has been deployed\n", + "\n", + "You should see JSON output with information about the semantic endpoint" + ], + "metadata": { + "id": "X8rQXMrHhMkS" + } + }, + { + "cell_type": "code", + "source": [ + "check_endpoint = es.inference.get(\n", + " inference_id=inference_id,\n", + ")\n", + "\n", + "check_endpoint.body" + ], + "metadata": { + "id": "n3Yk7rgYhP-N", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "d9e68225-5796-411e-964a-6db3be5541aa" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'endpoints': [{'inference_id': 'semantic-reranking',\n", + " 'task_type': 'rerank',\n", + " 'service': 'elasticsearch',\n", + " 'service_settings': {'num_allocations': 1,\n", + " 'num_threads': 1,\n", + " 'model_id': 'cross-encoder__ms-marco-minilm-l-6-v2'},\n", + " 'task_settings': {'return_documents': True}}]}" + ] + }, + "metadata": {}, + "execution_count": 7 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Create the index mapping\n", + "\n", + "We are going to index the `title` and `abstract` from the dataset. " + ], + "metadata": { + "id": "4vqimyNWAhWb" + } + }, + { + "cell_type": "code", + "source": [ + "index_name = \"arxiv-papers\"\n", + "\n", + "index_mapping = {\n", + " \"mappings\": {\n", + " \"properties\": {\"title\": {\"type\": \"text\"}, \"abstract\": {\"type\": \"text\"}}\n", + " }\n", + "}\n", + "\n", + "\n", + "try:\n", + " es.indices.create(index=index_name, body=index_mapping)\n", + " print(f\"Index '{index_name}' created successfully.\")\n", + "except exceptions.RequestError as e:\n", + " if e.error == \"resource_already_exists_exception\":\n", + " print(f\"Index '{index_name}' already exists.\")\n", + " else:\n", + " print(f\"Error creating index '{index_name}': {e}\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "4bc9ba0b-9be3-410a-d1d6-f1da04bbfec7", + "id": "DPADF_7ytTmR" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Index 'arxiv-papers' created successfully.\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Ready the dataset\n", + "We are going to use the [CShorten/ML-ArXiv-Papers](https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers) dataset." + ], + "metadata": { + "id": "FqQmaT5P-Nhx" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Download Dataset\n", + "**Note** You may get a warning *The secret `HF_TOKEN` does not exist in your Colab secrets*.\n", + "\n", + "You can safely ignore this." + ], + "metadata": { + "id": "aN0dbYO7oB47" + } + }, + { + "cell_type": "code", + "source": [ + "dataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "IVnpj5bBoEBL", + "outputId": "bc6371d9-d66f-482c-95f8-8cdb89e4f0ef" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n", + "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", + "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", + "You will be able to reuse this secret in all of your notebooks.\n", + "Please note that authentication is recommended but still optional to access public models or datasets.\n", + " warnings.warn(\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Index into Elasticsearch\n", + "\n", + "We will loop through the dataset and send batches of rows to Elasticsearch\n", + "- This may take a couple minutes depending on your cluster sizing." + ], + "metadata": { + "id": "GQxDITCpAKWb" + } + }, + { + "cell_type": "code", + "source": [ + "def bulk_insert_elasticsearch(dataset, index_name, chunk_size=1000):\n", + " actions = []\n", + " for record in dataset:\n", + " action = {\n", + " \"_index\": index_name,\n", + " \"_source\": {\"title\": record[\"title\"], \"abstract\": record[\"abstract\"]},\n", + " }\n", + " actions.append(action)\n", + "\n", + " if len(actions) == chunk_size:\n", + " helpers.bulk(es, actions)\n", + " actions = []\n", + "\n", + " if actions:\n", + " helpers.bulk(es, actions)\n", + "\n", + "\n", + "bulk_insert_elasticsearch(dataset[\"train\"], index_name)" + ], + "metadata": { + "id": "tDZ0qEbW-ozW" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Query with Reranking\n", + "\n", + "This contains a `text_similarity_reranker` retriever, which:\n", + "\n", + "- Uses a standard retriever to:\n", + " - Perform a lexical query against the `title` field\n", + "- Performs a reranking:\n", + " - Takes as input the top 100 results from the previous search\n", + " - `rank_window_size`: 100\n", + " - Takes as input the query\n", + " - `inference_text`: query\n", + "- Uses our previously created reranking API and model\n", + "\n" + ], + "metadata": { + "id": "2bwvzLfRjJ2n" + } + }, + { + "cell_type": "code", + "source": [ + "query = \"sparse vector embedding\"\n", + "\n", + "# Query scored from score\n", + "response_scored = es.search(\n", + " index=\"arxiv-papers\",\n", + " body={\n", + " \"size\": 10,\n", + " \"retriever\": {\"standard\": {\"query\": {\"match\": {\"title\": query}}}},\n", + " \"fields\": [\"title\", \"abstract\"],\n", + " \"_source\": False,\n", + " },\n", + ")\n", + "\n", + "# Query with Semantic Reranker\n", + "response_reranked = es.search(\n", + " index=\"arxiv-papers\",\n", + " body={\n", + " \"size\": 10,\n", + " \"retriever\": {\n", + " \"text_similarity_reranker\": {\n", + " \"retriever\": {\"standard\": {\"query\": {\"match\": {\"title\": query}}}},\n", + " \"field\": \"abstract\",\n", + " \"inference_id\": \"semantic-reranking\",\n", + " \"inference_text\": query,\n", + " \"rank_window_size\": 100,\n", + " }\n", + " },\n", + " \"fields\": [\"title\", \"abstract\"],\n", + " \"_source\": False,\n", + " },\n", + ")" + ], + "metadata": { + "id": "HWXQBS35jQ3n" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Print the table comparing the scored and reranked results" + ], + "metadata": { + "id": "Hnam80Irbj6a" + } + }, + { + "cell_type": "code", + "source": [ + "titles_scored = [\n", + " paper[\"fields\"][\"title\"][0] for paper in response_scored.body[\"hits\"][\"hits\"]\n", + "]\n", + "titles_reranked = [\n", + " paper[\"fields\"][\"title\"][0] for paper in response_reranked.body[\"hits\"][\"hits\"]\n", + "]\n", + "\n", + "# Creating a DataFrame\n", + "df = pd.DataFrame(\n", + " {\"Scored Results\": titles_scored, \"Reranked Results\": titles_reranked}\n", + ")\n", + "\n", + "df_styled = df.style.set_properties(**{\"text-align\": \"left\"}).set_caption(\n", + " f\"Comparison of Scored and Semantic Reranked Results - Query: '{query}'\"\n", + ")\n", + "\n", + "# Display the table\n", + "df_styled" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 415 + }, + "id": "yTTNYCYcBtll", + "outputId": "0f0af538-13fd-4e62-e3d2-1dd185689904" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Comparison of Scored and Semantic Reranked Results - Query: 'sparse vector embedding'
 Scored ResultsReranked Results
0Compact Speaker Embedding: lrx-vectorScaling Up Sparse Support Vector Machines by Simultaneous Feature and\n", + " Sample Reduction
1Quantum Sparse Support Vector MachinesSpaceland Embedding of Sparse Stochastic Graphs
2Sparse Support Vector Infinite PushElliptical Ordinal Embedding
3The Sparse Vector Technique, RevisitedMinimum-Distortion Embedding
4L-Vector: Neural Label Embedding for Domain AdaptationFree Gap Information from the Differentially Private Sparse Vector and\n", + " Noisy Max Mechanisms
5Spaceland Embedding of Sparse Stochastic GraphsInterpolated Discretized Embedding of Single Vectors and Vector Pairs\n", + " for Classification, Metric Learning and Distance Approximation
6Sparse Signal Recovery in the Presence of Intra-Vector and Inter-Vector\n", + " CorrelationAttention Word Embedding
7Stable Sparse Subspace Embedding for Dimensionality ReductionBinary Speaker Embedding
8Auto-weighted Mutli-view Sparse Reconstructive EmbeddingNetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization
9Embedding Words in Non-Vector Space with Unsupervised Graph LearningEstimating Vector Fields on Manifolds and the Embedding of Directed\n", + " Graphs
\n" + ] + }, + "metadata": {}, + "execution_count": 17 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Print out Title and Abstract\n", + "This will print the title and the abstract for the top 10 results after semantic reranking." + ], + "metadata": { + "id": "A0HyNZoWyeun" + } + }, + { + "cell_type": "code", + "source": [ + "for paper in response_reranked.body[\"hits\"][\"hits\"]:\n", + " print(\n", + " f\"Title {paper['fields']['title'][0]} \\n Abstract: {paper['fields']['abstract'][0]}\"\n", + " )" + ], + "metadata": { + "id": "4ZEx-46rn3in", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "e83318ad-ca42-4aa7-98d4-37c4428eb70a" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Title Compact Speaker Embedding: lrx-vector \n", + " Abstract: Deep neural networks (DNN) have recently been widely used in speaker\n", + "recognition systems, achieving state-of-the-art performance on various\n", + "benchmarks. The x-vector architecture is especially popular in this research\n", + "community, due to its excellent performance and manageable computational\n", + "complexity. In this paper, we present the lrx-vector system, which is the\n", + "low-rank factorized version of the x-vector embedding network. The primary\n", + "objective of this topology is to further reduce the memory requirement of the\n", + "speaker recognition system. We discuss the deployment of knowledge distillation\n", + "for training the lrx-vector system and compare against low-rank factorization\n", + "with SVD. On the VOiCES 2019 far-field corpus we were able to reduce the\n", + "weights by 28% compared to the full-rank x-vector system while keeping the\n", + "recognition rate constant (1.83% EER).\n", + "\n", + "Title Quantum Sparse Support Vector Machines \n", + " Abstract: We analyze the computational complexity of Quantum Sparse Support Vector\n", + "Machine, a linear classifier that minimizes the hinge loss and the $L_1$ norm\n", + "of the feature weights vector and relies on a quantum linear programming solver\n", + "instead of a classical solver. Sparse SVM leads to sparse models that use only\n", + "a small fraction of the input features in making decisions, and is especially\n", + "useful when the total number of features, $p$, approaches or exceeds the number\n", + "of training samples, $m$. We prove a $\\Omega(m)$ worst-case lower bound for\n", + "computational complexity of any quantum training algorithm relying on black-box\n", + "access to training samples; quantum sparse SVM has at least linear worst-case\n", + "complexity. However, we prove that there are realistic scenarios in which a\n", + "sparse linear classifier is expected to have high accuracy, and can be trained\n", + "in sublinear time in terms of both the number of training samples and the\n", + "number of features.\n", + "\n", + "Title Sparse Support Vector Infinite Push \n", + " Abstract: In this paper, we address the problem of embedded feature selection for\n", + "ranking on top of the list problems. We pose this problem as a regularized\n", + "empirical risk minimization with $p$-norm push loss function ($p=\\infty$) and\n", + "sparsity inducing regularizers. We leverage the issues related to this\n", + "challenging optimization problem by considering an alternating direction method\n", + "of multipliers algorithm which is built upon proximal operators of the loss\n", + "function and the regularizer. Our main technical contribution is thus to\n", + "provide a numerical scheme for computing the infinite push loss function\n", + "proximal operator. Experimental results on toy, DNA microarray and BCI problems\n", + "show how our novel algorithm compares favorably to competitors for ranking on\n", + "top while using fewer variables in the scoring function.\n", + "\n", + "Title The Sparse Vector Technique, Revisited \n", + " Abstract: We revisit one of the most basic and widely applicable techniques in the\n", + "literature of differential privacy - the sparse vector technique [Dwork et al.,\n", + "STOC 2009]. This simple algorithm privately tests whether the value of a given\n", + "query on a database is close to what we expect it to be. It allows to ask an\n", + "unbounded number of queries as long as the answer is close to what we expect,\n", + "and halts following the first query for which this is not the case.\n", + " We suggest an alternative, equally simple, algorithm that can continue\n", + "testing queries as long as any single individual does not contribute to the\n", + "answer of too many queries whose answer deviates substantially form what we\n", + "expect. Our analysis is subtle and some of its ingredients may be more widely\n", + "applicable. In some cases our new algorithm allows to privately extract much\n", + "more information from the database than the original.\n", + " We demonstrate this by applying our algorithm to the shifting heavy-hitters\n", + "problem: On every time step, each of $n$ users gets a new input, and the task\n", + "is to privately identify all the current heavy-hitters. That is, on time step\n", + "$i$, the goal is to identify all data elements $x$ such that many of the users\n", + "have $x$ as their current input. We present an algorithm for this problem with\n", + "improved error guarantees over what can be obtained using existing techniques.\n", + "Specifically, the error of our algorithm depends on the maximal number of times\n", + "that a single user holds a heavy-hitter as input, rather than the total number\n", + "of times in which a heavy-hitter exists.\n", + "\n", + "Title L-Vector: Neural Label Embedding for Domain Adaptation \n", + " Abstract: We propose a novel neural label embedding (NLE) scheme for the domain\n", + "adaptation of a deep neural network (DNN) acoustic model with unpaired data\n", + "samples from source and target domains. With NLE method, we distill the\n", + "knowledge from a powerful source-domain DNN into a dictionary of label\n", + "embeddings, or l-vectors, one for each senone class. Each l-vector is a\n", + "representation of the senone-specific output distributions of the source-domain\n", + "DNN and is learned to minimize the average L2, Kullback-Leibler (KL) or\n", + "symmetric KL distance to the output vectors with the same label through simple\n", + "averaging or standard back-propagation. During adaptation, the l-vectors serve\n", + "as the soft targets to train the target-domain model with cross-entropy loss.\n", + "Without parallel data constraint as in the teacher-student learning, NLE is\n", + "specially suited for the situation where the paired target-domain data cannot\n", + "be simulated from the source-domain data. We adapt a 6400 hours\n", + "multi-conditional US English acoustic model to each of the 9 accented English\n", + "(80 to 830 hours) and kids' speech (80 hours). NLE achieves up to 14.1%\n", + "relative word error rate reduction over direct re-training with one-hot labels.\n", + "\n", + "Title Spaceland Embedding of Sparse Stochastic Graphs \n", + " Abstract: We introduce a nonlinear method for directly embedding large, sparse,\n", + "stochastic graphs into low-dimensional spaces, without requiring vertex\n", + "features to reside in, or be transformed into, a metric space. Graph data and\n", + "models are prevalent in real-world applications. Direct graph embedding is\n", + "fundamental to many graph analysis tasks, in addition to graph visualization.\n", + "We name the novel approach SG-t-SNE, as it is inspired by and builds upon the\n", + "core principle of t-SNE, a widely used method for nonlinear dimensionality\n", + "reduction and data visualization. We also introduce t-SNE-$\\Pi$, a\n", + "high-performance software for 2D, 3D embedding of large sparse graphs on\n", + "personal computers with superior efficiency. It empowers SG-t-SNE with modern\n", + "computing techniques for exploiting in tandem both matrix structures and memory\n", + "architectures. We present elucidating embedding results on one synthetic graph\n", + "and four real-world networks.\n", + "\n", + "Title Sparse Signal Recovery in the Presence of Intra-Vector and Inter-Vector\n", + " Correlation \n", + " Abstract: This work discusses the problem of sparse signal recovery when there is\n", + "correlation among the values of non-zero entries. We examine intra-vector\n", + "correlation in the context of the block sparse model and inter-vector\n", + "correlation in the context of the multiple measurement vector model, as well as\n", + "their combination. Algorithms based on the sparse Bayesian learning are\n", + "presented and the benefits of incorporating correlation at the algorithm level\n", + "are discussed. The impact of correlation on the limits of support recovery is\n", + "also discussed highlighting the different impact intra-vector and inter-vector\n", + "correlations have on such limits.\n", + "\n", + "Title Stable Sparse Subspace Embedding for Dimensionality Reduction \n", + " Abstract: Sparse random projection (RP) is a popular tool for dimensionality reduction\n", + "that shows promising performance with low computational complexity. However, in\n", + "the existing sparse RP matrices, the positions of non-zero entries are usually\n", + "randomly selected. Although they adopt uniform sampling with replacement, due\n", + "to large sampling variance, the number of non-zeros is uneven among rows of the\n", + "projection matrix which is generated in one trial, and more data information\n", + "may be lost after dimension reduction. To break this bottleneck, based on\n", + "random sampling without replacement in statistics, this paper builds a stable\n", + "sparse subspace embedded matrix (S-SSE), in which non-zeros are uniformly\n", + "distributed. It is proved that the S-SSE is stabler than the existing matrix,\n", + "and it can maintain Euclidean distance between points well after dimension\n", + "reduction. Our empirical studies corroborate our theoretical findings and\n", + "demonstrate that our approach can indeed achieve satisfactory performance.\n", + "\n", + "Title Auto-weighted Mutli-view Sparse Reconstructive Embedding \n", + " Abstract: With the development of multimedia era, multi-view data is generated in\n", + "various fields. Contrast with those single-view data, multi-view data brings\n", + "more useful information and should be carefully excavated. Therefore, it is\n", + "essential to fully exploit the complementary information embedded in multiple\n", + "views to enhance the performances of many tasks. Especially for those\n", + "high-dimensional data, how to develop a multi-view dimension reduction\n", + "algorithm to obtain the low-dimensional representations is of vital importance\n", + "but chanllenging. In this paper, we propose a novel multi-view dimensional\n", + "reduction algorithm named Auto-weighted Mutli-view Sparse Reconstructive\n", + "Embedding (AMSRE) to deal with this problem. AMSRE fully exploits the sparse\n", + "reconstructive correlations between features from multiple views. Furthermore,\n", + "it is equipped with an auto-weighted technique to treat multiple views\n", + "discriminatively according to their contributions. Various experiments have\n", + "verified the excellent performances of the proposed AMSRE.\n", + "\n", + "Title Embedding Words in Non-Vector Space with Unsupervised Graph Learning \n", + " Abstract: It has become a de-facto standard to represent words as elements of a vector\n", + "space (word2vec, GloVe). While this approach is convenient, it is unnatural for\n", + "language: words form a graph with a latent hierarchical structure, and this\n", + "structure has to be revealed and encoded by word embeddings. We introduce\n", + "GraphGlove: unsupervised graph word representations which are learned\n", + "end-to-end. In our setting, each word is a node in a weighted graph and the\n", + "distance between words is the shortest path distance between the corresponding\n", + "nodes. We adopt a recent method learning a representation of data in the form\n", + "of a differentiable weighted graph and use it to modify the GloVe training\n", + "algorithm. We show that our graph-based representations substantially\n", + "outperform vector-based methods on word similarity and analogy tasks. Our\n", + "analysis reveals that the structure of the learned graphs is hierarchical and\n", + "similar to that of WordNet, the geometry is highly non-trivial and contains\n", + "subgraphs with different local topology.\n", + "\n" + ] + } + ] } - ] - } - ] + ] } \ No newline at end of file