Docs: Quickstart updates + Tracing (#1244)

evidentlyai · Aug 13, 2024 · f39eeeb · f39eeeb
1 parent 677d5de
commit f39eeeb
Show file tree

Hide file tree

Showing 28 changed files with 1,414 additions and 421 deletions.
diff --git a/docs/book/.gitbook/assets/cloud/qs_denials.png b/docs/book/.gitbook/assets/cloud/qs_denials.png
diff --git a/docs/book/.gitbook/assets/cloud/qs_tracing_dataset.png b/docs/book/.gitbook/assets/cloud/qs_tracing_dataset.png
diff --git a/docs/book/.gitbook/assets/cloud/tracing_tutorial_dataset.png b/docs/book/.gitbook/assets/cloud/tracing_tutorial_dataset.png
diff --git a/docs/book/.gitbook/assets/cloud/tracing_tutorial_eval_example.png b/docs/book/.gitbook/assets/cloud/tracing_tutorial_eval_example.png
diff --git a/docs/book/README.md b/docs/book/README.md
@@ -7,11 +7,45 @@ Evidently is available both as an open-source Python library and Evidently Cloud
 
 # Get started
 
-Choose a Quickstart (1-2min) or a Tutorial (15 min) to start.
-
-<table data-view="cards"><thead><tr><th></th><th></th><th></th></tr></thead><tbody><tr><td><strong></strong><strong>LLM evaluations</strong><strong></strong></td><td>Run checks for text data and generative LLM outputs.</td><td><p><a href="get-started/quickstart-llm.md">→ LLM Quickstart</a><br><a href="get-started/tutorial-llm.md">→ LLM Tutorial</a></p></td></tr><tr><td><strong></strong><strong>Tabular data checks</strong><strong></strong></td><td>Run evaluations for tabular data and ML models.</td><td><p><a href="get-started/hello-world.md">→ Tabular Quickstart</a><br><a href="get-started/tutorial.md">→ Tabular Tutorial</a></p></td></tr><tr><td><strong></strong><strong>Monitoring Dashboard</strong><strong></strong></td><td>Get a live dashboard to track evaluation results over time.</td><td><p><a href="get-started/quickstart-cloud.md">→ Monitoring Quickstart</a><br><a href="get-started/tutorial-cloud.md">→ Monitoring Tutorial</a></p></td></tr></tbody></table>
-
-You can explore more code [examples](examples/examples.md). 
+<table data-card-size="large" data-view="cards">
+  <thead>
+    <tr>
+      <th></th>
+      <th></th>
+      <th></th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>
+        <strong>Evidently Cloud</strong>
+      </td>
+      <td>
+        AI evaluation and observability platform built on top of Evidently Python library. Includes advanced features, collaboration and support.
+      </td>
+      <td>
+        <a href="get-started/cloud_quickstart_llm.md">→ LLM Evals Quickstart</a><br>
+        <a href="get-started/cloud_quickstart_tabular.md">→ Data and ML Quickstart</a><br>
+        <a href="get-started/cloud_quickstart_tracing.md">→ LLM Tracing Quickstart</a>
+      </td>
+    </tr>
+    <tr>
+      <td>
+        <strong>Evidently Open-Source</strong>
+      </td>
+      <td>
+        An open-source Python library with 20m+ downloads. Helps evaluate, test and monitor data, ML and LLM-powered systems. 
+      </td>
+      <td>
+        <a href="get-started/oss_quickstart_llm.md">→ LLM Evals Quickstart</a><br>
+        <a href="get-started/oss_quickstart_tabular.md">→ Data and ML Quickstart</a><br>
+        <a href="examples/tutorial-monitoring.md">→ Self-hosted dashboard</a>
+      </td>
+    </tr>
+  </tbody>
+</table>
+
+You can explore more in-depth [Examples and Tutorials](examples/examples.md). 
 
 # How it works 
 
@@ -46,8 +80,6 @@ You can be as hands-off or hands-on as you like: start with Presets, and customi
   * Configure alerts when metrics are out of bounds.
 
 **Docs**:
-* [Quickstart - LLM and text evals](get-started/quickstart-llm.md) 
-* [Quickstart - ML and tabular](get-started/hello-world.md)
 * [Reference: available Metrics](reference/all-metrics.md)
 * [User guide: how to get Reports](tests-and-reports/get-reports.md) 
 </details>
@@ -77,8 +109,6 @@ This interface helps automate your evaluations for regression testing, checks du
   * Configure alerts on failed Tests.
 
 **Docs**:
-* [Tutorial - LLM and text evals](get-started/tutorial-llm.md) 
-* [Quickstart - ML and tabular](get-started/tutorial.md)
 * [Reference: available Tests](reference/all-tests.md)
 * [User guide: how to generate Tests](tests-and-reports/run-tests.md) 
 </details>
@@ -104,8 +134,6 @@ You can use Evidently Cloud or self-host. Evidently Cloud offers extra features
 * For Evidently Cloud: get pre-built Tabs and manage everything in the UI. 
 
 **Docs**:
-* [Get Started - Evidently Cloud](get-started/tutorial-cloud.md)
-* [Get Started - Self-hosting](get-started/tutorial-monitoring.md)
 * [Monitoring user guide](monitoring/monitoring_overview.md)
 </details>
 

diff --git a/docs/book/SUMMARY.md b/docs/book/SUMMARY.md
@@ -3,13 +3,14 @@
 * [What is Evidently?](README.md)
   * [Core Concepts](introduction/core-concepts.md)
 * [Get Started](get-started/README.md)
-  * [Evidently OSS Quickstart](get-started/hello-world.md)
-  * [Evidently LLM Quickstart](get-started/quickstart-llm.md)
-  * [Evidently Cloud Quickstart](get-started/quickstart-cloud.md)
-  * [Tutorial - Reports and Tests](get-started/tutorial.md)
-  * [Tutorial - Data & ML Monitoring](get-started/tutorial-cloud.md)
-  * [Tutorial - LLM Evaluation](get-started/tutorial-llm.md)
-  * [Self-host ML Monitoring](get-started/tutorial-monitoring.md)
+  * [Evidently Cloud](get-started/quickstart-cloud.md)
+      * [Quickstart - LLM tracing](get-started/cloud_quickstart_tracing.md)
+      * [Quickstart - LLM evaluations](get-started/cloud_quickstart_llm.md)
+      * [Quickstart - Data and ML checks](get-started/cloud_quickstart_tabular.md)
+      * [Quickstart - No-code evaluations](get-started/cloud_quickstart_nocode.md)
+  * [Evidently OSS](get-started/hello-world.md)
+      * [OSS Quickstart - LLM evals](get-started/oss_quickstart_llm.md)
+      * [OSS Quickstart - Data and ML monitoring](get-started/oss_quickstart_tabular.md)
 * [Presets](presets/README.md)
   * [All Presets](presets/all-presets.md)
   * [Data Drift](presets/data-drift.md)
@@ -20,22 +21,28 @@
   * [NoTargetPerformance](presets/no-target-performance.md)
   * [Text Overview](presets/text-overview.md)
   * [Recommender System](presets/recsys.md)
-* [Examples](examples/examples.md)
-* [Integrations](integrations/README.md)
-  * [Evidently integrations](integrations/evidently-integrations.md) 
-  * [Notebook environments](integrations/notebook-environments.md)
-  * [Evidently and Airflow](integrations/evidently-and-airflow.md)
-  * [Evidently and MLflow](integrations/evidently-and-mlflow.md)
-  * [Evidently and DVCLive](integrations/evidently-and-dvclive.md)
-  * [Evidently and Metaflow](integrations/evidently-and-metaflow.md)
+* [Tutorials and Examples](examples/README.md)
+  * [All Tutorials](examples/examples.md)
+  * [Tutorial - Tracing](examples/tutorial_tracing.md)
+  * [Tutorial - Reports and Tests](examples/tutorial_reports_tests.md)
+  * [Tutorial - Data & ML Monitoring](examples/tutorial-cloud.md)
+  * [Tutorial - LLM Evaluation](examples/tutorial-llm.md)
+  * [Self-host ML Monitoring](examples/tutorial-monitoring.md)
+
+## Setup
+  * [Installation](installation/install-evidently.md)
+  * [Evidently Cloud](installation/cloud_account.md)
 
 ## User Guide
-* [Installation](installation/install-evidently.md)
+
 * [Input data](input-data/README.md)
   * [Data requirements](input-data/data-requirements.md)
   * [Column mapping](input-data/column-mapping.md)
   * [Load data to pandas](input-data/load-data-to-pandas.md)
   * [Data for recommendations](input-data/recsys_data.md)
+* [Tracing](tracing/README.md)
+  * [Tracing overview](tracing/tracing_overview.md)
+  * [Set up tracing](tracing/set_up_tracing.md)
 * [Tests and reports](tests-and-reports/README.md)
   * [Pre-built reports](tests-and-reports/get-reports.md)
   * [Create a custom report](tests-and-reports/custom-report.md)
@@ -57,9 +64,9 @@
   * [Data drift parameters](customization/options-for-statistical-tests.md)
   * [Embeddings drift parameters](customization/embeddings-drift-parameters.md)
   * [Feature importance in data drift](customization/feature-importance.md)
-  * [Text descriptors parameters](customization/text-descriptors-parameters.md)
-  * [Text evals with HuggingFace](customization/huggingface_descriptor.md)
   * [Text evals with LLM-as-judge](customization/llm_as_a_judge.md)
+  * [Text evals with HuggingFace](customization/huggingface_descriptor.md)
+  * [Text descriptors parameters](customization/text-descriptors-parameters.md)
   * [Customize JSON output](customization/json-dict-output.md)
   * [Show raw data in Reports](customization/report-data-aggregation.md)
   * [Add text comments to Reports](customization/text-comments.md)
@@ -93,6 +100,15 @@
   * [evidently.tests](api-reference/evidently.tests.md)
   * [evidently.utils](api-reference/evidently.utils.md)
 
+## Integrations
+* [Integrations](integrations/README.md)
+  * [Evidently integrations](integrations/evidently-integrations.md) 
+  * [Notebook environments](integrations/notebook-environments.md)
+  * [Evidently and Airflow](integrations/evidently-and-airflow.md)
+  * [Evidently and MLflow](integrations/evidently-and-mlflow.md)
+  * [Evidently and DVCLive](integrations/evidently-and-dvclive.md)
+  * [Evidently and Metaflow](integrations/evidently-and-metaflow.md)
+
 ## SUPPORT
 
 * [Migration](support/migration.md)

diff --git a/docs/book/customization/llm_as_a_judge.md b/docs/book/customization/llm_as_a_judge.md
@@ -10,18 +10,119 @@ You can use external LLMs to score your text data. This method lets you evaluate
 
 The LLM “judge” must return a numerical score or a category for each text in a column. You will then be able to view scores, analyze their distribution or run conditional tests through the usual Descriptor interface.
 
-Evidently currently supports scoring data using Open AI LLMs. Use the `OpenAIPrompting()` descriptor to define your prompt and criteria.
+Evidently currently supports scoring data using Open AI LLMs (more LLMs coming soon). Use the `LLMEval` descriptor to define your prompt and criteria, or one of the built-in evaluators.
 
-# Code example
+# LLM Eval
 
-You can refer to an end-to-end example with different Descriptors:
+## Code example
 
-{% embed url="https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_evaluate_llm_with_text_descriptors.ipynb" %}
+You can refer to a How-to example with different LLM judges:
+
+{% embed url="https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_use_llm_judge_template.ipynb" %}
 
 {% hint style="info" %}
 **OpenAI key.** Add the OpenAI token as the environment variable: [see docs](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety). You will incur costs when running this eval.
 {% endhint %}
 
+## Built-in templates
+
+You can use built-in evaluation templates. They default to returning a binary category label with reasoning and using `gpt-4o-mini` model from OpenAI.
+
+Imports:
+
+```python
+from evidently.descriptors import LLMEval, NegativityLLMEval, PIILLMEval, DeclineLLMEval
+```
+
+To create a Report with these descriptors, simply list them:
+
+```python
+report = Report(metrics=[
+    TextEvals(column_name="response", descriptors=[
+        NegativityLLMEval(),
+        PIILLMEval(),
+        DeclineLLMEval()
+    ])
+])
+```
+
+You can also use parameters to modify the output to switch from `category` to `score` (0 to 1) in the output or to exclude the reasoning:
+
+```python
+report = Report(metrics=[
+    TextEvals(column_name="question", descriptors=[
+        NegativityLLMEval(include_category=False),   
+        PIILLMEval(include_reasoning=False), 
+        DeclineLLMEval(include_score=True)
+    ])
+])
+```
+
+{% hint style="info" %}
+**Which descriptors are there?** See the list of available built-in descriptors in the [All Metrics](../reference/all-metrics.md) page. 
+{% endhint %}
+
+## Custom LLM judge
+
+You can also create a custom LLM judge using the provided templates. You can specify the parameters, and Evidently will automatically generate the complete evaluation prompt to send to the LLM together with the evaluation data.
+
+Imports:
+```python
+from evidently.features.llm_judge import BinaryClassificationPromptTemplate
+```
+
+**Binary Classification template**. Example of defining a "conciseness" prompt:
+
+```python
+custom_judge = LLMEval(
+    subcolumn="category",
+    template = BinaryClassificationPromptTemplate(      
+        criteria = """Conciseness refers to the quality of being brief and to the point, while still providing all necessary information.
+            A concise response should:
+            - Provide the necessary information without unnecessary details or repetition.
+            - Be brief yet comprehensive enough to address the query.
+            - Use simple and direct language to convey the message effectively.
+        """,
+        target_category="concise",
+        non_target_category="verbose",
+        uncertainty="unknown",
+        include_reasoning=True,
+        pre_messages=[("system", "You are a judge which evaluates text.")],
+        ),
+    provider = "openai",
+    model = "gpt-4o-mini"
+)
+```
+
+### LLMEval Parameters
+
+| Parameter    | Description                                                      |
+|--------------|------------------------------------------------------------------|
+| `subcolumn`  | Specifies the type of descriptor. Available values: `category`, `score`. |
+| `template`   | Forces a specific template for evaluation. Available: `BinaryClassificationPromptTemplate`.|
+| `provider`   | The provider of the LLM to be used for evaluation. Available: `openai`. |
+| `model`      | Specifies the model used for evaluation within the provider, e.g., `gpt-3.5-turbo-instruct`. |
+
+### BinaryClassificationPromptTemplate 
+
+| Parameter          | Description                                                                                                             |
+|--------------------|-------------------------------------------------------------------------------------------------------------------------|
+| `criteria`         | Free-form text defining evaluation criteria.                       |
+| `target_category`  | Name of the desired or positive category.                                                             |
+| `non_target_category` | Name of the undesired or negative category.                                                           |
+| `uncertainty`      | Name of the category to return when the provided information is not sufficient to make a clear determination            |
+| `include_reasoning`| Specifies whether to include reasoning in the classification. Available: `True`, `False`. It will be included with the result. |
+| `pre_messages`     | List of system messages that set context or instructions before the evaluation task. For example, you can explain the evaluator role ("you are an expert..") or context ("your goal is to grade the work of an intern.." |
+
+
+# OpenAIPrompting
+
+There is an earlier implementation of this approach with `OpenAIPrompting` descriptor. See the documentation below.
+
+<details>
+
+<summary>OpenAIPrompting Descriptor</summary>
+
 To import the Descriptor:
 
 ```python
@@ -60,17 +161,52 @@ report = Report(metrics=[
 ```
 You can do the same for Test Suites. 
 
-# Descriptor parameters 
-
-| Parameter               | Description                                                                                                                                                                            |
-|-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `prompt: str`           | <ul><li>The text of the evaluation prompt that will be sent to the LLM.</li><li>Include at least one placeholder string.</li></ul>|
-| `prompt_replace_string: str` | <ul><li> A placeholder string within the prompt that will be replaced by the evaluated text. </li><li> The default string name is "REPLACE".</li></ul>|
-| `feature_type: str`     | <ul><li> The type of Descriptor the prompt will return. </li><li> Available: `num` (numerical) or `cat` (categorical). </li><li> This affects the statistics and default visualizations.</li></ul>|
-| `context_replace_string: str` |<ul><li> An optional placeholder string within the prompt that will be replaced by the additional context. </li><li> The default string name is "CONTEXT".</li></ul>|
-| `context: Optional[str]` | <ul><li> Additional context that will be added to the evaluation prompt, that **does not change** between evaluations. </li><li> Examples: a reference document, a set of positive and negative examples etc. </li><li> Pass this context as a string. </li><li> You cannot use `context` and `context_column` simultaneously. </li></ul>|
-| `context_column: Optional[str]` | <ul><li> Additional context that will be added to the evaluation prompt, that is **specific to each row**. </li><li> Examples: a chunk of text retrieved from reference documents for a specific query. </li><li>  Point to the column that contains the context. </li><li> You cannot use `context` and `context_column` simultaneously. </li></ul>|
-| `model: str`            | <ul><li> The name of the OpenAI model to be used for the LLM prompting, e.g., `gpt-3.5-turbo-instruct`. </li></ul> |
-| `openai_params: Optional[dict]` |  <ul><li> A dictionary with additional parameters for the OpenAI API call. </li><li> Examples: temperature, max tokens, etc. </li><li> Use parameters that OpenAI API accepts for a specific model.</li></ul> |
-| `possible_values: Optional[List[str]]` | <ul><li> A list of possible values that the LLM can return.</li><li> This helps validate the output from the LLM and ensure it matches the expected categories. </li><li> If the validation does not pass, you will get None as a response label. </li></ul>|
-| `display_name: Optional[str]` | <ul><li> A display name visible in Reports and as a column name in tabular export. </li><li>Use it to name your Descriptor.</li></ul>|
+## Descriptor parameters 
+
+- **`prompt: str`**
+  - The text of the evaluation prompt that will be sent to the LLM.
+  - Include at least one placeholder string.
+
+- **`prompt_replace_string: str`**
+  - A placeholder string within the prompt that will be replaced by the evaluated text.
+  - The default string name is "REPLACE".
+
+- **`feature_type: str`**
+  - The type of Descriptor the prompt will return.
+  - Available types: `num` (numerical) or `cat` (categorical).
+  - This affects the statistics and default visualizations.
+
+- **`context_replace_string: str`**
+  - An optional placeholder string within the prompt that will be replaced by the additional context.
+  - The default string name is "CONTEXT".
+
+- **`context: Optional[str]`**
+  - Additional context that will be added to the evaluation prompt, which **does not change** between evaluations.
+  - Examples: a reference document, a set of positive and negative examples, etc.
+  - Pass this context as a string.
+  - You cannot use `context` and `context_column` simultaneously.
+
+- **`context_column: Optional[str]`**
+  - Additional context that will be added to the evaluation prompt, which is **specific to each row**.
+  - Examples: a chunk of text retrieved from reference documents for a specific query.
+  - Point to the column that contains the context.
+  - You cannot use `context` and `context_column` simultaneously.
+
+- **`model: str`**
+  - The name of the OpenAI model to be used for the LLM prompting, e.g., `gpt-3.5-turbo-instruct`.
+
+- **`openai_params: Optional[dict]`**
+  - A dictionary with additional parameters for the OpenAI API call.
+  - Examples: temperature, max tokens, etc.
+  - Use parameters that OpenAI API accepts for a specific model.
+
+- **`possible_values: Optional[List[str]]`**
+  - A list of possible values that the LLM can return.
+  - This helps validate the output from the LLM and ensure it matches the expected categories.
+  - If the validation does not pass, you will get `None` as a response label.
+
+- **`display_name: Optional[str]`**
+  - A display name visible in Reports and as a column name in tabular export.
+  - Use it to name your Descriptor.
+
+</details>