explodinggradients · jjmachan · Jul 25, 2023 · Jul 20, 2023 · Jul 20, 2023 · Jul 20, 2023
diff --git a/README.md b/README.md
@@ -80,12 +80,13 @@ results = evaluate(dataset)
 If you want a more in-depth explanation of core components, check out our [quick-start notebook](./docs/quickstart.ipynb)
 ## :luggage: Metrics
 
-Ragas measures your pipeline's performance against two dimensions
+Ragas measures your pipeline's performance against different dimensions
 1. **Faithfulness**: measures the information consistency of the generated answer against the given context. If any claims made in the answer that cannot be deduced from context is penalized. 
 2. **Context Relevancy**:  measures how relevant retrieved contexts is to the question. Ideally the context should only contain information necessary to answer the question. The presence of redundant information in the context is penalized.
 3. **Answer Relevancy**: measures how relevant generated answer is to the question. This do not ensure factuality of the generated answer rather penalizes the presence of redundant information in the generated answer.
+4. **Aspect Critiques**: Designed to judge the submission against defined aspects like harmlessness, correctness, etc. You can also define your own aspect and validate the submission against your desired aspect. The output of aspect critiques is always binary.
 
-Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors. 
+The final `ragas_score` is the harmonic mean of of individual metric scores. 
 
 To read more about our metrics, checkout [docs](/docs/metrics.md).
 ## 🫂 Community

diff --git a/docs/metrics.md b/docs/metrics.md
@@ -1,5 +1,6 @@
 # Metrics
 
+
 1. `faithfulness` : measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.
 ```python
 from ragas.metrics.factuality import Faithfulness
@@ -41,6 +42,33 @@ results = answer_relevancy.score(dataset)
 ```
 
 
+4. `Aspect Critiques`: Critiques are LLM evaluators that evaluate the your submission using the provided aspect. There are several aspects like `correctness`, `harmfulness`,etc  (Check `SUPPORTED_ASPECTS` to see full list) that comes predefined with Ragas Critiques. If you wish to define your own aspect you can also do this. The `strictness` parameter is used to ensure a level of self consistency in prediction (ideal range 2-4). The output of aspect critiques is always binary indicating whether the submission adhered to the given aspect definition or not. These scores will not be considered for the final ragas_score due to it's non-continuous nature.
+- List of predefined aspects:
+`correctness`,`harmfulness`,`coherence`,`conciseness`,`maliciousness`
+
+```python
+## check predefined aspects
+from ragas.metrics.critique import SUPPORTED_ASPECTS
+print(SUPPORTED_ASPECTS)
+
+from ragas.metrics.critique import conciseness
+from ragas
+# Dataset({
+#     features: ['question','answer'],
+#     num_rows: 25
+# })
+dataset: Dataset
+
+results = conciseness.score(dataset)
+
+
+## Define your critique
+from ragas.metrics.critique import AspectCritique
+mycritique = AspectCritique(name="my-critique", definition="Is the submission safe to children?", strictness=2)
+
+```  
+
+
 ## Why is ragas better than scoring using GPT 3.5 directly.
 LLM like GPT 3.5 struggle when it comes to scoring generated text directly. For instance, these models would always only generate integer scores and these scores vary when invoked differently. Advanced paradigms and techniques leveraging LLMs to minimize this bias is the solution ragas presents.
 <h1 align="center">