# Prepare Test Cases

- This notebook will focus on preparing test cases to evaluate different RAG pipeline

# 1. Manually Prepare

## 1.1 Simple query

In [15]:
testing_dict = {
    "question":[],
    "expected_answer":[],
    "query_type":[],
}

In [16]:
question_1 = """
What is data distribution shifts ?
"""

answer_question_1 = """
**Data Distribution Shifts Definition:

Data distribution shifts refer to changes in the underlying distribution of the data, which can affect the performance of machine learning models. These shifts can occur in various forms, including concept drift, covariate shift, and label shift.

Types of Data Distribution Shifts:
1. Covariate Shift: This occurs when the distribution of the input data (P(X)) changes, but the conditional probability of the output given the input (P(Y|X)) remains the same.
2. Label Shift: This occurs when the distribution of the output data (P(Y)) changes, but the conditional probability of the input given the output (P(X|Y)) remains the same.
3. Concept Drift: This occurs when the conditional probability of the output given the input (P(Y|X)) changes, but the distribution of the input data (P(X)) remains the same.
"""

testing_dict['question'].append(question_1)
testing_dict['expected_answer'].append(answer_question_1)
testing_dict['query_type'].append("simple_query")

In [17]:
question_2 = """
What is the train-serving skew ?
"""

answer_question_2 = """
Train-Serving Skew Definition and Explanation:

The train-serving skew refers to a common failure mode in machine learning (ML) models where a model performs well during development but poorly when deployed in production. This occurs when the training data and the real-world data (also known as unseen data) come from different distributions.

Causes of Train-Serving Skew*

1. Divergence between training and real-world data: The underlying distribution of real-world data is unlikely to be the same as the underlying distribution of the training data. This can be due to various selection and sampling biases, such as differences in encoding, data types, or other factors.
2. Non-stationarity of real-world data: Real-world data is constantly changing, and data distributions shift over time.
"""

testing_dict['question'].append(question_2)
testing_dict['expected_answer'].append(answer_question_2)
testing_dict['query_type'].append("simple_query")

In [18]:
question_3 = """
What is degenerate feedback loops? Give some examples
"""

answer_question_3 = """
Degenerate Feedback Loops Definition:

Degenerate feedback loops occur when a system's outputs are used to generate its future inputs, which in turn influence the system's future outputs. This can lead to unintended consequences, such as perpetuating and magnifying biases embedded in the data.

Examples of Degenerate Feedback Loops:
1. Recommender Systems: A system recommends songs to users based on their past listening history. The songs that are ranked high by the system are shown first to users, which makes users click on them more. This, in turn, makes the system rank these songs even higher.
2. Resume-Screening Model: A model is trained to predict whether a candidate is qualified for a job based on their resume. The model finds that feature X (e.g., "went to Stanford") accurately predicts whether someone is qualified, so it recommends resumes with feature X. Recruiters only interview people whose resumes are recommended by the model, which means they only interview candidates with feature X. This, in turn, makes the model put even more weight on feature X.
3. Popularity Bias: Popular movies, books, or songs keep getting more popular, which makes it hard for new items to break into popular lists. This is an example of a degenerate feedback loop, where the popularity of an item is used to generate more popularity.
"""

testing_dict['question'].append(question_3)
testing_dict['expected_answer'].append(answer_question_3)
testing_dict['query_type'].append("simple_query")

In [19]:
question_4 = """
Give me 2 metrics to measure the diversity of the outputs of a recommender system.
"""

answer_question_4 = """
1. Aggregate Diversity: This metric measures the overall diversity of the system's outputs. It takes into account the variety of items recommended to users and the distribution of these items. A higher score indicates a more diverse set of recommendations.

2. Average Coverage of Long-Tail Items: This metric measures the extent to which the system recommends items from the long tail of the popularity distribution. The long tail consists of items that are rarely interacted with, but are still relevant to users. A higher score indicates that the system is better at recommending less popular items, which can help to reduce popularity bias.
"""

testing_dict['question'].append(question_4)
testing_dict['expected_answer'].append(answer_question_4)
testing_dict['query_type'].append("simple_query")

In [20]:
question_5 = """
What is the difference between machine learning in research vs in production ?.
"""

answer_question_5 = """
1. Requirements:
-Research: Focuses on achieving state-of-the-art model performance on benchmark datasets.
-Production: Different stakeholders have different requirements, which can include accuracy, latency, scalability, and interpretability.

2. Computational Priority:
Research: Prioritizes fast training and high throughput to quickly experiment with different models and hyperparameters.
Production: Prioritizes fast inference and low latency to ensure real-time predictions and efficient processing.

3. Data:
Research: Data is often clean and well-formated.
Production: Data is messy with a lot of noice, unstructured, constantly shifting. Labels, if there are any, might be sparse, imbalanced, or incorrect.

4. Fairness:
Research: Fairness is not a top consideration.
Production: Fairness must be considered when the prediction is conducted on wide variety of populations.

5. Interpretability:
Research: Interpretability is not a top consideration since research is mainly evaluated on model performance.
Production: Interpretability may be considered for both business leaders and end users, to understand why a decision is made so that they can trust a model.
"""

testing_dict['question'].append(question_5)
testing_dict['expected_answer'].append(answer_question_5)
testing_dict['query_type'].append("simple_query")

In [21]:
question_6 = """
What are the types of software system failures in ML pipelines ?.
"""

answer_question_6 = """
1. Dependency Failure: This type of failure occurs when a software package or codebase that the system depends on breaks, causing the system to fail. This is common when the dependency is maintained by a third party, especially if the third party no longer exists.

2. Deployment Failure: Deployment failures occur due to errors during deployment, such as deploying an older version of the model instead of the current version, or when the system lacks the necessary permissions to read or write certain files.

3. Hardware Failures: Hardware failures occur when the hardware used to deploy the model, such as CPUs or GPUs, does not behave as expected. For example, the CPUs may overheat and break down.

4. Downtime or Crashing: This type of failure occurs when a component of the system, such as a server, is down, causing the entire system to be unavailable.
"""

testing_dict['question'].append(question_6)
testing_dict['expected_answer'].append(answer_question_6)
testing_dict['query_type'].append("simple_query")

## 1.2 Complex query

In [22]:
question_7 = """
What is data distribution shifts ? how it happen ? how to detect? how to solve it ?.
"""

answer_question_7 = """
QUESTION: What is Data Distribution Shift?
ANSWER:
Data distribution shift refers to the change in the underlying distribution of the data, which can affect the performance of a machine learning model. It is often used interchangeably with concept drift, covariate shift, and label shift, but these are distinct subtypes of data shift.

Types of Data Distribution Shifts:

1. Covariate Shift: When the distribution of the input (X) changes, but the conditional probability of an output given an input (P(Y|X)) remains the same.
2. Label Shift: When the distribution of the output (Y) changes, but the conditional probability of an input given an output (P(X|Y)) remains the same.
3. Concept Drift: When the conditional probability of an output given an input (P(Y|X)) changes, but the distribution of the input (X) remains the same.

QUESTION: How Data Distribution Shift Happen?
ANSWER:
Data distribution shift can occur due to various reasons, including:

1. Biases in data selection: This can result in an unrepresentative sample of the population, leading to covariate shift.
2. Changes in the environment: This can cause concept drift, where the underlying relationships between the input and output variables change.

QUESTION: How to Detect Data Distribution Shift?
ANSWER:
Detecting data distribution shift can be challenging, but some methods include:

1. Monitoring accuracy-related metrics: This can help identify changes in the model's performance.
2. Monitoring input distribution: This can help identify changes in the distribution of the input data.
3. Monitoring label distribution: This can help identify changes in the distribution of the output data.
4. Using techniques such as Black Box Shift Estimation: This can help detect label shifts without requiring labels from the target distribution.

QUESTION: How to Solve Data Distribution Shift?
ANSWER:
1. Updating models: This can involve retraining the model from scratch (stateless retraining) or continue training it from the last checkpoint (stateful training).
2. Monitoring and observability: This can help identify changes in the data distribution and trigger updates to the model.

"""

testing_dict['question'].append(question_7)
testing_dict['expected_answer'].append(answer_question_7)
testing_dict['query_type'].append("complex_query")

In [23]:
question_8 = """
Can latency happen if my feature engineering contain too many features ?

"""

answer_question_8 = """
Yes, having too many features can lead to increased latency in your model. This is particularly true when it comes to online prediction, where features need to be extracted from raw data in real-time.

There are several reasons why an excessive number of features can contribute to higher latency:

1. Increased Computational Requirements: More features require more computational resources to process, which can lead to slower inference times and increased latency.
2. Data Extraction and Processing: When features need to be extracted from raw data in real-time, having too many features can slow down the process, leading to increased latency.
3. Model Complexity: Models with too many features can become overly complex, leading to slower inference times and increased latency.
"""

testing_dict['question'].append(question_8)
testing_dict['expected_answer'].append(answer_question_8)
testing_dict['query_type'].append("complex_query")

In [24]:
question_9 = """
What is data leakage in mlops context? how it happens, how to detect and what are the solutions to it ?
"""

answer_question_9 = """ 
QUESTION: What is Data Leakage in MLOps Context?
ANSWER:
Data leakage refers to the phenomenon when a form of the label "leaks" into the set of features used for making predictions, and this same information is not available during inference. This can cause machine learning models to fail in unexpected and spectacular ways, even after extensive evaluation and testing.

QUESTION: How Does Data Leakage Happen ?
ANSWER:
1. Splitting time-correlated data randomly: When data is time-correlated, splitting it randomly can cause information from the future to leak into the training process.
2. Scaling and normalization: Using statistics from the entire dataset to scale and normalize features can cause leakage.
3. Filling in missing data: Filling in missing data with statistics from the test split can cause leakage.
4. Poor handling of data duplication: Failing to remove duplicates before splitting data can cause the same samples to appear in both train and validation/test splits.
5. Group leakage: A group of examples with strongly correlated labels can be divided into different splits, causing leakage.

QUESTION: How to Detect Data Leakage?
ANSWER:
1. Measure the predictive power of each feature or a set of features: If a feature has unusually high correlation with the target variable, investigate how this feature is generated and whether the correlation makes sense.
2. Keep track of the sources of your data and understand how it is collected and processed: This can help identify potential sources of leakage.
Detecting data leakage requires a deep understanding of the way data is collected and processed. Some ways to detect data leakage include:

QUESTION: What are the solutions to Data Leakage?
ANSWER:
1. Split data by time: When data is time-correlated, split it by time to avoid leakage.
2. Use only statistics from the train split to scale and fill in missing values: This can prevent leakage caused by scaling and filling in missing values.
3. Remove duplicates before splitting: Check for duplicates before splitting and also after splitting to ensure that the same samples do not appear in both train and validation/test splits.
4. Normalize data: Normalize data so that data from different sources can have the same means and variances.
5. Incorporate subject matter experts: Incorporate subject matter experts into the ML design process to gain a deeper understanding of how data is collected and processed.
"""


testing_dict['question'].append(question_9)
testing_dict['expected_answer'].append(answer_question_9)
testing_dict['query_type'].append("complex_query")

In [25]:
question_10 = """ 
Can I just simply replace my existing model with my new retrained model ?
"""

answer_question_10 = """
Replacing an Existing Model with a New Retrained Model: Considerations and Best Practices:

No, you should not simply replace an existing model with a new retrained model. While it may seem like a simple swap, there are several factors to consider before making the change.

1. Retraining Strategies: There are two primary retraining strategies to consider: stateless retraining (training from scratch) and stateful training (fine-tuning). The choice between these two approaches depends on the specific use case and the characteristics of the data.
2. Data Considerations: When retraining a model, it's essential to consider the data used for retraining. This includes deciding which data to use (e.g., data from the last 24 hours, last week, last 6 months, or from the point when data has started to drift). The choice of data will impact the performance of the retrained model.
3. Continual Learning: Continual learning is a training paradigm where a model updates itself with every incoming sample in production. However, this approach can be challenging to implement, and most companies update their models in micro-batches instead.

Best Practices for Replacing an Existing Model:
1. Evaluate the retrained model: Ensure that the retrained model performs better than the existing model before deploying it.
2. Use a champion-challenger approach: Create a replica of the existing model and update this replica on new data. Only replace the existing model with the updated replica if the updated replica proves to be better.
3. Consider the retraining frequency: Determine the optimal retraining frequency for each model, taking into account the specific needs of each model.
"""

testing_dict['question'].append(question_10)
testing_dict['expected_answer'].append(answer_question_10)
testing_dict['query_type'].append("complex_query")

# 2. Use Langchain QAGenerateChain

In [4]:
from langchain.evaluation.qa import QAGenerateChain
from langchain_community.document_loaders import PyPDFLoader
from langchain_groq import ChatGroq


import config

In [3]:
# Prepare llm
groq_llama3_1_70b = ChatGroq(
    model="llama-3.1-70b-versatile",
    temperature=0,
    api_key=config.GROQ_API_KEY
)

In [5]:
# -------------------------------------------------------------------------
# 1. Load multiple PDFs using PyPDFLoader, only if the file is a PDF
# -------------------------------------------------------------------------
print("Loading PDFs Start.")

file_paths = config.PDF_FILE_PATHS
docs = []

for file_path in file_paths:
    # Check if the file is a PDF
    if file_path.lower().endswith(".pdf"):
        loader = PyPDFLoader(file_path)
        docs.extend(loader.load())  # Combine all loaded docs into a single list
    else:
        print(f"Skipping non-PDF file: {file_path}")

print("Loading PDFs End.")

# Preprocess page content
for doc in docs:
    # Replace tabs with spaces in each document's page_content
    doc.page_content = doc.page_content.replace("\t", " ")


Loading PDFs Start.
Loading PDFs End.


In [10]:
import numpy as np

np.random.seed(1024)

random_docs = list(np.random.choice(docs, size=10))

In [11]:
random_docs

[Document(metadata={'source': '../designing-machine-learning-systems.pdf', 'page': 443}, page_content='LFs (labeling functions)\n, \nWeak supervision\nheuristics\n, \nWeak supervision\nlogs\n, \nData Sources\n, \nData Sources\nexperiment tracking\n, \nExperiment tracking\nmonitoring and\n, \nLogs\n-\nLogs\nstorage\n, \nData Sources\nloop tiling, model optimization\n, \nModel optimization\nloss curve\n, \nExperiment tracking\nloss functions\n, \nObjective Functions\n(\nsee also\n objective functions)\nlow-rank factorization\n, \nLow-Rank Factorization\n-\nLow-Rank Factorization\nM\nmaintainability\n, \nMaintainability\nManning, Christopher\n, \nMind Versus Data\nMAR (missing at random) values\n, \nHandling Missing Values\nMCAR (missing completely at random) values\n, \nHandling Missing Values\nmerge conflicts\n, \nVersioning\nmessage queue, dataflow and\n, \nData Passing Through Real-Time Transport\nMetaflow\n, \nData Science Workflow Management\nmetrics\nmonitoring and\n, \nMonitoring 

In [13]:
question_answer_chain = QAGenerateChain.from_llm(groq_llama3_1_70b)
llm_generated_cases = question_answer_chain.apply_and_parse(
    [{"doc": t} for t in random_docs]
)



In [14]:
llm_generated_cases

[{'qa_pairs': {'query': 'What type of values are described as being missing due to a reason that is related to the observed data, but not to the unobserved data?',
   'answer': 'MAR (missing at random) values.'}},
 {'qa_pairs': {'query': 'According to the text, what is the analogy Erik Bernhardsson uses to describe the expectation that data scientists should know about infrastructure, and what does it compare this expectation to?',
   'answer': "Erik Bernhardsson's analogy compares the expectation that data scientists should know about infrastructure to expecting app developers to know about how Linux kernels work."}},
 {'qa_pairs': {'query': 'In the context of a ride-sharing application like Lyft, what are the three services considered in a simplified microservice architecture, and what is the primary function of each service?',
   'answer': 'The three services are: '}},
 {'qa_pairs': {'query': 'What is the primary difference between batch processing and stream processing in terms of 

In [29]:
for case in llm_generated_cases:
    question = case['qa_pairs']['query']
    expect_answer = case['qa_pairs']['answer']

    testing_dict['question'].append(question)
    testing_dict['expected_answer'].append(expect_answer)
    testing_dict['query_type'].append("llm_generated")

# Final: Save the test cases

In [30]:
import joblib
joblib.dump(value=testing_dict, filename="test_cases.pkl")

['test_cases.pkl']