## Search "ECB guide to internal models"

**A. Get Questions and Embeddings**:
- Load ECB Guide embeddings from a pickle file.
- Convert string embeddings to numpy arrays.
- Import questions from an Excel file.

**B. Helper Functions**:
- `search_docs`: Search documents and rank them based on cosine similarity of embeddings.
- `create_embedding`: Generate an embedding for a search phrase.
- `test_answers`: Test the questions against the embeddings to get relevant answers.
- `display_rows`: Display specified rows from a DataFrame in a given format.

**C. Conduct Testing**:
- Run the `test_answers` function and display the top results.

**D. Perform Manual Testing**
- Conduct manual testing for specific queries and display the top 3 results.

In [1]:
import pandas as pd
from tqdm import tqdm
import openai
import time
import tiktoken
from openai.embeddings_utils import get_embedding, cosine_similarity
import ast
import numpy as np
import matplotlib.pyplot as plt


# Settings
tqdm.pandas()
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

#### A. Get questions and embeddings

In [2]:
# Get the ECB Guide with embeddings and testing questions
embeddings = pd.read_pickle("ecb_guide_embeddings.pkl")
embeddings['embedding'] = embeddings['embedding'].apply(ast.literal_eval).apply(np.array)

questions = pd.read_excel("questions_ecb_extended.xlsx")

# Extract relevant parts from questions
idx = questions['in_scope'] == 1
questions = questions[idx]

#### B. Helper functions

In [3]:
def search_docs(df, search_phrase):
    search_embedding = create_embedding(search_phrase)
    df["similarity"] = df["embedding"].apply(
        lambda x: cosine_similarity(x, search_embedding))
    df = df.sort_values(by='similarity', ascending=False).reset_index()
    first_value = df['Index'].iloc[0]

    return df, first_value

def create_embedding(search_phrase):
    return get_embedding(
        search_phrase,
        engine="text-embedding-ada-002"
    )

def test_answers(embeddings, questions):
    # Create an empty list to store the results
    results_list = []
    
    for index, row in questions.iterrows():
        query = row['question']
        question_number = row['Index']
        df, first_value = search_docs(embeddings, query)

        # Create a Boolean mask and find the index of the first occurrence
        mask = df['Index'] == question_number
        first_index = mask.idxmax() + 1 if mask.any() else 10000

       # Append a new dictionary to the results list
        results_list.append({
            'Query': query,
            'question_number': question_number,
            'first_value': first_value,
            'top_result': int(first_index),
            'total_documents': len(df)
        })

    # Convert the list of dictionaries to a DataFrame
    results_df = pd.DataFrame(results_list)

    # Replace NaN values with 1000 in column 'A'
    results_df['top_result'].fillna(100000, inplace=True)
    results_df['top_result'] = results_df['top_result'].astype(int)
    
    return results_df

def display_rows(df, top=3):
    """
    Display multiple rows in the DataFrame in the specified format.
    
    Parameters:
    - df (pd.DataFrame): The DataFrame containing the data
    - row_indices (list): A list of indices of the rows to display
    
    """
    row_indices = list(range(top))
    
    for i, row_index in enumerate(row_indices):
        # Get the values of the specified row from the DataFrame
        row_data = df.iloc[row_index]
        
        # Extract the 'source' and 'text' values from the row
        source_value = row_data['full_label']
        text_value = row_data['checked_sentence']
        
        # Display the data as specified
        print(f"Source: {source_value}\n")
        print(f"{format_text(text_value)}\n")
        
        # Print separator if not the last row
        if i < len(row_indices) - 1:
            print("-" * 10)


def format_text(text):
    # Split the text into words
    words = text.split()
    
    # Initialize the formatted text and a temporary line
    formatted_text = ""
    line = ""
    
    for word in words:
        # If the word is a bullet or special marker, start a new line with spacing
        special_markers = {"(a)", "(b)", "(c)", "(d)", "(e)", "(f)", "(g)", "(h)", "(i)", "(j)", 
                           "(i)", "(ii)", "(iii)", "(iv)", "(v)", "(vi)",
                          "(vii)", "(viii)", "(ix)", "(x)", "(xi)", "(xii)",
                          "â€¢"}

        if word in special_markers:
            formatted_text += line + "\n\n"  # Add two new lines for spacing
            line = word + " "
        # If adding the word does not exceed 80 characters, add it to the line
        elif len(line + word) <= 100:
            line += word + " "
        # If adding the word exceeds 80 characters, start a new line
        else:
            formatted_text += line + "\n"
            line = word + " "
    # Add the last line to the formatted text
    formatted_text += line
    
    return formatted_text

def frequency_analysis(places):
    top_1 = sum(places == 1)
    top_5 = sum(places <= 5)
    top_10 = sum(places <= 10)
    top_1000 = sum(places <= 10000)
    
    print(f"Occurrences in Top 1: {top_1}")
    print(f"Occurrences in Top 5: {top_5}")
    print(f"Occurrences in Top 10: {top_10}")
    print(f"Occurrences in Top 1000: {top_1000}")

#### C. Conduct testing

In [4]:
# Get the file
test_results = test_answers(embeddings, questions)

test_results[['Query', 'top_result', 'total_documents']]

Unnamed: 0,Query,top_result,total_documents
0,Why is consistent implementation of internal model-related tasks crucial for banking groups?,1,1121
1,What needs to be documented for an LGD model?,6,1121
2,What are the components of model risk framework?,3,1121
3,We have just validated an internal model. There is a material changing coming in three months since the last validation. We will validate the model again during the next annual validation. Is this ok?,1,1121
4,Is it ok for a small bank to combine modelling and validation activities in the same department?,1,1121
5,May internal audit delegate its review of internal models to the model validation department?,2,1121
6,Is there any relation between credit and climate risk?,4,1121
7,How to measure the materiality of the rating system?,2,1121
8,What are the requirements related to outdated ratings?,1,1121
9,What information should data quality reports contain?,3,1121


In [5]:
# Get top results
frequency_analysis(test_results['top_result'])

Occurrences in Top 1: 10
Occurrences in Top 5: 20
Occurrences in Top 10: 21
Occurrences in Top 1000: 24


#### D. Peform manual testing

In [6]:
# Test 1
query = "What does the term 'initial validation' refer to?"
df, first_value = search_docs(embeddings, query)

display_rows(df, 2)

Source: General topics > 1 Overarching principles for internal models > 1.6 General principles for internal validation

All internal models and internal estimates should be subject to an initial and subsequently to an 
annual internal validation. For the avoidance of doubt, the term "initial validation" in the guide 
refers to the validation of new models as well as the validation of material changes and extensions 
to approved models. 

----------
Source: General topics > 4 Internal validation > 4.1 Relevant regulatory references

In the context of rating systems, the term "validation" encompasses a range of processes and 
activities that contribute to an assessment of whether ratings adequately differentiate risk, and 
whether estimates of risk parameters (such as PD, LGD and CCF) appropriately characterise the 
relevant aspects of risk. 



In [7]:
# Test 2
query = 'what are reference dates for EAD/CCF modelling?'
df, first_value = search_docs(embeddings, query)

display_rows(df, 2)

Source: Credit risk > 7 Conversion factors > 7.3 CCF structure > 7.3.1  Relevant regulatory references

199. For the purposes of Article 182(1)(a) of the CRR, institutions must compute realised CCF. To 
comply with this requirement, in the understanding of the ECB institutions should adopt the 
following approach. 

(a) Calculate realised CCF as the ratio of the difference between the EAD and the exposure at the 
reference date in the numerator, and the difference between the limit at reference date and the 
exposure at reference date (i.e. the amount available to be drawn at the reference date) in the 
denominator. This does not mean that, to address the issues with the "region of instability", 
institutions may not use direct EAD realisation (as referred to in paragraph 207(a) of this 
chapter). In any case, all the requirements regarding CCF risk quantification referred to in the 
applicable regulation apply, together with the ECB's understanding of those requirements as set out 
in