# Clustering

In this notebook, we aim at creating a representation of all financial assets in the database that allows to group them according to some shared characteristics. This also includes dealing with categorical and null values as well as defining a preprocessing function.


### Import statements

In [None]:
# Import statements 
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

### 1° Loading the dataset

In [None]:
filepath = '../../RFQ_Data_Challenge_HEC.csv'
df = pd.read_csv(filepath)
df.head()

### 2° Defining the preprocessing function

Insights on some features :
- **Deal_Date** : The date on which a financial deal is executed. Needs to be converteted in datetime.
- **ISIN** : International Securities Identification Number, a unique identifier for Financial instruments.
- **company_short_name** : the name of the issuer of the financial instrument. It corresponds to the client name.
- **B_price** : The bid price of the financial instrument. To be converted to int (object for the moment)
- **B_side** : Natixis's position as a buyer or seller of the financial instrument (for the moment 'NATIXIS SELL' or 'NATIXIS BUY'). Contient 8 valeurs nulles.
- **Total_Requested_Volume** : The requested volume for buying or selling the financial instrument. It needs to be converted into a numerical column (object for the moment). Contient 2 valeurs nulles.
- **Total_Traded_Volume_Natixis** : The volume of the financial instrument traded by Natixis. Already good data type.
- **Total_Traded_Volume_Away** : The volume of the financial instrument traded by other banks. Already good data type.
- **Total_Traded_Volume** : The total volume of the financial instrument traded. Already good data type.
- **BloomIndustrySector**, **BloomIndustryGroup**,**BloomIndustrySubGroup**
- **maturity** : The length of time during which interest is paid. Some null values are marked as NaT. We convert this column into Datetime type. Some maturities go back to 1900, it is not possible. We delete those.
- **Rating_Fitch** : The credit rating of the financial instrument from Fitch Ratings.
- **Rating_Moodys** : The credit rating of the financial instrument from Moody's.
- **Rating_SP** : The credit rating of the financial instrument from S&P Global Ratings.
- **Ccy** : The currency in which the financial instrument is denominated.
- **Classification** : The activity sector of the company.
- **Tier** : The seniority level of the financial instrument. Lots of null value, we replace them by UNKOWN (627100 values)
- **AssumedMaturity** : The assumed maturity date of the financial instrument. Also contains a lot of null values, we replace them by maturity values (the null values only).
- **Coupon** : The interest rate of the financial instrument. Already a float.
- **Frequency** : The frequency of interest payments on the financial instrument. Takes values 1M, 3M, 6M, 12M. We delete the 'M' and convert the value into int.
- **Type** : The type of interest rate on the financial instrument (fixed or variable).
- **MidYTM** : The yield to maturity on the prime bid. Already a float.
- **MidYTM** : The yield to maturity on the prime bid. Already a float.
- **YTWDate** : Yield to Worst - The yield on the first possible redemption date. 
- **SpreadvsBenchmarkMid** : The spread of the financial instrument versus the interpolated government bond curve.
- **MidASWSpread** : The spread of the financial instrument versus the swap curve.
- **MidZSpread** : The spread of the financial instrument versus the zero- coupon curve.
- **GSpreadMid** : The spread of the financial instrument versus the interpolated government bond curve.
- **MidModifiedDuration** : The modified duration of the financial instrument. 
- **MidConvexity** : The convexity of the financial instrument.
- **MidEffectiveDuration** : The effective duration of the financial instrument.
- **MidEffectiveConvexity** : The effective convexity of the financial instrument.

Features that can be deleted at first : 
- **Cusip**, same as **cusip** but with more null values 
- **Maturity**, same as **maturity**

Added columns :
- Year, month, day of deal_date
- Year, month, day of maturity
- days to maturity

In [None]:
# Null values analysis for columns with null values below 15000
below_threshold = df.isnull().sum().sort_values(ascending=False) < 15000
print("Columns with null values below 15000:")
print(below_threshold[below_threshold].index)

# Null values analysis for columns with null values above 15000
above_threshold = df.isnull().sum().sort_values(ascending=False) >= 15000
print("\nColumns with null values above or equal to 15000:")
print(above_threshold[above_threshold].index)


In [None]:
def preprocess_dataframe(df):
    """
    Preprocesses the input DataFrame with the following steps:
    1. Converts 'Deal_Date', 'maturity', 'AssumedMaturity', 'YTWDate' columns to datetime.
    2. Converts 'B_Side' column to boolean (1 for 'NATIXIS BUY', 0 for 'NATIXIS SELL').
    3. Converts 'B_Price' and 'Total_Requested_Volume' columns to integers.
    4. Fills null values in 'Tier', 'AssumedMaturity', and 'YTWDate' columns with 'UNKNOWN'.
    5. Converts 'Frequency' feature values into integers (removing 'M' from the end).
    6. Drops the unsused 'Cusip' column.

    Parameters:
    - df (DataFrame): Input DataFrame.

    Returns:
    - DataFrame: Processed DataFrame.
    """

    df = df.copy()

    # Drop null values only for columns below the threshold
    columns_to_delete_null_vales = ['MidYTM', 'Coupon', 'Ccy', 'cusip',
       'maturity', 'cdcissuerShortName', 'Frequency', 'MidPrice', 'cdcissuer',
       'company_short_name', 'BloomIndustrySubGroup', 'B_Price',
       'Total_Traded_Volume_Natixis', 'B_Side',
       'Total_Traded_Volume_Away', 'Total_Requested_Volume',
       'Total_Traded_Volume', 'Type', 'Maturity', 'ISIN', 'Deal_Date']
    df = df.dropna(subset=columns_to_delete_null_vales)

    # Convert 'B_Price', 'Total_Requested_Volume', 'Frequency' to integers
    df['Frequency'] = df['Frequency'].str.replace('M', '')
    numerical_columns = ['B_Price', 'Total_Requested_Volume', 'Frequency']
    df.dropna(subset=numerical_columns, inplace=True)
    for column in numerical_columns:
        df[column] = pd.to_numeric(df[column], errors='coerce').astype(int)

    # Fix the error in the B_Price column
    df = df[df['B_Price'] >= 20]

    # Replace NaT with null values in the 'Maturity' column
    df['maturity'].replace({pd.NaT: np.nan}, inplace=True)

    # Convert 'Deal_Date', 'maturity', 'AssumedMaturity', 'YTWDate' to datetime
    df['Deal_Date'] = pd.to_datetime(df['Deal_Date'])
    df['maturity'] = pd.to_datetime(df['maturity'], errors='coerce',  format='%Y-%m-%d %H:%M:%S.%f')
    df['AssumedMaturity'] = pd.to_datetime(df['AssumedMaturity'], errors='coerce')
    df['YTWDate'] = pd.to_datetime(df['YTWDate'], errors='coerce')

    # Add year, month, day for clustering 
    df['Year_dealdate'] = df['Deal_Date'].dt.year
    df['Month_dealdate'] = df['Deal_Date'].dt.month
    df['Day_dealdate'] = df['Deal_Date'].dt.day
    df['Year_maturity'] = df['maturity'].dt.year
    df['Month_maturity'] = df['maturity'].dt.month
    df['Day_maturity'] = df['maturity'].dt.day

    # Delete maturities smaller than 2021 (as deal dates starts in 2021)
    df = df[df['maturity'].dt.year >= 2021]

    # Compute number of days between maturity and deal date
    df['Days_to_Maturity'] = (df['maturity'] - df['Deal_Date']).dt.days

    # Replace null values in 'AssumedMaturity' with values from 'Maturity'
    df['AssumedMaturity'] = df['AssumedMaturity'].fillna(df['Maturity'])

    # Convert 'B_Side' column to boolean (1 for 'NATIXIS BUY', 0 for 'NATIXIS SELL')
    df = df[df['B_Side'].isin(['NATIXIS SELL', 'NATIXIS BUY'])]
    df['B_Side'] = df['B_Side'].replace({'NATIXIS BUY': 1, 'NATIXIS SELL': 0})

    # Convert null values of 'Tier'
    df['Tier'].fillna('UNKNOWN', inplace=True)

    # Lower string names 
    df['Sales_Name'] = df['Sales_Name'].str.lower()
    df['company_short_name'] = df['company_short_name'].str.lower()

    # Drop unused columns
    columns_to_drop = ['Cusip', 'Maturity']
    df.drop(columns=columns_to_drop, inplace=True)

    return df

In [None]:
df_preprocessed = preprocess_dataframe(df)
pd.set_option('display.max_columns', None)
df_preprocessed.head()

In [None]:
df_preprocessed.shape

In [None]:
# Function for imputing numerical missing values in the financial columns
def complete_nan_values(df):

    df_unique_isin = df.groupby('ISIN').first()
    columns = ['Classification', 'SpreadvsBenchmarkMid', 'MidASWSpread', 'MidZSpread', 'GSpreadMid', 
               'MidModifiedDuration', 'MidConvexity', 'MidEffectiveDuration', 'MidEffectiveConvexity', 'Year_dealdate', 'Month_dealdate']
    df_by_classification = df_unique_isin[columns].copy()
    df_by_classification = df_by_classification.groupby(['Classification', 'Year_dealdate']).mean().reset_index()

    df_group_by_industry = df_by_classification.groupby('Classification').mean().reset_index()
    numeric_columns = ['SpreadvsBenchmarkMid', 'MidASWSpread', 'MidZSpread', 'GSpreadMid', 
                       'MidModifiedDuration', 'MidConvexity', 'MidEffectiveDuration', 'MidEffectiveConvexity']
    
    df_by_classification['additional_column'] = df_by_classification['Classification'].astype(str) + ' - ' + df_by_classification['Year_dealdate'].astype(str)
    df['additional_column'] = df['Classification'].astype(str) + ' - ' + df['Year_dealdate'].astype(str)

    for column in numeric_columns:
        df_by_classification[column] = df_by_classification[column].fillna(df_by_classification['Classification'].map(df_group_by_industry.set_index('Classification')[column]))

    for column in numeric_columns:
        df[column] = df[column].fillna(df['additional_column'].map(df_by_classification.set_index('additional_column')[column]))

    df.drop(columns=['additional_column'], inplace=True)
    
    return df

In [None]:
df_filled = complete_nan_values(df_preprocessed)
missing_values = df_filled.isnull().sum()
missing_values[missing_values!=0]

Once we've corrected the B_price error, which sometimes corresponds to yield values (by setting a minimum value of 20), only 314718 lines remain.

### 4° Defining the preprocessing function for clustering

In [None]:
def preprocess_clustering(df, cols_to_exclude):

    # Drop the columns that we exclude
    df = df.drop(cols_to_exclude, axis=1, errors='ignore')

    # Identify numerical columns
    numerical_columns = df.select_dtypes(include=['number']).columns

    # Transform 'Ccy' to 'is_euro' boolean column
    df['is_euro'] = (df['Ccy'] == 'EUR').astype(int)
    # Transform 'Type' to 'is_fixed' boolean column
    df['is_fixed'] = (df['Type'] == 'Fixed').astype(int)
    # Drop the original 'Ccy' and 'Type' columns
    df = df.drop(['Ccy', 'Type'], axis=1, errors='ignore')

    # Ordinal encoding for 'Rating_Fitch'
    rating_mapping = {
        'AAA': 22,
        'AA+': 21,
        'AA': 20,
        'AA-': 19,
        'A+': 18,
        'A': 17,
        'A-': 16,
        'BBB+': 15,
        'BBB': 14,
        'BBB-': 13,
        'BB+': 12,
        'BB': 11,
        'BB-': 10,
        'B+': 9,
        'B': 8,
        'B-': 7,
        'CCC+': 6,
        'CCC': 5,
        'CCC-': 4,
        'CC': 3,
        'C': 2,
        'WD': 1,
        'D': 0,
        'NR': np.nan
    }

    rating_mapping_moodys = {
        'Aaa': 22,
        'Aa1': 21,
        'Aa2': 20,
        '(P)Aa2': 20,
        'Aa3': 19,
        '(P)Aa3': 19,
        'A1': 18,
        '(P)A1': 18,
        'A2': 17,
        '(P)A2': 17,
        'A3': 16,
        '(P)A3': 16,
        'Baa1': 15,
        '(P)Baa1': 15,
        'Baa2': 14,
        '(P)Baa2': 14,
        'Baa3': 13,
        'Ba1': 12,
        'Ba2': 11,
        'Ba3': 10,
        'B1': 9,
        'B2': 8,
        'B3': 7,
        'Caa1': 6,
        'Caa2': 5,
        'Caa3': 4,
        'Ca': 2.5,
        'C': 0
    }

    df['Rating_Fitch_encoded'] = df['Rating_Fitch'].map(rating_mapping)
    df['Rating_SP_encoded'] = df['Rating_SP'].map(rating_mapping)
    df['Rating_Moodys_encoded'] = df['Rating_Moodys'].map(rating_mapping_moodys)
    # Create a unique Rating that averages the 3 Ratings and ignores missing values
    df['Rating'] = df[['Rating_Fitch_encoded', 'Rating_SP_encoded', 'Rating_Moodys_encoded']].mean(axis=1)

    # Map values in 'Country' column
    valid_countries = ['FRANCE', 'ITALY', 'GERMANY', 'NETHERLANDS', 'SPAIN']
    df['Country'] = df['Country'].apply(lambda x: x if x in valid_countries else 'OTHER')
    # Perform one-hot encoding
    df = pd.get_dummies(df, columns=['Country'], prefix='is')

    # Map values in 'Classification' column
    valid_classes = ['Financials', 'Government', 'Industrials', 'Utilities']
    df['Classification'] = df['Classification'].apply(lambda x: x if x in valid_classes else 'OTHER')
    # Perform one-hot encoding
    df = pd.get_dummies(df, columns=['Classification'], prefix='is')

    # Add newly created boolean columns and 'Rating' to agg_dict with average
    agg_dict = {col: 'mean' for col in ['is_euro', 'is_fixed', 'Rating']}
    agg_dict.update({col: 'first' for col in ['is_FRANCE', 'is_ITALY', 'is_GERMANY', 'is_NETHERLANDS', 'is_SPAIN']})
    agg_dict.update({col: 'first' for col in ['is_Financials', 'is_Government', 'is_Industrials', 'is_Utilities']})
    agg_dict.update({num_col: ['min', 'max', 'median'] for num_col in numerical_columns})

    # Grouping by 'ISIN' and aggregating columns
    grouped_df = df.groupby('ISIN').agg(agg_dict).reset_index()

    # Flatten the multi-level column index
    grouped_df.columns = ['_'.join(col).strip() for col in grouped_df.columns.values]

    # Drop identical columns
    grouped_df = grouped_df.T.drop_duplicates().T

    return grouped_df

In [None]:
cols_to_exclude = ['Deal_Date', 'cusip', 'B_Side', 'Instrument', 'Sales_Name', 'Sales_Initial', 'company_short_name',
                   'Total_Requested_Volume', 'Total_Traded_Volume_Natixis', 'Total_Traded_Volume_Away', 'Total_Traded_Volume',
                   'cdissuer', 'Tier', 'Year_dealdate', 'Month_dealdate','Day_dealdate', 'Days_to_Maturity',
                   'cdissuerShortName', 'lb_Platform_2']
df_clustering = preprocess_clustering(df_filled, cols_to_exclude)

In [None]:
df_clustering.head()

### 5° Supervised clustering

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
df_clustering.set_index('ISIN_', inplace=True)

Let's now deep dive into the classical KMeans where we will be imputing the missing values. Financial missing values have been imputed previously. We just need to impute the ratings. We will proceed with the median.

Imputing missing values:

In [None]:
missing_values = df_clustering.isnull().sum()
missing_values[missing_values!=0] 

In [None]:
df_clustering_filled = df_clustering.copy()
df_clustering_filled['Rating_mean'] = df_clustering_filled['Rating_mean'].fillna(df_clustering['Rating_mean'].median())

In [None]:
scaler = StandardScaler()
df_normalized = scaler.fit_transform(df_clustering_filled)

We apply the elbow method to determine the optimal number of clusters for the KMeans approach.

In [None]:
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
from umap.umap_ import UMAP

In [None]:
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))

In [None]:
visualizer.fit(df_normalized)        # Fit the data to the visualizer
visualizer.show();                          # Finalize and render the figure

Optimal number of cluster is 7.
<br>
Now let's move on to exploring the results we obtain.

In [None]:
model = KMeans(n_clusters=7, verbose=0, random_state=42)

In [None]:
clusters = model.fit_predict(df_normalized)

In [None]:
pd.Series(clusters).value_counts()

In [None]:
embedding = UMAP(n_neighbors=50, learning_rate=0.5, init="random", min_dist=0.001
                      ).fit_transform(df_normalized)

In [None]:
sns.scatterplot(x=embedding[:,0], y=embedding[:,1], hue=clusters, palette='dark')

Explainability options:
- ExKMC
- Build a classification model for each label and look at Shap values

### Building an explainability classification model

In [None]:
from catboost import CatBoostClassifier
import shap

In [None]:
def preprocess_explainability(df, cols_to_exclude):

    # Drop the columns that we exclude
    df = df.drop(cols_to_exclude, axis=1, errors='ignore')

    # Identify numerical columns
    numerical_columns = df.select_dtypes(include=['number']).columns

    # Transform 'Ccy' to 'is_euro' boolean column
    df['is_euro'] = (df['Ccy'] == 'EUR').astype(int)
    # Transform 'Type' to 'is_fixed' boolean column
    df['is_fixed'] = (df['Type'] == 'Fixed').astype(int)
    # Drop the original 'Ccy' and 'Type' columns
    df = df.drop(['Ccy', 'Type'], axis=1, errors='ignore')

    # Ordinal encoding for 'Rating_Fitch'
    rating_mapping = {
        'AAA': 22,
        'AA+': 21,
        'AA': 20,
        'AA-': 19,
        'A+': 18,
        'A': 17,
        'A-': 16,
        'BBB+': 15,
        'BBB': 14,
        'BBB-': 13,
        'BB+': 12,
        'BB': 11,
        'BB-': 10,
        'B+': 9,
        'B': 8,
        'B-': 7,
        'CCC+': 6,
        'CCC': 5,
        'CCC-': 4,
        'CC': 3,
        'C': 2,
        'WD': 1,
        'D': 0,
        'NR': np.nan
    }

    rating_mapping_moodys = {
        'Aaa': 22,
        'Aa1': 21,
        'Aa2': 20,
        '(P)Aa2': 20,
        'Aa3': 19,
        '(P)Aa3': 19,
        'A1': 18,
        '(P)A1': 18,
        'A2': 17,
        '(P)A2': 17,
        'A3': 16,
        '(P)A3': 16,
        'Baa1': 15,
        '(P)Baa1': 15,
        'Baa2': 14,
        '(P)Baa2': 14,
        'Baa3': 13,
        'Ba1': 12,
        'Ba2': 11,
        'Ba3': 10,
        'B1': 9,
        'B2': 8,
        'B3': 7,
        'Caa1': 6,
        'Caa2': 5,
        'Caa3': 4,
        'Ca': 2.5,
        'C': 0
    }

    df['Rating_Fitch_encoded'] = df['Rating_Fitch'].map(rating_mapping)
    df['Rating_SP_encoded'] = df['Rating_SP'].map(rating_mapping)
    df['Rating_Moodys_encoded'] = df['Rating_Moodys'].map(rating_mapping_moodys)
    # Create a unique Rating that averages the 3 Ratings and ignores missing values
    df['Rating'] = df[['Rating_Fitch_encoded', 'Rating_SP_encoded', 'Rating_Moodys_encoded']].mean(axis=1)

    # Add newly created boolean columns and 'Rating' to agg_dict with average
    agg_dict = {col: 'mean' for col in ['is_euro', 'is_fixed', 'Rating']}
    agg_dict.update({col: 'first' for col in ['Country', 'Classification']})
    agg_dict.update({num_col: ['min', 'max', 'median'] for num_col in numerical_columns})

    # Grouping by 'ISIN' and aggregating columns
    grouped_df = df.groupby('ISIN').agg(agg_dict).reset_index()

    # Flatten the multi-level column index
    grouped_df.columns = ['_'.join(col).strip() for col in grouped_df.columns.values]

    # Drop identical columns
    grouped_df = grouped_df.T.drop_duplicates().T

    # Set back data types to numerical when needed
    grouped_df = grouped_df.astype({col: 'float' for col in grouped_df.columns if col not in ['Classification_first', 'Country_first', 'ISIN_']})

    # Replace missing values with empty string
    grouped_df['Country_first'].replace({None: ''}, inplace=True)

    return grouped_df

In [None]:
df_exp = preprocess_explainability(df_filled, cols_to_exclude)
df_exp['cluster'] = clusters
df_exp.set_index(['ISIN_'], inplace=True)
df_exp.head()

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Function to train CatBoostClassifier for each cluster label
def train_catboost_classifier(df, cluster_labels, min_representation=100):
    classifiers = {}

    for label in cluster_labels:
        # Check if the cluster label is represented at least min_representation times
        if df['cluster'].value_counts().get(label, 0) < min_representation:
            print(f"Skipping cluster {label} as it has less than {min_representation} instances.")
            continue

        # Create binary labels for the current cluster
        lb = LabelBinarizer()
        binary_labels = lb.fit_transform(df['cluster'] == label).ravel()

        # Split the data into train and test sets
        X_train, X_test, y_train, y_test = train_test_split(
            df.drop('cluster', axis=1), binary_labels, test_size=0.2, random_state=42
        )

        # Initialize CatBoostClassifier
        clf = CatBoostClassifier(iterations=100, depth=5, learning_rate=0.1, loss_function='Logloss')

        # Train the classifier
        clf.fit(X_train, y_train, cat_features=['Country_first', 'Classification_first'], verbose=False)

        # Make predictions on the test set
        y_pred = clf.predict(X_test)

        # Evaluate the model
        accuracy = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred)

        # Store the classifier and evaluation results
        classifiers[label] = {'classifier': clf, 'accuracy': accuracy, 'classification_report': report}

        # Retrain the model on the full data
        clf.fit(df.drop('cluster', axis=1), binary_labels, cat_features=['Country_first', 'Classification_first'], verbose=False)

        # Save the retrained model in classifiers dict
        classifiers[label]['classifier_full_data'] = clf

    return classifiers


In [None]:
# Train CatBoostClassifier for each cluster label
cluster_classifiers = train_catboost_classifier(df_exp, cluster_labels=df_exp['cluster'].unique())

In [None]:
def plot_shap_explainability(model, df):
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(df)

    # Display beeswarm SHAP plot
    #shap.plots.beeswarm(shap_values)
    shap.summary_plot(shap_values, df, plot_type="beeswarm")

In [None]:
def plot_shap_explainability_corrected(model, df):
    
    df_to_plot = df.reset_index().copy()
    ISIN_list = df_to_plot.ISIN_.to_list()
    index = pd.Index(ISIN_list)
    df_to_plot.drop(columns=["ISIN_"], inplace=True)
    df_to_plot = df_to_plot.set_index(index)
    df_to_plot = df_to_plot.drop('cluster', axis=1)

    explainer = shap.TreeExplainer(model)
    shap_values = explainer(df_to_plot)
    shap.plots.beeswarm(shap_values)

In [None]:
# Assuming cluster_classifiers is already defined
for label, info in cluster_classifiers.items():
    print(f"Classifier for Cluster {label}:")
    print(f"Accuracy: {info['accuracy']:.2f}")
    print("Classification Report:")
    print(info['classification_report'])
    print("\n")

    # Check if 'classifier_full_data' key exists in the dictionary
    if 'classifier_full_data' in info:
        print(f"Plotting SHAP explainability for Cluster {label}")
        # Access the retrained model on full data
        full_data_model = info['classifier_full_data']
        
        # Plot SHAP explainability
        plot_shap_explainability_corrected(full_data_model, df_exp)
        
        plt.show()  # Display the plot
        print("\n")
    else:
        print(f"No 'classifier_full_data' available for Cluster {label}\n")


### Recommending similar bonds

In [None]:
df_clustering_filled

In [None]:
df_to_test = df_clustering_filled.reset_index().copy()
ISIN_list = df_to_test.ISIN_.to_list()
index = pd.Index(ISIN_list)
df_to_test.drop(columns=["ISIN_"], inplace=True)
df_to_test = df_to_test.set_index(index)
df_to_test

In [None]:
from sklearn.metrics import pairwise_distances

def recommend_n_bonds(row_id, df, kmeans_model):
    # Get the cluster of the given row
    cluster_id = kmeans_model.predict([df.loc[row_id]])[0]
    # Get the indices of data points in the same cluster
    cluster_indices = np.where(kmeans_model.labels_ == cluster_id)[0]
    # Get the distances between the given row and all other points in the cluster
    distances = pairwise_distances(df.loc[[row_id]], df.iloc[cluster_indices], metric='euclidean')[0]
    # Sort indices based on distances and get the top 5 nearest indices
    sorted_indices = np.argsort(distances)
    top5_nearest_indices = cluster_indices[sorted_indices][:5]

    return top5_nearest_indices.tolist()

# Example usage:
#row_id_to_check = 0  # Replace with the desired row index
#top5_nearest_ids = recommend_n_bonds('AT0000383864', df_to_test, model)

#print(f"Top 5 nearest ids for row {row_id_to_check}: {top5_nearest_ids}")

In [None]:
model.predict(df_clustering_filled.loc['XS2717309855'])

In [None]:
top5_nearest_ids = recommend_n_bonds('XS2717309855', df_to_test, model)

print(f"Top 5 nearest ids for row {row_id_to_check}: {top5_nearest_ids}")

In [None]:
df[(df['ISIN'] == 'XS2236363573')]

In [None]:
df_test = df.copy()
df_test['B_Price'] = float(df['B_Price'])

In [None]:
df_preprocessed[df_preprocessed['ISIN']=='XS2236363573']

## Unsupervised clustering

In [None]:
import hdbscan

In [None]:
pd.DataFrame(df_normalized, columns=df_clustering_filled.columns).head()

In [None]:
plt.scatter(x=embedding[:,0], y=embedding[:,1])

In [None]:
clusterer = hdbscan.HDBSCAN(min_cluster_size=50, min_samples=5, gen_min_span_tree=True)
clusters = clusterer.fit_predict(df_normalized)

In [None]:
sns.scatterplot(x=embedding[:,0], y=embedding[:,1], hue=clusters, palette='dark')

In [None]:
pd.Series(clusters).value_counts()

## Predicting for Natixis test

In [None]:
df_preprocessed.head()

In [None]:
df_bond = df[df['ISIN']=='XS2236363573']
df_bond.head()

In [None]:
def preprocess_bond(df):
    """
    Preprocesses the input DataFrame with the following steps:
    1. Converts 'Deal_Date', 'maturity', 'AssumedMaturity', 'YTWDate' columns to datetime.
    2. Converts 'B_Side' column to boolean (1 for 'NATIXIS BUY', 0 for 'NATIXIS SELL').
    3. Converts 'B_Price' and 'Total_Requested_Volume' columns to integers.
    4. Fills null values in 'Tier', 'AssumedMaturity', and 'YTWDate' columns with 'UNKNOWN'.
    5. Converts 'Frequency' feature values into integers (removing 'M' from the end).
    6. Drops the unsused 'Cusip' column.

    Parameters:
    - df (DataFrame): Input DataFrame.

    Returns:
    - DataFrame: Processed DataFrame.
    """

    df = df.copy()

    # Shift back the columns to the correct place
    column_names = df.columns
    # Find the index of 'cdIssuerShortName' and 'maturity'
    cd_issuer_index = column_names.get_loc('cdcissuerShortName')
    maturity_index = column_names.get_loc('maturity')
    # Loop through each column and shift the data to the left
    for i in range(cd_issuer_index, maturity_index + 1):
        df.iloc[:, i] = df.iloc[:, i + 1]

    # Replace empty column with nans
    df.iloc[:, maturity_index+1] = np.nan

    # Convert 'B_Price', 'Total_Requested_Volume', 'Frequency' to integers
    df['Frequency'] = df['Frequency'].str.replace('M', '')
    numerical_columns = ['B_Price', 'Total_Requested_Volume', 'Frequency']
    #df.dropna(subset=numerical_columns, inplace=True)
    for column in numerical_columns:
        df[column] = pd.to_numeric(df[column], errors='coerce').astype(int)

    # Fix the error in the B_Price column
    #df = df[df['B_Price'] >= 20]

    # Replace NaT with null values in the 'Maturity' column
    df['maturity'].replace({pd.NaT: np.nan}, inplace=True)

    # Convert 'Deal_Date', 'maturity', 'AssumedMaturity', 'YTWDate' to datetime
    df['Deal_Date'] = pd.to_datetime(df['Deal_Date'])
    df['maturity'] = pd.to_datetime(df['maturity'], errors='coerce',  format='%Y-%m-%d %H:%M:%S.%f')
    df['AssumedMaturity'] = pd.to_datetime(df['AssumedMaturity'], errors='coerce')
    df['YTWDate'] = pd.to_datetime(df['YTWDate'], errors='coerce')

    # Add year, month, day for clustering 
    df['Year_dealdate'] = df['Deal_Date'].dt.year
    df['Month_dealdate'] = df['Deal_Date'].dt.month
    df['Day_dealdate'] = df['Deal_Date'].dt.day
    df['Year_maturity'] = df['maturity'].dt.year
    df['Month_maturity'] = df['maturity'].dt.month
    df['Day_maturity'] = df['maturity'].dt.day

    # Delete maturities smaller than 2021 (as deal dates starts in 2021)
    #df = df[df['maturity'].dt.year >= 2021]

    # Compute number of days between maturity and deal date
    df['Days_to_Maturity'] = (df['maturity'] - df['Deal_Date']).dt.days

    # Replace null values in 'AssumedMaturity' with values from 'Maturity'
    df['AssumedMaturity'] = df['AssumedMaturity'].fillna(df['Maturity'])

    # Convert 'B_Side' column to boolean (1 for 'NATIXIS BUY', 0 for 'NATIXIS SELL')
    df = df[df['B_Side'].isin(['NATIXIS SELL', 'NATIXIS BUY'])]
    df['B_Side'] = df['B_Side'].replace({'NATIXIS BUY': 1, 'NATIXIS SELL': 0})

    # Convert null values of 'Tier'
    df['Tier'].fillna('UNKNOWN', inplace=True)

    # Lower string names 
    df['Sales_Name'] = df['Sales_Name'].str.lower()
    df['company_short_name'] = df['company_short_name'].str.lower()

    # Drop unused columns
    columns_to_drop = ['Cusip', 'Maturity']
    df.drop(columns=columns_to_drop, inplace=True)

    return df

In [None]:
df_bond = preprocess_bond(df_bond)

In [None]:
df_bond.head()

In [None]:
df_bond = pd.concat([df_preprocessed, df_bond], axis=0)
df_bond.tail()

In [None]:
df_bond_filled = complete_nan_values(df_bond)
df_bond.tail()

We impute the missing values for price.

In [None]:
# Assuming 'B_Price' and 'MidPrice' are columns in your DataFrame 'df'
correlation = df_bond['B_Price'].corr(df_bond['MidPrice'])

print(f"Correlation between B_Price and MidPrice: {correlation:.4f}")

In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np

# Assuming 'B_Price' is the independent variable and 'MidPrice' is the dependent variable
X = df_bond[['MidPrice']]
y = df_bond['B_Price']

# Create a linear regression model
model = LinearRegression()

# Fit the model
model.fit(X, y)

# Print the coefficients
p_intercept = model.intercept_
p_slope = model.coef_[0]

print(f"Intercept: {p_intercept:.4f}")
print(f"Slope (Coefficient for B_Price): {p_slope:.4f}")


We now set the missing prices as price = 0.9896 * MidPrice + 0.5937

In [None]:
# Use a lambda function to calculate the predicted values
fill_zero = lambda mid_price: p_intercept + p_slope * mid_price

# Create a boolean mask for values equal to 0 in 'B_Price'
mask = df_bond['B_Price'] == 0

# Apply the lambda function to replace zero values in 'B_Price'
df_bond.loc[mask, 'B_Price'] = df_bond.loc[mask, 'MidPrice'].apply(fill_zero)

df_bond.tail()

In [None]:
cols_to_exclude = ['Deal_Date', 'cusip', 'B_Side', 'Instrument', 'Sales_Name', 'Sales_Initial', 'company_short_name',
                   'Total_Requested_Volume', 'Total_Traded_Volume_Natixis', 'Total_Traded_Volume_Away', 'Total_Traded_Volume',
                   'cdissuer', 'Tier', 'Year_dealdate', 'Month_dealdate','Day_dealdate', 'Days_to_Maturity',
                   'cdissuerShortName', 'lb_Platform_2']
df_bond_clustering = preprocess_clustering(df_bond, cols_to_exclude)

In [None]:
df_bond_clustering[df_bond_clustering['ISIN_']=='XS2236363573']

In [None]:
missing_values = df_bond_clustering.isnull().sum()
missing_values[missing_values!=0] 

In [None]:
df_bond_clustering_filled = df_bond_clustering.copy()
df_bond_clustering_filled['Rating_mean'] = df_bond_clustering['Rating_mean'].fillna(df_bond_clustering['Rating_mean'].median())

In [None]:
df_normalized = df_bond_clustering_filled.drop(columns=['ISIN_'])

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_normalized = scaler.fit_transform(df_normalized)

In [None]:
pd.DataFrame(df_normalized, columns=df_bond_clustering_filled.columns[1:]).head()

In [None]:
from sklearn.cluster import KMeans

In [None]:
clusterer = KMeans(n_clusters=7)
clusterer.fit(df_normalized)
clusters = clusterer.predict(df_normalized)

In [None]:
df_bond_clustering_filled['cluster'] = clusters

In [None]:
df_bond_clustering_filled.head()

In [None]:
df_bond_clustering_filled[df_bond_clustering_filled['ISIN_']=='XS2236363573']

In [None]:
from sklearn.metrics import pairwise_distances

def recommend_n_bonds(row_id, df, kmeans_model):
    # Get the cluster of the given row
    cluster_id = kmeans_model.predict([df.loc[row_id]])[0]
    # Get the indices of data points in the same cluster
    cluster_indices = np.where(kmeans_model.labels_ == cluster_id)[0]
    # Get the distances between the given row and all other points in the cluster
    distances = pairwise_distances(df.loc[[row_id]], df.iloc[cluster_indices], metric='euclidean')[0]
    # Sort indices based on distances and get the top 5 nearest indices
    sorted_indices = np.argsort(distances)
    top5_nearest_indices = cluster_indices[sorted_indices][:5]

    return top5_nearest_indices.tolist()

# Example usage:
#row_id_to_check = 0  # Replace with the desired row index
#top5_nearest_ids = recommend_n_bonds('AT0000383864', df_to_test, model)

#print(f"Top 5 nearest ids for row {row_id_to_check}: {top5_nearest_ids}")

In [None]:
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances

def get_nearest_rows(df_normalized, isin_string):
    # Find the index of the given ISIN string in df_bond_clustering_filled
    index = df_bond_clustering_filled[df_bond_clustering_filled['ISIN_'] == isin_string].index[0]
    
    # Calculate Euclidean distances between the selected row and all other rows
    distances = euclidean_distances(df_normalized, [df_normalized[index]])
    
    # Get the indices of the 5 nearest rows (excluding the row itself)
    nearest_indices = np.argsort(distances.flatten())[1:6]
    
    # Retrieve the corresponding rows from the original DataFrame
    nearest_rows = df_bond_clustering_filled.iloc[nearest_indices]
    
    return nearest_rows

# Example usage:
isin_to_search = 'XS2236363573'
result = get_nearest_rows(df_normalized, isin_to_search)


In [None]:
result

In [None]:
df_bond_clustering_filled[df_bond_clustering_filled['ISIN_']=='XS2236363573']