## This notebook demonstrates how to use our interface to 1) build a case base and a set of queries from a dataframe, 2) apply Case Base Maintenance (CBM) - Modified Condensed Nearest Neighbor (MCNN) method to reduce the case base size, 3) retrieve the top-k most similar cases to a query from both case bases, 4) and evaluate the similarity and diversity of the retrievals from both case bases.

### 1. Load data and build the original case base.

In [1]:
import pandas as pd
from utils.case import Query
from utils.casebase import CaseBase, MCNN_CaseBase
from utils.utils import retrieve_topk
from utils.eval import cal_diversity, cal_sim_matrix

# Modify to your own path
path = r'/home/dwd/proj/Diversity-Improvement-in-CBR/CleanedDATA V12-05-2021.csv'
df = pd.read_csv(path, sep=';', encoding='windows-1252')

In [2]:
import warnings
warnings.filterwarnings('ignore')

# Uncomment line 11 to save the modification. Remove this cell once the update of data is done.
series = df['Publication identifier,,,,,,,,,,,,,,,,,,']
for i in range(len(series)):
    j_comma = series[i].find(',')
    if j_comma > 0:
        series[i] = series[i][:j_comma]
df.rename(columns={'Publication identifier,,,,,,,,,,,,,,,,,,': 'Publication identifier'}, inplace=True)
# df.to_csv(path, sep=';', index=False, encoding='windows-1252')

In [3]:
# Define the parameters here.
rand_seed = 0 # provide determinism
n_cases = int(0.8 * len(df))
n_queries = len(df) - n_cases
k = 5 # number of cases to retrieve
thr_desc=0.7 # modify this threshold for MCNN's description similarity
thr_sol=0.7 # modify this threshold for MCNN's solution similarity
attr_names = ['Task', 'Case study type', 'Case study', 'Online/Off-line', 'Input for the model',
              'Model Approach', 'Model Type', 'Models', 'Data Pre-processing', 'Complementary notes', 'Publication identifier',
              'Performance indicator', 'Performance', 'Publication Year']
desc_attrs=attr_names[:5]
sol_attrs=attr_names[5:]
df = df[attr_names]

print(f"Number of cases in the dataset: {len(df)}")
print(f"Number of cases used to build the case base: {n_cases}")
print(f"Number of cases to retrieve: {k}")
print(f"Attributes for case: {attr_names}")
print(f"Attributes for description part: {desc_attrs}")
print(f"Attributes for solution part: {sol_attrs}")

# shuffle the rows of df
df_ = df.sample(frac=1, random_state=rand_seed).reset_index(drop=True)
df_cases = df_.iloc[:n_cases]
df_queries = df_.iloc[n_cases:][desc_attrs]

Number of cases in the dataset: 263
Number of cases used to build the case base: 210
Number of cases to retrieve: 5
Attributes for case: ['Task', 'Case study type', 'Case study', 'Online/Off-line', 'Input for the model', 'Model Approach', 'Model Type', 'Models', 'Data Pre-processing', 'Complementary notes', 'Publication identifier', 'Performance indicator', 'Performance', 'Publication Year']
Attributes for description part: ['Task', 'Case study type', 'Case study', 'Online/Off-line', 'Input for the model']
Attributes for solution part: ['Model Approach', 'Model Type', 'Models', 'Data Pre-processing', 'Complementary notes', 'Publication identifier', 'Performance indicator', 'Performance', 'Publication Year']


In [4]:
# Build the original case base from 80% of the data
cb = CaseBase.from_dataframe(df_cases)
print(f'Size of the case base: {len(cb.cases)}')

# Create the queries using the remaining 20% of the data
queries = []
for i in range(len(df_queries)):
    q = Query.from_series(df_queries.iloc[i], _id=i)
    queries.append(q)
print(f'Number of queries: {len(queries)}')

Size of the case base: 210
Number of queries: 53


### 2. Retrieve the top-k most similar cases to the query from original case base.

In [5]:
# Retrieve the top-5 most similar cases to the query.
query = queries[1]
case_sims = retrieve_topk(query, cb, weights=[1, 1, 1, 1, 1], k=k)
print(f"Retrive top-5 most similar cases to the query: {query}")
display(case_sims)
# Extract the retrieved case and similarity list from the result
retrieved_cases0 = list(map(lambda x: x[0], case_sims))
retrieved_solutions0 = list(map(lambda x: x.to_desc_sol_pair(desc_attrs, sol_attrs)[1], retrieved_cases0))
retrieved_sims0 = list(map(lambda x: x[1], case_sims))
print(f"Retrieved cases: {retrieved_cases0}")
print(f"Retrieved solutions: {retrieved_solutions0}")
print(f"Similarity list: {retrieved_sims0}")

Retrive top-5 most similar cases to the query: Query 1


[(Case 27, 0.96),
 (Case 54, 0.96),
 (Case 74, 0.96),
 (Case 97, 0.96),
 (Case 110, 0.96)]

Retrieved cases: [Case 27, Case 54, Case 74, Case 97, Case 110]
Retrieved solutions: [Solution 27, Solution 54, Solution 74, Solution 97, Solution 110]
Similarity list: [0.96, 0.96, 0.96, 0.96, 0.96]


### 3. Apply Case Base Maintenance (CBM) method - Modified Condensed Nearest Neighbor (MCNN).
The new case base is built from exactly the same cases as the original case base, with generalization of descriptions and solutions.

In [9]:
# Initialize the MCNN Case Base
mcnn_cb = MCNN_CaseBase(cb.cases, desc_attrs, sol_attrs, thr_desc, thr_sol=thr_sol, 
                        _seed=rand_seed)

print("Number of descriptions in new case base:", len(mcnn_cb.descriptions))
print("Number of solutions in new case base:", len(mcnn_cb.solutions))

Number of descriptions in new case base: 12
Number of solutions in new case base: 34


### 4. Retrieve the top-k most similar cases to the query from new case base.

In [10]:
rlt = mcnn_cb.retrieve_topk(query, k=k)
print(f"Retrive top-5 most similar cases to the query: {query}")
display(rlt)
# Extract the retrieved solution and similarity list from the result
retrieved_solutions1 = list(map(lambda x: x[0][1], rlt))
retrieved_sims1 = list(map(lambda x: x[1], rlt))
print("Retrieved solutions:", retrieved_solutions1)
print("Similarity list:", retrieved_sims1)

Retrive top-5 most similar cases to the query: Query 1


[((GC 104, Solution 104), 0.8965000000000001),
 ((GC 104, Solution 43), 0.8965000000000001),
 ((GC 104, Solution 13), 0.8965000000000001),
 ((GC 104, Solution 51), 0.8965000000000001),
 ((GC 104, Solution 196), 0.8965000000000001)]

Retrieved solutions: [Solution 104, Solution 43, Solution 13, Solution 51, Solution 196]
Similarity list: [0.8965000000000001, 0.8965000000000001, 0.8965000000000001, 0.8965000000000001, 0.8965000000000001]


### 5. Evaluate and compare the results of the two retrievals.

In [11]:
# Calculate the average similarity and diversity of the retrieved cases
aver_sim0 = sum(retrieved_sims0) / len(retrieved_sims0)
sim_matrix0 = cal_sim_matrix(retrieved_solutions0, case_base=cb)
div0 = cal_diversity(sim_matrix0)
print("========== Original Case Base =========")
print("Similarity matrix:")
print(sim_matrix0)
print("---------------------------------")
print("Average similarity:", aver_sim0)
print("Diversity:", div0)

# Calculate the average similarity and diversity of the retrieved solutions
aver_sim1 = sum(retrieved_sims1) / len(retrieved_sims1)
# NOTE: The similarity matrix can be useful when implementing the other CBM methods
# Calculate the inter-solution similarities
sim_matrix1 = cal_sim_matrix(retrieved_solutions1, case_base=mcnn_cb)
div1 = cal_diversity(sim_matrix1)
print("\n============ MCNN Case Base ===========")
print("Similarity matrix:")
print(sim_matrix1)
print("---------------------------------")
print("Average similarity:", aver_sim1)
print("Diversity:", div1)

Similarity matrix:
[[1.         0.96666667 0.70913037 0.96666667 0.96666667]
 [0.96666667 1.         0.70913037 0.96666667 0.96666667]
 [0.70913037 0.70913037 1.         0.70913037 0.70913037]
 [0.96666667 0.96666667 0.70913037 1.         0.96666667]
 [0.96666667 0.96666667 0.70913037 0.96666667 1.        ]]
---------------------------------
Average similarity: 0.96
Diversity: 0.272695707070707

Similarity matrix:
[[1.         0.66821705 0.6666559  0.62436647 0.50756303]
 [0.66821705 1.         0.60844631 0.56929825 0.60213033]
 [0.6666559  0.60844631 1.         0.55065789 0.43472222]
 [0.62436647 0.56929825 0.55065789 1.         0.42358674]
 [0.50756303 0.60213033 0.43472222 0.42358674 1.        ]]
---------------------------------
Average similarity: 0.8965
Diversity: 0.8688711621042963


### TODOs:
- Unify performance metrics for the solutions (utils/case.py, line 182)
- Implement the other CBM methods
- Implement the other case base evaluation metrics (coverage, etc.)
- GUI + Batch test + Results visualization