This is where I train a SBERT model (more specifically, a **SimCSE** model for **unsupervised** learning).

**Steps:**
* Reconstruct the `correlations.csv` to make it more straight-forward
* Explore the tree, then use `title` from `root_node` to the current `topic_node` as `input_1`
* Use `title` of each `content_item` as `input_2`
* Use the imformation above to make my dataset
* Train a SBERT model
* Predict and select `TOP_N` predicted content items with the highest cosine-similarity and the same language as the given topic

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import os
from IPython.display import display, Markdown
from pathlib import Path

DATA_PATH = "/kaggle/input/learning-equality-curriculum-recommendations/"

# Bulid my Dataset

## Reconstruct Correlations

First, considering the old `correlations.csv` file is not a ideal form for training (each row contains one `topic_id` and several `content_ids`), let's reconstruct it.

In [2]:
correlations_df = pd.read_csv(DATA_PATH + 'correlations.csv')
correlations_df.head()

Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
1,t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
2,t_00069b63a70a,c_11a1dc0bfb99
3,t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
4,t_0008768bdee6,c_34e1424229b4 c_7d1a964d66d5 c_aab93ee667f4


In [3]:
correlation = correlations_df.copy()
correlation.content_ids = correlation.content_ids.str.split()
correlation = correlation.explode('content_ids').rename(columns={"content_ids": "content_id"}).reset_index(drop=True)
correlation.head(10)

Unnamed: 0,topic_id,content_id
0,t_00004da3a1b2,c_1108dd0c7a5d
1,t_00004da3a1b2,c_376c5a8eb028
2,t_00004da3a1b2,c_5bc0e1e2cba0
3,t_00004da3a1b2,c_76231f9d0b5e
4,t_00068291e9a4,c_639ea2ef9c95
5,t_00068291e9a4,c_89ce9367be10
6,t_00068291e9a4,c_ac1672cdcd2c
7,t_00068291e9a4,c_ebb7fdf10a7e
8,t_00069b63a70a,c_11a1dc0bfb99
9,t_0006d41a73a8,c_0c6473c3480d


Now it has become a dataframe `correlation` with 279919 rows and 2 columns. Each row contrains only one `topic_id` and one `content_id`.

## Explore the Topic Tree

Next, we consider using the semantic information in each `topic_tree`. 

We can use some helpful code from the notebook **Tips and Recommendations from Hosts**.

In [4]:
topics_df = pd.read_csv(DATA_PATH + 'topics.csv', index_col=0)  # use `id` as index
topics_df.head()

Unnamed: 0_level_0,title,description,channel,category,level,language,parent,has_content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True
t_000095e03056,Unit 3.3 Enlargements and Similarities,,b3f329,aligned,2,en,t_aa32fb6252dc,False
t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True
t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True
t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_e2452e21d252,True


In [5]:
# define some helper functions and classes to aid with data traversal
def print_markdown(md):
    display(Markdown(md))

class Topic:
    def __init__(self, topic_id):
        self.id = topic_id

    @property
    def parent(self):
        parent_id = topics_df.loc[self.id].parent
        if pd.isna(parent_id):
            return None
        else:
            return Topic(parent_id)

    @property
    def ancestors(self):
        ancestors = []
        parent = self.parent
        while parent is not None:
            ancestors.append(parent)
            parent = parent.parent
        return ancestors

    def get_breadcrumbs(self, separator=" >> ", include_self=True, include_root=True):
        ancestors = self.ancestors
        if include_self:
            ancestors = [self] + ancestors
        if not include_root:
            ancestors = ancestors[:-1]
        return separator.join(reversed([a.title for a in ancestors]))

    def __eq__(self, other):
        if not isinstance(other, Topic):
            return False
        return self.id == other.id

    def __getattr__(self, name):
        return topics_df.loc[self.id][name]

    def __str__(self):
        return self.title
    
    def __repr__(self):
        return f"<Topic(id={self.id}, title=\"{self.title}\")>"

In [6]:
# An example
topic = Topic("t_00004da3a1b2")
print("Topic title:\t'" + topic.title + "'")
print("Breadcrumbs:\t" + topic.get_breadcrumbs())

Topic title:	'Откриването на резисторите'
Breadcrumbs:	Khan Academy (български език) >> Наука >> Физика >> Открития и проекти >> Откриването на резисторите


Now we can use this topic_tree to get more semantic information.

For example, for the given topic `Откриването на резисторите`, we can use the hole title string from the root node to the current topic node `Khan Academy (български език) >> Наука >> Физика >> Открития и проекти >> Откриването на резисторите` to calculate the similarity between this topic and other content items.

Is this a better way than using just the single title information from one topic?

## Build Dataset

Before training our SBERT model, we need to prepare our data in certain formats.

Since it's an unsupervised tast (with no label indicates how similar two sentences are), we can use a SimCSE model (another sentence-transformers-model for unsupervised learning).

SentenceTransformers implements the MultipleNegativesRankingLoss, which makes training with SimCSE trivial

>**Links for more information:**
>
>[SimCSE](https://www.sbert.net/examples/unsupervised_learning/SimCSE/README.html)
>
>[How to train sentence-transformers](https://huggingface.co/blog/how-to-train-sentence-transformers)

# Train the SBERT Model

# Make Prediction

# What to do next?