This is where I train a SBERT model (more specifically, a **SimCSE** model for **unsupervised** learning).

**Steps:**
* Reconstruct the `correlations.csv` to make it more straight-forward
* Explore the tree, then use `title` from `root_node` to the current `topic_node` as `input_1`
* Use `title` of each `content_item` as `input_2`
* Use the imformation above to make my dataset
* Train a SBERT model and add it as a dataset for future use

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import os
from IPython.display import display, Markdown
from pathlib import Path
from tqdm.auto import tqdm

DATA_PATH = "/kaggle/input/learning-equality-curriculum-recommendations/"

# Bulid my Dataset

## Reconstruct Correlations

First, considering the old `correlations.csv` file is not a ideal form for training (each row contains one `topic_id` and several `content_ids`), let's reconstruct it.

In [43]:
correlations_df = pd.read_csv(DATA_PATH + 'correlations.csv')
correlations_df.head()

Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
1,t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
2,t_00069b63a70a,c_11a1dc0bfb99
3,t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
4,t_0008768bdee6,c_34e1424229b4 c_7d1a964d66d5 c_aab93ee667f4


In [44]:
correlation = correlations_df.copy()
correlation.content_ids = correlation.content_ids.str.split()
correlation = correlation.explode('content_ids').rename(columns={"content_ids": "content_id"}).reset_index(drop=True)
correlation.head(10)

Unnamed: 0,topic_id,content_id
0,t_00004da3a1b2,c_1108dd0c7a5d
1,t_00004da3a1b2,c_376c5a8eb028
2,t_00004da3a1b2,c_5bc0e1e2cba0
3,t_00004da3a1b2,c_76231f9d0b5e
4,t_00068291e9a4,c_639ea2ef9c95
5,t_00068291e9a4,c_89ce9367be10
6,t_00068291e9a4,c_ac1672cdcd2c
7,t_00068291e9a4,c_ebb7fdf10a7e
8,t_00069b63a70a,c_11a1dc0bfb99
9,t_0006d41a73a8,c_0c6473c3480d


Now it has become a dataframe `correlation` with 279919 rows and 2 columns. Each row contrains only one `topic_id` and one `content_id`.

## Explore the Topic Tree

Next, we consider using the semantic information in each `topic_tree`. 

We can use some helpful code from the notebook **Tips and Recommendations from Hosts**.

In [78]:
topics_df = pd.read_csv(DATA_PATH + 'topics.csv', index_col=0).fillna({"title": ""})  # use `id` as index
topics_df.head()

Unnamed: 0_level_0,title,description,channel,category,level,language,parent,has_content
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True
t_000095e03056,Unit 3.3 Enlargements and Similarities,,b3f329,aligned,2,en,t_aa32fb6252dc,False
t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True
t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True
t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_e2452e21d252,True


In [79]:
# define some helper functions and classes to aid with data traversal
def print_markdown(md):
    display(Markdown(md))

class Topic:
    def __init__(self, topic_id):
        self.id = topic_id

    @property
    def parent(self):
        parent_id = topics_df.loc[self.id].parent
        if pd.isna(parent_id):
            return None
        else:
            return Topic(parent_id)

    @property
    def ancestors(self):
        ancestors = []
        parent = self.parent
        while parent is not None:
            ancestors.append(parent)
            parent = parent.parent
        return ancestors

    def get_breadcrumbs(self, separator="[SEP]", include_self=True, include_root=True):
        ancestors = self.ancestors
        if include_self:
            ancestors = [self] + ancestors
        if not include_root:
            ancestors = ancestors[:-1]
        return separator.join(reversed([a.title for a in ancestors]))

    def __eq__(self, other):
        if not isinstance(other, Topic):
            return False
        return self.id == other.id

    def __getattr__(self, name):
        return topics_df.loc[self.id][name]

    def __str__(self):
        return self.title
    
    def __repr__(self):
        return f"<Topic(id={self.id}, title=\"{self.title}\")>"

In [80]:
# An example
topic = Topic("t_00004da3a1b2")
print("Topic title:\t'" + topic.title + "'")
print("Breadcrumbs:\t" + topic.get_breadcrumbs())

Topic title:	'Откриването на резисторите'
Breadcrumbs:	Khan Academy (български език)[SEP]Наука[SEP]Физика[SEP]Открития и проекти[SEP]Откриването на резисторите


Now we can use this topic_tree to get more semantic information.

For example, for the given topic `Откриването на резисторите`, we can use the hole title string from the root node to the current topic node `Khan Academy (български език) >> Наука >> Физика >> Открития и проекти >> Откриването на резисторите` to calculate the similarity between this topic and other content items.

Is this a better way than using just the single title information from one topic?

## Build Dataset

Before training our SBERT model, we need to prepare our data in certain formats.

Since it's an unsupervised task (with no label indicating how similar two sentences are), we can use a SimCSE model (another sentence-transformers-model for unsupervised learning).

SentenceTransformers implements the `MultipleNegativesRankingLoss`, which makes training with SimCSE trivial.

>**Links for more information:**
>
>[SimCSE](https://www.sbert.net/examples/unsupervised_learning/SimCSE/README.html)
>
>[How to train sentence-transformers](https://huggingface.co/blog/how-to-train-sentence-transformers)

We can explore an example dataset to find out the required format.

In [57]:
!pip install datasets

[0m

In [58]:
from datasets import load_dataset
dataset = load_dataset("embedding-data/sentence-compression")

Downloading and preparing dataset json/embedding-data--sentence-compression to /root/.cache/huggingface/datasets/json/embedding-data--sentence-compression-305c6d7a2ac8f95b/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/14.2M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/embedding-data--sentence-compression-305c6d7a2ac8f95b/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [62]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['set'],
        num_rows: 180000
    })
})


In [61]:
print(f"- The dataset has {dataset['train'].num_rows} examples.")
print(f"- Each example is a {type(dataset['train'][0])} with a {type(dataset['train'][0]['set'])} as value.")
print(f"- Examples look like this: {dataset['train'][0]}")

- The dataset has 180000 examples.
- Each example is a <class 'dict'> with a <class 'list'> as value.
- Examples look like this: {'set': ["The USHL completed an expansion draft on Monday as 10 players who were on the rosters of USHL teams during the 2009-10 season were selected by the League's two newest entries, the Muskegon Lumberjacks and Dubuque Fighting Saints.", 'USHL completes expansion draft']}


After having a good knowledge about the required format, we can now make our own training dataset.

In [101]:
correlation_new = correlation.copy()
correlation_new.head()

Unnamed: 0,topic_id,content_id
0,t_00004da3a1b2,c_1108dd0c7a5d
1,t_00004da3a1b2,c_376c5a8eb028
2,t_00004da3a1b2,c_5bc0e1e2cba0
3,t_00004da3a1b2,c_76231f9d0b5e
4,t_00068291e9a4,c_639ea2ef9c95


We can also construct a ContentItem class to get its title.

In [85]:
content_df = pd.read_csv(DATA_PATH + 'content.csv', index_col=0).fillna({"title": ""})  # use `id` as index

In [100]:
class ContentItem:
    def __init__(self, content_id):
        self.id = content_id
        
    def __getattr__(self, name):
        return content_df.loc[self.id][name]

    def __str__(self):
        return self.title
    
    def __repr__(self):
        return f"<ContentItem(id={self.id}, title=\"{self.title}\")>"

    def __eq__(self, other):
        if not isinstance(other, ContentItem):
            return False
        return self.id == other.id
    
    def get_title(self):
        return self.title

Put them together to make our `input_new` dataframe!
* `topic_id`: the id of one topic
* `content_id`: the id of one content
* `topic_input`: titles from the root node to the topic node with the above `topic_id`
* `content_input`: title of the content with the above `content_id`

In [114]:
input_df = pd.DataFrame(columns=['topic_input', 'content_input'])

for index, row in tqdm(correlation_new.iterrows(), total=correlation_new.shape[0]):
    top = Topic(row['topic_id'])
    con = ContentItem(row['content_id'])
    input_df.loc[index] = (top.get_breadcrumbs(), con.get_title())
    
input_new = pd.concat([correlation_new, input_df], axis=1)

  0%|          | 0/279919 [00:00<?, ?it/s]

In [122]:
# check whether we need to drop certain rows
for index, row in tqdm(input_new.iterrows(), total=input_new.shape[0]):
    if row['topic_id'] == "" or row['content_id'] == "":
        print('yes')
        break

  0%|          | 0/279919 [00:00<?, ?it/s]

In [123]:
# sentence lists for input
topic_input_list = input_new.iloc[:, 2].to_list()
content_input_list = input_new.iloc[:, 3].to_list()

Then use these two lists to make Dataset.

In [146]:
import sys
sys.path.append('../input/sentence-transformers-offline-install/sentence-transformers')
import sentence_transformers
from sentence_transformers import InputExample, losses, SentenceTransformer
from torch.utils.data import DataLoader

# Convert train sentences to sentence pairs
train_data = [InputExample(texts=[t, c]) for t, c in zip(topic_input_list, content_input_list)]

# DataLoader to batch your data
train_dataloader = DataLoader(train_data, batch_size=128, shuffle=True)

# Train the SBERT Model

In [151]:
import torch
from transformers import AutoModel

In [152]:
# config
MODEL_PATH = "/kaggle/input/sentence-embedding-models/paraphrase-multilingual-mpnet-base-v2"
device = "cuda" if torch.cuda.is_available() else "cpu"
EPOCHS = 5

In [153]:
# model
AutoModel.from_pretrained(MODEL_PATH)
model.to(device)

# loss
train_loss = losses.MultipleNegativesRankingLoss(model)

In [154]:
# train
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=EPOCHS,
    show_progress_bar=True
)

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Iteration:   0%|          | 0/2187 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
# save you model


# Make Prediction

# What to do next?
* Use Swifter to accelerate the processing of Dataframe