<a href="https://www.kaggle.com/code/kkkkkkc/lecr-eda?scriptVersionId=115795902" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

This is my notebook for the competition LECR :)

I'll do a simple data analysis, and try models like SBERT and SimCSE.

I'm new to Kaggle. Please leave your comments and I'll make improvements :)

# Introduction
This is where I briefly introduce the data provided by the host.

And I will mention some tips for this competition.

1. GOAL: The goal of this competition is to match content to the specific topic *(WARNING: THERE MIGHT BE MORE THAN ONE CORRELATED CONTENT)*.


2. DATASET:
    * `topics.csv`:  Contains a row for each topic in the dataset.
    * `content.csv`: Contains a row for each content item in the dataset.
    * `correlations.csv`: Contains `topic_id` with the correlated `content_ids`.
    * `sample_submission.csv`: A submission file in the correct format *(WARNING: YOU SHOULD ONLY SUBMIT PREDICTIONS FOR THOSE TOPICS LISTED IN THIS FILE)*.


3. EVALUATION:
    * LECR uses MEAN F2 SCORE for evaluation.
    * Since it's a CODE COMPETITION, the actual test set contains additional topics and content items. In the public version, the sample test data are drawn from the training set.


4. TIPS FROM THE HOST:
    * **Context matters! Explore the tree**: This is because other nodes in the topic tree may contain more semantic context than the given topic.
    * **Narrow down by language**: Most of the time the language of a topic will match the language of its correlated content. You can use this feature to narrow down or give priority.
    * **Focus on aligned and supplemental for performance**: The testing dataset does not contain any topics from `source` channels. So it's important to focus on these kind of data to achieve a better performance.
    * **Balance the semantics of `title`, `description`, and `text`**: It's important to carefully weight these fields because the semantic information they contain vary across topics and content items.
    * **Disregard `copyright_holder` for training purposes**: This field is blanked out in the testing data. So don't use it in the training phase.
    * **Restructure correlations for efficiency**: Considering the efficiency, you may need to reconstruct the correlations in `correlations.csv`.

# Exploratoty Data Analysis
This is where I do some simple data analysis.

## Before Actual Work

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import os

DATA_PATH = "/kaggle/input/learning-equality-curriculum-recommendations/"

In [2]:
content = pd.read_csv(DATA_PATH + 'content.csv')
topics = pd.read_csv(DATA_PATH + 'topics.csv')
correlations = pd.read_csv(DATA_PATH + 'correlations.csv')

## Topics

In [3]:
print(topics.shape)
topics.head()

(76972, 9)


Unnamed: 0,id,title,description,channel,category,level,language,parent,has_content
0,t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True
1,t_000095e03056,Unit 3.3 Enlargements and Similarities,,b3f329,aligned,2,en,t_aa32fb6252dc,False
2,t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True
3,t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True
4,t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_e2452e21d252,True


In [4]:
topics.describe(include='all')

Unnamed: 0,id,title,description,channel,category,level,language,parent,has_content
count,76972,76970,34953,76972,76972,76972.0,76972,76801,76972
unique,76972,45082,23067,171,3,,28,17512,2
top,t_00004da3a1b2,Assessments,v0.1,fef095,source,,en,t_344131c2889b,True
freq,1,558,371,5770,43487,,36161,270,61517
mean,,,,,,3.963026,,,
std,,,,,,1.099633,,,
min,,,,,,0.0,,,
25%,,,,,,3.0,,,
50%,,,,,,4.0,,,
75%,,,,,,4.0,,,


Things we know from the above information:
* Only 2 topics do not have `title`
* Nearly half of them have `description`
* Nearly all of them have `parent` (empty means it's the root node)

Things we need to find out:
* Languages distribution of `title` and `description`
* Length of `title` and `description`
* Numbers of topics in each channel
* Distribution of `catagory`
* Distribution of `level`
* Distribution of `language`
* Sample tree

## Content

In [5]:
print(content.shape)
content.head()

(154047, 8)


Unnamed: 0,id,title,description,kind,text,language,copyright_holder,license
0,c_00002381196d,"Sumar números de varios dígitos: 48,029+233,930","Suma 48,029+233,930 mediante el algoritmo está...",video,,es,,
1,c_000087304a9e,Trovare i fattori di un numero,Sal trova i fattori di 120.\n\n,video,,it,,
2,c_0000ad142ddb,Sumar curvas de demanda,Cómo añadir curvas de demanda\n\n,video,,es,,
3,c_0000c03adc8d,Nado de aproximação,Neste vídeo você vai aprender o nado de aproxi...,document,\nNado de aproximação\nSaber nadar nas ondas ...,pt,Sikana Education,CC BY-NC-ND
4,c_00016694ea2a,geometry-m3-topic-a-overview.pdf,geometry-m3-topic-a-overview.pdf,document,Estándares Comunes del Estado de Nueva York\n\...,es,Engage NY,CC BY-NC-SA


In [6]:
content.describe(include="all")

Unnamed: 0,id,title,description,kind,text,language,copyright_holder,license
count,154047,154038,89456,154047,74035,154047,71821,74035
unique,154047,130937,76305,5,70687,27,148,7
top,c_00002381196d,Video,v0.1,video,Unsupported browser\n\nThe HTML5 content is no...,en,Khan Academy,CC BY-NC-SA
freq,1,504,903,61487,234,65939,17034,52088


Things we know from the above information:
* Nearly all of them have `title`
* Only half have `description` and `text`

Things we need to find out:
* Languages distribution of `title` and `description` and `text`
* Length of `title` and `description` and `text`
* Distribution of `language`
* Distribution of `kind`

## Correlations

In [7]:
print(correlations.shape)
correlations.head()

(61517, 2)


Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
1,t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
2,t_00069b63a70a,c_11a1dc0bfb99
3,t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
4,t_0008768bdee6,c_34e1424229b4 c_7d1a964d66d5 c_aab93ee667f4


In [8]:
correlations.describe(include='all')

Unnamed: 0,topic_id,content_ids
count,61517,61517
unique,61517,47299
top,t_00004da3a1b2,c_dd739e116435
freq,1,122


Things we know from the above information:
* Contains only 61517 rows/topics

Things we need to find out:
* The MAX/MIN/AVG content items one topic has

# Modeling
This is where I try models like SBERT and SimCSE.

I'll explain why I choose these two models, and provide some examples to illustrate the usage.

Considering the GOAL of this competition (matching content items to the given topic), this is actually a Semantic Textual Similarity task.

[SBERT](https://www.sbert.net/), or Sentence-BERT is a novel framework for computing sentence/text embeddings for more than 100 languages **(multilingual)**. These embeddings can then be used in tasks like semantic similarity. It's easy to use with the help of Hugging Face. SBERT acutally uses a Siamese Network Structure, which is way **faster** than the usual Cross-Encoder Structure.

[SimCSE](https://github.com/princeton-nlp/SimCSE) adds the idea of contrastive learning to SBERT. It achieves **SOTA** performance on **unsupervised** learning task.

In [9]:
# This is a simple example about how to use SBERT
# SentenceTransformers implements the MultipleNegativesRankingLoss, which makes training with SimCSE trivial
# Click the links above to see more details
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m966.1 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- done
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l- \ | done
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125938 sha256=0169088ebe333a5709cc503afc6f0509e53dc470b3bad47f55510ce78c07bd2d
  Stored in directory: /root/.cache/pip/wheels/bf/06/fb/d59c1e5bd1dac7f6cf61ec0036cc3a10ab8fecaa6b2c3d3ee9
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2
[0m

In [10]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2') # fast but also achieves good result

# lists of sentences
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome']
sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

# compute embeddings
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

# compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)

# output the pairs with scores
for i in range(len(sentences1)):
    print("{} \t {} \t score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The cat sits outside 	 The dog plays in the garden 	 score: 0.2838
A man is playing guitar 	 A woman watches TV 	 score: -0.0327
The new movie is awesome 	 The new movie is so great 	 score: 0.8939


# Evaluation
This is where I explain the F2 SCORE.

When we evaluate a mdoel, we often use metrics like Precision and Recall. However, it's hard to achieve good results on both scores, so people should trade off between Precision and Recall. That's when people came up with a measurement called the **F-Score**. 

The F-Score is a the harmonic mean of a system's precision and recall values. It can be calculated by the following formula:

$$F\text-Score = (1+\beta^2) \cdot \frac{Precision \cdot Recall}{\beta^2 \cdot Precision+Recall}$$

If we set $\beta = n$, the $F\text-Score$ is now the $F_n\text-Score$. This means that if we set $\beta = 2$, we'll get the formula of **F2-Score**.