<h1 style="font-family:calibri;font-size:250%;text-align:center;">🧑‍🎓Curriculum Recommendations - EDA for everyone</h1>

<a id="table"></a>
<h1 style="background-color:mediumspringgreen;font-family:calibri;font-size:250%;text-align:center;border-radius: 25px 25px;">Table of Contents</h1>

* [1. Introduction](#1)

* [2. Data Exploration](#2)

    * [2.1 Explore content data](#2.1)
    
    * [2.2 Explore topics](#2.2)
    
    * [2.3 Explore correlations](#2.3)

<a id="1"></a>
## <p style="padding:10px;background-color:mediumspringgreen;margin:0;color:black;font-family:calibri;font-size:120%;text-align:center;border-radius: 25px 25px;overflow:hidden;font-weight:500">1. Introduction</p>

The goal of this competition is to streamline the process of matching educational content to specific topics in a curriculum. You will develop an accurate and efficient model trained on a library of K-12 educational materials that have been organized into a variety of topic taxonomies. These materials are in diverse languages, and cover a wide range of topics, particularly in STEM (Science, Technology, Engineering, and Mathematics).

This is my first public EDA on Kaggle for Natural Language data. First of all, I do this notebook for myself, to practice my EDA-building skills and understand this data. If you have some corrections or pieces of advice, feel free to contact me in the comments section. Hope it will be useful and insightful for you😉

<a id="2"></a>
## <p style="padding:10px;background-color:mediumspringgreen;margin:0;color:black;font-family:calibri;font-size:120%;text-align:center;border-radius: 25px 25px;overflow:hidden;font-weight:500">2. Data Exploration</p>

Let's load libraries first (because of issues with plotting on 5.12.0 version of plotly. So I had to downgrade)

In [1]:
!pip install plotly==5.11.0

Collecting plotly==5.11.0
  Downloading plotly-5.11.0-py2.py3-none-any.whl (15.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 5.12.0
    Uninstalling plotly-5.12.0:
      Successfully uninstalled plotly-5.12.0
Successfully installed plotly-5.11.0
[0m

In [2]:
from pathlib import Path

import pandas as pd
import numpy as np
import plotly.express as px

And load available data

In [3]:
data_path = Path("/kaggle/input/learning-equality-curriculum-recommendations")

In [4]:
content_data = pd.read_csv(data_path / "content.csv")
correlations_data = pd.read_csv(data_path / "correlations.csv")
topics_data = pd.read_csv(data_path / "topics.csv")
sample_submission = pd.read_csv(data_path / "sample_submission.csv")

<a id="2.1"></a>
## <p style="padding:10px;background-color:mediumspringgreen;margin:0;color:black;font-family:calibri;font-size:120%;text-align:center;border-radius: 25px 25px;overflow:hidden;font-weight:500">2.1 Explore content data</p>

Let's take a first look on this table

In [5]:
content_data

Unnamed: 0,id,title,description,kind,text,language,copyright_holder,license
0,c_00002381196d,"Sumar números de varios dígitos: 48,029+233,930","Suma 48,029+233,930 mediante el algoritmo está...",video,,es,,
1,c_000087304a9e,Trovare i fattori di un numero,Sal trova i fattori di 120.\n\n,video,,it,,
2,c_0000ad142ddb,Sumar curvas de demanda,Cómo añadir curvas de demanda\n\n,video,,es,,
3,c_0000c03adc8d,Nado de aproximação,Neste vídeo você vai aprender o nado de aproxi...,document,\nNado de aproximação\nSaber nadar nas ondas ...,pt,Sikana Education,CC BY-NC-ND
4,c_00016694ea2a,geometry-m3-topic-a-overview.pdf,geometry-m3-topic-a-overview.pdf,document,Estándares Comunes del Estado de Nueva York\n\...,es,Engage NY,CC BY-NC-SA
...,...,...,...,...,...,...,...,...
154042,c_fffcbdd4de8b,2. 12: Diffusion,,html5,What will eventually happen to these dyes?\n\n...,en,CSU and Merlot,CC BY-NC-SA
154043,c_fffe15a2d069,Sommare facendo gruppi da 10,Sal somma 5+68 spezzando il 5 in un 2 e un 3.\n\n,video,,it,,
154044,c_fffed7b0d13a,Introdução à subtração,Sal fala sobre o que significa subtrair. Os ex...,video,,pt,,
154045,c_ffff04ba7ac7,SA of a Cone,,video,,en,,


There are next columns:

  - id - A unique identifier for this content item.
  - title - Title text for this content item.
  - description - Description text. May be empty.
  - language - Language code representing the language of this content item.
  - kind - Describes what format of content this item represents, as one of:
      - document (text is extracted from a PDF or EPUB file)
      - video (text is extracted from the subtitle file, if available)
      - exercise (text is extracted from questions/answers)
      - audio (no text)
      - html5 (text is extracted from HTML source)
  - text - Extracted text content, if available and if licensing permitted (around half of content items have text content).
  - copyright_holder - If text was extracted from the content, indicates the owner of the copyright for that content. Blank for all test set items.
  - license - If text was extracted from the content, the license under which that content was made available. Blank for all test set items.

I want to add "is_text_nan" column to see the relation of absence of text with other features.

In [6]:
content_data["is_text_nan"] = content_data["text"].apply(lambda x: True if pd.isnull(x) else False)

Now I can build some plots to understand my data better

In [7]:
fig = px.histogram(content_data, x="kind", pattern_shape="is_text_nan", pattern_shape_sequence=["x", ""],
                   text_auto=True, title="Number of content of each type").update_xaxes(categoryorder="total descending")
fig.show()

As we see, a big part of videos and exercises don't have text, aligned to it. But for documents and html5 there is a much bigger fraction of content that has text. It must be related to problems of transforming different sources of information into text.

Also, all audio content does not have text. But there is a very small fraction of the audion in the train dataset. Hope, for the test set too.

In [8]:
fig = px.histogram(content_data, x="language", pattern_shape="is_text_nan", pattern_shape_sequence=["x", ""],
                   text_auto=True, log_y=True, title="Number of content on each language (log scale)").update_xaxes(categoryorder="total descending")
fig.show()

As we see here, a critical part of the content is created in English. There are two takeaways from that:
  - People from not English-speaking countries will have much less content, related to each topic
  - Future models may be overfitted to English, a lot of people with worse recommendations. It must be prevented.

In [9]:
n_words_in_title = content_data["title"].apply(lambda x: -1 if pd.isnull(x) else len(str(x).split()))  # To separate nans from empty values
fig = px.histogram(n_words_in_title, title="Number of words in titles")
fig.show()

In this plot, we see, that most titles are quite short. Because of that, they must be easy to input into transformer-like models.

In [10]:
n_words_in_description = content_data["description"].apply(lambda x: -100 if pd.isnull(x) else len(str(x).split()))  # To separate nans from empty values
fig = px.histogram(n_words_in_description, log_y=True, title="Number of words in description (log scale)")
fig.show()

The description seems also quite short. A big part of the descriptions are really empty, as organizers mentioned

In [11]:
n_words_in_text = content_data["text"].apply(lambda x: -100 if pd.isnull(x) else len(str(x).split()))  # To separate nans from empty values
fig = px.histogram(n_words_in_text, log_y=True, title="Number of words in text (log scale)")
fig.show()

<a id="2.2"></a>
## <p style="padding:10px;background-color:mediumspringgreen;margin:0;color:black;font-family:calibri;font-size:120%;text-align:center;border-radius: 25px 25px;overflow:hidden;font-weight:500">2.2 Explore topics</p>

In [12]:
topics_data

Unnamed: 0,id,title,description,channel,category,level,language,parent,has_content
0,t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True
1,t_000095e03056,Unit 3.3 Enlargements and Similarities,,b3f329,aligned,2,en,t_aa32fb6252dc,False
2,t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True
3,t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True
4,t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_e2452e21d252,True
...,...,...,...,...,...,...,...,...,...
76967,t_fffb0bf2801d,4.3 Graph of functions,,e77b55,aligned,4,en,t_676e6a1a4dc7,False
76968,t_fffbe1d5d43c,Inscribed shapes problem solving,Use properties of inscribed angles to prove pr...,0c929f,source,4,sw,t_50145b9bab3f,True
76969,t_fffe14f1be1e,Lección 7,,6e90a7,aligned,6,es,t_d448c707984d,True
76970,t_fffe811a6da9,تحديد العلاقة بين الإحداثيّات القطبية والإحداث...,5b9e5ca86571f90499ea987f,9fd860,source,2,ar,t_5b4f3ba4eb7d,True


topics.csv - Contains a row for each topic in the dataset. These topics are organized into "channels", with each channel containing a single "topic tree" (which can be traversed through the "parent" reference). Note that the hidden dataset used for scoring contains additional topics not in the public version. You should only submit predictions for those topics listed in sample_submission.csv.

There are next columns:
  - id - A unique identifier for this topic.
  - title - Title text for this topic.
  - description - Description text (may be empty)
  - channel - The channel (that is, topic tree) this topic is part of.
  - category - Describes the origin of the topic.
      - source - Structure was given by original content creator (e.g. the topic tree as imported from Khan Academy). There are no topics in the test set with this category.
      - aligned - Structure is from a national curriculum or other target taxonomy, with content aligned from multiple sources.
      - supplemental - This is a channel that has to some extent been aligned, but without the same level of granularity or fidelity as an aligned channel.
  - language - Language code for the topic. May not always match apparent language of its title or description, but will always match the language of any associated content items.
  - parent - The id of the topic that contains this topic, if any. This field if empty if the topic is the root node for its channel.
  - level - The depth of this topic within its topic tree. Level 0 means it is a root node (and hence its title is the title of the channel).
  - has_content - Whether there are content items correlated with this topic. Most content is correlated with leaf topics, but some non-leaf topics also have content correlations.

I want to add "is_title_nan" and "is_description_nan" column to see relation of absence of text with another features.

In [13]:
n_words_in_title = topics_data["title"].apply(lambda x: -1 if pd.isnull(x) else len(str(x).split()))
fig = px.histogram(n_words_in_title, title="Number of words in title")
fig.show()

Most titles of topics are also short, like in content part.

In [14]:
n_words_in_description = topics_data["description"].apply(lambda x: -1 if pd.isnull(x) else len(str(x).split()))
fig = px.histogram(n_words_in_description, log_y=True, title="Number of words in description")
fig.show()

Most descriptions look empty or have a small number of words. That's bad, because we have too little information about every topic, and it will be hard to link each topic with some content.

In [15]:
fig = px.histogram(topics_data, x="category", title="Number of topics of each category").update_xaxes(categoryorder="total descending")
fig.show()

Most of the topics are from "source" category, but there will be no topics from this category in a test set

In [16]:
fig = px.histogram(topics_data, x="language", log_y=True, title="Number of topics of each language (log scale)").update_xaxes(categoryorder="total descending")
fig.show()

Most of topics are also on English

In [17]:
fig = px.histogram(topics_data, x="level", log_y=True, title="Number of topics of each level").update_xaxes(categoryorder="total descending")
fig.show()

As we see, most of the topics has a not big depth of 4

In [18]:
fig = px.histogram(
    topics_data,
    x="has_content",
    title="Number of topics which has content"
).update_xaxes(categoryorder="total descending")

fig.show()

<a id="2.3"></a>
## <p style="padding:10px;background-color:mediumspringgreen;margin:0;color:black;font-family:calibri;font-size:120%;text-align:center;border-radius: 25px 25px;overflow:hidden;font-weight:500">2.3 Explore correlations</p>

Take a look on correlations table

In [19]:
correlations_data

Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
1,t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
2,t_00069b63a70a,c_11a1dc0bfb99
3,t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
4,t_0008768bdee6,c_34e1424229b4 c_7d1a964d66d5 c_aab93ee667f4
...,...,...
61512,t_fff830472691,c_61fb63326e5d c_8f224e321c87
61513,t_fff9e5407d13,c_026db653a269 c_0fb048a6412c c_20de77522603 c...
61514,t_fffbe1d5d43c,c_46f852a49c08 c_6659207b25d5
61515,t_fffe14f1be1e,c_cece166bad6a


Let's count, how much cointent related to each topic

In [20]:
correlations_data["n_of_related_content"] = correlations_data["content_ids"].apply(lambda x: len(str(x).split()))

In [21]:
fig = px.histogram(
    correlations_data, x="n_of_related_content", log_y=True,
    title="Number of content, correlated to topics"
).update_xaxes(categoryorder="total descending")

fig.show()

Most of the topics do have not so much content, related to them.

Submission to this competition must look the same way, as correlation data, like that:

In [22]:
sample_submission

Unnamed: 0,topic_id,content_ids
0,t_00004da3a1b2,c_1108dd0c7a5d c_376c5a8eb028 c_5bc0e1e2cba0 c...
1,t_00068291e9a4,c_639ea2ef9c95 c_89ce9367be10 c_ac1672cdcd2c c...
2,t_00069b63a70a,c_11a1dc0bfb99
3,t_0006d41a73a8,c_0c6473c3480d c_1c57a1316568 c_5e375cf14c47 c...
4,t_4054df11a74e,c_3695c5dc1df6 c_f2d184a98231


That's all for now. There will be more updates to this notebook in the future. If you like my work, please upvote and subscribe, it's very important to me. Thanks for your attention☺️