# Recommender System

## Introduction
The project is designed to improve the quality of education through the use of AI technologies. The primary objective is to demonstrate how AI can facilitate personalized education within the UCM college program. This will be achieved by assisting UCM students in developing the program with their own individual learning pathway and providing adaptive teaching materials to their individual needs. To subsequently develop a system in collaboration with educational institutions that meets societal needs and expectations, providing a responsive educational chain in the fast-paced world.

This notebook provides an overview of the recommender system we have built for UCM. It includes key aspects to help you understand the system's functionality and workflow. We'll begin with how to call the system and then delve into the details of each component.

Note: We are not Software Engineers, so the structure might not be the best. We are open to suggestions and improvements.

Before starting, it's necessary to install the system's requirements. These are listed in the `requirements.txt` file. To install them, run the following command in your terminal:

```
pip install -r requirements.txt
```


Additionally, when you run the system for the first time, some transformer models will be downloaded. This will take some time in the first run. Moreover, we use MongoDB for the system. You need to have MongoDB installed on your system. You can download MongoDB from [here](https://www.mongodb.com/try/download/community).

## Table of Contents:
1. [Explainability](#Explainability)
2. [Initialization and Recommender System Structure](#Initialization-and-Recommender-System-Structure)
3. [Keyword Similarity](#Keyword-Similarity)
4. [Bloom's Taxonomy](#Bloom's-Taxonomy)
5. [Explanation with Generative LLM](#Explanation-with-Generative-LLM)
6. [Planner (Integer Linear Programming)](#Planner-(Integer-Linear-Programming))
7. [Warning System](#Warning-System)
8. [Student Model](#Student-Model)
9. [Reranker](#Reranker)
10. [Knowledge Graph](#Knowledge-Graph)

[GitHub](https://github.com/gabeha/recommenderSystemAPI); The project is large, so we have created a GitHub repository.
 

## Explainability:

This part is the most crucial, since our model will be used in a real-world environment with human interaction. Therefore, the explainability of models is an important aspect to be accepted by users and to comply with country regulations.

**XAI requires:**

- **Transparency** is given “if the processes that extract model parameters from training data and generate labels from testing data can be described and motivated by the approach designer”.

- **Interpretability** describes the possibility to comprehend the ML model and to present the underlying basis for decision-making in a way that is understandable to humans.

- **Explainability** is a concept that is recognized as important, but a joint definition is not yet available. It is suggested that explainability in ML can be considered as “the collection of features of the interpretable domain, that have contributed for a given example to produce a decision (e.g., classification or regression)”.

sourse: DARPA
<p align="center">
  <img src="images/darpa.png" alt="DARPA" title="DARPA"/>
</p>

Explanation are part of a social interaction:

- Valid explanations should be sound, cogent, convincing and create trust.
- The receiving party should be able to understand it.
- Step by step transfer of knowledge

Moreover explanations:

- Must show the difference between possible outcomes of decision.
- Should be relevant.
- Can be incomplete. Emphasizing a few examples may be sufficient. No need to be overly complete.
- Should be believable;

**XAI in NLP** is becoming more important as Large Language models are used for a variety of mission-critical and high-impact applications in our society.

Several overview papers have been written recently, investigating existing approaches for XAI in NLP. Here are two important ones:

1) [Analysis Methods in Neural Language Processing: A Survey Yonatan Belinkov and James Glass](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00254/43503/Analysis-Methods-in-Neural-Language-Processing-A)

2) [A Survey of the State of Explainable AI for Natural Language Processing Marina Danilevsky, Kun Qian, Ranit Aharonov, Yannis Katsis, Ban Kawas, Prithviraj Sen](https://arxiv.org/abs/2010.00711)

Relevant questions to ask are: what linguistic knowledge is captured by neural networks, why they make certain predictions, if they are robust, interpreting the way they represent language and how they fail.

But how do you explain complex mathematics and machine learning algorithms to individuals that are not familiar with those concepts, let alone, really understand these?

In the following paper, these problems are discussed:

1) [The Mythos of Model Interpretability by Zachary C. Lipton](https://arxiv.org/abs/1606.03490)

2) [DARPA program onExplainable Artificial Intelligence (XAI) by Dr. Matt Turek](https://www.darpa.mil/program/explainable-artificial-intelligence)

Both are important to understand psychological aspects of "explainability" and "understanding" and different approaches taking in similar situations.

## Initialization and Recommender System Structure:

General Structure of the Recommender System:

<p align="center">
  <img src="images/rs.png" alt="Recommender System" title="Recommender System"/>
</p>

The file `instances/recommender_instances.py` contains the initialization of the recommender system. Here, we import all the components that are required for the system. The components are initialized once and then used throughout the system. The system's structure is kept as modular as possible. This modularity aids in the easy integration of new components into the system and in its maintenance.


In [1]:
import os
# First we change the directory to the root directory of the project. Where we usually run the app.py file.
os.chdir(os.getcwd().replace("\\notebooks", ""))

Recommender System Moduls:
```
# Keyword Based
keyword_based = KeywordBased(model_name="all-MiniLM-L12-v2",
                             seed_help=True,
                             domain_adapt=True,
                             zero_adapt=True,
                             domain_type='title',
                             seed_type='title',
                             zero_type='title',
                             score_alg='sum',
                             distance='cos',
                             backend='keyBert',
                             scaler="MinMax",
                             sent_splitter=False,
                             precomputed_course=True)

# Content Based
content_based = ContentBased(
                             model_name='BAAI/bge-large-en-v1.5',
                             distance='cos',
                             score_alg='sum',
                             scaler='None')

# Explanation model
explanation = LLM(
    token="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"
)

# Bloom Based
bloom_based = BloomBased(score_alg='sum',
                         scaler='None')

# Warning model
warning_model = WarningModel()

# Planner model
planner = None

In [3]:
# Here is example code for the initialization of the system.
from instances.recommender_instances import rs

# Now with the rs object we can call the components that we need.
input = { # Example of the input that we need for the system. Also, this is the input that we get from the front-end.
    "keywords": {
        "physics": 1.0,
        "maths": 1.0,
        "statistics": 1.0,
        "ai": 1.0,
        "computer science": 1.0,
        "chem": 1.0,
    },
    "blooms": {'create': 0.0,
               'understand': 0.0,
               'apply': 0.0,
               'analyze': 0.0,
               'evaluate': 0.0,
               'remember': 0.0}
}
student_node = rs.get_recommendation(input) # This will return the recommendations for the input as a Student Node object.
student_node

CourseBasedRecSys config: 
model_name: all-MiniLM-L12-v2
seed_help: True
domain_adapt: True
zero_adapt: True
seed_type: title
domain_type: title
zero_type: title
adaptive_thr: 0.0
minimal_similarity_zeroshot: 0.8
score_alg: sum
distance: cos
backend: keyBert
scaler: MinMax
sent_splitter: False
precomputed_course: True



  0%|          | 0/144 [00:00<?, ?it/s]


RecSys settings: 
Validate_input: True 
Keyword_based: <rec_sys_uni.rec_systems.keyword_based_sys.keyword_based.KeywordBased object at 0x000001C8591AAE00> 
Bloom_based: <rec_sys_uni.rec_systems.bloom_based_sys.bloom_based.BloomBased object at 0x000001C823036080> 
Explanation: <rec_sys_uni.rec_systems.llm_explanation.LLM.LLM object at 0x000001C822B23550> 
Planner: None 
Top_n: 7 


<rec_sys_uni._rec_sys_helpers.StudentNode at 0x1c85943a3b0>

The `StudentNode` object contains all information about the student. It is a class that we have created to keep the information about the student. It contains the following attributes:

1. **results**: contains results from the recommender system components.
2. **student_input**: contains keywords and Bloom's taxonomy of the student input.
3. **course_data**: currently we are using system course data, but later on, the front-end should send the course list for which we need to compute recommendation.
4. **student_data**: contains all courses which the student took.
5. **id**: unique Student ID, which was assigned by MongoDB.

The `StudentNode` object is saved in the MongoDB database. We use the `id` attribute as the unique identifier for the student. We use the `id` attribute to retrieve the `StudentNode` object from the database. It helps call other components of the system independently without sending unnecessary information to the front-end.

Let's take a look at each component of the student node at detail.

### 1. Results:
`results`: A dictionary of recommended courses. For example:

- `recommended_courses`: A dictionary where each key is a `course_id` (String) and the value is another dictionary with the following keys:
  - `score`: A float representing the total score of each model.
  - `period`: A list of integers or a list of lists of integers. For example, `[1, 4]` or `[[1, 2, 3], [4, 5, 6]]` or `[[1,2]]`.
  - `warning`: A boolean value.
  - `warning_recommendation`: A list of warning recommendations.
  - `keywords`: A dictionary of scores for each keyword (only present after applying the CourseBased model).
  - `blooms`: A dictionary of scores for each bloom (only present after applying the BloomBased model).

- `sorted_recommended_courses`: A list of courses, which are included to the structured recommendation, where each course is a dictionary with the following keys:
  - `course_code`: A string.
  - `course_name`: A string.
  - `warning`: A boolean value.
  - `warning_recommendation`: A list of warning recommendations.
  - `keywords`: A dictionary where each key is a keyword (String) and the value is a weight (float).
  - `blooms`: A dictionary where each key is a bloom (String) and the value is a weight (float).
  - `score`: A float.

- `structured_recommendation`: A dictionary with the following structure:
  - `semester_1`: A dictionary with the following keys:
    - `period_1`: A list of `course_id` (String), the length of the list is less than or equal to `top_n`.
    - `period_2`: A list of `course_id` (String), the length of the list is less than or equal to `top_n`.
  - `semester_2`: A dictionary with the following keys:
    - `period_4`: A list of `course_id` (String), the length of the list is less than or equal to `top_n`.
    - `period_5`: A list of `course_id` (String), the length of the list is less than or equal to `top_n`.

Note: If you apply the `sort_by_periods` function, you will have `structured_recommendation` and `sorted_recommended_courses` keys in the results.


In [4]:
print(student_node.results.keys())
print("Number of courses: " + str(len(student_node.results["recommended_courses"].keys())))
print(student_node.results["sorted_recommended_courses"][0].keys())
print(student_node.results["structured_recommendation"].keys())
print(student_node.results["structured_recommendation"]["semester_1"].keys())
print(student_node.results["structured_recommendation"]["semester_1"]["period_1"][0].keys())

dict_keys(['recommended_courses', 'sorted_recommended_courses', 'structured_recommendation'])
Number of courses: 149
dict_keys(['course_code', 'course_name'])
dict_keys(['semester_1', 'semester_2'])
dict_keys(['period_1', 'period_2'])
dict_keys(['course_code', 'course_name'])


### 2. Student Input:
`student_input`: A dictionary that contains the student's input. For example:

- `keywords`: A dictionary where each key is a keyword (String) and the value is a weight (float).
- `blooms`: A dictionary where each key is a bloom (String) and the value is a weight (float).

Example of `student_input`:

```python
student_input = {
    "keywords": {
        "python": 0.5,
        "data science": 0.2
    },
    "blooms": {
        "create": 0.5,
        "understand": 0.75,
        "apply": 0.25,
        "analyze": 0.5,
        "evaluate": 0.0,
        "remember": 1.0
    }
}
```

### 3. Course Data:
`course_data`: A dictionary where each key is a `course_id` (String) and the value is another dictionary with the following keys:

- `course_name`: A string representing the name of the course. For example, "Philosophy of Science".
- `period`: A list of integers or a list of lists of integers. For example, `[1, 4]` or `[[1, 2, 3], [4, 5, 6]]` or `[[1,2]]`.
- `level`: An integer representing the level of the course. For example, 1, 2, or 3.
- `prerequisites`: A list of course IDs (Strings) that are prerequisites for the course. For example, `["COR1001", "COR1002"]`.
- `description`: A string providing a description of the course. For example, "This course is about ...".
- `ilos`: A list of strings representing the Intended Learning Outcomes (ILOs) of the course. For example, ["Be able to apply the scientific method to a given problem", "Be able to explain the difference between science and pseudoscience"].

Example of `course_data`:

```python
course_data = {
    "COR1001": {
        "course_name": "Philosophy of Science",
        "period": [1, 4],
        "level": 1,
        "prerequisites": ["COR1002"],
        "description": "This course is about ...",
        "ilos": [
            "Be able to apply the scientific method to a given problem",
            "Be able to explain the difference between science and pseudoscience"
        ]
    },
    ...
}
```

Note: We are using the `course_data` from the system. However, later on, the front-end should send the course list for which we need to compute recommendation.

### 4. Student Data:
`student_data`: A dictionary that contains the student's data. For example:

- `courses_taken`: A dictionary where each key is a `course_id` (String) and the value is another dictionary with the following keys:
  - `passed`: A boolean value indicating whether the student passed the course.
  - `grade`: A float representing the grade the student received in the course.
  - `period`: An integer or a list of two integers representing the period(s) when the course was taken.
  - `year`: An integer representing the year when the course was taken.

Example of `student_data`:

```python
student_data = {
    "courses_taken": {
        "COR1001": {
            "passed": True,
            "grade": 3.5,
            "period": [1, 2],
            "year": 2022
        },
        ...
    }
}
```

Note: We are using the `student_data` from the system. However, later on, the front-end should send the student's data along with student ID, which was provided by Maastricht University.

### Usage of MongoDB:

We use MongoDB to store the `StudentNode` object. We use the `id` attribute to retrieve the `StudentNode` object from the database. We show an example of how to use MongoDB to retrieve the `StudentNode` object in the json format.

Local server: `mongodb://localhost:27017/`

In the RecSys a.k.a `rs` object, we have a `db` attribute which is the database that we are using. We use a `RecSys` database. We store the `StudentNode` object in the `student_results` collection. We use the `id` attribute as the unique identifier for the student's session. 

Note: Every time, when student input is sent to the system, we create a new `StudentNode` object and store it in the database.

Example: How to use the `id` attribute to retrieve the `StudentNode` object as json from the database.

In [5]:
# This is an example of id that we get from MongoDB.
print(student_node.id)

658851466c10295417bafda1


In [6]:
from bson.objectid import ObjectId
# First we need to get the database from the rs object.
db = rs.db

# Then we get the collection from the database.
collection = db["student_results"]

# We can retrieve the student node object in the json format from the collection by their id.
student_info = collection.find({"_id": ObjectId(str(student_node.id))})[0]
student_info.keys()

dict_keys(['_id', 'student_id', 'student_input', 'course_data', 'student_data', 'results', 'time'])

Let's briefly explain the parameters and functions of the Recommender System:
```
def __init__(self,
                 keyword_based: KeywordBased = None,
                 bloom_based: BloomBased = None,
                 explanation: LLM = None,
                 warning_model: WarningModel = None,
                 planner: UCMPlanner = None,
                 validate_input: bool = True,
                 top_n: int = 7,
                 ):
        self.validate_input = validate_input
        self.keyword_based = keyword_based
        self.bloom_based = bloom_based
        self.explanation = explanation
        self.warning_model = warning_model
        self.planner = planner
        self.top_n = top_n
        self.db = pymongo.MongoClient("mongodb://localhost:27017/")["RecSys"]
```
- `keyword_based`: KeywordBased model.
- `bloom_based`: BloomBased model.
- `explanation`: LLM model.
- `warning_model`: WarningModel model.
- `planner`: UCMPlanner model.
- `validate_input`: Boolean value. If True, the input from front-end will be validated.
- `top_n`: Integer value. The number of courses that will be recommended in the period.  


```
def validate_system_input(self,
                              student_input,
                              course_data,
                              student_data,
                              system_course_data,
                              system_student_data):
        """
        function: validate_system_input
        description: validate the format of the input data
        """

        check_student_input(student_input)

        if system_student_data:
            student_data = get_student_data()

        if system_course_data:
            exception_courses = []
            if student_data:
                exception_courses = list(student_data['courses_taken'].keys())
            course_data = get_course_data(except_courses=exception_courses)

        check_course_data(course_data)

        if student_data is not None:
            check_student_data(student_data)

        return student_input, course_data, student_data
```
- function `validate_system_input`: Validate the format of the input data. If the input data is not in the correct format, it will raise an error. If the input data is in the correct format, it will return the input data. Additionally, we include system `course_data` and `student_data` in the input data if they are not provided by the front-end, but in later stage it should be provided by the front-end. All rules for the input data are defined in the `rec_sys_uni/error_checker/errors.py` file.

In [7]:
input = { # Lets make a mistake in the input data.
    "keyword": { # Instead of keywords, we wrote keyword.
        "physics": 1.0,
        "maths": 1.0,
        "statistics": 1.0,
        "ai": 1.0,
        "computer science": 1.0,
        "chem": 1.0,
    },
    "blooms": {'create': 0.0,
               'understand': 0.0,
               'apply': 0.0,
               'analyze': 0.0,
               'evaluate': 0.0,
               'remember': 0.0}
}
try:
    student_node = rs.get_recommendation(input)
except Exception as e:
    print(e) 

student_input does not have keywords


`get_recommendation` function is the main function of the system. It calls all the components of the system, which calculate recommendation and returns the `StudentNode` object. It takes the following parameters:
```
def get_recommendation(self,
                           student_intput,
                           course_data=None,
                           student_data=None,
                           system_course_data=True,
                           system_student_data=False,
                           ):
        """
        function: get_recommendation
        description: get Student Node object
        """

        if self.validate_input:
            student_input, course_data, student_data = self.validate_system_input(student_intput,
                                                                                  course_data,
                                                                                  student_data,
                                                                                  system_course_data,
                                                                                  system_student_data)

        # Make results template
        results = make_results_template(course_data)

        # Create StudentNode object
        student_info = StudentNode(results, student_intput, course_data, student_data)

        # Compute recommendation and store in student_info
        compute_recommendation(self, student_info)

        # Compute warnings and store in student_info
        if self.warning_model:
            compute_warnings(self, student_info)

        # Sort by periods
        sort_by_periods(self, student_info, self.top_n, include_keywords=True, include_score=True,
                        include_blooms=False)

        # Save student_info to database
        now = datetime.now()

        collection = self.db["student_results"]
        input_dict = {
            "student_id": "123",
            "student_input": student_info.student_input,
            "course_data": student_info.course_data,
            "student_data": student_info.student_data,
            "results": student_info.results,
            "time": now.strftime("%d/%m/%Y %H:%M:%S")
        }
        object_db = collection.insert_one(input_dict)
        student_info.set_id(object_db.inserted_id)

        # Return StudentNode object
        return student_info
```
- `student_input`: A dictionary that contains the student's input.
- `course_data`: A dictionary that contains the course data.
- `student_data`: A dictionary that contains the student's data.
- `system_course_data`: Boolean value. If True, the system course data will be used.
- `system_student_data`: Boolean value. If True, the system student data will be used.


`generate_explanation` function is used to generate the explanation for the recommendation. It takes the following parameters:
- `student_id`: A string representing the student's ID from MongoDB.
- `course_code`: A string representing the course's code.

`make_timeline` function is used to generate the timeline for the recommendation. It takes the following parameters:
- `student_id`: A string representing the student's ID from MongoDB.

## Keyword Similarity:

Our Keyword Similarity model is designed to assess the degree of similarity between a student's specified keywords and the content within course descriptions. To accomplish this, various models can be employed, particularly those available through the HuggingFace library ([leaderboard](https://huggingface.co/spaces/mteb/leaderboard)), which is renowned for its diverse range of pre-trained models and natural language processing tools. These models from HuggingFace are adept at understanding and processing language, thus making them ideal for accurately gauging the relevance between keywords and course content.

In addition to the options available in HuggingFace, there is also the possibility of utilizing an [Intel-based model](https://huggingface.co/Intel). This approach can be particularly advantageous when working with Intel server CPUs, as the model is optimized to efficiently compute embeddings on these processors. The use of an Intel model could potentially offer improved performance in terms of speed and efficiency, especially in environments that are already using Intel hardware.

The choice between these models should be guided by several factors, including the specific requirements of the task, the available computational resources, and the desired balance between accuracy and efficiency. HuggingFace models might offer a wider range of capabilities and potentially higher accuracy due to their bigger model and extensive training on diverse datasets. On the other hand, an Intel-optimized model might provide faster processing times and better integration with existing Intel server infrastructures, making it a practical choice for environments prioritizing efficiency and hardware compatibility.

Ultimately, the decision on which model to use for computing keyword-course description similarity should be based on a thorough evaluation of the project's needs and the computational environment.

Let's take a look at the `KeywordBased` class initialization:
```
# Keyword Based
keyword_based = KeywordBased(model_name="all-MiniLM-L12-v2",
                             seed_help=True,
                             domain_adapt=True,
                             zero_adapt=True,
                             domain_type='title',
                             seed_type='title',
                             zero_type='title',
                             score_alg='sum',
                             distance='cos',
                             backend='keyBert',
                             scaler=True,
                             sent_splitter=False,
                             precomputed_course=True)
```

- `model_name`: model name of the sentence transformer, specifically from the HuggingFace library.
- `seed_help`: apply seed help, mix course description with seed words embeddings
- `domain_adapt`: apply domain adaptation, add additional attention layer to the model
- `zero_adapt`: apply zero-shot adaptation, more advance seed filtering
- `seed_type`: type of the seed help either 'title' or 'domains', we use 'title' for now, since 'domains' did not show good performance
- `domain_type`: type of the domain adaptation either 'title' or 'domains', we use 'title' for now, since 'domains' did not show good performance
- `zero_type`: type of the zero-shot adaptation either 'title' or 'domains', we use 'title' for now, since 'domains' did not show good performance
- `adaptive_thr`: adaptive threshold for the zero-shot adaptation
- `minimal_similarity_zeroshot`: minimal similarity between a candidate and a domain word for the zero-shot adaptation
- `score_alg`: score algorithm either 'sum' or 'rrf', currently we use 'sum', but Relevance Ranking Fusion (RRF) can be used, but with another implementation of keyword similarity
- `distance`: distance metric either 'cos' or 'dot', for some models is required to use 'dot' instead of 'cos'.
- `backend`: backend either 'keyBert' or 'Intel'. 'Intel' is used to compute the embeddings on Intel server CPU, but it provides only advance models like BGE, which can be too large for our system.
- `scaler`: apply min-max scaler to the keywords weights.
- `sent_splitter`: split the course description into sentences. It is implemented to use Linear Programing to calculate scores, but it still needs to be implemented.
- `precomputed_course`: use precomputed course embeddings or not.

If you are planning to use domain adaptation, zero-shot adaptation, or both, it is necessary to precompute the embeddings and the attention layer in advance. You can use the following code to precompute the embeddings and attention layer:

```
def calculate_zero_shot_adaptation(course_data, model_name, attention,
                                   adaptive_thr: float = 0.15,
                                   minimal_similarity_zeroshot: float = 0.8):
```
- `course_data`: A dictionary or list containing the course code. Please follow the template that we discussed earlier.
- `model_name`: The name of the sentence transformer model, specifically from the HuggingFace library, for which you want to precompute the embedding for zero-shot adaptation.
- `attention`: The specific words to which you want to apply attention in order to bring the embedding closer to the course domain within the embedding space.

```
def calculate_few_shot_adaptation(course_data, model_name, attention,
                                  lr=1e-4, epochs=100,  # Training Parameters
                                  start_index=0,
                                  include_description=True,
                                  include_title=False,
                                  include_ilos=False, ):
```
- `course_data`: A dictionary containing course descriptions, Intended Learning Outcomes (ILOs), and the course code. Please follow the template that we discussed earlier.
- `model_name`: The name of the sentence transformer model, specifically from the HuggingFace library, for which you want to precompute the embedding for few-shot adaptation.
- `attention`: The specific words to which you want to apply attention in order to bring the embedding closer to the course domain within the embedding space.
- `lr`: Learning rate.
- `epochs`: Number of epochs.

```
def calculate_precomputed_courses(course_data, model_name,
                                  include_description=True,
                                  include_title=False,
                                  include_ilos=False, ):
```
- `course_data`: A dictionary containing course descriptions, Intended Learning Outcomes (ILOs), and the course code. Please follow the template that we discussed earlier.
- `model_name`: The name of the sentence transformer model, specifically from the HuggingFace library, for which you want to precompute the embedding for the course.
- `include_description`: Boolean value. If True, the course description will be included in the embedding.
- `include_title`: Boolean value. If True, the course title will be included in the embedding.
- `include_ilos`: Boolean value. If True, the course ILOs will be included in the embedding.


In [16]:
from rec_sys_uni.datasets.datasets import calculate_precomputed_courses, calculate_few_shot_adaptation, calculate_zero_shot_adaptation, get_course_data
%load_ext autoreload
%autoreload 2

In [17]:
course_data = get_course_data()
course_data.keys()

dict_keys(['COR1002', 'COR1003', 'COR1004', 'COR1006', 'HUM1003', 'HUM1007', 'HUM1010', 'HUM1011', 'HUM1012', 'HUM1013', 'HUM1016', 'HUM2003', 'HUM2005', 'HUM2007', 'HUM2008', 'HUM2013', 'HUM2016', 'HUM2018', 'HUM2021', 'HUM2022', 'HUM2030', 'HUM2031', 'HUM2046', 'HUM2047', 'HUM2051', 'HUM2054', 'HUM2056', 'HUM2057', 'HUM2058', 'HUM2059', 'HUM2060', 'HUM3014', 'HUM3019', 'HUM3029', 'HUM3034', 'HUM3036', 'HUM3040', 'HUM3042', 'HUM3043', 'HUM3044', 'HUM3045', 'HUM3049', 'HUM3050', 'HUM3051', 'HUM3052', 'HUM3053', 'SCI1004', 'SCI1005', 'SCI1009', 'SCI1010', 'SCI1016', 'SCI2002', 'SCI2009', 'SCI2010', 'SCI2011', 'SCI2017', 'SCI2018', 'SCI2019', 'SCI2022', 'SCI2031', 'SCI2033', 'SCI2034', 'SCI2035', 'SCI2036', 'SCI2037', 'SCI2039', 'SCI2040', 'SCI2041', 'SCI2042', 'SCI2043', 'SCI2044', 'SCI3003', 'SCI3005', 'SCI3006', 'SCI3007', 'SCI3046', 'SCI3049', 'SCI3050', 'SCI3051', 'SCI3052', 'SSC1005', 'SSC1007', 'SSC1025', 'SSC1027', 'SSC1029', 'SSC1030', 'SSC2002', 'SSC2004', 'SSC2006', 'SSC2007',

In [4]:
course_data['COR1002'].keys()

dict_keys(['course_name', 'period', 'level', 'prerequisites', 'description', 'ilos'])

In [None]:
calculate_precomputed_courses(course_data, "all-MiniLM-L12-v2"
                              , include_description=True
                              , include_title=True
                              , include_ilos=True
                              , use_cuda=True)

In [7]:
len(course_data)

149

In [None]:
attention = {}
for i in course_data.keys():
    attention[i] = course_data[i]['course_name']
calculate_few_shot_adaptation(course_data, "all-MiniLM-L12-v2", attention=attention, use_cuda=True)

In [None]:
calculate_zero_shot_adaptation(course_data, "all-MiniLM-L12-v2", attention=attention)

You can find additional information about attention-based keyword similarity [here](https://arxiv.org/pdf/2211.07499.pdf). 

If you follow all the steps outlined above, you will be able to use the new model to compute the similarity between the student's keywords and the course descriptions. Note that all embeddings and layers will be downloaded automatically; just remember to change the model during the initialization of KeywordBased

General structure of Keyword-based model:
<p align="center">
  <img src="images/kb.png" alt="Keyword-based model" title="Keyword-based model"/>
</p>

In [13]:
from rec_sys_uni.rec_systems.keyword_based_sys.keyword_based import KeywordBased
keyword_based = KeywordBased(model_name="all-MiniLM-L12-v2",
                             seed_help=True,
                             domain_adapt=True,
                             zero_adapt=True,
                             domain_type='title',
                             seed_type='title',
                             zero_type='title',
                             score_alg='sum',
                             distance='cos',
                             backend='keyBert',
                             scaler="MinMax",
                             sent_splitter=False,
                             precomputed_course=True)

keywords_output = keyword_based.recommend(student_node.course_data,
                                             student_node.student_input['keywords'])
keywords_output['COR1002']

{'physics': 0.317,
 'maths': 0.2752,
 'statistics': 0.2255,
 'ai': 0.1399,
 'computer science': 0.2675,
 'chem': 0.17}

#### Future Improvements of Keyword Similarity:
- Add evaluation metrics to compare performance of different models.
- Add Explanation of the keyword similarity based on attention layers to get sentences and words that are similar to the keywords using exBert, LIME or other explanation techniques.

ExBERT is another model explaining the innerworkings of Transformers developed by MIT-IBM AI Labs and Harvard NLP Group.

ExBERT does not only provide insight into attention mechanisms, but also combines them with the contextual word embeddings to provide better insight in these important relations as well. According to the authors: "While static analyses of these models lead to targeted insights, interactive tools are more dynamic and can help humans better gain an intuition for the model internal reasoning process. We present EXBERT, an interactive tool named after the popular BERT language model, that provides insights into the meaning of the contextual representations by matching a human-specified input to similar contexts in a large annotated dataset. By aggregating the annotations of the matching similar contexts, EXBERT helps intuitively explain what each attention-head has learned."

The paper can be found here: [paper](https://arxiv.org/pdf/1910.05276.pdf) and the source code here: [github](https://github.com/bhoov/exbert)

Reference: EXBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models. Hoover et al. 2019.

<p align="center">
  <img src="images/ExBERT.png" alt="ExBERT" title="ExBERT"/>
</p>

## Bloom's Taxonomy:

[Bloom's taxonomy](https://en.wikipedia.org/wiki/Bloom%27s_taxonomy) describes a [multilabel classification](https://en.wikipedia.org/wiki/Multi-label_classification) problem, where an intended learning objective (ILO) is labeled according to some domain criterion. For our use case, we are specifically interested in the cognitive domain, where an ILO may be assigned to either of six categories:

- create
  
- understand
  
- apply
  
- analyze
  
- evaluate
  
- remember
  
These categories reflect cognitive skills a student is expected to develop when embarking on an academic trajectory. We allow students to indicate which skill(s) they would like to focus on each time they query a recommendation from our system.

Besides students' preferences, we also need to categorize courses according to their ILOs. First, we trained a multi-layer perceptron (MLP) for multilabel classification combining back-propagation, gradient descent, and Bayesian optimization for hyper-parameter search. The MLP expects a text embedding produced by the sentence transformer `all-MiniLM-L6-v2`. It is possible to use any sentence transformer, but that would require re-training the MLP. To classify an ILO, we encode it using the sentence transformer and then pass it through the MLP. For each ILO, we then obtain a vector with six values, each corresponding to one of the six cognitive skills. These values represent the independent probability that the ILO corresponds to that cognitive skill.

Given the time it takes to compute these probabilities for all courses, we pre-compute them before setting up the recommender. Currently, these additional values are stored within `catalog.json`; found under `rec_sys_uni\datasets\data\planners\`. Each dictionary holds information about a course. To find the probabilities associated with each ILO of a specific course, look under the keyword `"proba"`.

To collapse the probabilities for each ILO into a single set of six probabilities per course, we employ the [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution). So, for each course and for each cognitive skill, we compute the collapsed probability by evaluating the following formula:
<center>

# $\( P\left(\sum_{i}p_{s,i} \geq 1\right) = p_s \)$
</center>

To learn more about how we trained the MLP, please check our notebook on [auto-labeling ILOs](Autolabeling_ILOs.ipynb). To learn more about how we collapsed the probabilities for each ILO, check our notebook on [aggregating ILO probabilities](Aggregating_ILO_Probs.ipynb).

**Note:** We don't recommend running the notebooks, because many things have changed since they were implemented, and not all files and directories may be in the places each notebook expects. However, the processes and methods used remain the same.

**Note:** The following information is deprecated. Our implementation of the Bloom-based recommender fell behind and should be upgraded to expect student inputs and course data using the new format.

The `BloomBased` class in the recommendation system is designed to generate course recommendations for students based on Bloom's Taxonomy. This taxonomy categorizes cognitive skills into six areas: create, understand, apply, analyze, evaluate, and remember. The system allows students to express their preference for specific cognitive skills, and courses are recommended accordingly.

#### Class Initialization
`BloomBased.__init__(self, precomputed_blooms: bool = True, top_n: int | None = 4, score_alg: str = 'sum', scaler: str = 'None')`
Initializes the `BloomBasedRecSys` with specified parameters.

##### Parameters
- `precomputed_blooms` (bool, optional): Determines if precomputed Bloom's values are used. Default is `True`. Currently, setting this to `False` is not implemented.
- `top_n` (int, optional): Limits the number of recommendations, applying weights according to Bloom's taxonomy. If `None`, all courses are considered.
- `score_alg` (str): Specifies the algorithm used for scoring. Current implementation supports 'sum'.
- `scaler` (str): Specifies the scaling technique for the scores. Options are 'MaxMin' or 'None'. Current default is 'None'.

#### Method: `recommend`
##### `BloomBased.recommend(self, course_data, student_blooms)`
Generates course recommendations based on student input and course data.

##### Parameters
- `course_data`: A list of dictionaries containing data about each course. The format is expected to follow that of `catalog.json`
- `student_blooms`: Bloom's taxonomy preferences indicated by the student. A dictionary with a key and value for each cognitive skill is expected.

##### Returns
- `blooms_output`: The same course data with updated scores.

##### Raises
- `AssertionError`: If `precomputed_blooms` is set to `False`.

**Notes**
- The current implementation requires `precomputed_blooms` to be `True`. Setting it to `False` will raise an error.
- The method reads precomputed Bloom's taxonomy values from a JSON file (`precomputed_blooms.json`) and uses them to match courses with student preferences.
- The `recommend` method does not currently implement any scoring or filtering based on `top_n` or `scaler` parameters, but they are included for potential future extensions.

**Deprecated Information**
- The current implementation of the Bloom-based recommender is outdated and does not support recent changes in student input and course data formats. An upgrade is recommended to align with new system requirements. The bloom's values you can find here: `rec_sys_uni/datasets/data/course/precomputed_blooms.json`

#### Future Improvements of Bloom's taxonomy:
- Deploy and test the [DistilBERT](https://github.com/RyanLauQF/BloomBERT) or [DistilBERT EDM2022](https://github.com/SteveLEEEEE/EDM2022CLO) model for bloom-based recommender. This model should show better accuracy on the Bloom's taxonomy classification task, since they obtained high accuracy on their dataset, it would be interesting test on our data as well.
- Add explanation of the bloom-based recommender based on shapley values/gini importance/LIME.
<p align="center">
  <img src="images/Bloom_importance.png" alt="Bloom Explanation" title="Bloom Explanation"/>
</p>

Another type of explainability techniques perturbs the input to find which regions from it change the prediction to a larger degree. One such method is [LIME](https://arxiv.org/abs/1602.04938 ), which build a linear local approximator for each instance. It perturbs the input and tries to predict how the output is being changed with each local perturbation. The weights of the linear model for each token are used as saliency scores.

See this [book](https://christophm.github.io/interpretable-ml-book/lime.html) on interpretability for more information on LIME and it's usage in NLP.

LIME can also be used to explain which words contribute most to the polarity in sentiment analysis. See for instance this LIME visualization used to examine the words used for the classification in negation handling. It can be observed that DON'T is notreconized as DON NOT and is this not contributing to the negative polarity of this sentence, hence the positive classification instead of the (correct) negative one.

<p align="center">
  <img src="images/LIME.png" alt="LIME" title="LIME"/>
</p>


General structure of Bloom's Taxonomy model:
<p align="center">
  <img src="images/bloom.png" alt="Bloom" title="Bloom"/>
</p>




In [14]:
from rec_sys_uni.rec_systems.bloom_based_sys.bloom_based import BloomBased
bloom_based = BloomBased(score_alg='sum',
                         scaler='None')

bloom_output = bloom_based.recommend(student_node.course_data,
                                          student_node.student_input['keywords'])
bloom_output['COR1002']

{'remember': 3.3306690738754696e-16,
 'understand': 1.0,
 'apply': 5.436325878349635e-08,
 'analyze': 2.886579864025407e-15,
 'evaluate': 1.3245626817592893e-11,
 'create': 8.374181348358434e-10}

## Explanation with Generative LLM:

Currently, we are utilizing OpenAI's model to provide explanations for keywords and to summarize course descriptions. This is achieved through a combination of a prompt template and few-shot learning examples, as implemented in our `rec_sys_uni/rec_systems/llm_explanation/LLM.py` script. One significant advantage of using OpenAI is its capability to generate outputs in JSON format, which facilitates easier integration and manipulation of data in various applications.

However, we are contemplating the development and deployment of our proprietary model for generating explanations. The primary motivation behind this shift is the potential for greater customization and accuracy. By fine-tuning our model specifically to our dataset, we aim to achieve performance that surpasses what we currently obtain with OpenAI. Own model will allow for more tailored and precise explanations, catering specifically to the unique characteristics and requirements of our data.

Furthermore, developing our model could offer additional benefits such as enhanced control over the model's behavior, the ability to continuously improve and adapt the model based on incoming data and feedback, and potentially lower long-term costs associated with using a third-party service. While the initial development and training might require significant resources, the long-term benefits could be substantial, especially in terms of offering a more personalized and relevant user experience.

#### Future Improvements of Explanation with Generative LLM:
- Develop and deploy our model for generating explanations.
- Split tasks into separate promts for better control over the output to adjust into the JSON format.
- Deploy the knowledge graph to reduce the number of hallucinations and provide more robust explanations; [Graph Retrieval Augmented Generation](https://medium.com/@nebulagraph/graph-rag-the-new-llm-stack-with-knowledge-graphs-e1e902c504ed).
- Add explanation of generated output based on retrieval from the knowledge graph.

General structure of the explanation:
<p align="center">
  <img src="images/lmm.png" alt="Explanation" title="Explanation"/>
</p>

In [17]:
# To get the explanation for the course, we need to provide the student id and course code.
explanation = rs.generate_explanation(student_node.id, "SCI2039")

In [18]:
explanation.keys()

dict_keys(['course_code', 'course_title', 'summary_description', 'keywords_explanation'])

# Example of Course Overview

## SCI2039: Computer Science

### Summary Description
Computer Science is a comprehensive course that provides an overview of the discipline, covering a wide range of topics including algorithmic foundations of informatics, hardware issues, such as number systems and computer architectures, and software issues, such as operating systems, programming languages, compilers, networks, the Internet, and artificial intelligence. The course also includes practical lab sessions to investigate the concepts introduced. By the end of the course, students are expected to have developed experience in applying techniques from informatics, computer science, and programming for their own research and educational purposes.

### Relevance to Other Disciplines

- **Physics**
  - While there is a moderate similarity between physics and the course description (keyword similarity: 0.9), the course does not directly cover physics topics. However, the analytical and problem-solving skills developed in physics can be beneficial in computer science.

- **Maths**
  - Mathematics is highly relevant to this course (keyword similarity: 0.67), as it forms the foundation of many computer science concepts, such as algorithms, logic, and discrete mathematics.

- **Statistics**
  - Statistics has some relevance to the course (keyword similarity: 0.28), particularly in the context of data analysis and interpretation, which are essential in computer science.

- **AI**
  - Artificial intelligence is directly relevant to the course (keyword similarity: 0.51), as it is explicitly mentioned in the course description. The course covers artificial intelligence as one of the key topics.

- **Computer Science**
  - The course is directly aligned with computer science (keyword similarity: 1.0), as it is the main subject of the course title. The course covers a broad range of computer science topics, providing a comprehensive introduction to the field.

- **Chem**
  - While there is a moderate similarity between chemistry and the course description (keyword similarity: 0.57), the course does not directly cover chemistry topics. However, the analytical and logical thinking skills developed in chemistry can be beneficial in computer science.


## Planner (Integer Linear Programming):

Our system has access to all courses' descriptions, titles, and ILOs. It combines all that information with user preferences to generate a preference score for each course. By incorporating course prerequisites and rules and regulations described in [Education and Examination Regulations (EER)](eer_2023-2024-mslas.pdf) into an expert system, we can go further than just ranking courses. Instead, we can generate academic trajectories that align with students' preferences. That is the role of our planner.

Our planner is powered by an integer linear programming (ILP) solver bundled with [scipy](https://scipy.org/). We formulated the problem of generating an academic trajectory as a constrained scheduling problem. The system tries to allocate courses such that the total preference score is maximized, but certain constraints are never broken.

At the time of writing this, our planner employs nearly 50000 constraints. Some of them come from EER. We went through the entire document and manually formulated each of the functional constraints according to the specifications. This accounts for less than half of the total number of constraints employed by the planner. Most of the constraints are the result of including course prerequisites.

Course prerequisites are expressed in natural language. To get our planner to use them, they had to be translated into functional constraints. So, we translated each course prerequisite into propositional logic. Then, we developed a specialized parser to translate them from this intermediate form into functional constraints.

**Note:** Currently, our parser takes the propositional formulas and writes them in conjunctive normal form (CNF) first. Then, it goes over each disjunctive clause and converts it to a functional constraint. While this process is easier to implement, it results in an explosion of functional constraints. A more efficient approach would be to write the propositional formulas in disjunctive normal form (DNF) and then convert them to functional constraints. This process would require including auxiliary variables, as specified in [IMMS Modeling Guide - Integer Programming Tricks](https://download.aimms.com/aimms/download/manuals/AIMMS3OM_IntegerProgrammingTricks.pdf).

#### Class Initialization
##### `UCMPlanner.__init__(self, reclib: Sequence[dict] | str) -> None`
Initializes the `UCMPlanner` with the provided recommendation library (reclib).

##### Parameters
- `reclib` (Sequence[dict] | str): The recommendation library, which can be either a sequence of dictionaries representing courses or a string path to a JSON file containing this data.

#### Internal Workings
- Upon initialization, the planner processes course data, normalizes period information, and establishes various mappings and constraints crucial for the planning algorithm.
- The planner formulates the academic trajectory as a constrained scheduling problem, aiming to maximize the total preference score while adhering to numerous constraints, including EER and course prerequisites.
- Course prerequisites, initially in natural language, are converted into propositional logic and subsequently into functional constraints for the ILP solver.

#### Methods
##### `_update_ranks(self, courses: Sequence[dict])`
Updates course rankings based on provided scores.

##### Parameters
- `courses` (Sequence[dict]): A sequence of dictionaries, each containing a course code and a score.

##### `plan(self, courses: Sequence[dict])`
Generates an academic trajectory based on the updated course scores.

##### Parameters
- `courses` (Sequence[dict]): A sequence of dictionaries containing course information and up-to-date scores.

##### Returns
- A structured academic plan (`pretty_tree`), organizing courses into periods and years, or `None` if no feasible plan is found.

**Notes**
- The planning process involves the generation of linear constraints and the application of a Mixed Integer Linear Programming (MILP) solver from `scipy`.
- Additional constraints related to general education and concentration requirements (`gedu_ineq` and `conc_ineq`) are incorporated into the planning process as needed.
- Typically, an entire academic trajectory can be generated in under four seconds. But sometimes the process may take longer.


## Warning System:

At present, our warning system is facing challenges in achieving accurate predictions due to imbalanced student data. This imbalance adversely affects the system's ability to correctly predict instances of academic failure. To address this issue, a more refined approach is being considered. We propose to define common features across courses, focusing on key skills such as physics, mathematics, law, etc. By identifying and utilizing these skill-based features, we aim to improve the prediction accuracy of potential academic failures. Furthermore, this approach will enable us to provide students with tailored recommendations on which courses to undertake in order to develop the relevant skills needed to succeed in courses they are predicted to struggle with.

Previously, we experimented with using embeddings, assigning weights based on students' grades. However, this method resulted in overly generalized embeddings, leading to indistinct differentiation between them. 

Currently, we are now employing a Random Forest Classifier to predict academic failures. This classifier offers the advantage of feature importance analysis, which we utilize to determine the key aspects contributing to potential failures or passing. Additionally, we leverage the capabilities of a Sentence Transformer to check these potential courses and give warning recommendations with the matching score with the predicted course.

Warning model implementation and usage: you can find the warning model in the `rec_sys_uni/rec_systems/warning_model/warning_model.py` script and `notebooks/Warning_Model.ipynb` notebook. 

General structure of the warning model:
<p align="center">
  <img src="images/wp.png" alt="Warning" title="Warning"/>
</p>

### Collaborative Filtering as a Warning System:

We implemented a matrix factorization method to predict the course grade, utilizing student data that includes academic grades. However, this system is not yet fully integrated, as it requires access to real student IDs from Maastricht University. Additionally, the model necessitates periodic updates to maintain its relevance and accuracy.

A significant challenge we face with this approach is the limited number of courses taken by individual students. This situation leads to a sparse matrix, which complicates the course grade prediction. Sparse data can significantly hinder the effectiveness of matrix factorization techniques, as these methods rely on a denser dataset to make accurate predictions and recommendations. As an additional improvement, we can average results of matrix factorization and Restricted Boltzmann Machine (RBM) to get better results, since this combination showed outperforming results with the [sparse Netflix dataset](https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429) or Non-Matrix Factorization along with Supper Vector Machine. However, we do not have so much data and features of courses. Therefore, we need to collect course features to get better results.

To address this issue, we are exploring alternative methods. One promising approach is the use of a Knowledge Graph. This method can effectively leverage the available student data to provide course recommendations. A Knowledge Graph, by its very nature, is adept at handling sparse data. It creates a network of entities (such as students, courses, skills) and their interrelations, allowing for a more nuanced understanding of student profiles and course attributes. This enhanced understanding can lead to more accurate and personalized course recommendations, even when dealing with limited student course histories.

By incorporating a Knowledge Graph, we aim to overcome the limitations posed by sparse data and improve the overall effectiveness of our recommendation system. This approach not only promises to deliver better-tailored course suggestions to students but also has the potential to evolve and scale as more data becomes available or as the educational offerings at Maastricht University expand and change.

Collaborative Filtering implementation: you can find the collaborative filtering model in the `notebooks/Collaborative_Filtering.ipynb` notebook.

## Student Model:

General structure of the student model:
<p align="center">
  <img src="images/sm.png" alt="Student" title="Student"/>
</p>

### Content-based:

Currently, we utilize the Sentence Transformer [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5), which has shown the best performance on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). The purpose of this model is to compare the similarities of courses taken by students with the course catalog that are similar to those already taken. However, this approach lacks transparency and explainability, since it is not clear why the recommended courses are similar to each other.

In [21]:
from rec_sys_uni.rec_systems.content_based_sys.content_based import ContentBased
from rec_sys_uni.datasets.datasets import get_student_data
content_based = ContentBased(
    model_name='BAAI/bge-large-en-v1.5',
    distance='cos',
    score_alg='sum',
    scaler='None')

student_data = get_student_data()

content_based_output = content_based.compute_course_similarity(
                                                                student_node.results["recommended_courses"].keys(),
                                                                student_data
                                                                )
content_based_output['COR1002']

{'SSC1005': 0.6391,
 'SSC1025': 0.5895,
 'SSC2019': 0.6427,
 'SSC2062': 0.6381,
 'SSC1007': 0.5868,
 'SSC2004': 0.5462}

In [19]:
student_data

{'courses_taken': {'SSC1005': {'passed': True,
   'grade': 7.1,
   'year': 2022,
   'period': 1},
  'COR1003': {'passed': True, 'grade': 6.5, 'year': 2022, 'period': 1},
  'SSC1025': {'passed': True, 'grade': 7.6, 'year': 2022, 'period': 2},
  'SSC2019': {'passed': True, 'grade': 6.0, 'year': 2022, 'period': 2},
  'SSC2062': {'passed': True, 'grade': 6.4, 'year': 2023, 'period': 5},
  'SSC1007': {'passed': True, 'grade': 7.2, 'year': 2023, 'period': 4},
  'COR1004': {'passed': False, 'grade': 5.7, 'year': 2023, 'period': 5},
  'SSC2004': {'passed': True, 'grade': 6.9, 'year': 2023, 'period': 5}}}

### Statistical-Student-Based: (Improvements of Recommendation System)

**Hidden Markov Models** (HMMs) largely used to assign the correct label sequence to sequential data or assess the probability of a given label and data sequence. These models are finite state machines characterised by a number of states, transitions between these states, and output symbols emitted while in each state. The HMM is an extension to the Markov chain, where each state corresponds deterministically to a given event. In the HMM the observation is a probabilistic function of the state. HMMs share the Markov chain's assumption, being that the probability of transition from one state to another only depends on the current state - i.e. the series of states that led to the current state are not used. They are also time invariant.

The HMM is a directed graph, with probability weighted edges (representing the probability of a transition between the source and sink states) where each vertex emits an output symbol when entered. The symbol (or observation) is non-deterministically generated. For this reason, knowing that a sequence of output observations was generated by a given HMM does not mean that the corresponding sequence of states (and what the current state is) is known. This is the 'hidden' in the hidden markov model.

Formally, a HMM can be characterised by:

- the output observation alphabet. This is the set of symbols which may be
  observed as output of the system.
- the set of states.
- the transition probabilities $a_{ij} = P(s_t = j | s_{t-1} = i)$. These
  represent the probability of transition to each state from a given state.
- the output probability matrix $b_i(k) = P(X_t = o_k | s_t = i)$. These
  represent the probability of observing each symbol in a given state.
- the initial state distribution. This gives the probability of starting
  in each state.

**HMM** can give probabilities of next courses based on the previous student data. However, HMMs have joint probabilities, which can fail on one course sequence, which leads to sequence errors of next predicted course. Therefore, we can try to use a more advanced model, such as [Conditional Random Fields](https://en.wikipedia.org/wiki/Conditional_random_field) (CRFs), which can give conditional probabilities of next courses based on the previous student data. CRFs are a class of statistical modelling method often applied in pattern recognition and machine learning and used for structured prediction. Whereas an ordinary classifier predicts courses for a single sample without regard to "neighboring" samples, a CRF can take context into account: The prediction of courses for one course may depend on other courses.

Although our feature function in both models can be course labels, it is better to consider more advanced feature extraction methods to generalize sequence labels. This approach determines which courses should be aligns with course features instead of course labels.

## Course Reranker:

Since we receive different recommendations from our models, we need to apply a re-ranking strategy to determine the final recommendation. This can be achieved by using reciprocal rank fusion.
<center>

# $\[ \sum_{d \in D} \frac{1}{k + r(d)} \]$

</center>

- `k`: is constant value, which is used to assign weights to the models.
- `d`: is model.
- `r(d)`: recommendation of the model which sorted by certain criteria.



## Knowledge Graph:

Knowledge Graph is a hot topic in fields of representation learning, mining techniques and recommender systems. 

Relevant papers:
- [Recommender systems based on graph embedding techniques: A comprehensive review](https://arxiv.org/pdf/2109.09587.pdf)
- [ColdGuess: A General and Effective Relational Graph Convolutional Network to Tackle Cold Start Cases](https://www.mlgworkshop.org/2022/papers/MLG22_paper_0866.pdf)
- [Topological Representation Learning for E-commerce Shopping Behaviors](https://www.mlgworkshop.org/2023/papers/MLG__KDD_2023_paper_13.pdf)

A Knowledge Graph can address problems with the explainability of models and provide more accurate results by introducing a structured knowledge base of courses and students. We attempt to create a knowledge graph of courses using current data by extracting relevant keywords from course descriptions and Intended Learning Outcomes (ILOs). These extracted keywords are then compared with other course descriptions and accepted as keywords if they meet a high similarity threshold. This approach demonstrates which courses and keywords share similar concepts.


General Knowledge graph explanation:
<p align="center">
  <img src="images/kg.png" alt="Knowledge Graph" title="Knowledge Graph"/>
</p>

All process of creating this knowledge graph can be found in the `notebooks/Knowledge_Graph.ipynb` notebook along with course clustering, LDAs, keywords extraction, and other techniques.

Knowledge graph `knowledge_graph/knowledge_graph.html` itself.

Course clustering `knowledge_graph/course_clustering.html` representation:
<p align="center">
  <img src="images/course_clustering.png" alt="Course Clustering" title="Course Clustering"/>
</p>

Latent Dirichlet Allocation (LDA) `knowledge_graph/lda.html` representation:
<p align="center">
  <img src="images/lda.png" alt="LDA" title="LDA"/>
</p>

Moreover, we extract the semantic role labeling of all course descriptions, as they can be utilized to discern the primary concepts of the course. Semantic Role Labeling (SRL) is the task of assigning labels to words in a sentence to identify their role within that sentence. SRL, a task in natural language processing, is also referred to as shallow semantic parsing. The semantic roles, often termed as arguments, are designated based on both the verb and its context. Consequently, verbs can determine the Bloom's taxonomy level, and the context of verbs can specify the subject to which this Bloom level is applied.. This approach can be used for Bloom's taxonomy explanation.

[Open Syllabus Galaxy](https://galaxy.opensyllabus.org/) is dataset of syllabi from universities, articles, books around the world. It contains 6.1 million syllabi from 80 countries. The dataset can be accessed through request to them. This dataset is invaluable for creating a more comprehensive knowledge graph and utilizing its labels to define course features, aiding in Warning Prediction, Student Modeling, and training our Generative Large Language Model to provide explanations with references. Furthermore, there is ongoing work to create Intended Learning Outcomes (ILOs) for courses based on their descriptions and materials, which could be beneficial for articulating relevant ILOs.

By creating a CRFs (Conditional Random Fields) model based on student data, you can generate a student knowledge graph with probability transitions along with course features, which can be derived from the Open Syllabus dataset.

Additionally, European Comission released the [paper](https://esco.ec.europa.eu/en/about-esco/data-science-and-esco/leveraging-artificial-intelligence-maintain-esco-occupations-pillar) on the alignment of citizen occupations and skills taxonomy, which can be useful in our current work of creating a knowledge graph for the Recommender System.
