# Recommender System

## Introduction
This notebook provides an overview of the recommender system we have built for UCM. It includes key aspects to help you understand the system's functionality and workflow. We'll begin with how to call the system and then delve into the details of each component.

Note: We are not Software Engineers, so the structure might not be the best. We are open to suggestions and improvements.

Before starting, it's necessary to install the system's requirements. These are listed in the `requirements.txt` file. To install them, run the following command in your terminal:

```
pip install -r requirements.txt
```


Additionally, when you run the system for the first time, some transformer models will be downloaded. This will take some time in the first run. Moreover, we use MongoDB for the system. You need to have MongoDB installed on your system. You can download MongoDB from [here](https://www.mongodb.com/try/download/community).

## Table of Contents:
1. [Initialization and Recommender System Structure](#Initialization-and-Recommender-System-Structure)
2. [Keyword Similarity](#Keyword-Similarity)
3. [Bloom's Taxonomy](#Bloom's-Taxonomy)
4. [Explanation with Generative LLM](#Explanation-with-Generative-LLM)
5. [Planner (Integer Linear Programming)](#Planner-(Integer-Linear-Programming))
6. [Warning System](#Warning-System)
7. [Collaborative Filtering](#Collaborative-Filtering)
8. [Knowledge Graph](#Knowledge-Graph)
 

## Initialization and Recommender System Structure:

The file `instances/recommender_instances.py` contains the initialization of the recommender system. Here, we import all the components that are required for the system. The components are initialized once and then used throughout the system. The system's structure is kept as modular as possible. This modularity aids in the easy integration of new components into the system and in its maintenance.


In [2]:
import os
# First we change the directory to the root directory of the project. Where we usually run the app.py file.
os.chdir(os.getcwd().replace("\\notebooks", ""))

In [8]:
# Here is example code for the initialization of the system.
from instances.recommender_instances import rs

# Now with the rs object we can call the components that we need.
input = { # Example of the input that we need for the system. Also, this is the input that we get from the front-end.
    "keywords": {
        "physics": 1.0,
        "maths": 1.0,
        "statistics": 1.0,
        "ai": 1.0,
        "computer science": 1.0,
        "chem": 1.0,
    },
    "blooms": {'create': 0.0,
               'understand': 0.0,
               'apply': 0.0,
               'analyze': 0.0,
               'evaluate': 0.0,
               'remember': 0.0}
}
student_node = rs.get_recommendation(input) # This will return the recommendations for the input as a Student Node object.
student_node

{'COR1002': {'physics': 0.317, 'maths': 0.2752, 'statistics': 0.2255, 'ai': 0.1399, 'computer science': 0.2675, 'chem': 0.17}, 'COR1003': {'physics': 0.1223, 'maths': 0.1556, 'statistics': 0.1974, 'ai': 0.1037, 'computer science': 0.1756, 'chem': 0.0811}, 'COR1004': {'physics': 0.0678, 'maths': 0.0993, 'statistics': 0.1095, 'ai': 0.0654, 'computer science': 0.1255, 'chem': 0.1127}, 'COR1006': {'physics': 0.2941, 'maths': 0.2917, 'statistics': 0.1887, 'ai': 0.2249, 'computer science': 0.3374, 'chem': 0.1621}, 'HUM1003': {'physics': 0.114, 'maths': 0.1233, 'statistics': 0.1015, 'ai': -0.0003, 'computer science': 0.1589, 'chem': 0.0839}, 'HUM1007': {'physics': 0.1675, 'maths': 0.1606, 'statistics': 0.2046, 'ai': 0.1514, 'computer science': 0.2175, 'chem': 0.1048}, 'HUM1010': {'physics': 0.0917, 'maths': 0.0521, 'statistics': 0.1268, 'ai': 0.0118, 'computer science': 0.0549, 'chem': -0.0058}, 'HUM1011': {'physics': 0.1227, 'maths': 0.1435, 'statistics': 0.0216, 'ai': 0.1675, 'computer scie

KeyError: 'blooms'

The `StudentNode` object contains all information about the student. It is a class that we have created to keep the information about the student. It contains the following attributes:

1. **results**: contains results from the recommender system components.
2. **student_input**: contains keywords and Bloom's taxonomy of the student input.
3. **course_data**: currently we are using system course data, but later on, the front-end should send the course list for which we need to compute recommendation.
4. **student_data**: contains all courses which the student took.
5. **id**: unique Student ID, which was assigned by MongoDB.

The `StudentNode` object is saved in the MongoDB database. We use the `id` attribute as the unique identifier for the student. We use the `id` attribute to retrieve the `StudentNode` object from the database. It helps call other components of the system independently without sending unnecessary information to the front-end.

Let's take a look at each component of the student node at detail.

### 1. Results:
`results`: A dictionary of recommended courses. For example:

- `recommended_courses`: A dictionary where each key is a `course_id` (String) and the value is another dictionary with the following keys:
  - `score`: A float representing the total score of each model.
  - `period`: A list of integers or a list of lists of integers. For example, `[1, 4]` or `[[1, 2, 3], [4, 5, 6]]` or `[[1,2]]`.
  - `warning`: A boolean value.
  - `warning_recommendation`: A list of warning recommendations.
  - `keywords`: A dictionary of scores for each keyword (only present after applying the CourseBased model).
  - `blooms`: A dictionary of scores for each bloom (only present after applying the BloomBased model).

- `sorted_recommended_courses`: A list of courses, which are included to the structured recommendation, where each course is a dictionary with the following keys:
  - `course_code`: A string.
  - `course_name`: A string.
  - `warning`: A boolean value.
  - `warning_recommendation`: A list of warning recommendations.
  - `keywords`: A dictionary where each key is a keyword (String) and the value is a weight (float).
  - `blooms`: A dictionary where each key is a bloom (String) and the value is a weight (float).
  - `score`: A float.

- `structured_recommendation`: A dictionary with the following structure:
  - `semester_1`: A dictionary with the following keys:
    - `period_1`: A list of `course_id` (String), the length of the list is less than or equal to `top_n`.
    - `period_2`: A list of `course_id` (String), the length of the list is less than or equal to `top_n`.
  - `semester_2`: A dictionary with the following keys:
    - `period_4`: A list of `course_id` (String), the length of the list is less than or equal to `top_n`.
    - `period_5`: A list of `course_id` (String), the length of the list is less than or equal to `top_n`.

Note: If you apply the `sort_by_periods` function, you will have `structured_recommendation` and `sorted_recommended_courses` keys in the results.


In [11]:
print(student_node.results.keys())
print("Number of courses: " + str(len(student_node.results["recommended_courses"].keys())))
print(student_node.results["sorted_recommended_courses"][0].keys())
print(student_node.results["structured_recommendation"].keys())
print(student_node.results["structured_recommendation"]["semester_1"].keys())
print(student_node.results["structured_recommendation"]["semester_1"]["period_1"][0].keys())

dict_keys(['recommended_courses', 'sorted_recommended_courses', 'structured_recommendation'])
Number of courses: 149
dict_keys(['semester_1', 'semester_2'])
dict_keys(['period_1', 'period_2'])


### 2. Student Input:
`student_input`: A dictionary that contains the student's input. For example:

- `keywords`: A dictionary where each key is a keyword (String) and the value is a weight (float).
- `blooms`: A dictionary where each key is a bloom (String) and the value is a weight (float).
- `semester`: A float representing the semester.

Example of `student_input`:

```python
student_input = {
    "keywords": {
        "python": 0.5,
        "data science": 0.2
    },
    "blooms": {
        "create": 0.5,
        "understand": 0.75,
        "apply": 0.25,
        "analyze": 0.5,
        "evaluate": 0.0,
        "remember": 1.0
    }
}
```

### 3. Course Data:
`course_data`: A dictionary where each key is a `course_id` (String) and the value is another dictionary with the following keys:

- `course_name`: A string representing the name of the course. For example, "Philosophy of Science".
- `period`: A list of integers or a list of lists of integers. For example, `[1, 4]` or `[[1, 2, 3], [4, 5, 6]]` or `[[1,2]]`.
- `level`: An integer representing the level of the course. For example, 1, 2, or 3.
- `prerequisites`: A list of course IDs (Strings) that are prerequisites for the course. For example, `["COR1001", "COR1002"]`.
- `description`: A string providing a description of the course. For example, "This course is about ...".
- `ilos`: A list of strings representing the Intended Learning Outcomes (ILOs) of the course. For example, ["Be able to apply the scientific method to a given problem", "Be able to explain the difference between science and pseudoscience"].

Example of `course_data`:

```python
course_data = {
    "COR1001": {
        "course_name": "Philosophy of Science",
        "period": [1, 4],
        "level": 1,
        "prerequisites": ["COR1002"],
        "description": "This course is about ...",
        "ilos": [
            "Be able to apply the scientific method to a given problem",
            "Be able to explain the difference between science and pseudoscience"
        ]
    },
    ...
}
```

Note: We are using the `course_data` from the system. However, later on, the front-end should send the course list for which we need to compute recommendation.

### 4. Student Data:
`student_data`: A dictionary that contains the student's data. For example:

- `courses_taken`: A dictionary where each key is a `course_id` (String) and the value is another dictionary with the following keys:
  - `passed`: A boolean value indicating whether the student passed the course.
  - `grade`: A float representing the grade the student received in the course.
  - `period`: An integer or a list of two integers representing the period(s) when the course was taken.
  - `year`: An integer representing the year when the course was taken.

Example of `student_data`:

```python
student_data = {
    "courses_taken": {
        "COR1001": {
            "passed": True,
            "grade": 3.5,
            "period": [1, 2],
            "year": 2022
        },
        ...
    }
}
```

Note: We are using the `student_data` from the system. However, later on, the front-end should send the student's data along with student ID, which was provided by Maastricht University.

### Usage of MongoDB:

We use MongoDB to store the `StudentNode` object. We use the `id` attribute to retrieve the `StudentNode` object from the database. We show an example of how to use MongoDB to retrieve the `StudentNode` object in the json format.

Local server: `mongodb://localhost:27017/`

In the RecSys a.k.a `rs` object, we have a `db` attribute which is the database that we are using. We use a `RecSys` database. We store the `StudentNode` object in the `student_results` collection. We use the `id` attribute as the unique identifier for the student's session. 

Note: Every time, when student input is sent to the system, we create a new `StudentNode` object and store it in the database.

Example: How to use the `id` attribute to retrieve the `StudentNode` object as json from the database.

In [12]:
# This is an example of id that we get from MongoDB.
print(student_node.id)

65787b31d0727440ea3d5d6e


In [13]:
from bson.objectid import ObjectId
# First we need to get the database from the rs object.
db = rs.db

# Then we get the collection from the database.
collection = db["student_results"]

# We can retrieve the student node object in the json format from the collection by their id.
student_info = collection.find({"_id": ObjectId(str(student_node.id))})[0]
student_info.keys()

dict_keys(['_id', 'student_id', 'student_input', 'course_data', 'student_data', 'results', 'time'])

Let's briefly explain the parameters and functions of the Recommender System:
```
def __init__(self,
                 keyword_based: KeywordBased = None,
                 bloom_based: BloomBased = None,
                 explanation: LLM = None,
                 warning_model: WarningModel = None,
                 planner: UCMPlanner = None,
                 validate_input: bool = True,
                 top_n: int = 7,
                 ):
        self.validate_input = validate_input
        self.keyword_based = keyword_based
        self.bloom_based = bloom_based
        self.explanation = explanation
        self.warning_model = warning_model
        self.planner = planner
        self.top_n = top_n
        self.db = pymongo.MongoClient("mongodb://localhost:27017/")["RecSys"]
```
- `keyword_based`: KeywordBased model.
- `bloom_based`: BloomBased model.
- `explanation`: LLM model.
- `warning_model`: WarningModel model.
- `planner`: UCMPlanner model.
- `validate_input`: Boolean value. If True, the input from front-end will be validated.
- `top_n`: Integer value. The number of courses that will be recommended in the period.  


```
def validate_system_input(self,
                              student_input,
                              course_data,
                              student_data,
                              system_course_data,
                              system_student_data):
        """
        function: validate_system_input
        description: validate the format of the input data
        """

        check_student_input(student_input)

        if system_student_data:
            student_data = get_student_data()

        if system_course_data:
            exception_courses = []
            if student_data:
                exception_courses = list(student_data['courses_taken'].keys())
            course_data = get_course_data(except_courses=exception_courses)

        check_course_data(course_data)

        if student_data is not None:
            check_student_data(student_data)

        return student_input, course_data, student_data
```
- function `validate_system_input`: Validate the format of the input data. If the input data is not in the correct format, it will raise an error. If the input data is in the correct format, it will return the input data. Additionally, we include system `course_data` and `student_data` in the input data if they are not provided by the front-end, but in later stage it should be provided by the front-end. All rules for the input data are defined in the `rec_sys_uni/error_checker/errors.py` file.

In [14]:
input = { # Lets make a mistake in the input data.
    "keyword": { # Instead of keywords, we wrote keyword.
        "physics": 1.0,
        "maths": 1.0,
        "statistics": 1.0,
        "ai": 1.0,
        "computer science": 1.0,
        "chem": 1.0,
    },
    "blooms": {'create': 0.0,
               'understand': 0.0,
               'apply': 0.0,
               'analyze': 0.0,
               'evaluate': 0.0,
               'remember': 0.0}
}
try:
    student_node = rs.get_recommendation(input)
except Exception as e:
    print(e) 

student_input does not have keywords


`get_recommendation` function is the main function of the system. It calls all the components of the system, which calculate recommendation and returns the `StudentNode` object. It takes the following parameters:
```
def get_recommendation(self,
                           student_intput,
                           course_data=None,
                           student_data=None,
                           system_course_data=True,
                           system_student_data=False,
                           ):
        """
        function: get_recommendation
        description: get Student Node object
        """

        if self.validate_input:
            student_input, course_data, student_data = self.validate_system_input(student_intput,
                                                                                  course_data,
                                                                                  student_data,
                                                                                  system_course_data,
                                                                                  system_student_data)

        # Make results template
        results = make_results_template(course_data)

        # Create StudentNode object
        student_info = StudentNode(results, student_intput, course_data, student_data)

        # Compute recommendation and store in student_info
        compute_recommendation(self, student_info)

        # Compute warnings and store in student_info
        if self.warning_model:
            compute_warnings(self, student_info)

        # Sort by periods
        sort_by_periods(self, student_info, self.top_n, include_keywords=True, include_score=True,
                        include_blooms=False)

        # Save student_info to database
        now = datetime.now()

        collection = self.db["student_results"]
        input_dict = {
            "student_id": "123",
            "student_input": student_info.student_input,
            "course_data": student_info.course_data,
            "student_data": student_info.student_data,
            "results": student_info.results,
            "time": now.strftime("%d/%m/%Y %H:%M:%S")
        }
        object_db = collection.insert_one(input_dict)
        student_info.set_id(object_db.inserted_id)

        # Return StudentNode object
        return student_info
```
- `student_input`: A dictionary that contains the student's input.
- `course_data`: A dictionary that contains the course data.
- `student_data`: A dictionary that contains the student's data.
- `system_course_data`: Boolean value. If True, the system course data will be used.
- `system_student_data`: Boolean value. If True, the system student data will be used.


`generate_explanation` function is used to generate the explanation for the recommendation. It takes the following parameters:
- `student_id`: A string representing the student's ID from MongoDB.
- `course_code`: A string representing the course's code.

`make_timeline` function is used to generate the timeline for the recommendation. It takes the following parameters:
- `student_id`: A string representing the student's ID from MongoDB.

## Keyword Similarity:

Our Keyword Similarity model is designed to assess the degree of similarity between a student's specified keywords and the content within course descriptions. To accomplish this, various models can be employed, particularly those available through the HuggingFace library, which is renowned for its diverse range of pre-trained models and natural language processing tools. These models from HuggingFace are adept at understanding and processing language, thus making them ideal for accurately gauging the relevance between keywords and course content.

In addition to the options available in HuggingFace, there is also the possibility of utilizing an Intel-based model. This approach can be particularly advantageous when working with Intel server CPUs, as the model is optimized to efficiently compute embeddings on these processors. The use of an Intel model could potentially offer improved performance in terms of speed and efficiency, especially in environments that are already using Intel hardware.

The choice between these models should be guided by several factors, including the specific requirements of the task, the available computational resources, and the desired balance between accuracy and efficiency. HuggingFace models might offer a wider range of capabilities and potentially higher accuracy due to their advanced algorithms and extensive training on diverse datasets. On the other hand, an Intel-optimized model might provide faster processing times and better integration with existing Intel server infrastructures, making it a practical choice for environments prioritizing efficiency and hardware compatibility.

Ultimately, the decision on which model to use for computing keyword-course description similarity should be based on a thorough evaluation of the project's needs and the computational environment.

Let's take a look at the `KeywordBased` class initialization:
```
# Keyword Based
keyword_based = KeywordBased(model_name="all-MiniLM-L12-v2",
                             seed_help=True,
                             domain_adapt=True,
                             zero_adapt=True,
                             domain_type='title',
                             seed_type='title',
                             zero_type='title',
                             score_alg='sum',
                             distance='cos',
                             backend='keyBert',
                             scaler=True,
                             sent_splitter=False,
                             precomputed_course=True)
```

- `model_name`: model name of the sentence transformer, specifically from the HuggingFace library.
- `seed_help`: apply seed help, mix course description with seed words embeddings
- `domain_adapt`: apply domain adaptation, add additional attention layer to the model
- `zero_adapt`: apply zero-shot adaptation, more advance seed filtering
- `seed_type`: type of the seed help either 'title' or 'domains', we use 'title' for now, since 'domains' did not show good performance
- `domain_type`: type of the domain adaptation either 'title' or 'domains', we use 'title' for now, since 'domains' did not show good performance
- `zero_type`: type of the zero-shot adaptation either 'title' or 'domains', we use 'title' for now, since 'domains' did not show good performance
- `adaptive_thr`: adaptive threshold for the zero-shot adaptation
- `minimal_similarity_zeroshot`: minimal similarity between a candidate and a domain word for the zero-shot adaptation
- `score_alg`: score algorithm either 'sum' or 'rrf', currently we use 'sum', but Relevance Ranking Fusion (RRF) can be used, but with another implementation of keyword similarity
- `distance`: distance metric either 'cos' or 'dot', for some models is required to use 'dot' instead of 'cos'.
- `backend`: backend either 'keyBert' or 'Intel'. 'Intel' is used to compute the embeddings on Intel server CPU, but it provides only advance models like BGE, which can be too large for our system.
- `scaler`: apply min-max scaler to the keywords weights.
- `sent_splitter`: split the course description into sentences. It is implemented to use Linear Programing to calculate scores, but it still needs to be implemented.
- `precomputed_course`: use precomputed course embeddings or not.

If you are planning to use domain adaptation, zero-shot adaptation, or both, it is necessary to precompute the embeddings and the attention layer in advance. You can use the following code to precompute the embeddings and attention layer:

```
def calculate_zero_shot_adaptation(course_data, model_name, attention,
                                   adaptive_thr: float = 0.15,
                                   minimal_similarity_zeroshot: float = 0.8):
```
- `course_data`: A dictionary or list containing the course code. Please follow the template that we discussed earlier.
- `model_name`: The name of the sentence transformer model, specifically from the HuggingFace library, for which you want to precompute the embedding for zero-shot adaptation.
- `attention`: The specific words to which you want to apply attention in order to bring the embedding closer to the course domain within the embedding space.

```
def calculate_few_shot_adaptation(course_data, model_name, attention,
                                  lr=1e-4, epochs=100,  # Training Parameters
                                  start_index=0,
                                  include_description=True,
                                  include_title=False,
                                  include_ilos=False, ):
```
- `course_data`: A dictionary containing course descriptions, Intended Learning Outcomes (ILOs), and the course code. Please follow the template that we discussed earlier.
- `model_name`: The name of the sentence transformer model, specifically from the HuggingFace library, for which you want to precompute the embedding for few-shot adaptation.
- `attention`: The specific words to which you want to apply attention in order to bring the embedding closer to the course domain within the embedding space.
- `lr`: Learning rate.
- `epochs`: Number of epochs.

```
def calculate_precomputed_courses(course_data, model_name,
                                  include_description=True,
                                  include_title=False,
                                  include_ilos=False, ):
```
- `course_data`: A dictionary containing course descriptions, Intended Learning Outcomes (ILOs), and the course code. Please follow the template that we discussed earlier.
- `model_name`: The name of the sentence transformer model, specifically from the HuggingFace library, for which you want to precompute the embedding for the course.
- `include_description`: Boolean value. If True, the course description will be included in the embedding.
- `include_title`: Boolean value. If True, the course title will be included in the embedding.
- `include_ilos`: Boolean value. If True, the course ILOs will be included in the embedding.


In [1]:
from rec_sys_uni.datasets.datasets import calculate_precomputed_courses, calculate_few_shot_adaptation, calculate_zero_shot_adaptation, get_course_data
%load_ext autoreload
%autoreload 2

In [3]:
course_data = get_course_data()
course_data.keys()

dict_keys(['COR1002', 'COR1003', 'COR1004', 'COR1006', 'HUM1003', 'HUM1007', 'HUM1010', 'HUM1011', 'HUM1012', 'HUM1013', 'HUM1016', 'HUM2003', 'HUM2005', 'HUM2007', 'HUM2008', 'HUM2013', 'HUM2016', 'HUM2018', 'HUM2021', 'HUM2022', 'HUM2030', 'HUM2031', 'HUM2046', 'HUM2047', 'HUM2051', 'HUM2054', 'HUM2056', 'HUM2057', 'HUM2058', 'HUM2059', 'HUM2060', 'HUM3014', 'HUM3019', 'HUM3029', 'HUM3034', 'HUM3036', 'HUM3040', 'HUM3042', 'HUM3043', 'HUM3044', 'HUM3045', 'HUM3049', 'HUM3050', 'HUM3051', 'HUM3052', 'HUM3053', 'SCI1004', 'SCI1005', 'SCI1009', 'SCI1010', 'SCI1016', 'SCI2002', 'SCI2009', 'SCI2010', 'SCI2011', 'SCI2017', 'SCI2018', 'SCI2019', 'SCI2022', 'SCI2031', 'SCI2033', 'SCI2034', 'SCI2035', 'SCI2036', 'SCI2037', 'SCI2039', 'SCI2040', 'SCI2041', 'SCI2042', 'SCI2043', 'SCI2044', 'SCI3003', 'SCI3005', 'SCI3006', 'SCI3007', 'SCI3046', 'SCI3049', 'SCI3050', 'SCI3051', 'SCI3052', 'SSC1005', 'SSC1007', 'SSC1025', 'SSC1027', 'SSC1029', 'SSC1030', 'SSC2002', 'SSC2004', 'SSC2006', 'SSC2007',

In [4]:
course_data['COR1002'].keys()

dict_keys(['course_name', 'period', 'level', 'prerequisites', 'description', 'ilos'])

In [6]:
calculate_precomputed_courses(course_data, "all-MiniLM-L12-v2"
                              , include_description=True
                              , include_title=True
                              , include_ilos=True
                              , use_cuda=True)

Calculate precomputed courses for the model: BAAI/bge-large-en-v1.5
System will calculate this list of courses: ['COR1002', 'COR1003', 'COR1004', 'COR1006', 'HUM1003', 'HUM1007', 'HUM1010', 'HUM1011', 'HUM1012', 'HUM1013', 'HUM1016', 'HUM2003', 'HUM2005', 'HUM2007', 'HUM2008', 'HUM2013', 'HUM2016', 'HUM2018', 'HUM2021', 'HUM2022', 'HUM2030', 'HUM2031', 'HUM2046', 'HUM2047', 'HUM2051', 'HUM2054', 'HUM2056', 'HUM2057', 'HUM2058', 'HUM2059', 'HUM2060', 'HUM3014', 'HUM3019', 'HUM3029', 'HUM3034', 'HUM3036', 'HUM3040', 'HUM3042', 'HUM3043', 'HUM3044', 'HUM3045', 'HUM3049', 'HUM3050', 'HUM3051', 'HUM3052', 'HUM3053', 'SCI1004', 'SCI1005', 'SCI1009', 'SCI1010', 'SCI1016', 'SCI2002', 'SCI2009', 'SCI2010', 'SCI2011', 'SCI2017', 'SCI2018', 'SCI2019', 'SCI2022', 'SCI2031', 'SCI2033', 'SCI2034', 'SCI2035', 'SCI2036', 'SCI2037', 'SCI2039', 'SCI2040', 'SCI2041', 'SCI2042', 'SCI2043', 'SCI2044', 'SCI3003', 'SCI3005', 'SCI3006', 'SCI3007', 'SCI3046', 'SCI3049', 'SCI3050', 'SCI3051', 'SCI3052', 'SSC100

  0%|          | 0/149 [00:00<?, ?it/s]

In [7]:
len(course_data)

149

In [None]:
attention = {}
for i in course_data.keys():
    attention[i] = course_data[i]['course_name']
calculate_few_shot_adaptation(course_data, "all-MiniLM-L12-v2", attention=attention, use_cuda=True)

In [9]:
calculate_zero_shot_adaptation(course_data, "all-MiniLM-L12-v2", attention=attention)

Calculate Zero-Shot Adaptation based on course title for the model: all-MiniLM-L12-v2
System will calculate this list of courses: ['COR1002', 'COR1003', 'COR1004', 'COR1006', 'HUM1003', 'HUM1007', 'HUM1010', 'HUM1011', 'HUM1012', 'HUM1013', 'HUM1016', 'HUM2003', 'HUM2005', 'HUM2007', 'HUM2008', 'HUM2013', 'HUM2016', 'HUM2018', 'HUM2021', 'HUM2022', 'HUM2030', 'HUM2031', 'HUM2046', 'HUM2047', 'HUM2051', 'HUM2054', 'HUM2056', 'HUM2057', 'HUM2058', 'HUM2059', 'HUM2060', 'HUM3014', 'HUM3019', 'HUM3029', 'HUM3034', 'HUM3036', 'HUM3040', 'HUM3042', 'HUM3043', 'HUM3044', 'HUM3045', 'HUM3049', 'HUM3050', 'HUM3051', 'HUM3052', 'HUM3053', 'SCI1004', 'SCI1005', 'SCI1009', 'SCI1010', 'SCI1016', 'SCI2002', 'SCI2009', 'SCI2010', 'SCI2011', 'SCI2017', 'SCI2018', 'SCI2019', 'SCI2022', 'SCI2031', 'SCI2033', 'SCI2034', 'SCI2035', 'SCI2036', 'SCI2037', 'SCI2039', 'SCI2040', 'SCI2041', 'SCI2042', 'SCI2043', 'SCI2044', 'SCI3003', 'SCI3005', 'SCI3006', 'SCI3007', 'SCI3046', 'SCI3049', 'SCI3050', 'SCI3051', 

  0%|          | 0/149 [00:00<?, ?it/s]

You can find additional information about attention-based keyword similarity [here](https://arxiv.org/pdf/2211.07499.pdf). 

If you follow all the steps outlined above, you will be able to use the new model to compute the similarity between the student's keywords and the course descriptions. Note that all embeddings and layers will be downloaded automatically; just remember to change the model during the initialization of KeywordBased

## Bloom's Taxonomy:

In [None]:
# TODO

## Explanation with Generative LLM:

Currently, we are utilizing OpenAI's technology to provide explanations for keywords and to summarize course descriptions. This is achieved through a combination of a prompt template and few-shot learning examples, as implemented in our `rec_sys_uni/rec_systems/llm_explanation/LLM.py` script. One significant advantage of using OpenAI is its capability to generate outputs in JSON format, which facilitates easier integration and manipulation of data in various applications.

However, we are contemplating the development and deployment of our proprietary model for generating explanations. The primary motivation behind this shift is the potential for greater customization and accuracy. By fine-tuning our model specifically to our dataset, we aim to achieve performance that surpasses what we currently obtain with OpenAI. Own model will allow for more tailored and precise explanations, catering specifically to the unique characteristics and requirements of our data.

Furthermore, developing our model could offer additional benefits such as enhanced control over the model's behavior, the ability to continuously improve and adapt the model based on incoming data and feedback, and potentially lower long-term costs associated with using a third-party service. While the initial development and training might require significant resources, the long-term benefits could be substantial, especially in terms of offering a more personalized and relevant user experience.

In [17]:
# To get the explanation for the course, we need to provide the student id and course code.
explanation = rs.generate_explanation(student_node.id, "SCI2039")

In [18]:
explanation.keys()

dict_keys(['course_code', 'course_title', 'summary_description', 'keywords_explanation'])

# Course Overview

## SCI2039: Computer Science

### Summary Description
Computer Science is a comprehensive course that provides an overview of the discipline, covering a wide range of topics including algorithmic foundations of informatics, hardware issues, such as number systems and computer architectures, and software issues, such as operating systems, programming languages, compilers, networks, the Internet, and artificial intelligence. The course also includes practical lab sessions to investigate the concepts introduced. By the end of the course, students are expected to have developed experience in applying techniques from informatics, computer science, and programming for their own research and educational purposes.

### Relevance to Other Disciplines

- **Physics**
  - While there is a moderate similarity between physics and the course description (keyword similarity: 0.9), the course does not directly cover physics topics. However, the analytical and problem-solving skills developed in physics can be beneficial in computer science.

- **Maths**
  - Mathematics is highly relevant to this course (keyword similarity: 0.67), as it forms the foundation of many computer science concepts, such as algorithms, logic, and discrete mathematics.

- **Statistics**
  - Statistics has some relevance to the course (keyword similarity: 0.28), particularly in the context of data analysis and interpretation, which are essential in computer science.

- **AI**
  - Artificial intelligence is directly relevant to the course (keyword similarity: 0.51), as it is explicitly mentioned in the course description. The course covers artificial intelligence as one of the key topics.

- **Computer Science**
  - The course is directly aligned with computer science (keyword similarity: 1.0), as it is the main subject of the course title. The course covers a broad range of computer science topics, providing a comprehensive introduction to the field.

- **Chem**
  - While there is a moderate similarity between chemistry and the course description (keyword similarity: 0.57), the course does not directly cover chemistry topics. However, the analytical and logical thinking skills developed in chemistry can be beneficial in computer science.


## Planner (Integer Linear Programming):

In [None]:
# TODO

## Warning System:

At present, our warning system is facing challenges in achieving accurate predictions due to imbalanced student data. This imbalance adversely affects the system's ability to correctly predict instances of academic failure. To address this issue, a more refined approach is being considered. We propose to define common features across courses, focusing on key skills such as physics, mathematics, law, etc. By identifying and utilizing these skill-based features, we aim to improve the prediction accuracy of potential academic failures. Furthermore, this approach will enable us to provide students with tailored recommendations on which courses to undertake in order to develop the relevant skills needed to succeed in courses they are predicted to struggle with.

Previously, we experimented with using embeddings, assigning weights based on students' grades. However, this method resulted in overly generalized embeddings, leading to indistinct differentiation between them. 

Currently, we are now employing a Random Forest Classifier to predict academic failures. This classifier offers the advantage of feature importance analysis, which we utilize to determine the key aspects contributing to potential failures or passing. Additionally, we leverage the capabilities of a Sentence Transformer to check these potential courses and give warning recommendations with the matching score with the predicted course.

Warning model implementation and usage: you can find the warning model in the `rec_sys_uni/rec_systems/warning_model/warning_model.py` script and `notebooks/Warning_Model.ipynb` notebook. 

## Collaborative Filtering:

We implemented a matrix factorization method to generate course recommendations, utilizing student data that includes academic grades. However, this system is not yet fully integrated, as it requires access to real student IDs from Maastricht University. Additionally, the model necessitates periodic updates to maintain its relevance and accuracy.

A significant challenge we face with this approach is the limited number of courses taken by individual students. This situation leads to a sparse matrix, which complicates the recommendation process. Sparse data can significantly hinder the effectiveness of matrix factorization techniques, as these methods rely on a denser dataset to make accurate predictions and recommendations.

To address this issue, we are exploring alternative methods. One promising approach is the use of a Knowledge Graph. This method can effectively leverage the available student data to provide course recommendations. A Knowledge Graph, by its very nature, is adept at handling sparse data. It creates a network of entities (such as students, courses, skills) and their interrelations, allowing for a more nuanced understanding of student profiles and course attributes. This enhanced understanding can lead to more accurate and personalized course recommendations, even when dealing with limited student course histories.

By incorporating a Knowledge Graph, we aim to overcome the limitations posed by sparse data and improve the overall effectiveness of our recommendation system. This approach not only promises to deliver better-tailored course suggestions to students but also has the potential to evolve and scale as more data becomes available or as the educational offerings at Maastricht University expand and change.

Collaborative Filtering implementation: you can find the collaborative filtering model in the `notebooks/Collaborative_Filtering.ipynb` notebook.

## Knowledge Graph: