<h1 align="center">LOGISTIC REGRESSION ON LARGE DATASET - SIDES </h1>


<h3 style="color:#ff6347">Data</h3>

<p align="center">
  <b><big>Format of the Data (train_data.csc)</big></b>
</p>

<p align="center">


<style>
table {
  font-size: 12px;
}
</style>

| question | answer_option | n_correct_answer_option | n_answer_option | type | specialty                      | specialty_id | student | result | test   | date     | time | date_time                  |
|----------|---------------|------------------------|-----------------|------|--------------------------------|--------------|---------|--------|--------|----------|------|----------------------------|
| 3141719  | QMA           | 1                      | 5               | DP   | endocrinology_metabolism_nutrition | 7            | 191572  | 1      | 2703967 | 20210216 | 1359 | 2021-02-16T00:13:59Z |
| 3160174  | QMA           | 1                      | 5               | DP   | endocrinology_metabolism_nutrition | 7            | 191572  | 1      | 2703967 | 20210216 | 1359 | 2021-02-16T00:13:59Z |
| 3160183  | QMA           | 2                      | 5               | DP   | endocrinology_metabolism_nutrition | 7            | 191572  | 1      | 2703967 | 20210216 | 1359 | 2021-02-16T00:13:59Z |
| 3160244  | QMA           | 1                      | 5               | DP   | endocrinology_metabolism_nutrition | 7            | 191572  | 0      | 2703967 | 20210216 | 1359 | 2021-02-16T00:13:59Z |
| 3160261  | QMA           | 1                      | 5               | DP   | endocrinology_metabolism_nutrition | 7            | 191572  | 1      | 2703967 | 20210216 | 1359 | 2021-02-16T00:13:59Z |


#### Properties of the Table:

1. The table is organized into rows and columns.
2. Each row represents a specific question along with its associated data.
3. Each column represents a specific attribute or property of the question.
4. The data is structured in a tabular form for easy readability and comparison.
5. The table includes numeric, text, and date/time data types.
6. The first row is the header row, which provides the names of each column for clarity.

####  Column Descriptions:

| Column Name              | Description                                                                                         |
| ------------------------ | --------------------------------------------------------------------------------------------------- |
| question                 | The question ID.                                                                                    |
| answer_option            | Either QMA or QMB. QMA (question multiple answer) means that multiple answers are possible. QUA (question unique answer) means that only one answer is possible. |
| n_correct_answer_option  | The number of correct answers.                                                                      |
| n_answer_option          | The number of options.                                                                              |
| type                     | The type of question. QI (question isole), DP (dossier progressif), or LCA (lecture critique d'article). These are SIDES specific question types. |
| specialty                | The specialty of the question. If the question is tagged with multiple specialties, they are separated by a + sign. |
| spec_id                  | The specialty ID. This is a numeric value that corresponds to the specialty.                       |
| student                  | The student ID.                                                                                     |
| result                   | The result of the student on the question. 0 means incorrect, 1 means correct. (don't need to be binary, it will be done in the code) |
| test                     | The test ID.                                                                                        |
| date                     | The date of the test.                                                                               |
| time                     | The time of the test.                                                                               |
| date_time                | The date and time of the test.                                                                      |





<h3 style="color:#ff6347">Import Necessary Libraries</h3>

In [1]:
# Standard Library Imports
import os
import time
import datetime
import logging
import argparse
import glob
from collections import defaultdict, Counter

# Third-Party Library Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import sparse
from tqdm import tqdm
from sklearn.preprocessing import OneHotEncoder, normalize
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score, log_loss, mean_squared_error

# Project-Specific Imports
import json
import dataio
import utils.this_queue as OurQueue

# Additional Imports
import sys

from scipy.sparse import load_npz, hstack, csr_matrix, csr_matrix, find
import joblib

# import prepare_data as pdr
import import_ipynb
import prepare_data_SIDES as sides_pdr

import warnings

importing Jupyter notebook from prepare_data_SIDES.ipynb


<h3 style="color:#ff6347">Parameters</h3>

In [2]:
# main path that includes all the functions and the main code
main_path = 'c:/Users/Ghislaine/Desktop/knowledge_traing_EloRating'

In [3]:
# dettermine course_id
semester = '2019-2020'

#### Preprocessing Parameters

In [4]:

configurations = {
    '2019-2020' : {
    'folder' : 'data/sides/',
    'education_year' : '2019-2020',
    'min_interactions_per_user' : 100,
    'min_answer_per_question' : 100,
    'kc_column' : 'specialty', 
    'train_file' : 'train_data.csv',
    'test_file' : 'test_data.csv',
    'already_preprocessed':False,
    'user_skill_item_already_encoded':False
    },
    
    '2020-2021' : {
    'folder' : 'data/sides/',
    'education_year' : '2020-2021',
    'min_interactions_per_user' : 100,
    'min_answer_per_question' : 100,
    'kc_column' : 'specialty', 
    'train_file' : 'train_data.csv',
    'test_file' : 'test_data.csv',
    'already_preprocessed':False,
    'user_skill_item_already_encoded':True
    }
}

#### Encoding & Logistic Regression Parameters

In [5]:
class Options():
    def __init__(self):
        
        # Encoding
        self.users = True
        self.user_skill_together= True
        self.items = True
        self.skills = True
        self.file_splits = 1
        self.by_spec = False
        self.spec_difficulty = True
        self.weighted_encoding = True
        
        # Logistic Regression
        self.iter = 300
        self.C = 1.0 #1e-1


In [6]:
options = Options()
# print an error message if the user_skill_together is set to True but skills or users are set to False
if options.user_skill_together and (not options.skills or not options.users):
    raise ValueError('user_skill_together is set to True but skills or users are set to False. Please set user_skill_together to False or set skills and users to True.')


In [7]:
# set parameters
folder = configurations[semester]['folder']
education_year = configurations[semester]['education_year']
min_interactions_per_user = configurations[semester]['min_interactions_per_user']
min_answer_per_question = configurations[semester]['min_answer_per_question']
kc_column = configurations[semester]['kc_column']
train_file = configurations[semester]['train_file']
test_file = configurations[semester]['test_file']
already_preprocessed=configurations[semester]['already_preprocessed']
user_skill_item_already_encoded=configurations[semester]['user_skill_item_already_encoded']

#### Save results seperately for seperate versions of the model & save features

In [8]:
all_features = ['users', 'user_skill_together','spec_difficulty' ,'items', 'skills', 'weighted_encoding']
active_features = [features for features in all_features if vars(options)[features]]

# Check if 'user_skill_together', 'users', and 'skills' are all active features
if 'user_skill_together' in active_features and 'users' in active_features and 'skills' in active_features:
    active_features.remove('users')
    active_features.remove('skills')

additional_suffix = '-'.join(active_features)

EXPERIMENT_FOLDER = folder + '/' + education_year + '/result_logreg_' + additional_suffix + '/'

dataio.prepare_folder(EXPERIMENT_FOLDER)

# Save the configurations and parameters to a TXT file

# Combine configurations and hyper_params into a single dictionary
# Convert Options object to a dictionary
options_dict = options.__dict__
combined_dict = {
    'configurations': configurations[semester],
    'options': options_dict
}

# Write the configurations dictionary to the TXT file
with open(EXPERIMENT_FOLDER + 'features.txt', 'w') as file:
    json.dump(combined_dict, file, indent=4)

print("Features have been written to the file: features.txt in the ", EXPERIMENT_FOLDER)

Features have been written to the file: features.txt in the  data/sides//2019-2020/result_logreg_user_skill_together-spec_difficulty-items-weighted_encoding/


<h2 align="center" style="color:blue">3 STEPS FOR LOGISTIC REGRESSION</h2>


<h3 style="color:#228B22">Step-1: Preprocessing the Data</h3>


Preprocessing includes these major steps:
<small>
1. Removing duplicates
2. Removing Rows with Empty Values for KC (Knowledge Component)
3. Removing Rows with Empty Values for n_correct_options
4. Transforming Non-Binary Scores
5. Removing Users with Insufficient Interactions
6. Removing Items with Insufficient Interactions
7. Creating Variables and Transforming IDs: Variables are created, and user and item IDs are transformed to numeric values.
8. Renaming Questions/Skills: The questions/skills in the "item_skills" data are renamed using the item IDs.
9. Creating Q-matrix: A Q-matrix is created, where each row represents a question and each column represents a skill.
10. Saving the Data: The preprocessed data is saved as a CSV file.

</small>


<p align="center">
  <b><big>Preprocessed data</big></b>
</p>

<p align="center">


<style>
table {
  font-size: 12px;
}
</style>

| user_id | item_id | n_options | answer_type | timestamp | correct | kc_id                           | group |
|---------|---------|-----------|-------------|-----------|---------|---------------------------------|-------|
| 2       | 0       | 5         | QMA         | 0         | 1       | endocrinology_metabolism_nutrition | 0     |
| 2       | 1       | 5         | QMA         | 0         | 1       | endocrinology_metabolism_nutrition | 0     |
| 2       | 2       | 5         | QMA         | 0         | 1       | endocrinology_metabolism_nutrition | 0     |
| 2       | 3       | 5         | QMA         | 0         | 0       | endocrinology_metabolism_nutrition | 0     |
| 2       | 4       | 5         | QMA         | 0         | 1       | endocrinology_metabolism_nutrition | 0     |




In [9]:

# if data already preprocessed and  if preprocessed_data.csv exists in folder+'/'+ education_year+"/processed/
if already_preprocessed and os.path.exists(folder+'/'+ education_year+"/processed/preprocessed_data.csv"):
    # print message
    print("Data already preprocessed. Reading preprocessed data...")
    # read csv file preprocessed_data.csv
    data= pd.read_csv(folder+'/'+ education_year+"/processed/preprocessed_data.csv")
    # read npz file q_mat.npz
    q_mat =q_mat = sparse.load_npz(folder + '/' + education_year + "/processed/q_mat.npz").toarray()
    # # reaad json file config.json
    # with open(folder+'/'+ education_year+"/processed/config.json") as f:
    #     config = json.load(f)
        
    
else:
    print("Data not preprocessed. Preprocessing data...")
    warnings.filterwarnings(action='once')
    # processing the row data made available by KDD organisers
    data, q_mat, listOfKC, dict_of_kc, train_set, test_set,skill_names_ids_map_df = sides_pdr.prepare_sides(folder, education_year, \
                                                                    train_file, \
                                                                    test_file,\
                                                                        kc_column, min_interactions_per_user, min_answer_per_question,\
                                                                        True, True, True,True)
            
    # delete unnecessary dataframes
    del train_set
    del test_set

# only keep the columns that are needed for the logistic reression model
data = data[['user_id', 'item_id', 'timestamp', 'correct', 'group']]
# sort the data by user_id and timestamp
data.sort_values(by=["user_id","timestamp"], inplace=True)



Data not preprocessed. Preprocessing data...
Opened SIDES train data. Output: 40618623 samples.
Opened SIDES test data. Output: 10154656 samples.
Removed 0 duplicated samples.
Removed 0 samples with NaN skills.
Removed 0 samples with NA answer_type.
Removed 0 samples with non-binary outcomes.
Removed 13231806 samples 
(users with less than 100 interactions).
Removed 5119382 samples 
(questions with less than 100 answers).
Computed q-matrix. Shape: (97647, 31).
Data preprocessing done. Final output: 32422091 samples.


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [10]:
# number of students
uSize = data['user_id'].nunique()
print("Number of unique students is:", uSize)

qSize = data['item_id'].nunique()
print("Number of unique questions is:", qSize)

tSize = q_mat.shape[1]
print("Number of unique specialty in data: ", tSize )

#define the number of attempts per student
attempt_counter_question = np.zeros(qSize)

# from data select only group==0
data = data[data['group']==0]
# number of attempts per question
#attempt_counter_question = data.groupby('item_id').size().values
attempt_counter_question = data.groupby('item_id').size().reset_index(name='count')

# save the number of attempts per question as csv file
attempt_counter_question = pd.DataFrame(attempt_counter_question)
attempt_counter_question.to_csv(folder+'/'+ education_year+"/processed/attempt_counter_question.csv", index=False)

# remove data to save memory
del data

Number of unique students is: 24757
Number of unique questions is: 97647
Number of unique specialty in data:  31


<h3 style="color:#228B22">Step-2: Encoding Sparse Matrix</h3>

Build sparse features dataset from dense dataset and q-matrix

<small>

Arguments:
- `df`: Dense dataset, output from the preprocessing function
- `Q_mat`: Q-matrix, output from the preprocessing function
- `active_features`: Features used to build the dataset (list of strings). Determined in the encoding parameters section.

Output:
- `sparse_df`: Sparse dataset. 

Depending on the encoding parameters, the sparse dataset is the one-hot encoded version of the dense dataset

Example encoding that include the user-skill specialty_difficulty and item 

</small>





In [11]:
def df_to_sparse(df, Q_mat, active_features, config, skip_sucessive=True):
    
	# Transform q-matrix into dictionary
	dt = time.time()
	dict_q_mat = {i:set() for i in range(Q_mat.shape[0])}
	for elt in np.argwhere(Q_mat == 1):
		dict_q_mat[elt[0]].add(elt[1])

	X={}
	if 'skills' in active_features:
		X["skills"] = sparse.csr_matrix(np.empty((0, Q_mat.shape[1])))

	X['df'] = np.empty((0,5)) # Keep only track of line index + user/item id + correctness

	q = defaultdict(lambda: OurQueue())  # Prepare counters for time windows
	wf_counters = defaultdict(lambda: 0)
	if len(set(active_features).intersection({"skills","attempts","wins","fails"})) > 0:
		for stud_id in tqdm(df["user_id"].unique()):
			df_stud = df[df["user_id"]==stud_id][["user_id", "item_id", "timestamp","correct","group"]].copy()
			df_stud_indices = np.array(df_stud.index).reshape(-1,1)
			#df_stud.sort_values(by="timestamp", inplace=True) # Sort values
			df_stud = np.array(df_stud)
			X['df'] = np.vstack((X['df'], np.hstack((df_stud[:,[0,1,3,4]], df_stud_indices))))

			skills_temp = Q_mat[df_stud[:,1].astype(int)].copy()
			if 'skills' in active_features:
				X['skills'] = sparse.vstack([X["skills"],sparse.csr_matrix(skills_temp)])
				# if options.weighted_encoding == True:
				# 	X["skills"]= X["skills"]/X["skills"].sum(axis=1)
			
	if 'users' in active_features:
		onehot = OneHotEncoder(categories=[np.arange(config["n_users"])])
		if len(set(active_features).intersection({"skills","attempts","wins","fails"})) > 0:
			X['users'] = onehot.fit_transform(X["df"][:,0].reshape(-1,1))
		else:
			X['users'] = onehot.fit_transform(df["user_id"].values.reshape(-1,1))
	if 'items' in active_features:
		onehot = OneHotEncoder(categories=[np.arange(config["n_items"])])
		if len(set(active_features).intersection({"skills","attempts","wins","fails"})) > 0:
			X['items'] = onehot.fit_transform(X["df"][:,1].reshape(-1,1))
		else:
			X['items'] = onehot.fit_transform(df["item_id"].values.reshape(-1,1))

	if len(set(active_features).intersection({"skills","attempts","wins","fails"})) > 0:
		sparse_df = sparse.hstack([sparse.csr_matrix(X['df'][:,-3].reshape(-1,1)),
			sparse.hstack([X[agent] for agent in active_features]),sparse.csr_matrix(X['df'][:,-2].reshape(-1,1))]).tocsr()
		sparse_df = sparse_df[np.argsort(X["df"][:,-1])] # sort matrix by original index
	else:
		sparse_df = sparse.hstack([sparse.csr_matrix(df["correct"].values.reshape(-1,1)),
			sparse.hstack([X[agent] for agent in active_features]),sparse.csr_matrix(X['df'][:,-2].reshape(-1,1))]).tocsr()
		# No need to sort sparse matrix here

	# Split into train and test sparse matrices
	train_indices = np.nonzero(sparse_df[:, -1] == 0)[0]
	test_indices = np.nonzero(sparse_df[:, -1] == 1)[0]
	# Extract train_sparse_df and test_sparse_df directly from the sparse matrix and remve the last column (group)
	train_sparse_df = sparse_df[train_indices, :-1]
	test_sparse_df = sparse_df[test_indices, :-1]

	return train_sparse_df, test_sparse_df

In [12]:
os.chdir(main_path)

In [13]:
if user_skill_item_already_encoded==True:
	# call X-uis.npz sparse matrixos.chdir(folder+'/'+ education_year+"/processed") 
	# call X-uis.npz sparse matrix by having uis as features_suffix
	all_features = ['users', 'items', 'skills']
	active_features = [features for features in all_features if vars(options)[features]]
	features_suffix = ''.join([features[0] for features in active_features])
	# load the sparse matrix from folder+'/'+ education_year+"/processed"
	# go to folder+'/'+ education_year+"/processed"
	os.chdir(folder+'/'+ education_year+"/processed")
	train_sparse_df = sparse.load_npz('train_sparse_df-{:s}.npz'.format(features_suffix))
	test_sparse_df = sparse.load_npz('test_sparse_df-{:s}.npz'.format(features_suffix))

	with open("config.json") as f_in:
		dico = json.load(f_in)

else:
	dt = time.time()
	os.chdir(folder+'/'+ education_year+'/processed')
	all_features = ['users', 'items', 'skills']
	#options = Options()
	active_features = [features for features in all_features if vars(options)[features]]
	features_suffix = ''.join([features[0] for features in active_features])

	if options.by_spec == True:
		# run for each specialty
		qmat = sparse.load_npz('q_mat.npz').toarray()
		with open("config.json") as f_in:
			dico = json.load(f_in)
		# find the list of files that contain by_spec_preprocessed in their name	
		files = [file for file in os.listdir() if 'by_spec_preprocessed' in file]
		# for file that contains by_spec_preprocessed in its name
		for file in files:
			df = pd.read_csv(file)
			print('Loading data:', df.shape[0], 'samples in ', time.time() - dt, "seconds")
			X  = df_to_sparse(df, qmat, active_features, dico)
			# save the sparse matrix by specialty name and features_suffix
			name = file.split('by_spec_preprocessed_')[1].split('.csv')[0]
			sparse.save_npz('X-{:s}-{:s}.npz'.format(features_suffix, name), X)

	elif options.by_spec == False:
		if options.file_splits == 1:
			df = pd.read_csv('preprocessed_data.csv')
			# only keep the columns that are needed for the logistic reression model
			df = df[['user_id', 'item_id', 'timestamp', 'correct', 'group']]
			# sort the data by user_id and timestamp
			df = df.sort_values(by=["user_id", "timestamp"]).reset_index(drop=True)
			qmat = sparse.load_npz('q_mat.npz').toarray()
			with open("config.json") as f_in:
				dico = json.load(f_in)
			print('Loading data:', df.shape[0], 'samples in ', time.time() - dt, "seconds")
			train_sparse_df, test_sparse_df = df_to_sparse(df, qmat, active_features, dico)
			sparse.save_npz('train_sparse_df-{:s}.npz'.format(features_suffix), train_sparse_df)
			sparse.save_npz('test_sparse_df-{:s}.npz'.format(features_suffix), test_sparse_df)
			# sparse.save_npz('train_sparse_df.npz'.format(features_suffix), train_sparse_df)
			# sparse.save_npz('test_sparse_df.npz'.format(features_suffix), test_sparse_df)
		elif options.file_splits > 1:
			df = pd.read_csv('preprocessed_data.csv')
			# only keep the columns that are needed for the logistic reression model
			df = df[['user_id', 'item_id', 'timestamp', 'correct', 'group']]
			# sort the data by user_id and timestamp
			df = df.sort_values(by=["user_id", "timestamp"]).reset_index(drop=True)
			qmat = sparse.load_npz('q_mat.npz').toarray()
			with open("config.json") as f_in:
				dico = json.load(f_in)
			print('Loading data:', df.shape[0], 'samples in ', time.time() - dt, "seconds")
			list_of_user_ids = np.array_split(np.arange(dico["n_users"]),options.file_splits)
			# remove old sparse dataframes in the folder
			for file in glob.glob("train_sparse_df-{:s}_[0-9].npz".format(features_suffix)):
				os.remove(file)
			for file in glob.glob("test_sparse_df-{:s}_[0-9].npz".format(features_suffix)):
				os.remove(file)
			# for each split, save the train and test sparse dataframes
			for i, arr in enumerate(list_of_user_ids):
				df = pd.read_csv('preprocessed_data.csv')
				# only keep the columns that are needed for the logistic reression model
				df = df[['user_id', 'item_id', 'timestamp', 'correct', 'group']]
				# sort the data by user_id and timestamp
				df.sort_values(by=["user_id","timestamp"], inplace=True)
				df = df[df["user_id"].isin(arr)]
				train_sparse_df, test_sparse_df  = df_to_sparse(df, qmat, active_features, dico)
				sparse.save_npz('train_sparse_df-{:s}_{}.npz'.format(features_suffix,i), train_sparse_df)
				sparse.save_npz('test_sparse_df-{:s}_{}.npz'.format(features_suffix,i), test_sparse_df)
				# sparse.save_npz('train_sparse_df.npz'.format(features_suffix), train_sparse_df)
				# sparse.save_npz('test_sparse_df.npz'.format(features_suffix), test_sparse_df)
			# TODO : remove old sparse dataframes
			train_sparse_df = sparse.vstack([sparse.load_npz(sparse_file) for sparse_file in sorted(glob.glob("train_sparse_df-{:s}_[0-9].npz".format(features_suffix)))])
			sparse.save_npz('train_sparse_df-{:s}.npz'.format(features_suffix), train_sparse_df)
			test_sparse_df= sparse.vstack([sparse.load_npz(sparse_file) for sparse_file in sorted(glob.glob("test_sparse_df-{:s}_[0-9].npz".format(features_suffix)))])
			sparse.save_npz('test_sparse_df-{:s}.npz'.format(features_suffix), test_sparse_df)
		else:
			print("Please select file_splits >= 1.")
	else:
		print("Please select by_spec = True or False.")

	# call X-uis.npz sparse matrix
	#os.chdir(folder+'/'+ education_year+"/processed")
	# call X-uis.npz sparse matrix by having uis as features_suffix
	all_features = ['users', 'items', 'skills']
	active_features = [features for features in all_features if vars(options)[features]]
	features_suffix = ''.join([features[0] for features in active_features])
	train_sparse_df = sparse.load_npz('train_sparse_df-{:s}.npz'.format(features_suffix))
	test_sparse_df = sparse.load_npz('test_sparse_df-{:s}.npz'.format(features_suffix))

	with open("config.json") as f_in:
		dico = json.load(f_in)


Loading data: 32422091 samples in  33.51363658905029 seconds


100%|██████████| 24757/24757 [2:50:53<00:00,  2.41it/s]  
  train_sparse_df, test_sparse_df = df_to_sparse(df, qmat, active_features, dico)


In [14]:
os.chdir(main_path)

In [15]:
if options.user_skill_together == True:
    
    # both for train and test sparse seperately
    for X in [train_sparse_df, test_sparse_df]:

        # find the columns 1:n_users of X
        users_sparse = X[:, 1:dico["n_users"]+1]
        indices_user = users_sparse.nonzero()
        nonzero_indices_user = list(zip(indices_user[0], indices_user[1]))

        # find the last n_skills columns of X
        skills_sparse = X[:, -dico["n_skills"]:]
        indices_skills = skills_sparse.nonzero()
        nonzero_indices_skills = list(zip(indices_skills[0], indices_skills[1]))

        # find the columns in between users_sparse and skills_sparse
        questions_sparse = X[:, dico["n_users"]+1:-dico["n_skills"]]

        # find the first column of X
        correct_sparse = X[:, 0]

        # Get the number of columns in the sparse matrices
        num_users = users_sparse.shape[1]
        num_skills = skills_sparse.shape[1]
        # Create the COO matrix
        row_indices = [row for row, _ in nonzero_indices_skills]
        col_indices = [nonzero_indices_user[row][1] * num_skills + col for row, col in nonzero_indices_skills]
        user_skill_coo = sparse.coo_matrix((np.ones(len(row_indices)), (row_indices, col_indices)),
                                            shape=(users_sparse.shape[0], num_users * num_skills),
                                            dtype=np.float64)

        user_skill_sparse = user_skill_coo.tocsr()

        if options.weighted_encoding == True:
            # user_skill_sparse= user_skill_sparse/user_skill_sparse.sum(axis=1)
            # skills_sparse= skills_sparse/skills_sparse.sum(axis=1)
            user_skill_sparse= normalize(user_skill_sparse, norm='l1')
            skills_sparse= normalize(skills_sparse, norm='l1')
            
        if options.spec_difficulty==True:
            X_sparse = sparse.hstack([correct_sparse, user_skill_sparse, skills_sparse, questions_sparse])
        else:
            X_sparse = sparse.hstack([correct_sparse, user_skill_sparse, questions_sparse])
            
        # if X  is in the first iteration (train_sparse), then save it as X-uuis.npz
        # if first iteration
        if X is train_sparse_df:
            #  save X_sparse as train_sparse_ready.npz to EXPERIMENT_FOLDER
            sparse.save_npz(os.path.join(EXPERIMENT_FOLDER,'train_sparse_ready.npz'), X_sparse)
        # if second iteration
        if X is test_sparse_df:
            # save X_sparse as test_sparse_ready.npz to EXPERIMENT_FOLDER
            sparse.save_npz(os.path.join(EXPERIMENT_FOLDER,'test_sparse_ready.npz'), X_sparse)
            
        
            
else:
    # save train_sparse
    sparse.save_npz(os.path.join(EXPERIMENT_FOLDER,'train_sparse_ready.npz'), train_sparse_df)
    # save test_sparse
    sparse.save_npz(os.path.join(EXPERIMENT_FOLDER,'test_sparse_ready.npz'), test_sparse_df)

<h3 style="color:#228B22">Step-3: Run Logistic Regression</h3>

In [16]:
os.chdir(main_path)


In [17]:
experiment_args = vars(options)
today = datetime.datetime.now() # save date of experiment

# EXPERIMENT_FOLDER = folder + '/' + education_year + '/results/'
#dataio.prepare_folder(EXPERIMENT_FOLDER)
# load config file
with open(f'{folder}{education_year}/processed/config.json') as json_file:
    config = json.load(json_file)
    n_items = config["n_items"]
    n_users = config["n_users"]
    n_skills= config["n_skills"]
    
if options.by_spec == True:
    learner_competency = np.zeros((uSize, tSize)) 
    attempt_counter_student_spec=np.zeros((uSize, tSize))

    for specialty in skill_names_ids_map_df["specialty"].unique():
        
        X=csr_matrix(load_npz(f"{folder}{education_year}/processed/X-{features_suffix}-{specialty}.npz"))
        y = X[:,0].toarray().flatten()
        dt = time.time()
        # print fittin..message by also mentioning the specialty
        print("Fitting logistic regression for specialty: {}...".format(specialty))
        lr = LogisticRegression(solver="saga", max_iter=options.iter, C=options.C,n_jobs=-1).fit(X[:,1:],y)
        
        # find the corresponding specialty_id from skill_names_ids_map_df
        specialty_id = skill_names_ids_map_df[skill_names_ids_map_df["specialty"]==specialty]["specialty_id"].values[0]
        user_deltas = lr.coef_[:,:n_users]
        # fill the learner_competency matrix with the corresponding specialty_id
        learner_competency[:,specialty_id] = user_deltas.flatten()
        
        # load data for this specialty
        data_this_spec= pd.read_csv(f"{folder}{education_year}/processed/by_spec_preprocessed_{specialty}.csv")
        # count number of attempts per student
        num_attempts_per_student = data_this_spec.groupby("user_id").size()
        # fill the attempt_counter_student_spec matrix with the corresponding specialty_id and for the students who attempted the specialty
        attempt_counter_student_spec[num_attempts_per_student.index,specialty_id] = num_attempts_per_student.values

    # rename the columns of the learner_competency matrix with the specialty names
    learner_competency= pd.DataFrame(learner_competency, columns=skill_names_ids_map_df["specialty"].unique())
    attempt_counter_student_spec= pd.DataFrame(attempt_counter_student_spec, columns=skill_names_ids_map_df["specialty"].unique())
    # save the learner_competency as csv
    learner_competency.index.name = "user_id"
    learner_competency.to_csv(os.path.join(EXPERIMENT_FOLDER,'learner_competency.csv'))
    # save the attempt_counter_student_spec as csv to processed folder and rename indices as usedr_id
    attempt_counter_student_spec.index.name = "user_id"
    attempt_counter_student_spec.to_csv(folder+'/'+ education_year+"/processed/attempt_counter_student_spec.csv")
    
    
else:
    
    # Load sparsely encoded datasets from EXPERIMENT_FOLDER
    #X = csr_matrix(load_npz(f"{folder}{education_year}/processed/train_sparse_ready.npz"))
    X = csr_matrix(load_npz(os.path.join(EXPERIMENT_FOLDER, "train_sparse_ready.npz")))

    #X = csr_matrix(load_npz(f"{folder}{education_year}/processed/X-{features_suffix}.npz"))
    #X = csr_matrix(load_npz(folder+'/'+education_year+'/processed/X-uuis.npz'))
    y = X[:,0].toarray().flatten()
    #qmat = load_npz(f"{folder}{education_year}/processed/q_mat.npz")
    #n_items = qmat.shape[0]

    dt = time.time()
    print('fitting...')

    lr = LogisticRegression(solver="saga", max_iter=options.iter, C=options.C,n_jobs=-1).fit(X[:,1:],y)
    # save the model
    joblib.dump(lr, EXPERIMENT_FOLDER+'logistic_regression_model.pkl')
    # print the expected number of coefs and the real number of coefs
    if options.spec_difficulty == True:
        expected_number_of_coefs = n_users*n_skills+n_skills+n_items
    elif options.spec_difficulty == False:
        expected_number_of_coefs = n_users*n_skills+n_items
    print("Expected number of coefs: {}. Real number of coefs: {}".format(expected_number_of_coefs,len(lr.coef_[0])))
    
    # saving the output of the logistic regression to the appropriate folder
    
    if options.user_skill_together == True:
        
        # item_difficulty
        
        # Load item_deltas array from saved numpy file
        item_deltas = np.array(lr.coef_[0][-n_items:])
        item_indices = np.arange(item_deltas.shape[0])
        # item_indices as integers
        item_indices = item_indices.astype(int)
        # Combine item indices and deltas into a single array
        item_difficulty = np.column_stack((item_indices, item_deltas))
        # Convert to pandas DataFrame and rename columns
        item_difficulty_df = pd.DataFrame(item_difficulty, columns=['item_id', 'difficulty_irt'])
        # Save as CSV file
        csv_path = os.path.join(EXPERIMENT_FOLDER, 'question_difficulty.csv')
        item_difficulty_df.to_csv(csv_path, index=False)
        
        # learner_competency
        
        skill_names_ids_map_df = pd.read_csv(folder+'/'+education_year+'/processed/skill_names_ids_map.csv')
        learner_competency=np.array(lr.coef_[0][:n_users*n_skills]).reshape(n_users,n_skills)
        # turn into pandas dataframe
        learner_competency= pd.DataFrame(learner_competency, columns=skill_names_ids_map_df["specialty"].unique())
        # index as user_id
        learner_competency.index.name = "user_id"
        # save as csv
        csv_path = os.path.join(EXPERIMENT_FOLDER, 'learner_competency.csv')
        learner_competency.to_csv(csv_path)
        
        # specialty diffciulty
        
        if options.spec_difficulty==True:
            # open an empty specialty_difficulty dataframe with specialty and specialty_difficulty as columns
            specialty_difficulty = pd.DataFrame(columns=["specialty","specialty_difficulty"])
            # fill the specialty column with the skill_names_ids_map_df["specialty"].unique()
            specialty_difficulty["specialty"] = skill_names_ids_map_df["specialty"].unique()
            # fill the specialty_difficulty column with the lr.coef_[0][n_users*n_skills:(n_users*n_skills+n_skills)]
            specialty_difficulty["specialty_difficulty"] = lr.coef_[0][n_users*n_skills:(n_users*n_skills+n_skills)]
            # remove index column
            specialty_difficulty.index.name = None
            # save as csv
            csv_path = os.path.join(EXPERIMENT_FOLDER, 'specialty_difficulty.csv')
            specialty_difficulty.to_csv(csv_path)

                
        #  attempts per student in each specialty
        
        # Filter the 1:n_students x n_skills columns
        X_filtered = X[:, 1:n_users * n_skills + 1]
        # Sum the columns to find the number of attempts per student
        if options.weighted_encoding == True:
            X_filtered[X_filtered != 0] = 1
        num_attempts_per_student = np.asarray(X_filtered.sum(axis=0)).ravel()
        # Reshape the attempts array to match n_users x n_skills
        attempt_counter_student = num_attempts_per_student.reshape(n_users, n_skills)
        # Create a DataFrame with the attempt_counter_student array
        attempt_counter_student_df = pd.DataFrame(attempt_counter_student, columns=skill_names_ids_map_df["specialty"].unique())
        # Set the index name to "user_id"
        attempt_counter_student_df.index.name = "user_id"
        # Save the DataFrame to CSV
        attempt_counter_student_df.to_csv(folder + '/' + education_year + '/processed/attempt_counter_student_spec.csv')
        
        # Attempt per specialty

        # Get the sum of the rows in the attempt_counter_student DataFrame
        attempt_counter_spec = attempt_counter_student_df.sum(axis=0)
        # Create a new DataFrame with the total_attempts column
        attempt_counter_spec_df = attempt_counter_spec.to_frame()
        # Rename the column as "total_attempts"
        attempt_counter_spec_df.columns = ["total_attempts"]
        # Rename the index as "specialty"
        attempt_counter_spec_df.index.name = "specialty"
        # Save the DataFrame to CSV
        attempt_counter_spec_df.to_csv(folder + '/' + education_year + '/processed/attempt_counter_spec.csv')    
        
        
    else:
        
        #np.save(os.path.join(EXPERIMENT_FOLDER,'item_deltas.npy'), np.array(lr.coef_[0,-n_items:]))
        #np.save(os.path.join(EXPERIMENT_FOLDER,'item_deltas.npy'), np.array(lr.coef_[0][-n_items:]))
        np.save(os.path.join(EXPERIMENT_FOLDER,'item_deltas.npy'), np.array(lr.coef_[0][-(n_skills+n_items):-n_skills]))
        # save the last n_skills coefs as skill_deltas
        np.save(os.path.join(EXPERIMENT_FOLDER,'skill_deltas.npy'), np.array(lr.coef_[0][-n_skills:]))
        

fitting...
Expected number of coefs: 865145. Real number of coefs: 865145


# Validation Test

In [18]:
def CalculateRMSE(Output, ground, I):
    Output = np.array(Output)
    ground = np.array(ground)
    error = (Output - ground) 
    err_sqr = error*error
    RMSE = math.sqrt(err_sqr.sum()/I)
    return RMSE  

In [19]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

import math

# validate the test set by using the trained model
# load the test_sparse_df
#test_sparse_df = sparse.load_npz(folder+'/'+education_year+'/processed/test_sparse_ready.npz')
test_sparse_df = sparse.load_npz(os.path.join(EXPERIMENT_FOLDER, "train_sparse_ready.npz"))
X = csr_matrix(load_npz(os.path.join(EXPERIMENT_FOLDER, "train_sparse_ready.npz")))

# read lr model from the EXPERIMENT_FOLDER
lr = joblib.load(os.path.join(EXPERIMENT_FOLDER, "logistic_regression_model.pkl"))
# predict the probability of correctness for each sample in the test_sparse_df
y_pred = lr.predict_proba(test_sparse_df[:,1:])
# save the predicted probabilities as csv
#np.savetxt(folder+'/'+education_year+'/results/y_pred.csv', y_pred, delimiter=',')
np.savetxt(os.path.join(EXPERIMENT_FOLDER, "train_sparse_ready.npz"), y_pred, delimiter=',')
# compare the predicted probabilities with the actual correctness
# load the actual correctness
test_target = test_sparse_df[:,0].toarray().flatten()
# compute the roc_auc_score
auc_score=roc_auc_score(test_target, y_pred[:,1])


# Make predictions on the test data
test_predictions = lr.predict(test_sparse_df[:, 1:]) 
accuracy = accuracy_score(test_target, test_predictions)
precision = precision_score(test_target, test_predictions)
recall = recall_score(test_target, test_predictions)
f1 = f1_score(test_target, test_predictions)


rmse_test = CalculateRMSE( y_pred[:,1], test_target, len(y_pred[:,1]))

print("Test RMSE: ", rmse_test)
print("Test AUC: ", auc_score)
print("Test ACC: ", accuracy)


# save the results to EXPERIMENT_FOLDER
with open(os.path.join(EXPERIMENT_FOLDER, "validation_results.txt"), "w") as f:
    f.write("Test RMSE: " + str(rmse_test) + "\n")
    f.write("Test AUC: " + str(auc_score) + "\n")
    f.write("Test ACC: " + str(accuracy) + "\n")
    f.write("Test Precision: " + str(precision) + "\n")
    f.write("Test Recall: " + str(recall) + "\n")
    f.write("Test F1: " + str(f1) + "\n")
    


Test RMSE:  0.4349058422885915
Test AUC:  0.7840478106690244
Test ACC:  0.7078667053508517
