# Dataset 

This notebook creates a dataset used for all our experiments. Using a common dataset helps in robust evaluation of our learning algorithm, with respect to the omniscient policy. 

We create a dataset & save it for use by the learning algorithm. 

# Student Features 

Research has shown that students prefer to learn a certain way. Though there is no unanimous choice, there is a fair bit of understanding of what a student wants from the content they are trying to understand. We'll assume, there was a survey conducted among students, which asked to know how should information be taught to help them understand in a streamlined manner. 

1. Visual (V): How much preference is given to visual explanations (video, short-film, movie-clip, vlogs).
2. Text (T): How much preference is given to written explanations (books, articles, blogs, research papers)
3. Demo-based (D): Preference to live experimentation to help understand the concept.
4. Practical (P): Explain or demo the topic, & enable us to perform it.
5. Step-by-step (S): A guide to practicing / trying to understand the topic in a systematic way.
6. Activity / Task based (AT): Preference to contents that are interactive \& require students to participate
7. Lecture (L) : Perference to being passive \& listen to an expert explain the topic.
8. Audio (A) : How much preference is given to audio explanations (podcasts, music)
9. Self-evaluation (SE) : Self evaluate their readiness/motivation/excitement for the course.
10. Pre-assessment (PA): Pre-assessment of pre-requisites required for the course.

Other features to be considered later to add to the complexity \& completeness. 
- Age : Which age bracket ( [0 - 10 , 11-20, 21-30 ..... 70-80 ]
- Gender : Male or Female. (These are 2 features)
- Features, that can be captured in a live system: 
    - Response Times: Time taken by student to answer questions / give feedback
    - Correctness of answer: How correct is the answer. Generally, we classify answer as either being right or wrong. Research has shown that most answer fall in between, being completely wrong or right. It a research challenge to measure the rightness of an answer \& depends of the subject being taught. 
    - Interactions: Student interaction with the content being taught. Are they highlighting the text, bookmarking , pausing , adding notes. 
    - Forgetfulness: How well do students remember the content they were taught. 


# Content Features 

1. Ease of understanding (E) : How relatively easy is it to understand content. A value close to 0 implies its not easy, where as a value close to 1 implies, its comparatively easy.
2. Simple / Intuition (I): Does the content provide a simple, intuitive understanding of the topic.
3. Surface / In-depth (ID): Does it provide a surface level or deep understanding of the topic.
4. Brief / Concise (C) : Is it short , to the point OR descriptive , verbose & elaborative. Learners have different levels of maintaining concentration \& rememberance.
5. Thorough (T): How well does the content cover the topic.
6. Preference / Well reviewed / Well rated (R) : How well rated is the explanation. Remember to write down in education, that we envision s system where teachers would share available resources from various sources \& are able to use it to teach their students.
7. Theoritical / Abstract (A): How theoritical, abstract is the content
8. Practical / Hands on (P) : Is it something that can be tried or experienced
9. Experimental / Task-based (ETB): Does it require a task to be completed to fully understand it like collaboration with other students or some research / findings.


# Topic Features 

Going forward we would also like to consider importance of a topic, which would be used by the skipping algorithm to decide whether or not its a good option to skip

# Data Set 

- Very Small : 50 students with 10 topics
- Small :  100 students with 25 topics
- Medium : 200 students with 50 topics
- Large : 400 students with 100 topics
- Very Large : 800 students with 200 topics

In [12]:
import numpy as np
import pandas as pd
import pickle

'''
Class that enscapulates student & content data generators. Its uses the StudentDataGen & ContentDataGen to create data. 
'''
class DataGenerator:
    
    def __init__(self):
        self.studentDataGen = StudentDataGen()
        self.contentDataGenerator = ContentDataGen()
        
    def createData(self) : 
        self.studentData = self.studentDataGen.create()
        self.contentData , self.all_topics , self.topic_content = self.contentDataGenerator.getContentsFeatures()
        
        with open('student.pickle', 'wb') as student_file:
            print('Student Data : ', self.studentData.shape)
            pickle.dump(self.studentData, student_file, protocol=pickle.HIGHEST_PROTOCOL)
            
        with open('content.pickle', 'wb') as content_file:
            print('Content Data : ', self.contentData.shape)
            pickle.dump(self.contentData, content_file , protocol=pickle.HIGHEST_PROTOCOL)
            
        with open('topic.pickle', 'wb') as topics_file:
            print('Topics Data : ', self.all_topics)
            pickle.dump(self.all_topics, topics_file , protocol=pickle.HIGHEST_PROTOCOL)
            
        with open('topic_content.pickle', 'wb') as topic_content_file:
            print('Topic Content Data : ', self.topic_content)
            pickle.dump(self.topic_content, topic_content_file , protocol=pickle.HIGHEST_PROTOCOL)
            
    def createStudentData(self):
        self.studentData =  self.studentDataGen.create()

    def getStudentData(self):
        return self.studentData
    
    def createContentData(self):
        self.contentsFeatures = self.contentDataGenerator.getContentsFeatures() 
        self.topicContent = self.contentDataGenerator.getTopicContent()
        
    def getContentData(self):
        return self.contentsFeatures
    
    def getTopicData(self):
        return self.topicContent

'''
This is the student data generator
'''
class StudentDataGen:
    def __init__(self):
        self.number_of_students = 400 # Students taking the course. 
        # Student Preferences: Visual (V) , Text (T) , Demo-based (D) , Practical (P), Step-by-step (S) , 
        # Activity / Task based (AT), Lecture (L) , Audio (A) , Self-evaluation (SE) , Pre-assessment (PA)
        # Students preference to learning via various ways can also be evaluated on a scale of 10, rather being binary. 
        self.student_context = ['S_V','S_T','S_D','S_P','S_S','S_AT','S_L','S_A','S_SE','S_PA']
    
    def create(self):
        ## Create Student Context Data
        student_data = np.random.uniform(0.0 , 1.0 , size=(self.number_of_students,len(self.student_context)))
        student_data = np.round(student_data,2)
        student_context_df = pd.DataFrame(data=student_data , columns = self.student_context)
        return student_context_df
    
'''
This is the content data generator
'''
class ContentDataGen:
    
    def __init__(self):
        self.number_of_topics = 100 # Number of topics in the course
        # Content Features 
        # Ease of understanding (E) , Simple / Intuitive (I) , Surface / In-depth (ID) , Brief / Concise (C), Thorough (T),
        # Preference / Well reviewed / Well rated (R) , Theoritical / Abstract (A), Practical / Hands on (P), 
        # Experimental / Task-based (ETB)
        # Content preference to learning via various ways can also be evaluated on a scale of 10, rather being binary. 
        self.content_context = ['C_E','C_I','C_ID','C_C','C_T','C_R','C_A','C_P','C_ETB']
        self.no_contents_per_topic = np.random.randint(5,11,self.number_of_topics) # Variable number of contents per topic.
    
    def create(self):
        all_contents = list()
        all_topics = list()
        topic_content = {}
        for i,j in enumerate(self.no_contents_per_topic):
            topic_id = "T_" + str(i+1) # e.g : T_10
            content_ids = [] # Temporary variable to help map topic to content. 
            for j_1 in range(1,j+1) : # Number of contents
                c_id = 'C_' + str(i+1) + '_' + str(j_1) # e.g : C_10_2 : Content number 2 for topics 10
                content_ids.append(c_id)
                all_contents.append(c_id)
            topic_content[topic_id] = content_ids  
            all_topics.append(topic_id)
        return topic_content , all_topics , all_contents
    
    # Content related features
    def getContentsFeatures(self):
        self.topic_content , self.all_topics , self.all_contents = self.create()
        content_data = np.random.uniform(0.0 , 1.0 , size=(sum(self.no_contents_per_topic),len(self.content_context)))
        content_data = np.round(content_data,2)
        content_context_df = pd.DataFrame(data=content_data, 
                             columns = self.content_context , index=self.all_contents)
        return content_context_df , self.all_topics , self.topic_content
    
    def getTopicContent(self):
        return self.topic_content

In [13]:
dataGenerator = DataGenerator()
dataGenerator.createData()

Student Data :  (400, 10)
Content Data :  (720, 9)
Topics Data :  ['T_1', 'T_2', 'T_3', 'T_4', 'T_5', 'T_6', 'T_7', 'T_8', 'T_9', 'T_10', 'T_11', 'T_12', 'T_13', 'T_14', 'T_15', 'T_16', 'T_17', 'T_18', 'T_19', 'T_20', 'T_21', 'T_22', 'T_23', 'T_24', 'T_25', 'T_26', 'T_27', 'T_28', 'T_29', 'T_30', 'T_31', 'T_32', 'T_33', 'T_34', 'T_35', 'T_36', 'T_37', 'T_38', 'T_39', 'T_40', 'T_41', 'T_42', 'T_43', 'T_44', 'T_45', 'T_46', 'T_47', 'T_48', 'T_49', 'T_50', 'T_51', 'T_52', 'T_53', 'T_54', 'T_55', 'T_56', 'T_57', 'T_58', 'T_59', 'T_60', 'T_61', 'T_62', 'T_63', 'T_64', 'T_65', 'T_66', 'T_67', 'T_68', 'T_69', 'T_70', 'T_71', 'T_72', 'T_73', 'T_74', 'T_75', 'T_76', 'T_77', 'T_78', 'T_79', 'T_80', 'T_81', 'T_82', 'T_83', 'T_84', 'T_85', 'T_86', 'T_87', 'T_88', 'T_89', 'T_90', 'T_91', 'T_92', 'T_93', 'T_94', 'T_95', 'T_96', 'T_97', 'T_98', 'T_99', 'T_100']
Topic Content Data :  {'T_79': ['C_79_1', 'C_79_2', 'C_79_3', 'C_79_4', 'C_79_5', 'C_79_6', 'C_79_7', 'C_79_8'], 'T_89': ['C_89_1', 'C_89_2',

In [37]:
import numpy as np 

confidence_threshold = 100
threshold_updated_count = 1

# def updateConfidenceThreshold(self,rounds):
#     if np.log10(rounds) > self.threshold_updated_count : 
#         self.confidence_threshold /= np.log10(rounds)
#         self.threshold_updated_count += 1
#         print(rounds)
        
for i in range(100020):
    if np.log10(i) > threshold_updated_count : 
        confidence_threshold /= np.log10(i)
        threshold_updated_count += 1
        print(i)
        print("**** confidence_threshold ****", confidence_threshold)
#     updateConfidenceThreshold(i)

  del sys.path[0]


11
**** confidence_threshold **** 96.02525677891276
101
**** confidence_threshold **** 47.909111799616106
1001
**** confidence_threshold **** 15.96739357122114
10001
**** confidence_threshold **** 3.991805054499464
100001
**** confidence_threshold **** 0.798360317456399


In [3]:
import os,pickle
import pandas as pd 
file_path = os.path.join(os.path.curdir, '..' , 'dataset' , 'small')
with open(os.path.join(file_path , 'student.pickle'), 'rb') as student_file:
    studentContext= pickle.load(student_file)
    
studentContext.head(1)

Unnamed: 0,S_V,S_T,S_D,S_P,S_S,S_AT,S_L,S_A,S_SE,S_PA
0,0.87,0.82,0.88,0.36,0.6,0.06,0.66,0.56,0.66,0.07


In [11]:
reward_not_0 = pd.read_csv('logs_oracle_verySmall')
len(reward_not_0['arm_pulled'].unique())

75

In [6]:
import pandas as pd
import os,pickle
content_df = pd.DataFrame()
file_path = os.path.join(os.path.curdir, '..' , 'dataset' , 'very_small')
with open(os.path.join(file_path ,'content.pickle'), 'rb') as content_file:
    content_df = pickle.load(content_file)
content_df

Unnamed: 0,C_E,C_I,C_ID,C_C,C_T,C_R,C_A,C_P,C_ETB
C_1_1,0.22,0.95,0.34,0.33,0.20,0.61,0.62,0.35,0.31
C_1_2,0.49,0.01,0.19,0.95,0.94,0.58,0.04,0.80,0.06
C_1_3,0.23,0.47,0.71,0.72,0.42,0.83,0.70,0.27,0.55
C_1_4,0.32,0.32,0.38,0.99,0.15,0.59,0.46,0.34,0.39
C_1_5,0.21,0.36,0.41,0.57,0.85,0.05,0.38,0.49,0.64
C_1_6,0.62,0.58,0.06,0.11,0.37,0.18,0.82,0.90,0.33
C_1_7,0.19,0.25,0.21,0.90,0.04,0.20,0.64,0.30,0.63
C_1_8,0.28,0.37,0.45,0.29,0.19,0.11,0.90,0.09,0.05
C_1_9,0.22,0.94,0.98,0.92,0.35,0.14,0.19,1.00,0.48
C_1_10,0.03,0.29,0.15,0.14,0.52,0.26,0.85,0.65,0.59
