## Overview:

This notebook is the first attempt of optimizing classroom assignment based on 20191 scheduling. The objective is to maximize room efficiency, which is defined as: $$\frac{\text{# of student limit}}{\text{room capacity}}$$ 

Later we will add happiness factors according to students and professors' needs.

Created on: 5.1.2020

Created by: Nanchun Shi

## I. Data preprocessing

In [10]:
import pandas as pd
import numpy as np

In [11]:
## read in data

df = pd.read_csv('data/schedule_2015_to_2019.csv')

In [12]:
df.head()

Unnamed: 0,cancelled,date_of_cancellation,term,course,section,title,mode,units,level,department,...,second_instructor_uid,second_room,seats_offered,reg_count,adj_reg,wait_count,total_tuition_units,classroom_capacity,cap_remaining_seats,classroom_remaining_seats
0,True,,20153,,14025,External Financial Reporting Issues,C,4.0,,,...,,,42,24,,,96.0,46.0,18.0,22.0
1,False,,20153,ACCT-370,14025,External Financial Reporting Issues,C,4.0,,ACCT,...,,,42,24,,,96.0,46.0,18.0,22.0
2,True,,20153,,14026,External Financial Reporting Issues,C,4.0,,,...,,,42,40,,,160.0,46.0,2.0,6.0
3,False,,20153,ACCT-370,14026,External Financial Reporting Issues,C,4.0,,ACCT,...,,,42,40,,,160.0,46.0,2.0,6.0
4,True,,20153,,14027,External Financial Reporting Issues,C,4.0,,,...,,,42,42,,,168.0,46.0,0.0,4.0


In [13]:
df.shape

(7441, 38)

In [14]:
## select 20191 (Spring) term & not cancelled

df = df[(df.term == 20191) & (df.cancelled == 0)]

In [15]:
df.shape

(582, 38)

In [16]:
## check how many course have second rooms

df.second_room.isna().sum()

581

In [17]:
## since it's rate, drop it

df = df[df.second_room.isna()]

In [18]:
## select necessary columns

df = df.iloc[:,np.r_[3:6,9:11,18:21,23,30,35]]

In [19]:
df.isna().sum()

course                 0
section                0
title                  0
department             0
type                   0
first_days             0
first_begin_time       0
first_end_time         0
first_room            35
seats_offered          0
classroom_capacity    59
dtype: int64

In [20]:
## since we need to use all information we selected, we need to drop null values:

df.dropna(inplace=True)

In [21]:
df.shape

(522, 11)

In [22]:
df.head()

Unnamed: 0,course,section,title,department,type,first_days,first_begin_time,first_end_time,first_room,seats_offered,classroom_capacity
6804,ACCT-410,14001,Foundations of Accounting,ACCT,Elective,TH,16:00:00,17:50:00,JFF322,46,48.0
6805,ACCT-410,14002,Foundations of Accounting,ACCT,Elective,TH,10:00:00,11:50:00,ACC310,46,54.0
6806,ACCT-410,14003,Foundations of Accounting,ACCT,Elective,TH,12:00:00,13:50:00,ACC310,48,54.0
6807,ACCT-410,14004,Foundations of Accounting,ACCT,Elective,MW,12:00:00,13:50:00,ACC201,48,48.0
6808,ACCT-410,14005,Foundations of Accounting,ACCT,Elective,MW,14:00:00,15:50:00,ACC201,47,48.0


In [23]:
df.first_days.value_counts()

MW     179
TH     162
T       45
M       43
W       37
H       31
F       20
S        3
MWF      2
Name: first_days, dtype: int64

In [15]:
# pd.to_datetime(df.first_end_time,format='%H:%M:%S') > pd.to_datetime(df.first_begin_time,format='%H:%M:%S')

In [16]:
## explore different time chunks

time_cks = pd.Series(map(lambda x: [x[0],x[1]], df[['first_begin_time','first_end_time']].values))

In [17]:
## for simplicity, I will not consider course of different time chuncks
## after value counts, select those only appears at least 2 times
## need to implement this for every types of day of the week

time_cks_vc = time_cks.value_counts()
tcv_index = time_cks_vc.index

selected_time_cks = [tcv_index[i] for i,v in enumerate(time_cks_vc) if v > 1]

In [18]:
selected_time_cks[0]

['12:00:00', '13:50:00']

In [19]:
selected_begin_times, selected_end_times = zip(*[t for t in selected_time_cks])

## II. Demo for MW courses

In [24]:
new = df.copy()

In [25]:
new = new[(new.first_days == 'MW')]

In [26]:
dm_time_cks = pd.Series(map(lambda x: [x[0],x[1]], new[['first_begin_time','first_end_time']].values))

In [29]:
dm_time_cks_vc = dm_time_cks.value_counts()
dm_tcv_index = dm_time_cks_vc.index

dm_selected_time_cks = [dm_tcv_index[i] for i,v in enumerate(dm_time_cks_vc) if v > 1]

In [30]:
dm_selected_begin_times, dm_selected_end_times = zip(*[t for t in dm_selected_time_cks])

In [25]:
# new_bools = list(map(lambda t: True if (t[0] in dm_selected_begin_times and t[1] in dm_selected_end_times) else False,
#             zip(new.first_begin_time, new.first_end_time)))

In [31]:
new_bools = list(map(lambda t: True if ([t[0], t[1]] in dm_selected_time_cks) else False,
            zip(new.first_begin_time, new.first_end_time)))

In [32]:
new1 = new[new_bools].copy()
new1.head(3)

Unnamed: 0,course,section,title,department,type,first_days,first_begin_time,first_end_time,first_room,seats_offered,classroom_capacity
6807,ACCT-410,14004,Foundations of Accounting,ACCT,Elective,MW,12:00:00,13:50:00,ACC201,48,48.0
6808,ACCT-410,14005,Foundations of Accounting,ACCT,Elective,MW,14:00:00,15:50:00,ACC201,47,48.0
6812,ACCT-370,14025,External Financial Reporting Issues,ACCT,ACCT Core,MW,08:00:00,09:50:00,ACC303,37,46.0


In [33]:
new1.shape

(175, 11)

In [35]:
## for the first time chunk

new2 = new1[(new1.first_begin_time == dm_selected_begin_times[0])\
            & (new1.first_end_time == dm_selected_end_times[0])].copy()

In [36]:
new2.shape

(23, 11)

In [41]:
new2.head()

Unnamed: 0,course,section,title,department,type,first_days,first_begin_time,first_end_time,first_room,seats_offered,classroom_capacity
6813,ACCT-370,14026,External Financial Reporting Issues,ACCT,ACCT Core,MW,10:00:00,11:50:00,ACC303,38,46.0
6823,ACCT-373,14056,Introduction to Auditing and Assurance Services,ACCT,ACCT Core,MW,10:00:00,11:50:00,BRI5,36,42.0
6830,ACCT-377,14066,Valuation for Financial Statement Purposes,ACCT,ACCT Core,MW,10:00:00,11:50:00,BRI5,35,42.0
6834,ACCT-470,14115,Advanced External Financial Reporting Issues,ACCT,ACCT Core,MW,10:00:00,11:50:00,ACC201,45,48.0
6837,ACCT-473,14135,Financial Statement Auditing,ACCT,ACCT Core,MW,10:00:00,11:50:00,JFF328,36,36.0


In [43]:
## there is one duplicated name; could be two class happen in one room or typo

len(new2.first_room.unique())

22

In [44]:
## so all courses are different; check this for the whole data set?????

new2.duplicated(subset=['course','section','title','department']).sum()

0

In [34]:
## actually, we should drop these two records, since we should not 
##randomly assign two classes in the same classroom

# check = []
# new_names = []

# for r in new2.first_room:
#     if r in check:
#         new_r = r + '-' + str(check.count(r))
#     else:
#         new_r = r
#     check.append(r)
#     new_names.append(new_r)

In [35]:
# new2['first_room'] = new_names

In [45]:
new2.drop_duplicates(subset = ['first_room'], keep = False, inplace = True)

In [46]:
new2.head(3)

Unnamed: 0,course,section,title,department,type,first_days,first_begin_time,first_end_time,first_room,seats_offered,classroom_capacity
6813,ACCT-370,14026,External Financial Reporting Issues,ACCT,ACCT Core,MW,10:00:00,11:50:00,ACC303,38,46.0
6834,ACCT-470,14115,Advanced External Financial Reporting Issues,ACCT,ACCT Core,MW,10:00:00,11:50:00,ACC201,45,48.0
6837,ACCT-473,14135,Financial Statement Auditing,ACCT,ACCT Core,MW,10:00:00,11:50:00,JFF328,36,36.0


In [47]:
new2['id'] = new2.course + new2.section

In [48]:
## calculate original average room efficiency

orig_re = np.mean(new2.seats_offered/new2.classroom_capacity)
orig_re

0.914922558077704

In [40]:
## optimization
## gave an error because in the data, there is one course was scheduled with limit > capacity
## this may cause not feasiable issue if the optimal solution is the schedule that course as it was
## so we could allow extra seats but with limits

course = new2[['id','seats_offered']].set_index('id')
room = new2[['first_room','classroom_capacity']].set_index('first_room')

from gurobipy import GRB, Model

mod = Model()

I = course.index
J = room.index
# extra = 1

x = mod.addVars(I, J, vtype = GRB.BINARY)
y = mod.addVars(I,J, lb = 0, ub = 5, vtype = GRB.INTEGER)

mod.setObjective(sum(x[i,j]*course.loc[i,:]/room.loc[j,:] for i in I for j in J)-\
                 sum(y[i,j] for i in I for j in J), sense = GRB.MAXIMIZE)

for i in I:
    mod.addConstr(sum(x[i,j] for j in J) == 1)
    for j in J:
        mod.addConstr(x[i,j]*course.loc[i,:] <= y[i,j] + room.loc[j,:])
for j in J:
    mod.addConstr(sum(x[i,j] for i in I) == 1)

mod.setParam('outputflag',False)
mod.optimize()

Using license file /Users/aslanshi/gurobi.lic
Academic license - for non-commercial use only


In [41]:
## imporve a bit

mod.objval/len(new2)

0.9214041453539504

In [42]:
## new schedule

result = pd.DataFrame(index=I, columns=['Course'])
for i in I:
    for j in J:
        if x[i,j].x:
            result.loc[i,'Course'] = j
result

Unnamed: 0_level_0,Course
id,Unnamed: 1_level_1
ACCT-37014026,HOH2
ACCT-47014115,ACC303
ACCT-47314135,JFF417
ACCT-43014144,BRI202
BAEP-47114403,JFF328
BUAD-30214650,JFF327
BUAD-30214652,JFF331
BUAD-30414721,ACC201
BUAD-30614788,JFF414
BUAD-31114909,HOH1


## III. For all Dow types

In [43]:
new3 = df.copy()

In [44]:
dow_vc = new3.first_days.value_counts()
dow_vc

MW     179
TH     162
T       45
M       43
W       37
H       31
F       20
S        3
MWF      2
Name: first_days, dtype: int64

In [45]:
## no conflicts for these 2 classes

new3[new3.first_days == 'MWF']

Unnamed: 0,course,section,title,department,type,first_days,first_begin_time,first_end_time,first_room,seats_offered,classroom_capacity
7337,WRIT-340,66701,Advanced Writing,BUCO,WRIT,MWF,08:00:00,08:50:00,JFF331,19,36.0
7338,WRIT-340,66710,Advanced Writing,BUCO,WRIT,MWF,09:00:00,09:50:00,JFF331,19,36.0


### Description of the model:

The following optimiation model will, for those couses that are on the same dow schedule and in the same time chunk (call it a window), if there are more than 1 course in a window, output new assignment based the previous one to optimize the room efficiencies. For those courses that stand alone in its window, or extremely overfit in the last term, their schedules will remain the same. 

In [46]:
dow = []
tcks = []
orig_mean_re = []
orig_min_re = []
orig_max_re = []
opt_mean_re = []
opt_min_re = []
opt_max_re = []


for t in dow_vc.index:
    
    ## select corresponding dow schedule
    temp = new3[new3.first_days == t].copy()
    
    ## select time chunks in which there are more than 1 course 
    time_cks = pd.Series(map(lambda x: [x[0],x[1]], temp[['first_begin_time','first_end_time']].values))
    time_cks_vc = time_cks.value_counts()
    tcv_index = time_cks_vc.index
    
    selected_time_cks = [tcv_index[i] for i,v in enumerate(time_cks_vc) if v > 1]
    
    ## if there are no more than one courses in a window, not using them in optimization
    ## consider the MWF case above
    if len(selected_time_cks) == 0:
        continue
    
    selected_begin_times, selected_end_times = zip(*[t for t in selected_time_cks])
    
    bools = list(map(lambda t: True if ([t[0], t[1]] in selected_time_cks) else False,
            zip(temp.first_begin_time, temp.first_end_time)))
    
    ## select courses in selected time chunks
    ## so for the remaining couses, they stand alone in their window
    temp = temp[bools]
    
    for idx in range(len(selected_begin_times)):
        
        ## select couses belonging to the ith selected chunks
        temp1 = temp[(temp.first_begin_time == selected_begin_times[idx])\
                & (temp.first_end_time == selected_end_times[idx])].copy()
        
        ## in some cases, two different could use the same classroom
        ## we should drop them since we should not assign two random classes in the same room in the same window
        temp1.drop_duplicates(subset = ['first_room'], keep = False, inplace = True)
        
        ## there are cases when 2 classes with different course id but in the same room in the same window
        ## not sure the logistic behind this
        ## if that happens, and thoses are the only courses in the window, the resulting dataframe will be empty
        if len(temp1) == 0:
            continue
            
        dow.append(t)
        tcks.append((selected_begin_times[idx], selected_end_times[idx]))
        
        ## create a course ID
        temp1['id'] = temp1.course + temp1.section
        
        ## calculating original average/min/max room efficiency for this window
        orig_mean = np.mean(temp1.seats_offered/temp1.classroom_capacity)
        orig_min = min(temp1.seats_offered/temp1.classroom_capacity)
        orig_max = max(temp1.seats_offered/temp1.classroom_capacity)
        orig_mean_re.append(orig_mean)
        orig_min_re.append(orig_min)
        orig_max_re.append(orig_max)
        
        course = temp1[['id','seats_offered']].set_index('id')
        room = temp1[['first_room','classroom_capacity']].set_index('first_room')

        mod = Model()

        I = course.index
        J = room.index

        x = mod.addVars(I, J, vtype = GRB.BINARY)
        y = mod.addVars(I,J, lb = 0, ub = 5, vtype = GRB.INTEGER)
        
        ## the objective is to maximize total room efficiency, while limit the extra spots used
        RE = sum(x[i,j]*course.loc[i,:]/room.loc[j,:] for i in I for j in J)
        extra = sum(y[i,j] for i in I for j in J)
        mod.setObjective(RE - extra, sense = GRB.MAXIMIZE)

        for i in I:
            mod.addConstr(sum(x[i,j] for j in J) == 1)
            for j in J:
                mod.addConstr(x[i,j]*course.loc[i,:] <= y[i,j] + room.loc[j,:])
        for j in J:
            mod.addConstr(sum(x[i,j] for i in I) == 1)

        mod.setParam('outputflag',False)
        mod.optimize()
        
        ## there may be cases the the number of seats offered overfit the capacity
        ## it may the limit of rooms, or the expected # of registered is not high
        ## if so, optimization will fail (note: y has an upper bound)
        ## it's rare, so these courses will remain in the same classroom
        try:
            opt_mean = RE.getValue()/len(temp1)
            m = []
            for i in I:
                for j in J:
                    if x[i,j].x:
                        m.append(x[i,j].x*course.loc[i,:].values[0]/room.loc[j,:].values[0])
            opt_min = min(m)
            opt_max = max([x[i,j].x*course.loc[i,:].values[0]/room.loc[j,:].values[0] for i in I for j in J])
        except:
            opt_mean = 'Original Schedule Overfit'
            opt_min = 'Original Schedule Overfit'
            opt_max = 'Original Schedule Overfit'
         
        opt_mean_re.append(opt_mean)
        opt_min_re.append(opt_min)
        opt_max_re.append(opt_max)

In [47]:
result_df = pd.DataFrame(dict(zip(['DOW','Time Chunk','Orig_RE_Mean','Orig_RE_Min','Orig_RE_Max',
                                   'Opt_RE_Mean','Opt_RE_Min','Opt_RE_Max'],
                                  [dow,tcks,orig_mean_re,orig_min_re,orig_max_re,
                                   opt_mean_re,opt_min_re,opt_max_re])))

In [48]:
# result_df.to_excel('Classroom_optimization_result.xlsx')

In [49]:
result_df.head()

Unnamed: 0,DOW,Time Chunk,Orig_RE_Mean,Orig_RE_Min,Orig_RE_Max,Opt_RE_Mean,Opt_RE_Min,Opt_RE_Max
0,MW,"(10:00:00, 11:50:00)",0.914923,0.61745,1.0,0.921404,0.516779,1.0
1,MW,"(12:00:00, 13:50:00)",0.915698,0.489933,1.013699,0.916269,0.496644,1.0
2,MW,"(16:00:00, 17:50:00)",0.903737,0.271375,1.016667,0.908113,0.237918,1.01667
3,MW,"(14:00:00, 15:50:00)",0.920496,0.563758,1.0,0.920675,0.563758,1.0
4,MW,"(14:00:00, 15:20:00)",0.797411,0.538462,1.0,0.81655,0.384615,1.0


In [50]:
filter1_bools = list(map(lambda t: True if type(t) != str else False, result_df.Opt_RE_Mean))
result_df = result_df[filter1_bools]

# result_df['Opt_Room_Efficiency'] = result_df['Opt_Room_Efficiency'].astype(float)

# filter2_bools = list(map(lambda t: True if t > 0 else False, result_df.Opt_RE_Mean))
# result_df = result_df[filter2_bools]

In [51]:
print(f'Original Average Room Efficiency Mean: {100*np.mean(result_df.Orig_RE_Mean):.1f}%.')
print(f'Optimized Average Room Efficiency Mean: {100*np.mean(result_df.Opt_RE_Mean):.1f}%.')

Original Average Room Efficiency Mean: 85.9%.
Optimized Average Room Efficiency Mean: 86.3%.


In [52]:
print(f'Original Average Room Efficiency Min: {100*np.mean(result_df.Orig_RE_Min):.1f}%.')
print(f'Optimized Average Room Efficiency Min: {100*np.mean(result_df.Opt_RE_Min):.1f}%.')

Original Average Room Efficiency Min: 62.6%.
Optimized Average Room Efficiency Min: 60.1%.


In [53]:
print(f'Original Average Room Efficiency Max: {100*np.mean(result_df.Orig_RE_Max):.1f}%.')
print(f'Optimized Average Room Efficiency Max: {100*np.mean(result_df.Opt_RE_Max):.1f}%.')

Original Average Room Efficiency Max: 97.3%.
Optimized Average Room Efficiency Max: 97.5%.
