# First-Year Writing Seminars

**Objectives:**
* Assign students to FWS sections so that they get one of their top 5 choices.
* Improve assignments by making changes to our assignment problem formulation.

**Key Ideas:**
* integrality property
* the assignment problem
* the transportation problem

**Brief description:** If you recall pre-enroll, there was a separate ballot you completed by listing your top 5 picks for FWS that semester. You were later notified which class you got placed into, probably hoping it was your first choice. By now, this should not seem like magic; problems like these often enlist help from Operations Research especially as the scale increases. Disclaimer: the following model is not actually used by Cornell.

In [None]:
# imports -- make sure to run this cell
import pandas as pd
import math, itertools
import matplotlib.pyplot as plt
%matplotlib inline
import networkx as nx
from networkx.algorithms import bipartite
from fws_lab import *
from ortools.linear_solver import pywraplp as OR

## Part 0: As a Maximum Matching Problem

As a first attempt, we may notice that we can think of matching as many students as possible to one of their top choices as a maximum matching problem. 
Starting with a list of classes and a list of students, we construct a bipartite graph with students on one side and all seats in all classes on the other. We will create an edge between a student and all seats in their preferred classes. Then each edge in a matching indicates the assignment of a student to a class, and the maximum matching assigns as many students as possible.

Below, we've implemented a function that computes the maximum matching and outputs the maximum number of students that can be assigned to one of their top k choices.

In [None]:
def max_matching_size(preferences, capacity, k=5):
    STUDENTS = list(preferences.index)                         # students
    
    CLASSES = []
    for c in preferences.columns:
        CLASSES = CLASSES + list(preferences[c].unique())
    CLASSES = list(set(CLASSES))                               # classes
    CLASS_NODES = itertools.product(CLASSES,range(capacity))
    
    graph = nx.Graph()
    graph.add_nodes_from(CLASS_NODES, bipartite=0)
    graph.add_nodes_from(STUDENTS, bipartite=1)
    
    for s in STUDENTS:
        for c in preferences:
            if int(c) <= k:                   # only consider the top k preferences
                for i in range(capacity):
                    # add an edge for every corresponding class node to the student
                    graph.add_edge((preferences.at[int(s),c],i),s) 
            
    match = nx.bipartite.maximum_matching(graph)
    match_size = len(match)//2              # match includes each matched edge twice
    print("Number of students matched:", match_size)
    print("")
    # Output what class the first few students were matched to
    print("Student\tClass")
    if len(STUDENTS) < 10:
        num_to_print = len(STUDENTS)
    else:
        num_to_print = 5
    for i in range(1,num_to_print+1):
        if i in match:
            print("%s\t%s" % (i, match[i][0]))
    if i < len(STUDENTS):
        print("...")
    # Also print how many students are in each class.
    print("")
    print("Class\tNumber of students")
    for cls in sorted(CLASSES)[:5]:
        num_students = sum(((cls,i) in match) for i in range(capacity))
        print("%s\t%s" % (cls, num_students))
    if len(CLASSES) > 5:
        print("...")

First, we will try it on a small example. Here, there are 7 students (1-7) and 4 classes (A-D). Each class has a capacity of 2 students. Column `1` gives the student's first preference and `2` gives their second preference.

In [None]:
fws_match_data = pd.read_csv('data/fws_7_students.csv', index_col=0)
display(fws_match_data)

We can visualize our data as a graph, where a class is connected to a student if it is one of their choices.

In [None]:
ex0(fws_match_data)

Now, let's compute a maximum matching to try to assign students to one of their choices.

In [None]:
max_matching_size(fws_match_data, 2, k=2)

This seems like a good solution, we have assigned all the students to one of their top two choices. Next, we will try this strategy on some real data.

In [None]:
# import data
fws_f09_ballots = pd.read_csv('data/fws_f09_ballots.csv', index_col=0)
fws_f09_ballots.head()

There are 2886 total students, each of whom listed their top five preferences. We assume that each class can fit 16 students. How well can we do if we want to match all students to either their first or second preference?

**Q:** Run the cell below to find how many students are matched to one of their top two choices.

**A:** <font color='blue'>2362</font>  

In [None]:
max_matching_size(fws_f09_ballots, 16, k=2)

**Q:** Edit the following cell to find out how many students we can match if we try to assign all to one of their top three choices. What about top four? Five? How many of their top choices do we need to consider to be able to match all the students? 

**A:** <font color='blue'>Top 3: 2695, top 4: 2871, top 5: 2886. We need to use all 5 preferences to match all the students.</font> 

In [None]:
# TODO: Uncomment and choose appropriate values of k
# max_matching_size(fws_f09_ballots, 16, k=XXX)

### BEGIN SOLUTION
max_matching_size(fws_f09_ballots, 16, k=5)
### END SOLUTION

So we can achieve our goal of matching all the students to one of the prefered classes, but we would like to do even better. In this maximum matching version, a student being matched to their first choice is no different than being matched to their fifth choice. That doesn't feel quite right--a solution where most students get their first choice should be better than a solution where most students get their fifth choice. We want to capture this idea by giving higher priority to higher preferences. But to do that, we will need a different model.

## Part 1: Brainstorming

We want to *assign* a class to each student. This sounds like an assignment problem. Recall in our description of the assignment problem, we assign workers to tasks and try to minimize the cost of the assignment (for example, minimize the time to complete all the tasks). However, we can assign more than one student to a class. To handle this situation, it may be helpful to think of each seat in a class separately.

**Example 1**  
The following table gives an instance with 8 students (1-8) and 4 classes (A-D). Each class has a capacity of 2 students. Column `1` gives the student's first preference and `2` gives their second preference.

In [None]:
fws_example_1 = pd.read_csv('data/fws_8_students_0.csv', index_col=0)
display(fws_example_1)

**Q:** What are the "workers" in this example?  

**A:** <font color='blue'>Seats in classes</font>  

**Q:** How many workers do we have?

**A:** <font color='blue'> Two workers for every class, one for each seat.</font>  

**Q:** What are the "tasks" in this example?  

**A:** <font color='blue'>Students</font>  

**Q:** How many tasks do we have?

**A:** <font color='blue'>1 for each student.</font>  

**Q:** What are costs of assigning a particular worker to a particular task?  

**A:** <font color='blue'> We will let the cost to the student from their first choice be 1 and the cost from the  their second choice be 2. There are many reasonable choices for costs, but the cost from the first choice should be less than the cost from the second choice since we are minimizing cost and we want to encourage more first choices.</font>

**Q:** How can we represent the cost for a class that was not a student's top preference? (In this case, we never want to assign the student to this class.)

**A:** <font color='blue'>We can let the assignment cost be infinity (in practice, we use a very large number). This way, the optimal assignment cannot match them.</font>

**Q:** Visualize the example as a bipartite graph by completing the dictionary of finite assignment costs. We can also think of these assignment costs as edge weights in our graph. This will be used later as the input to the LP model.

In [None]:
# TODO: Uncomment and complete the dictionary of assignment costs
# S = ['A','B','C','D']
# D = [1,2,3,4,5,6,7,8]
# E = {('A',1): , 
#      ('B',1): , 
#      ('B',2): ,
#      ('A',2): ,
#      ('D',3): , 
#      ('C',3): , 
#      ('B',4): , 
#      ('A',4): , 
#      ('C',5): , 
#      ('B',5): , 
#      ('A',6): , 
#      ('B',6): , 
#      ('B',7): , 
#      ('A',7): ,
#      ('A',8): , 
#      ('C',8):  }

### BEGIN SOLUTION
S = ['A','B','C','D']
D = [1,2,3,4,5,6,7,8]
E = {('A',1):1, 
     ('B',1):2, 
     ('B',2):1,
     ('A',2):2,
     ('D',3):1, 
     ('C',3):2, 
     ('B',4):1, 
     ('A',4):2, 
     ('C',5):1, 
     ('B',5):2, 
     ('A',6):1, 
     ('B',6):2, 
     ('B',7):1, 
     ('A',7):2,
     ('A',8):1, 
     ('C',8):2 }
### END SOLUTION


ex1(S, D, E)

Remember that in the above graph, the class nodes (on the left) each represent multiple seats. To complete the simple model, we need to answer a few more questions.  

**Q:** Currently, this problem doesn't have a feasible assignment. Five students all prefer classes A and B, but only four of them can be in one of those classes. How can we make sure our assignment problem always matches students to one of their top choices?


**A:** <font color='blue'>Add a dummy class with lots of seats (enough for all the students) and set the cost of assigning a student to a seat to be larger than assigning to any of their choices. This means that if we cannot assign a student to anything else, we will match it to the dummy class, but we will always prefer to match to a real class.</font>  

**Q:** What do we do if we have more seats than students? How can we make sure that both sides of our assignment problem have the same number of objects? (Note: we will not show this in our visualizations, but we should know how to resolve this issue.)


**A:** <font color='blue'>Add dummy students until there are the same number of seats as students. Set the cost of assigning one of these students to any class to also be large.</font>  

**Q:** Create a similar list of students, classes, and assignment costs (edge weights in our graph) like the lists above but account for the dummy class node (which is indexed as zero).

In [None]:
# TODO: Uncomment and complete the dictionary of assignment costs
# S_dummy = ['A','B','C','D','dummy']
# D_dummy = [1,2,3,4,5,6,7,8]
# E_dummy = {
#    ('A',1):1, ('B',1):2, ('D',2):1, ('C',2):2, ('A',3):1, ('C',3):2, ('B',4):1, 
#    ('D',4):2, ('C',5):1, ('B',5):2,  ('A',6):1,  ('B',6):2, ('B',7):1, ('A',7):2,
#    ('dummy',1):XXX, ('dummy',2):XXX, ('dummy',3):XXX, ('dummy',4):XXX, ('dummy',5):XXX, 
#    ('dummy',6):XXX, ('dummy',7):XXX}


### BEGIN SOLUTION
S_dummy = ['A','B','C','D','dummy']
D_dummy = [1,2,3,4,5,6,7,8]
E_dummy = {('A',1):1, ('B',1):2, ('B',2):1, ('A',2):2, ('D',3):1, ('C',3):2, ('B',4):1, ('A',4):2, 
     ('C',5):1, ('B',5):2, ('A',6):1, ('B',6):2, ('B',7):1, ('A',7):2, ('A',8):1, ('C',8):2,
     ('dummy',1):3, ('dummy',2):3, ('dummy',3):3, ('dummy',4):3, 
     ('dummy',5):3, ('dummy',6):3, ('dummy',7):3, ('dummy',8):3}
### END SOLUTION

ex1(S_dummy, D_dummy, E_dummy)

## Part 2: Solving

Let's use OR-Tools to define our mathematical model. 

In [None]:
def fws(preferences, capacity, cost, integer = False):
    """A model for solving a first-year writing seminar assignment problem.
    
    Args:
        preferences (pd.DataFrame): Preferred classes for each student.
        capacity (int): Capacity of the classroom.
        cost (Dict): Dictionary from edges types to cost.
    """
    STUDENTS = list(preferences.index)                         # students
    
    CLASSES = []
    for c in preferences.columns:
        CLASSES = CLASSES + list(preferences[c].unique())
    CLASSES = list(set(CLASSES))  + ['dummy']                  # classes
    
    edge_costs = {}                                 
    for s in STUDENTS:
        for c in preferences:
            edge_costs[(preferences.at[int(s),c],s)] = cost[int(c)]
    
    # add dummy edges
    dummy_edges = list(itertools.product(['dummy'], STUDENTS))
    
    # add dummy edge costs
    for edge in dummy_edges:
        edge_costs[edge] = cost['dummy']
        
    EDGES = list(edge_costs)                                   # edges
    
    # define model
    m = OR.Solver('fws', OR.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
    
    # decision variables
    x = {}    
    for i,j in EDGES:
        if integer:
            x[i,j] = m.IntVar(0, m.infinity(), ('(%s, %s)' % (i,j)))
        else:
            x[i,j] = m.NumVar(0, m.infinity(), ('(%s, %s)' % (i,j)))
        
    # objective function
    m.Minimize(sum(edge_costs[i,j]*x[i,j] for i,j in EDGES))
    
    # subject to: each class is assigned to no more students than the capacity
    for k in CLASSES:
        m.Add(sum(x[i,j] for i,j in EDGES if i==k) <= capacity)
       
    # subject to: each student is assigned at least one class
    for k in STUDENTS:
        m.Add(sum(x[i,j] for i,j in EDGES if j==k) >= 1)
    
    return m,x

In [None]:
def solution_summary(preferences, x):
    '''Get the counts of every assigned preference for some solution.'''
    counts = {int(i):0 for i in list(preferences.columns)}
    unassigned = 0
    matches = {k[1] : k[0] for k,v in x.items() if v.solution_value() > 0.9 and k[0] != 'dummy'}
    for index, row in preferences.iterrows():
        class_to_rank = {v:k for k,v in row.to_dict().items()}
        if index in matches:
            pref=int(class_to_rank[matches[index]])
            counts[pref] += 1
        else:
            unassigned +=1
    print("Unmatched students:", unassigned)
    return counts

**Q:** Replace `XXX` with the dummy edge costs and then run the cell.

In [None]:
# TODO: Uncomment and replace XXX with the cost of the dummy edge costs
# costs = {1:1, 2:2, 'dummy':XXX}

### BEGIN SOLUTION
costs = {1:1, 2:2, 'dummy':3} # any larger value is also fine.
### END SOLUTION

m,x = fws(fws_example_1, 2, costs)
m.Solve()
solution_summary(fws_example_1, x)

6 students got their first choice, 1 student got their second choice, and 1 student was left unmatched.   

**Q:** Why can't all the students be assigned their top two choices? Or do you think the answer you got could be better?  

**A:** <font color='blue'>Limited capacity of classes</font> 

**Example 2**  
In this new instance with 8 students (1-8) and 4 classes (A-D), more students prefer A and B than C and D.

In [None]:
fws_8_students = pd.read_csv('data/fws_8_students.csv', index_col=0)
display(fws_8_students)

In [None]:
S2 = ['A','B','C','D']
D2 = [1,2,3,4,5,6,7,8]
E2 = {('A',1):1, ('B',1):2, ('B',2):1, ('A',2):2, ('C',3):1, ('D',3):2, ('A',4):1, ('B',4):2, 
      ('B',5):1, ('A',5):2, ('A',6):1, ('C',6):2, ('C',7):1, ('D',7):2, ('A',8):1, ('D',8):2}

ex2(S2, E2)

This example will show why it is important to be careful with our choices of edge costs. We start with letting the edge weights from the first choice, second choice, and dummy nodes be 1, 3, and 4 respectively.

In [None]:
# cost of assigning to dummy class is 4
costs = {1:1, 2:3, 'dummy':4}
m,x = fws(fws_8_students, 2, costs)
m.Solve()
solution_summary(fws_8_students, x)

There is 1 unmatched student when using 4 as the dummy assignment cost.  

**Q:** Why can't all the students be assigned one of their top 2 choices? Or do you think the answer you got could be better?  

**A:** <font color='blue'>By adjusting the costs of assigning to the dummy class, we can find an assignment in which all students are matched to a class.</font>  

Re-solve using 6 as the dummy cost.

In [None]:
costs = {1:1, 2:3, 'dummy':6}
m,x = fws(fws_8_students, 2, costs)
m.Solve()
solution_summary(fws_8_students, x)

**Q:** In both examples, the dummy costs of 4 and 6 are the largest weights. In some sense, they are prioritized last. We rather assign to 1 and 3 cost classes. However, we get different solutions. Why?

**A:** <font color='blue'>The combined cost of a first pick class and a dummy class versus two second pick classes is  $1+4 < 3+3$ (when cost 4) while is it $1+6 > 3+3$ (when cost 6) </font>  

When solving the actual data, you will see that other subtle reasons might lead to unmatched students.

## Part 3: Solving the Actual Data

There are 2886 students and 183 class sections. Assume each class can have at most 16 students. As we already know, each student picks their top 5 classes.

In [None]:
fws_f09_ballots = pd.read_csv('data/fws_f09_ballots.csv', index_col=0)
fws_f09_ballots.head()

There are 6 columns with the first being student # and the other 5 being first, second, third, fourth, and fifth choice. Each row is a student, and class # indicates the class picked as the choice belonging to the column.

In [None]:
costs = {1:1, 2:2, 3:3, 4:4, 5:5, 'dummy':6}
m,x = fws(fws_f09_ballots, 16, costs)
m.Solve()
original  = solution_summary(fws_f09_ballots, x)
Histo(original, 15)

We got an answer! Unfortunately, there are 16 students who did not get any of their top 5 picks. Let's improve our model, so that no students are unmatched.  

The objective function is actually a weighted function. The coefficients dictate how desirable each of the corresponding edges are. An edge with a small cost (weight) has a higher likelihood of being in the solution while an edge with a large cost will potentially be avoided.

Let's leave the costs for the first - fifth choices (1-5) the same but set the cost of edges from the dummy node to an arbitrarily large number like 100,000. These edges are now very likely to be avoided in a solution.

**Q:** What is the real-world interpretation of having less edges from the dummy node in the solution?

**A:** <font color='blue'> If less edges from the dummy node are in the solution, less students are assigned to the dummy course. Hence, more students are getting one of their top 5 preferred courses. </font>  

**Q:** Set the cost of the dummy edges to 100,000. How many students are unmatched now?

**A:** <font color='blue'> There are now 0 unmatched students! </font>  

In [None]:
# TODO: Set the cost of the dummy edges to 100,000
#costs = {1:1, 2:2, 3:3, 4:4, 5:5, 'dummy':XXX}

### BEGIN SOLUTION
costs = {1:1, 2:2, 3:3, 4:4, 5:5, 'dummy':100000}
### END SOLUTION

m,x = fws(fws_f09_ballots, 16, costs)
m.Solve()
large_dummy_weight  = solution_summary(fws_f09_ballots, x)
Histo(large_dummy_weight, 15)

**Q:** Compare the distribution of received student preferences between our original solution and the new one with zero unmatched students.

**A:** <font color='blue'> All students are now matched but the distribution is less skewed towards the first preference. Less students got their first or second preference and more got their third - fifth preference. </font>  

You may have mentioned that less students got their first choice. What if we want a solution that maximizes the number of students receiving their first choice, then maximizes those receiving their second choice, and so on all subject to the number of unmatched students being minimized (in this case, we know there should be zero unmatched students). It turns out we can acheive some complex behavior like this by just setting our weights cleverly.

By setting the weight of the dummy edges to be multiple orders of magnitude greater than the other edge weights, we essentially acheived a model that first minimizes the number of unmatched students without thinking about any preferences. This is because a student being matched or unmatched has a significantly larger effect on the value of a solution than which preference and student receives. We can apply this same approach again! 

We will use a dummy cost of 100,000 again. Now, we want to maximize the number of first preferences received. To do this, we set the cost of these edges to a large negative value an order of magnitude less than the cost of the dummy edges: -10,000. We set the remaining edge weights in the same fashion.

**Q:** Choose the correct edge weights for the second through fifth preference edges. 

In [None]:
# TODO: Choose the correct edge weights
# costs = {1:-10000, 2:XXX, 3:XXX, 4:XXX, 5:XXX, 'dummy':100000}

### BEGIN SOLUTION
costs = {1:-10000, 2:-1000, 3:-100, 4:-10, 5:0, 'dummy':100000}
### END SOLUTION

m,x = fws(fws_f09_ballots, 16, costs)
m.Solve()
max_lexico  = solution_summary(fws_f09_ballots, x)
Histo(max_lexico, 15)

**Q:** How does the number of students with their first preference compare to the previous solutions? What was the number of unmatched students? Comment on the distribution you observe.

**A:** <font color='blue'> This solution had nearly 200 more first preferences met with 1747 while maintaining zero unmatched students. All of these first preferences restricted the solution greatly so the remaining students are relatively evenly distributed among prefernces 2-5.</font>  

What if we want to have the opposite approach. Again, we want to minimize unmatched students. However, rather than maximizing the number of students who recieive their first preference next, we want to minimize the number of students who recieive their last (fifth) preference. Then, we minimize the number of students who recieive their fourth preference and so on.

**Q:** Choose costs that achieve the described objective. (Hint: Use different orders of magnitude to choose the order in which objectives are considered. Negative edge weights are like maximizing and positive edge weights are like minimizing)

In [None]:
# TODO: Choose costs that achieve the described objective.
# costs = {1:XXX, 2:XXX, 3:XXX, 4:XXX, 5:XXX, 'dummy':XXX}

### BEGIN SOLUTION
costs = {1:0, 2:10, 3:100, 4:1000, 5:10000, 'dummy':100000}
### END SOLUTION

m,x = fws(fws_f09_ballots, 16, costs)
m.Solve()
min_lexico  = solution_summary(fws_f09_ballots, x)
Histo(min_lexico, 15)

**Q:** How does the number of students with their fifth preference compare to the previous solutions? Compare this solution to the other solutions.

**A:** <font color='blue'> This solution has zero unmatched students and has only 15 students with their fifth preference. Becuase the model is more restricted as it considers higher prefernces, the number of students with their first preference is significantly less at only 1205.</font>  

As you can see, there are a wide variety of solutions with zero unmatched students. How did the Knight Institute choose a solution? They opted to do a combination of the previous two models. The number of fifth and then fourth preferences were minimized and then the number of first, second, and then third preferences was then maximized. Let's see the solution! 

In [None]:
costs = {1:-100, 2:-10, 3:0, 4:1000, 5:10000, 'dummy':100000}
m,x = fws(fws_f09_ballots, 16, costs)
m.Solve()
knight  = solution_summary(fws_f09_ballots, x)
Histo(knight, 15)

**Q:** Compare this solution to the previous solutions. Do you think the Knight Institute made the right decision? How would you have done it differently?

**A:** <font color='blue'> We can see this solution has the minimum number of fouth and fifth preferences received but then maximizes the number of first prefernces achieving nearly 160 more first prefernces satisfied than the previous solution. As a result, there are less second preferences but more third prefernces.</font>  

In each of the previous models, we set the weights of different edges at different orders of magnitude to essentially order our objectives. However, an order of magnitude seems arbitrary. How can we show exactly what multiplier needs to  be used to prove the model is giving us the solution we want?

Let's consider a simple example in which there is only one prefernce which is either met or not. We want to minimize unmatched. Let's assume we have 100 students that must be assigned. We need to choose weights $w_1$ and $w_{\text{dummy}}$.

**Q:** If one student is unmatched, what is the maximum total weight the other 100 students can acheive?

**A:** <font color='blue'> $99w_1$</font>  

**Q:** Using your answer to **Q23**, give an inequality that your weights must abide by.

**A:** <font color='blue'> $99w_1 < w_{\text{dummy}}$</font>  

**Q:** What multiplier needs to be used to ensure that the number of unmatched students is minimized?

**A:** <font color='blue'> $100$</font>  