# First-Year Writing Seminars

**Objectives:**
* Assign students to FWS sections so that they get one of their top 5 choices.
* Improve assignments by making changes to our transportation formulation.

**Key Ideas:**
* the transportation problem
* the assignment problem

**Brief description:** If you recall pre-enroll, there was a separate ballot you completed by listing your top 5 picks for FWS that semester. You were later notified which class you got placed into, probably hoping it was your first choice. By now, this should not seem like magic; problems like these often enlist help from Operations Research especially as the scale increases. Disclaimer: the following model is not actually used by Cornell.

In [None]:
# imports
from ortools.linear_solver import pywraplp as OR
import pandas as pd
import numpy as np
from fws_lab_new import inputData

## Part 1: Brainstorming

We want to *assign* a class to each student. This sounds like an assignment problem, which is a special case of the *transportation problem*. 

**Example 1**  
The following table gives a sample input with 8 students (1-8) and 4 classes (A-D). Each row lists a student along with their first and second class preferences. Assume each class has a capacity of 2 students. 

| Student | First | Second |
|:-------:|:-----:|:------:|
|    1    |   A   |    B   |
|    2    |   B   |    A   |
|    3    |   C   |    D   |
|    4    |   A   |    B   |
|    5    |   B   |    A   |
|    6    |   A   |    C   |
|    7    |   C   |    D   |
|    8    |   A   |    D   |

**Q:** What are the supply nodes $i$?  

**A:** <font color='blue'>A supply node $i$ corresponds to student $i$, $i \in \{1,...,8\}$.</font>

**Q:** What are the supply values $s_i$ of the supply nodes $i$? (Hint: How many units can be transferred?)

**A:** <font color='blue'>$s_i = 1$ for each supply node $i$.</font>

**Q:** What are the demand nodes $j$?  

**A:** <font color='blue'>A demand node $j$ corresponds to class $j$, $j \in \{A,B,C,D\}$.</font>

**Q:** What are the demand values $d_j$ of the demand nodes? (Hint: How many units can be received?)

**A:** <font color='blue'>$d_j = 2$ for each demand node $j$.</font>

**Q:** What does a directed edge $(i,j)$ from a supply node $i$ to a demand node $j$ indicate?  

**A:** <font color='blue'>A directed edge $(i,j)$ indicates that we can ship flow from supply node $i$ to demand node $j$, that is, "assign" student $i$ to class $j$.</font>

For each directed edge $(i,j)$, there is a corresponding edge cost $c(i,j)$. For now, we will just worry about finding feasible solutions, so we assume $c(i,j) = 1$ for all $(i,j)$.

**Q:** Is this a balanced input for the transportation problem? How do you know?

**A:** <font color='blue'>Yes, because the total supply is equal to the total demand: $\sum_{i} s_i = \sum_{j} d_j = 8$.</font>

Run the cell below to visualize the graph for this input. (First-preference edges are in blue; second-preference edges are in orange.)

In [None]:
from fws_lab_new import small_ex

supply_nodes = ['A','B','C','D'] 
demand_nodes = [1,2,3,4,5,6,7,8] 
edges_and_costs = {('A',1):1, ('B',1):1, ('B',2):1, ('A',2):1, ('C',3):1, ('D',3):1, ('A',4):1, ('B',4):1, 
                   ('B',5):1, ('A',5):1, ('A',6):1, ('C',6):1, ('C',7):1, ('D',7):1, ('A',8):1, ('D',8):1}

small_ex(supply_nodes, demand_nodes, edges_and_costs)

**Q:** Give a feasible solution for this input by listing each class and the students assigned to it. Also, give the objective value of the solution. (You can look back at the table if that's easier.)

**A:** <font color='blue'>Answers may vary, but every feasible solution falls under one of the following two cases: either $C = \{3,6\}$ and $D = \{7,8\}$, or $C = \{6,7\}$ and $D = \{3,8\}$. The objective value of any solution is 8, since the unit cost of sending flow across any edge (that is, the cost of assigning a student) is 1 and we assign 8 students. </font>

**Example 2**  
The following year, there are only 7 students, again having two preferences each.

In [None]:
ex2 = pd.read_csv('fws_7_students.csv', index_col=0)
display(ex2)

In [None]:
costs = {1:1, 2:1}
S, D, E = inputData(ex2, costs) # a function that transforms a table of students' preferences into a graph input

# small_ex(S, D, E)

**Q:** Do feasible solutions exist for this new input? How do you know?

**A:** <font color='blue'>No, since the total supply (7) is less than the total demand (8).</font>

**Q:** What changes to the graph can we make to ensure all the demand we can be met? (Hint: think about a new node)

**A:** <font color='blue'>We can add a dummy supply node and directed edges from said node to every demand node. This allows us to send as much additional supply as is necessary to satisfy demands.</font>

In [None]:
costs = {1:1, 2:1, 'dummy':1}
S_dummy, D_dummy, E_dummy = inputData(ex2, costs)

small_ex(S_dummy, D_dummy, E_dummy)

## Part 2: Solving

Let's see if we can find a suitable assignment of students. To do this, we'll use a Python model, defined below.

Here's what's in the code:
* Flow variables $x[i,j]$ that tell us how much flow (i.e., how many student "units") we ship across each edge in the graph.
* Our objective function: we want to maximize the number of students assigned, that is, maximize the flow values.
* A constraint specifying that each student can be assigned at most one class. (The dummy supply node has no such restriction.)
* A constraint specifying that each class demand node must receive a number of students equal to its capacity.

Don't worry about being able to understand the code yet&mdash;just read through it and run the cell below to test the function on our input.

In [None]:
def simpleAssign(preferences, costs, csize):
    """A model for solving simple instances of the first-year writing seminar assignment problem.
    
    Args:
        preferences (pd.DataFrame): Preferred classes for each student.
        costs (Dict): Dictionary from edge types to unit costs.
        csize (int): Capacity of the classroom.        
    """
    students, classes, edges = inputData(preferences, costs)
    EDGES = list(edges.keys())      # create edge list
    
    c = edges.copy()                # define c[i,j]
    
    # define model
    m = OR.Solver('FWS', OR.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
    
    # decision variables
    x = {} 
    for i,j in EDGES:
        x[i,j] = m.IntVar(0, m.infinity(), ('(%s, %s)' % (i,j))) # x[i,j] == units shipped on edge (i,j)
        
    # define objective function here
    m.Maximize(sum(x[i,j] for i,j in EDGES)) # maximize flow
       
    # add constraint to ensure each student (besides the dummy) is assigned at most one class
    for k in students:
        if k != 'dummy':
            m.Add(sum(x[i,j] for i,j in EDGES if i==k) <= 1)
        
    # add constraint to ensure each class is filled to capacity
    for k in classes:
        m.Add(sum(x[i,j] for i,j in EDGES if j==k) == csize)
    
    # solve
    m.Solve()
    
    return m,x

In [None]:
# print solution details
def print_sol(m, x):
    print('Objective value:', m.Objective().Value())

    print('Flows across each edge:')
    for var in m.variables():
        print(var.name(), ':', var.solution_value())

In [None]:
m,x = simpleAssign(ex2, costs, 2)
print_sol(m,x)

**Q:** The function outputs the flows across each edge. In words, interpret the output. How did the solver assign students? Is this the result we were hoping for? 

**A:** <font color='blue'>In our transportation formulation, a unit of flow along an edge $(i,j)$ corresponds to a student $i$ being assigned to a class $j$. Here, we have flows of value 2 from the dummy supply node to each class ABCD, and all other flows are zero. So the solver assigned 2 dummy students to each class, but didn't actually assign any real students--so the output gives us nothing in terms of a usable solution.</font>

Let's see if we can find a more useful solution by adjusting our input and model.

**Q:** In some sense, we need the solver to value assigning "real" students to classes over assigning fake "dummy" students&mdash;that is, we want to make it so that the solver chooses to send flow across edges emanating from student supply nodes, rather than the dummy supply node. How should we adjust our input to implement this change? (Hint: given two edges, how might a computer quantitatively compare them?)

**A:** <font color='blue'>Revise the edge costs for dummy edges to be higher than those for student edges. This will entice the solver into picking the student edges, rather than the dummy edges (if it has the choice).</font>

In our code for the model, we'll make one small change. Our current objective function maximizes the total flow shipped across edges, regardless of cost: 

<code>m.Maximize(sum(x[i,j] for i,j in EDGES))</code>

**Q:** Give an argument as to why our current model will always give a solution with objective value 8.

**A:** <font color='blue'> To prove this, we argue that (1) any feasible solution has objective value 8 and (2) a feasible solution exists.

(1) We've specified in our constraints that each class node receives 2 students (real or fake). Thus, any feasible assignment has 4(2) = 8 students assigned, i.e., 8 units of flow.

(2) Since we have an unlimited supply of dummy filler students, we know we can always find a feasible assignment by just filling up each class with these students, as we saw in the output above.

Thus our model will always find a feasible solution, and whatever that solution is, it has objective value 8.</font>

If you understood **Q**, it should make sense that <code>sum(x[i,j] for i,j in EDGES)</code> is always 8. So which edges $(i,j)$ should the solver choose to ship flow on? Using our knowledge from **Q10**, in choosing between edges we'd rather ship flow on edges with smaller cost. Putting this all together, our objective function becomes the following:

<code>m.Minimize(sum(c[i,j]*x[i,j] for i,j in EDGES))</code>

Note that only edges $(i,j)$ that actually ship flow (i.e., <code>x[i,j] > 0</code>) contribute to the cost of a solution.

Run the cell below, which defines a function <code>Assign</code>. It's identical to the <code>simpleAssign</code> function above, but implements our new and improved objective function.

In [None]:
def Assign(preferences, costs, csize):
    """A model for solving an FWS assignment problem.
    
    Args:
        preferences (pd.DataFrame): Preferred classes for each student.
        costs (Dict): Dictionary from edge types to unit costs.
        csize (int): Capacity of the classroom.        
    """
    students, classes, edges = inputData(preferences, costs)
    EDGES = list(edges.keys())      # create edge list
    
    c = edges.copy()                # define c[i,j]
    
    # define model
    m = OR.Solver('FWS', OR.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
    
    # decision variables
    x = {}  # units to be shipped on each edge
    for i,j in EDGES:
        x[i,j] = m.IntVar(0, m.infinity(), ('(%s, %s)' % (i,j))) 
        
    # define objective function here
    m.Minimize(sum(c[i,j]*x[i,j] for i,j in EDGES))
       
    # add constraint to ensure each student (besides the dummy) is assigned at most one class
    for k in students:
        if k != 'dummy':
            m.Add(sum(x[i,j] for i,j in EDGES if i==k) <= 1)
        
    # add constraint to ensure each class is filled to capacity
    for k in classes:
        m.Add(sum(x[i,j] for i,j in EDGES if j==k) == csize)
    
    # solve
    m.Solve()
    
    return m,x

Now, try updating the cost of shipping flow across "dummy" edges below to be larger than the cost of shipping flow across "student" edges. Then run the cell below to see if we get a better solution.

In [None]:
# TODO: Update the unit cost of dummy edges to be 2. (Remember that the unit cost of regular student edges is 1.)
# costs['dummy'] = XXX

### BEGIN SOLUTION
costs['dummy'] = 2
### END SOLUTION

In [None]:
m,x = Assign(ex2, costs, 2)
print_sol(m,x)

from fws_lab_new import solution_summary # a function to help print and analyze solutions
solution_summary(ex2, x)

Nice! We now have a solution that assigns all 7 students to a class. 

The next logical step is to ask: can we do better? For instance, can we find an assignment where more students have their first choice?

Below, we set the costs of our different edge types so that the cheapest (most likely to be picked) edges are the first-preference edges, then the second-preference edges, then the dummy. Run the cell to see what happens.

In [None]:
costs = {1:1, 2:2, 'dummy':3}
m,x = Assign(ex2, costs, 2)
solution_summary(ex2, x)

**Q:** Compare the two solutions above. Which one do you think is better? Why? (Hint: there's no "right" answer!)

**A:** <font color='blue'>This question is all about recognizing the *trade-offs* between two different solutions. One might choose the first solution over the second solution because it assigns every student. On the other hand, one might go with the second solution because more students are assigned their "top choice" class.</font>

Re-solve using 4 as the cost of dummy edges.

In [None]:
costs = {1:1, 2:2, 'dummy':4}
m,x = Assign(ex2, costs, 2)
solution_summary(ex2, x)

**Q:** In the previous two solutions, we set the cost of dummy edges to be 3 and then 4. In both cases, by making the cost greater than 2, we ensured these edges were "prioritized last" in deciding which edges to include in a solution; the solver would rather send flow across cheaper edges with unit cost 1 or 2. However, the solutions are different. Why?

**A:** <font color='blue'> In the first solution, the solver probably opted to include a dummy edge that allowed it to "save later" by assigning more first choices. In the second solution, the cost of the dummy edges was high enough that this strategy of sacrificing an assignment to save costs in other ways wasn't worth it anymore.
    
It may help to think about the cost of the solutions. The cost of the first solution is 1(6) + 2(0) + 3(1) = 9. The cost of the second solution is 1(5) + 2(2) + 4(0) = 9. Notice that, if we take the first solution's assignment and raise the dummy cost to 4, the solution's cost increases to 10. In the second solution, the solver is able to find a cheaper assignment (one that assigns every student) that costs 9 < 10, so it assigns students this way instead.</font>

## Part 3: Solving with the Actual Data

In the actual data from the Spring 2021 semester, there are 2285 students and 141 class sections. Per the Knight Institute's rules, each class can have at most 17 students. As you already know, each student picks their top 5 classes. 

In [None]:
s21 = pd.read_csv('s21_fws_ballots.csv', index_col=0)
s21.head()

Each row in the data frame corresponds to a student (1-2285) and that student's first, second, third, fourth, and fifth choice FWS class (out of 141). 

Let's try running our model on the actual data! Since we now have 5 preferences per student, we define the cost of a student being assigned their $k$th preference to be $k$, and the cost of assigning a dummy student to be 6. (This cost can be interpreted as the cost of *not assigning a student* to one of their top five choices&mdash;do you understand why?)

In [None]:
costs = {1:1, 2:2, 3:3, 4:4, 5:5, 'dummy':6}
m,x = Assign(s21, costs, 17)
original = solution_summary(s21, x)

from fws_lab_new import Histo # a function to view solutions as a histogram
Histo(original)

We got an answer! Unfortunately, there are 72 students who were not assigned any of their top 5 picks. 

What if we set the cost of assigning 'dummy' students to be an absurdly high number, like 100,000? Try it out by running the cell below.

In [None]:
costs['dummy'] = 100000
    
m,x = Assign(s21, costs, 17)
large_dummy_cost = solution_summary(s21, x)
Histo(large_dummy_cost)

Yay! We found a solution that assigns every student to one of their top 5 FWS choices. So why does this work?

As we saw in the toy examples, edge costs dictate how much you want the solver to ship flow across those edges. Since we want to satisfy all demand at minimal cost, an edge with a smaller cost has a higher likelihood of being used in the solution, while an edge with a larger cost will potentially be avoided. (For instance, we want more first-choice than fifth-choice, so the cost of first-choice edges is lower than that of fifth-choice edges.) 

Applying this thinking to the dummy, setting the dummy edge costs to an enormous number like 100,000 essentially discourages the solver from ever choosing to ship across a dummy edge instead of a real student edge, unless it's absolutely necessary to satisfy demand constraints. 

More technically, the cost of not assigning just *one* student to one of their top 5 choices (i.e., the cost of assigning a dummy student in place of a real student) is greater than the cost of assigning *all* 2285 students their fifth choice: $100,000 > 5(2285) = 11,425$. So we'd rather assign every student (if such an assignment exists) than fail to assign just one student!

**Q:** Just looking at the histograms, compare the distribution of received student preferences between our original solution and the new one with zero unmatched students. (How does the number of first choices compare? Fifth choices?)

**A:** <font color='blue'>Both are monotonically decreasing (i.e., # first received > # second received > # third received...). Ignoring the fact that the original solution leaves some students unassigned, the original solution performs better both in terms of # of first choices assigned (1269 to 1179) and # of fifth choices assigned (19 to 55).</font>

It appears that in the new (zero unmatched) solution, less students get their first choice. Let's see if we can do better! To do so, let's find a solution that *maximizes* the number of students receiving their first choice. 

We've already shown how to minimize the number of dummy assignments&mdash;we just increase the cost of assigning dummy students. In the cell below, modify the cost of assigning a student their first choice to achieve our new objective. (Hint: how do you turn minimization into maximization?)

In [None]:
# TODO: Update the unit cost of first-choice edges to achieve our desired objective. 
# costs[1] = XXX

### BEGIN SOLUTION
costs[1] = -100000     # any number less than -100000 is also fine
### END SOLUTION

m,x = Assign(s21, costs, 17)
max_first = solution_summary(s21, x)
Histo(max_first)

**Q:** You should get a solution with over 1400 students receiving their first choice. (If not, fiddle around until you do!) However, notice that this solution leaves some students unassigned, even though the dummy cost is still 100,000! Why do you think this is?

**A:** <font color='blue'>If we set the cost of first-choice edges to a large enough negative number, then it becomes cheaper for the solver to sacrifice some assignments (by assigning dummy students) in order to assign more students their first-choice, since the 'incentive' for assigning first-choice is now so high.</font>

To remedy this, we'll set the cost of first-choice assignments to be an order of magnitude closer to zero than the cost of dummy assignments. This is like telling the solver, "Hey, we want to maximize the number of first-choice assignments if we can, but it's more important that there are no students unassigned."

In [None]:
costs[1] = -10000

m,x = Assign(s21, costs, 17)
modified_max_first = solution_summary(s21, x)
Histo(modified_max_first)

What if we wanted to try the opposite approach? Again, we want to minimize the number of unmatched students above all else. But this time, instead of then trying to maximize the number of first choices received, let's try minimizing the number of fifth choices received. 

To do this, we set the cost of assigning a student their fifth preference to be much greater than the cost of assigning any other preference&mdash;but still an order of magnitude less than not assigning the student at all.

In [None]:
costs = {1:1, 2:2, 3:3, 4:4, 5:10000, 'dummy':100000}

m,x = Assign(s21, costs, 17)
min_fifth = solution_summary(s21, x)
Histo(min_fifth)

**Q:** Compare these two solutions (maximizing first choice versus minimizing fifth choice, both subject to minimizing unassigned). If you had to present one of these solutions to the FWS assignment committee, which would you present? Give a reason for your choice.

**A:** <font color='blue'>Answers may vary. One justification for the first solution is that well over half the students receive their top choice. One justification for the second solution is that it has a lesser (meaning better) mean preference received, and no student receives their last choice (out of their top 5 choices, of course).</font>

In each of the previous two examples, we saw how setting edge weights with different orders of magnitude allowed us to achieve a greater degree of complexity and nuance in our solutions. 

Let's implement one more (slightly more complicated) approach. As always, we want to minimize the number of unassigned students first. Then, we want to maximize the number of students receiving their first choice, then maximize the number of students receiving their second choice, and so on down the line until we finally maximize the number of students receiving their fifth choice. (This notion of "ranking" or "prioritizing" different objectives is known as the *lexicographic method*.)

In the cell below, fill in the missing edge costs to implement the approach outlined above.

In [None]:
# TODO: Fill in the missing edge costs to implement the lexicographic-maximum ordering outlined above.
# Hint: use different orders of magnitude to rank the importance of each objective
# costs = {1:XXX, 2:-1000, 3:XXX, 4:XXX, 5:-1, 'dummy':100000}

### BEGIN SOLUTION
costs = {1:-10000, 2:-1000, 3:-100, 4:-10, 5:-1, 'dummy':100000}
### END SOLUTION

m,x = Assign(s21, costs, 17)
lexico_max = solution_summary(s21, x)
Histo(lexico_max)

**Q:** Compare this histogram to the previous solutions you've seen. How is the distribution different? Why might this be the case?

**A:** <font color='blue'>Unlike the previous solutions, the histogram here is not monotonically decreasing--there are more fifth choices assigned than fourth choices assigned. 
    
We might explain this by considering our numerous objectives. As we push as many students as possible into their first, second, and then third choices, classes will begin to fill up! When we are left to choose between fourth and fifth choice for the remaining students, it could be the case that there are some students whose fourth choice classes are already full, so the solver must assign them their fifth choice instead, so as not to leave them unassigned--remember that our most important objective is to assign every student. </font>

Important takeaways:

* We can model a seemingly difficult problem, with real-world implications, using concepts from ORIE!
* Implementing a "dummy node" can help turn an infeasible transportation input into a feasible one.
* Modifying edge costs in the graph changes the objective function, which allows us to find and compare a variety of feasible solutions.

## Bonus

We don't want to waste time trying different cost combinations if there is no solution where every student gets one of their top 5 picks. How can we check whether there exists a feasible solution with 0 unmatched students? 

**B1:** What is the cost of edges representing students' preferences?  

**A:** <font color='blue'>0</font>

**B2:** What is the cost of dummy edges?  
    
**A:** <font color='blue'>1 (any positive number will work, but 1 is easy)</font>

**B3:** What is the desired solution?  

**A:** <font color='blue'>Let's say there is a feasible solution that matches every student. There are (141 classes)(17 students / class) = 2397 seats, of which 2285 are filled by students. So 2397 - 2285 = 112 seats are filled by dummy students. This corresponds to a solution with objective value 1(112) = 112. So if our objective function returns a solution with this value, we know we have a feasible solution with 0 unmatched students.
    
On the other hand, if there are unmatched students, then there will be more than 112 dummy students assigned. So if the optimal solution has objective value greater than 112, we can conclude there is no feasible solution with 0 unmatched students. </font>

**B4:** Describe another way that might use a different model.  

**A:** <font color='blue'>Answers may vary.</font>

# ====================================

# FWS: Min-Cost Flow

**Objective:**
* Make improvements to our existing FWS assignment model by interpreting it as a min-cost flow problem.

**Key Ideas:**
* the transportation problem
* the assignment problem
* the min-cost flow problem

**Reading Assignment:**
* Read Handout 7.5 on the min-cost flow problem.
* Read through the FWS lab to refresh your memory on key concepts.

**Brief description:** If you recall pre-enroll, there was a separate ballot you completed by listing your top 5 picks for FWS that semester. You were later notified which class you got placed into, probably hoping it was your first choice. By now, this should not seem like magic; problems like these often enlist help from Operations Research especially as the scale increases. Disclaimer: the following model is not actually used by Cornell.

In [None]:
# imports
import pandas as pd
import numpy as np
from ortools.graph import pywrapgraph as ORMC
from fws_lab_new import inputData

### Recap

In the FWS lab, we dealt with an *assignment problem*: we were trying to assign students to different First-Year Writing Seminar sections, given a list of each student's preferences. This is a special case of the *transportation problem*, where we want to ship units from supply nodes to demand nodes. (Here, we are "shipping" students to classes!)

A small input for such a problem might be as follows: we have 4 classes (ABCD) that can each hold 2 students. There are 7 students available, whose preferred classes are listed in the table below.

| Student | First | Second |
|:-------:|:-----:|:------:|
|    1    |   A   |    B   |
|    2    |   D   |    C   |
|    3    |   A   |    C   |
|    4    |   B   |    D   |
|    5    |   C   |    B   |
|    6    |   A   |    B   |
|    7    |   B   |    A   |

To solve this problem, we set up a graph like the following:

In [None]:
from fws_lab_new import small_ex

S = [1,2,3,4,5,6,7]
D = ['A','B','C','D']
E = {(1,'A'):1, 
     (1,'B'):2, 
     (2,'D'):1, 
     (2,'C'):2, 
     (3,'A'):1, 
     (3,'C'):2, 
     (4,'B'):1, 
     (4,'D'):2, 
     (5,'C'):1, 
     (5,'B'):2, 
     (6,'A'):1, 
     (6,'B'):2, 
     (7,'B'):1, 
     (7,'A'):2 }

small_ex(S, D, E)

The supply nodes (representing students) are on the left, and the demand nodes (representing classes) are on the right. A directed edge $(i,j)$ from a supply node $i$ to a demand node $j$ means that we can "send" (assign) student $i$ to class $j$, at some unit cost $c[i,j]$. Note that "first-choice" edges are in blue, and "second-choice" edges are in orange.

**Q:** Right now, we have a demand of $(4)(2) = 8$ students, but we can only supply 7 students. What additional nodes and edges do we need to include in our graph to make sure we can satisfy our demand?

**A:** <font color='blue'>Include a "dummy" supply node that has arcs going from it to every demand node. </font>

Now our graph looks something like this:

In [None]:
S_dummy = S + ['dummy']
D_dummy = D
E_dummy = dict(E)
E_dummy.update({('dummy','A'):3,
                ('dummy','B'):3,
                ('dummy','C'):3,
                ('dummy','D'):3})

small_ex(S_dummy,D_dummy,E_dummy)

If we drew a graph for the actual FWS input, it would look similar to this, except with thousands of student supply nodes and hundreds of class demand nodes! 

As a reminder, in the real-world problem, each student gives up to 5 preferences and each class section is capped at 17 students.

## Min-Cost Flow Formulation

**Review** 

Recall the min-cost flow problem. It takes as input
* A directed graph $G = (V,A)$,
* costs $c(i,j)$ for shipping one unit of good from node $i$ to node $j$ for each arc $(i,j) \in A$,
* capacities $u(i,j)$ for each arc $(i,j) \in A$,
* supply values $b(i)$ for each node $i \in V$, such that $\sum_{i \in V} b(i) = 0$.

Remember also that at each node $i$, our supply value $b(i)$ is greater than 0 if there is supply at node $i$, less than 0 if there is demand at node $i$, and equal to 0 if there is neither supply nor demand at node $i$ (i.e., node $i$ is a transit node). Using max-flow terminology, supply nodes are "sources," demand nodes are "sinks," and transit nodes are interior nodes.

Our goal is to find a feasible flow that satisfies both flow-capacity constraints and flow-conservation constraints; that is, we wish to find a flow $f(i,j)$ on all arcs such that $0 \leq f(i,j) \leq u(i,j)$ for every arc $(i,j) \in A$ and $\sum_{(i,j) \in A} f(i,j) - \sum_{(j,i) \in A} f(j,i) = b(i)$ for every node $i \in V$.

The objective value of a feasible solution is given by $\sum_{(i,j) \in A} c(i,j)*f(i,j)$. We'd like to minimize this cost function&mdash;in other words, find a "min-cost" flow.

(For a more in-depth discussion, see Handout 7.5 and the min-cost flow lab.)

**Formulating the model**

In Handout 7.5, we learned that the transportation problem is really just a specific case of the min-cost flow problem. Let's use this fact, along with the transportation model we've already created for the FWS assignment problem, to formulate a min-cost flow model. As we'll see, using a min-cost flow approach will allow us to incorporate new information into the model.

Nodes for each student and class section, as well as the special 'dummy' supply node, remain the same as before, as do our arcs and edge costs; all we need to do to create a min-cost flow input is define (1) the arc capacities and (2) the supply values at each node, and we'll be all set!

**Q:** What should the capacity $u(i,j)$ on each arc $(i,j)$ be? (An arc $(i,j)$ connects a student node $i$ to a class node $j$.)

**A:** <font color='blue'>1</font>

**Q:** We also need to define the capacity $u(dummy,j)$ on each arc leaving the 'dummy' node to a class node. What is the maximum number of dummy students we can send to each class?

**A:** <font color='blue'> Set $u(dummy,j) = 17$. More generally, we set $u(dummy,j) = $ (max number of 'real' students) $ - $ (min number of 'real' students), or (in this case) $17 - 0 = 17$.</font>

Next, let's define our supply values $b(i)$.

**Q:** For a student node $i$, what should the supply value $b(i)$ be?

**A:** <font color='blue'>1</font>

**Q:** For a class node $j$, what should the supply value $b(j)$ be? (If there is demand at a node $k$, then $b(k) < 0$.)

**A:** <font color='blue'>-17</font>

Once again, we must account for our dummy supply node. Recall that for a min-cost flow input to be valid, the "net supply/demand" summed up over all nodes should be equal to 0:  $\sum_{i \in V} b(i) = 0$. 

Suppose we have $n$ students selecting from $m$ classes, each of which can have up to 17 students. 

**Q:** Using this information, what should the supply value $b(dummy)$ be?

**A:** <font color='blue'>To satisfy our input condition $\sum_{i \in V} b(i) = 0$, we must have $\sum_{students,i} b(i)$ + $\sum_{classes,j} b(j)$ + $b(dummy) = 0$. Thus $n(1) - m(17) + b(dummy) = 0$, which gives $b(dummy) = 17m - n$.</font>

This should make sense intuitively; essentially, we are saying that after every student has been assigned a class, whatever spots are left over should be filled by our "fake students." (Of course, this assumes there are enough spots for every "real" student!)

We now have everything we need to formulate the model. As a reminder, we set the unit cost of an edge $(i,j)$ from a student node to the class node representing their $k$th preference to be $k$. For "dummy" edges, we set the unit cost to be an arbitrarily large number (100,000) to discourage the solver from sending flow across those edges unless it absolutely has to.

Run the cell below, which implements our min-cost flow model in Python. (You can read through the code if you'd like.)

In [None]:
# A min-cost flow model for the FWS assignment problem
#
# 'dataset' is the name of the datafile
# 'minstudents' is the minimum number of (real) students that must be assigned to each section (between 0 and csize)
# 'csize' is the desired class size, filled with a combination of real and 'filler' students
# 'dcost' is the cost of not assigning a student to one of their top 5 preferences (i.e., cost of dummy edge)
def mincostflow(dataset='s21_fws_ballots.csv', minstudents=0, csize=17, dcost=100000):
    
    if minstudents > csize or minstudents < 0:
        raise ValueError('Error: minstudents must be in [0,class size].')
    
    # read in data and specify unit costs for student assignments
    data = pd.read_csv(dataset, index_col=0)
    costs = {1:1, 2:2, 3:3, 4:4, 5:5}
    students, classes, edges = inputData(data, costs)
           
    n = len(students) # number of students
    m = len(classes) # number of class sections
    dcapacity = csize - minstudents # number of dummy students we can send to each class
    
    # define supply b[i] at each node i
    # ORTools spec says nodes must be nonnegative integers indexed starting at 0 (dummy supply node),
    # so class numbers are indexed from (n + 1) to (n + m), where 
    # n is the number of students and m is the number of class sections  
    supplies = []
    supplies.append(csize*m - n) # dummy supply node
    for s in students:
        supplies.append(1) # each student node has supply 1
    for c in classes:
        supplies.append(-1*csize) # each class node has supply -csize (i.e., demand csize)

    # define parallel arrays, one index per arc in the min-cost flow graph 
    start_nodes = []
    end_nodes = []
    capacities = []
    unit_costs = []
    
    # add student edges    
    for i,j in edges:
        start_nodes.append(i) # arcs start at student node 
        end_nodes.append(j+n) # arcs end at class node
        capacities.append(1) 
        unit_costs.append(edges[i,j])
        
    # add dummy edges
    for j in classes:
        start_nodes.append(0)
        end_nodes.append(j+n)
        capacities.append(dcapacity)
        unit_costs.append(dcost)
    
    # create solver
    min_cost_flow = ORMC.SimpleMinCostFlow()
    
    # add arcs, capacities, unit costs to graph
    for i in range(0, len(start_nodes)):
        min_cost_flow.AddArcWithCapacityAndUnitCost(int(start_nodes[i]), int(end_nodes[i]), capacities[i], unit_costs[i])  
    
    # add node supplies to graph
    for i in range(0,len(supplies)):
        min_cost_flow.SetNodeSupply(i, supplies[i])

    return min_cost_flow

In [None]:
m = mincostflow()

from fws_lab_new import printmcf
printmcf(m) # helper function to print results

Success! If everything ran properly, you should now have a working min-cost flow formulation for the FWS assignment problem.

**Q:** Use the preference list above to calculate the overall cost of the solution. (There are 141 class sections in total, each with a capacity of 17. The cost of a student receiving their $k$th preference is $k$ and the cost of assigning a dummy student to a class section is 100,000.)

**A:** <font color='blue'>The total class capacity is $(141\:sections)(17\:\frac{students}{section}) = 2397$ "spots," of which $1182 + 629 + 299 + 123 + 52 = 2285$ are filled by "real" students and $2397 - 2285 = 112$ are filled by "dummy" students. Now the overall cost is just the summation of the flow on each edge type times the unit cost of that edge type: 
$(1182)(1) + (629)(2) + (299)(3) + (123)(4) + (52)(5) + (112)(100,000) = 11,204,089.$

You can verify this by adding the following lines to the code cell above:<br><code>m.Solve()</code><br><code>print(m.OptimalCost())</code></font>

This is all well and good, but so far we've just repeated what the transportation model already found. What more can we do with min-cost flow?

You may have noticed that our formulation is fairly simple in terms of its assumptions. For example, based off your answer to **Q**, a feasible (though expensive) solution might involve assigning 17 fake 'filler' students to a section! It's also easy to imagine our model assigning just one or two "real" students to a less interesting section that doesn't rank as high on people's preferences.

The folks at the Knight Institute want to avoid the administrative headaches of a class with just one or two students, while at the same time ensuring students take full advantage of the diversity of FWS classes offered. So, they request that each class section have a minimum of six students enrolled, but no more than 17 (as before).

**Q:** We need to find a way to "force" our model to assign six real (that is, not filler) students to each class. How can we implement this "minimum class size constraint"? (Hint: take a look at **Q2**)

**A:** <font color='blue'>(taken from answer to Q2) Set $u(dummy,j) = $ (max number of 'real' students) $ - $ (min number of 'real' students), or (in this case) $17 - 6 = 11$. Now we can only send at most 11 dummy students to each class node, but since each class node has a demand of 17, we must fill at least 6 spots with real students.</font>

If you read through the Python function, you may have noticed that it can take as input a parameter called 'minstudents', which specifies the minimum number of "real" students assigned to each class section. (The code generalizes what you did in **Q**.)

In [None]:
m = mincostflow(minstudents=0)
printmcf(m)

**Q:** Try a few different values for the 'minstudents' parameter and see what outputs you get. What do you observe? Can the Knight Institute satisfy their minimum class size constraint? (If not, how might cancelling some classes help?) 

**A:** <font color='blue'>Should see that a feasible flow only exists if the value of minstudents is 0 or 1. The input becomes infeasible for values of minstudents from 2 to 17. Thus the Knight Institute can't satisfy the minimum class size constraint (6) without cancelling classes, as we'll see. Cancelling classes with low interest mitigates this problem by effectively forcing students into more popular classes. </font>

Run the following cell, which outputs the least popular class (or classes) among students' preferences. (Define "least popular" as appearing the least on students' list of preferences.) If you'd like, read the comments alongside each line of code to understand what the function does.

In [None]:
# Outputs the least popular FWS class section among students' listed preferences
def leastpopular(dataset='s21_fws_ballots.csv'):
    data = pd.read_csv(dataset) # reads in dataset
    
    a = data[['1','2','3','4','5']].values.tolist() # creates a list of all the class preferences students put
    a = [x for x in a if x != 0] # deletes preferences left blank
    unique, counts = np.unique(np.array(a), return_counts=True) # counts how many of each class appears on the preference list
    classdict = dict(zip(unique, counts)) # creates a dictionary of class number : number of preferences
    
    least_students = min(classdict.values()) # finds the minimum number of preferences in the dictionary
    res = [c for c in classdict if classdict[c] == least_students] # finds class number corresponding to min number of prefs.
    
    print('The class (or classes) with the least students interested is ' + str(res) + '.')
    print('Only ' + str(least_students) + ' student(s) put this class as one of their top 5 preferences.') # prints results 
    
leastpopular()

**Q:** Does this output make sense based on what you observed in **Q8**? Explain.

**A:** <font color='blue'>The function output states that the class with the minimum number of students putting it as a preference has only 1 student interested. Thus only 1 "real" student can be assigned this class--so if we set 'minstudents' higher than 1, there aren't enough students interested to satisfy the minimum class size constraint, and the mincostflow function returns 'Infeasible'.</font>

To summarize, we started with a transportation model for the FWS assignment problem. Then we translated this into a more flexible min-cost flow model, which allowed us to incorporate new information (the Knight Institute's minimum class size constraint) into our formulation. 

In the future, we might want to add even more flexibility, such as the capability to cancel class sections. (Stay tuned!)

# ====================================

# FWS: Integer Program

**Objectives:**
* Improve on an existing model by adding important features and nuances.
* Construct and compare different solutions to the FWS assignment problem.
* Explore integrality properties.

**Key Ideas:**
* the transportation, assignment, and min-cost flow problems
* integer programming
* the integrality property

**Brief description:** If you recall pre-enroll, there was a separate ballot you completed by listing your top 5 picks for FWS that semester. You were later notified which class you got placed into, probably hoping it was your first choice. By now, this should not seem like magic; problems like these often enlist help from Operations Research especially as the scale increases. Disclaimer: the following model is not actually used by Cornell.

In [None]:
# imports
from ortools.linear_solver import pywraplp as OR
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from fws_lab_new import inputData, solution_summary, Histo # helper functions to format and print data/solutions

## Part 1: Improving Our Integer Program

The last time we saw the FWS assignment problem, we discovered that incorporating a new constraint (a minimum class size of 6 "real live" students) made the problem infeasible because at least one class didn't have enough students interested. 

We notify the Knight Institute that there seems to be a lack of enthusiasm for class \#1, since only one student listed it among their top five preferences. (Maybe it's at 8 AM on Monday, who knows?) They inquire if they can just cancel the class altogether and still find a full matching of students to FWS sections. 

Of course, this is not the only time where being able to decide which class sections actually run might be helpful. What if an instructor got sick before the school year, or the FWS budget decreased and some class offerings were cut? 

Let's see how we can build this idea into our original transportation model (copy/pasted below). We didn't mention it at the time, but the transportation model code is really just an integer program! We have:

* decision variables $x[i,j]$ representing the amount of "student flow" going from student $i$ to class $j$
* a matrix of costs $c[i,j]$ representing the "cost" of assigning students to each of their various class preferences (with better preferences costing less)
* additional decision variables $x[dummy,j]$ to fill leftover spots in classes with fake "filler" students if needed, with a very high edge cost $c[dummy,j]$ to make sure the solver fills classes with actual students first
* an objective function to minimize the total cost of the "flow" (assignment)
* constraints to ensure (a) no student is assigned multiple classes and (b) each class is filled to capacity with a combination of real and dummy students

In [None]:
def Assign(preferences, costs, csize):
    """A model for solving an FWS assignment problem.
    
    Args:
        preferences (pd.DataFrame): Preferred classes for each student.
        costs (Dict): Dictionary from edge types to unit costs.
        csize (int): Capacity of the classroom.        
    """
    students, classes, edges = inputData(preferences, costs)
    EDGES = list(edges.keys())      # create edge list
    
    c = edges.copy()                # define c[i,j]
    
    # define model
    m = OR.Solver('FWS', OR.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
    
    # decision variables
    x = {}  # units to be shipped on each edge
    for i,j in EDGES:
        x[i,j] = m.IntVar(0, m.infinity(), ('(%s, %s)' % (i,j))) 
        
    # define objective function here
    m.Minimize(sum(c[i,j]*x[i,j] for i,j in EDGES))
       
    # add constraint to ensure each student (besides the dummy) is assigned at most one class
    for k in students:
        if k != 'dummy':
            m.Add(sum(x[i,j] for i,j in EDGES if i==k) <= 1)
        
    # add constraint to ensure each class is filled to capacity
    for k in classes:
        m.Add(sum(x[i,j] for i,j in EDGES if j==k) == csize)
    
    # solve
    m.Solve()
    
    return m,x

In [None]:
s21 = pd.read_csv('s21_fws_ballots.csv', index_col=0) # Spring 2021: 2285 students, 141 class sections
s21.head() # preview

In [None]:
costs = {1:1, 2:2, 3:3, 4:4, 5:5, 'dummy':100000}

m,x = Assign(s21, costs, 17)
original_sol = solution_summary(s21, x)
print(original_sol)

Notice that the <code>Assign</code> function above doesn't include any notion of the minimum class size (6) that the Knight Institute wants! In terms of class size, we just have a constraint specifying that the number of students (real, dummy, or both) in each class must equal <code>csize</code>, in this case 17.

**Q:** Let's say we replace the constraint mentioned above with one that is a bit more lenient: each class must have between 6 and 17 students. Do we still need "dummy" students in our model? Why? 

**A:** <font color='blue'>The purpose of including the dummy students at all is to satisfy the constraint that every class must be totally filled to capacity with (real and/or fake) students, by backfilling classes that don't have enough real students. By relaxing this constraint, though, there's really no need to include the dummy student supply node, since each class doesn't have to be full as long as at least 6 (real) students are assigned to it. If a class has less than 6 (real) students assigned to it, it doesn't make sense for us to add dummy students until there are 6 students total, because that doesn't solve the problem of lack of interest. Instead, we should just return infeasible in that case.</font>

**Q:** We'd like to update our integer program to account for whether or not a class section runs. To do so, we'll need to define new binary decision variables. What are they?

**A:** <font color='blue'>Add a new binary decision variable for each class, set to 1 if the class runs and 0 if the class does not run.</font>

**Q:** Suppose we have a class section A that we'd like to run. What are the upper and lower bounds on the number of students we can assign to class A?

**A:** <font color='blue'>Upper bound = 17; lower bound = 6.</font>

**Q:** Suppose we have a class section B that we do not want to run. What are the upper and lower bounds on the number of students we can assign to class B?

**A:** <font color='blue'>Upper bound = 0; lower bound = 0.</font>

**Q:** Suppose we have a class section C and a binary variable $y$, such that if class C runs, $y = 1$, and if class C does not run, $y = 0$. In terms of $y$, what are the upper and lower bounds on the number of students we can assign to class C? 

**A:** <font color='blue'>Upper bound = $17y$; lower bound = $6y$.</font>

Below, you'll see a modified version of the <code>Assign</code> function. Read through and run the cell.

In [None]:
def modifiedAssign(preferences, costs, minstudents, csize):
    """A modified FWS assignment model, which incorporates minimum class size and the option of not running class sections.
    
    Args:
        preferences (pd.DataFrame): Preferred classes for each student.
        costs (Dict): Dictionary from edge types to unit costs.
        minstudents (int): Minimum number of students in the classroom.
        csize (int): Capacity of the classroom.        
    """
    students, classes, edges = inputData(preferences, costs) 
    EDGES = list(edges.keys())      # create edge list
    
    c = edges.copy()                # define c[i,j]

    # define model
    m = OR.Solver('FWS', OR.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
    
    # decision variables
    x = {}  
    for i,j in EDGES:
        # define x(i,j) here
        x[i,j] = m.IntVar(0, m.infinity(), ('(%s, %s)' % (i,j))) 
        
    y = {}
    for j in classes:
        # define y_j here
        y[j] = m.BoolVar('y_%s' % j) # A BoolVar or Boolean variable is similar to an integer variable,
                                     # except that it can only take on values in {0,1}, where 0 represents "false"
                                     # and 1 represents "true." We could have also used an IntVar ranging from 0 to 1.
        
    # define objective function here
    m.Minimize(sum(c[i,j]*x[i,j] for i,j in EDGES))
       
    # add constraint to ensure each student is assigned exactly one class
    for k in students:
        m.Add(sum(x[i,j] for i,j in EDGES if i==k) == 1)
        
    # add constraint to ensure each class that runs satisfies minimum and maximum class size
    for k in classes:
        m.Add(sum(x[i,j] for i,j in EDGES if j==k) <= csize*y[k])
        m.Add(sum(x[i,j] for i,j in EDGES if j==k) >= minstudents*y[k])
    
    # solve
    status = m.Solve()

    if status == OR.Solver.INFEASIBLE:
        print('Infeasible')
        return
    
    classes_run = int(sum(y[j].solution_value() for j in classes))
    print(str(classes_run) + ' of ' + str(len(classes)) + ' classes run.')
    
    return m,x   

Pay close attention to the updates made to the decision variables and constraints. Do they make sense based on what you answered in the previous series of questions? (If you're confused, ask a TA before moving on!)

**Q:** In our previous <code>Assign</code> function, we added a constraint to ensure that each student (besides the dummy) was assigned at most one class. In our <code>modifiedAssign</code> function, we change this to strict equality: now each student is assigned exactly one class. Why? (Hint: think about how our new decision variables impact the cost of a solution)

**A:** <font color='blue'>If we did not update this constraint, the solver will opt to not assign any students by setting every $x[i,j]$ equal to zero (by way of setting every $y[k]$ equal to zero). This is because the cheapest solution (of cost 0) is just to not run any classes!

Sidenote: It's not quite enough to say, "Well, if we don't have strict equality, then we might not assign every student." This was true for the original model too! For this model, we can go one step further: if we don't have strict equality, then we <u>know</u> that zero students will get assigned (for the reason stated above).</font>

One advantage of our modified assignment function is that it's flexible in the case that a class must be canceled for some reason. For instance, if the Knight Institute wanted to cancel section 87, we'd just put the following line of code in with our constraints: 

<code>m.Add(y[87] == 0)</code>

Now, let's test out our new model! Run the cell below, which inputs the Knight Institute's constraints by setting the parameter 'minstudents' equal to 6 and 'csize' equal to 17. As a reminder, in our min-cost flow formulation&mdash;that is, before we included the ability to not run classes&mdash;this input returned "Infeasible."

In [None]:
costs = {1:1, 2:2, 3:3, 4:4, 5:5}

m,x = modifiedAssign(s21, costs, 6, 17)
modified_sol = solution_summary(s21, x)
print(modified_sol)

Woohoo! We found a feasible solution! As a bonus, we can check which class wasn't run:

In [None]:
from fws_lab_new import print_not_run # helper function to print classes not run
print_not_run(m)

Now, let's see how this solution compares to the solution from the original model, which had no minimum class size.

In [None]:
# print percentage of students receiving preference 1,...,5
def print_pct(sol):
    for pref in range(1,6):
        print(pref, ':', round(100*sol[pref]/2285, 2), '%')

print('Assign function solution: no minimum class size')
Histo(original_sol)
print_pct(original_sol)

print('')
    
print('Modified Assign function solution: minimum class size 6')
Histo(modified_sol)
print_pct(modified_sol)

**Q:** Compare the objective values of these two solutions (you can use the cell below for computations if you'd like). What do you observe? Does this make sense? (Remember we set the cost of a student receiving their $k$th preference to be $k$; that is, a student receiving their top choice cost 1, second choice cost 2, and so forth. Ignore any contributions from dummy edges in the first solution.)

**A:** <font color='blue'>The first (less constrained) solution has objective value $(1179)(1) + (634)(2) + (301)(3) + (116)(4) + (55)(5) = 4089$, and the second (more constrained) solution has objective value $(1188)(1) + (616)(2) + (302)(3) + (126)(4) + (53)(5) = 4095$. This should make sense, as adding more constraints to the model (i.e., shrinking the feasible region) can only maintain or worsen the objective value.</font>

In [None]:
### CELL FOR COMPUTATIONS ###


**Q:** The Knight Institute is worried about money and so wants to know the least amount of classes they can offer while still having students get one of their top 5 preferences. To implement this, we can re-define the objective function in our model to the following:

<code>m.Minimize(sum(y[j] for j in CLASS))</code>

Doing this will give a feasible solution with an objective value of 135. We claim that this is optimal! Give an argument as to why they'll always need at least 135 sections. (Recall that there are 2285 students in total.)

**A:** <font color='blue'>The objective value essentially says that we can never run less than 135 class sections if we want a feasible solution. Suppose we fill each classroom to the max, that is, assign 17 students per class for as many classes as possible. We can fill $\frac{2285 - 2285\bmod17}{17} = 134$ classes this way, with a remaining $2285\bmod17 = 7$ students to put in the 135th class section.</font>

The fact that we're able to find a feasible assignment with exactly 135 sections open may seem pretty magical. Disclaimer: this is indeed a bit magical, and may not always happen; we managed to get lucky with the data!

**Q:** Do you expect this solution to be better or worse (for students) as the min-cost solutions above? Explain your answer. 

**A:** <font color='blue'>We'd expect this solution to be worse for students, because it just tries to minimize the number of sections run while each student gets one of their listed preferences (it doesn't matter which). In contrast, the min-cost solutions incorporate the extra step of ranking of students' preferences once a feasible solution (where each student gets one of their listed preferences) is guaranteed.</font>

Run the cell below to get a visual of the "least-classes" solution!

In [None]:
from fws_lab_new import minimizeNumClassesAssign # same function as modifiedAssign, but with the objective function from above
m,x = minimizeNumClassesAssign(s21, costs, 6, 17)
least_classes_sol = solution_summary(s21, x)
Histo(least_classes_sol)
print_pct(least_classes_sol)

Clearly, there's a trade-off here: by running fewer classes, the university saves money, but doing so might leave some students dissatisfied with their FWS assignment. On the other hand, running more classes will probably increase student satisfaction, but will also increase expenses. This is a peek into the messy world of *multi-objective optimization*! 

In a sense, we want the "best of both worlds"&mdash;that is, we want to both minimize the number of sections run and minimize student dissatisfaction. We already know how to find each of these solutions individually, as well as what the solutions actually are: a section-optimal solution runs 135 sections, while a student-optimal solution has cost 4095. To find a good intersection between the two, one strategy is to "fix" one of these values (by way of constraints) and optimize the other as much as possible. (Simplex, anyone?)

**Q:** For our problem, we have two strategies: either fix the number of sections and minimize cost, or fix the cost and minimize number of sections. Give a reason why we might choose the first strategy.

**A:** <font color='blue'>Trying to decrease the number of sections after fixing the cost (which, as we'll see, is just an arbitrary number) probably won't give us anything interesting, because decreasing the number of sections should just increase the cost.** On the other hand, there are likely several solutions (with varying costs) that run the same number of sections. 

** Sam made a good point that we could write an IP like "How few sections can I have while staying within 5% as good a solution?", which would motivate the second strategy.</font>

**Q:** Write the constraint that our model must run exactly 135 classes, using the code from **Q8** as a guide. Your answer should look like this: m.Add(XXX)<br>(Don't worry, you don't need to paste your answer anywhere.)

**A:** <font color='blue'><code>m.Add(sum(y[j] for j in CLASS) == 135)</code></font>

In the cell below, we run the <code>modifiedAssign</code> function with the constraint you wrote above. Then we vary the number of classes that are run, minimizing cost (i.e., student dissatisfaction) at each step. 

Look at the scatter plot. It compares the cost of these solutions with the cost of the student-optimal solution&mdash;the closer the cost, the better. Does this surprise you?

In [None]:
from fws_lab_new import modifiedAssignWithNumClasses # modifiedAssign, but with constraint to specify # of classes that run

# Helper function: return the objective value of a solution dictionary of the form {edge cost : number of edges in solution}
def objValue(sol):
    return sum(int(k)*int(v) for k,v in sol.items())

# Find optimal solutions when exactly 135, 136, ..., 141 class sections are running
obj_values = {}
for i in range(135,142):
    print(str(i) + ' classes offered:')    
    sol_vars = modifiedAssignWithNumClasses(s21, costs, 6, 17, i)
    if sol_vars != None:
        m,x = sol_vars
        class_sol = solution_summary(s21, x)
        obj_values[i] = objValue(class_sol)
        print('Objective value ' + str(objValue(class_sol)))
    print('')
    
# Display scatter plot of solution values compared to optimal solution value
sectionsOffered = list(obj_values.keys())
pctOptimal = [100 * objValue(modified_sol) / x for x in list(obj_values.values())]
plt.scatter(sectionsOffered, pctOptimal)
plt.xlabel('Number of Sections Offered')
plt.ylabel('Optimality Metric (%)')
plt.axhline(y=100, color='green', linestyle='dotted')
plt.show()

Take-homes from this section:
* By updating our integer program model, we accounted for "real-world" scenarios: both a minimum class size and the possibility of cancelling classes. (Adding the flexibility to cancel class sections happened to let us turn an infeasible problem into a feasible one!)
* With multiple feasible solutions, we may need to consider the trade-offs that come with prioritizing different objectives.

## Part 2: Objective Analysis

As you have probably noticed, there is more than one feasible solution to the FWS assignment problem&mdash;that is, there exist multiple (potentially many) matchings of students to FWS sections that ensures every student gets one of their top 5 picks. Having many solutions is great, but in a real-life scenario, we eventually have to pick one!

Once we know we have feasible solutions, deciding which of the solutions is "the best" depends on what our goal is. For instance, our goal could be to find the solution that maximizes the number of students receiving their first choice, or to minimize the number of students receiving their fifth choice.

In our transportation formulation for the FWS assignment problem, we saw how we can achieve some complex behavior in our solutions by simply making some clever adjustments to our objective function.

**Q:** When our problem input included a dummy node, we set the unit cost of edges coming from the dummy node to be an absurdly high number (100,000). Why? 

**A:** <font color='blue'>(copy/pasted from transportation section) In the FWS input, edge costs dictate how much you want the solver to select the corresponding edges. Since we are trying to minimize cost, an edge with a small cost has a higher likelihood of being in the solution while an edge with a large cost will potentially be avoided. (For instance, we want more first-choice than fifth-choice, so the cost of a first-choice edge is lower than the cost of a fifth-choice.) Applying this to the dummy, setting the edge cost to an enormous number like 100,000 essentially discourages the solver from ever choosing a dummy edge over a real student edge unless absolutely necessary to get a feasible solution. 

More technically, the cost of not assigning just *one* student to one of their top 5 choices (i.e., the cost of assigning a dummy student in place of a real student) is greater than the cost of assigning *all* 2285 students their fifth choice: $100,000 > 5(2285) = 11,425$. So we'd rather assign every student, if such an assignment exists, than fail to assign just one student! Using this strategy thus ensures we maximize the number of assigned students first before considering any notion of student preference.</font>

To find distinct feasible solutions, we played around with different values for the cost of assigning a student to each preference. But it may not be obvious how exactly changing the edge costs allows us to incorporate different objectives (and find customized solutions).

Take another look at our objective function:

$min\:\sum_{(i,j) \in A} c(i,j)x(i,j)$

$ = min\:\sum_{k=1}^{5} ($edge cost of preference $k)($number of students assigned preference $k)$

$ = min\:c_1 f_{1} + c_2 f_{2} + c_3 f_{3} + c_4 f_{4} + c_5 f_{5}$

It turns out our objective function is really a composition of five objective functions! Each function $f_k$, $k \in \{1,..,5\}$, counts the number of students assigned preference $k$. 

Our costs $c_k$ for each edge type $k$, enumerated in the <code>costs</code> dictionary, specify the "weights" (or "priorities") of each function $f_k$ in our overall objective function. This strategy, known as the *weighted-sum method*, allows us to encode a ranking of importance among our five objectives. 

For instance, specifying costs $c_k = k, k \in \{1,..,5\}$, prioritizes minimizing $f_{5}$ over minimizing $f_{4}$, which is more important than minimizing $f_{3}$, and so on.

**Q:** Suppose we define our <code>costs</code> dictionary as follows: <code>{1:42, 2:-9, 3:23, 4:1, 5:-100}</code>

What objectives (instructions) are we giving to the solver when it goes about assigning students? Rank them in order of importance. (Hint: there are five)

**A:** <font color='blue'>In decreasing order of importance: maximize fifth choice, minimize first choice, minimize third choice, maximize second choice, minimize fourth choice.</font>

Get a visual of the solution generated by this (rather silly) input by running the cell below. Does it make sense based on your answer to **Q**?

In [None]:
costs = {1:42, 2:-9, 3:23, 4:1, 5:-100}
m,x = modifiedAssign(s21, costs, 6, 17)
silly_sol = solution_summary(s21, x)
Histo(silly_sol)

Previously, we've implemented several goals, such as maximizing the number of students receiving their first choice or minimizing the number of students receiving their fifth choice. 

We also saw a *lexicographic maximum* approach, where we tried to maximize the number of students receiving their first choice, then maximize the number of students receiving their second choice, and so on down the line until finally maximizing the number of students receiving their fifth choice:

In [None]:
costs = {1:-10000, 2:-1000, 3:-100, 4:-10, 5:-1}
m,x = modifiedAssign(s21, costs, 6, 17)
lexico_max = solution_summary(s21, x)
Histo(lexico_max)

Now, let's try the opposite!

A *lexicographic-minimum ordering* of objectives is as follows: first (most importantly), we minimize the number of fifth preferences assigned, second (most importantly), we minimize the number of fourth preferences assigned, and so forth until our "least important" objective, minimizing the number of first preferences.

Below, fill in the missing edge costs to implement this approach. Then run the cell to see the solution!

In [None]:
# TODO: Fill in the missing edge costs to implement the lexicographic-minimum ordering outlined above.
# Hint: use different orders of magnitude to rank the importance of each objective
# costs = {1:XXX, 2:10, 3:XXX, 4:XXX, 5:XXX}

### BEGIN SOLUTION
costs = {1:1, 2:10, 3:100, 4:1000, 5:10000}
### END SOLUTION

m,x = modifiedAssign(s21, costs, 6, 17)
lexico_min = solution_summary(s21, x)
Histo(lexico_min)

As we've seen, there are a plethora of solutions to the FWS assignment problem! Choosing just one depends on how you define the "best" solution. 

The Knight Institute decided on the following criteria in determining the optimal solution:
* First, the number of fifth, and then fourth, preferences should be minimized.
* After that, the number of first, second, and then third preferences should be maximized.

In the cell below, try to implement these specifications, using weights with different orders of magnitude for each successive objective.

In [None]:
# TODO: Fill in the missing edge costs to implement the Knight Institute approach outlined above.
# costs = {1:XXX, 2:XXX, 3:-1, 4:XXX, 5:XXX}

### BEGIN SOLUTION
costs = {1:-100, 2:-10, 3:-1, 4:1000, 5:10000}
### END SOLUTION

m,x = modifiedAssign(s21, costs, 6, 17)
knight = solution_summary(s21, x)
Histo(knight)

**Q:** Compare this solution to the previous solutions. Do you think the Knight Institute made the right decision? How would you have done it differently?

**A:** <font color='blue'>Answers may vary.</font>

**Supplemental Exercise: Guaranteeing an ordering**

Suppose we have 9 students to assign to FWS sections, and each student lists 4 preferences. We want to minimize the number of students receiving their fourth choice, then third choice, and so on. 

Let's set <code>costs</code> so that the cost of each preference is a multiple of 10:

<code>costs = {1:1, 2:10, 3:100, 4:1000}</code>

We might wonder if the IP solver could still assign someone their fourth choice in order to reduce the cost: for instance, maybe this allows us to give a lot more students their first choice. But this is impossible!

Why? Consider the cost of an assignment where all 9 students are assigned to their first or second or third choice, that is, no students are assigned their fourth choice. This will cost at most $9(100) = 900 < 1000$ (the cost of assigning just one student their fourth choice). Thus the IP solver will never opt to assign someone their fourth choice unless this is the only way to achieve a feasible solution&mdash;otherwise, it's just too expensive. 

Using the same reasoning, you can justify that the IP will then minimize the number of third-choice assignments (and so on).

More generally, if we have two objectives (e.g., number of third-choice and number of fourth-choice) that can take on values in the range $\{0,...,9\}$, then multiplying the weight of one objective by 10 would make it completely dominant over the other.

Thus, by cleverly choosing our edge costs, we can make it so that our higher-priority objectives 'dominate' lower-priority objectives. This is what we mean by a *lexicographic ordering*.

**E1:** Would this approach (using multiples of 10) still work if we had 10 students to assign?

**A:** <font color=blue>Yes. Consider the minimum cost of a solution in which at least one student is assigned their fourth choice: $9(1) + 1000 = 1009$. Now compare this to the maximum cost of a solution in which no student is assigned their fourth choice: $10(100) = 1000 < 1009$. So the solver will still never choose assigning a student their fourth choice if it can be avoided. </font> 

**E2:** Would this approach (using multiples of 10) still work if we had 11 students to assign?

**A:** <font color=blue>No. For example, assigning 10 students their first choice and 1 student their fourth choice costs 1010, and assigning 1 student their second choice and 10 students their third choice would also cost 1010. So the solver might choose to assign a fourth choice when it could've avoided it.</font>

**E3:** Now imagine we have $n$ students to assign. By what factor should the objective weights differ to ensure we get the ordering we want?

**A:** <font color=blue>By a factor of $n$.</font>

Returning to the actual FWS input, suppose we want to guarantee a lexicographic-minimum ordering when finding a solution. To do so, we might set our costs to be something like <code>{1:1, 2:n, 3:n\*\*2, 4:n\*\*3, 5:n\*\*4}</code>, where $n$ is the number of students&mdash;in this case, 2285. (In Python, we use the <code>**</code> operator for exponentation.)

**E4:** An 1101 student sees the $n^4$ term in the edge costs above and is worried the computer they're using won't be able to handle such a big number. They decide to divide all the edge costs by $n^2$, so that the largest edge cost is now only $n^2$. Will this give the same solution? Explain.

**A:** <font color='blue'>Yes! Scaling the edge costs by any scaling factor $k > 0$ preserves the relationship between them (that consecutive edge types' costs differ by a factor of $n$).</font>

**E5:** Another 1101 student defines <code>costs</code> to be the following:

<code>costs = {1:0, 2:1, 3:n, 4:n\*\*2, 5:n\*\*3}</code>

Will this still give the solution we want? Explain.

**A:** <font color='blue'>Yes! Comparing preferences 2-3, 3-4, and 4-5, the factor of $n$ separating the respective edge costs creates the "ranking" we want. For preferences 1-2, setting the cost of first-choice edges to 0 makes it "infinitely" more expensive to assign a student their second choice than to assign all $n$ students their first choice, since $\frac{w_2}{w_1} = \lim_{k \to 0}\frac{1}{k} = \infty$. Since $\infty > n$, this is fine.</font>

## Part 3: A Note on Integrality

The integer program we've developed so far has integer decision variables $x(i,j) \in \{0, 1, ...\}$, corresponding to the amount of flow to send on an arc $(i,j)$ from a student node $i$ to a class node $j$. As an improvement, we added new binary decision variables $y(j) \in \{0, 1\}$ that encode whether class $j$ runs or not. 

What happens if we remove ("relax") our integrality constraints on $x(i,j)$? (This is known as an *LP relaxation* of the integer program.) To find out, we'll use the following function, which is pretty much the same as the <code>modifiedAssign</code> function we've been using, except we can specify whether our $x(i,j)$ are restricted to only taking on integer values.

In [None]:
def integralityAssign(preferences, costs, minstudents, csize, integer_only = True):
    """Same as function modifiedAssign, but with optional additional parameter 'integer_only'
    
    Args:
        preferences (pd.DataFrame): Preferred classes for each student.
        costs (Dict): Dictionary from edge types to unit costs.
        minstudents (int): Minimum number of students in the classroom.
        csize (int): Capacity of the classroom.
        integer_only (bool): Whether or not decision variables x[i,j] are constrained to integers. (Default: True)
    """
    students, classes, edges = inputData(preferences, costs) 
    EDGES = list(edges.keys())      # create edge list
    
    c = edges.copy()                # define c[i,j]

    # define model
    m = OR.Solver('FWS', OR.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
    
    # decision variables
    x = {}  
    for i,j in EDGES:
        # define x(i,j) here
        if integer_only:
            x[i,j] = m.IntVar(0, m.infinity(), ('(%d, %s)' % (i,j)))
        else:
            x[i,j] = m.NumVar(0, m.infinity(), ('(%d, %s)' % (i,j)))
  
    print('Decision variables constrained to integer values.' if integer_only else 'Decision variables unconstrained.') 
        
    y = {}
    for j in classes:
        # define y_j here
        y[j] = m.BoolVar('y_%s' % j) # A BoolVar or Boolean variable is similar to an integer variable,
                                     # except that it can only take on values in {0,1}, where 0 represents "false"
                                     # and 1 represents "true." We could have also used an IntVar ranging from 0 to 1.
        
    # define objective function here
    m.Minimize(sum(c[i,j]*x[i,j] for i,j in EDGES))
       
    # add constraint to ensure each student is assigned exactly one class
    for k in students:
        m.Add(sum(x[i,j] for i,j in EDGES if i==k) == 1)
        
    # add constraint to ensure each class that runs satisfies minimum and maximum class size
    for k in classes:
        m.Add(sum(x[i,j] for i,j in EDGES if j==k) <= csize*y[k])
        m.Add(sum(x[i,j] for i,j in EDGES if j==k) >= minstudents*y[k])
    
    # solve
    status = m.Solve()

    if status == OR.Solver.INFEASIBLE:
        print('Infeasible')
        return
    
    classes_run = int(sum(y[j].solution_value() for j in classes))
    print(str(classes_run) + ' of ' + str(len(classes)) + ' classes run.')
    
    return m,x 

The Knight Institute saw our histograms from the previous section and decided they liked their approach the best: minimizing the number of students receiving their fifth choice, then minimizing fourth choice, then maximizing first, second, and then third choice. As a refresher, run the cell below to see what that solution looks like. (By setting the parameter 'integer_only' to <code>True</code>, we tell the model to create the decision variables as integer variables.)

In [None]:
costs = {1:-100, 2:-10, 3:-1, 4:1000, 5:10000}
integer_only = True

m,x = integralityAssign(s21, costs, 6, 17, integer_only)
int_sol = solution_summary(s21, x, integer_only)
Histo(int_sol)
print_pct(int_sol)

If all goes well, you should get the same solution you got in the previous section. 

Now, let's see if anything interesting happens when our $x(i,j)$ are no longer confined to the happy world of integers! We do this by setting our parameter 'integer_only' to <code>False</code>.

In [None]:
integer_only = False
m,x = integralityAssign(s21, costs, 6, 17, integer_only)
not_just_int_sol = solution_summary(s21, x, integer_only)
Histo(not_just_int_sol)
print_pct(not_just_int_sol)

Hmm, doesn't seem like much changed. Maybe we can just use our LP-relaxation from now on, since it's more efficient and cheaper to use an LP solver than an IP solver anyways.

The Knight Institute sees our solution and wants to know if we can drop the number of students with their fourth choice by one, from 103 to 102, so that they can brag that roughly 96 % of students get one of their top three choices. 

To see if this is possible, we add a constraint that caps the number of fourth-choice assignments at 102. Run the cell below to see if we can satisfy the Knight Institute's request.

In [None]:
integer_only = False

# adding constraint
edge_dict = inputData(s21, costs)[2] # dictionary {edges : edge costs}
edge_list = list(edge_dict.keys()) # list [edges]
m,x = integralityAssign(s21, costs, 6, 17, integer_only) # obtain solver object
m.Add(sum(x[i,j] for i,j in edge_list if edge_dict[i,j] == costs[4]) <= 102) # add constraint
m.Solve() # re-solve

add_constraint_sol = solution_summary(s21, x, integer_only)
Histo(add_constraint_sol)
print_pct(add_constraint_sol)

Uh-oh...we broke integrality! It doesn't make much sense to be assigning half a student their fifth-choice FWS preference.

**Q:** The Knight Institute isn't too keen on splitting students in half. They argue that there must be some way to assign whole (integer) students where (a) no student gets their fifth-choice preference, (b) 102 students (or less) get their fourth-choice preference, and (c) the rest get one of their top three preferences. Explain why this is impossible.

**A:** <font color='blue'>To satisfy this, we need an integer solution that is feasible (i.e., matches all students), has 102 (or less) fourth-choice, and has 0 fifth-choice. Such a solution would have to be better than the one we just found, which has 0.5 fifth-choice. But we know that we cannot find an integer solution better than this LP-relaxation solution, since the IP-feasible region is contained within the LP-feasible region (and thus the best solution for the IP can be no better than the best solution for the corresponding LP-relaxation).</font>

You may be wondering why adding this constraint in particular broke integrality. As it turns out, we're lucky that none of the previous constraints we added broke it first! 

The moral of the story is that we can't rely on an LP relaxation if we're looking for integral solutions. Instead, we need to use an IP formulation and an IP solver. 

**Q:** We've developed three distinct models for the FWS assignment problem: a transportation model, a min-cost flow model, and an integer program model. Suppose we relax each to an LP. Which model(s) will still guarantee us an integral optimal solution?

**A:** <font color='blue'>Transportation and min-cost flow. These are guaranteed to give integral optimal solutions so long as edge capacities and supplies at supply/demand nodes are integer (which they are). As we saw above, the IP model has no such guarantee.</font>