# Baseball Elimination Lab

**Objectives:**
* Introduce students to a sophisticated formulation using the
maximum flow problem.
* Demonstrate how to solve the application by the
Ford-Fulkerson algorithm.

**Key Ideas:**
* integrality property
* max-flow min-cut theorem
* the baseball elimination problem

**Reading Assignment:**
* Read Handout 6 on the baseball elimination problem.

**Brief description:** In this lab, we will learn how to formulate the so-called baseball elimination problem as a maximum flow problem and use the output of the Ford-Fulkerson algorithm to determine whether a team can still win the division, and if not, why. We will again use the Python package NetworkX.

## Part 1: Baseball Elimination Max-Flow Formulation


You have seen an example for the Baseball Elimination Problem in
class. Recall that the data for this problem consists of the following:
* a collection of teams $1, 2, \ldots, n$,
* the number of games $g(i,j)$ remaining to be played between teams $i$ and $j$ for all pairs of teams $i$ and $j$,
* the number of wins $w(i)$ team $i$ already has.

We would like to determine if team $n$ (our favorite) has
been eliminated already: that is, even if team $n$ wins all
of its remaining games, no matter how the games between the other
teams turn out, there will always be some team with more wins than
team $n$ at the end of the season.  If it is possible for
team $n$ to finish the season at least tied for first, then
it has not been eliminated.

Consider the following data for a 4 team league:

Team | Wins 
--- | ---
1 | 8 
2 | 10
3 | 10
4 | 1

Games remaining to be played:

vs. | 1 | 2 | 3 | 4
 -- | -- | -- | -- | --
**1** | - | 3 | 3 | 6
**2** | 3 | - | 6 | 3
**3** | 3 | 6 | - | 3
**4** | 6 | 3 | 3 | -


Our team (team 4) didn't do very well so far. We would like
to determine whether it still has a chance to finish first at the end
of the season (we are satisfied with a tie for first place as
well). 

**Q1:** How many games can team 4 possibly win during the
season? Call this number $W$. (Hint: assume that team 4 wins all their remaning games.)

In [None]:
W = 

Team 4 finishes first if no other team wins more games than $W$.  

**Q2:** What if (hypothetically) team 1 already had 14 wins? Can our team come out first?

**A:** 

Team 1 has already won 8 games.  

**Q3:** At most how many more games is this team allowed to win if we want to make sure that our team comes out first? What about teams 2 and 3?

**A:**  

**Q4:** How can you express this amount in general (in terms of $W$ and $w(i)$)
for team $i$?

**A:**

If one of two teams scheduled to play is team 4, then we assume
that the game outcome is decided in our favor. For the rest of the games, we
would like to assign a winner so that no team wins more games than the
number determined above. So for each pair of teams (other than those
containing our team) we would like to decide how many of the leftover
games between them are going to be won by one team or the other.


To illustrate this, we draw two sets of nodes: one set for all the
pairs of teams not containing our team (these will be called the 
*pair nodes*) and another one for the individual teams (these are
called the *team nodes*). We can interpret $g(i,j)$, the number of 
games remaining to be played between teams $i$ and $j$, as the amount
of "excess games" at the pair node. These games need to be "distributed" 
between the two corresponding team nodes as represented by edges.  

**Q5:** Complete the following code and run the cell to add edges to our graph.


In [None]:
import networkx as nx

G = nx.DiGraph()
# pair nodes
G.add_node('1,2', pos=(10,30))
G.add_node('1,3', pos=(10,20))
G.add_node('2,3', pos=(10,10))
# team nodes
G.add_node('1', pos=(20,30))
G.add_node('2', pos=(20,20))
G.add_node('3', pos=(20,10))

# FILL IN THE EDGES --remember that they are directed edges, so the order of endpoints matter
G.add_edge('1,2', '1')
G.add_edge('1,2', '2')
G.add_edge()
G.add_edge()
G.add_edge()
G.add_edge()

# graph display
pos=nx.get_node_attributes(G,'pos')
nx.draw_networkx(G,pos,node_size=1000,node_color='lightblue')

To make a single-source, single-sink flow problem from
this model, we introduce two nodes: a node which is the "source
of all games" and a node which is the "sink of all played
games".  

**Q6:** What should be the capacity of (source, pair node) arcs? What about (team node, sink) arcs? Fill in the capacity values missing in the code below and run the cell to display the graph.

In [None]:
# adding our source node and sink node
G.add_node('s', pos=(0,20))
G.add_node('t', pos=(30,20))

# FILL IN THE CAPACITIES
G.add_edge('s', '1,2', capacity = )
G.add_edge('s', '1,3', capacity = )
G.add_edge('s', '2,3', capacity = )
G.add_edge('1', 't', capacity = )
G.add_edge('2', 't', capacity = )
G.add_edge('3', 't', capacity = )

# graph display
pos=nx.get_node_attributes(G,'pos')
cap=nx.get_edge_attributes(G,'capacity')
nx.draw_networkx(G,pos,node_size=1000,node_color='lightblue')
nx.draw_networkx_edge_labels(G,pos,edge_labels=cap);

**Q7:** In class, we used capacities
of $\infty$ on the (pair node, team node) arcs.  Why is this
appropriate?

**A:** 

**Q8:** What does a feasible flow, with all integer flow values,
in the above network correspond to? How can we interpret the value
of a flow? 

**A:** 

**Q9:** How can we tell whether our team is eliminated by solving the maximum
flow problem on the above network? What has to be the value of the
maximum flow if our team is not eliminated?

**A:**

**Q10:** Solve the maximum flow problem on the above network with
the Ford-Fulkerson algorithm. What is the value of the maximum
flow? What is the maximum flow (i.e. the actual flows on the
arcs)? What about the minimum cut (i.e., which nodes are on the $s$ side, which on the $t$)?

**A:** 

Check your answer by running the following cell, which will compute a maximum flow in the graph $G$ we defined previously.

In [None]:
flow_value, flow = nx.maximum_flow(G, 's', 't')
print("The value of the flow is", flow_value)
for i, j in G.edges:
    print("The flow on the arc from "+i+" to "+j+" is",flow[i][j])

**Q11:** Has our team been eliminated already or not yet? If it
hasn't been eliminated, give a short scenario (i.e. a way
for the remaining games to turn out) by which our team could end
the season at least tied for first place. If it has been
eliminated, give a short explanation why. (Imagine you are trying
to explain it to a friend who doesn't know anything about the max
flow or min cut problems.)

**A:** 

**Q12a:** If team 4 was not eliminated, then how many games from the
rest of the season could it lose and still come in first place?
Does it matter which of its remaining games it loses?

**A:**  

**Q12b:** If team 4 was eliminated, how many additional games should it
have won from the first part of the season in order to have prevented
this early end to its competitive season? Does it matter against which 
teams these additional win(s) come from?

**A:** 

Now assume that team 3 had only 9 wins (the total
number of games is one less than previously).  

**Q13:** How is the network going to change? Will team 4 be eliminated in this case? Also answer Q11 and Q12a/b for this case.

**A:** 

**Q14:** Going back to the general formulation of the Baseball
Elimination Problem given at the beginning of this lab, how many
pair nodes and how many team nodes are we going to have if the
number of teams is $n$?

**A:**  

**Q15:** Write down in terms of the general formulation what the
nodes, arcs and arc capacities correspond to.

**A:**

## Part 2: Proof of Elimination

Here is the data from another season.

Team | Wins 
--- | ---
1 | 7
2 | 7
3 | 3
4 | 3

Games remaining to be played:

vs. | 1 | 2 | 3 | 4
 -- | -- | -- | -- | --
**1** | - | 3 | 1 | 1
**2** | 3 | - | 1 | 1
**3** | 1 | 1 | - | 3
**4** | 1 | 1 | 3 | -

**Q16:** Finish the code and run the cell to display the new graph. We once again are cheering for team 4.  

In [None]:
# creating a new graph
G = nx.DiGraph()
G.add_node('s', pos=(0,20))
G.add_node('t', pos=(30,20))
G.add_node('1,2', pos=(10,30))
G.add_node('1,3', pos=(10,20))
G.add_node('2,3', pos=(10,10))
G.add_node('1', pos=(20,30))
G.add_node('2', pos=(20,20))
G.add_node('3', pos=(20,10))

# FILL IN THE EDGES --remember that they are directed edges, so the order of endpoints matter
G.add_edge('1,2', '1')
G.add_edge('1,2', '2')
G.add_edge()
G.add_edge()
G.add_edge()
G.add_edge()

# FILL IN THE CAPACITIES
G.add_edge('s', '1,2', capacity = )
G.add_edge('s', '1,3', capacity = )
G.add_edge('s', '2,3', capacity = )
G.add_edge('1', 't', capacity = )
G.add_edge('2', 't', capacity = )
G.add_edge('3', 't', capacity = )

pos=nx.get_node_attributes(G,'pos')
cap=nx.get_edge_attributes(G,'capacity')
nx.draw_networkx(G,pos,node_size=1000,node_color='lightblue')
nx.draw_networkx_edge_labels(G,pos,edge_labels=cap);

**Q17:** Can you tell if team 4 is eliminated or not? If the team is not yet eliminated, give a short scenario where the team comes out first. If the team is eliminated, explain why this happened to your friend who doesn't know anything about Operations Research.

**A:** 

**Q18:** Is it possible for a minimum $s$-$t$ cut in the general
network to have infinite capacity? Why?

**A:** 

**Q19:** Consider the pair node $i,j$ and the team nodes $i$ and $j$. Is it possible for the pair node to be in the minimum cut but not the team nodes? Why? (Hint: use what you learned in Q18)

**A:** 

**Bonus:** Consider the edges of infinite capacity. Why is $\hat G + 1$ where $\hat G$ is the sum of all finite capacity edges sufficently large?

**A:** 

Let's look at the min-cut for this graph by running the labeling algorithm after running Ford-Fulkerson to find an optimal flow. The python package `max_flow` contains the functions you wrote in the max-flow lab with some additional visualizations.

In [None]:
from max_flow import *
ex = max_flow(add_infinite_capacities(G)) # create a max flow instance from the graph G
ex.ford_fulkerson(show=False) # run Ford-Fulkerson

**Q20:** Now that we have run Ford-Fulkerson, let's run the labeling algorithm. At each iteration, you will be asked to select the next node to explore from the set of unexplored nodes. Furthermore, the residual graph is plotted with the checked nodes colored red.

In [None]:
ex.label(auto=False,show=True)

**NOTE**: It is important to distinguish between the residual graph and the flow graph here. We run the labeling algorithm on the *residual* graph to find the set of reachable nodes. The residual graph for an optimal flow must not have an $s-t$ path which implies $t$ will not be reachable. Therefore, we can view the set of checked nodes as a cut on the graph (a min-cut more specifically.) When we look at the capacity of this cut, we go back to looking at the *flow* graph.

The following cell plots the flow graph with the final set of checked nodes. It may be helpful in answering the next question.

In [None]:
ex.plot_checked()

**Q21:** What min-cut does the labeling algorithm give us? What is it's capacity?

**A:**

**Q22:** Are there any pair nodes in the cut? What do you notice about the capacity of the arc into the pair node in relation to the capacities coming out of the team nodes? How can this be interpretted in terms of the baseball eleimination problem?

**A:** 

**Q23:** In a general input, what can you say about the value of the maximum flow if team $n$ is eliminated? What can you say about the minimum-cut? How can you use the minimum-cut to explain to your friend (who has not taken 1101) why a team has been eliminated?

**A:** 

##  Bonus: MLB Example

Let's apply what we learned to an actual MLB season! We will look at the American League during the 2014 season on September 1. We will need the win record for each team as of September 1 and the remaining number of games to be played between each team. We load in this data below.

*This data was obtained from [Sports Reference](https://www.baseball-reference.com/boxes/?month=9&day=1&year=2014)

In [None]:
import pandas as pd
w = pd.read_csv('data/standing.csv', index_col = 0)['W']
g = pd.read_csv('data/games_left.csv', index_col = 0)

In [None]:
w

In [None]:
g.head()

The Houston Astros (HOU) are having a rough season so far. Let's see if they can prevent being eliminated.

In [None]:
team = 'HOU'
W = w[team] + sum(g[team]) # total number of wins they could possibly get
u = {}  # limits to how many wins each team can get 
for tm in w.index:
    u[tm] = W - w[tm]

Now, we create the graph!

In [None]:
G = nx.DiGraph()

G.add_node('s', pos=(0,0.5))  # source 
G.add_node('t', pos=(1,0.5))  # sink 

for i in range(len(w)):
    if w.index[i] != team:
        G.add_node('%s'%(w.index[i]), pos=(0.75,10*i))  # team nodes
        for j in range(len(w)):
            if i < j and w.index[j] != team:
                G.add_node('%s,%s'%(w.index[i],w.index[j]), pos=(0.25,10*(i+j)))  # pair nodes

for i in range(len(w)):
    if w.index[i] != team:
        G.add_edge(w.index[i], 't', capacity = u[w.index[i]])  # sink edges
        for j in range(len(w)):
            if i < j and w.index[j] != team:
                pair_node = '%s,%s'%(w.index[i],w.index[j])
                i_node = w.index[i]
                j_node = w.index[j]
                G.add_edge('s', pair_node, capacity = g.at[i_node, j_node]) # source edges

                # create edges from pair nodes to team nodes
                G.add_edge(pair_node, '%s'%(i_node))
                G.add_edge(pair_node, '%s'%(j_node))

Lastly, we will run Ford-Fulkerson to get an optimal flow and then use the labeling algorithm to generate a minimum cut.

In [None]:
ex = max_flow(add_infinite_capacities(G)) # create a max flow instance from the graph G
ex.ford_fulkerson(show=False) # run Ford-Fulkerson

# print the set of checked nodes
checked_attr = nx.get_node_attributes(ex.G,'check')
for i in checked_attr:
    if checked_attr[i]:
        print(i)

Use `g.at['TM1','TM2']` to get the number of games left to play between `TM1` and `TM2`. For example, the following cell gives the number of games left to play between LAA and OAK.

In [None]:
g.at['LAA','OAK']

Use `u['TM1']` to give the number of games that `TM1` can win before The Houston Astros are eliminated. For example, the following cell indicates LAA must lose the rest of their games.

In [None]:
u['LAA']

**Q24:** How many games are left to play between LAA, OAK, and SEA? How many wins combined can these three teams have before The Houston Astros are eliminated. (Hint: Use the commands above to compute the answer.)

**A:** 

In [None]:
# USE THIS CELL FOR COMPUTATIONS


**Q25:** Will The Houston Astros be eliminated? If so, use the minimum cut to explain (to a non OR student) why they are eliminated?

**A:**  